CN104657362B - Data storage, querying method and device - Google Patents

Data storage, querying method and device Download PDF

Info

Publication number
CN104657362B
CN104657362B CN201310577254.5A CN201310577254A CN104657362B CN 104657362 B CN104657362 B CN 104657362B CN 201310577254 A CN201310577254 A CN 201310577254A CN 104657362 B CN104657362 B CN 104657362B
Authority
CN
China
Prior art keywords
data
compressed
mark
content
bucket
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310577254.5A
Other languages
Chinese (zh)
Other versions
CN104657362A (en
Inventor
张元龙
林汇宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Computer Systems Co Ltd
Original Assignee
Shenzhen Tencent Computer Systems Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Tencent Computer Systems Co Ltd filed Critical Shenzhen Tencent Computer Systems Co Ltd
Priority to CN201310577254.5A priority Critical patent/CN104657362B/en
Publication of CN104657362A publication Critical patent/CN104657362A/en
Application granted granted Critical
Publication of CN104657362B publication Critical patent/CN104657362B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The present invention provides a kind of date storage method, data to be stored include initial data identify and with the corresponding original data content of initial data mark, the method includes:The data to be stored is clustered according to initial data mark, a data block is corresponded to respectively, and generation data block identifier is identified according to the initial data of the corresponding data to be stored of the data block per a kind of;The original data content of the corresponding data to be stored of the data block is compressed, compressed data content is stored in the data field of the data block, and obtains the address mark of the compressed data content;It is identified according to the initial data of the corresponding data to be stored of the data block and described address mark generation indexes, and the index is stored in the index area of the data block.Date storage method provided by the invention greatly improves the service efficiency of data.The present invention also provides a kind of data storage device, a kind of data query method and a kind of data query arrangements.

Description

Data storage, querying method and device
Technical field
The present invention relates to field of computer technology, more particularly to a kind of storage of data, querying method and device.
Background technology
With the progress of computer technology, computer technology produces strong influence to people’s lives, work, generates A large amount of data need computer to store and process.At present, the computer data to be stored are via original MB(Million words Section)、GB(Gigabytes)The order of magnitude has risen to TB(Terabyte), even PB(Thousand terabytes)The order of magnitude, in face of data Explosive growth, how to reduce data carrying cost and have become as people's urgent problem.
The means of traditional reduction data carrying cost, store after usually directly carrying out compression processing to data, however, This sacrifice calculates the time to exchange the mode of ground memory space for, reduces the service efficiency of data, such as during inquiry data Must all decompress the data of compression could use.
Invention content
Based on this, it is necessary to the problem of reducing the service efficiency of data for the method for traditional compressed data, provide A kind of date storage method and device, data query method and apparatus.
A kind of date storage method, data to be stored include initial data identify and it is corresponding with initial data mark Original data content, the method includes:
The data to be stored is clustered according to initial data mark, a data are corresponded to respectively per a kind of Block, and generation data block identifier is identified according to the initial data of the corresponding data to be stored of the data block;
The original data content of the corresponding data to be stored of the data block is compressed, compressed data content is stored In the data field of the data block, and obtain the address mark of the compressed data content;
According to the initial data of the corresponding data to be stored of data block mark and described address mark generation index, and The index is stored in the index area of the data block.
A kind of data storage device, data to be stored include initial data identify and it is corresponding with initial data mark Original data content, described device include:
Data block generation module, it is each for being clustered according to initial data mark to the data to be stored Class corresponds to a data block respectively, and identifies generation data block according to the initial data of the corresponding data to be stored of the data block Mark;
Original data content compression module, for by the original data content of the corresponding data to be stored of the data block into Compressed data content, is stored in the data field of the data block, and obtain the address label of the compressed data content by row compression Know;
Generation module is indexed, for being identified with described according to the initial data of the corresponding data to be stored of the data block Location mark generation indexes, and the index is stored in the index area of the data block.
Above-mentioned date storage method and device, it is right respectively per one kind data to be stored after data to be stored is clustered A data block is answered, and is identified according to the initial data of the corresponding data to be stored of data block and generates data block identifier, and data Block includes index area and data field, and data field stores compressed data content, index area store according to initial data mark and The index of the address mark generation of compressed data content in data field.It, can be directly according to inquiry Data Identification when inquiring data Determining the data block where inquiry data, the address that inquiry data content is obtained according to data index in the block identifies, so as to To obtain Compressed text search data content from data field according to address mark, can be obtained in Compressed text search data after decompression Hold.Both it can reach good compression effectiveness, reduce the space shared by storage data, and energy rapidly locating position, It realizes quick search data, without all compressed datas are decompressed, greatly improves the service efficiency of data.
A kind of data query method, the method includes:
Obtain original query Data Identification;
The original query Data Identification institute is determined according to the original query Data Identification and the data block identifier to prestore Corresponding data block;
According to the index in the index area of the data block, address label corresponding with the original query Data Identification is obtained Know;
Compressed text search data content is obtained from the data field of the data block according to described address mark, decompresses the pressure Contracting inquiry data content, obtains inquiry data content.
A kind of data query arrangement, described device include:
Original query Data Identification acquisition module, for obtaining original query Data Identification;
Data block determining module, described in being determined according to the original query Data Identification and the data block identifier to prestore Data block corresponding to original query Data Identification;
Address identifier acquisition module for the index in the index area according to the data block, is obtained and original is looked into described Ask the corresponding address mark of Data Identification;
Data content acquisition module is inquired, is compressed for being obtained according to described address mark from the data field of the data block Data content is inquired, decompression module for decompressing the Compressed text search data content, obtains inquiry data content.
Above-mentioned data query method and apparatus after obtaining original query Data Identification, pass through the original query Data Identification It can determine the data block corresponding to original query Data Identification with the data block identifier to prestore, it is determined where inquiry data Data block, corresponding with original query Data Identification address mark is obtained from data index area in the block, so as to according to this Address mark can obtain Compressed text search data content from data data field in the block, by the decompression of Compressed text search data content just Inquiry data content can be obtained.Can rapidly locating position, quick-searching is realized, without by all compressed data solutions Compression greatly improves the service efficiency of data.
Description of the drawings
Fig. 1 is the flow diagram of date storage method in one embodiment;
Fig. 2 is to clip low level portion from the initial data of the corresponding data to be stored of data block mark in one embodiment Point, compressed data mark is obtained, and compressed data mark corresponding with the address mark of compressed data content is stored in data block Index area the step of flow diagram;
Fig. 3 is to compress the original data content of the corresponding data to be stored of data block in one embodiment, will be pressed The flow signal for the step of contracting data content is stored in the data field of data block, and the address for obtaining compressed data content identifies Figure;
The flow diagram for the step of Fig. 4 is generation decoding accelerometer in one embodiment;
Fig. 5 is the structure diagram of data block in a concrete application scene;
Fig. 6 is the flow diagram of data query method in one embodiment;
Fig. 7 is the low portion that original query Data Identification is clipped in one embodiment, obtains Compressed text search Data Identification, Obtain the flow diagram for the step of corresponding address of the Compressed text search Data Identification stored in the index area of data block identifies;
Fig. 8 is the schematic diagram of Huffman tree and decoding accelerometer in a concrete application scene;
Fig. 9 is the structure diagram of data storage device in one embodiment;
Figure 10 is the structure diagram that generation module is indexed in one embodiment;
Figure 11 is the structure diagram of original data content compression module in one embodiment;
Figure 12 is the structure diagram of data storage device in another embodiment;
Figure 13 is the structure diagram of data query arrangement in one embodiment;
Figure 14 is the structure diagram of address identifier acquisition module in one embodiment;
Figure 15 is the structure diagram of the first acquisition module in one embodiment;
Figure 16 is the structure diagram of the first acquisition module in another embodiment;
Figure 17 is the structure diagram that data content acquisition module is inquired in one embodiment;
Figure 18 is the module map for the computer system that the embodiment of the present invention can be realized in one embodiment.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, it is right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
It is appreciated that term " first " used in the present invention, " second " etc. can be used to describe various elements herein, But these elements should not be limited by these terms.These terms are only used to distinguish first element and another element.Citing comes It says, without departing from the scope of the invention, the first presetting digit capacity can be known as the second presetting digit capacity, and similarly, Second presetting digit capacity can be known as the first presetting digit capacity.First presetting digit capacity and the second presetting digit capacity both presetting digit capacity, But it is not same presetting digit capacity.
The description of specific distinct unless the context otherwise, the present invention in element and component, the shape that quantity both can be single Formula exists, and form that can also be multiple exists, and the present invention is defined not to this.Although step in the present invention with label into It has gone arrangement, but is not used to limit the precedence of step, unless expressly stated the order of step or holding for certain step Based on row needs other steps, otherwise the relative rank of step is adjustable.It is appreciated that used herein Term "and/or" is related to and covers one of associated Listed Items or one or more of any and all possible group It closes.
As shown in Figure 1, in one embodiment, providing a kind of date storage method, this method includes:
Step 102, data to be stored is clustered according to initial data mark, a data is corresponded to respectively per a kind of Block, and generation data block identifier is identified according to the initial data of the corresponding data to be stored of data block.
Data to be stored include initial data identify and with the corresponding original data content of initial data mark." original " table Registration is according to the state of script, uncompressed or other processing.Initial data mark can be the key of data to be stored(key), can For distinguishing different data to be stored.Data to be stored is clustered, it can be by the initial data mark with common portion Know centrally stored, and the common portion that each initial data is identified extracts, and is not repeated to store, and stores initial data mark Non-common parts can be only stored during knowledge, achievees the purpose that identify initial data progress fixed length compression.And the compressed number of fixed length According to the characteristic for still having random access, convenient for searching.
Data block is one group or several groups of records of continuous arrangement together in order, is that main memory is set with input, output A data unit being transmitted between standby or external memory.Data block can be stored in the memory of computer.It will treat After storage data are clustered according to initial data mark, per an a kind of corresponding data block, and treated according to data block is corresponding Store the initial data mark generation data block identifier of data.It, can be according to inquiry Data Identification in this way when inquiring data Data block where location data, so as to achieve the purpose that quickly to search.
Step 104, the original data content of the corresponding data to be stored of data block is compressed, by compressed data content The data field of data block is stored in, and obtains the address mark of compressed data content.
In order to obtain larger compression ratio, variable-length encoding can be used to compress original data content.Variable-length encoding is just It is entropy coding, it is using the probability of occurrence of each data cell or frequency as a kind of coding mode of foundation, allows and occurs at most The shortest code identification of code character, shortens code length as far as possible in this way;Last coding result is a binary system Serial data, Huffman encoding algorithm is exactly a kind of typical variable-length encoding.
Very high compression ratio can be reached, but since code length is not fixed using variable-length encoding so that compressed data Storage address is difficult to directly predict by data structure, generally require the decompression of all compressed datas can just be inquired it is required Data.Address mark is for identifying the storage location of compressed data content, when storing compressed data content, obtains compression number It is identified according to the address of content, can be convenient for directly positioning compressed data content according to address mark, not need to all compression numbers According to all decoding.
Step 106, according to the initial data of the corresponding data to be stored of data block mark and address mark generation index, and The index area for being stored in data block will be indexed.
Generation is identified according to the address of the initial data of the corresponding data to be stored of data block mark and compressed data content It indexes, and the index is stored in the index area of data block, convenient for being obtained and initial data mark pair according to initial data mark The address mark for the compressed data content answered, and then identified according to the address and obtain compressed data content, so as to fulfill fast quick checking Ask data.
After data to be stored is clustered, one is corresponded to per one kind data to be stored respectively for above-mentioned date storage method Data block, and generation data block identifier is identified according to the initial data of the corresponding data to be stored of data block, and data block includes Index area and data field, data field store compressed data content, and index area is stored according to initial data mark and data field The index of the address mark generation of middle compressed data content.When inquiring data, directly it can determine to look into according to inquiry Data Identification The data block where data is ask, the address that inquiry data content is obtained according to data index in the block identifies, so as to basis Address mark obtains Compressed text search data content from data field, and Compressed text search data content can be obtained after decompression.Both Good compression effectiveness can be reached, the space shared by storage data, and energy rapidly locating position is reduced, realize fast Data are ask in quick checking, without all compressed datas are decompressed, greatly improve the service efficiency of data.
In one embodiment, step 102 includes:First presetting digit capacity during the initial data of data to be stored is identified The identical data to be stored of low portion is as a kind of, a corresponding data block, and using low portion as number respectively per one kind According to the data block identifier of block.
In the present embodiment, using the identical data to be stored of the low portion of the first presetting digit capacity as one kind, per a kind of point Not Dui Ying a data block, and using low portion as the data block identifier of data block.When inquiring data, according to inquiry data mark The low portion of the presetting digit capacity of knowledge can determine whether inquiry data where data block, it can be achieved that quick search.
Wherein the first presetting digit capacity is set as needed, if than representing that initial data identifies with 32 signless integers, Then an initial data mark needs to be stored with 4 bytes, if an initial data is identified as " 0x00000200 ", wherein " 0x " represents hexadecimal, then the least-significant byte of initial data mark is " 00 ", and the first presetting digit capacity is 8 at this time.Here low 8 Position refers to the least-significant byte under binary system, last 2 be equivalent under hexadecimal.
In one embodiment, step 106 includes:It is cut from the initial data mark of the corresponding data to be stored of data block Low portion is removed, obtains compressed data mark, and compressed data is identified into storage corresponding with the address mark of compressed data content In the index area of data block.
Due to using the identical data to be stored of the low portion of the first presetting digit capacity as one kind, then per the original in a kind of Beginning Data Identification has identical low portion, and the depositing needed for storage Data Identification in index area can be reduced by clipping low portion Space is stored up, realizes the compression identified to initial data, obtains compressed data mark.For example the low portion of 8 is clipped, it can subtract The memory space of few 1 byte.And the low portion clipped identifies data block identifier and compressed data as data block identifier Combination can obtain initial data mark.Compressed data mark corresponding with the address mark of compressed data content is stored in data block Index area, the correspondence of compressed data mark and address mark is just stored in the index of index area.
As shown in Fig. 2, it in one embodiment, is clipped from the initial data mark of the corresponding data to be stored of data block Low portion obtains compressed data mark, and compressed data is identified be stored in corresponding with the address mark of compressed data content It the step of index area of data block, specifically includes:
Step 202, the identical initial data of the high-order portion of the second presetting digit capacity during the initial data of data block is identified Mark is used as same subclass, and the initial data of each subclass identifies each data bucket in the index area of corresponding data block respectively, It is identified high-order portion as the data bucket of data bucket, and stores data bucket mark in the index area of data block and exist with data bucket The correspondence of bucket initial memory address in index area.
Wherein, the first presetting digit capacity and the second presetting digit capacity and total bit less than or equal to initial data mark.Data Bucket(bucket)It is used to store the data structure of data in the index area for being data block.
Specifically, the identical initial data of the high-order portion of the second presetting digit capacity is identified as same subclass, and by number Multiple data buckets are divided into according to the index area of block, each subclass is corresponded with each data bucket.The initial data of each subclass Mark has identical high-order portion, and the high-order portion can be used to be identified as the data bucket of data bucket.
Storage data bucket mark and the correspondence of bucket initial memory address of the data bucket in index area are also needed in index area Relationship, specifically, the array that can be used an array length identical with the quantity of data bucket in data block stores data bucket Bucket initial memory address is designated as data bucket mark under array, so as to fulfill storage data bucket mark and data bucket in index area In bucket initial memory address correspondence.First presetting digit capacity and the second presetting digit capacity and identified no more than initial data Total bit, so as to ensure that initial data mark can normally be compressed.
Step 204, low portion and high position portion are clipped from the initial data of the corresponding data to be stored of data block mark Point, compressed data mark is obtained, and compressed data mark is stored in the data bucket identified using high-order portion as data bucket.
First presetting digit capacity can be 8 integral multiple under binary system, and the second presetting digit capacity can also be 8 under binary system Integral multiple.Due to 8 be a byte, it is ensured that the part clipped is the integral multiple of byte, and a usual character need The memory space of a byte is occupied, thus space where data storage can be effectively reduced.
Low portion and high-order portion are clipped from the initial data mark of the corresponding data to be stored of data block, to original Data Identification is compressed, and obtains compressed data mark, and compressed data mark is stored in and is made with the high-order portion clipped In data bucket for data bucket mark, data block identifier, data bucket mark and compressed data mark can form initial data mark Know, initial data mark, which is compressed, can save memory space.
Step 206, the corresponding compression of compressed data mark with being stored in data bucket is stored in the index area of data block The address mark of data content.
Specifically, compressed data mark and data content corresponding with compressed data can be stored in the index area of data block Address mark within a data area.So as to identify the compressed data content for obtaining and being stored in data field by compressed data.
In the present embodiment, by opening up multiple data buckets in the index area of data block, by the high-order portion of initial data mark Divide after being clipped with low portion after obtaining compressed data mark, compressed data mark is stored in data bucket, is gone back in index area Store address mark corresponding with compressed data mark.When inquiring data, it can be identified by the data bucket stored in index area With the correspondence of bucket initial memory address of the data bucket in index area, the compressed data mark of storage is obtained from data bucket Know, then obtain compressed data from index area and identify corresponding address mark, so as to the data according to address mark from data block Area obtains inquiry data content.Index area will be stored in after initial data mark compression, saves memory space, and can pass through Compressed data content in index area rapidly locating area ensure that the efficiency of inquiry data.
In one embodiment, the address of compressed data content is identified as the storage of compressed data content within a data area Location relative to initial memory address in data field offset;Then step 206 includes:It is stored corresponding to data bucket in index area Start offset amount and the corresponding compressed data content of compressed data mark in data bucket offset relative to starting The bucket bias internal amount of offset.
In the present embodiment, by the storage address of compressed data content within a data area relative to originating storage in data field The offset of location is identified as the address of compressed data content.Initial memory address refers to start to store depositing for data in data field Address is stored up, since data are Coutinuous stores, initial memory address are obtained and compressed data identifies corresponding compressed data content Offset, the actual storage address of compressed data content can be obtained, i.e., start to store in the compressed data in data field The storage address of appearance.And adjacent next compressed data is identified by the compressed data and identifies corresponding compressed data content Offset can obtain the storage end position that the compressed data identifies corresponding compressed data content, because without additional Data length is stored, the memory space of occupancy is small.
Start offset amount is that the corresponding compressed data content of the compressed data mark in data bucket starts within a data area Relative to the offset of above-mentioned initial memory address, bucket bias internal amount refers to store in data bucket each the storage address of storage The offset of compressed data content relative to start offset amount offset.One compressed data identifies corresponding start offset amount With bucket bias internal amount and be exactly the real offset that the compressed data identifies, the sum of the real offset and initial memory address It is exactly the actual storage address that the compressed data identifies corresponding compressed data content.In the present embodiment, due to storing offset Than store storage address in itself than save more memory spaces, storage address can be further reduced by bucket bias internal amount Identify occupied space.
In one embodiment, the address of compressed data content is identified as the storage of compressed data content within a data area Location relative to initial memory address in data field offset;Step 206 includes:It is stored corresponding to data bucket in index area The data length of each compressed data content corresponding to the start offset amount and data bucket of compressed data content.
In the present embodiment, in index area store data bucket corresponding to compressed data content start offset amount and The data length of each compressed data content corresponding to data bucket, due to the memory space ratio deviation amount shared by data length more It is small, it can further compressed data volume.It needs to obtain Compressed text search data in start offset amount and data bucket when inquiring data Before mark(Including itself)All compressed datas identify corresponding data length.Since data are Coutinuous stores, data In bucket before Compressed text search Data Identification(Do not include itself)All compressed datas identify the sum of corresponding data length, exactly press The bucket bias internal amount of contracting inquiry Data Identification;Bucket bias internal amount and start offset amount be exactly that Compressed text search Data Identification is corresponding Compressed data content storage address, that is, start store compressed data content address.And pass through Compressed text search data The end address of compressed data content can be obtained by identifying the data length of itself.
In one embodiment, compressed data mark identifies in the data bucket of the index area of data block according to compressed data Numerical values recited ascending or descending order storage.
In the present embodiment, numerical values recited ascending order or drop that the compressed data mark in data bucket is identified according to compressed data Sequence stores, and can be convenient for quickly positioning compressed data mark by binary chop, so as to increase substantially the speed of inquiry data Degree.
Specifically, binary chop is a kind of searching algorithm that a certain element-specific is searched in subordinate ordered array.Search plain process It is since the neutral element of array, if neutral element is exactly the element to be searched, searches plain process and terminate;It is if a certain Element-specific is more than or less than neutral element, then is searched in the half for being more than or less than neutral element in array, and with Start equally since neutral element to compare.If being sky in a certain step array, representative can not find.This searching algorithm is every It is primary that search range is more all made to reduce half, the efficiency of inquiry data can be greatlyd improve.
As shown in figure 3, in one embodiment, step 104 includes:
Step 302, the original data content of data to be stored is divided into data cell, obtains data cell set, meter Calculate the frequency of occurrences of each data cell in data cell set.
The division of data cell can determine that original data content can be used as a data sheet in itself according to actual needs Original data content can also be divided into multiple data cells by member.If being IP address A.B.C.D than data content, can make With 4 byte representation IP address, each byte represents A, B, C or D, IP address can be divided into tetra- numbers of A, B, C and D respectively According to unit.
Step 304, according to the frequency of occurrences of data cell, data cell coding is distributed, and according to data for data cell The correspondence of unit and data cell encoding is by the original data content of the corresponding data to be stored of data block according to data sheet Member is encoded, and obtains compressed data content.
The frequency of occurrences of data cell is calculated, is that the high data cell of the frequency of occurrences distributes shorter coding, to there is frequency The low data cell of rate distributes longer coding, so as to according to the correspondence of data cell and data cell encoding by data block The original data content of corresponding data to be stored carries out variable-length encoding according to data cell, obtains compressed data content.
Step 306, by compressed data content and the coding schedule of the correspondence of record data cell and data cell encoding The data field of data block is stored in, and obtains the address mark of compressed data content.
Compressed data content and coding schedule be stored in the data field of data block, wherein coding schedule have recorded data cell and The correspondence of data cell coding, can be decoded compressed data content according to the coding schedule when inquiring data, obtain former Beginning data content.
In the present embodiment, by the way that the original data content of data to be stored is divided into data cell, according to data cell The frequency of occurrences to original data content carry out variable-length encoding after store, can substantially reduce storage initial data mark needed for deposit Space is stored up, and each compressed data content is corresponding with address mark, it is quick fixed to be identified when inquiring data by the address The storage address of position compressed data content takes into account data volume and inquires the efficiency of data.
In one embodiment, according to the frequency of occurrences of data cell, the step of data cell coding is distributed for data cell Suddenly, including:According to the frequency of occurrences of data cell, Huffman tree is constructed, according to from the root node of Huffman tree to leaf node Coordinates measurement data cell coding;This method further includes:Huffman tree is stored in the data field of data block.
Specifically, each data cell can be corresponded to the leaf node of Huffman tree, by data cell according to the frequency of occurrences It is ranked up, 2 minimum leaf nodes of the frequency of occurrences is merged to obtain a new node, the frequency of occurrences of the node is this The sum of the frequency of occurrences of two data cells.The leaf node of the two merging is excluded, in remaining leaf node and new section Two minimum nodes of the wherein frequency of occurrences are remerged in point, until the root node for being merged into Huffman tree always.Define Hough The left subtree of the node of Man Shu is 0, right subnumber 1(It is 1 that left subtree, which can also be defined, right subtree 0, is only illustrated here former Reason), so as to according to the binary data cell coding of coordinates measurement from root node to leaf node.It can be in storage compressed data When content and coding schedule, Huffman tree is stored in the data field of data block together.
During using Huffman tree decoding compressed data content, can by turn it be decoded using compressed data content as bit stream.Specifically Ground, according to structure Huffman tree when definition, since the root node of Huffman tree, encounter 0 and go to left child node, encounter 1 Right child node is then gone to, until leaf node, the binary coding from root node to leaf node is a data cell at this time Coding, and the length of data cell coding is identical with depth of the leaf node in Huffman tree.So as to according to coding schedule It obtains the data cell and encodes corresponding data cell, then continue to decode by turn since the root node of Huffman tree, until obtaining The data or decoding for obtaining presets are completed.
If original data content has fixed form, the data of presets can be set, shape is preset when obtaining this Decoding terminates during the data of formula, and without a complete compressed data content is all decoded, content needed for acquisition can carry The efficiency of height inquiry data.
In the present embodiment, original data content is compressed using Huffman encoding, compressed data can be substantially reduced Memory space needed for content.
As shown in figure 4, in one embodiment, which further includes the step of generation decodes accelerometer, packet It includes:
Step 402, accelerometer is decoded according to data cell code construction of the data length for preset length, decodes accelerometer In each data cell coding mapping to Huffman tree in the corresponding leaf section of data cell coding in decoding accelerometer Point.
Generally the data Jing Guo Huffman encoding are decoded, need to decode by turn, efficiency is therefore conventional than relatively low Decoding process can not meet performance requirement in High Performance Data Query service, and structure decoding accelerometer is needed to realize quick decoding.Tool Body, accelerometer is decoded, and first will be in decoding accelerometer for the data cell code construction of preset length using data length Corresponding leaf node in each data cell coding mapping to Huffman tree.
Step 404, data cell identical with the prefix of the preset length of long data cell encoding in accelerometer will be decoded The corresponding node of prefix of long data cell encoding in coding mapping to Huffman tree.
Wherein, long data cell encoding is more than the data cell coding of preset length for data length.For data length More than the long data cell encoding of preset length, then the prefix in accelerometer with the preset length of long data cell encoding will be decoded The corresponding node of the prefix in identical data cell coding mapping to Huffman tree.
Step 406, by the data cell coding mapping identical with short data cell encoding of prefix in decoding accelerometer to Kazakhstan The corresponding leaf node of short data cell encoding in Fu Man trees.
Wherein, short data cell encoding is less than the data cell coding of preset length for data length.For data length Less than the short data cell encoding of preset length, then the data sheet that prefix is identical with short data cell encoding in accelerometer will be decoded Primitive encoding is mapped to the corresponding leaf node of short data cell encoding in Huffman tree.
Step 408, by decoding accelerometer storage to the data field of data block.
When being decoded to compressed data content, decoding accelerometer is obtained from the data field of data block, usable reading refers to Needle takes the data of preset length from compressed data content, a node being mapped to according to decoding accelerometer in Huffman tree, If the node of the mapping is leaf node, the data of the preset length are the data cell coding of preset length or short Data cell encodes, and directly can obtain corresponding data cell according to coding schedule.And then reading pointer is supported or opposed from compressed data Distance corresponding with the depth of the leaf node is moved in the direction of the initial memory address of content, i.e., the mobile and leaf node pair After the corresponding distance of length of data cell coding answered, continue to take the data of preset length to be decoded, until decoding obtains The partial data of presets in original data content or original data content.
If the node of mapping is non-leaf nodes, the starting that reading pointer is supported or opposed from Compressed text search data content is stored After distance corresponding with preset length is moved in the direction of address, using Huffman tree since the node that coded data maps by turn Mobile reading pointer is simultaneously decoded until leaf node, is obtained the corresponding data cell of leaf node, is continued to read preset length Data, until decoding obtains inquiry data content or inquires the partial data of presets in data content.
In the present embodiment, decoding accelerometer being generated, and being stored into the data field of data block, when decoding can be once from pressure The data of preset length is taken to be decoded in contracting data content, avoids decode by turn as possible, improve decoded efficiency.
Illustrate the principle of above-mentioned date storage method with a concrete application scene below, this application scene is with the data Storage method is illustrated applied to computer.Data to be stored includes account and IP address corresponding with account, account are Initial data identifies, and IP address is original data content, such as following table:
Account IP address
0000001 192.168.10.11
0000002 192.168.11.9
0000003 192.168.9.11
…… ……
One data to be stored of table
It is used as account and offset used here as 32 signless integers to illustrate.In the case that uncompressed, Each account needs 8 bytes to store, wherein 4 bytes are used for storing account, 4 bytes are used for storing corresponding with account The offset of IP address is compressed in data field.
As shown in figure 5, account is divided into 256 classes by least-significant byte first, a data block 520 is corresponded to respectively per class, then altogether Generate 256 data blocks 520, the least-significant byte of 520 corresponding account of data block(Under binary system)Data block identifier for data block. Data block 520 includes index area 522 and data field 524.Index area 522 includes 65536(I.e. 216)A data bucket, data bucket mark For the 16 high of account, and the array of pointers that an array length is 65536 is opened up, to record the start bit of each data bucket It puts.8~15 of account then need to be only stored in each data bucket, to reach the compression to account.It is also stored in index area every It is each in the start offset amount and data bucket of a data bucket to compress the corresponding bucket bias internal amount of account.The storage compression of data field 524 IP address, and according to the start offset amount bucket bias internal amount corresponding with compression account of data bucket in index area, compression can be positioned Storage location of the IP address in data field 524.
In this application scene, an additional storage overhead of data block be only the occupied 65536*4 of array of pointers= 256KB, storage overhead is 256KB*256=64MB in total, and compress account reduction memory space for N*3 bytes, wherein N Number for account, it is seen that account number is more, and compression effectiveness is more apparent.
As shown in fig. 6, in one embodiment, providing a kind of data query method, above-mentioned data are used for inquiring The data of storage method storage, this method include:
Step 602, original query Data Identification is obtained.
Original query Data Identification can be inputted by user or be obtained from third party application, for according to original query Data Identification inquires inquiry data content corresponding with the original query Data Identification from the data of storage.
Step 604, original query Data Identification institute is determined according to original query Data Identification and the data block identifier to prestore Corresponding data block.
Data block identifier can distinguish different data blocks, and data block identifier is generated according to Data Identification, therefore basis Original query Data Identification can determine with the matched data block identifier of original query Data Identification, may thereby determine that initial data The corresponding data block of mark, the data block are to store the data of the corresponding Compressed text search data content of original query Data Identification Block.
Step 606, the index in the index area of data block obtains address label corresponding with original query Data Identification Know.
The index according to Data Identification and address mark generation is stored in the index area of data block, it, can according to the index Obtain address mark corresponding with original query Data Identification.
Step 608, it is identified according to address from the data field of data block and obtains Compressed text search data content, decompression compression is looked into Data content is ask, obtains inquiry data content.
The address identifies the storage location of Compressed text search data content in data data field in the block, therefore root Compressed text search data content can be obtained from the data field of data block according to address mark, after Compressed text search data content is decoded Inquiry data content corresponding with original query Data Identification can be obtained.
Above-mentioned data query method after obtaining original query Data Identification, by the original query Data Identification and prestores Data block identifier can determine data block corresponding to original query Data Identification, it is determined that the data where inquiry data Block obtains address mark corresponding with original query Data Identification, so as to according to the address label from data index area in the block Compressed text search data content can be obtained from data data field in the block by knowing, and the decompression of Compressed text search data content can be obtained Inquire data content.Can rapidly locating position, realize quick-searching, without all compressed datas are decompressed, Greatly improve the service efficiency of data.
In one embodiment, step 604 includes:Determine the low level with the first presetting digit capacity in original query Data Identification The corresponding data block of the matched data block identifier in part.
Specifically, using the low portion of the first presetting digit capacity of Data Identification as data block identifier when storing data, because This can intercept the low portion of the first presetting digit capacity of original query Data Identification and the data block mark to prestore when inquiring data Knowledge is compared, if data block identifier is matched with the low portion, it is exactly storage compression to illustrate the corresponding data block of the data block identifier Inquire the data block of data content.
In the present embodiment, determine that storage compression is looked into using the low portion of the first presetting digit capacity of original query Data Identification The data block of data content is ask, is calculated simply, it is efficient.
In one embodiment, step 606 includes:The low portion of original query Data Identification is clipped, compression is obtained and looks into Data Identification is ask, obtains the corresponding address mark of the Compressed text search Data Identification stored in the index area of data block.
Specifically, pair of compressed data mark and address mark for clipping low portion is stored in the index area of data block It should be related to, when inquiry clips the low portion of original query Data Identification, obtains Compressed text search Data Identification, so as to from data block Index area in obtain corresponding with Compressed text search Data Identification address and identify.
In the present embodiment, storage is compressed Data Identification, need to only be clipped when can save memory space, and inquire The low portion of original query Data Identification can realize inquiry, not influence search efficiency, and having taken into account data volume and data makes Use efficiency.
As shown in fig. 7, in one embodiment, clipping the low portion of original query Data Identification, Compressed text search is obtained Data Identification obtains the step of corresponding address of the Compressed text search Data Identification stored in the index area of data block identifies and includes:
Step 702, the matched data bucket mark of high-order portion with the second presetting digit capacity of original query Data Identification is determined Know.
Specifically, the index area of data block includes multiple data buckets, is preset when storing data by the second of Data Identification The high-order portion of digit is identified as data bucket.Therefore during inquiry data, the determining and original query in the index area of data block The matched data bucket mark of high-order portion of second presetting digit capacity of Data Identification, the matched data bucket identify corresponding data Bucket is exactly the data bucket for storing Compressed text search Data Identification.Wherein, the first presetting digit capacity and the second presetting digit capacity and less than etc. In the total bit of original query Data Identification, ensure the validity of inquiry.
Step 704, it obtains data bucket from the index area of data block and identifies bucket starting of the corresponding data bucket in index area Storage address.
When storing data, the bucket starting of data bucket mark and data bucket in index area is stored in the index area of data block The correspondence of storage address.Therefore during inquiry data, matched data bucket can be obtained from the index area of determining data block Identify bucket initial memory address of the corresponding data bucket in the index area.This barrel of initial memory address is obtained, visit can be passed through Ask that the storage address reads the data stored in data bucket.
Step 706, the high-order portion and low portion of original query Data Identification are clipped, obtains Compressed text search data mark Know.
In the present embodiment, when storing Data Identification, the high-order portion and low portion of having clipped Data Identification are pressed Contracting Data Identification simultaneously stores.Therefore when inquiring data, the high-order portion and low portion of original query Data Identification need to be clipped, Obtain Compressed text search Data Identification.
Step 708, the bucket initial memory address according to data bucket in index area, lookup and Compressed text search in data bucket The matched compressed data mark of Data Identification.
Bucket initial memory address of the data bucket in index area is obtained, it can be since this barrel of initial memory address in the number It is identified according to being searched in bucket with the matched compressed data of Compressed text search Data Identification.In addition by under adjacent with the inquiry data bucket The bucket initial memory address of one data bucket can determine the end position of the inquiry data bucket, so as to avoid when in data bucket There is no can not stop inquiring during Compressed text search Data Identification.
Step 710, address mark corresponding with matched compressed data mark is obtained from index area.
The correspondence of compressed data mark and address mark is stored in the index area of data block, therefore passes through the matching Compressed data mark, corresponding with matched compressed data mark address can be obtained from index area and is identified, which identifies For determining storage location of the Compressed text search data content in the data field of data block.
In the present embodiment, Data Identification is compressed when storing data, a large amount of memory spaces can be saved, when inquiring data The data block where Compressed text search data content, then the index from the data block can be quickly determined using original query Data Identification Area obtains address mark, so as to identify the Compressed text search data content stored in the data field for obtaining data block according to address, looks into It is fast to ask speed, without decompressing all compressed datas, greatly improves the service efficiency of data.
In one embodiment, the address of compressed data content is identified as Compressed text search data content depositing within a data area Store up offset of the address relative to initial memory address in data field;Step 710 includes:It is right that data bucket institute is obtained from index area The start offset amount answered and matched compressed data identify corresponding bucket bias internal amount;According to start offset amount and bucket bias internal amount Calculate the offset of inquiry data content.
In the present embodiment, by the storage address of compressed data content within a data area relative to originating storage in data field The offset of location is identified as the address of compressed data content.Initial memory address refers to start to store depositing for data in data field Address is stored up, since data are Coutinuous stores, initial memory address are obtained and matched compressed data identifies corresponding offset, It can obtain the actual storage address of matched compressed data content, that is, the actual storage of Compressed text search data content Location.And the offset of the adjacent corresponding compressed data content of next compressed data mark is identified by the matched compressed data Amount can obtain the storage end position that the matched compressed data identifies corresponding compressed data content, because without volume Outer storage data length, the memory space of occupancy are small.
Start offset amount is that the corresponding compressed data content of the compressed data mark in data bucket starts within a data area Relative to the offset of above-mentioned initial memory address, bucket bias internal amount refers to store in data bucket each the storage address of storage The offset of compressed data content relative to start offset amount offset.When calculating offset, Compressed text search Data Identification pair Real offset that the is start offset amount answered and bucket bias internal amount and being exactly the Compressed text search Data Identification, the real offset With initial memory address and the actual storage address that is exactly the corresponding compressed data content of the Compressed text search Data Identification.This reality Apply in example, due to storage offset than storage storage address in itself than save more memory spaces, pass through bucket bias internal Amount can be further reduced storage address and identify occupied space, and search efficiency be influenced smaller.
In one embodiment, the address of compressed data content is identified as Compressed text search data content depositing within a data area Store up offset of the address relative to initial memory address in data field;Step 710 includes:It is right that data bucket institute is obtained from index area Matched compressed data mark and matched compressed data pressure all before identifying in the start offset amount and data bucket answered The data length of contracting Data Identification;The offset of inquiry data content is calculated according to start offset amount and the data length obtained.
In the present embodiment, in index area store data bucket corresponding to compressed data content start offset amount and The data length of each compressed data content corresponding to data bucket, due to the memory space ratio deviation amount shared by data length more It is small, it can further compressed data volume.It needs to obtain Compressed text search data in start offset amount and data bucket when inquiring data Before mark(Including itself)All compressed datas identify corresponding data length.Since data are Coutinuous stores, data In bucket before Compressed text search Data Identification(Do not include itself)All compressed datas identify the sum of corresponding data length, exactly press The bucket bias internal amount of contracting inquiry Data Identification;Bucket bias internal amount and start offset amount be exactly that Compressed text search Data Identification is corresponding Compressed data content storage address, that is, start store compressed data content address.And pass through Compressed text search data The end address of compressed data content can be obtained by identifying the data length of itself.
A barrel bias internal amount is substituted using the data length of recording compressed data content in the present embodiment, can realize higher Compression ratio, occupy memory space smaller, although compared with bucket bias internal amount, when can spend more when inquiring data Between, but when storage resource anxiety, usage record data length is more preferably to select.
In one embodiment, the compressed data in the data bucket of the index area of data block is identified according to compressed data mark The numerical values recited ascending or descending order storage of knowledge;It searches in data bucket and is identified with the matched compressed data of Compressed text search Data Identification The step of include:It is searched in data bucket using binary chop and is identified with the matched compressed data of Compressed text search Data Identification.
In the present embodiment, numerical values recited ascending order or drop that the compressed data mark in data bucket is identified according to compressed data Sequence stores, and can be convenient for quickly positioning compressed data mark by binary chop, so as to increase substantially the speed of inquiry data Degree.
Specifically, binary chop is a kind of searching algorithm that a certain element-specific is searched in subordinate ordered array.Search plain process It is since the neutral element of array, if neutral element is exactly the element to be searched, searches plain process and terminate;It is if a certain Element-specific is more than or less than neutral element, then is searched in the half for being more than or less than neutral element in array, and with Start equally since neutral element to compare.If being sky in a certain step array, representative can not find.This searching algorithm is every It is primary that search range is more all made to reduce half, the efficiency of inquiry data can be greatlyd improve.
In one embodiment, step 608 includes:Pair of record data cell and data cell encoding is obtained from data field The Compressed text search data content encoded including data cell is decoded by the coding schedule that should be related to according to coding schedule, obtains number According to unit, inquiry data content is obtained according to data cell.
The coding schedule of the correspondence of record data cell and data cell encoding is stored in the data field of data block, from Data field obtains coding schedule, so as to be decoded Compressed text search data content according to the coding schedule, obtains data cell, obtains The data cell obtained forms inquiry data content.
In one embodiment, the coding of the correspondence of record data cell and data cell encoding is obtained from data field The Compressed text search data content encoded including data cell is decoded by table according to coding schedule, data cell is obtained, according to number The step of obtaining inquiry data content according to unit includes:Huffman tree and record data cell and data cell are obtained from data field The coding schedule of the correspondence of coding;Compressed text search data content is divided by data cell coding according to Huffman tree, according to Coding schedule obtains and the corresponding data cell of data cell coding;Inquiry data content is obtained according to data cell.
It, can be using Compressed text search data content as bit stream, by turn when decoding Compressed text search data content using Huffman tree Decoding.Definition when specifically, according to structure Huffman tree, the left subtree of node is 0, right subnumber 1, from the root section of Huffman tree Point starts, and encountering 0 goes to left child node, encounters 1 and goes to right child node, until leaf node, at this time from root node to leaf The binary coding of child node is a data cell encoding, and the length of data cell coding and the leaf node are in Hough Depth in graceful tree is identical.Corresponding data cell is encoded so as to obtain the data cell according to coding schedule, then from Huffman tree Root node start continue decode by turn, until obtain presets data or decoding complete.
In the present embodiment, original data content is compressed using Huffman encoding, compressed data can be substantially reduced Memory space needed for content.
In one embodiment, from data field obtain Huffman tree with record data cell and data cell encoding it is corresponding The coding schedule of relationship;Compressed text search data content is divided by data cell coding according to Huffman tree, is obtained according to coding schedule With the corresponding data cell of data cell coding;The step of obtaining inquiry data content according to data cell includes step 1)~step Rapid 4):
1)The corresponding pass of Huffman tree, decoding accelerometer with recording data cell and data cell encoding is obtained from data field The coding schedule of system.
Wherein, decoding accelerometer includes the data cell coding of multiple preset lengths, decodes each data in accelerometer Cell encoding is mapped in Huffman tree and the corresponding leaf node of data cell coding in decoding accelerometer.Decode accelerometer In long data sheet in the data cell coding mapping to Huffman tree identical with the prefix of the preset length of long data cell encoding The corresponding node of prefix of primitive encoding;The data cell that long data cell encoding is more than preset length for data length encodes.Solution Short data unit is compiled in the prefix data cell coding mapping to Huffman tree identical with short data cell encoding in code accelerometer The corresponding leaf node of code.The data cell that short data cell encoding is less than preset length for data length encodes.Long data sheet The data cell that primitive encoding is more than preset length for data length encodes.
2)The coding of preset length is read since the initial memory address of Compressed text search data content using reading pointer Data, the node being mapped to coded data using decoding accelerometer in Huffman tree, and judge whether the node of mapping is leaf Child node.
3)If the node of coded data mapping is leaf node, is decoded according to coding schedule and obtain the corresponding number of leaf node According to unit, and by reading pointer support or oppose the initial memory address from Compressed text search data content direction movement with leaf node After the corresponding distance of depth, continue to read the coded data of preset length, until decoding obtains inquiry data content or inquiry number According to the partial data of presets in content.
4)If the node of coded data mapping is non-leaf nodes, reading pointer is supported or opposed from Compressed text search data content Initial memory address direction move corresponding with preset length distance after, the section that is mapped using Huffman tree from coded data Point starts to move reading pointer by turn and decode until leaf node, obtains the corresponding data cell of leaf node, continues to read The coded data of preset length, until decoding obtains inquiry data content or inquires the part number of presets in data content According to.
For example, as shown in figure 8, the left subtree of non-leaf nodes is 0 in Huffman tree 802, right subtree 1, for Bit stream 010010 ... when being decoded by turn using Huffman tree 802, encounters 0 left-hand rotation, encounters 1 and turns right, as shown in Figure 8, from Root node starts, and turns left, turns right, turns left, turns left, turns right, turns left, eventually arrives at leaf node, can be somebody's turn to do according to coding schedule 010010 corresponding data cell.
When decoding accelerometer 804 is used to be decoded, each data cell coding decoded in accelerometer is 16. The coded data of 16 is read using reading pointer, the coded data is mapped to Huffman tree 802 using decoding accelerometer 804 In node, and judge mapping node whether be leaf node.
If the node of coded data mapping is leaf node, there are two types of situations, i.e. the coded data is the data of 16 The prefix of cell encoding or the data encoding of 16 is less than the short data cell encoding of 16 for length, can be according to mapping Leaf node corresponding coding unit coding directly obtain coding unit, and the depth that reading pointer is moved to the node corresponds to Distance(If for example, the depth of the node is 16, reading pointer can be moved 16, this is because 16 data are Decoding), continue to read the coded data of 16, until the data cell that decoding is completed or obtained constitutes the number of presets According to when stop decoding.
If the node of coded data mapping is non-leaf nodes, which is the prefix portion of long codes data Point, it after reading pointer is moved 16, is continuing with Huffman tree 802 and continues to decode by turn, decode and obtain when reaching leaf node A data cell is obtained, continues 16 coded datas of reading since current read pointer at this time and is decoded, until having decoded Into or the data cell that obtains constitute the data of presets when stop decoding.
In the present embodiment, decoding accelerometer being generated, and being stored into the data field of data block, when decoding can be once from pressure The data of preset length is taken to be decoded in contracting data content, avoids decode by turn as possible, improve decoded efficiency.
As shown in figure 9, in one embodiment, providing a kind of data storage device, data to be stored includes original number According to mark and with the corresponding original data content of initial data mark, the device include data block generation module 902, initial data Content compression module 904 and index generation module 906.
Data block generation module 902 is used to cluster data to be stored according to initial data mark, per a kind of difference A corresponding data block, and generation data block identifier is identified according to the initial data of the corresponding data to be stored of data block.
Data to be stored include initial data identify and with the corresponding original data content of initial data mark." original " table Registration is according to the state of script, uncompressed or other processing.Initial data mark can be the key of data to be stored(key), can For distinguishing different data to be stored.Data block generation module 902 can will have for being clustered to data to be stored The initial data mark of common portion is centrally stored, and the common portion that each initial data is identified extracts, no longer heavy Storage again can only store non-common parts when storing initial data mark, reach and progress fixed length compression is identified to initial data Purpose.And the compressed data of fixed length still have the characteristic of random access, convenient for searching.
After data to be stored is clustered according to initial data mark, per an a kind of corresponding data block, and according to number According to the initial data mark generation data block identifier of the corresponding data to be stored of block.It, can be according to inquiry in this way when inquiring data Data Identification can be where location data data block, so as to achieve the purpose that quickly to search.
Original data content compression module 904 is used to carry out the original data content of the corresponding data to be stored of data block Compressed data content, is stored in the data field of data block by compression, and obtains the address mark of compressed data content.
In order to obtain larger compression ratio, original data content compression module 904 may be used in variable-length encoding to original Data content is compressed.Very high compression ratio can be reached, but since code length is not fixed using variable-length encoding so that pressure The storage address of contracting data is difficult to directly predict by data structure, generally requires and decompress all compressed datas and could inquire To required data.Address mark, when storing compressed data content, obtains for identifying the storage location of compressed data content The address mark of compressed data content is obtained, can be convenient for directly positioning compressed data content according to address mark, not need to institute There is compressed data all to decode.
Generation module 906 is indexed to be used to be identified according to the initial data of the corresponding data to be stored of data block mark and address Generation index, and the index area for being stored in data block will be indexed.
Generation module 906 is indexed to can be used for number is identified and compressed according to the initial data of the corresponding data to be stored of data block It is indexed, and the index is stored in the index area of data block according to the address mark generation of content, convenient for being identified according to initial data The address mark of compressed data content corresponding with initial data mark is obtained, and then is identified according to the address and obtains compressed data Content, so as to fulfill quick search data.
After data to be stored is clustered, one is corresponded to per one kind data to be stored respectively for above-mentioned data storage device Data block, and generation data block identifier is identified according to the initial data of the corresponding data to be stored of data block, and data block includes Index area and data field, data field store compressed data content, and index area is stored according to initial data mark and data field The index of the address mark generation of middle compressed data content.When inquiring data, directly it can determine to look into according to inquiry Data Identification The data block where data is ask, the address that inquiry data content is obtained according to data index in the block identifies, so as to basis Address mark obtains Compressed text search data content from data field, and Compressed text search data content can be obtained after decompression.Both Good compression effectiveness can be reached, the space shared by storage data, and energy rapidly locating position is reduced, realize fast Data are ask in quick checking, without all compressed datas are decompressed, greatly improve the service efficiency of data.
In one embodiment, data block generation module 902 is additionally operable to during the initial data of data to be stored is identified the The data to be stored that the low portion of one presetting digit capacity is identical is used as a kind of, a corresponding data block, and will be low respectively per one kind Data block identifier of the bit position as data block.
In the present embodiment, using the identical data to be stored of the low portion of the first presetting digit capacity as one kind, per a kind of point Not Dui Ying a data block, and using low portion as the data block identifier of data block.When inquiring data, according to inquiry data mark The low portion of the presetting digit capacity of knowledge can determine whether inquiry data where data block, it can be achieved that quick search.
In one embodiment, index generation module 906 is additionally operable to the original number from the corresponding data to be stored of data block According to low portion is clipped in mark, compressed data mark is obtained, and compressed data is identified into the address label with compressed data content Know the corresponding index area for being stored in data block.
Due to using the identical data to be stored of the low portion of the first presetting digit capacity as one kind, then per the original in a kind of Beginning Data Identification has identical low portion, and the depositing needed for storage Data Identification in index area can be reduced by clipping low portion Space is stored up, realizes the compression identified to initial data, obtains compressed data mark.For example the low portion of 8 is clipped, it can subtract The memory space of few 1 byte.And the low portion clipped identifies data block identifier and compressed data as data block identifier Combination can obtain initial data mark.Compressed data mark corresponding with the address mark of compressed data content is stored in data block Index area, the correspondence of compressed data mark and address mark is just stored in the index of index area.
As shown in Figure 10, in one embodiment, index generation module 906 includes data bucket generation module 906a, compression Data Identification generation module 906b and the first memory module 906c.
Data bucket generation module 906a is used for the high-order portion of the second presetting digit capacity during the initial data of data block is identified Identical initial data mark is used as same subclass, and the initial data of each subclass is identified in the index area of corresponding data block respectively Each data bucket, identified high-order portion as the data bucket of data bucket, and store data bucket in the index area of data block The correspondence of mark and bucket initial memory address of the data bucket in index area.
Wherein, the first presetting digit capacity and the second presetting digit capacity and total bit less than or equal to initial data mark.Specifically Ground, data bucket generation module 906a are used to identify the identical initial data of the high-order portion of the second presetting digit capacity as same son Class, and the index area of data block is divided into multiple data buckets, each subclass is corresponded with each data bucket.Each subclass Initial data mark has identical high-order portion, and the high-order portion can be used to be identified as the data bucket of data bucket.
Storage data bucket mark and the correspondence of bucket initial memory address of the data bucket in index area are also needed in index area Relationship, specifically, data bucket generation module 906a may be used in the quantity phase of an array length and data bucket in data block With array store the bucket initial memory address of data bucket, data bucket mark is designated as under array, so as to fulfill storage data The correspondence of bucket mark and bucket initial memory address of the data bucket in index area.First presetting digit capacity and the second presetting digit capacity And no more than initial data mark total bit, so as to ensure initial data mark can normally compress.
Compressed data identifier generation module 906b is used for from the initial data mark of the corresponding data to be stored of data block Clip low portion and high-order portion, obtain compressed data mark, and by compressed data mark be stored in using high-order portion as In the data bucket of data bucket mark.
First presetting digit capacity can be 8 integral multiple under binary system, and the second presetting digit capacity can also be 8 under binary system Integral multiple.Due to 8 be a byte, it is ensured that the part clipped is the integral multiple of byte, and a usual character need The memory space of a byte is occupied, thus space where data storage can be effectively reduced.
Low portion and high-order portion are clipped from the initial data mark of the corresponding data to be stored of data block, to original Data Identification is compressed, and obtains compressed data mark, and compressed data mark is stored in and is made with the high-order portion clipped In data bucket for data bucket mark, data block identifier, data bucket mark and compressed data mark can form initial data mark Know, initial data mark, which is compressed, can save memory space.
First memory module 906c is used to store the compressed data mark with storing in data bucket in the index area of data block Know the address mark of corresponding compressed data content.
Specifically, the first memory module 906c can be used in the index area of data block store compressed data mark and with pressure The address mark of the corresponding data content of contracting data within a data area.It is deposited so as to identify to obtain in data field by compressed data The compressed data content of storage.
In the present embodiment, by opening up multiple data buckets in the index area of data block, by the high-order portion of initial data mark Divide after being clipped with low portion after obtaining compressed data mark, compressed data mark is stored in data bucket, is gone back in index area Store address mark corresponding with compressed data mark.When inquiring data, it can be identified by the data bucket stored in index area With the correspondence of bucket initial memory address of the data bucket in index area, the compressed data mark of storage is obtained from data bucket Know, then obtain compressed data from index area and identify corresponding address mark, so as to the data according to address mark from data block Area obtains inquiry data content.Index area will be stored in after initial data mark compression, saves memory space, and can pass through Compressed data content in index area rapidly locating area ensure that the efficiency of inquiry data.
In one embodiment, the address of compressed data content is identified as the storage of compressed data content within a data area Location relative to initial memory address in data field offset;First memory module 906c is additionally operable to store data in index area The offset phase of start offset amount corresponding to bucket and the compressed data content corresponding to the compressed data mark in data bucket For the bucket bias internal amount of start offset amount.
In the present embodiment, by the storage address of compressed data content within a data area relative to originating storage in data field The offset of location is identified as the address of compressed data content.Initial memory address refers to start to store depositing for data in data field Address is stored up, since data are Coutinuous stores, initial memory address are obtained and compressed data identifies corresponding compressed data content Offset, the actual storage address of compressed data content can be obtained, i.e., start to store in the compressed data in data field The storage address of appearance.And adjacent next compressed data is identified by the compressed data and identifies corresponding compressed data content Offset can obtain the storage end position that the compressed data identifies corresponding compressed data content, because without additional Data length is stored, the memory space of occupancy is small.
Start offset amount is that the corresponding compressed data content of the compressed data mark in data bucket starts within a data area Relative to the offset of above-mentioned initial memory address, bucket bias internal amount refers to store in data bucket each the storage address of storage The offset of compressed data content relative to start offset amount offset.One compressed data identifies corresponding start offset amount With bucket bias internal amount and be exactly the real offset that the compressed data identifies, the sum of the real offset and initial memory address It is exactly the actual storage address that the compressed data identifies corresponding compressed data content.In the present embodiment, due to storing offset Than store storage address in itself than save more memory spaces, storage address can be further reduced by bucket bias internal amount Identify occupied space.
In one embodiment, the address of compressed data content is identified as the storage of compressed data content within a data area Location relative to initial memory address in data field offset;First memory module 906c is additionally operable to store data in index area The data of each compressed data content corresponding to the start offset amount and data bucket of compressed data content corresponding to bucket are long Degree.
In the present embodiment, in index area store data bucket corresponding to compressed data content start offset amount and The data length of each compressed data content corresponding to data bucket, due to the memory space ratio deviation amount shared by data length more It is small, it can further compressed data volume.It needs to obtain Compressed text search data in start offset amount and data bucket when inquiring data Before mark(Including itself)All compressed datas identify corresponding data length.Since data are Coutinuous stores, data In bucket before Compressed text search Data Identification(Do not include itself)All compressed datas identify the sum of corresponding data length, exactly press The bucket bias internal amount of contracting inquiry Data Identification;Bucket bias internal amount and start offset amount be exactly that Compressed text search Data Identification is corresponding Compressed data content storage address, that is, start store compressed data content address.And pass through Compressed text search data The end address of compressed data content can be obtained by identifying the data length of itself.
In one embodiment, compressed data mark identifies in the data bucket of the index area of data block according to compressed data Numerical values recited ascending or descending order storage.
In the present embodiment, numerical values recited ascending order or drop that the compressed data mark in data bucket is identified according to compressed data Sequence stores, and can be convenient for quickly positioning compressed data mark by binary chop, so as to increase substantially the speed of inquiry data Degree.
As shown in figure 11, in one embodiment, original data content compression module 904 includes frequency computing module 904a, compressed data content generating module 904b and the second memory module 904c.
Frequency computing module 904a is used to the original data content of data to be stored being divided into data cell, obtains data Unit set calculates the frequency of occurrences of each data cell in data cell set.
The division of data cell can determine that original data content can be used as a data sheet in itself according to actual needs Original data content can also be divided into multiple data cells by member.
Compressed data content generating module 904b is used for the frequency of occurrences according to data cell, and data are distributed for data cell Cell encoding, and according to the correspondence of data cell and data cell encoding by the original of the corresponding data to be stored of data block Data content is encoded according to data cell, obtains compressed data content.
The frequency of occurrences of data cell is calculated, is that the high data cell of the frequency of occurrences distributes shorter coding, to there is frequency The low data cell of rate distributes longer coding, so as to according to the correspondence of data cell and data cell encoding by data block The original data content of corresponding data to be stored carries out variable-length encoding according to data cell, obtains compressed data content.
Second memory module 904c is used for compressed data content and record data cell and the correspondence of data cell encoding The coding schedule of relationship is stored in the data field of data block, and obtains the address mark of compressed data content.
Compressed data content and coding schedule be stored in the data field of data block, wherein coding schedule have recorded data cell and The correspondence of data cell coding, can be decoded compressed data content according to the coding schedule when inquiring data, obtain former Beginning data content.
In the present embodiment, by the way that the original data content of data to be stored is divided into data cell, according to data cell The frequency of occurrences to original data content carry out variable-length encoding after store, can substantially reduce storage initial data mark needed for deposit Space is stored up, and each compressed data content is corresponding with address mark, it is quick fixed to be identified when inquiring data by the address The storage address of position compressed data content takes into account data volume and inquires the efficiency of data.
In one embodiment, compressed data content generating module 904b is additionally operable to the frequency of occurrences according to data cell, Huffman tree is constructed, is encoded according to the coordinates measurement data cell from the root node of Huffman tree to leaf node;Second storage Module 904c is additionally operable to store Huffman tree in the data field of data block.
Specifically, compressed data content generating module 904b can be used for corresponding to each data cell the leaf of Huffman tree Data cell is ranked up by node according to the frequency of occurrences, merges to obtain one newly by 2 minimum leaf nodes of the frequency of occurrences Node, the sum of the frequency of occurrences of the node for the frequency of occurrences of the two data cells.Exclude the leaf section of the two merging Point remerges two minimum nodes of the wherein frequency of occurrences in remaining leaf node and new node, is merged into Kazakhstan always Until the root node of Fu Man trees.The left subtree for defining the node of Huffman tree is 0, right subnumber 1(It is 1 that left subtree, which can also be defined, Right subtree is 0, only illustrates principle here), so as to according to the binary number of coordinates measurement from root node to leaf node According to cell encoding.Huffman tree when storing compressed data content and coding schedule, can be together stored in the data field of data block In.
During using Huffman tree decoding compressed data content, can by turn it be decoded using compressed data content as bit stream.Specifically Ground, according to structure Huffman tree when definition, since the root node of Huffman tree, encounter 0 and go to left child node, encounter 1 Right child node is then gone to, until leaf node, the binary coding from root node to leaf node is a data cell at this time Coding, and the length of data cell coding is identical with depth of the leaf node in Huffman tree.So as to according to coding schedule It obtains the data cell and encodes corresponding data cell, then continue to decode by turn since the root node of Huffman tree, until obtaining The data or decoding for obtaining presets are completed.
If original data content has fixed form, the data of presets can be set, shape is preset when obtaining this Decoding terminates during the data of formula, and without a complete compressed data content is all decoded, content needed for acquisition can carry The efficiency of height inquiry data.
In the present embodiment, original data content is compressed using Huffman encoding, compressed data can be substantially reduced Memory space needed for content.
As shown in figure 12, in one embodiment, device further includes decoding accelerometer generation module 905, is reflected including first It penetrates module 905a, the second mapping block 905b, third mapping block 905c and third and reflects memory block 905d.
First mapping block 905a, which is used to be decoded for the data cell code construction of preset length according to data length, to be accelerated Table is decoded in each data cell coding mapping to Huffman tree in accelerometer and is encoded with the data cell in decoding accelerometer Corresponding leaf node.
Generally the data Jing Guo Huffman encoding are decoded, need to decode by turn, efficiency is therefore conventional than relatively low Decoding process can not meet performance requirement in High Performance Data Query service, and structure decoding accelerometer is needed to realize quick decoding.Tool Body, accelerometer is decoded, and first will be in decoding accelerometer for the data cell code construction of preset length using data length Corresponding leaf node in each data cell coding mapping to Huffman tree.
Second mapping block 905b is used to that the prefix phase in accelerometer with the preset length of long data cell encoding will to be decoded The corresponding node of prefix of long data cell encoding in same data cell coding mapping to Huffman tree;Long data cell encoding The data cell for being more than preset length for data length encodes.
For data length be more than preset length long data cell encoding, then will decoding accelerometer in long data cell The corresponding node of the prefix in the identical data cell coding mapping to Huffman tree of the prefix of the preset length of coding.
Third mapping block 905c is used to that the data cell that prefix is identical with short data cell encoding in accelerometer will to be decoded The corresponding leaf node of short data cell encoding in coding mapping to Huffman tree;Short data cell encoding is less than for data length The data cell coding of preset length.
It is less than the short data cell encoding of preset length for data length, then will decodes prefix and short data in accelerometer The corresponding leaf node of short data cell encoding in the identical data cell coding mapping to Huffman tree of cell encoding.
Third reflects memory block 905d for that will decode accelerometer storage to the data field of data block.
When being decoded to compressed data content, decoding accelerometer is obtained from the data field of data block, usable reading refers to Needle takes the data of preset length from compressed data content, a node being mapped to according to decoding accelerometer in Huffman tree, If the node of the mapping is leaf node, the data of the preset length are the data cell coding of preset length or short Data cell encodes, and directly can obtain corresponding data cell according to coding schedule.And then reading pointer is supported or opposed from compressed data Distance corresponding with the depth of the leaf node is moved in the direction of the initial memory address of content, i.e., the mobile and leaf node pair After the corresponding distance of length of data cell coding answered, continue to take the data of preset length to be decoded, until decoding obtains The partial data of presets in original data content or original data content.
If the node of mapping is non-leaf nodes, the starting that reading pointer is supported or opposed from Compressed text search data content is stored After distance corresponding with preset length is moved in the direction of address, using Huffman tree since the node that coded data maps by turn Mobile reading pointer is simultaneously decoded until leaf node, is obtained the corresponding data cell of leaf node, is continued to read preset length Data, until decoding obtains inquiry data content or inquires the partial data of presets in data content.
In the present embodiment, decoding accelerometer being generated, and being stored into the data field of data block, when decoding can be once from pressure The data of preset length is taken to be decoded in contracting data content, avoids decode by turn as possible, improve decoded efficiency.
As shown in figure 13, in one embodiment, a kind of data query arrangement is provided, including original query Data Identification Acquisition module 1302, data block determining module 1304, address identifier acquisition module 1306 and inquiry data content acquisition module 1308。
Original query Data Identification acquisition module 1302 is used to obtain original query Data Identification.
Original query Data Identification can be inputted by user or be obtained from third party application, for according to original query Data Identification inquires inquiry data content corresponding with the original query Data Identification from the data of storage.
Data block determining module 1304 is original for being determined according to original query Data Identification and the data block identifier to prestore Inquire the data block corresponding to Data Identification.
Data block identifier can distinguish different data blocks, and data block identifier is generated according to Data Identification, therefore basis Original query Data Identification can determine with the matched data block identifier of original query Data Identification, may thereby determine that initial data The corresponding data block of mark, the data block are to store the data of the corresponding Compressed text search data content of original query Data Identification Block.
Address identifier acquisition module 1306 for the index in the index area according to data block, obtains and original query number It is identified according to corresponding address is identified.
The index according to Data Identification and address mark generation is stored in the index area of data block, it, can according to the index Obtain address mark corresponding with original query Data Identification.
Data content acquisition module 1308 is inquired, for obtaining Compressed text search from the data field of data block according to address mark Data content, decompression module for decompressing Compressed text search data content, obtain inquiry data content.
The address identifies the storage location of Compressed text search data content in data data field in the block, therefore root Compressed text search data content can be obtained from the data field of data block according to address mark, after Compressed text search data content is decoded Inquiry data content corresponding with original query Data Identification can be obtained.
Above-mentioned data query arrangement after obtaining original query Data Identification, by the original query Data Identification and prestores Data block identifier can determine data block corresponding to original query Data Identification, it is determined that the data where inquiry data Block obtains address mark corresponding with original query Data Identification, so as to according to the address label from data index area in the block Compressed text search data content can be obtained from data data field in the block by knowing, and the decompression of Compressed text search data content can be obtained Inquire data content.Can rapidly locating position, realize quick-searching, without all compressed datas are decompressed, Greatly improve the service efficiency of data.
In one embodiment, data block determining module 1304 is additionally operable to determine in original query Data Identification first in advance If the corresponding data block of the matched data block identifier of the low portion of digit.
Specifically, using the low portion of the first presetting digit capacity of Data Identification as data block identifier when storing data, because This data block determining module 1304 when inquiring data can be used for the low of the first presetting digit capacity of interception original query Data Identification Bit position by the low portion compared with the data block identifier to prestore, if data block identifier is matched with the low portion, illustrates this The corresponding data block of data block identifier is exactly to store the data block of Compressed text search data content.
In the present embodiment, determine that storage compression is looked into using the low portion of the first presetting digit capacity of original query Data Identification The data block of data content is ask, is calculated simply, it is efficient.
In one embodiment, address identifier acquisition module 1306 is additionally operable to clip the low level portion of original query Data Identification Point, Compressed text search Data Identification is obtained, obtains the corresponding address of Compressed text search Data Identification stored in the index area of data block Mark.
Specifically, pair of compressed data mark and address mark for clipping low portion is stored in the index area of data block It should be related to, when inquiry clips the low portion of original query Data Identification, obtains Compressed text search Data Identification, so as to from data block Index area in obtain corresponding with Compressed text search Data Identification address and identify.
In the present embodiment, storage is compressed Data Identification, need to only be clipped when can save memory space, and inquire The low portion of original query Data Identification can realize inquiry, not influence search efficiency, and having taken into account data volume and data makes Use efficiency.
As shown in figure 14, in one embodiment, address identifier acquisition module 1306 includes data bucket mark determining module 1306a, bucket initial memory address determining module 1306b, Compressed text search Data Identification acquisition module 1306c, searching module 1306d With the first acquisition module 1306e.
Data bucket mark determining module 1306a is used to determine the high position with the second presetting digit capacity of original query Data Identification The matched data bucket mark in part.
Specifically, the index area of data block includes multiple data buckets, is preset when storing data by the second of Data Identification The high-order portion of digit is identified as data bucket.Therefore during inquiry data, data bucket mark determining module 1306a is used in data It determines to identify with the matched data bucket of the high-order portion of the second presetting digit capacity of original query Data Identification in the index area of block, it should It is exactly the data bucket for storing Compressed text search Data Identification that matched data bucket, which identifies corresponding data bucket,.Wherein, the first default position Number and the second presetting digit capacity and less than or equal to original query Data Identification total bit, ensure the validity of inquiry.
Bucket initial memory address determining module 1306b is used to obtain the corresponding number of data bucket mark from the index area of data block According to bucket initial memory address of the bucket in index area.
When storing data, the bucket starting of data bucket mark and data bucket in index area is stored in the index area of data block The correspondence of storage address.Therefore during inquiry data, matched data bucket can be obtained from the index area of determining data block Identify bucket initial memory address of the corresponding data bucket in the index area.This barrel of initial memory address is obtained, visit can be passed through Ask that the storage address reads the data stored in data bucket.
Compressed text search Data Identification acquisition module 1306c is used to clip the high-order portion and low level of original query Data Identification Part obtains Compressed text search Data Identification.
In the present embodiment, when storing Data Identification, the high-order portion and low portion of having clipped Data Identification are pressed Contracting Data Identification simultaneously stores.Therefore when inquiring data, Compressed text search Data Identification acquisition module 1306c is used to clip original look into The high-order portion and low portion of Data Identification are ask, obtains Compressed text search Data Identification.
Searching module 1306d searches for the bucket initial memory address according to data bucket in index area in data bucket Compressed data mark matched with Compressed text search Data Identification.
Bucket initial memory address of the data bucket in index area is obtained, it can be since this barrel of initial memory address in the number It is identified according to being searched in bucket with the matched compressed data of Compressed text search Data Identification.In addition by under adjacent with the inquiry data bucket The bucket initial memory address of one data bucket can determine the end position of the inquiry data bucket, so as to avoid when in data bucket There is no can not stop inquiring during Compressed text search Data Identification.
First acquisition module 1306e is used to obtain and the corresponding address label of matched compressed data mark from index area Know.
The correspondence of compressed data mark and address mark is stored in the index area of data block, therefore passes through the matching Compressed data mark, corresponding with matched compressed data mark address can be obtained from index area and is identified, which identifies For determining storage location of the Compressed text search data content in the data field of data block.
In the present embodiment, Data Identification is compressed when storing data, a large amount of memory spaces can be saved, when inquiring data The data block where Compressed text search data content, then the index from the data block can be quickly determined using original query Data Identification Area obtains address mark, so as to identify the Compressed text search data content stored in the data field for obtaining data block according to address, looks into It is fast to ask speed, without decompressing all compressed datas, greatly improves the service efficiency of data.
In one embodiment, the address of compressed data content is identified as Compressed text search data content depositing within a data area Store up offset of the address relative to initial memory address in data field.
As shown in figure 15, the first acquisition module 1306e includes the second acquisition module 1306e1 and the first offset calculates mould Block 1306e2.
Second acquisition module 1306e1 from index area for obtaining the start offset amount and matched corresponding to data bucket Compressed data identifies corresponding bucket bias internal amount.
First offset computing module 1306e2 is used to be calculated in inquiry data according to start offset amount and bucket bias internal amount The offset of appearance.
In the present embodiment, by the storage address of compressed data content within a data area relative to originating storage in data field The offset of location is identified as the address of compressed data content.Initial memory address refers to start to store depositing for data in data field Address is stored up, since data are Coutinuous stores, initial memory address are obtained and matched compressed data identifies corresponding offset, It can obtain the actual storage address of matched compressed data content, that is, the actual storage of Compressed text search data content Location.And the offset of the adjacent corresponding compressed data content of next compressed data mark is identified by the matched compressed data Amount can obtain the storage end position that the matched compressed data identifies corresponding compressed data content, because without volume Outer storage data length, the memory space of occupancy are small.
Start offset amount is that the corresponding compressed data content of the compressed data mark in data bucket starts within a data area Relative to the offset of above-mentioned initial memory address, bucket bias internal amount refers to store in data bucket each the storage address of storage The offset of compressed data content relative to start offset amount offset.When calculating offset, Compressed text search Data Identification pair Real offset that the is start offset amount answered and bucket bias internal amount and being exactly the Compressed text search Data Identification, the real offset With initial memory address and the actual storage address that is exactly the corresponding compressed data content of the Compressed text search Data Identification.This reality Apply in example, due to storage offset than storage storage address in itself than save more memory spaces, pass through bucket bias internal Amount can be further reduced storage address and identify occupied space, and search efficiency be influenced smaller.
In one embodiment, the address of compressed data content is identified as Compressed text search data content depositing within a data area Store up offset of the address relative to initial memory address in data field.
As shown in figure 16, the first acquisition module 1306e includes third acquisition module 1306e3 and the second offset calculates mould Block 1306e4.
Third acquisition module 1306e3 is for the start offset amount and data bucket corresponding to the acquisition data bucket from index area In compressed data mark all before matched compressed data mark and matched compressed data mark data length;
Second offset computing module 1306e4 is used to calculate inquiry number according to start offset amount and the data length obtained According to the offset of content.
In the present embodiment, in index area store data bucket corresponding to compressed data content start offset amount and The data length of each compressed data content corresponding to data bucket, due to the memory space ratio deviation amount shared by data length more It is small, it can further compressed data volume.It needs to obtain Compressed text search data in start offset amount and data bucket when inquiring data Before mark(Including itself)All compressed datas identify corresponding data length.Since data are Coutinuous stores, data In bucket before Compressed text search Data Identification(Do not include itself)All compressed datas identify the sum of corresponding data length, exactly press The bucket bias internal amount of contracting inquiry Data Identification;Bucket bias internal amount and start offset amount be exactly that Compressed text search Data Identification is corresponding Compressed data content storage address, that is, start store compressed data content address.And pass through Compressed text search data The end address of compressed data content can be obtained by identifying the data length of itself.
A barrel bias internal amount is substituted using the data length of recording compressed data content in the present embodiment, can realize higher Compression ratio, occupy memory space smaller, although compared with bucket bias internal amount, when can spend more when inquiring data Between, but when storage resource anxiety, usage record data length is more preferably to select.
In one embodiment, the compressed data in the data bucket of the index area of data block is identified according to compressed data mark The numerical values recited ascending or descending order storage of knowledge;Searching module 1306d be also used for binary chop searched in data bucket with The matched compressed data mark of Compressed text search Data Identification.
In the present embodiment, numerical values recited ascending order or drop that the compressed data mark in data bucket is identified according to compressed data Sequence stores, and can be convenient for quickly positioning compressed data mark by binary chop, so as to increase substantially the speed of inquiry data Degree.
In one embodiment, inquiry data content acquisition module 1308 is additionally operable to obtain record data cell from data field It, will be in the Compressed text search data that encoded including data cell according to coding schedule with the coding schedule of the correspondence of data cell encoding Appearance is decoded, and obtains data cell, and inquiry data content is obtained according to data cell.
The coding schedule of the correspondence of record data cell and data cell encoding is stored in the data field of data block, from Data field obtains coding schedule, so as to be decoded Compressed text search data content according to the coding schedule, obtains data cell, obtains The data cell obtained forms inquiry data content.
In one embodiment, inquiry data content acquisition module 1308 is additionally operable to obtain Huffman tree and note from data field Record the coding schedule of the correspondence of data cell and data cell encoding;Compressed text search data content is divided according to Huffman tree It encodes for data cell, is obtained and the corresponding data cell of data cell coding according to coding schedule;It is looked into according to data cell Ask data content.
It, can be using Compressed text search data content as bit stream, by turn when decoding Compressed text search data content using Huffman tree Decoding.Definition when specifically, according to structure Huffman tree, the left subtree of node is 0, right subnumber 1, from the root section of Huffman tree Point starts, and encountering 0 goes to left child node, encounters 1 and goes to right child node, until leaf node, at this time from root node to leaf The binary coding of child node is a data cell encoding, and the length of data cell coding and the leaf node are in Hough Depth in graceful tree is identical.Corresponding data cell is encoded so as to obtain the data cell according to coding schedule, then from Huffman tree Root node start continue decode by turn, until obtain presets data or decoding complete.
In the present embodiment, original data content is compressed using Huffman encoding, compressed data can be substantially reduced Memory space needed for content.
As shown in figure 17, in one embodiment, inquiry data content acquisition module 1308 includes the 4th acquisition module 1308a, node mapping block 1308b, leaf node processing module 1308c and non-leaf nodes processing module 1308d.
4th acquisition module 1308a be used for from data field obtain Huffman tree, decoding accelerometer with record data cell and The coding schedule of the correspondence of data cell coding.
Wherein, decoding accelerometer includes the data cell coding of multiple preset lengths, decodes each data in accelerometer Cell encoding is mapped in Huffman tree and the corresponding leaf node of data cell coding in decoding accelerometer.Decode accelerometer In long data sheet in the data cell coding mapping to Huffman tree identical with the prefix of the preset length of long data cell encoding The corresponding node of prefix of primitive encoding;The data cell that long data cell encoding is more than preset length for data length encodes.Solution Short data unit is compiled in the prefix data cell coding mapping to Huffman tree identical with short data cell encoding in code accelerometer The corresponding leaf node of code.The data cell that short data cell encoding is less than preset length for data length encodes.Long data sheet The data cell that primitive encoding is more than preset length for data length encodes.
Node mapping block 1308b is used for using reading pointer since the initial memory address of Compressed text search data content The coded data of preset length is read, node coded data being mapped to using decoding accelerometer in Huffman tree.
If nodes of the leaf node processing module 1308c for coded data mapping is leaf node, according to coding schedule Decoding obtains the corresponding data cell of leaf node, and the starting storage that reading pointer is supported or opposed from Compressed text search data content After distance corresponding with the depth of leaf node is moved in the direction of location, continue to read the coded data of preset length, until decoding It obtains inquiry data content or inquires the partial data of presets in data content.
If non-leaf nodes processing module 1308d is non-leaf nodes for the node of coded data mapping, will read After the direction for the initial memory address that pointer is supported or opposed from Compressed text search data content is moved with a distance from corresponding with preset length, use Huffman tree since the node that coded data maps moves reading pointer and decodes until leaf node by turn, obtains leaf section The corresponding data cell of point continues to read the coded data of preset length, until decoding obtains inquiry data content or inquiry number According to the partial data of presets in content.
In the present embodiment, decoding accelerometer being generated, and being stored into the data field of data block, when decoding can be once from pressure The data of preset length is taken to be decoded in contracting data content, avoids decode by turn as possible, improve decoded efficiency.
Figure 18 is the module map for the computer system 1000 that can realize the embodiment of the present invention.The computer system 1000 An only example for being suitable for the invention computer environment, it is impossible to be considered to propose appointing to the use scope of the present invention What is limited.Computer system 1000 can not be construed to need to rely on or the illustrative computer system 1000 with diagram One or more of component combination.
The computer system 1000 shown in Figure 18 is the example of a computer system for being suitable for the present invention.Have Other frameworks of different sub-systems configuration can also use.
As shown in figure 18, computer system 1000 includes processor 1010, memory 1020 and system bus 1022.Including Various system components including memory 1020 and processor 1010 are connected on system bus 1022.Processor 1010 is one For performing the hardware of computer program instructions by arithmetic sum logical operation basic in computer system.Memory 1020 It is one to be used to temporarily or permanently store calculation procedure or data(For example, program state information)Physical equipment.System is total Line 1020 can be any one in the bus structures of following several types, including memory bus or storage control, outer If bus and local bus.Processor 1010 and memory 1020 can be by system bus 1022 into row data communication.Wherein Memory 1020 includes read-only memory(ROM)Or flash memory(It is all not shown in figure)And random access memory(RAM), RAM Typically refer to be loaded with the main memory of operating system and application program.
Computer system 1000 further includes display interface 1030(For example, graphics processing unit), display equipment 1040(Example Such as, liquid crystal display), audio interface 1050(For example, sound card)And audio frequency apparatus 1060(For example, loud speaker).Show equipment 1040 and audio frequency apparatus 1060 be media device for experiencing multimedia content.
Computer system 1000 generally comprises a storage device 1070.Storage device 1070 can from a variety of computers It reads to select in medium, computer-readable medium refers to any available medium that can be accessed by computer system 1000, Including mobile and fixed two media.For example, computer-readable medium includes but not limited to, flash memory(Miniature SD Card), CD-ROM, digital versatile disc(DVD)Or other optical disc storages, cassette, tape, disk storage or other magnetic storages are set Any other medium that is standby or can simultaneously being accessed available for storage information needed by computer system 1000.
Computer system 1000 further includes input unit 1080 and input interface 1090(For example, I/O control).User can With by input unit 1080, such as the touch panel equipment in keyboard, mouse, display device 1040, input instruction and information arrive In computer system 1000.Input unit 1080 is typically to be connected on system bus 1022 by input interface 1090, but It can also be connected by other interfaces or bus structures, such as universal serial bus(USB).
Computer system 1000 can carry out logical connection with one or more network equipment in a network environment.Network is set Standby can be PC, server, router, smart phone, tablet computer or other common network nodes.Department of computer science System 1000 passes through LAN(LAN)Interface 1100 or mobile comm unit 1110 are connected with the network equipment.LAN(LAN) Refer in finite region, such as family, school, computer laboratory or the office building using the network media, interconnection composition Computer network.WiFi and twisted-pair feeder wiring Ethernet are two kinds of technologies of most common structure LAN.WiFi is a kind of It can make 1000 swapping data of computer system or the technology of wireless network is connected to by radio wave.Mobile comm unit 1110 are answered and are made a phone call by radio communication diagram while being moved in a wide geographic area.In addition to logical Other than words, mobile comm unit 1110 is also supported to carry out in 2G, 3G or the 4G cellular communication system for providing mobile data service Internet access.
It should be pointed out that other computer systems including than 1000 more or fewer subsystems of computer system It can be suitably used for inventing.For example, computer system 1000 can include exchanging the bluetooth unit of data in short distance, for shining The imaging sensor of phase and the accelerometer for measuring acceleration.
It is as detailed above, be suitable for the invention computer system 1000 can perform date storage method and/or The specified operation of data query method.Computer system 1000 runs in computer-readable medium soft by processor 1010 The form of part instruction performs these operations.These software instructions from storage device 1070 or can pass through lan interfaces 1100 are read into from another equipment in memory 1020.The software instruction being stored in memory 1020 is so that processor 1010 is held The above-mentioned date storage method of row and/or data query method.In addition, referred to by hardware circuit or hardware circuit combination software Order also can equally realize the present invention.Therefore, the combination the present invention is not limited to any specific hardware circuit and software is realized.
Embodiment described above only expresses the several embodiments of the present invention, and description is more specific and detailed, but simultaneously Cannot the limitation to the scope of the claims of the present invention therefore be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the guarantor of the present invention Protect range.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims (38)

1. a kind of date storage method, data to be stored include initial data identify and with the corresponding original of initial data mark Beginning data content, the method includes:
The data to be stored is clustered according to initial data mark, the initial data with common portion is identified Corresponding data to be stored is as a kind of, a corresponding data block, and wait to deposit according to the data block is corresponding respectively per one kind Store up the initial data mark generation data block identifier of data;
The original data content of the corresponding data to be stored of the data block is compressed, compressed data content is stored in institute The data field of data block is stated, and obtains the address mark of the compressed data content;
According to the initial data of the corresponding data to be stored of data block mark and described address mark generation index, and by institute State the index area that index is stored in the data block.
2. according to the method described in claim 1, it is characterized in that, described identify according to the initial data to described to be stored Data are clustered, and a data block is corresponded to respectively, and according to the original of the corresponding data to be stored of the data block per a kind of Data Identification generates data block identifier, including:
The identical data to be stored of the low portion of the first presetting digit capacity is made during the initial data of the data to be stored is identified For one kind, a data block is corresponded to respectively, and using the low portion as the data block identifier of the data block per a kind of.
It is 3. according to the method described in claim 2, it is characterized in that, described according to the corresponding data to be stored of the data block Initial data mark and described address mark generation index, and the index is stored in the index area of the data block, including:
The low portion is clipped from the initial data mark of the corresponding data to be stored of the data block, obtains compressed data Mark, and the compressed data is identified into the rope for being stored in the data block corresponding with the address mark of the compressed data content Draw area.
4. the according to the method described in claim 3, it is characterized in that, original from the corresponding data to be stored of the data block The low portion is clipped in beginning Data Identification, obtains compressed data mark, and the compressed data is identified and the compression The corresponding index area for being stored in the data block of address mark of data content, including:
During the initial data of the data block is identified the identical initial data mark of the high-order portion of the second presetting digit capacity as Same subclass, the initial data mark of each subclass corresponds to each data bucket in the index area of the data block respectively, by institute The data bucket that high-order portion is stated as the data bucket identifies, and the data bucket mark is stored in the index area of the data block Know the correspondence with bucket initial memory address of the data bucket in the index area;
The low portion and the high position portion are clipped from the initial data mark of the corresponding data to be stored of the data block Point, compressed data mark is obtained, and compressed data mark is stored in what is identified using the high-order portion as data bucket In data bucket;
The corresponding compressed data of compressed data mark with being stored in the data bucket is stored in the index area of the data block The address mark of content.
5. according to the method described in claim 4, it is characterized in that, the address of the compressed data content is identified as the compression Storage address of the data content in the data field relative to initial memory address in the data field offset;
The corresponding compression of compressed data mark stored in the index area of the data block with being stored in the data bucket The address mark of data content, including:
The start offset amount corresponding to the data bucket and the compressed data in the data bucket are stored in the index area The offset of corresponding compressed data content is identified relative to the bucket bias internal amount of the start offset amount.
6. according to the method described in claim 4, it is characterized in that, the address of the compressed data content is identified as the compression Storage address of the data content in the data field relative to initial memory address in the data field offset;
The corresponding compression of compressed data mark stored in the index area of the data block with being stored in the data bucket The address mark of data content, including:
The start offset amount of the compressed data content corresponding to the data bucket and the data are stored in the index area The data length of each compressed data content corresponding to bucket.
7. according to the method described in claim 4-6 any one, which is characterized in that the compressed data mark is in the data It is stored in the data bucket of the index area of block according to the numerical values recited ascending or descending order of compressed data mark.
8. according to the method described in claim 1-6 any one, which is characterized in that described to wait to deposit by the data block is corresponding The original data content of storage data is compressed, and compressed data content is stored in the data field of the data block, and obtain institute The address mark of compressed data content is stated, including:
The original data content of the data to be stored is divided into data cell, data cell set is obtained, calculates each institute State the frequency of occurrences of the data cell in the data cell set;
According to the frequency of occurrences of the data cell, data cell coding is distributed, and according to the data for the data cell The correspondence of unit and data cell coding presses the original data content of the corresponding data to be stored of the data block It is encoded according to data cell, obtains compressed data content;
By the compressed data content and the coding schedule for the correspondence for recording the data cell and data cell coding The data field of the data block is stored in, and obtains the address mark of the compressed data content.
9. according to the method described in claim 8, it is characterized in that, the frequency of occurrences according to the data cell, for institute Data cell distribution data cell coding is stated, including:
According to the frequency of occurrences of the data cell, Huffman tree is constructed, according to from the root node of the Huffman tree to leaf The coordinates measurement data cell coding of node;
The method further includes:The Huffman tree is stored in the data field of the data block.
10. according to the method described in claim 9, it is characterized in that, the method further includes:
It is each in the decoding accelerometer according to data cell code construction decoding accelerometer of the data length for preset length Leaf section corresponding with the data cell coding in the decoding accelerometer in data cell coding mapping to the Huffman tree Point;
By data cell coding mapping identical with the prefix of the preset length of long data cell encoding in the decoding accelerometer To the corresponding node of prefix of data cell encoding long described in Huffman tree;
By the data cell coding mapping identical with short data cell encoding of prefix in the decoding accelerometer to the Huffman The corresponding leaf node of short data cell encoding described in tree;
By the decoding accelerometer storage to the data field of the data block.
11. a kind of data query method, the method includes:
Obtain original query Data Identification;
According to corresponding to the original query Data Identification and the data block identifier to prestore determine the original query Data Identification Data block;According to the index in the index area of the data block, address corresponding with the original query Data Identification is obtained Mark;Including:The low portion of the original query Data Identification is clipped, Compressed text search Data Identification is obtained, obtains the number According to the corresponding address mark of the Compressed text search Data Identification stored in the index area of block;
Compressed text search data content is obtained from the data field of the data block according to described address mark, the compression is decompressed and looks into Data content is ask, obtains inquiry data content.
12. according to the method for claim 11, which is characterized in that described according to the original query Data Identification and prestoring Data block identifier determine data block corresponding to the original query Data Identification, including:
It determines corresponding with the matched data block identifier of low portion of the first presetting digit capacity in the original query Data Identification Data block.
13. according to the method for claim 11, which is characterized in that described to clip the described of the original query Data Identification Low portion obtains Compressed text search Data Identification, obtains the Compressed text search data stored in the index area of the data block Corresponding address mark is identified, including:
It determines to identify with the matched data bucket of the high-order portion of the second presetting digit capacity of the original query Data Identification;
The data bucket, which is obtained, from the index area of the data block identifies bucket starting of the corresponding data bucket in the index area Storage address;
The high-order portion of the original query Data Identification and the low portion are clipped, obtains Compressed text search data mark Know;
According to bucket initial memory address of the data bucket in the index area, searched and the compression in the data bucket Inquire the matched compressed data mark of Data Identification;
Address mark corresponding with the matched compressed data mark is obtained from the index area.
14. according to the method for claim 13, which is characterized in that the address of the compressed data content is identified as the pressure Offset of storage address of the data content in the data field relative to initial memory address in the data field is inquired in contracting;
It is described that address mark corresponding with the matched compressed data mark is obtained from the index area, including:
The start offset amount corresponding to the data bucket and the matched compressed data mark pair are obtained from the index area The bucket bias internal amount answered;
The offset of the inquiry data content is calculated according to the start offset amount and the bucket bias internal amount.
15. according to the method for claim 13, which is characterized in that the address of the compressed data content is identified as the pressure Offset of storage address of the data content in the data field relative to initial memory address in the data field is inquired in contracting;
It is described that address mark corresponding with the matched compressed data mark is obtained from the index area, including:
Matched pressure described in start offset amount and the data bucket corresponding to the data bucket is obtained from the index area The data length of all compressed data marks before contracting Data Identification and the matched compressed data mark;
The offset of the inquiry data content is calculated according to the start offset amount and the data length of the acquisition.
16. according to the method described in claim 13-15 any one, which is characterized in that in the index area of the data block Compressed data mark in data bucket is stored according to the numerical values recited ascending or descending order that the compressed data identifies;
Described searched in the data bucket identifies with the matched compressed data of the Compressed text search Data Identification, including:
It is searched in the data bucket using binary chop and is identified with the matched compressed data of the Compressed text search Data Identification.
17. according to the method described in claim 11-15 any one, which is characterized in that it is described according to described address mark from The data field of the data block obtains Compressed text search data content, decompresses the Compressed text search data content, obtains inquiry number According to content, including:
The coding schedule of the correspondence of record data cell and data cell encoding is obtained from the data field, according to the coding Table will be decoded including the Compressed text search data content that data cell encodes, and data cell be obtained, according to the data Unit obtains inquiry data content.
18. according to the method for claim 17, which is characterized in that it is described from the data field obtain record data cell and The coding schedule of the correspondence of data cell coding, the Compressed text search that will be encoded according to the coding schedule including data cell Data content is decoded, and obtains data cell, and inquiry data content is obtained according to the data cell, including:
The coding schedule of correspondence of the Huffman tree with recording data cell and data cell encoding is obtained from the data field;Root The Compressed text search data content is divided into data cell coding according to the Huffman tree, according to coding schedule acquisition and institute It states data cell and encodes corresponding data cell;Inquiry data content is obtained according to the data cell.
19. according to the method for claim 18, which is characterized in that described to obtain Huffman tree and record from the data field The coding schedule of the correspondence of data cell and data cell encoding;It will be in the Compressed text search data according to the Huffman tree Appearance is divided into data cell coding, is obtained and the corresponding data cell of data cell coding according to the coding schedule;According to The data cell obtains inquiry data content, including:
The correspondence of Huffman tree, decoding accelerometer with recording data cell and data cell encoding is obtained from the data field Coding schedule;Wherein, the data cell that the decoding accelerometer includes multiple preset lengths encodes, in the decoding accelerometer In each data cell coding mapping to the Huffman tree with the corresponding leaf of data cell coding in the decoding accelerometer Child node;The data cell coding mapping identical with the prefix of the preset length of long data cell encoding in the decoding accelerometer To the corresponding node of prefix of data cell encoding long described in Huffman tree;Prefix and short data list in the decoding accelerometer The corresponding leaf node of short data cell encoding described in the identical data cell coding mapping to the Huffman tree of primitive encoding;
The volume of the preset length is read since the initial memory address of the Compressed text search data content using reading pointer Code data, the node being mapped to the coded data using the decoding accelerometer in the Huffman tree;
If the node of the coded data mapping is leaf node, the leaf node pair is obtained according to coding schedule decoding The data cell answered, and the reading pointer is moved to the direction of the initial memory address away from the Compressed text search data content After moving distance corresponding with the depth of the leaf node, continue to read the coded data of preset length, until decoding obtains institute State the partial data of presets in inquiry data content or the inquiry data content;
If the node of the coded data mapping is non-leaf nodes, by the reading pointer to away from the Compressed text search number After moving corresponding with preset length distance according to the direction of the initial memory address of content, using the Huffman tree from institute The node for stating coded data mapping starts to move the reading pointer by turn and decode until leaf node, obtains the leaf section The corresponding data cell of point continues to read the coded data of preset length, until decoding obtains the inquiry data content or institute State the partial data of presets in inquiry data content.
20. a kind of data storage device, which is characterized in that data to be stored include initial data identify and with the initial data Corresponding original data content is identified, described device includes:
Data block generation module for being clustered according to initial data mark to the data to be stored, will have public affairs The corresponding data to be stored of initial data mark of part is used as a kind of altogether, per an a kind of data block corresponding respectively, and according to The initial data mark generation data block identifier of the corresponding data to be stored of the data block;
Original data content compression module, for the original data content of the corresponding data to be stored of the data block to be pressed Compressed data content, is stored in the data field of the data block by contracting, and obtains the address mark of the compressed data content;
Generation module is indexed, for according to the initial data of the corresponding data to be stored of data block mark and described address mark Know generation index, and the index is stored in the index area of the data block.
21. device according to claim 20, which is characterized in that the data block generation module is additionally operable to wait to deposit by described The identical data to be stored of the low portion of the first presetting digit capacity is as a kind of, every one kind point in the initial data mark of storage data Not Dui Ying a data block, and using the low portion as the data block identifier of the data block.
22. device according to claim 21, which is characterized in that the index generation module is additionally operable to from the data block The low portion is clipped in the initial data mark of corresponding data to be stored, obtains compressed data mark, and by the pressure The corresponding index area for being stored in the data block of the address of contracting Data Identification and compressed data content mark.
23. device according to claim 22, which is characterized in that the index generation module, including:
Data bucket generation module, for the high-order portion of the second presetting digit capacity in the initial data of data block mark is identical Initial data mark be used as same subclass, the initial data mark of each subclass is corresponded to respectively in the index area of the data block Each data bucket, identified the high-order portion as the data bucket of the data bucket, and in the index area of the data block The middle storage data bucket mark and the correspondence of bucket initial memory address of the data bucket in the index area;
Compressed data identifier generation module, for being clipped from the initial data of the corresponding data to be stored of data block mark The low portion and the high-order portion obtain compressed data mark, and compressed data mark are stored in described In the data bucket that high-order portion is identified as data bucket;
First memory module, for the compressed data mark stored in the storage in the index area of the data block and the data bucket Know the address mark of corresponding compressed data content.
24. device according to claim 23, which is characterized in that the address of the compressed data content is identified as the pressure Storage address of the contracting data content in the data field relative to initial memory address in the data field offset;
First memory module includes being additionally operable to store the start offset amount corresponding to the data bucket in the index area, And the offset of the compressed data content corresponding to the compressed data mark in the data bucket is relative to the start offset The bucket bias internal amount of amount.
25. device according to claim 23, which is characterized in that the address of the compressed data content is identified as the pressure Storage address of the contracting data content in the data field relative to initial memory address in the data field offset;
First memory module includes being additionally operable to store in the index area in the compressed data corresponding to the data bucket The data length of each compressed data content corresponding to the start offset amount of appearance and the data bucket.
26. according to the device described in claim 23-25 any one, which is characterized in that the compressed data mark is described It is stored in the data bucket of the index area of data block according to the numerical values recited ascending or descending order of compressed data mark.
27. according to the device described in claim 20-25 any one, which is characterized in that the original data content compresses mould Block includes:
Frequency computing module for the original data content of the data to be stored to be divided into data cell, obtains data sheet Member set calculates each frequency of occurrences of the data cell in the data cell set;
For the frequency of occurrences according to the data cell, number is distributed for the data cell for compressed data content generating module It is treated according to cell encoding, and according to the correspondence that the data cell and the data cell encode by the data block is corresponding The original data content of storage data is encoded according to data cell, obtains compressed data content;
Second memory module, for by the compressed data content and record the data cell and the data cell coding The coding schedule of correspondence is stored in the data field of the data block, and obtains the address mark of the compressed data content.
28. device according to claim 27, which is characterized in that the compressed data content generating module is additionally operable to basis The frequency of occurrences of the data cell constructs Huffman tree, according to from the root node of the Huffman tree to the road of leaf node Diameter generation data cell coding;
Second memory module is additionally operable to store the Huffman tree in the data field of the data block.
29. device according to claim 28, which is characterized in that described device further includes decoding accelerometer generation module, Including:
First mapping block, it is described for decoding accelerometer for the data cell code construction of preset length according to data length Decode accelerometer in each data cell coding mapping to the Huffman tree in it is described decoding accelerometer in data sheet The corresponding leaf node of primitive encoding;
Second mapping block is identical with the prefix of the preset length of long data cell encoding in accelerometer for that described will decode The corresponding node of prefix of long data cell encoding described in data cell coding mapping to Huffman tree;
Third mapping block, for the data cell identical with short data cell encoding of prefix in the decoding accelerometer to be encoded It is mapped to the corresponding leaf node of short data cell encoding described in the Huffman tree;
Third memory module, for storing the decoding accelerometer to the data field of the data block.
30. a kind of data query arrangement, which is characterized in that described device includes:
Original query Data Identification acquisition module, for obtaining original query Data Identification;
Data block determining module is described original for being determined according to the original query Data Identification and the data block identifier to prestore Inquire the data block corresponding to Data Identification;Address identifier acquisition module, for the rope in the index area according to the data block Draw, obtain address mark corresponding with the original query Data Identification;It is additionally operable to clip the original query Data Identification Low portion obtains Compressed text search Data Identification, obtains the Compressed text search data stored in the index area of the data block Identify corresponding address mark;
Data content acquisition module is inquired, for obtaining Compressed text search from the data field of the data block according to described address mark Data content, decompression module for decompressing the Compressed text search data content, obtain inquiry data content.
31. device according to claim 30, which is characterized in that the data block determining module be additionally operable to determine with it is described The corresponding data block of the matched data block identifier of the low portion of first presetting digit capacity in original query Data Identification.
32. device according to claim 30, which is characterized in that described address identifier acquisition module includes:
Data bucket identifies determining module, for determining the high-order portion with the second presetting digit capacity of the original query Data Identification Matched data bucket mark;
Bucket initial memory address determining module identifies corresponding number for obtaining the data bucket from the index area of the data block According to bucket initial memory address of the bucket in the index area;
Compressed text search Data Identification acquisition module, for clipping the high-order portion of the original query Data Identification and described Low portion obtains Compressed text search Data Identification;
Searching module, for the bucket initial memory address according to the data bucket in the index area, in the data bucket It searches and is identified with the matched compressed data of the Compressed text search Data Identification;
First acquisition module, for being obtained and the corresponding address label of the matched compressed data mark from the index area Know.
33. device according to claim 32, which is characterized in that the address of the compressed data content is identified as the pressure Offset of storage address of the data content in the data field relative to initial memory address in the data field is inquired in contracting;
First acquisition module includes:
Second acquisition module, for obtaining start offset amount and the matching corresponding to the data bucket from the index area Compressed data identify corresponding bucket bias internal amount;
First offset computing module, for calculating the inquiry data according to the start offset amount and the bucket bias internal amount The offset of content.
34. device according to claim 32, which is characterized in that the address of the compressed data content is identified as the pressure Offset of storage address of the data content in the data field relative to initial memory address in the data field is inquired in contracting;
First acquisition module includes:
Third acquisition module, for obtaining start offset amount and the data corresponding to the data bucket from the index area All compressed data marks before matched compressed data mark described in bucket and the matched compressed data mark Data length;
Second offset computing module, for calculating the inquiry according to the data length of the start offset amount and the acquisition The offset of data content.
35. according to the device described in claim 32-34 any one, which is characterized in that in the index area of the data block Compressed data mark in data bucket is stored according to the numerical values recited ascending or descending order that the compressed data identifies;
The searching module is also used for binary chop and is searched in the data bucket and the Compressed text search Data Identification Matched compressed data mark.
36. according to the device described in claim 30-34 any one, which is characterized in that the inquiry data content obtains mould Block is additionally operable to obtain the coding schedule of the correspondence of record data cell and data cell encoding from the data field, according to described Coding schedule will be decoded including the Compressed text search data content that data cell encodes, and data cell be obtained, according to described Data cell obtains inquiry data content.
37. device according to claim 36, which is characterized in that the inquiry data content acquisition module is additionally operable to from institute State the coding schedule that data field obtains correspondence of the Huffman tree with recording data cell and data cell encoding;According to the Kazakhstan The Compressed text search data content is divided into data cell coding by Fu Man trees, is obtained and the data sheet according to the coding schedule The corresponding data cell of primitive encoding;Inquiry data content is obtained according to the data cell.
38. the device according to claim 37, which is characterized in that the inquiry data content acquisition module includes:
4th acquisition module, for obtaining Huffman tree, decoding accelerometer and record data cell and data from the data field The coding schedule of the correspondence of cell encoding;Wherein, the data cell that the decoding accelerometer includes multiple preset lengths encodes, In each data cell coding mapping to the Huffman tree in the decoding accelerometer with it is described decoding accelerometer in number According to the corresponding leaf node of cell encoding;It is identical with the prefix of the preset length of long data cell encoding in the decoding accelerometer Data cell coding mapping to Huffman tree described in long data cell encoding the corresponding node of prefix;The decoding accelerates Short data unit described in the prefix data cell coding mapping to the Huffman tree identical with short data cell encoding in table Encode corresponding leaf node;
Node mapping block, for being read since the initial memory address of the Compressed text search data content using reading pointer The coded data is mapped in the Huffman tree by the coded data of the preset length using the decoding accelerometer Node;
Leaf node processing module, if the node for coded data mapping is leaf node, according to the coding schedule Decoding obtains the corresponding data cell of the leaf node, and by the reading pointer to away from the Compressed text search data content Initial memory address direction move corresponding with the depth of leaf node distance after, continue the volume of reading preset length Code data, until decoding obtains the partial data of presets in the inquiry data content or the inquiry data content;
Non-leaf nodes processing module, if the node for coded data mapping is non-leaf nodes, by the reading Direction from pointer to the initial memory address away from the Compressed text search data content movement it is corresponding with the preset length away from From rear, the reading pointer is moved by turn since the node that the coded data maps using the Huffman tree and is decoded straight To leaf node, the corresponding data cell of the leaf node is obtained, continues to read the coded data of preset length, until decoding Obtain the partial data of presets in the inquiry data content or the inquiry data content.
CN201310577254.5A 2013-11-18 2013-11-18 Data storage, querying method and device Active CN104657362B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310577254.5A CN104657362B (en) 2013-11-18 2013-11-18 Data storage, querying method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310577254.5A CN104657362B (en) 2013-11-18 2013-11-18 Data storage, querying method and device

Publications (2)

Publication Number Publication Date
CN104657362A CN104657362A (en) 2015-05-27
CN104657362B true CN104657362B (en) 2018-07-10

Family

ID=53248510

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310577254.5A Active CN104657362B (en) 2013-11-18 2013-11-18 Data storage, querying method and device

Country Status (1)

Country Link
CN (1) CN104657362B (en)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608019B (en) * 2015-12-21 2018-06-29 山东海量信息技术研究院 A kind of method in the quick searching datas of stone ram
CN105743509B (en) * 2016-01-26 2019-05-24 华为技术有限公司 Data compression device and method
CN107015985B (en) * 2016-01-27 2021-03-30 创新先进技术有限公司 Data storage and acquisition method and device
CN107204776A (en) * 2016-03-18 2017-09-26 余海箭 A kind of Web3D data compression algorithms based on floating number situation
CN106202398A (en) * 2016-07-08 2016-12-07 北京易车互联信息技术有限公司 A kind of method and device indexing foundation
CN106326464B (en) * 2016-08-31 2019-09-10 成都科来软件有限公司 A kind of network session packet indexing means based on retrieval information projection
CN106484852B (en) * 2016-09-30 2019-10-18 华为技术有限公司 Data compression method, equipment and calculating equipment
CN108075978B (en) * 2016-11-17 2021-03-30 华为技术有限公司 Message sending method, node configuration method and related equipment
CN106686401A (en) * 2017-01-13 2017-05-17 山东鑫诚信电子科技有限公司 Video data distributed storage method, video data distributed storage device, video data retrieval method and video data retrieval device
CN108628898B (en) * 2017-03-21 2021-04-23 中国移动通信集团河北有限公司 Method, device and equipment for data storage
CN107944957A (en) * 2017-11-22 2018-04-20 广州优视网络科技有限公司 Application program method for pushing, device and computer equipment
CN109144416B (en) 2018-08-03 2020-04-28 华为技术有限公司 Method and device for querying data
CN110866127A (en) * 2018-08-27 2020-03-06 华为技术有限公司 Method for establishing index and related device
CN110874416B (en) * 2018-09-04 2022-06-24 深圳云天励飞技术有限公司 Image characteristic value storage method and device and electronic equipment
CN109145118B (en) * 2018-09-06 2021-01-26 北京京东尚科信息技术有限公司 Information management method and device
CN111190908B (en) * 2018-11-15 2023-09-22 华为技术有限公司 Data management method, device and system
TWI695264B (en) * 2019-05-20 2020-06-01 慧榮科技股份有限公司 A data storage device and a data processing method
CN110147330B (en) * 2019-05-23 2023-09-01 深圳市创维软件有限公司 Word matrix data caching method, device, equipment and storage medium
CN112783418B (en) * 2019-11-01 2023-03-31 华为技术有限公司 Method for storing application program data and mobile terminal
CN111159202A (en) * 2019-12-30 2020-05-15 深信服科技股份有限公司 Data processing method, virtual device, equipment and storage medium
CN111274259A (en) * 2020-02-16 2020-06-12 西安奥卡云数据科技有限公司 Data updating method for storage nodes in distributed storage system
CN111309261A (en) * 2020-02-16 2020-06-19 西安奥卡云数据科技有限公司 Physical data position mapping method on single node in distributed storage system
CN113535709B (en) * 2020-04-15 2023-11-14 抖音视界有限公司 Data processing method and device and electronic equipment
CN111833496B (en) * 2020-07-17 2022-05-03 长园共创电力安全技术股份有限公司 Unlocking method and device based on intelligent key and storage medium
CN112584155B (en) * 2020-12-11 2022-11-04 南京中兴力维软件有限公司 Video data processing method and device
CN112486915B (en) * 2020-12-18 2023-01-20 上海哔哩哔哩科技有限公司 Data storage method and device
CN113852556B (en) * 2021-08-31 2023-04-14 天翼数字生活科技有限公司 Method and system for compressing and retrieving routing information

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1858747A (en) * 2006-04-30 2006-11-08 北京金山软件有限公司 Data storage/searching method and system
CN101676899A (en) * 2008-09-18 2010-03-24 上海宝信软件股份有限公司 Profiling and inquiring method for massive database records
CN102024047A (en) * 2010-12-14 2011-04-20 青岛普加智能信息有限公司 Data searching method and device thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100257174A1 (en) * 2009-04-02 2010-10-07 Matthew Dino Minuti Method for data compression utilizing pattern-analysis and matching means such as neural networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1858747A (en) * 2006-04-30 2006-11-08 北京金山软件有限公司 Data storage/searching method and system
CN101676899A (en) * 2008-09-18 2010-03-24 上海宝信软件股份有限公司 Profiling and inquiring method for massive database records
CN102024047A (en) * 2010-12-14 2011-04-20 青岛普加智能信息有限公司 Data searching method and device thereof

Also Published As

Publication number Publication date
CN104657362A (en) 2015-05-27

Similar Documents

Publication Publication Date Title
CN104657362B (en) Data storage, querying method and device
US10678779B2 (en) Generating sub-indexes from an index to compress the index
US10698912B2 (en) Method for processing a database query
CN107704202B (en) Method and device for quickly reading and writing data
CN111629081B (en) Internet Protocol (IP) address data processing method and device and electronic equipment
JP2011523152A (en) Search index format optimization
CN104008134B (en) Efficient storage method and system based on Hbase
JP2020518207A (en) Lossless reduction of data using basic data sheaves, and performing multidimensional search and content-associative retrieval on losslessly reduced data using basic data sieves
US9665590B2 (en) Bitmap compression for fast searches and updates
US9244935B2 (en) Data encoding and processing columnar data
CN109325089A (en) A kind of non-pointing object querying method, device, terminal device and storage medium
CN105005567B (en) Interest point query method and system
CN116301656A (en) Data storage method, system and equipment based on log structure merging tree
KR20030071327A (en) Improved huffman decoding method and apparatus thereof
CN110334103B (en) Recommendation service updating method, providing device, access device and recommendation system
CN110647577A (en) Data cube partitioning method and device, computer equipment and storage medium
WO2014097359A1 (en) Compression program, compression method, compression device and system
US8976048B2 (en) Efficient processing of Huffman encoded data
CN110020001A (en) Storage, querying method and the corresponding equipment of string data
CN110221778A (en) Processing method, system, storage medium and the electronic equipment of hotel's data
CN109491620B (en) Storage data rewriting method, device, server and storage medium
CN112131226A (en) Index obtaining method, data query method and related device
CN113852556B (en) Method and system for compressing and retrieving routing information
CN113535709B (en) Data processing method and device and electronic equipment
US11748307B2 (en) Selective data compression based on data similarity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant