CN102306187A - Hash sorting method for two-dimensional table - Google Patents

Hash sorting method for two-dimensional table Download PDF

Info

Publication number
CN102306187A
CN102306187A CN201110254893A CN201110254893A CN102306187A CN 102306187 A CN102306187 A CN 102306187A CN 201110254893 A CN201110254893 A CN 201110254893A CN 201110254893 A CN201110254893 A CN 201110254893A CN 102306187 A CN102306187 A CN 102306187A
Authority
CN
China
Prior art keywords
hash
divided block
cube
coordinate
row
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201110254893A
Other languages
Chinese (zh)
Inventor
蒋云良
范婧
刘勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201110254893A priority Critical patent/CN102306187A/en
Publication of CN102306187A publication Critical patent/CN102306187A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a hash sorting method for a two-dimensional table, which is a high-efficiency two-dimensional table sorting method and cannot change the relative sequence of records with the identical attribute values in the two-dimensional table, namely the two-dimensional table sorting method has stability. The method comprises the following steps of: (1) initializing the two-dimensional table to form a chain table, and establishing initial partition blocks; (2) hashing elements in the partition blocks on the current attribute to a hash-cube; (3) traversing the hash-cube to reconstruct the partition blocks; and (4) checking whether all the partition blocks with the element number of more than 1 are reconstructed or not and whether the reconstruction operation of the partition blocks with the element number of more than 1 is finished on all the attributes or not, if so, stopping, and otherwise, performing the steps (2), (3) and (4) on the next partition block or the next attribute.

Description

Bivariate table hash sort method
Technical field
The present invention relates to a kind of method of bivariate table ordering, relate in particular to a kind of hash sort method of two-dimensional discrete attribute list.
Background technology
Bivariate table is one of main expression-form of information; The data set table that is used for information processing is shown a bivariate table of being described by a plurality of attributive character; Value is discrete numerical value or character (can carry out the discretize pre-service earlier for the continuous type data is discrete data) on each dimension attribute; Record of a behavior in the bivariate table; This class two-dimensional table sorted is meant that a plurality of attribute fields to data set carry out sorting operation according to certain priority level, like ascending order, descending sort etc.In many fields of computing machine, usually need the bivariate table sorting operation; For example; Data item is a record in the typical commercial data processing; And record sorts according to key word; The process of searching or upgrading a record has been simplified in ordering greatly; For classification problem, sample is carried out two dimension ordering also be one and be highly profitable and necessary preprocessing process.The bivariate table ordering is the important step of many data processing algorithms, earlier record set is sorted on the property set of appointment, asks this property set that domain is divided the equivalence class that generates again, can reduce the complexity of calculating approximate up and down collection, attribute reduction, nuclear.
Existing typical bivariate table sort method is the bivariate table quick sort, but the method depends on the distribution situation of data, and when most of data were the inverted order arrangement, performance descended very fast, and does not have stability.On the other hand, along with fast development of information technology, the growth of data explosion property, it is more and more general to handle large data sets, and the method performance when data scale is very big is not ideal enough.Seeking the bivariate table sort method that adapts to large data sets efficiently is necessary.
Summary of the invention
Technical matters to be solved by this invention provides a kind of bivariate table hash sort method; It is a kind of sort method of bivariate table efficiently; And the method can not change the relative order of the identical record of property value in the bivariate table, and promptly this bivariate table sort method has stability.For this reason, the present invention adopts following technical scheme:
Bivariate table hash sort method may further comprise the steps:
(1), bivariate table is initialized as chained list and sets up the initial division piece;
(2), on current attribute with the element hash in the divided block in hash-cube;
(3), traversal hash-cube reconstruct divided block;
(4), check all elements number whether greater than 1 divided block all whether accomplish by reconstruct with on all properties element number greater than 1 divided block by reconstructed operation; If then stop, then on next divided block or next attribute, carry out step (2), (3), (4) if not.
On the basis of adopting technique scheme, the present invention also can adopt following further technical scheme:
In step (1), the bivariate table record is left in the chained list, chained list node comprises: the initial address that writes down in the bivariate table; Point to the pointer of next divided block; Point to the pointer of next element in the same divided block.
In step (2), the hash method step of use is: search minimum in the divided block and greatest member a on current attribute Min, a MaxCalculate the dimension d of the two-way array of hash-cube, i.e. the size of row and column,
Figure BDA0000087870370000021
Calculate the position of each data element in hash-cube in the divided block, make up the hash-cube of divided block.
In hash-cube, use i, j, k to represent row, row, the layer coordinate of hash-cube respectively, d is the two-way array dimension of hash-cube, x is a data element to be sorted in the divided block, a Min, a MaxBe respectively the minimum in the divided block, greatest member on current attribute, data element is the property value that is recorded on the current attribute, and [] expression rounds up.
Calculating the position of each data element in hash-cube may further comprise the steps: according to formula i=[(x-a Min)/d] the row-coordinate value of calculating element x in hash-cube; According to formula j=(x-a Min) the row coordinate figure of %d calculating element x in hash-cube; Each real number is to (i, j) corresponding counter k Ij, initial value is 0, if (i j) occurred, then k Ij=k Ij+ 1, k IjBe layer coordinate figure.k IjIdentified the two-way array level among the hash-cube in fact; Each layer plane matrix is different except the level sign; Line number all is identical with columns; Line data element in the two-way array increases according to order from left to right; The column data element increases according to from top to bottom order, and arbitrary data element greater than all row coordinates than its little data element.A grid in each layer matrix plane is only deposited a data elements, and the data element in all grids do not repeat, and all unduplicated elements can both find corresponding position in the divided block in such matrix plane.Be benchmark with the 0th layer matrix plane among the Hash-cube, all in the divided block not repeat element all are mapped on this matrix plane, and the element that repeats is mapped on next layer matrix plane successively.
The reconstruct divided block comprises in step (3): according to the order traversal hash-cube of layer, row, row; Coordinate i is identical with coordinate j, has only the different element of coordinate k to belong to same divided block; Order between the divided block is by coordinate i and coordinate j sign; I, j, k represent row, row, the layer coordinate of hash-cube respectively.
The Hash table is to be combined by array, table and specific mathematical method; A kind of structure that can effectively support dynamic data storage and extraction of structure; In data storage and information security field important application is arranged; Can in linear session, accomplish the access map operation of mass data through constructing suitable hash function; Utilize this characteristic structure sort method among the present invention; Realize the two-dimensional discrete tables of data hash sort method of a linear session complexity, efficient is better than at present the generally bivariate table sort method based on the quicksort method of use.Bivariate table hash sort method average time complexity can be reduced to O, and (m * n) (m is the key word number of bivariate table; N is the record number of bivariate table); And under DATA DISTRIBUTION situation heterogeneous, also can obtain counting yield preferably, under the situation that the data degree of saturation changes, performance is unaffected.This method has stability, is applicable to the sequencing problem research in problems such as the data mining that has sequential property and series machines study and the mass data processing.
In the inventive method, in hash to hash-cube of the element in the divided block, promptly obtain sorted data sequence according to certain order traversal hash-cube, the method both had been suitable for the bivariate table ordering and also had been fit to the one-dimensional data sequence.Hash-cube is the cube of multilayer matrix plane stack, and i is used by the grid store data element that row and column is divided in each layer matrix plane; J representes the row-coordinate and the row coordinate of element, k representing matrix level, (i; J, k) the unique position of data element in hash-cube that indicated.Element among the Hash-cube is to deposit according to certain order; Realized the conversion from the non-ordered data sequence to the ordered data sequence through hash-cube, hash-cube is transformed into orderly data sequence with the data element in the cube according to the order of layer, row, row traversal.
The ordering of two dimension attribute list is equal to the orderly partition process of an infosystem on whole property sets.The inventive method adopts the thought of breadth First and division; Adopt one dimension Hash ordering to handle to each dimension attribute; Whenever call an one dimension Hash ordering; Data acquisition is split into littler piece; On next attribute, calling one dimension Hash ordering on each piece respectively then, on attribute in the end ordering finish or current all divided block in have only an element.Need frequent each piece to dividing to carry out Hash during ordering, use the attribute data of static chained list put when sorting in the inventive method, the order between the record is indicated by pointer.The divided block that forms behind the Hash is orderly, and in order to preserve this order, chained list node comprises three elements: the initial address of each bar record; Point to the pointer of next divided block; Point to the pointer of next element in the same divided block.
struct Node{
Int*element; What deposit in // the codomain is the address that is recorded in the internal memory
Node*next_Category; // point to next divided block pointer
Node*next_element; The pointer of next element in the same divided block of // sensing
Description of drawings
Fig. 1 is with the method flow diagram of the element hash in the divided block in the hash-cube in the embodiment of the invention.
Fig. 2 is a hash-cube synoptic diagram of treating collating sequence in the embodiment of the invention.
Fig. 3 is the two-dimentional hash sort method process flow diagram in the embodiment of the invention.
Fig. 4 is that bivariate table is initialized as chained list and sets up the synoptic diagram of initial division piece in the embodiment of the invention.
Fig. 5 to Fig. 7 is the ranking results synoptic diagram of bivariate table on each attribute in the embodiment of the invention.
Embodiment
In order to make above-mentioned purpose of the present invention, feature and advantage more obviously understandable, the present invention is done concrete detailed explanation below in conjunction with accompanying drawing and embodiment.Should be appreciated that specific embodiment described herein only is used to explain the present invention, and be not used in qualification the present invention.
The bivariate table ordering is the amplification of one dimension ordering; In bivariate table hash sort method, need recursive call one dimension hash sort method; The collating sequence of treating of one dimension hash ordering is equal to the element sequence of divided block in the bivariate table hash sort method, and is existing that hash method introduction in the embodiment of the invention is following.
With reference to Fig. 1, be in the embodiment of the invention with the method flow diagram of the element hash in the divided block in the hash-cube.
Step 101 is searched minimum in the divided block and greatest member a on current attribute Min, a Max
All elements on current attribute in the divided block (comprises minimum, greatest member) between minimum and greatest member, if no repeating data, all elements can be mapped in the square formation in a certain order, store data in the grid that row and column is divided into.
Step 102, the dimension d of the plane square formation of calculating hash-cube.
Square can accommodate the number of elements between the maximum and minimum elements (including minimum, maximum element) of all non-duplicate data matrix dimension d is hash-cube the size of rows and columns,
Figure BDA0000087870370000051
(
Figure BDA0000087870370000052
is rounded up )
Step 103 is calculated the position of each data element in hash-cube in the divided block, makes up the hash-cube of divided block.
Unit in the divided block have repetition on current attribute generally speaking, if with the element map in the divided block in the square formation of plane, be not to shine upon one by one, but many-one mapping.In hash-cube, the element of repetition leaves in next layer plane square formation, and the hash that the element in the divided block can be unique is to confirming the position.(i, j k) are i=[(x-a need to carry out the coordinate of element x in hash-cube of hash Min)/d], j=(x-a Min) %d, k=k Ij+ 1 (k IjInitial value is 0, if (i j) occurred, k IjFrom increasing 1).
Give an example, sequence A=5,5,3,4,5,4}, coordinate such as the table 1 of each element in hash-cube, the hash-cube such as the accompanying drawing 2 that constitute by sequence A.
Element in the sequence A Coordinate among the Hash-cube
?5 (1,0,0)
?5 (1,0,1)
?3 (0,0,0)
?4 (0,1,0)
?5 (1,0,2)
?4 (0,1,1)
Table 1
With reference to Fig. 3, be bivariate table hash sort method process flow diagram in the embodiment of the invention.
Bivariate table Hash sort method is:
1, initialization chained list L, r i, L is initialized as the chained list that has only a division, r i=r 0
2, traversal chained list L.
2.1, obtain r if element number scans the element in this divided block greater than 1 in the divided block Imax, r IminAnd calculating d value, Otherwise forward 2.4 to.
2.2 each element in the divided block is carried out following steps, is mapped to hash-cube, uses array a[d] [d] [n] expression.
2.2.1 count initialized array count[d] [d]=0 (the count array is deposited the k coordinate of element)
2.2.2 calculate the ranks i of mapping, j, i=r/d, j=r%d.
2.2.3count[i][j]++
2.2.4 the node of storage currentElement is put into a[i] [j] [count[i] [j]-1] (depositing chained list node) for the reconstruct chained list
2.3 reconstruct chained list.Traversal array count[d] [d], and according to array count[d] the counting traversal hash-cube of [d].
2.3.1 work as count[i] [j]!=0 o'clock, carry out operation as follows, otherwise do not carry out.
If 2.3.2 count[i] [j]=1; A[i] [j] [0] is inserted in the chained list as next divided block; If count[i] [j]>1; A[i] [j] [0] is inserted among the chained list L a[i as next divided block] [j] [1] is to a[i] [j] [cout[i] [j]-1] be inserted in the chained list as the next element of same divided block.
2.4 to next divided block repeating step 2.1,2.2,2.3, up to handling all divided block.
3, to next attribute r I+1Repeating step 2 is up to according to attribute r mOrdering finishes and perhaps has only an element in each divided block.
Step 301, bivariate table is initialized as chained list and sets up the initial division piece.
In the embodiment of the invention, all records in the bivariate table of will waiting during initialization to sort are divided into an equivalence class, i.e. not differentiation of order between all records is placed in the chained list according to their order in bivariate table.Chained list node comprises the initial address that writes down in the bivariate table; Point to the pointer of next divided block; Point to the pointer of next element in the same divided block.What the codomain of chained list node was deposited is the address that is recorded in the internal memory; Can when ordering, find record faster like this; During initialization, the next_Category of all nodes is NULL, uses the next_element pointer that the record in the bivariate table is constituted a chained list.In order better to set forth bivariate table hash sort method; Citing a plain example describes, and table 2 is the bivariate tables with 8 records of 3 attributes, and all properties is the int type; Among Fig. 4 401 is to 408 being 8 deposit positions that are recorded in the internal memory, and 411 are the chained list after the initialization.Below with record x 1Be example explanation position the record property value how during ordering on a certain attribute, what deposit among the variable r is side-play amount, and initial value is 0, i.e. 410 among Fig. 4, and that the codomain in the node 409 is deposited is record x 1 Address 4000 in internal memory, x is write down in 4000+0 * 2nd in the internal memory 1The position of first attribute, in the time will on second attribute, sorting, the offset value among the variable r adds 1, and value through node 409 codomains and side-play amount can calculate and write down x in the internal memory 1The position of second attribute is 4000+1 * 2.
Record r 0 r 1 r 2
x 1 4 2 5
x 2 4 2 5
x 3 3 5 2
x 4 3 5 4
x 5 5 4 3
x 6 2 2 4
x 7 3 4 2
x 8 5 3 5
Table 2
Step 302, whether the element number in the inspection divided block is greater than 1.
If the element number in the divided block is 1, represent to have only a record in this divided block, need not compare, sort with other record, have only element in the divided block more than 1, just need be on next attribute can't the differentiation order record sort.Fig. 5 is at attribute r 0On ranking results, wherein the element number in the divided block 501,502,503 is more than 1, i.e. the r that writes down in these three divided block 0Property value equates respectively, can't the differentiation order, and need be at r 1Sort on the attribute, in like manner, the divided block 601,602 among Fig. 6 need be at attribute r again 2Go up and sorting.
Step 303, on current attribute with the element hash in the divided block in hash-cube.
This step adopts one dimension hash method, and the current property value of at first treating order recording calculates its coordinate in hash-cube by formula, and writing down hash in hash-cube, hash-cube is identical with hash-cube structure shown in Figure 2 then.
Step 304, traversal hash-cube reconstruct divided block.
During the reconstruct divided block, coordinate i is identical with coordinate j to have only the different element of coordinate k to belong to same divided block, otherwise does not belong to same divided block, and the order between the divided block is by coordinate i and coordinate j sign.
Whether step 305 moves down a divided block, and check the reconstruct of last divided block.
If then on current attribute to the end-of-job of all divided block, otherwise circulation is carried out step 302,303,304 to divided block.
Step 306, after move an attribute, and the inspection in the end whether accomplished divided block reconstruct on an attribute.
If then ordering finishes, otherwise divided block is carried out step 302,303,304,305 in next attribute cocycle.Fig. 5,6,7 is respectively at attribute r 0, r 1, r 2On the ranking results synoptic diagram.

Claims (5)

1. bivariate table hash sort method is characterized in that, it may further comprise the steps:
(1), bivariate table is initialized as chained list and sets up the initial division piece;
(2), on current attribute with the element hash in the divided block in hash-cube;
(3), traversal hash-cube reconstruct divided block;
(4), check all elements number whether greater than 1 divided block all whether accomplish by reconstruct with on all properties element number greater than 1 divided block by reconstructed operation; If then stop, then on next divided block or next attribute, carry out step (2), (3), (4) if not.
2. method according to claim 1 is characterized in that chained list node comprises:
The initial address that writes down in the bivariate table;
Point to the pointer of next divided block;
Point to the pointer of next element in the same divided block.
3. method according to claim 1 is characterized in that, in step (2), the hash method step of use is:
Search minimum in the divided block and greatest member a on current attribute Min, a Max
Calculate the dimension d of the two-way array of hash-cube, i.e. the size of row and column,
Figure FDA0000087870360000011
Calculate the position of each data element in hash-cube in the divided block, make up the hash-cube of divided block.
4. method according to claim 3 is characterized in that, said " calculating the position of each data element in hash-cube in the divided block " may further comprise the steps:
According to formula i=[(x-a Min)/d] the row-coordinate value of calculating element x in hash-cube;
According to formula j=(x-a Min) the row coordinate figure of %d calculating element x in hash-cube;
Each real number is to (i, j) corresponding counter k Ij, initial value is 0, if (i j) occurred, then k Ij=k Ij+ 1, k IjBe layer coordinate figure;
I, j, k represent row, row, the layer coordinate of hash-cube respectively, and d is the two-way array dimension of hash-cube, and x is a data element to be sorted in the divided block, a Min, a MaxBe respectively the minimum in the divided block, greatest member on the current attribute, [] expression rounds up.
5. method according to claim 1 is characterized in that, the reconstruct divided block comprises:
Order traversal hash-cube according to layer, row, row;
Coordinate i is identical with coordinate j, has only the different element of coordinate k to belong to same divided block;
Order between the divided block is by coordinate i and coordinate j sign;
I, j, k represent row, row, the layer coordinate of hash-cube respectively.
CN201110254893A 2011-08-31 2011-08-31 Hash sorting method for two-dimensional table Pending CN102306187A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110254893A CN102306187A (en) 2011-08-31 2011-08-31 Hash sorting method for two-dimensional table

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110254893A CN102306187A (en) 2011-08-31 2011-08-31 Hash sorting method for two-dimensional table

Publications (1)

Publication Number Publication Date
CN102306187A true CN102306187A (en) 2012-01-04

Family

ID=45380049

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110254893A Pending CN102306187A (en) 2011-08-31 2011-08-31 Hash sorting method for two-dimensional table

Country Status (1)

Country Link
CN (1) CN102306187A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942343A (en) * 2014-05-12 2014-07-23 中国人民大学 Data storage optimization method for hash joint
CN105260773A (en) * 2015-09-18 2016-01-20 华为技术有限公司 Image processing device and image processing method
CN107423422B (en) * 2017-08-01 2019-09-24 武大吉奥信息技术有限公司 Spatial data distributed storage and search method and system based on grid

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040034616A1 (en) * 2002-04-26 2004-02-19 Andrew Witkowski Using relational structures to create and support a cube within a relational database system
US7027446B2 (en) * 2001-07-18 2006-04-11 P-Cube Ltd. Method and apparatus for set intersection rule matching
US20060125827A1 (en) * 2004-12-15 2006-06-15 Microsoft Corporation System and method for interactively linking data to shapes in a diagram
US7171427B2 (en) * 2002-04-26 2007-01-30 Oracle International Corporation Methods of navigating a cube that is implemented as a relational object
US20100094885A1 (en) * 2004-06-30 2010-04-15 Skyler Technology, Inc. Method and/or system for performing tree matching
CN101777073A (en) * 2010-02-01 2010-07-14 浪潮集团山东通用软件有限公司 Data conversion method based on XML form

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7027446B2 (en) * 2001-07-18 2006-04-11 P-Cube Ltd. Method and apparatus for set intersection rule matching
US20040034616A1 (en) * 2002-04-26 2004-02-19 Andrew Witkowski Using relational structures to create and support a cube within a relational database system
US7171427B2 (en) * 2002-04-26 2007-01-30 Oracle International Corporation Methods of navigating a cube that is implemented as a relational object
US20100094885A1 (en) * 2004-06-30 2010-04-15 Skyler Technology, Inc. Method and/or system for performing tree matching
US20060125827A1 (en) * 2004-12-15 2006-06-15 Microsoft Corporation System and method for interactively linking data to shapes in a diagram
CN101777073A (en) * 2010-02-01 2010-07-14 浪潮集团山东通用软件有限公司 Data conversion method based on XML form

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942343A (en) * 2014-05-12 2014-07-23 中国人民大学 Data storage optimization method for hash joint
CN103942343B (en) * 2014-05-12 2017-03-08 中国人民大学 A kind of data store optimization method towards Hash connection
CN105260773A (en) * 2015-09-18 2016-01-20 华为技术有限公司 Image processing device and image processing method
CN105260773B (en) * 2015-09-18 2018-01-12 华为技术有限公司 A kind of image processing apparatus and image processing method
CN107423422B (en) * 2017-08-01 2019-09-24 武大吉奥信息技术有限公司 Spatial data distributed storage and search method and system based on grid

Similar Documents

Publication Publication Date Title
EP3308303B1 (en) Mechanisms for merging index structures in molap while preserving query consistency
CN104679778B (en) A kind of generation method and device of search result
CN102722531B (en) Query method based on regional bitmap indexes in cloud environment
CN104199986A (en) Vector data space indexing method base on hbase and geohash
CN106445416B (en) A kind of method and device of the storage of data record, inquiry and retrieval
CN103577440A (en) Data processing method and device in non-relational database
CN103902702A (en) Data storage system and data storage method
CN104112011B (en) The method and device that a kind of mass data is extracted
CN102609490B (en) Column-storage-oriented B+ tree index method for DWMS (data warehouse management system)
CN106326475A (en) High-efficiency static hash table implement method and system
CN106055621A (en) Log retrieval method and device
CN104809182A (en) Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)
CN101329676A (en) Data paralleling abstracting method and apparatus and database system
CN102737123B (en) A kind of multidimensional data distribution method
CN104111936A (en) Method and system for querying data
CN103902701A (en) Data storage system and data storage method
CN105117442A (en) Probability based big data query method
CN102411632A (en) Chain table-based memory database page type storage method
CN102306187A (en) Hash sorting method for two-dimensional table
CN101620600A (en) Method for processing mass data
CN100399338C (en) A sorting method of data record
CN100347698C (en) System for and method of generating steam for use in oil recovery processes
CN105138607A (en) Hybrid granularity distributional memory grid index-based KNN query method
CN107239454A (en) Search method and system based on text database
Liroz-Gistau et al. Dynamic workload-based partitioning algorithms for continuously growing databases

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120104