CN102306187A

CN102306187A - Hash sorting method for two-dimensional table

Info

Publication number: CN102306187A
Application number: CN201110254893A
Authority: CN
Inventors: 蒋云良; 范婧; 刘勇
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2011-08-31
Filing date: 2011-08-31
Publication date: 2012-01-04

Abstract

The invention provides a hash sorting method for a two-dimensional table, which is a high-efficiency two-dimensional table sorting method and cannot change the relative sequence of records with the identical attribute values in the two-dimensional table, namely the two-dimensional table sorting method has stability. The method comprises the following steps of: (1) initializing the two-dimensional table to form a chain table, and establishing initial partition blocks; (2) hashing elements in the partition blocks on the current attribute to a hash-cube; (3) traversing the hash-cube to reconstruct the partition blocks; and (4) checking whether all the partition blocks with the element number of more than 1 are reconstructed or not and whether the reconstruction operation of the partition blocks with the element number of more than 1 is finished on all the attributes or not, if so, stopping, and otherwise, performing the steps (2), (3) and (4) on the next partition block or the next attribute.

Description

Bivariate table hash sort method

Technical field

The present invention relates to a kind of method of bivariate table ordering, relate in particular to a kind of hash sort method of two-dimensional discrete attribute list.

Background technology

Bivariate table is one of main expression-form of information; The data set table that is used for information processing is shown a bivariate table of being described by a plurality of attributive character; Value is discrete numerical value or character (can carry out the discretize pre-service earlier for the continuous type data is discrete data) on each dimension attribute; Record of a behavior in the bivariate table; This class two-dimensional table sorted is meant that a plurality of attribute fields to data set carry out sorting operation according to certain priority level, like ascending order, descending sort etc.In many fields of computing machine, usually need the bivariate table sorting operation; For example; Data item is a record in the typical commercial data processing; And record sorts according to key word; The process of searching or upgrading a record has been simplified in ordering greatly; For classification problem, sample is carried out two dimension ordering also be one and be highly profitable and necessary preprocessing process.The bivariate table ordering is the important step of many data processing algorithms, earlier record set is sorted on the property set of appointment, asks this property set that domain is divided the equivalence class that generates again, can reduce the complexity of calculating approximate up and down collection, attribute reduction, nuclear.

Existing typical bivariate table sort method is the bivariate table quick sort, but the method depends on the distribution situation of data, and when most of data were the inverted order arrangement, performance descended very fast, and does not have stability.On the other hand, along with fast development of information technology, the growth of data explosion property, it is more and more general to handle large data sets, and the method performance when data scale is very big is not ideal enough.Seeking the bivariate table sort method that adapts to large data sets efficiently is necessary.

Summary of the invention

Technical matters to be solved by this invention provides a kind of bivariate table hash sort method; It is a kind of sort method of bivariate table efficiently; And the method can not change the relative order of the identical record of property value in the bivariate table, and promptly this bivariate table sort method has stability.For this reason, the present invention adopts following technical scheme:

Bivariate table hash sort method may further comprise the steps:

(1), bivariate table is initialized as chained list and sets up the initial division piece;

(2), on current attribute with the element hash in the divided block in hash-cube;

(3), traversal hash-cube reconstruct divided block;

(4), check all elements number whether greater than 1 divided block all whether accomplish by reconstruct with on all properties element number greater than 1 divided block by reconstructed operation; If then stop, then on next divided block or next attribute, carry out step (2), (3), (4) if not.

On the basis of adopting technique scheme, the present invention also can adopt following further technical scheme:

In step (1), the bivariate table record is left in the chained list, chained list node comprises: the initial address that writes down in the bivariate table; Point to the pointer of next divided block; Point to the pointer of next element in the same divided block.

In step (2), the hash method step of use is: search minimum in the divided block and greatest member a on current attribute _Min, a _MaxCalculate the dimension d of the two-way array of hash-cube, i.e. the size of row and column,

Calculate the position of each data element in hash-cube in the divided block, make up the hash-cube of divided block.

In hash-cube, use i, j, k to represent row, row, the layer coordinate of hash-cube respectively, d is the two-way array dimension of hash-cube, x is a data element to be sorted in the divided block, a _Min, a _MaxBe respectively the minimum in the divided block, greatest member on current attribute, data element is the property value that is recorded on the current attribute, and [] expression rounds up.

Calculating the position of each data element in hash-cube may further comprise the steps: according to formula i=[(x-a _Min)/d] the row-coordinate value of calculating element x in hash-cube; According to formula j=(x-a _Min) the row coordinate figure of %d calculating element x in hash-cube; Each real number is to (i, j) corresponding counter k _Ij, initial value is 0, if (i j) occurred, then k _Ij=k _Ij+ 1, k _IjBe layer coordinate figure.k _IjIdentified the two-way array level among the hash-cube in fact; Each layer plane matrix is different except the level sign; Line number all is identical with columns; Line data element in the two-way array increases according to order from left to right; The column data element increases according to from top to bottom order, and arbitrary data element greater than all row coordinates than its little data element.A grid in each layer matrix plane is only deposited a data elements, and the data element in all grids do not repeat, and all unduplicated elements can both find corresponding position in the divided block in such matrix plane.Be benchmark with the 0th layer matrix plane among the Hash-cube, all in the divided block not repeat element all are mapped on this matrix plane, and the element that repeats is mapped on next layer matrix plane successively.

The reconstruct divided block comprises in step (3): according to the order traversal hash-cube of layer, row, row; Coordinate i is identical with coordinate j, has only the different element of coordinate k to belong to same divided block; Order between the divided block is by coordinate i and coordinate j sign; I, j, k represent row, row, the layer coordinate of hash-cube respectively.

The Hash table is to be combined by array, table and specific mathematical method; A kind of structure that can effectively support dynamic data storage and extraction of structure; In data storage and information security field important application is arranged; Can in linear session, accomplish the access map operation of mass data through constructing suitable hash function; Utilize this characteristic structure sort method among the present invention; Realize the two-dimensional discrete tables of data hash sort method of a linear session complexity, efficient is better than at present the generally bivariate table sort method based on the quicksort method of use.Bivariate table hash sort method average time complexity can be reduced to O, and (m * n) (m is the key word number of bivariate table; N is the record number of bivariate table); And under DATA DISTRIBUTION situation heterogeneous, also can obtain counting yield preferably, under the situation that the data degree of saturation changes, performance is unaffected.This method has stability, is applicable to the sequencing problem research in problems such as the data mining that has sequential property and series machines study and the mass data processing.

In the inventive method, in hash to hash-cube of the element in the divided block, promptly obtain sorted data sequence according to certain order traversal hash-cube, the method both had been suitable for the bivariate table ordering and also had been fit to the one-dimensional data sequence.Hash-cube is the cube of multilayer matrix plane stack, and i is used by the grid store data element that row and column is divided in each layer matrix plane; J representes the row-coordinate and the row coordinate of element, k representing matrix level, (i; J, k) the unique position of data element in hash-cube that indicated.Element among the Hash-cube is to deposit according to certain order; Realized the conversion from the non-ordered data sequence to the ordered data sequence through hash-cube, hash-cube is transformed into orderly data sequence with the data element in the cube according to the order of layer, row, row traversal.

The ordering of two dimension attribute list is equal to the orderly partition process of an infosystem on whole property sets.The inventive method adopts the thought of breadth First and division; Adopt one dimension Hash ordering to handle to each dimension attribute; Whenever call an one dimension Hash ordering; Data acquisition is split into littler piece; On next attribute, calling one dimension Hash ordering on each piece respectively then, on attribute in the end ordering finish or current all divided block in have only an element.Need frequent each piece to dividing to carry out Hash during ordering, use the attribute data of static chained list put when sorting in the inventive method, the order between the record is indicated by pointer.The divided block that forms behind the Hash is orderly, and in order to preserve this order, chained list node comprises three elements: the initial address of each bar record; Point to the pointer of next divided block; Point to the pointer of next element in the same divided block.

struct Node{

Int*element; What deposit in // the codomain is the address that is recorded in the internal memory

Node*next_Category; // point to next divided block pointer

Node*next_element; The pointer of next element in the same divided block of // sensing

Description of drawings

Fig. 1 is with the method flow diagram of the element hash in the divided block in the hash-cube in the embodiment of the invention.

Fig. 2 is a hash-cube synoptic diagram of treating collating sequence in the embodiment of the invention.

Fig. 3 is the two-dimentional hash sort method process flow diagram in the embodiment of the invention.

Fig. 4 is that bivariate table is initialized as chained list and sets up the synoptic diagram of initial division piece in the embodiment of the invention.

Fig. 5 to Fig. 7 is the ranking results synoptic diagram of bivariate table on each attribute in the embodiment of the invention.

Embodiment

In order to make above-mentioned purpose of the present invention, feature and advantage more obviously understandable, the present invention is done concrete detailed explanation below in conjunction with accompanying drawing and embodiment.Should be appreciated that specific embodiment described herein only is used to explain the present invention, and be not used in qualification the present invention.

The bivariate table ordering is the amplification of one dimension ordering; In bivariate table hash sort method, need recursive call one dimension hash sort method; The collating sequence of treating of one dimension hash ordering is equal to the element sequence of divided block in the bivariate table hash sort method, and is existing that hash method introduction in the embodiment of the invention is following.

With reference to Fig. 1, be in the embodiment of the invention with the method flow diagram of the element hash in the divided block in the hash-cube.

Step 101 is searched minimum in the divided block and greatest member a on current attribute _Min, a _Max

All elements on current attribute in the divided block (comprises minimum, greatest member) between minimum and greatest member, if no repeating data, all elements can be mapped in the square formation in a certain order, store data in the grid that row and column is divided into.

Step 102, the dimension d of the plane square formation of calculating hash-cube.

Square can accommodate the number of elements between the maximum and minimum elements (including minimum, maximum element) of all non-duplicate data matrix dimension d is hash-cube the size of rows and columns,

(

is rounded up )

Step 103 is calculated the position of each data element in hash-cube in the divided block, makes up the hash-cube of divided block.

Unit in the divided block have repetition on current attribute generally speaking, if with the element map in the divided block in the square formation of plane, be not to shine upon one by one, but many-one mapping.In hash-cube, the element of repetition leaves in next layer plane square formation, and the hash that the element in the divided block can be unique is to confirming the position.(i, j k) are i=[(x-a need to carry out the coordinate of element x in hash-cube of hash _Min)/d], j=(x-a _Min) %d, k=k _Ij+ 1 (k _IjInitial value is 0, if (i j) occurred, k _IjFrom increasing 1).

Give an example, sequence A=5,5,3,4,5,4}, coordinate such as the table 1 of each element in hash-cube, the hash-cube such as the accompanying drawing 2 that constitute by sequence A.

Element in the sequence A	Coordinate among the Hash-cube
		?5	(1，0，0)
?5	(1，0，1)
		?3	(0，0，0)
?4	(0，1，0)
		?5	(1，0，2)
?4	(0，1，1)

Table 1

With reference to Fig. 3, be bivariate table hash sort method process flow diagram in the embodiment of the invention.

Bivariate table Hash sort method is:

1, initialization chained list L, r _i, L is initialized as the chained list that has only a division, r _i=r ₀

2, traversal chained list L.

2.1, obtain r if element number scans the element in this divided block greater than 1 in the divided block _Imax, r _IminAnd calculating d value, Otherwise forward 2.4 to.

2.2 each element in the divided block is carried out following steps, is mapped to hash-cube, uses array a[d] [d] [n] expression.

2.2.1 count initialized array count[d] [d]=0 (the count array is deposited the k coordinate of element)

2.2.2 calculate the ranks i of mapping, j, i=r/d, j=r%d.

2.2.3count[i][j]++

2.2.4 the node of storage currentElement is put into a[i] [j] [count[i] [j]-1] (depositing chained list node) for the reconstruct chained list

2.3 reconstruct chained list.Traversal array count[d] [d], and according to array count[d] the counting traversal hash-cube of [d].

2.3.1 work as count[i] [j]!=0 o'clock, carry out operation as follows, otherwise do not carry out.

If 2.3.2 count[i] [j]=1; A[i] [j] [0] is inserted in the chained list as next divided block; If count[i] [j]＞1; A[i] [j] [0] is inserted among the chained list L a[i as next divided block] [j] [1] is to a[i] [j] [cout[i] [j]-1] be inserted in the chained list as the next element of same divided block.

2.4 to next divided block repeating step 2.1,2.2,2.3, up to handling all divided block.

3, to next attribute r _I+1Repeating step 2 is up to according to attribute r _mOrdering finishes and perhaps has only an element in each divided block.

Step 301, bivariate table is initialized as chained list and sets up the initial division piece.

In the embodiment of the invention, all records in the bivariate table of will waiting during initialization to sort are divided into an equivalence class, i.e. not differentiation of order between all records is placed in the chained list according to their order in bivariate table.Chained list node comprises the initial address that writes down in the bivariate table; Point to the pointer of next divided block; Point to the pointer of next element in the same divided block.What the codomain of chained list node was deposited is the address that is recorded in the internal memory; Can when ordering, find record faster like this; During initialization, the next_Category of all nodes is NULL, uses the next_element pointer that the record in the bivariate table is constituted a chained list.In order better to set forth bivariate table hash sort method; Citing a plain example describes, and table 2 is the bivariate tables with 8 records of 3 attributes, and all properties is the int type; Among Fig. 4 401 is to 408 being 8 deposit positions that are recorded in the internal memory, and 411 are the chained list after the initialization.Below with record x ₁Be example explanation position the record property value how during ordering on a certain attribute, what deposit among the variable r is side-play amount, and initial value is 0, i.e. 410 among Fig. 4, and that the codomain in the node 409 is deposited is record x ₁ Address 4000 in internal memory, x is write down in 4000+0 * 2nd in the internal memory ₁The position of first attribute, in the time will on second attribute, sorting, the offset value among the variable r adds 1, and value through node 409 codomains and side-play amount can calculate and write down x in the internal memory ₁The position of second attribute is 4000+1 * 2.

Record	r ₀	r ₁	r ₂
				x ₁	4	2	5
x ₂	4	2	5
				x ₃	3	5	2
x ₄	3	5	4
				x ₅	5	4	3
x ₆	2	2	4
				x ₇	3	4	2
x ₈	5	3	5

Table 2

Step 302, whether the element number in the inspection divided block is greater than 1.

If the element number in the divided block is 1, represent to have only a record in this divided block, need not compare, sort with other record, have only element in the divided block more than 1, just need be on next attribute can't the differentiation order record sort.Fig. 5 is at attribute r ₀On ranking results, wherein the element number in the divided block 501,502,503 is more than 1, i.e. the r that writes down in these three divided block ₀Property value equates respectively, can't the differentiation order, and need be at r ₁Sort on the attribute, in like manner, the divided block 601,602 among Fig. 6 need be at attribute r again ₂Go up and sorting.

Step 303, on current attribute with the element hash in the divided block in hash-cube.

This step adopts one dimension hash method, and the current property value of at first treating order recording calculates its coordinate in hash-cube by formula, and writing down hash in hash-cube, hash-cube is identical with hash-cube structure shown in Figure 2 then.

Step 304, traversal hash-cube reconstruct divided block.

During the reconstruct divided block, coordinate i is identical with coordinate j to have only the different element of coordinate k to belong to same divided block, otherwise does not belong to same divided block, and the order between the divided block is by coordinate i and coordinate j sign.

Whether step 305 moves down a divided block, and check the reconstruct of last divided block.

If then on current attribute to the end-of-job of all divided block, otherwise circulation is carried out step 302,303,304 to divided block.

Step 306, after move an attribute, and the inspection in the end whether accomplished divided block reconstruct on an attribute.

If then ordering finishes, otherwise divided block is carried out step 302,303,304,305 in next attribute cocycle.Fig. 5,6,7 is respectively at attribute r ₀, r ₁, r ₂On the ranking results synoptic diagram.

Claims

1. bivariate table hash sort method is characterized in that, it may further comprise the steps:

(3), traversal hash-cube reconstruct divided block;

2. method according to claim 1 is characterized in that chained list node comprises:

The initial address that writes down in the bivariate table;

Point to the pointer of next divided block;

Point to the pointer of next element in the same divided block.

3. method according to claim 1 is characterized in that, in step (2), the hash method step of use is:

Search minimum in the divided block and greatest member a on current attribute _Min, a _Max

Calculate the dimension d of the two-way array of hash-cube, i.e. the size of row and column,

4. method according to claim 3 is characterized in that, said " calculating the position of each data element in hash-cube in the divided block " may further comprise the steps:

According to formula i=[(x-a _Min)/d] the row-coordinate value of calculating element x in hash-cube;

According to formula j=(x-a _Min) the row coordinate figure of %d calculating element x in hash-cube;

Each real number is to (i, j) corresponding counter k _Ij, initial value is 0, if (i j) occurred, then k _Ij=k _Ij+ 1, k _IjBe layer coordinate figure;

I, j, k represent row, row, the layer coordinate of hash-cube respectively, and d is the two-way array dimension of hash-cube, and x is a data element to be sorted in the divided block, a _Min, a _MaxBe respectively the minimum in the divided block, greatest member on the current attribute, [] expression rounds up.

5. method according to claim 1 is characterized in that, the reconstruct divided block comprises:

Order traversal hash-cube according to layer, row, row;

Coordinate i is identical with coordinate j, has only the different element of coordinate k to belong to same divided block;

Order between the divided block is by coordinate i and coordinate j sign;

I, j, k represent row, row, the layer coordinate of hash-cube respectively.