CN103345491A

CN103345491A - Method for quickly obtaining neighborhood by the utilization of Hash dividing barrels

Info

Publication number: CN103345491A
Application number: CN2013102610816A
Authority: CN
Inventors: 蒋云良; 曾志勇; 刘勇
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2013-06-26
Filing date: 2013-06-26
Publication date: 2013-10-09
Anticipated expiration: 2033-06-26
Also published as: CN103345491B

Abstract

The invention discloses a quick neighborhood calculation method with Hash dividing barrels being used for reducing neighborhood information particle searching space. According to the method, sample record sets are divided into barrels through the Hash method and according to the distances between sample records, and the searching space of neighborhood information particles of any sample record in the sets can be reduced to be within three adjacent barrels. Based on that and according to further observation, when the interactive method is adopted to search neighborhood space of sample records, the searching space of the adjacent information particles can be further reduced to be within two barrels according to the symmetry principle of a neighborhood system. The method is flexible and simple to realize, reduces comparing times and searching space of samples to a large degree and has advantages in processing large data.

Description

A kind of Hash Hash that uses divides the method that bucket obtains neighborhood fast

Technical field

The invention belongs to field of information processing, relating in particular to a kind of is tolerance with two norms distance, adopts Hash (Hash) to divide the method that bucket dwindles the quick acquisition neighborhood in neighborhood information particle search space.

Background technology

Along with developing rapidly and the widespread use of data base management system (DBMS) of infotech, the data of people's record are more and more.The data of increasing sharply are many important information under cover behind, and people wish and can carry out higher level analysis to it, in order to utilize these data better.

The concept that T.Y.Lin proposed neighbourhood model in 1988, he realizes granulation to domain by the usage space neighborhood, and spatial neighborhood is interpreted as basic information particle, then utilizes these essential information particles to describe other concepts in the domain.In 1998, Yao Yiyu taught and professor Wu Weizhi in 2002 the ultimate system character of neighborhood operator and neighborhood system has been carried out deep research respectively.Yao has discussed the relation between grain calculating and the Data Mining Tools such as rough set, the quotient space, and by adopting the logical decision language to describe granularity, the logical framework in the structure granularity world.Skowron has also described a grain language in the literature, and he regards the meaning collection of the logical formula that defines on the information table as information, and the syntax and semantics of this information has been discussed.On the basis of these researchs, Hu Qinghua is incorporated into neighbourhood model in the rough set, and the neighborhood rough set model has been carried out detailed definition, and designs the Algorithm for Reduction of the name of yojan simultaneously type, numeric type, mixed type data.

Along with the explosion type increase of data, when using neighbourhood model to handle big data, time efficiency just becomes the factor of overriding concern.The time that how to reduce search and calculate the neighborhood information particle is a considerable problem.

Neighborhood has two kinds of define methods: a kind of is to be decided by the quantity of contained object in the neighborhood, as the k-near neighbor method of classics; Another kind is to define according to the ultimate range that the centre of neighbourhood on a certain tolerance is put the border.Neighborhood involved in the present invention is the 2nd kind of method.

Nonempty finite set on the real number space is closed U={x ₁, x ₂, x ₃..., x _n, for any object x on the U _i, its θ neighborhood is θ (x _i)={ x ∈ U, Δ (x, x _i)≤θ }, wherein, θ 〉=0, θ (x _i) be called by x _iThe θ neighborhood information particle that generates is called for short x _iNeighborhood particle, with regard to two-dimentional real number space, based on the neighborhood of 1 norm, 2 norms and infinite norm as shown in Figure 3, be respectively rhombus, circle and square area.The character of tolerance has: (1)

Because x _i∈ θ (x _i); (2) x _j∈ θ (x _i) → x _i∈ θ (x _j); (3)

{ θ (the x of neighborhood information particle family _i) | i=1,2 ... n} constitutes the covering of U.

Neighborhood information particle family is guided out a neighborhood relationships N on the U of domain space, and this relation can be represented M (N)=(r by a relational matrix _Ij) _{N * n}If, x _j∈ θ (x _i), r then _Ij=1 otherwise r _Ij=0.

Summary of the invention

The objective of the invention is to a kind of Hash Hash that uses and divide the method that bucket obtains neighborhood fast, with the time that reduces search and calculate the neighborhood information particle, realize using neighbourhood model to handle the rapidity of big data.For this reason, the present invention is by the following technical solutions:

The concrete steps of the inventive method are as follows:

A kind of Hash Hash that uses divides the method that bucket obtains neighborhood fast, it is characterized in that it comprises the steps:

Step 1 is asked the true origin x of branch bucket coordinate system ₀,

According to given neighborhood system NRS=＜U, N, θ 〉, U is the set that whole sample record constitute, and N represents neighborhood relationships, and θ is the radius of neighbourhood;

Step 2 is asked the distance of sample,

For

Ask the distance between sample || x _i-x ₀||;

Step 3, according to the sample distance in the step 2, set up the search bucket with the Hash method:

For

{&ForAll; x}_{i} &Element; U, i = 1,2, \cdot \cdot \cdot, n,

Calculate

K is nonnegative integer.Set up the hash table with k as hash Key.Set up the sphere model of hash table in the space: with the radius of hash Key as sphere, the hash table is exactly a series of mutually nested balls in the space; Sample under a certain hash Key, being in hash Key is that the space between adjacent sphere is called described bucket, B in the space between the sphere of radius and the sphere that hashKey-1 is radius ₁, B ₂..., B _b, for by b hash Key value as b of radius gained bucket.

Step 4 obtains neighborhood:

Bucket B _K-1, B _k, B _K+1Interior record, the neighborhood of acquisition sample x.

On the basis of adopting technique scheme, the present invention also can adopt following further technical scheme:

x ₀Get initial point or minimum value is formed in N a proper vector.

When sample record obtains neighborhood in the search bucket, employing be alternative manner, only need search B _k, B _K+1The neighborhood that obtains sample x asked in record in the bucket.

The present invention is according to the distance between sample record, and the method for utilizing Hash is divided into bucket, any one sample record x in the set with the set of sample record _iThe search volume of neighborhood information particle with reduced to three adjacent bucket B _K-1, B _k, B _K+1In.On this basis, deeply observe and find, when the neighborhood space of search sample record adopt be alternative manner the time, according to the symmetry principle of neighborhood system, the search volume of neighborhood information particle further can be narrowed down to two bucket B _k, E _K+1Scope in.

Method of the present invention can obtain different branch bucket effects according to the big or small θ of different neighborhoods.Along with the θ value becomes big, the quantity of bucket will reduce, but the quantity of contained sample record can increase in each barrel, under minute bucket was continuous situation, the amplitude that the space of searching for when calculating the neighborhood information particle dwindles will reduce, and an effect of dividing bucket to bring will weaken, and divide the bucket be in discrete in, owing to may there not be adjacent bucket, search volume at this moment will reduce significantly, and the effect of dwindling neighborhood information particle search space of dividing bucket to bring strengthens.Along with the θ value diminishes, the quantity of bucket will become gradually and increase, and contained sample size will reduce in the bucket, may cause the quantity of information that comprises in each neighborhood information particle to tail off like this.

According to this above, according to neighborhood system NRS=＜U, N, θ〉different situations, by the θ value of selecting to be fit to, give full play to the effect of dwindling neighborhood information particle search space that the branch bucket brings, the time that reduces search and calculate the neighborhood information particle.

Description of drawings

Fig. 1 is that the branch bucket in the inventive method is the synoptic diagram under the continuous situation.

Fig. 2 is that the branch bucket in the inventive method is the synoptic diagram under the discrete case.

Fig. 3 is in the two-dimentional real number space, based on the neighborhood of 1 norm, 2 norms and infinite norm.

Embodiment

Technical scheme for a better understanding of the present invention is further described below in conjunction with drawings and Examples.

Step 1 is asked the true origin x of branch bucket coordinate system ₀:

According to given neighborhood system NRS=＜U, N, θ 〉, ask the true origin x of branch bucket coordinate system ₀Get N=CUD, neighborhood system NRS=＜U then, N, θ〉become neighborhood decision system NRS=＜U, C ∪ D, θ 〉, x ₀Get the minimum property value of each attribute and form a proper vector.Sample set U={x ₁, x ₂, x ₃..., x _n, the set C={a of sample attribute ₁, a ₂..., a _m), i attribute a of sample _i, sample decision attribute set D,

Then base is

x ₀=(min{x _l(a ₁)，x ₂(a ₁)，…x _n(a ₁)}，

min{x ₁(a ₂)，x ₂(a ₂)，…x _n(a ₂)}，…，min{x ₁(a ₁)，x ₂(a ₁)，…x _n(a ₁)})

Step 2, ask the distance of sample:

For

Ask the distance between sample || x _i-x ₀||;

| | x_{i} - x_{0} | | = \sqrt{{[x_{i} (a_{1}) - x_{0} (a_{1})]}^{2} + {[x_{i} (a_{2}) - x_{0} (a_{2})]}^{2} + \cdot \cdot \cdot {[x_{i} (a_{m}) - x_{0} (a_{m})]}^{2}}

Step 3, according to the distance in the step 2, the Hash method is set up bucket:

For

&ForAll; x_{i} &Element; U, i = 1,2, \cdot \cdot \cdot, n,

Calculate

K is nonnegative integer, sets up the hash table with k as hash Key.With the radius of hash Key as sphere, the hash table is exactly a series of mutually nested balls in the space.Sample under a certain hash Key, being in hash Key is that the space between adjacent sphere is called described bucket in the space between the sphere of radius and the sphere that hashKey-1 is radius.B ₁, B ₂..., B _bFor by b hash Key value as b of radius gained bucket.Then the part of records among the U in spatial distributions as shown in Figure 1 and Figure 2, Fig. 1 has provided a kind of more special situation, namely k is continuous; And in fact, for specific data, k may be discontinuous, and is namely shown in Figure 2.

Step 4 obtains neighborhood:

Search bucket B _K-1, B _k, B _K+1Its neighborhood asked in interior record.When sample record is found the solution neighborhood in the search bucket, employing be alternative manner because the symmetry of neighborhood system only needs search B _k, B _K+1Its neighborhood found the solution in record in the bucket.As shown in fig. 1, to the sample in the bucket of k=2, when finding the solution neighborhood, can only search for the sample in k=2 and the k=3 bucket.As shown in Figure 2, under the discontinuous situation of bucket, for the sample in the k=2 bucket, owing to be empty in its adjacent k=3 bucket, the space of finding the solution of sample neighborhood has dwindled in its barrel.

The time complexity of using hash foundation bucket is O (n), n=|U|.Sample attribute set C={a ₁, a ₂..., a _mIn the number of sample attribute be m, the number of the branch bucket of foundation is b, under sample was distributed to situation in each barrel equably, the complexity that neighborhood calculates was

Leveling off at b | under the situation of u|, the complexity that neighborhood calculates will level off to O (m|U|).

Claims

1. use the method that Hash Hash divides the quick acquisition of bucket neighborhood for one kind, it is characterized in that it comprises the steps:

Step 1 is asked the true origin x of branch bucket coordinate system ₀,

Step 2 is asked the distance of sample,

For

Ask the distance between sample || x _i-x ₀||;

For

{&ForAll; x}_{i} &Element; U, i = 1,2, \cdot \cdot \cdot, n,

Calculate

Step 4 obtains neighborhood:

2. a kind of Hash Hash that uses as claimed in claim 1 divides the method that bucket obtains neighborhood fast, it is characterized in that x ₀Get initial point or minimum value is formed in N a proper vector.

3. a kind of Hash Hash that uses as claimed in claim 1 divides the method that bucket obtains neighborhood fast, it is characterized in that when sample record obtains neighborhood in the search bucket, employing be alternative manner, only need search for B _k, B _K+1Record in the bucket obtains the neighborhood of sample x.