CN103345491A - Method for quickly obtaining neighborhood by the utilization of Hash dividing barrels - Google Patents

Method for quickly obtaining neighborhood by the utilization of Hash dividing barrels Download PDF

Info

Publication number
CN103345491A
CN103345491A CN2013102610816A CN201310261081A CN103345491A CN 103345491 A CN103345491 A CN 103345491A CN 2013102610816 A CN2013102610816 A CN 2013102610816A CN 201310261081 A CN201310261081 A CN 201310261081A CN 103345491 A CN103345491 A CN 103345491A
Authority
CN
China
Prior art keywords
neighborhood
hash
bucket
sample
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013102610816A
Other languages
Chinese (zh)
Other versions
CN103345491B (en
Inventor
蒋云良
曾志勇
刘勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201310261081.6A priority Critical patent/CN103345491B/en
Publication of CN103345491A publication Critical patent/CN103345491A/en
Application granted granted Critical
Publication of CN103345491B publication Critical patent/CN103345491B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a quick neighborhood calculation method with Hash dividing barrels being used for reducing neighborhood information particle searching space. According to the method, sample record sets are divided into barrels through the Hash method and according to the distances between sample records, and the searching space of neighborhood information particles of any sample record in the sets can be reduced to be within three adjacent barrels. Based on that and according to further observation, when the interactive method is adopted to search neighborhood space of sample records, the searching space of the adjacent information particles can be further reduced to be within two barrels according to the symmetry principle of a neighborhood system. The method is flexible and simple to realize, reduces comparing times and searching space of samples to a large degree and has advantages in processing large data.

Description

A kind of Hash Hash that uses divides the method that bucket obtains neighborhood fast
Technical field
The invention belongs to field of information processing, relating in particular to a kind of is tolerance with two norms distance, adopts Hash (Hash) to divide the method that bucket dwindles the quick acquisition neighborhood in neighborhood information particle search space.
Background technology
Along with developing rapidly and the widespread use of data base management system (DBMS) of infotech, the data of people's record are more and more.The data of increasing sharply are many important information under cover behind, and people wish and can carry out higher level analysis to it, in order to utilize these data better.
The concept that T.Y.Lin proposed neighbourhood model in 1988, he realizes granulation to domain by the usage space neighborhood, and spatial neighborhood is interpreted as basic information particle, then utilizes these essential information particles to describe other concepts in the domain.In 1998, Yao Yiyu taught and professor Wu Weizhi in 2002 the ultimate system character of neighborhood operator and neighborhood system has been carried out deep research respectively.Yao has discussed the relation between grain calculating and the Data Mining Tools such as rough set, the quotient space, and by adopting the logical decision language to describe granularity, the logical framework in the structure granularity world.Skowron has also described a grain language in the literature, and he regards the meaning collection of the logical formula that defines on the information table as information, and the syntax and semantics of this information has been discussed.On the basis of these researchs, Hu Qinghua is incorporated into neighbourhood model in the rough set, and the neighborhood rough set model has been carried out detailed definition, and designs the Algorithm for Reduction of the name of yojan simultaneously type, numeric type, mixed type data.
Along with the explosion type increase of data, when using neighbourhood model to handle big data, time efficiency just becomes the factor of overriding concern.The time that how to reduce search and calculate the neighborhood information particle is a considerable problem.
Neighborhood has two kinds of define methods: a kind of is to be decided by the quantity of contained object in the neighborhood, as the k-near neighbor method of classics; Another kind is to define according to the ultimate range that the centre of neighbourhood on a certain tolerance is put the border.Neighborhood involved in the present invention is the 2nd kind of method.
Nonempty finite set on the real number space is closed U={x 1, x 2, x 3..., x n, for any object x on the U i, its θ neighborhood is θ (x i)={ x ∈ U, Δ (x, x i)≤θ }, wherein, θ 〉=0, θ (x i) be called by x iThe θ neighborhood information particle that generates is called for short x iNeighborhood particle, with regard to two-dimentional real number space, based on the neighborhood of 1 norm, 2 norms and infinite norm as shown in Figure 3, be respectively rhombus, circle and square area.The character of tolerance has: (1)
Figure BDA00003409084400025
Because x i∈ θ (x i); (2) x j∈ θ (x i) → x i∈ θ (x j); (3)
Figure BDA00003409084400021
{ θ (the x of neighborhood information particle family i) | i=1,2 ... n} constitutes the covering of U.
Neighborhood information particle family is guided out a neighborhood relationships N on the U of domain space, and this relation can be represented M (N)=(r by a relational matrix Ij) N * nIf, x j∈ θ (x i), r then Ij=1 otherwise r Ij=0.
Summary of the invention
The objective of the invention is to a kind of Hash Hash that uses and divide the method that bucket obtains neighborhood fast, with the time that reduces search and calculate the neighborhood information particle, realize using neighbourhood model to handle the rapidity of big data.For this reason, the present invention is by the following technical solutions:
The concrete steps of the inventive method are as follows:
A kind of Hash Hash that uses divides the method that bucket obtains neighborhood fast, it is characterized in that it comprises the steps:
Step 1 is asked the true origin x of branch bucket coordinate system 0,
According to given neighborhood system NRS=<U, N, θ 〉, U is the set that whole sample record constitute, and N represents neighborhood relationships, and θ is the radius of neighbourhood;
Step 2 is asked the distance of sample,
For
Figure BDA00003409084400022
Ask the distance between sample || x i-x 0||;
Step 3, according to the sample distance in the step 2, set up the search bucket with the Hash method:
For ∀ x i ∈ U , i = 1,2 , · · · , n , Calculate
Figure BDA00003409084400024
K is nonnegative integer.Set up the hash table with k as hash Key.Set up the sphere model of hash table in the space: with the radius of hash Key as sphere, the hash table is exactly a series of mutually nested balls in the space; Sample under a certain hash Key, being in hash Key is that the space between adjacent sphere is called described bucket, B in the space between the sphere of radius and the sphere that hashKey-1 is radius 1, B 2..., B b, for by b hash Key value as b of radius gained bucket.
Step 4 obtains neighborhood:
Bucket B K-1, B k, B K+1Interior record, the neighborhood of acquisition sample x.
On the basis of adopting technique scheme, the present invention also can adopt following further technical scheme:
x 0Get initial point or minimum value is formed in N a proper vector.
When sample record obtains neighborhood in the search bucket, employing be alternative manner, only need search B k, B K+1The neighborhood that obtains sample x asked in record in the bucket.
The present invention is according to the distance between sample record, and the method for utilizing Hash is divided into bucket, any one sample record x in the set with the set of sample record iThe search volume of neighborhood information particle with reduced to three adjacent bucket B K-1, B k, B K+1In.On this basis, deeply observe and find, when the neighborhood space of search sample record adopt be alternative manner the time, according to the symmetry principle of neighborhood system, the search volume of neighborhood information particle further can be narrowed down to two bucket B k, E K+1Scope in.
Method of the present invention can obtain different branch bucket effects according to the big or small θ of different neighborhoods.Along with the θ value becomes big, the quantity of bucket will reduce, but the quantity of contained sample record can increase in each barrel, under minute bucket was continuous situation, the amplitude that the space of searching for when calculating the neighborhood information particle dwindles will reduce, and an effect of dividing bucket to bring will weaken, and divide the bucket be in discrete in, owing to may there not be adjacent bucket, search volume at this moment will reduce significantly, and the effect of dwindling neighborhood information particle search space of dividing bucket to bring strengthens.Along with the θ value diminishes, the quantity of bucket will become gradually and increase, and contained sample size will reduce in the bucket, may cause the quantity of information that comprises in each neighborhood information particle to tail off like this.
According to this above, according to neighborhood system NRS=<U, N, θ〉different situations, by the θ value of selecting to be fit to, give full play to the effect of dwindling neighborhood information particle search space that the branch bucket brings, the time that reduces search and calculate the neighborhood information particle.
Description of drawings
Fig. 1 is that the branch bucket in the inventive method is the synoptic diagram under the continuous situation.
Fig. 2 is that the branch bucket in the inventive method is the synoptic diagram under the discrete case.
Fig. 3 is in the two-dimentional real number space, based on the neighborhood of 1 norm, 2 norms and infinite norm.
Embodiment
Technical scheme for a better understanding of the present invention is further described below in conjunction with drawings and Examples.
Step 1 is asked the true origin x of branch bucket coordinate system 0:
According to given neighborhood system NRS=<U, N, θ 〉, ask the true origin x of branch bucket coordinate system 0Get N=CUD, neighborhood system NRS=<U then, N, θ〉become neighborhood decision system NRS=<U, C ∪ D, θ 〉, x 0Get the minimum property value of each attribute and form a proper vector.Sample set U={x 1, x 2, x 3..., x n, the set C={a of sample attribute 1, a 2..., a m), i attribute a of sample i, sample decision attribute set D,
Then base is
x 0=(min{x l(a 1),x 2(a 1),…x n(a 1)},
min{x 1(a 2),x 2(a 2),…x n(a 2)},…,min{x 1(a 1),x 2(a 1),…x n(a 1)})
Step 2, ask the distance of sample:
For
Figure BDA00003409084400041
Ask the distance between sample || x i-x 0||;
| | x i - x 0 | | = [ x i ( a 1 ) - x 0 ( a 1 ) ] 2 + [ x i ( a 2 ) - x 0 ( a 2 ) ] 2 + · · · [ x i ( a m ) - x 0 ( a m ) ] 2
Step 3, according to the distance in the step 2, the Hash method is set up bucket:
For ∀ x i ∈ U , i = 1,2 , · · · , n , Calculate
Figure BDA00003409084400044
K is nonnegative integer, sets up the hash table with k as hash Key.With the radius of hash Key as sphere, the hash table is exactly a series of mutually nested balls in the space.Sample under a certain hash Key, being in hash Key is that the space between adjacent sphere is called described bucket in the space between the sphere of radius and the sphere that hashKey-1 is radius.B 1, B 2..., B bFor by b hash Key value as b of radius gained bucket.Then the part of records among the U in spatial distributions as shown in Figure 1 and Figure 2, Fig. 1 has provided a kind of more special situation, namely k is continuous; And in fact, for specific data, k may be discontinuous, and is namely shown in Figure 2.
Step 4 obtains neighborhood:
Search bucket B K-1, B k, B K+1Its neighborhood asked in interior record.When sample record is found the solution neighborhood in the search bucket, employing be alternative manner because the symmetry of neighborhood system only needs search B k, B K+1Its neighborhood found the solution in record in the bucket.As shown in fig. 1, to the sample in the bucket of k=2, when finding the solution neighborhood, can only search for the sample in k=2 and the k=3 bucket.As shown in Figure 2, under the discontinuous situation of bucket, for the sample in the k=2 bucket, owing to be empty in its adjacent k=3 bucket, the space of finding the solution of sample neighborhood has dwindled in its barrel.
The time complexity of using hash foundation bucket is O (n), n=|U|.Sample attribute set C={a 1, a 2..., a mIn the number of sample attribute be m, the number of the branch bucket of foundation is b, under sample was distributed to situation in each barrel equably, the complexity that neighborhood calculates was
Figure BDA00003409084400051
Leveling off at b | under the situation of u|, the complexity that neighborhood calculates will level off to O (m|U|).

Claims (3)

1. use the method that Hash Hash divides the quick acquisition of bucket neighborhood for one kind, it is characterized in that it comprises the steps:
Step 1 is asked the true origin x of branch bucket coordinate system 0,
According to given neighborhood system NRS=<U, N, θ 〉, U is the set that whole sample record constitute, and N represents neighborhood relationships, and θ is the radius of neighbourhood;
Step 2 is asked the distance of sample,
For
Figure FDA00003409084300011
Ask the distance between sample || x i-x 0||;
Step 3, according to the sample distance in the step 2, set up the search bucket with the Hash method:
For ∀ x i ∈ U , i = 1,2 , · · · , n , Calculate
Figure FDA00003409084300013
K is nonnegative integer.Set up the hash table with k as hash Key.Set up the sphere model of hash table in the space: with the radius of hash Key as sphere, the hash table is exactly a series of mutually nested balls in the space; Sample under a certain hash Key, being in hash Key is that the space between adjacent sphere is called described bucket, B in the space between the sphere of radius and the sphere that hashKey-1 is radius 1, B 2..., B b, for by b hash Key value as b of radius gained bucket.
Step 4 obtains neighborhood:
Figure FDA00003409084300014
Bucket B K-1, B k, B K+1Interior record, the neighborhood of acquisition sample x.
2. a kind of Hash Hash that uses as claimed in claim 1 divides the method that bucket obtains neighborhood fast, it is characterized in that x 0Get initial point or minimum value is formed in N a proper vector.
3. a kind of Hash Hash that uses as claimed in claim 1 divides the method that bucket obtains neighborhood fast, it is characterized in that when sample record obtains neighborhood in the search bucket, employing be alternative manner, only need search for B k, B K+1Record in the bucket obtains the neighborhood of sample x.
CN201310261081.6A 2013-06-26 2013-06-26 A kind of method applying Hash Hash division bucket quickly to obtain neighborhood Active CN103345491B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310261081.6A CN103345491B (en) 2013-06-26 2013-06-26 A kind of method applying Hash Hash division bucket quickly to obtain neighborhood

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310261081.6A CN103345491B (en) 2013-06-26 2013-06-26 A kind of method applying Hash Hash division bucket quickly to obtain neighborhood

Publications (2)

Publication Number Publication Date
CN103345491A true CN103345491A (en) 2013-10-09
CN103345491B CN103345491B (en) 2016-11-23

Family

ID=49280286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310261081.6A Active CN103345491B (en) 2013-06-26 2013-06-26 A kind of method applying Hash Hash division bucket quickly to obtain neighborhood

Country Status (1)

Country Link
CN (1) CN103345491B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125892A (en) * 2019-12-12 2020-05-08 北京科技大学 Data storage and indexing method and system for molecular dynamics simulation program
CN114490011A (en) * 2020-11-12 2022-05-13 上海交通大学 Parallel acceleration implementation method of N-body simulation in heterogeneous architecture

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020321A (en) * 2013-01-11 2013-04-03 广东图图搜网络科技有限公司 Neighbor searching method and neighbor searching system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020321A (en) * 2013-01-11 2013-04-03 广东图图搜网络科技有限公司 Neighbor searching method and neighbor searching system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘勇等: "异构平台上多维线性哈希的研究", 《计算机科学》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125892A (en) * 2019-12-12 2020-05-08 北京科技大学 Data storage and indexing method and system for molecular dynamics simulation program
CN111125892B (en) * 2019-12-12 2021-10-12 北京科技大学 Data storage and indexing method and system for molecular dynamics simulation program
CN114490011A (en) * 2020-11-12 2022-05-13 上海交通大学 Parallel acceleration implementation method of N-body simulation in heterogeneous architecture

Also Published As

Publication number Publication date
CN103345491B (en) 2016-11-23

Similar Documents

Publication Publication Date Title
Esteves et al. Learning so (3) equivariant representations with spherical cnns
US9613055B2 (en) Querying spatial data in column stores using tree-order scans
Xu et al. Taxi-RS: Taxi-hunting recommendation system based on taxi GPS data
Zhang et al. Parallel online spatial and temporal aggregations on multi-core CPUs and many-core GPUs
CN104202816B (en) Extensive node positioning method of the 3D wireless sensor networks based on convex division
Campora et al. St-toolkit: A framework for trajectory data warehousing
Ghosh et al. Traj-cloud: a trajectory cloud for enabling efficient mobility services
CN103345491A (en) Method for quickly obtaining neighborhood by the utilization of Hash dividing barrels
Teng et al. IDEAL: a vector-raster hybrid model for efficient spatial queries over complex polygons
Karim et al. Spatiotemporal Aspects of Big Data.
Brakatsoulas et al. Practical data management techniques for vehicle tracking data
Alvanaki et al. GIS navigation boosted by column stores
Eldawy et al. The era of big spatial data: Challenges and opportunities
Carniel Spatial information retrieval in digital ecosystems: A comprehensive survey
Pant Performance comparison of spatial indexing structures for different query types
Schön et al. Storage, manipulation, and visualization of LiDAR data
Singh et al. Strategies for geographical scoping and improving a gazetteer
Jin et al. The research progress of spatial data mining technique
Lisowski et al. Tools for the Storage and Analysis of Spatial Big Data
Lin et al. A new directional query method for polygon dataset in spatial database
Schoier et al. A clustering method for large spatial databases
Dong et al. Processing probabilistic range queries over Gaussian-based uncertain data
Kontopoulos et al. Benchmarking moving object functionalities of DBMSs using real-world spatiotemporal workload
Kufer Effective and Efficient Summarization of Two-Dimensional Point Data: Approaches for Resource Description and Selection in Spatial Application Scenarios
Afshani et al. (Approximate) uncertain skylines

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant