CN107818147A

CN107818147A - Distributed temporal index system based on Voronoi diagram

Info

Publication number: CN107818147A
Application number: CN201710976062.XA
Authority: CN
Inventors: 季长清; 汪祖民; 秦静; 高杨; 刘飞
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2017-10-19
Filing date: 2017-10-19
Publication date: 2018-03-20

Abstract

Distributed temporal index system based on Voronoi diagram, belongs to data directory field, and for solving the problems, such as to improve available data querying method index efficiency, technical essential is：Each object s objects in each object r, data set S in data set R are calculated respectively with representing the distance between point p points, and by object r, s distributes to immediate representative point P；The immediate representative point with m object r, an object r and any object s is all collected in a Voronoi cell in R, thus produces into m Voronoi cell as subregion, output<VCm, List (Pi)>It is right；Effect is：Greatly reduce space cost so that space efficiency is very high.

Description

Distributed temporal index system based on Voronoi diagram

Technical field

The invention belongs to data directory field, is related to big data processing and spatial query algorithms application.

Background technology

With mobile communication and the fast development based on location-based service correlation technique, cloud computing, big data, Internet of Things, shifting The technologies such as dynamic calculating and space orientation are also progressively ripe, and GPS, camera, blue-teeth data etc. are also constantly increasing, and emerge in large numbers Substantial amounts of spatial data, this to be faced with huge challenge in the storage and processing of various spatial datas or object.

When data carry out big data processing, the problem of operation time is long, spatiotemporal data warehouse efficiency is low is frequently encountered.And The computing system of traditional computer is poor with distributed performance parallel because only supporting limited thread, the calculating money of unit Source usually limited (be such as limited to the size of hard disk or internal memory, CPU element computing capability is not strong etc.) and can not directly apply.

Index has important influence to large-scale data access efficiency.New space index method needs to be incorporated into tradition Database processing engine in, so as to R-tree structures occur.R-tree is indexed in multidimensional data ring equivalent to two-dimentional B+ trees Extension under border.The algorithm inquired about to carry out arest neighbors (Nearest Neighbor, NN) for being currently based on R-tree indexes has A lot, but these methods all concentrate single thread execution task on a single computer.When data scale increases rapidly it is necessary to Application distribution formula Database Systems are handled to be indexed with data query etc..

The content of the invention

In order to improve available data querying method index efficiency, the present invention provides following scheme：

A kind of distributed temporal index system based on Voronoi diagram, wherein being stored with a plurality of instruction, the instruction is suitable to There is processor to load and perform：Row's Voronoi indexes are made down using Spark structures；In given d dimension spaces two datasets R and S, Spark carry out burst, part mappers parallel operations simultaneously by default mechanism；Acquiescence is used in Spark tasks reducer；Before map functions are started, obtain representing point p using pre- clustering algorithm, and be loaded into each map main memory In；

In each map treatment progress, the burst of input is read using TextInputFormat successively, TextInputFormat reads data into Mapper example from file, calculates each object r in data set R respectively, Each object s objects in data set S are with representing the distance between point p points, and by object r, s distributes to immediate representative Point P；The immediate representative point with m object r, an object r and any object s is all collected at a Voronoi in R In cell, m Voronoi cell is thus produced into as subregion, output<VCm, List (P_i)>It is right, query point p is given, is sentenced Its other closest subregion or most some neighbouring partition sets, mapper output initial data concentrate to closest subregion or Each object r, s and its subregion VC of closest partition set_mId；Mapper is output to Spark file system.

One space is divided into multiple disjoint polygons by Voronoi diagram, some point in each polygon Arest neighbors be respectively positioned in the Voronoi cells where the point, each polygon in figure is referred to as associated with point p Voronoi cells, any point in the cell where point p are all p arest neighbors.

Row's Voronoi indexes include two parts：Master index, including all cluster centres；Second index, including storage In the presence of each subregion VC to as queue.

The described distributed temporal index system based on the row's of falling Thiessen polygon, it is based on following manner and obtains representative Point, it is determined that internal cluster point and consecutive points, inside being clustered to the data clusters of point, selecting cluster centre after cluster is indexed, Required data are to cluster a consecutive points for connection with internal, with this inside cluster point for the center of circle, include adjacent cluster centre Point establishes circle, using this circle for circumscribed circle triangle as Delaunay triangles, by two different inside in this method Cluster point establishes Delaunay triangles respectively, and the two Delaunay triangles establish Delaunay by common ground of consecutive points The triangulation network, data object is divided into several big subregions, selects a wherein cluster representative point to turn into and represent a little, what is be divided is each Object contains object id to be clustered in a Voronoi unit in each Voronoi grids.

Voronoi diagram is by VD (p)={ V (p₁),V(p₂),...,V(p_m), wherein：VD (p) is the Voronoi diagram on P Intersection, V (p₁) be p1 Voronoi diagram, the set associated with all points provided, be referred to as following distance caused by p Function Dist () Voronoi diagram, the Voronoi diagram of each p points is necessarily including the institute than other any points closer to q here A little, thus a query point q neighbour be closure Voronoi diagram；

Voronoi units mark off a region for including n point, i.e. P on the R of space from D dimension spaces:{p₁, p₂,…,p_n, the region that subregion VC is provided, i.e. VC subregions are on point p_iRegion VC (p_i), if meeting VC (p_i)=p | d (p, p_i) ≤(p,p_j), then the region is referred to as the Voronoi unit associated with p；

Wherein：Wherein p is specified point or query point, d (p, p_i) it is p and p_iBetween minimum Eustachian distance, i, j are variables, N >=2, p₁≠p₂, i ≠ j, i, j ∈ I_n=1 .., n, and i takes all values in 1 .., the n, when often taking a value, j is taken all over 1 .., Except all values of i values now in n.

Beneficial effect：The present invention uses the indexing means of Voronoi diagram, due to having used multidimensional Voronoi indexes, the rope Draw support Spatial-data Integration, be suitable for indexing the data set of various dimensions, can support mass data collection and various dimensions, and due to Preferable Spatial Objects storage needs a very small space, because we only need to store the representative point letter of each object Breath, so greatly reducing space cost so that space efficiency is very high, using arranging safe polygon to distributed medical space-time Region is indexed, and this solution has important influence to large-scale data access efficiency.

Brief description of the drawings

Fig. 1 .Voronoi scheme

Fig. 2 fall to arrange Voronoi diagram index schematic diagram；

The example key diagram of Fig. 3 present invention；

Fig. 4 .Delaunay triangulation networks establish schematic diagram；

Specific embodiment party

Embodiment 1：A kind of distributed temporal index method based on Voronoi diagram, this method is by based on Voronoi diagram Distributed temporal index system performs, and described system is wherein stored with a plurality of instruction, and the instruction is suitable to have processor loading And perform, its step is as follows：Row's Voronoi indexes are made down using Spark structures, give two datasets R and S in d dimension spaces, Spark is a kind of existing computing engines, and it carries out burst by default mechanism, and part mappers is simultaneously parallel to be run, Using the reducer of acquiescence in Spark tasks, before map functions are started, obtain representing point p using pre- clustering algorithm, and will It is loaded into each map main memory；

In each map treatment progress, the burst of input is read using TextInputFormat successively, TextInputFormat reads data into Mapper example from file, calculates each object r in data set R respectively, Each object s objects in data set S are with representing the distance between point p points, and by object r, s distributes to immediate representative Point P；The immediate representative point with m object r, an object r and any object s is all collected at a Voronoi in R In cell, m Voronoi cell is thus produced into as subregion, output<VCm, List (P_i)>It is right, P_iIt is one obtained Series is immediate to be represented a little, and i represents the position sequence of point, gives query point p, differentiates its closest subregion or most some are neighbouring Partition set, each object r, s to closest subregion or closest partition set that mapper output initial data is concentrated And its subregion VC_mId；Mapper is output to Spark file system.

Wherein described Voronoi diagram, it is that a space is divided into multiple disjoint polygons, in each polygon In arest neighbors of some point be respectively positioned in the Voronoi cells where the point, each polygon in figure is referred to as and point p Associated Voronoi cells, any point in the cell where point p are all p arest neighbors.

Voronoi diagram is by VD (p)={ V (p₁),V(p₂),...,V(p_m) wherein：VD (p) is the Voronoi diagram on P Intersection, V (p₁) be p1 Voronoi diagram, the set associated with all points provided, be referred to as following distance caused by p Function Dist () Voronoi diagram, the Voronoi diagram of each p points is necessarily including the institute than other any points closer to q here A little, thus a query point q neighbour be closure Voronoi diagram；

The acquisition methods of point are represented, it is determined that internal cluster point and consecutive points, by the internal data clusters for clustering point, after cluster Select cluster centre to be indexed, required data are the consecutive points with internal cluster point connection, are circle with this inside cluster point The heart, circle is established comprising adjacent cluster centre point, Delaunay triangles, we are used as the triangle of circumscribed circle using this circle Two different inside cluster points are established into Delaunay triangles respectively in method, the two Delaunay triangles are with consecutive points Establish Delaunay triangulation network for common ground, data object be divided into several big subregions, select a wherein cluster representative point into To represent a little, each object being divided to be clustered in a Voronoi unit, in each Voronoi grids containing pair As id.

Embodiment 2：Further scheme supplement or explanation of the present embodiment as embodiment 1, as shown in Figure 1, Voronoi One space is divided into multiple disjoint polygons by figure.The arest neighbors of some point in each polygon is respectively positioned on this In Voronoi cells where point.Each polygon in figure is referred to as the Voronoi cell associated with point p.This sampling point Any point in cell where p is all p arest neighbors.So in the K-NN search based on Voronoi, each The data point p of Voronoi cells may serve to be verified its whether be some query point q neighbour.And inverted index leads to It is usually used in the search of text similarity, the position of record is determined by property value.

Voronoi diagram (Voronoi Diagram, VD):By VD (p)={ V (p₁),V(p₂),...,V(p_m) provide with The associated set of all points, is referred to as the Voronoi diagram that distance function Dist () is followed caused by p.Here each p points Voronoi diagram necessarily include than other any points closer to q institute a little.Therefore query point q neighbour is closure Voronoi diagram.Accompanying drawing 1 shows 8 Neighbor Points in the two-dimentional Euclidean space of Voronoi diagram.

Voronoi units (Voronoi Cell, VC):On the R of space, one is marked off from D dimension spaces and includes n point Region, i.e. P:{p₁,p₂,…,p_n, wherein n >=2, p₁≠p₂, i ≠ j, i, j ∈ I_nThe region VC that=1 .., n.VC are provided (p_i)=p | d (p, p_i)≤(p,p_j), wherein d (p, p_i) it is p and p_iBetween minimum Eustachian distance, then the region be referred to as and p_i Associated Voronoi units.

Our row's of falling Voronoi indexes are to be combined inverted index and Vornoi indexes, produce new index, simultaneous Both advantages of tool.The Voronoi indexes of the specific row of falling are the extensive spatial data structures of storage mapping data point.Given one Individual large data sets P, it includes the set of data objects in Euclidean space, and for directoried data set, each object is to be clustered one In individual Voronoi units, Voronoi diagram can be expressed as VC (p)={ VC₁,VC₂,…,VC_m}.We are using VC (p) as the row of falling The key value of index.All data object { P_i}∈VC_mId be stored in queue and be used as value.That is, each Voronoi Contain substantial amounts of object id in grid.

In such a system, face it is following some：

S1. the data handled are very big；

S2. query point occurs at random, is not included in data set, while data set is probably that distribution tilts 's；

S3. the data model established under multidimensional theorem in Euclid space and distance.

Arrange Voronoi indexes (Inverted Voronoi Index, IVI) and include two portions

Point：S1. master index, including all cluster centres；

S2. the second index, including be stored in each VC to as queue.Inverted index be in order to effectively index position with Data object in the adjacent queue of query object.When a given inquiry, we differentiate closest VC or most one at can A little neighbouring VC collection.Then the corresponding queue element (QE)s of these VC are included to come, so as to obtain kNN query resultses.

As shown in Figure 2, an IVI for including two-dimensional space object is illustrated, is divided based on Voronoi, we will be right As being divided into 6 subregions.For the sake of simplicity, we select P as representing a little,Therefore, each object most connects with it Near representative point has been each assigned to same Voronoi cells.Intuitively, the side of Voronoi diagram index partition is arranged Method is that hyperspace is divided into the Voronoi units of multiple forms of falling row.

Therefore, our IVI has advantages below：

S1. mass data collection is supported：Because the row's of falling Voronoi diagram index structure inherits the form of inverted index, It is very directly perceived it is known that, this index scheme is applied to distributed treatment.

S2. various dimensions are supported：Multidimensional Voronoi indexes are make use of, the index supports Spatial-data Integration, is suitable for indexing The data set of various dimensions.

S3. space efficiency：Preferable Spatial Objects storage needs a very small space.Because we only need to store The representative point information of each object, so greatly reducing space cost.

Build Spark and fall to arrange Voronoi diagram index

How we using Spark builds IVI if introducing.Because Voronoi diagram can be multiple with merging by fractionation Voronoi diagram (VP) obtains, so construction falls row's Voronoi indexes and is applied to Spark models.Particularly every sub- VP is closed And obtain Voronoi to the end.

As shown in algorithm 1：Two datasets R and S are given in given d dimension spaces.Spark peace default mechanisms carry out burst. Some mappers parallel operations simultaneously.In Spark tasks, we use the reducer given tacit consent to.Start map functions it Before, we obtain representing point p using quick pre- clustering algorithm, and are loaded into each map main memory.

Then, in each map treatment progress, it will read point of input using TextInputFormat successively Piece (presses the pattern of the input in distributed file system), and TextInputFormat can read data to Mapper's from file In example.Each r, the distance between s objects and p points are calculated, and by r, s distributes to immediate representative point P. in algorithm In 2-3 rows, each point is collected in a Voronoi cell, and it will be produced into m Voronoi cell, in algorithm It can be exported in 4-6 rows<VCm, List (P_i)>Right, mapper output raw data sets (R or S) arrive each of hithermost subregion Individual object r, s and its subregion VC_mId.

Finally, in algorithm 8-10 rows, it would be desirable to needed according to what is controlled oneself by customized Mapper is output to Spark file system by MultipleOutputFormat functions.It is determined how task result Write back in the lasting storage of bottom.Voronoi index structure of the structure based on Spark is described in detail in we in algorithm 1 Algorithm pseudo code.Using IVI, if given one represents a little, our cans start Spark tasks to carry out data partition simultaneously Collect some data messages of each subregion.

Embodiment 3：In today that medical social security service develops rapidly at a high speed, with the living standard day of people Benefit improves, and also becomes more hommization and personalization for the demand of medical services.Also there are increasing people to need simultaneously Medical services that will be more convenient and perfect.Simultaneously with mobile communication and the fast development based on location-based service correlation technique, The technologies such as cloud computing, big data, Internet of Things, mobile computing and space orientation are also progressively ripe, and GPS, camera, bluetooth number Also constantly increasing according to waiting, emerging in large numbers substantial amounts of spatial data, this causes the storage and processing of various spatial datas or object In be faced with huge challenge.Electronic health record, nursing call center system, extensive medical data base in industry of medical care Also improving operating efficiency in fast development, portable medical correlation technique Deng application, improving medical services, Economy type medicine cost etc. Aspect has played more and more effects.

It is especially flourishing but China's geographical environment difference is huge, economic development is uneven, medical resource skewness weighing apparatus Area is compared with outlying district, and medical level is there is also very big difference, while as rural area is to industries such as urban migration, tourisms Rapid development so that exponentially type increases on the basis of script population mobility is big, and patient is frequently encountered originally to one When individual local, it is unknown to where see a doctor after suffering from the disease, stands in the queue to register it is more likely that need several months ahead of time to preengage hospital, Toss about multiple hospitals by bus, most a large amount of manpower financial capacities have been wasted in traffic etc. at last, and disease does not obtain in time The problem for the treatment of.It is daily that we are also frequently encountered when needing emergency treatment, do not know but around have what hospital, which hospital's energy This state of an illness is handled, which hospital position is more preferable closer to, service from patient, so as to because the delay time at stop, causing treatment not in time, Tragedy because of delay treatment and lethal even can occur.

Although there are the website of oneself in more hospitals at present, it can in advance register, inquire about, online interrogation also becomes to hold very much Easily, but hospital of China is numerous, and it is difficult to distinguish the true from the false for size medical web site, and online doctor's qualification cannot get certification, while PC end equipments It is not easy to carry, when needing complicated inquiry and family's distress call so that related interrogation of seeing a doctor becomes extremely difficult.

In recent years, with the arrival in medical big data epoch, there are the related data of more medical resources.Mobile doctor The concept for the treatment of is arisen at the historic moment, and so-called portable medical refers to use mobile communication technology and equipment, and any place carries at any time For the medical services suitable for masses and medical information.In development in recent years, the skill such as internet, mobile communication, multimedia The rapid development of the rapid development of art, especially 3G, 4G technology, portable medical technology is set to achieve significant progress.But in recent years Come, it has been found that when carrying out big data processing for this kind of portable medical data, be frequently encountered operation time length, space-time data The problem of search efficiency is low.And the computing system of traditional computer is because only support limited thread, parallel with distribution Poor performance, the computing resource of unit are usually limited (be such as limited to the size of hard disk or internal memory, CPU element computing capability is not strong etc.) And the processing of Large-scale Mobile medical data can not be directly applied to.This big data inquiry given in Mobile medical system and processing band Come a series of with challenging.

It is well known that index has important influence to large-scale data access efficiency.New space index method needs It is incorporated into traditional database processing engine, so as to R-tree structures occur.R-tree indexes equivalent to two-dimentional B+ trees Extension under multidimensional data environment.It is currently based on being looked into carry out arest neighbors (Nearest Neighbor, NN) for R-tree indexes The algorithm of inquiry has a lot, but these methods all concentrate single thread execution task on a single computer.When data scale is rapid Handled during growth it is necessary to application distribution formula Database Systems to be indexed with data query etc..

The distributed temporal index method based on Voronoi diagram in embodiment 1 or 2 is applied to mobile cure by the present embodiment Calling field is treated, current existing medical call system there are three kinds, there is bus medical care intercom system, IP network Semi-digital medical care Intercom system, IP network medical care information intercom system.And these medical call systems have significant limitation, they can only Short range transmission information, if patient not in the range of information transfer, can not perform.And it is used to performing and described is based on Voronoi The medical call system of the distributed temporal index method of figure is not influenceed then by these, and it can effectively be carried under distributed environment NN Query efficiency in tall and big size range.This just makes this invention particularly important, especially for paroxysmal disease or Need the patient the more paid close attention to, it is necessary to preferably service is provided, while be also required for a kind of equipment can more preferable corresponding disease Communication between the service of people's needs and medical personnel, there is provided a good medical environment.

The system of the distributed temporal index method based on Voronoi diagram is able to carry out, by the information of patient according to attribute After being classified, establishing turns into internal cluster point, and when patient uses medical call system, system is analyzed according to patient information to be belonged to Property, which kind of analysis patient now needs most and helps, and is the help of help or the life inconvenience of great medical knowledge.At this moment, exist The point in the Thiessen polygon nearest from it is found out using patient information as discrete points data, is now needed most so as to obtain patient Help, to make patient obtain best help.

The present invention, the system for being able to carry out the distributed temporal index method based on Voronoi diagram, due to having used multidimensional Voronoi indexes, the index support Spatial-data Integration, are suitable for indexing the data set of various dimensions, can support mass data collection And various dimensions, and a very small space is needed because preferable Spatial Objects store, because we only need storage every The representative point information of one object, so greatly reducing space cost so that space efficiency is very high, can make patient timely Get help.

In another embodiment scheme, the row's of falling Voronoi diagram index is based on to build using Spark, 3-dimensional is given in space Fixed two medical associated data set R and S, R are medical resource data sets, including such as the reaction medical treatment such as doctor, Medical Devices, position The data set of resource information.S is patient data set, includes the data set of the reaction such as patient's case information, position conditions of patients, The two data sets are uploaded in HDFS, because Spark peace default mechanisms carry out burst.Some mappers parallel operations simultaneously. In Spark tasks, we use the reducer given tacit consent to.Before map functions are started, we use quick pre- clustering algorithm The representative point p of the medical resource in a region is obtained, and is loaded into each map main memory.

Then, in each map treatment progress, it will read point of input using TextInputFormat successively Piece (presses the pattern of the input in distributed file system), and file can be read data by TextInputFormat in a streaming manner Into Mapper example.Calculate each medical resource data r object, the distance between patient data s objects and p points, and By r, s distributes to immediate representative point P, and in the algorithm, it is mono- that each medical resource representative point is collected at a Voronoi In first lattice, production (in actual scene, is exactly that an extensive medical resource is concentrated, is divided into m by it into m Voronoi cell There is the representative for representing a medical resource point in the medical area of same nature, such as a city medical centre, each region, than Such as say a Grade A hospital), such program can export upon execution<VCm, List (P_i)>It is right, mapper output raw data sets (R or S) arrives each object r, s and its subregion VC of hithermost subregion_mId.We need to be passed through according to the needs controlled oneself Mapper is output to Spark file system by customized MultipleOutputFormat functions.It determine how by Task result is write back in the lasting storage of bottom.Using the row of falling medical IVI, if giving the inquiry of a patient user Request, such as a hospital for meeting case diagnosis and treatment needs is found from the medical data in the whole nation, we start can Spark tasks carry out data partition and collect some data messages of each subregion.Medical treatment is found by the key of inverted index It is a representative hospital that resource, which represents point, then finds correlation by the specific data of hospital and need medical resource, and is fed back to Patient.Thus can quickly using Spark data handling system Spark using number with thousand note computers, in a distributed manner The distributed data for finding correlation from extensive medical resource.

Claims

1. a kind of distributed temporal index system based on Voronoi diagram, wherein being stored with a plurality of instruction, the instruction is suitable to have Processor is loaded and performed：Row's Voronoi indexes are made down using Spark structures；Two datasets R and S in given d dimension spaces, Spark carries out burst, part mappers parallel operations simultaneously by default mechanism；Acquiescence is used in Spark tasks reducer；Before map functions are started, obtain representing point p using pre- clustering algorithm, and be loaded into each map main memory In；

2. the distributed temporal index system as claimed in claim 1 based on the row's of falling Thiessen polygon, it is characterised in that： One space is divided into multiple disjoint polygons by Voronoi diagram, the arest neighbors of some point in each polygon It is respectively positioned in the Voronoi cells where the point, each polygon in figure is referred to as the Voronoi unit associated with point p Lattice, any point in the cell where point p are all p arest neighbors.

3. the distributed temporal index system as claimed in claim 1 based on the row's of falling Thiessen polygon, it is characterised in that：Arrange Voronoi indexes include two parts：Master index, including all cluster centres；Second index, including it is stored in each subregion VC to as queue.

4. the distributed temporal index system as claimed in claim 1 based on the row's of falling Thiessen polygon, it is characterised in that：Its base Obtain and represented a little in following manner, it is determined that internal cluster point and consecutive points, inside is clustered to the data clusters of point, selected after cluster Cluster centre is indexed, and required data are and the internal consecutive points for clustering point connection, with this inside cluster point for the center of circle, bag Establish circle containing adjacent cluster centre point, using this circle for circumscribed circle triangle as Delaunay triangles, in this method Two different inside cluster points are established into Delaunay triangles respectively, the two Delaunay triangles are common using consecutive points Delaunay triangulation network is established with point, data object is divided into several big subregions, selects a wherein cluster representative point to turn into generation Table point, each object being divided contain object id to be clustered in a Voronoi unit in each Voronoi grids.

5. the distributed temporal index system as claimed in claim 4 based on the row's of falling Thiessen polygon, it is characterised in that： Voronoi diagram is by VD (p)={ V (p₁),V(p₂),...,V(p_m), wherein：VD (p) is the Voronoi diagram intersection on P, V (p₁) be p1 Voronoi diagram, the set associated with all points provided, be referred to as following distance function caused by p Dist () Voronoi diagram, here the Voronoi diagram of each p points necessarily include than other any points closer to q institute a little, Thus query point q neighbour is the Voronoi diagram of closure；

Voronoi units mark off a region for including n point, i.e. P on the R of space from D dimension spaces:{p₁,p₂,…, p_n, the region that subregion VC is provided, i.e. VC subregions are on point p_iRegion VC (p_i), if meeting VC (p_i)=p | d (p, p_i)≤(p, p_j), then the region is referred to as the Voronoi unit associated with p；

Wherein：Wherein p is specified point or query point, d (p, p_i) it is p and p_iBetween minimum Eustachian distance, i, j are variables, n >= 2, p₁≠p₂, i ≠ j, i, j ∈ I_n=1 .., n, and i takes all values in 1 .., n, when often taking a value, j is taken in 1 .., n Except all values of i values now.