CN103995861B

CN103995861B - A kind of distributed data device based on space correlation, method and system

Info

Publication number: CN103995861B
Application number: CN201410208628.0A
Authority: CN
Inventors: 罗敬宁
Original assignee: Individual
Current assignee: Individual
Priority date: 2014-05-16
Filing date: 2014-05-16
Publication date: 2018-08-28
Anticipated expiration: 2034-05-16
Also published as: CN103995861A

Abstract

The present invention relates to data system technical fields, are concretely a kind of distributed data device based on space correlation, method and system.Wherein, storage method includes that will be divided into multiple grids with the data of spatial character, the grid have the grid data in space；According to the incidence relation of mesh space position, the data in the grid are stored in a plurality of memory nodes.The beneficial effects of the present invention are the various types spatial datas for flood tide, the data write-in of high degree of parallelism may be implemented and read, storage that ensure that data divide according to space attribute can be balanced, holding space correlation, safe is to each node, simultaneously, system has great extended capability, and system extension is linearly related with performance, and system is not in the node that largely leaves unused, or I/O bottleneck nodes, realize the original intention of Distributed system design.

Description

A kind of distributed data device based on space correlation, method and system

Technical field

The present invention relates to data system technical fields, are concretely a kind of distributed data dresses based on space correlation It sets, method and system.

Background technology

Distributed data system passes through years development, has become mass data high efficiency, high availability, high sexual valence at present Than the important settling mode for storing and applying, above had a decisive role pushing cloud computing and big data application.Point The core concept of cloth data system is to disperse the storage of data, and data are divided into the subset of standard, are stored using multinode Each subset of data, and the location information of data subset is stored to host node.When reading data, each memory node is only It is responsible for providing the data subset of oneself, is responsible for recombinating each data subset by client interface component, finally submits data file Or record.It is read by the segmentation storage and dispersion of data, can greatly reduce the I/O pressure of single node, realize more piece The parallel reading of point, the performance of entire data system will be greatly improved, can adapt to present-day data sharp increase and answer The challenge being continuously improved with demand.

The main process object of distributed data system realized in the prior art is data file or data record, essence On be still a kind of file system or database storage system extension.Therefore, current distributed data system is more real It is now distributed file system, basic processing unit of the file as system forms subset of the file dispersion storage and arrive each node. The data segmentation of distributed file system is based entirely on physics cutting, and the associated foundation of data subset is the position in file It sets.It is no Attribute Association relationship, data file and its data meanwhile in distributed file system, between data file The storage location of subset is allocated according to random rule.System has preferable performance for the whole read-write of file, still According to data attribute, file part is read or multifile is associated with when reading, memory node is difficult to maintain rational load Weighing apparatus, it may appear that a small amount of node undertakes most of I/O pressure, and the situation of great deal of nodes free time greatly influences system performance.

The prior art one provides a kind of distributed file system, such as Hadoop distributed file systems (HDFS), it It is designed to be suitble to operate in the distributed file system on common hardware, HDFS can provide the data access of high-throughput, non- The often storage and application of suitable large-scale data.HDFS is using file as storage unit, by file according to fixed data block capacity (64MB) carries out physical segmentation, and using multicopy copy mode, memory node is arrived in data block dispersion storage.HDFS is mainly by one A namenode and multiple back end composition, namenode are responsible for the retrieval of maintenance documentation name, determine data block and back end Mapping, receive reading and writing data request, data are provided in the way of data flow.

But inventor has found that the prior art one at least has the following disadvantages：HDFS systems are file system first Distributed implementation, processing main body is still each independent file, thus cannot be known as real data system；Secondly, face To the data with apparent Attribute Association, such as space, the time, level attribute data type, HDFS systems None- identified this A little attribute informations can not divide data according to Attribute Association mode, establish the association between data block, cause not conforming to for data distribution Reason；Again, since HDFS systems do not divide storage data according to data attribute, when according to the progress data reading of certain attributive character When taking, rational load balancing can not be provided, the I/O pressure of part of nodes is excessive, causes the decline of overall performance；Finally, HDFS data access is file-based, can not be based on attributive character and carry out data access, can not provide the data based on attribute The operations such as inquiry, merging, cutting can not read and recombinate the data of multiple files according to Attribute Association.

The prior art two provides a kind of Google BigTable (BigTable:A Distributed Storage System for Structured Data) distributed data base system, mass data record may be implemented hundreds and thousands of The storage of node.BigTable introduces column family (i.e. Column Family) based on a kind of storage organization of Key/Value Concept, i.e., one record are made of Key and one or more column families, and when storage is stored by column family.The data of each BigTable Table is made of multiple Tablet, and Tablet is the unit of a data record set, generally limits data capacity as 100- 200MB is responsible for Tablet location informations by Tablet Server.BigTable establishes the Tablet ropes of a B+ tree Guiding structure carries out High Availabitity using Google Chubby services and accesses lock control, and user accesses and is connected to by Chubby Then Root Tablet navigate to specific Tablet, corresponding column family is obtained finally by Key.BigTable is very suitable for Unstructured or semi-structured data the storage of huge data volume, can provide preferable distributed parallel readwrite performance.

But inventor has found that the prior art two at least has the following disadvantages：The Key/Value of BigTable first Structure only has single keyword, and generally character string, is not used to the crucial word description with multidimensional property structure, Such as the spatial data with spatial position, time attribute and hierarchy attributes；Secondly BigTable storages data are according to arrival What sequence or random fashion were allocated, different Tablet is formed with this, can cause not conforming to for space correlation data storage Reason, data access can frequently be directed toward a Tablet, to lose the advantage of distributed system；Again, the number of BigTable Attribute Association can not be judged by Key according to record, such as space is adjacent, time sequencing, hierarchical relationship etc., thus in data When inquiry is read, task distribution and processing can not be carried out according to attributive character, cause the unbalanced of system performance.

Invention content

The present invention is just allowing for the defect and deficiency of current distributed file system, for the data with space attribute Type has devised and embodied a kind of distributed data device based on space correlation, method and system.Entire data system will not There are Document Concepts, system establishes different data spaces using normed space parameter, and input data is according to matched data sky Between carry out gridding segmentation, distribution series coding is calculated by grid hash, is encoded to according to by the number of gridding with this According in block storage to distribution node.This distributed system is rather than simple by as the system of a real data-oriented File system, data and data block in system are split and are distributed according to the space attribute of itself completely, digital independent It can be related to the arbitrary region of data space, automatically according to spatial dimension splicing and cut data, formation most terminates system Fruit is provided in a manner of data flow or file.

An embodiment of the present invention provides a kind of distributed data storage methods based on space correlation, including,

Data with spatial character are divided into multiple grids, the data in space where the grid has the grid；

According to the incidence relation of mesh space position, the data in the grid are stored in a plurality of memory nodes.

One of a kind of distributed data storage method based on space correlation is further according to embodiments of the present invention Aspect, the data of the spatial character include one-dimensional space data, and two-dimensional space data, three-dimensional space data or multidimensional are empty Between data.

Another of a kind of distributed data storage method based on space correlation is into one according to embodiments of the present invention The aspect of step, data space are made of as the container with same space attribute data the consistent space lattice of multiple ranges, Each grid all has time shaft, and the data in different periods are stored in the way of timeslice, in each timeslice Further include at least one physical layer, data are divided into multiple data blocks according to physical layer.

Another of a kind of distributed data storage method based on space correlation according to embodiments of the present invention is into one The aspect of step, according to the incidence relation of mesh space position by the data in the grid be stored in a plurality of memory nodes into One step includes being stored in spatial position in the memory node of dispersion apart from close data.

Another of a kind of distributed data storage method based on space correlation according to embodiments of the present invention is into one The aspect of step, the spatial position according to grid are also wrapped before the data in the grid are stored in a plurality of memory nodes It includes, the multiple grid is subjected to conversion from dimensional space to one-dimensional sequence, obtain capable of embodying spatial relationship between grid Sequential coding.

Another of a kind of distributed data storage method based on space correlation according to embodiments of the present invention is into one The aspect of step, the conversion method that the multiple grid is carried out from dimensional space to one-dimensional sequence further comprise Martin Hilb Tequ line, row overture line or Z overture lines.

Another of a kind of distributed data storage method based on space correlation according to embodiments of the present invention is into one The aspect of step further includes after the sequential coding of spatial relationship obtaining to embody between grid, by the sequential coding with Memory node, which is done, to be mapped, and is stored in spatial position in the memory node of dispersion apart from close data according to the mapping.

The embodiment of the present invention additionally provides a kind of parallel read method of the distributed data based on space correlation, including,

According to the read requests with spatial character data are read, the grid for covering the read requests is determined；

According to the grid of the covering, the memory node for storing the data in grid is determined；

From the data of the memory node parallel-read requests of the determination.

One of the according to embodiments of the present invention parallel read method of a kind of distributed data based on space correlation into The aspect of one step further including, according to the reading after the data that the memory node of the determination obtains the read requests It takes the spatial dimension in request to carry out splicing cutting to the data of the grid coverage, obtains the data of accurate read requests.

The embodiment of the present invention additionally provides a kind of Distributed Storage device based on space correlation, including,

Mesh generation unit, for will have the data of spatial character to be divided into multiple grids, the grid has the net The data in space where lattice；

Data in the grid are stored in a plurality of storages and saved by storage unit for the incidence relation according to grid Point.

One of a kind of Distributed Storage device based on space correlation is further according to embodiments of the present invention Aspect, further include hash computing unit, be connected between the mesh generation unit and the storage unit, for will be described Multiple grids carry out the conversion from dimensional space to one-dimensional sequence, and the grid for obtaining capable of embodying spatial relationship between grid is compiled Number；Further include map unit, be connected between the storage unit and hash computing unit, for saving the grid and storage Point, which is done, to be mapped.

Another of a kind of Distributed Storage device based on space correlation is into one according to embodiments of the present invention The aspect of step, the hash computing unit is using including hibert curve, row overture line or Z overtures line by the multiple grid Carry out the conversion from dimensional space to one-dimensional sequence.

The embodiment of the present invention additionally provides a kind of parallel reading device of the distributed data based on space correlation, including, it asks Acquiring unit is sought, for according to the read requests with spatial character data are read, determining the grid for covering the read requests；

Processing unit determines the memory node for storing the data in grid for the grid according to the covering；

Parallel reading unit, the data for the memory node parallel-read requests from the determination.

One of a kind of distributed data reading device based on space correlation is further according to embodiments of the present invention Aspect, further include splicing and cut out unit, be connected with the reading unit, for according to the space in the read requests Range carries out splicing cutting to the data of the grid coverage, obtains the data of accurate read requests.

The embodiment of the present invention additionally provides a kind of distributed data system based on space correlation, including,

Data in the grid are stored in a plurality of storages and saved by storage unit for the incidence relation according to grid Point；

Acquisition request unit, for according to the read requests with spatial character data are read, determining and covering the reading The grid of request；

One of a kind of distributed data system based on space correlation further side according to embodiments of the present invention Face hashes computing unit, is carried out from dimensional space to one-dimensional sequence for that will divide the multiple grids formed after data space Conversion obtains capable of embodying the sequential coding of spatial relationship between grid；

Map unit is mapped for doing the grid with memory node.

Another of a kind of distributed data system based on space correlation according to embodiments of the present invention is further Aspect, the hash computing unit, map unit, acquisition request unit, processing unit are located at host node；

The mesh generation unit, storage unit, parallel reading unit are located at client node.

Another of a kind of distributed data system based on space correlation according to embodiments of the present invention is further Aspect further includes middle layer node, is connected between the client node and memory node, for according to the read requests In spatial dimension the data of the grid coverage are spliced and are cut, obtain the data of accurate read requests.

Height may be implemented simultaneously for the various types spatial data of flood tide by the method and device in above-described embodiment The reading and writing data of row degree, storage that ensure that data divide according to space attribute can be balanced, holding space correlation, safe To each node, various forms of digital independent applications are adapted to, provide really data-centered service to the user, Rather than simple file is read.Meanwhile system has great extended capability, and system extension is linearly related with performance , system is not in largely leave unused node or I/O bottleneck nodes, realizes the original intention of Distributed system design, i.e.,：Point It dissipates pressure, improve efficiency.

Description of the drawings

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.In the accompanying drawings：

Fig. 1 show a kind of flow chart of the distributed data storage method based on space correlation of the embodiment of the present invention；

Fig. 2 show a kind of flow of the parallel read method of distributed data based on space correlation of the embodiment of the present invention Figure；

Fig. 3 show a kind of structure chart of the Distributed Storage device based on space correlation of the embodiment of the present invention；

Show for a kind of structure of the parallel reading device of distributed data based on space correlation of the embodiment of the present invention described in Fig. 4 It is intended to；

Fig. 5 show a kind of structure chart of the distributed data system based on space correlation of the embodiment of the present invention；

Fig. 6 is the data space and mesh generation schematic diagram of embodiment；

Fig. 7 show the schematic diagram of grid data model of the embodiment of the present invention；

Fig. 8 show Marking the cell of the embodiment of the present invention and hash sequence number corresponds to schematic diagram；

Fig. 9 show the schematic diagram of Marking the cell of the embodiment of the present invention and memory node mapping relations；

Figure 10 show satellite remote sensing date of the embodiment of the present invention and the position view in grid；

Figure 11 show the attribute table of grid data of the embodiment of the present invention；

Figure 12 show coarse gridding of the embodiment of the present invention in the mapping relations schematic diagram of memory node；

Figure 13 show the schematic diagram of write-in data of the embodiment of the present invention；

Figure 14 show the schematic diagram of multifile of embodiment of the present invention write-in data space；

Figure 15 show the schematic diagram that the embodiment of the present invention reads data space subset according to rectangular area；

Figure 16 show the schematic diagram that the embodiment of the present invention reads polygonal region data space；

Figure 17 show the embodiment of the present invention and reads polygonal region data space flow chart.

Specific implementation mode

Understand in order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the accompanying drawings to this hair Bright embodiment is described in further details.Here, the illustrative embodiments of the present invention and their descriptions are used to explain the present invention, but simultaneously It is not as a limitation of the invention.

It is a kind of flow chart of the distributed data storage method based on space correlation of the embodiment of the present invention as shown in Figure 1.

Including step 101, the data with spatial character are divided into multiple grids, the grid has where the grid The data in space.

Step 102, according to the incidence relation of mesh space position, the data in the grid are stored in a plurality of storages Node.

As an embodiment of the present invention, the data of the spatial character include one-dimensional space data, two-dimensional space number According to three-dimensional space data or multidimensional space data.The data of spatial character are in specific coordinate system, description to be empty The data acquisition system of interior continuous or discrete location each attribute, physics, statistics, natural etc. information includes mainly：It is based on Geographical spatial entities data, continuously distributed space lattice data, the information point data of discrete distribution, with ranks coordinate Image and science matrix data etc..

As an embodiment of the present invention, data space is as the container with same space attribute data, by multiple The consistent space lattice composition of range, each grid all has time shaft, to store the data in different periods, every Further include at least one physical layer in a period, data are divided into multiple data blocks according to physical layer.

As an embodiment of the present invention, the data in the grid are stored in by plural number according to the spatial position of grid A memory node further comprises, spatial position is stored in apart from close data in the memory node of dispersion.

As an embodiment of the present invention, the data in the grid are stored in by the spatial position according to grid Further include that the multiple grid is subjected to the conversion from dimensional space to one-dimensional sequence, obtains energy before a plurality of memory nodes The grid number of spatial relationship enough between embodiment grid, such as the number of the closer grid of spatial position distance are closer.

As an embodiment of the present invention, described by the multiple grid from dimensional space to one-dimensional sequence turn It changes and further comprises hibert curve, row overture line or Z overture lines.

As an embodiment of the present invention, after the grid number for obtaining to embody spatial relationship between grid also Including grid number is done with memory node and is mapped, is stored in spatial position apart from close data according to the mapping In the memory node of dispersion.

As an embodiment of the present invention, by spatial position in the memory node that close data are stored in dispersion into One step includes that the data by spatial position apart from close grid are stored in the memory node of dispersion.

The number of high degree of parallelism may be implemented for the various types spatial data of flood tide by the method for above-described embodiment According to write-in, storage that guarantee can be balanced according to the data that space attribute is divided, keeping space correlation, safe to each section Point, meanwhile, system has great extended capability, and system extension and performance are linearly related, and system is not in big The idle node of amount or I/O bottleneck nodes, realize the original intention of Distributed system design.

It is illustrated in figure 2 a kind of flow of the parallel read method of distributed data based on space correlation of the embodiment of the present invention Figure.

It determines according to the read requests with spatial character data are read including step 201 and covers the read requests Grid.

Step 202, according to the grid of the covering, the memory node for storing the data in grid is determined.

Step 203, from the data of the memory node parallel-read requests of the determination.

As an embodiment of the present invention, the data that the read requests are obtained from the memory node of the determination it After further include that the data of the grid coverage are spliced and cut according to the spatial dimension in the read requests, obtain The data of accurate read requests.

It is to select grid data according to spatial neighborhood relations when reading data by the embodiment of above-mentioned read method , most of memory node concurrent working can be driven, ensures the I/O Performance optimizations of system.

It is illustrated in figure 3 a kind of structure chart of the Distributed Storage device based on space correlation of the embodiment of the present invention.

Including mesh generation unit 301, for will have the data of spatial character to be divided into multiple grids, the grid tool The data in space where having the grid.

Data in the grid are stored in a plurality of storages by storage unit 302 for the incidence relation according to grid Node.

As an embodiment of the present invention, the data of the spatial character include one-dimensional space data, two-dimensional space number According to three-dimensional space data or multidimensional space data.

As an embodiment of the present invention, data space is as the container with same space attribute data, by multiple The consistent space lattice composition of range, each grid are all had time shaft, are stored in different periods in a manner of timeslice Data, further include at least one physical layer in each timeslice, data are divided into multiple data blocks according to physical layer.

As an embodiment of the present invention, the storage unit 302 is further used for spatial position apart from close number According to being stored in the memory node of dispersion.

As an embodiment of the present invention, further include hash computing unit 303, be connected to the mesh generation unit Between 301 and the storage unit 302, for the multiple grid to be carried out to the conversion from dimensional space to one-dimensional sequence, obtain To the sequential coding that can embody spatial relationship between grid；

Further include map unit 304, is connected between the storage unit 302 and hash computing unit 303, is used for institute It states sequential coding and is done with memory node and mapped, grid data is stored in storage section by the storage unit 302 according to the mapping Point in.

As an embodiment of the present invention, the hashing unit 303 using include hibert curve, row overture line or The multiple grid is carried out the conversion from dimensional space to one-dimensional sequence by Z overtures line.

As an embodiment of the present invention, the storage unit 302 is further used for spatial position apart from close net The data of lattice are stored in the memory node of dispersion.

The number of high degree of parallelism may be implemented for the various types spatial data of flood tide by the device of above-described embodiment According to write-in, storage that guarantee can be balanced according to the data that space attribute is divided, keeping space correlation, safe to each section Point, meanwhile, system has great extended capability, and system extension and performance are linearly related, and system is not in big The idle node of amount or I/O bottleneck nodes, realize the original intention of Distributed system design.

It is a kind of structure of the parallel reading device of distributed data based on space correlation of the embodiment of the present invention as described in Figure 4 Schematic diagram.

Including acquisition request unit 401, for according to the read requests with spatial character data are read, determining covering institute State the grid of read requests.

Processing unit 402 determines the memory node for storing the data in grid for the grid according to the covering.

Reading unit 403, the data for obtaining the read requests from the memory node of the determination.

As an embodiment of the present invention, further include splicing and cutting out unit 404, be connected with the reading unit 403 It connects, for the data of the grid coverage to be spliced and cut according to the spatial dimension in the read requests, obtains essence The data of true read requests.

It is to select grid data according to spatial neighborhood relations when reading data by the embodiment of above-mentioned reading device , most of memory node concurrent working can be driven, ensures the I/O Performance optimizations of system.

It is illustrated in figure 5 a kind of structure chart of the distributed data system based on space correlation of the embodiment of the present invention.

Including host node 501, a plurality of memory nodes 502, client node 503.

The host node 501 further comprises：

Data space unit 5011, the description for establishing the data capsule space with consistent attribute, is divided into multiple Space lattice calculates the unique mark of each grid.

Computing unit 5011 is hashed, the multiple grids formed after data space will be divided and carried out from dimensional space to one-dimensional sequence The conversion of row obtains capable of embodying the sequential coding of spatial relationship between grid；The data space, which is used as, has same space The data capsule of feature description establishes data space, according to mesh generation mark according to space attribute, coordinate system and data characteristics Data space is divided into multiple grids, obtains the unique mark of each grid by standard.

Map unit 5012 maps obtained sequential coding and a plurality of memory nodes, obtains one-to-one The mapping relations of grid and memory node.

Further include acquisition request unit 5013, the reading for being sent out according to the client node 503 has space special Property data read requests, the grid for covering the read requests is determined according to the read requests.

Processing unit 5014 determines the memory node for storing the data in grid for the grid according to the covering.

The client node 503 further comprises：

Mesh generation unit 5031, for will have the data of spatial character to be divided into multiple grids, the grid has The data in space where the grid.

Storage unit 5032 deposits the data in the grid for the incidence relation (i.e. mapping relations) according to grid It is stored in a plurality of memory nodes 502.

Reading unit 5033, the data for the memory node parallel-read requests from the determination.

Can also include splicing and cutting out unit 5035 in above-mentioned client node 503, for being asked according to the reading The spatial dimension asked is spliced and is cut from the data of acquisition, and the data of accurate read requests are obtained.

Alternatively, further including middle layer node 504 between the client node 503 and the memory node 502, it is used for It is cut from the data of acquisition according to the spatial dimension in the read requests, obtains the data hair of accurate read requests Give client node 503.

In above-described embodiment, hash computing unit, map unit, acquisition request unit, processing unit can be located at main section Point (latter embodiments are respectively positioned on host node) can also be located at memory node (i.e. no host node), also or positioned at client End node (i.e. no host node)；Or in certain embodiments, if after the data space is divided into multiple grids in advance, The system can need not hash computing unit, map unit divides grid again, the data for directly inputting client node Then the Standard Segmentation for dividing grid according to data space divides at multiple grids according to the incidence relation of grid (i.e. spatial position) It is not stored in different memory nodes.

A specific embodiment will be used below, and embodiment of the present invention and technological merit is elaborated.

First, systematic parameter is initialized, host node database is established, disposes memory node management module, client is installed Interface library is completed sample data input and is prepared.

Fig. 6 is the data space and mesh generation schematic diagram of embodiment, in the present embodiment by taking 2-D data as an example, it is proposed that One two-dimensional data space can also be the data space of multidimensional data correspondence establishment multidimensional in other embodiments, most A data space based on geographical coordinate is created in basic 2-D data space, range includes entire China and peripheral region Domain, wherein：Coordinate range is：70°-140°E、15°-55°N；X-direction resolution ratio is 0.02 °, and Y-direction resolution ratio is 0.02 °； Earth ellipsoid model reference WGS-84 standards；The upper left corner is coordinate origin, and it is longitude that X-coordinate, which is taken out, it is positive to the left, Y-coordinate axle It is positive downward for latitude；Data space physics size is 3500X2000.Above-mentioned parameter is integrated into attribute data record, and data are empty Between identify and be set as " ChinaRegion ", be stored in host node database, data space, which creates, to be completed.Wherein, host node database Using matrix form relational database clustered deploy(ment) mode, it is made of 2-4 node according to system scale, all nodes have portion Complete data copy, data consistency reach stringent transaction rank.By configuration interface, the object of memory node is inputted Information, including address, capacity, interface protocol, active and standby setting, active state etc. are managed, memory node allocation list is established.

Start planning grid in next step to divide, the size of grid needs to consider data type, data volume size, data space Size etc. wants moderate according to the data block size of mesh segmentation, preferably in the range of 128KB-4MB, the I/O of such system Best performance.In the present embodiment, grid is designed according to the square area of 250X250, then data space " Chinare Gion " includes 112 grids, forms the grid matrix of a 14X8, as shown in Figure 6.For each grid, its upper left corner is recorded The parameters such as coordinate, X and Y-direction resolution ratio, X and Y-direction width；Meanwhile function is calculated by Marking the cell, generate grid mark Know, such as Marking the cell can be one 64 integers, wherein：High 4 represent data space type；Following 8 reservations； Then 26 are grid upper left corner X-coordinate value；Last 26 are grid upper left corner Y-coordinate value, such as：The data of the present embodiment are empty Between type be Chinese grid, the grid that top left co-ordinate is 110.0 ° of E and 45.0 ° of N is calculated in code 15 Marking the cell is：16142813667133857664.Meanwhile system also provides a Marking the cell inverse calculating function, passes through net Grid upper left corner X and Y coordinates value can be directly calculated in case marker knowledge, such as：Marking the cell is 16142914330428357664 grid, through inverse calculating can obtain grid upper left corner X and Y coordinates value be 120.0 ° of E and 35.0°N.Grid property parameter and Marking the cell are combined to form data record, are stored in the grid property table of host node database In.

Grid in above-mentioned steps can also include the data of several levels, such as be illustrated in figure 7 the embodiment of the present invention The schematic diagram of grid data model, model includes three levels in this example, i.e.,：Grid, timeslice and physical layer.These three layers It is secondary to include downwards, first, data space is established according to spatial dimension, coordinate-system, data space is divided into multiple nets Lattice, each grid will be an independent data cells, and the contact between grid is maintained by space topological association；Next, for Each grid, there are a time shaft, data according to timeslice formal distribution on a timeline, timeslice has " thickness " Attribute, different " thickness " represent different time ranges, are arranged when being inputted by data；Again, spatial data generally comprise 1 or Multiple physical levels will be divided into multiple physical layers in a timeslice according to attribute, and data are divided into multiple numbers according to physical layer According to block, become the minimum unit of system storage.

Obtained in the previous step 14 × 8 grid matrix can form 112 grid data block sequences in actual storage data How row, this one-dimensional sequence arrange, will have a significant effect to the I/O performances of system, therefore the embodiment of the present invention uses algorithm Model calculates mapping of the two-dimensional matrix to one-dimensional sequence.

In this step, data space " ChinaRegion " will use the algorithm model of hibert curve to calculate grid Matrix hashes, and hibert curve is a kind of fractal curve that can fill a full plane space, and value is to establish one-dimensional Space is to the one-to-one relationship of two-dimensional space.The plane space of hibert curve traversal, it is necessary to be the square of a 2n Matrix needs to be extended for 2 for 14 × 8 grid matrix of data space " ChinaRegion "⁴×2⁴Space of matrices, This space complete hibert curve calculate, then by the upper left corner, 14 × 8 space of matrices is cut out, as hash Result of calculation, the top of Fig. 8 are the schematic diagram that data space " ChinaRegion " is superimposed hibert curve.Martin Hilb Tequ Line computation represents ranking of some grid in hibert curve in grid matrix the result is that a serial number, two The point being connected in hibert curve is also adjacent in two-dimensional space, so that the closely coupled number in space According to the sequential coding in hibert curve is also just closer.Tables 1 and 2 in Fig. 8 is the grid mark in two regions A and B Know and the corresponding table of hibert curve sequence number.

As in other embodiment, other hashing algorithms, such as, grid larger for spatial dimension can also be used Space matrix is more than 4 × 4 dimensions, and carrying out hash distribution according to hibert curve (Hilbert) calculates, and obtains space correlation One-dimensional sequence；Smaller for spatial dimension, mesh space matrix is less than 4 × 4 dimensions, is carried out using general row overture line Hash distribution calculates；There is the case where larger difference for mesh space matrix X and Y-direction dimension, is hashed using Z overture line computations Distribution.

After obtaining Hilbert sequence number, cartesian product relationship is carried out using this sequence number and memory node allocation list Operation, so that it may further to obtain the mapping relations of grid and memory node.Fig. 9 is Marking the cell and memory node mapping relations Schematic diagram, with the B area chosen in Fig. 8 as an example, the table 1 of Fig. 9 is that Marking the cell is corresponding with hibert curve coding Table, Fig. 9 tables 2 are memory node allocation list, and Fig. 9 tables 3 are the grid being calculated and memory node mapping table.It is analyzed by Fig. 9 tables 3 It is calculated it is found that being hashed by the grid of hibert curve, it can be very perfectly by the grid sequence of B area, balanced point For cloth to all memory nodes, the load of each node is almost the same, and 8 memory nodes all at least store a grid data, Store up to 2 grid datas.Space correlation relationship maintains be distributed memory node, rather than exists only in one and deposit It stores up in node, when reading data according to spatial neighborhood relations selection grid data, distributed data system of the invention can be with The most of memory node concurrent working of driving, ensures the I/O Performance optimizations of system.

After the completion of above-mentioned processing, data space can serve as data capsule, and input data is supplied to use, all to meet The data type of data space " ChinaRegion " space attribute description can store data, typical using one here Spatial data types are as example.Satellite remote sensing date is a kind of typical spatial data type, and essential characteristic is exactly continuous The raster data of spatial distribution can be described with use space grid model.As partly showing satellite remote sensing date on Figure 10 A part, used here as a satellite remote sensing images as input data, entitled " file1.dat " its spatial dimension of file For：100 ° of E-, 120 ° of E and 30 ° of N-, 50 ° of N, coordinate system, resolution ratio, ellipsoid model and data space " ChinaRegion " It is consistent, the lower part of Figure 10 is the position signal of input data within a grid.The upper part of Figure 11 shows input data The signal of mesh generation is divided into 16 net regions, while will be superimposed upon data by the hibert curve in this region On image.The table 1 of Figure 11 is the mesh space attribute list of input data, and the table 2 of Figure 11 is grid hash-coding table, these are logical Host node database is crossed, space querying acquisition is carried out by data space " ChinaRegion " grid property table.Figure 12 lower parts Table 1 for the mapping relations of the input data grid and memory node further inquired, Figure 12 is that 16 grids are deposited with 8 The mapping table of node is stored up, the lower part of Figure 12 shows 16 grids in the distribution situation of memory node, and each node is at least A corresponding grid, Node2 and the corresponding number of grids of Node5 are most, reach 3, entire distribution situation is more balanced.

Figure 13 shows the schematic diagram of data write-in, and according to mesh generation standard, input data is firstly split into 16 grids Data volume, then according to the mapping relations of grid and memory node, with obtained memory node address by client by 16 nets Lattice data volume is written in parallel to each memory node (Node1, Node2 ... Node8), finally host node data is notified to be written successfully, The key data stream of ablation process only generates between client and memory node, only transmission control between client and host node The load distribution of information, whole system is in admirable proportion, and is not in I/O bottlenecks.

Figure 14 is the schematic diagram that data space is written in multifile, and file1.dat, file2.dat, file3.dat are 3 tools There is the data file of same space attribute, data, Figure 14 is written according to the mesh generation of data space " ChinaRegion " respectively After the completion of showing 3 file data inputs, the image of entire data space, three parts data are spliced into an entirety, subsequently Data access only for data space, need not be concerned about which file is data belong to.

The distributed data system of the present invention has more apparent advantage and efficiency in digital independent, and digital independent can To regard the inverse process of data write-in as, but under the arrangement of outstanding hashing algorithm, the degree of parallelism of data read process is incited somebody to action To the raising of matter.For stating the data space " ChinaRegion " for completing 3 files of write-in herein above, respectively according to rectangle Digital independent is carried out with polygonal region, illustrates the degree of parallelism and loading condition of whole system.

Figure 15 is a schematic diagram that data space subset is read according to rectangular area, and Figure 15 shows in data space and selects A rectangular area of the leftmost side is selected, ranging from：95 ° of E-, 105 ° of E and 30 ° of N-, 45 ° of N include 9 grids, the B in Figure 15 altogether For the final image of this rectangular area data.All nets of covering input area are obtained by host node database by client Lattice sequence number (Grid_ID) further obtains corresponding memory node number (Node_ according to Marking the cell (Hilbert_Num) ID), finally obtain the address (IP_Addr) of memory node by the number of memory node, the table 1 of Figure 15 be obtained grid with Memory node mapping table determines the storage location of each grid, and grid number is read from relevant memory node parallel by client According to body.Data flow only generates between memory node and client, and system load is in admirable proportion, and the lower part of Figure 15 is shown respectively The loading condition of a memory node, 9 grids that current digital independent is related to, highly uniform is distributed on 8 nodes, only Node6 needs to provide 2 grids, and other nodes all only need to provide a grid, and appearance is not a large amount of idle for memory node, There are not high I/O pressure nodes, system is under a good load condition, this is the distributed data system of the present invention Compared to the embodiment of other distributed systems most advantage.

In digital independent, the embodiment of the present invention can also provide a spatial manipulation middle layer be built in client or It is connected between client and memory node, for the grid data of dispersion to be spliced into an entirety, according to input request Area of space is accurately cut, and the data volume that formation meets user demand returns to client.Figure 16 is shown more than one The digital independent in side shape region is illustrated, and the part A in Figure 16 shows position of the input polygonal region in data space, according to According to spatial retrieval algorithm, the maximum mesh sequence for covering this polygonal region is obtained, the part B in Figure 16 shows polygon The signal of the minimum external grid rectangle in region, is related to 9 grids altogether.According to grid data parallel reading manner, from each storage Node reads grid data, is input to spatial manipulation middle layer, and the splicing of grid data, cutting and again are completed according to space arithmetic The processing such as group, the C portion in Figure 16 are final digital independent the results show that finally returning to client with stream socket.

Figure 17 shows the flow signal of the digital independent of the embodiment of the present invention, and user proposes data by client first Read requests (dotted region in such as figure in grid) are sent to host node, the data read request include by space, when Between and the property parameters such as physics；Then host node uses spatial retrieval algorithm, the minimum grid of inquiry covering input space range Marking the cell is returned to client by sequence such as 16 grids in figure；Again, client submits Marking the cell sequence to main section Point, and query master node database obtains the corresponding memory node of grid, client is proposed by parallel mode to each node Grid data is asked；Grid data is transferred to spatial manipulation middle layer by each memory node, in the spatial manipulation middle layer The splicing and cutting of data are carried out, the data of the part dotted region such as client request are formed；Finally, among spatial manipulation Layer returns to client with stream socket, by final result, completes digital independent.Entire data read process, data flow and control Stream processed is kept completely separate, and the I/O pressure dissipations of data flow to each memory node, host node does not undertake the I/O of any data flow, only It is responsible for the inquiry of control information and former data, the load of system is very reasonable, is suitble to large-scale data application demand.

By the above embodiments, for the various types spatial data of flood tide, the data that high degree of parallelism may be implemented are read It writes, storage that guarantee can be balanced according to the data that space attribute is divided, keeping space correlation, safe to each node, Various forms of digital independent applications are adapted to, provide really data-centered service to the user, rather than it is simple File read.Meanwhile system has great extended capability, and system extension and performance are linearly related, system is not It will appear largely idle node or I/O bottleneck nodes, realize the original intention of Distributed system design, i.e.,：Dispersion pressure carries High efficiency.

It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, the present invention can be used in one or more wherein include computer usable program code computer The computer program production implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The present invention be with reference to according to the method for the embodiment of the present invention, the flow of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be realized by computer program instructions every first-class in flowchart and/or the block diagram The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided Instruct the processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine so that the instruction executed by computer or the processor of other programmable data processing devices is generated for real The device for the function of being specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.

These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that instruction generation stored in the computer readable memory includes referring to Enable the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device so that count Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, in computer or The instruction executed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in a box or multiple boxes.

Particular embodiments described above has carried out further in detail the purpose of the present invention, technical solution and advantageous effect Describe in detail it is bright, it should be understood that the above is only a specific embodiment of the present invention, the guarantor being not intended to limit the present invention Range is protected, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should be included in this Within the protection domain of invention.

Claims

1. a kind of distributed data system based on space correlation, it is characterised in that including：

Mesh generation unit, it is described for will have the data of spatial character, timeslice and physical characteristic to be divided into multiple grids The data in space, the grid all have time shaft where grid has the grid, and different periods are stored in the form of timeslice Interior data, further include at least one physical layer in each timeslice, and data are divided into multiple data blocks according to physical layer；Its In, the data of the spatial character include：Based on geographical spatial entities data, continuously distributed space lattice data, discrete Information point data, the image with ranks coordinate and the science matrix data of distribution；

Data in the grid are stored in a plurality of memory nodes by storage unit for the incidence relation according to grid；

Acquisition request unit, for according to the read requests with spatial character data are read, determining and covering the read requests Grid；

Parallel reading unit, the data for the memory node parallel-read requests from the determination；

Further include,

Computing unit is hashed, is carried out from dimensional space to one-dimensional sequence for the multiple grids formed after data space will to be divided Conversion obtains capable of embodying the sequential coding of spatial relationship between grid；

Map unit is mapped for doing the grid with memory node.

2. a kind of distributed data system based on space correlation according to claim 1, which is characterized in that

The hash computing unit, map unit, acquisition request unit, processing unit are located at host node；

3. a kind of distributed data system based on space correlation according to claim 2, which is characterized in that in further including Between node layer, be connected between the client node and memory node, for according to the spatial dimension in the read requests The data of the grid coverage are spliced and cut, the data of accurate read requests are obtained.