CN103995861B - A kind of distributed data device based on space correlation, method and system - Google Patents
A kind of distributed data device based on space correlation, method and system Download PDFInfo
- Publication number
- CN103995861B CN103995861B CN201410208628.0A CN201410208628A CN103995861B CN 103995861 B CN103995861 B CN 103995861B CN 201410208628 A CN201410208628 A CN 201410208628A CN 103995861 B CN103995861 B CN 103995861B
- Authority
- CN
- China
- Prior art keywords
- data
- grid
- space
- node
- spatial
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to data system technical fields, are concretely a kind of distributed data device based on space correlation, method and system.Wherein, storage method includes that will be divided into multiple grids with the data of spatial character, the grid have the grid data in space;According to the incidence relation of mesh space position, the data in the grid are stored in a plurality of memory nodes.The beneficial effects of the present invention are the various types spatial datas for flood tide, the data write-in of high degree of parallelism may be implemented and read, storage that ensure that data divide according to space attribute can be balanced, holding space correlation, safe is to each node, simultaneously, system has great extended capability, and system extension is linearly related with performance, and system is not in the node that largely leaves unused, or I/O bottleneck nodes, realize the original intention of Distributed system design.
Description
Technical field
The present invention relates to data system technical fields, are concretely a kind of distributed data dresses based on space correlation
It sets, method and system.
Background technology
Distributed data system passes through years development, has become mass data high efficiency, high availability, high sexual valence at present
Than the important settling mode for storing and applying, above had a decisive role pushing cloud computing and big data application.Point
The core concept of cloth data system is to disperse the storage of data, and data are divided into the subset of standard, are stored using multinode
Each subset of data, and the location information of data subset is stored to host node.When reading data, each memory node is only
It is responsible for providing the data subset of oneself, is responsible for recombinating each data subset by client interface component, finally submits data file
Or record.It is read by the segmentation storage and dispersion of data, can greatly reduce the I/O pressure of single node, realize more piece
The parallel reading of point, the performance of entire data system will be greatly improved, can adapt to present-day data sharp increase and answer
The challenge being continuously improved with demand.
The main process object of distributed data system realized in the prior art is data file or data record, essence
On be still a kind of file system or database storage system extension.Therefore, current distributed data system is more real
It is now distributed file system, basic processing unit of the file as system forms subset of the file dispersion storage and arrive each node.
The data segmentation of distributed file system is based entirely on physics cutting, and the associated foundation of data subset is the position in file
It sets.It is no Attribute Association relationship, data file and its data meanwhile in distributed file system, between data file
The storage location of subset is allocated according to random rule.System has preferable performance for the whole read-write of file, still
According to data attribute, file part is read or multifile is associated with when reading, memory node is difficult to maintain rational load
Weighing apparatus, it may appear that a small amount of node undertakes most of I/O pressure, and the situation of great deal of nodes free time greatly influences system performance.
The prior art one provides a kind of distributed file system, such as Hadoop distributed file systems (HDFS), it
It is designed to be suitble to operate in the distributed file system on common hardware, HDFS can provide the data access of high-throughput, non-
The often storage and application of suitable large-scale data.HDFS is using file as storage unit, by file according to fixed data block capacity
(64MB) carries out physical segmentation, and using multicopy copy mode, memory node is arrived in data block dispersion storage.HDFS is mainly by one
A namenode and multiple back end composition, namenode are responsible for the retrieval of maintenance documentation name, determine data block and back end
Mapping, receive reading and writing data request, data are provided in the way of data flow.
But inventor has found that the prior art one at least has the following disadvantages:HDFS systems are file system first
Distributed implementation, processing main body is still each independent file, thus cannot be known as real data system;Secondly, face
To the data with apparent Attribute Association, such as space, the time, level attribute data type, HDFS systems None- identified this
A little attribute informations can not divide data according to Attribute Association mode, establish the association between data block, cause not conforming to for data distribution
Reason;Again, since HDFS systems do not divide storage data according to data attribute, when according to the progress data reading of certain attributive character
When taking, rational load balancing can not be provided, the I/O pressure of part of nodes is excessive, causes the decline of overall performance;Finally,
HDFS data access is file-based, can not be based on attributive character and carry out data access, can not provide the data based on attribute
The operations such as inquiry, merging, cutting can not read and recombinate the data of multiple files according to Attribute Association.
The prior art two provides a kind of Google BigTable (BigTable:A Distributed Storage
System for Structured Data) distributed data base system, mass data record may be implemented hundreds and thousands of
The storage of node.BigTable introduces column family (i.e. Column Family) based on a kind of storage organization of Key/Value
Concept, i.e., one record are made of Key and one or more column families, and when storage is stored by column family.The data of each BigTable
Table is made of multiple Tablet, and Tablet is the unit of a data record set, generally limits data capacity as 100-
200MB is responsible for Tablet location informations by Tablet Server.BigTable establishes the Tablet ropes of a B+ tree
Guiding structure carries out High Availabitity using Google Chubby services and accesses lock control, and user accesses and is connected to by Chubby
Then Root Tablet navigate to specific Tablet, corresponding column family is obtained finally by Key.BigTable is very suitable for
Unstructured or semi-structured data the storage of huge data volume, can provide preferable distributed parallel readwrite performance.
But inventor has found that the prior art two at least has the following disadvantages:The Key/Value of BigTable first
Structure only has single keyword, and generally character string, is not used to the crucial word description with multidimensional property structure,
Such as the spatial data with spatial position, time attribute and hierarchy attributes;Secondly BigTable storages data are according to arrival
What sequence or random fashion were allocated, different Tablet is formed with this, can cause not conforming to for space correlation data storage
Reason, data access can frequently be directed toward a Tablet, to lose the advantage of distributed system;Again, the number of BigTable
Attribute Association can not be judged by Key according to record, such as space is adjacent, time sequencing, hierarchical relationship etc., thus in data
When inquiry is read, task distribution and processing can not be carried out according to attributive character, cause the unbalanced of system performance.
Invention content
The present invention is just allowing for the defect and deficiency of current distributed file system, for the data with space attribute
Type has devised and embodied a kind of distributed data device based on space correlation, method and system.Entire data system will not
There are Document Concepts, system establishes different data spaces using normed space parameter, and input data is according to matched data sky
Between carry out gridding segmentation, distribution series coding is calculated by grid hash, is encoded to according to by the number of gridding with this
According in block storage to distribution node.This distributed system is rather than simple by as the system of a real data-oriented
File system, data and data block in system are split and are distributed according to the space attribute of itself completely, digital independent
It can be related to the arbitrary region of data space, automatically according to spatial dimension splicing and cut data, formation most terminates system
Fruit is provided in a manner of data flow or file.
An embodiment of the present invention provides a kind of distributed data storage methods based on space correlation, including,
Data with spatial character are divided into multiple grids, the data in space where the grid has the grid;
According to the incidence relation of mesh space position, the data in the grid are stored in a plurality of memory nodes.
One of a kind of distributed data storage method based on space correlation is further according to embodiments of the present invention
Aspect, the data of the spatial character include one-dimensional space data, and two-dimensional space data, three-dimensional space data or multidimensional are empty
Between data.
Another of a kind of distributed data storage method based on space correlation is into one according to embodiments of the present invention
The aspect of step, data space are made of as the container with same space attribute data the consistent space lattice of multiple ranges,
Each grid all has time shaft, and the data in different periods are stored in the way of timeslice, in each timeslice
Further include at least one physical layer, data are divided into multiple data blocks according to physical layer.
Another of a kind of distributed data storage method based on space correlation according to embodiments of the present invention is into one
The aspect of step, according to the incidence relation of mesh space position by the data in the grid be stored in a plurality of memory nodes into
One step includes being stored in spatial position in the memory node of dispersion apart from close data.
Another of a kind of distributed data storage method based on space correlation according to embodiments of the present invention is into one
The aspect of step, the spatial position according to grid are also wrapped before the data in the grid are stored in a plurality of memory nodes
It includes, the multiple grid is subjected to conversion from dimensional space to one-dimensional sequence, obtain capable of embodying spatial relationship between grid
Sequential coding.
Another of a kind of distributed data storage method based on space correlation according to embodiments of the present invention is into one
The aspect of step, the conversion method that the multiple grid is carried out from dimensional space to one-dimensional sequence further comprise Martin Hilb
Tequ line, row overture line or Z overture lines.
Another of a kind of distributed data storage method based on space correlation according to embodiments of the present invention is into one
The aspect of step further includes after the sequential coding of spatial relationship obtaining to embody between grid, by the sequential coding with
Memory node, which is done, to be mapped, and is stored in spatial position in the memory node of dispersion apart from close data according to the mapping.
The embodiment of the present invention additionally provides a kind of parallel read method of the distributed data based on space correlation, including,
According to the read requests with spatial character data are read, the grid for covering the read requests is determined;
According to the grid of the covering, the memory node for storing the data in grid is determined;
From the data of the memory node parallel-read requests of the determination.
One of the according to embodiments of the present invention parallel read method of a kind of distributed data based on space correlation into
The aspect of one step further including, according to the reading after the data that the memory node of the determination obtains the read requests
It takes the spatial dimension in request to carry out splicing cutting to the data of the grid coverage, obtains the data of accurate read requests.
The embodiment of the present invention additionally provides a kind of Distributed Storage device based on space correlation, including,
Mesh generation unit, for will have the data of spatial character to be divided into multiple grids, the grid has the net
The data in space where lattice;
Data in the grid are stored in a plurality of storages and saved by storage unit for the incidence relation according to grid
Point.
One of a kind of Distributed Storage device based on space correlation is further according to embodiments of the present invention
Aspect, further include hash computing unit, be connected between the mesh generation unit and the storage unit, for will be described
Multiple grids carry out the conversion from dimensional space to one-dimensional sequence, and the grid for obtaining capable of embodying spatial relationship between grid is compiled
Number;Further include map unit, be connected between the storage unit and hash computing unit, for saving the grid and storage
Point, which is done, to be mapped.
Another of a kind of Distributed Storage device based on space correlation is into one according to embodiments of the present invention
The aspect of step, the hash computing unit is using including hibert curve, row overture line or Z overtures line by the multiple grid
Carry out the conversion from dimensional space to one-dimensional sequence.
The embodiment of the present invention additionally provides a kind of parallel reading device of the distributed data based on space correlation, including, it asks
Acquiring unit is sought, for according to the read requests with spatial character data are read, determining the grid for covering the read requests;
Processing unit determines the memory node for storing the data in grid for the grid according to the covering;
Parallel reading unit, the data for the memory node parallel-read requests from the determination.
One of a kind of distributed data reading device based on space correlation is further according to embodiments of the present invention
Aspect, further include splicing and cut out unit, be connected with the reading unit, for according to the space in the read requests
Range carries out splicing cutting to the data of the grid coverage, obtains the data of accurate read requests.
The embodiment of the present invention additionally provides a kind of distributed data system based on space correlation, including,
Mesh generation unit, for will have the data of spatial character to be divided into multiple grids, the grid has the net
The data in space where lattice;
Data in the grid are stored in a plurality of storages and saved by storage unit for the incidence relation according to grid
Point;
Acquisition request unit, for according to the read requests with spatial character data are read, determining and covering the reading
The grid of request;
Processing unit determines the memory node for storing the data in grid for the grid according to the covering;
Parallel reading unit, the data for the memory node parallel-read requests from the determination.
One of a kind of distributed data system based on space correlation further side according to embodiments of the present invention
Face hashes computing unit, is carried out from dimensional space to one-dimensional sequence for that will divide the multiple grids formed after data space
Conversion obtains capable of embodying the sequential coding of spatial relationship between grid;
Map unit is mapped for doing the grid with memory node.
Another of a kind of distributed data system based on space correlation according to embodiments of the present invention is further
Aspect, the hash computing unit, map unit, acquisition request unit, processing unit are located at host node;
The mesh generation unit, storage unit, parallel reading unit are located at client node.
Another of a kind of distributed data system based on space correlation according to embodiments of the present invention is further
Aspect further includes middle layer node, is connected between the client node and memory node, for according to the read requests
In spatial dimension the data of the grid coverage are spliced and are cut, obtain the data of accurate read requests.
Height may be implemented simultaneously for the various types spatial data of flood tide by the method and device in above-described embodiment
The reading and writing data of row degree, storage that ensure that data divide according to space attribute can be balanced, holding space correlation, safe
To each node, various forms of digital independent applications are adapted to, provide really data-centered service to the user,
Rather than simple file is read.Meanwhile system has great extended capability, and system extension is linearly related with performance
, system is not in largely leave unused node or I/O bottleneck nodes, realizes the original intention of Distributed system design, i.e.,:Point
It dissipates pressure, improve efficiency.
Description of the drawings
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for
For those of ordinary skill in the art, without creative efforts, other are can also be obtained according to these attached drawings
Attached drawing.In the accompanying drawings:
Fig. 1 show a kind of flow chart of the distributed data storage method based on space correlation of the embodiment of the present invention;
Fig. 2 show a kind of flow of the parallel read method of distributed data based on space correlation of the embodiment of the present invention
Figure;
Fig. 3 show a kind of structure chart of the Distributed Storage device based on space correlation of the embodiment of the present invention;
Show for a kind of structure of the parallel reading device of distributed data based on space correlation of the embodiment of the present invention described in Fig. 4
It is intended to;
Fig. 5 show a kind of structure chart of the distributed data system based on space correlation of the embodiment of the present invention;
Fig. 6 is the data space and mesh generation schematic diagram of embodiment;
Fig. 7 show the schematic diagram of grid data model of the embodiment of the present invention;
Fig. 8 show Marking the cell of the embodiment of the present invention and hash sequence number corresponds to schematic diagram;
Fig. 9 show the schematic diagram of Marking the cell of the embodiment of the present invention and memory node mapping relations;
Figure 10 show satellite remote sensing date of the embodiment of the present invention and the position view in grid;
Figure 11 show the attribute table of grid data of the embodiment of the present invention;
Figure 12 show coarse gridding of the embodiment of the present invention in the mapping relations schematic diagram of memory node;
Figure 13 show the schematic diagram of write-in data of the embodiment of the present invention;
Figure 14 show the schematic diagram of multifile of embodiment of the present invention write-in data space;
Figure 15 show the schematic diagram that the embodiment of the present invention reads data space subset according to rectangular area;
Figure 16 show the schematic diagram that the embodiment of the present invention reads polygonal region data space;
Figure 17 show the embodiment of the present invention and reads polygonal region data space flow chart.
Specific implementation mode
Understand in order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the accompanying drawings to this hair
Bright embodiment is described in further details.Here, the illustrative embodiments of the present invention and their descriptions are used to explain the present invention, but simultaneously
It is not as a limitation of the invention.
It is a kind of flow chart of the distributed data storage method based on space correlation of the embodiment of the present invention as shown in Figure 1.
Including step 101, the data with spatial character are divided into multiple grids, the grid has where the grid
The data in space.
Step 102, according to the incidence relation of mesh space position, the data in the grid are stored in a plurality of storages
Node.
As an embodiment of the present invention, the data of the spatial character include one-dimensional space data, two-dimensional space number
According to three-dimensional space data or multidimensional space data.The data of spatial character are in specific coordinate system, description to be empty
The data acquisition system of interior continuous or discrete location each attribute, physics, statistics, natural etc. information includes mainly:It is based on
Geographical spatial entities data, continuously distributed space lattice data, the information point data of discrete distribution, with ranks coordinate
Image and science matrix data etc..
As an embodiment of the present invention, data space is as the container with same space attribute data, by multiple
The consistent space lattice composition of range, each grid all has time shaft, to store the data in different periods, every
Further include at least one physical layer in a period, data are divided into multiple data blocks according to physical layer.
As an embodiment of the present invention, the data in the grid are stored in by plural number according to the spatial position of grid
A memory node further comprises, spatial position is stored in apart from close data in the memory node of dispersion.
As an embodiment of the present invention, the data in the grid are stored in by the spatial position according to grid
Further include that the multiple grid is subjected to the conversion from dimensional space to one-dimensional sequence, obtains energy before a plurality of memory nodes
The grid number of spatial relationship enough between embodiment grid, such as the number of the closer grid of spatial position distance are closer.
As an embodiment of the present invention, described by the multiple grid from dimensional space to one-dimensional sequence turn
It changes and further comprises hibert curve, row overture line or Z overture lines.
As an embodiment of the present invention, after the grid number for obtaining to embody spatial relationship between grid also
Including grid number is done with memory node and is mapped, is stored in spatial position apart from close data according to the mapping
In the memory node of dispersion.
As an embodiment of the present invention, by spatial position in the memory node that close data are stored in dispersion into
One step includes that the data by spatial position apart from close grid are stored in the memory node of dispersion.
The number of high degree of parallelism may be implemented for the various types spatial data of flood tide by the method for above-described embodiment
According to write-in, storage that guarantee can be balanced according to the data that space attribute is divided, keeping space correlation, safe to each section
Point, meanwhile, system has great extended capability, and system extension and performance are linearly related, and system is not in big
The idle node of amount or I/O bottleneck nodes, realize the original intention of Distributed system design.
It is illustrated in figure 2 a kind of flow of the parallel read method of distributed data based on space correlation of the embodiment of the present invention
Figure.
It determines according to the read requests with spatial character data are read including step 201 and covers the read requests
Grid.
Step 202, according to the grid of the covering, the memory node for storing the data in grid is determined.
Step 203, from the data of the memory node parallel-read requests of the determination.
As an embodiment of the present invention, the data that the read requests are obtained from the memory node of the determination it
After further include that the data of the grid coverage are spliced and cut according to the spatial dimension in the read requests, obtain
The data of accurate read requests.
It is to select grid data according to spatial neighborhood relations when reading data by the embodiment of above-mentioned read method
, most of memory node concurrent working can be driven, ensures the I/O Performance optimizations of system.
It is illustrated in figure 3 a kind of structure chart of the Distributed Storage device based on space correlation of the embodiment of the present invention.
Including mesh generation unit 301, for will have the data of spatial character to be divided into multiple grids, the grid tool
The data in space where having the grid.
Data in the grid are stored in a plurality of storages by storage unit 302 for the incidence relation according to grid
Node.
As an embodiment of the present invention, the data of the spatial character include one-dimensional space data, two-dimensional space number
According to three-dimensional space data or multidimensional space data.
As an embodiment of the present invention, data space is as the container with same space attribute data, by multiple
The consistent space lattice composition of range, each grid are all had time shaft, are stored in different periods in a manner of timeslice
Data, further include at least one physical layer in each timeslice, data are divided into multiple data blocks according to physical layer.
As an embodiment of the present invention, the storage unit 302 is further used for spatial position apart from close number
According to being stored in the memory node of dispersion.
As an embodiment of the present invention, further include hash computing unit 303, be connected to the mesh generation unit
Between 301 and the storage unit 302, for the multiple grid to be carried out to the conversion from dimensional space to one-dimensional sequence, obtain
To the sequential coding that can embody spatial relationship between grid;
Further include map unit 304, is connected between the storage unit 302 and hash computing unit 303, is used for institute
It states sequential coding and is done with memory node and mapped, grid data is stored in storage section by the storage unit 302 according to the mapping
Point in.
As an embodiment of the present invention, the hashing unit 303 using include hibert curve, row overture line or
The multiple grid is carried out the conversion from dimensional space to one-dimensional sequence by Z overtures line.
As an embodiment of the present invention, the storage unit 302 is further used for spatial position apart from close net
The data of lattice are stored in the memory node of dispersion.
The number of high degree of parallelism may be implemented for the various types spatial data of flood tide by the device of above-described embodiment
According to write-in, storage that guarantee can be balanced according to the data that space attribute is divided, keeping space correlation, safe to each section
Point, meanwhile, system has great extended capability, and system extension and performance are linearly related, and system is not in big
The idle node of amount or I/O bottleneck nodes, realize the original intention of Distributed system design.
It is a kind of structure of the parallel reading device of distributed data based on space correlation of the embodiment of the present invention as described in Figure 4
Schematic diagram.
Including acquisition request unit 401, for according to the read requests with spatial character data are read, determining covering institute
State the grid of read requests.
Processing unit 402 determines the memory node for storing the data in grid for the grid according to the covering.
Reading unit 403, the data for obtaining the read requests from the memory node of the determination.
As an embodiment of the present invention, further include splicing and cutting out unit 404, be connected with the reading unit 403
It connects, for the data of the grid coverage to be spliced and cut according to the spatial dimension in the read requests, obtains essence
The data of true read requests.
It is to select grid data according to spatial neighborhood relations when reading data by the embodiment of above-mentioned reading device
, most of memory node concurrent working can be driven, ensures the I/O Performance optimizations of system.
It is illustrated in figure 5 a kind of structure chart of the distributed data system based on space correlation of the embodiment of the present invention.
Including host node 501, a plurality of memory nodes 502, client node 503.
The host node 501 further comprises:
Data space unit 5011, the description for establishing the data capsule space with consistent attribute, is divided into multiple
Space lattice calculates the unique mark of each grid.
Computing unit 5011 is hashed, the multiple grids formed after data space will be divided and carried out from dimensional space to one-dimensional sequence
The conversion of row obtains capable of embodying the sequential coding of spatial relationship between grid;The data space, which is used as, has same space
The data capsule of feature description establishes data space, according to mesh generation mark according to space attribute, coordinate system and data characteristics
Data space is divided into multiple grids, obtains the unique mark of each grid by standard.
Map unit 5012 maps obtained sequential coding and a plurality of memory nodes, obtains one-to-one
The mapping relations of grid and memory node.
Further include acquisition request unit 5013, the reading for being sent out according to the client node 503 has space special
Property data read requests, the grid for covering the read requests is determined according to the read requests.
Processing unit 5014 determines the memory node for storing the data in grid for the grid according to the covering.
The client node 503 further comprises:
Mesh generation unit 5031, for will have the data of spatial character to be divided into multiple grids, the grid has
The data in space where the grid.
Storage unit 5032 deposits the data in the grid for the incidence relation (i.e. mapping relations) according to grid
It is stored in a plurality of memory nodes 502.
Reading unit 5033, the data for the memory node parallel-read requests from the determination.
Can also include splicing and cutting out unit 5035 in above-mentioned client node 503, for being asked according to the reading
The spatial dimension asked is spliced and is cut from the data of acquisition, and the data of accurate read requests are obtained.
Alternatively, further including middle layer node 504 between the client node 503 and the memory node 502, it is used for
It is cut from the data of acquisition according to the spatial dimension in the read requests, obtains the data hair of accurate read requests
Give client node 503.
In above-described embodiment, hash computing unit, map unit, acquisition request unit, processing unit can be located at main section
Point (latter embodiments are respectively positioned on host node) can also be located at memory node (i.e. no host node), also or positioned at client
End node (i.e. no host node);Or in certain embodiments, if after the data space is divided into multiple grids in advance,
The system can need not hash computing unit, map unit divides grid again, the data for directly inputting client node
Then the Standard Segmentation for dividing grid according to data space divides at multiple grids according to the incidence relation of grid (i.e. spatial position)
It is not stored in different memory nodes.
A specific embodiment will be used below, and embodiment of the present invention and technological merit is elaborated.
First, systematic parameter is initialized, host node database is established, disposes memory node management module, client is installed
Interface library is completed sample data input and is prepared.
Fig. 6 is the data space and mesh generation schematic diagram of embodiment, in the present embodiment by taking 2-D data as an example, it is proposed that
One two-dimensional data space can also be the data space of multidimensional data correspondence establishment multidimensional in other embodiments, most
A data space based on geographical coordinate is created in basic 2-D data space, range includes entire China and peripheral region
Domain, wherein:Coordinate range is:70°-140°E、15°-55°N;X-direction resolution ratio is 0.02 °, and Y-direction resolution ratio is 0.02 °;
Earth ellipsoid model reference WGS-84 standards;The upper left corner is coordinate origin, and it is longitude that X-coordinate, which is taken out, it is positive to the left, Y-coordinate axle
It is positive downward for latitude;Data space physics size is 3500X2000.Above-mentioned parameter is integrated into attribute data record, and data are empty
Between identify and be set as " ChinaRegion ", be stored in host node database, data space, which creates, to be completed.Wherein, host node database
Using matrix form relational database clustered deploy(ment) mode, it is made of 2-4 node according to system scale, all nodes have portion
Complete data copy, data consistency reach stringent transaction rank.By configuration interface, the object of memory node is inputted
Information, including address, capacity, interface protocol, active and standby setting, active state etc. are managed, memory node allocation list is established.
Start planning grid in next step to divide, the size of grid needs to consider data type, data volume size, data space
Size etc. wants moderate according to the data block size of mesh segmentation, preferably in the range of 128KB-4MB, the I/O of such system
Best performance.In the present embodiment, grid is designed according to the square area of 250X250, then data space " Chinare
Gion " includes 112 grids, forms the grid matrix of a 14X8, as shown in Figure 6.For each grid, its upper left corner is recorded
The parameters such as coordinate, X and Y-direction resolution ratio, X and Y-direction width;Meanwhile function is calculated by Marking the cell, generate grid mark
Know, such as Marking the cell can be one 64 integers, wherein:High 4 represent data space type;Following 8 reservations;
Then 26 are grid upper left corner X-coordinate value;Last 26 are grid upper left corner Y-coordinate value, such as:The data of the present embodiment are empty
Between type be Chinese grid, the grid that top left co-ordinate is 110.0 ° of E and 45.0 ° of N is calculated in code 15
Marking the cell is:16142813667133857664.Meanwhile system also provides a Marking the cell inverse calculating function, passes through net
Grid upper left corner X and Y coordinates value can be directly calculated in case marker knowledge, such as:Marking the cell is
16142914330428357664 grid, through inverse calculating can obtain grid upper left corner X and Y coordinates value be 120.0 ° of E and
35.0°N.Grid property parameter and Marking the cell are combined to form data record, are stored in the grid property table of host node database
In.
Grid in above-mentioned steps can also include the data of several levels, such as be illustrated in figure 7 the embodiment of the present invention
The schematic diagram of grid data model, model includes three levels in this example, i.e.,:Grid, timeslice and physical layer.These three layers
It is secondary to include downwards, first, data space is established according to spatial dimension, coordinate-system, data space is divided into multiple nets
Lattice, each grid will be an independent data cells, and the contact between grid is maintained by space topological association;Next, for
Each grid, there are a time shaft, data according to timeslice formal distribution on a timeline, timeslice has " thickness "
Attribute, different " thickness " represent different time ranges, are arranged when being inputted by data;Again, spatial data generally comprise 1 or
Multiple physical levels will be divided into multiple physical layers in a timeslice according to attribute, and data are divided into multiple numbers according to physical layer
According to block, become the minimum unit of system storage.
Obtained in the previous step 14 × 8 grid matrix can form 112 grid data block sequences in actual storage data
How row, this one-dimensional sequence arrange, will have a significant effect to the I/O performances of system, therefore the embodiment of the present invention uses algorithm
Model calculates mapping of the two-dimensional matrix to one-dimensional sequence.
In this step, data space " ChinaRegion " will use the algorithm model of hibert curve to calculate grid
Matrix hashes, and hibert curve is a kind of fractal curve that can fill a full plane space, and value is to establish one-dimensional
Space is to the one-to-one relationship of two-dimensional space.The plane space of hibert curve traversal, it is necessary to be the square of a 2n
Matrix needs to be extended for 2 for 14 × 8 grid matrix of data space " ChinaRegion "4×24Space of matrices,
This space complete hibert curve calculate, then by the upper left corner, 14 × 8 space of matrices is cut out, as hash
Result of calculation, the top of Fig. 8 are the schematic diagram that data space " ChinaRegion " is superimposed hibert curve.Martin Hilb Tequ
Line computation represents ranking of some grid in hibert curve in grid matrix the result is that a serial number, two
The point being connected in hibert curve is also adjacent in two-dimensional space, so that the closely coupled number in space
According to the sequential coding in hibert curve is also just closer.Tables 1 and 2 in Fig. 8 is the grid mark in two regions A and B
Know and the corresponding table of hibert curve sequence number.
As in other embodiment, other hashing algorithms, such as, grid larger for spatial dimension can also be used
Space matrix is more than 4 × 4 dimensions, and carrying out hash distribution according to hibert curve (Hilbert) calculates, and obtains space correlation
One-dimensional sequence;Smaller for spatial dimension, mesh space matrix is less than 4 × 4 dimensions, is carried out using general row overture line
Hash distribution calculates;There is the case where larger difference for mesh space matrix X and Y-direction dimension, is hashed using Z overture line computations
Distribution.
After obtaining Hilbert sequence number, cartesian product relationship is carried out using this sequence number and memory node allocation list
Operation, so that it may further to obtain the mapping relations of grid and memory node.Fig. 9 is Marking the cell and memory node mapping relations
Schematic diagram, with the B area chosen in Fig. 8 as an example, the table 1 of Fig. 9 is that Marking the cell is corresponding with hibert curve coding
Table, Fig. 9 tables 2 are memory node allocation list, and Fig. 9 tables 3 are the grid being calculated and memory node mapping table.It is analyzed by Fig. 9 tables 3
It is calculated it is found that being hashed by the grid of hibert curve, it can be very perfectly by the grid sequence of B area, balanced point
For cloth to all memory nodes, the load of each node is almost the same, and 8 memory nodes all at least store a grid data,
Store up to 2 grid datas.Space correlation relationship maintains be distributed memory node, rather than exists only in one and deposit
It stores up in node, when reading data according to spatial neighborhood relations selection grid data, distributed data system of the invention can be with
The most of memory node concurrent working of driving, ensures the I/O Performance optimizations of system.
After the completion of above-mentioned processing, data space can serve as data capsule, and input data is supplied to use, all to meet
The data type of data space " ChinaRegion " space attribute description can store data, typical using one here
Spatial data types are as example.Satellite remote sensing date is a kind of typical spatial data type, and essential characteristic is exactly continuous
The raster data of spatial distribution can be described with use space grid model.As partly showing satellite remote sensing date on Figure 10
A part, used here as a satellite remote sensing images as input data, entitled " file1.dat " its spatial dimension of file
For:100 ° of E-, 120 ° of E and 30 ° of N-, 50 ° of N, coordinate system, resolution ratio, ellipsoid model and data space " ChinaRegion "
It is consistent, the lower part of Figure 10 is the position signal of input data within a grid.The upper part of Figure 11 shows input data
The signal of mesh generation is divided into 16 net regions, while will be superimposed upon data by the hibert curve in this region
On image.The table 1 of Figure 11 is the mesh space attribute list of input data, and the table 2 of Figure 11 is grid hash-coding table, these are logical
Host node database is crossed, space querying acquisition is carried out by data space " ChinaRegion " grid property table.Figure 12 lower parts
Table 1 for the mapping relations of the input data grid and memory node further inquired, Figure 12 is that 16 grids are deposited with 8
The mapping table of node is stored up, the lower part of Figure 12 shows 16 grids in the distribution situation of memory node, and each node is at least
A corresponding grid, Node2 and the corresponding number of grids of Node5 are most, reach 3, entire distribution situation is more balanced.
Figure 13 shows the schematic diagram of data write-in, and according to mesh generation standard, input data is firstly split into 16 grids
Data volume, then according to the mapping relations of grid and memory node, with obtained memory node address by client by 16 nets
Lattice data volume is written in parallel to each memory node (Node1, Node2 ... Node8), finally host node data is notified to be written successfully,
The key data stream of ablation process only generates between client and memory node, only transmission control between client and host node
The load distribution of information, whole system is in admirable proportion, and is not in I/O bottlenecks.
Figure 14 is the schematic diagram that data space is written in multifile, and file1.dat, file2.dat, file3.dat are 3 tools
There is the data file of same space attribute, data, Figure 14 is written according to the mesh generation of data space " ChinaRegion " respectively
After the completion of showing 3 file data inputs, the image of entire data space, three parts data are spliced into an entirety, subsequently
Data access only for data space, need not be concerned about which file is data belong to.
The distributed data system of the present invention has more apparent advantage and efficiency in digital independent, and digital independent can
To regard the inverse process of data write-in as, but under the arrangement of outstanding hashing algorithm, the degree of parallelism of data read process is incited somebody to action
To the raising of matter.For stating the data space " ChinaRegion " for completing 3 files of write-in herein above, respectively according to rectangle
Digital independent is carried out with polygonal region, illustrates the degree of parallelism and loading condition of whole system.
Figure 15 is a schematic diagram that data space subset is read according to rectangular area, and Figure 15 shows in data space and selects
A rectangular area of the leftmost side is selected, ranging from:95 ° of E-, 105 ° of E and 30 ° of N-, 45 ° of N include 9 grids, the B in Figure 15 altogether
For the final image of this rectangular area data.All nets of covering input area are obtained by host node database by client
Lattice sequence number (Grid_ID) further obtains corresponding memory node number (Node_ according to Marking the cell (Hilbert_Num)
ID), finally obtain the address (IP_Addr) of memory node by the number of memory node, the table 1 of Figure 15 be obtained grid with
Memory node mapping table determines the storage location of each grid, and grid number is read from relevant memory node parallel by client
According to body.Data flow only generates between memory node and client, and system load is in admirable proportion, and the lower part of Figure 15 is shown respectively
The loading condition of a memory node, 9 grids that current digital independent is related to, highly uniform is distributed on 8 nodes, only
Node6 needs to provide 2 grids, and other nodes all only need to provide a grid, and appearance is not a large amount of idle for memory node,
There are not high I/O pressure nodes, system is under a good load condition, this is the distributed data system of the present invention
Compared to the embodiment of other distributed systems most advantage.
In digital independent, the embodiment of the present invention can also provide a spatial manipulation middle layer be built in client or
It is connected between client and memory node, for the grid data of dispersion to be spliced into an entirety, according to input request
Area of space is accurately cut, and the data volume that formation meets user demand returns to client.Figure 16 is shown more than one
The digital independent in side shape region is illustrated, and the part A in Figure 16 shows position of the input polygonal region in data space, according to
According to spatial retrieval algorithm, the maximum mesh sequence for covering this polygonal region is obtained, the part B in Figure 16 shows polygon
The signal of the minimum external grid rectangle in region, is related to 9 grids altogether.According to grid data parallel reading manner, from each storage
Node reads grid data, is input to spatial manipulation middle layer, and the splicing of grid data, cutting and again are completed according to space arithmetic
The processing such as group, the C portion in Figure 16 are final digital independent the results show that finally returning to client with stream socket.
Figure 17 shows the flow signal of the digital independent of the embodiment of the present invention, and user proposes data by client first
Read requests (dotted region in such as figure in grid) are sent to host node, the data read request include by space, when
Between and the property parameters such as physics;Then host node uses spatial retrieval algorithm, the minimum grid of inquiry covering input space range
Marking the cell is returned to client by sequence such as 16 grids in figure;Again, client submits Marking the cell sequence to main section
Point, and query master node database obtains the corresponding memory node of grid, client is proposed by parallel mode to each node
Grid data is asked;Grid data is transferred to spatial manipulation middle layer by each memory node, in the spatial manipulation middle layer
The splicing and cutting of data are carried out, the data of the part dotted region such as client request are formed;Finally, among spatial manipulation
Layer returns to client with stream socket, by final result, completes digital independent.Entire data read process, data flow and control
Stream processed is kept completely separate, and the I/O pressure dissipations of data flow to each memory node, host node does not undertake the I/O of any data flow, only
It is responsible for the inquiry of control information and former data, the load of system is very reasonable, is suitble to large-scale data application demand.
By the above embodiments, for the various types spatial data of flood tide, the data that high degree of parallelism may be implemented are read
It writes, storage that guarantee can be balanced according to the data that space attribute is divided, keeping space correlation, safe to each node,
Various forms of digital independent applications are adapted to, provide really data-centered service to the user, rather than it is simple
File read.Meanwhile system has great extended capability, and system extension and performance are linearly related, system is not
It will appear largely idle node or I/O bottleneck nodes, realize the original intention of Distributed system design, i.e.,:Dispersion pressure carries
High efficiency.
It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention
Apply the form of example.Moreover, the present invention can be used in one or more wherein include computer usable program code computer
The computer program production implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
The present invention be with reference to according to the method for the embodiment of the present invention, the flow of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that can be realized by computer program instructions every first-class in flowchart and/or the block diagram
The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided
Instruct the processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine so that the instruction executed by computer or the processor of other programmable data processing devices is generated for real
The device for the function of being specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that instruction generation stored in the computer readable memory includes referring to
Enable the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device so that count
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, in computer or
The instruction executed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in a box or multiple boxes.
Particular embodiments described above has carried out further in detail the purpose of the present invention, technical solution and advantageous effect
Describe in detail it is bright, it should be understood that the above is only a specific embodiment of the present invention, the guarantor being not intended to limit the present invention
Range is protected, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should be included in this
Within the protection domain of invention.
Claims (3)
1. a kind of distributed data system based on space correlation, it is characterised in that including:
Mesh generation unit, it is described for will have the data of spatial character, timeslice and physical characteristic to be divided into multiple grids
The data in space, the grid all have time shaft where grid has the grid, and different periods are stored in the form of timeslice
Interior data, further include at least one physical layer in each timeslice, and data are divided into multiple data blocks according to physical layer;Its
In, the data of the spatial character include:Based on geographical spatial entities data, continuously distributed space lattice data, discrete
Information point data, the image with ranks coordinate and the science matrix data of distribution;
Data in the grid are stored in a plurality of memory nodes by storage unit for the incidence relation according to grid;
Acquisition request unit, for according to the read requests with spatial character data are read, determining and covering the read requests
Grid;
Processing unit determines the memory node for storing the data in grid for the grid according to the covering;
Parallel reading unit, the data for the memory node parallel-read requests from the determination;
Further include,
Computing unit is hashed, is carried out from dimensional space to one-dimensional sequence for the multiple grids formed after data space will to be divided
Conversion obtains capable of embodying the sequential coding of spatial relationship between grid;
Map unit is mapped for doing the grid with memory node.
2. a kind of distributed data system based on space correlation according to claim 1, which is characterized in that
The hash computing unit, map unit, acquisition request unit, processing unit are located at host node;
The mesh generation unit, storage unit, parallel reading unit are located at client node.
3. a kind of distributed data system based on space correlation according to claim 2, which is characterized in that in further including
Between node layer, be connected between the client node and memory node, for according to the spatial dimension in the read requests
The data of the grid coverage are spliced and cut, the data of accurate read requests are obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410208628.0A CN103995861B (en) | 2014-05-16 | 2014-05-16 | A kind of distributed data device based on space correlation, method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410208628.0A CN103995861B (en) | 2014-05-16 | 2014-05-16 | A kind of distributed data device based on space correlation, method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103995861A CN103995861A (en) | 2014-08-20 |
CN103995861B true CN103995861B (en) | 2018-08-28 |
Family
ID=51310026
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410208628.0A Active CN103995861B (en) | 2014-05-16 | 2014-05-16 | A kind of distributed data device based on space correlation, method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103995861B (en) |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104376079B (en) * | 2014-11-17 | 2017-11-07 | 四川汇源吉迅数码科技有限公司 | A kind of mass data processing based on location service information and storage device and its method |
US10922474B2 (en) * | 2015-03-24 | 2021-02-16 | Intel Corporation | Unstructured UI |
CN105159903B (en) * | 2015-04-29 | 2018-09-28 | 北京交通大学 | Based on the big data Organization And Management method and system for dividing shape multistage honeycomb grid |
CN105279260B (en) * | 2015-10-21 | 2018-08-10 | 武汉大学 | A kind of government affairs big data method for digging divided based on space lattice |
CN106802786B (en) * | 2016-12-09 | 2019-02-19 | 中电科华云信息技术有限公司 | Time-space matrix computing system and method based on Hash area maps |
CN106528793B (en) * | 2016-12-14 | 2019-12-24 | 自然资源部国土卫星遥感应用中心 | Space-time fragment storage method of distributed spatial database |
CN107766491A (en) * | 2017-10-18 | 2018-03-06 | 浪潮金融信息技术有限公司 | File memory method and device, computer-readable recording medium, terminal |
CN109298836B (en) * | 2018-09-04 | 2022-07-08 | 航天信息股份有限公司 | Method, apparatus and storage medium for processing data |
CN109271113B (en) * | 2018-09-28 | 2022-03-29 | 武汉烽火众智数字技术有限责任公司 | Data management system and method based on cloud storage |
CN109871418A (en) * | 2019-01-04 | 2019-06-11 | 广州市城市规划勘测设计研究院 | A kind of space index method and system of space-time data |
CN111767264A (en) * | 2019-04-02 | 2020-10-13 | 中国石油化工股份有限公司 | Distributed storage method and data reading method based on geological information coding |
CN111797174A (en) * | 2019-04-08 | 2020-10-20 | 华为技术有限公司 | Method and apparatus for managing spatiotemporal data |
CN110442444B (en) * | 2019-06-18 | 2021-12-10 | 中国科学院计算机网络信息中心 | Massive remote sensing image-oriented parallel data access method and system |
CN110795605B (en) * | 2020-01-03 | 2020-05-12 | 北京东方通科技股份有限公司 | Data storage system based on distributed memory grid |
CN111460775B (en) * | 2020-03-05 | 2022-04-05 | 北京师范大学 | Method and device for generating trade characteristic grid graph |
CN113177091B (en) * | 2021-05-19 | 2023-10-10 | 杭州华橙软件技术有限公司 | Incremental data storage method and device, storage medium and electronic device |
CN115840752B (en) * | 2023-02-24 | 2023-05-02 | 西安索格亚航空科技有限公司 | Global aviation navigation data storage and query method |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7640257B2 (en) * | 2007-08-09 | 2009-12-29 | Teradata Us, Inc. | Spatial join in a parallel database management system |
CN101692231B (en) * | 2009-01-14 | 2012-11-07 | 中国科学院地理科学与资源研究所 | Remote sensing image block sorting and storing method suitable for spatial query |
-
2014
- 2014-05-16 CN CN201410208628.0A patent/CN103995861B/en active Active
Non-Patent Citations (2)
Title |
---|
"基于HBase的矢量空间数据分布式存储研究";范建龙 等;《地理与地理信息科学》;20120930;第28卷(第5期);第39-42页 * |
崔鑫."海量空间数据的分布式存储管理及并行处理技术研究".《中国优秀硕士学位论文全文数据库•信息科技辑》.2012,I137-45. * |
Also Published As
Publication number | Publication date |
---|---|
CN103995861A (en) | 2014-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103995861B (en) | A kind of distributed data device based on space correlation, method and system | |
CN106898047B (en) | Self-adaptive network visualization method for dynamic fusion of tilt model and multivariate model | |
Schütz et al. | Fast Out‐of‐Core Octree Generation for Massive Point Clouds | |
Han et al. | Hgrid: A data model for large geospatial data sets in hbase | |
CN111291016A (en) | Layered mixed storage and indexing method for mass remote sensing image data | |
CN109947889A (en) | Spatial data management method, apparatus, equipment and storage medium | |
Xiao et al. | Remote sensing image database based on NOSQL database | |
JP2020531970A (en) | Fusion of scalable space-time density data | |
CN108804602A (en) | A kind of distributed spatial data storage computational methods based on SPARK | |
CN103455531A (en) | Parallel indexing method supporting real-time biased query of high dimensional data | |
CN103177103A (en) | Three-dimensional geographical information system management platform | |
CN111125392A (en) | Remote sensing image storage and query method based on matrix object storage mechanism | |
CN114297206A (en) | Refined efficient dynamic tile map service publishing method, medium and electronic equipment | |
CN114328779A (en) | Geographic information cloud disk based on cloud computing efficient retrieval and browsing | |
Jing et al. | An improved distributed storage and query for remote sensing data | |
Hegeman et al. | Distributed LiDAR data processing in a high-memory cloud-computing environment | |
CN116383144A (en) | Multi-source heterogeneous remote sensing data storage method and device | |
KR20060095444A (en) | Entity lookup system | |
CN116467540B (en) | HBase-based massive space data rapid visualization method | |
CN111599015B (en) | Spatial polygon gridding filling method and device under constraint condition | |
CN111061806B (en) | Storage method and networked access method for distributed massive geographic tiles | |
Danovaro et al. | Level-of-detail for data analysis and exploration: A historical overview and some new perspectives | |
Gold et al. | Voronoi hierarchies | |
KR101685999B1 (en) | Distributed Storage Method For Mass TIN Data | |
CN107679126A (en) | Laser three-D cloud data stores and management method and its system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |