KR20170085786A

KR20170085786A - System and method for storing data in big data platform

Info

Publication number: KR20170085786A
Application number: KR1020160005344A
Authority: KR
Inventors: 주인학; 이강우; 장인성
Original assignee: 한국전자통신연구원
Priority date: 2016-01-15
Filing date: 2016-01-15
Publication date: 2017-07-25

Abstract

The present invention relates to a spatial data distribution system and method in a big data platform for efficiently storing and retrieving large capacity spatial data having location information in a distributed environment, A spatial indexer for reading data, generating a spatial index for the read distributed data, and storing the index in a spatial index repository; A spatial index storage unit for storing a spatial index of the spatial data distribution; a spatial index storage unit for storing the spatial index of the spatial index storage; A dispersion part; A data storage unit for reading the distributed data stored in accordance with the range of the block determined by the data distribution unit and reconstructing the distributed data into a plurality of distributed data storage units; And a block metadata storage unit for storing the range and reference information of the block determined and generated by the data distribution unit as block metadata.

Description

[0001] The present invention relates to a system and method for storing large data in a large data platform,

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technology for processing spatial data among large-scale data processing techniques in a big data platform, and more particularly, to a technology for processing spatial data in a large data platform, To a data distributed storage system and method thereof.

In general, a database management system (DBMS) is used to efficiently store and manage structured data and retrieve it through a quick query.

The conventional database management system is based on the processing of statically stored data, and thus it is possible to respond quickly and accurately to the processing of general data.

In recent years, however, a big data environment in which large data (large data) such as amount, cycle, and format of data is produced widely can not be processed by an existing database management system.

Big data processing technology, which finds meaningful value by processing and analyzing large amount of data, is getting more attention, and big data age is being introduced such as big investment of companies in big data analysis.

Since the amount of big data is larger than that of existing data, there is a problem that the processing time for collecting, storing, searching, and analyzing data increases rapidly only with the database management system. That is, the conventional central server management database management system has a problem in that it is difficult to quickly respond to queries because the load increases when processing a large number of queries on a continuously changing large amount of data.

Big data processing techniques for solving these problems are being developed. Typical examples are Hadoop, which is a framework for supporting distributed application programs operating in a cluster composed of a plurality of computers.

Hadoop provides a way to collect large amounts of data, both regular and unstructured, to distribute the data redundantly and in parallel on a distributed network cluster. Hadoop's core component, the Hadoop Distributed File System (HDFS), is a file storage system that can store large amounts of data. Another key element, MapReduce, is distributed parallel data processing It is a programming model for.

These HDFS and MapReduce are based on the fact that data is stored in a cluster and distributed. Each computer in the cluster is called a node, and data is stored and managed redundantly evenly distributed among a plurality of nodes. HDFS stores files in block units, which are set to 64MB by default, and reads and writes are also made in block units. To overcome the disadvantage that the processing method is mostly executed in a single node, one master node generates information related to data processing and setting, and the actual tasks are executed in parallel at each worker node.

On the other hand, most of the big data includes location information. More than 80% of the information produced and used in various fields, including Internet of Things (IOT), social networks and mobile terminals, is directly or indirectly related to location information (spatial information).

Therefore, it is very important to acquire, process, and analyze spatial information in big data processing. In recent years, there has been a demand for using and processing spatial information included in social network service, web, Is increasing.

However, existing distributed data storage methods including HDFS do not consider the characteristics of spatial data, and there is no guarantee that spatial objects in a spatially adjacent region are randomly distributed and physically stored in the same node.

Therefore, when performing frequent search and calculation operations on spatial objects in a spatially adjacent region, there is a problem in that communication costs for exchanging data between nodes must be increased in order to read and operate data.

SUMMARY OF THE INVENTION It is an object of the present invention to solve the above problems and to solve the problems of the prior art that process spatial information in a big data platform, ) To improve spatial data processing performance in a big data platform environment and efficiently support various spatial queries and spatial analysis requested from the outside by providing a distributed processing method and system that match the characteristics of spatial data frequently occurring in a big data platform environment And a method for distributing spatial data in a platform.

Another object of the present invention is to disperse spatial data so as to efficiently analyze and service a large amount of spatial data including location information while being generated in a sensor, a social network or a mobile terminal, in a Hadoop-based big data environment And to provide a spatial data distribution system and method in a big data platform that enables a user to store and retrieve data.

According to an aspect of the present invention, there is provided a system and method for spatial data distribution in a big data platform. In order to increase the efficiency of spatial data processing in which queries and queries are spatially performed on spatially adjacent objects, In a distributed environment, it is possible to bundle objects that are spatially adjacent to each other by loading spatial data or loading the spatial data by a spatial index, and to store them physically in the same repository as possible in a distributed environment.

A spatial data distribution storage system in a big data platform according to an aspect of the present invention includes a spatial index storage unit for reading distributed data stored in a plurality of distributed data stores, generating a spatial index for the distributed distributed data, Generating unit; A spatial index storage unit for storing a spatial index of the spatial data distribution; a spatial index storage unit for storing the spatial index of the spatial index storage; A dispersion part; A data storage unit for reading the distributed data stored in accordance with the range of the block determined by the data distribution unit and reconstructing the distributed data into a plurality of distributed data storage units; And a block metadata storage unit for storing the range and reference information of the block determined and generated by the data distribution unit as block metadata.

According to the present invention, an existing method of distributing and storing data in a big data environment does not take into consideration the characteristics of spatial data, thereby overcoming ineffective disadvantages and efficiently analyzing spatial data when searching and analyzing spatial data .

In addition, when spatial data is distributed and stored, data corresponding to spatially adjacent regions are physically stored in the same place, thereby enhancing the performance of spatial data processing having characteristics in which operations are frequently performed on spatially adjacent regions.

Also, the spatial data distribution storage system according to the method of the present invention has an effect of providing an efficient spatial data analysis function when an application program for searching and analyzing a large amount of spatial data is used.

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram of a spatial data distribution storage system in a Big Data Platform according to the present invention. FIG.
2 is an exemplary diagram for explaining a spatial data distribution operation in the present invention;
3 is a flowchart showing an operation flow for a spatial data distribution method in a Big Data Platform according to the present invention.
4 is a flowchart illustrating a method for searching for spatial data stored in a spatial data distribution storage system in a Big Data Platform according to the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention and the manner of achieving them will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art. And the present invention is defined by the description of the claims. It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. It is noted that " comprises, " or "comprising," as used herein, means the presence or absence of one or more other components, steps, operations, and / Do not exclude the addition.

Prior to the description of the present invention, a brief description will be made of the following description and basic concepts of the present invention.

In the present invention, a basic unit for storing data is referred to as a block, and a node that manages data distribution storage state of the entire cluster is referred to as a main node, and a plurality of nodes storing actual data is referred to as a distributed node .

In Hadoop, the nodes serving as the above-mentioned nodes are referred to as a name node and a data node, respectively. However, the present invention is not limited thereto and can operate in a similar distributed processing framework. In this case, Should be explained by reference.

In the configuration of the present invention, first, a method of distributing spatial data to distributed nodes will be briefly described. In order to determine a unit in which spatial data is dispersively stored for each distributed node in a distributed environment, Spatial Index).

A spatial index is a technique that can speed up a search by indexing two or more dimensional spatial objects that can not be given a one-dimensional order such as an integer or a string. We divide the space in which data exists into two-dimensional space into rectangular regions and construct a tree structure. Representative examples include R-tree, Quadtree, and Grid.

The R-tree is a representative tree structure for efficiently processing spatial data, and stores a minimum bounding rectangle (MBR) for the object. You can also use a grid with fixed tiles, and you can use GeoHash, which is a code for representing two latitude / longitude coordinates as a single value on a grid basis.

In the present invention, the spatial data retrieval is a search for a region such as " Find a spatial object (building, road, etc.) included in an area given by a user ", a proximity query such as "Find a closest spatial object at a given point" Find a pair of schools and factories within a ".

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, preferred embodiments of a spatial data distribution system and method in a Big Data Platform according to the present invention will be described in detail with reference to the accompanying drawings.

1 is a block diagram of a spatial data distribution storage system in a Big Data Platform according to the present invention.

As shown in FIG. 1, a spatial data distribution storage system in a big data platform according to the present invention may include a data processing unit 200, a data storage 300, and a search unit 400.

The data processing unit 200 includes a data storage unit 210, a spatial index generation unit 220, a data distribution unit 230, a data storage unit 240, and a block metadata storage unit 250.

The data store 300 includes a spatial index store 310, a block metadata store 320, and a plurality of distributed data stores 330.

The search unit 400 includes a plurality of distributed data search units 41 and a data search unit 420.

The spatial index generator 220 reads spatial data from a plurality of the distributed data repositories 330 and generates a spatial index serving as a reference for storing the spatial data.

The data distribution unit 230 generates a criterion for distributing spatial data. The data distribution unit 230 generates a criterion for spatial data distribution by determining a range of a block that is a unit of spatial data distribution, To the data storage unit 250 and the data storage unit 240.

The data storage unit 240 stores the data stored in the data storage 300 again according to the spatial data dispersion criterion information provided by the data distribution unit 230.

The data storage unit 240 reads data from the distributed data storage 330 according to the range of data blocks provided by the data distribution unit 230, reconstructs the data blocks, and stores the data blocks in the distributed data storage 330.

The block metadata storage unit 250 stores information of blocks in which spatial data is distributed, and stores data block range and reference information provided by the data distribution unit 230 as block metadata.

The data loading unit 210 reads the raw data and stores the data in the data storage 300 when the data is not loaded in the distributed data storage 330. [

The data retrieval unit 420 of the retrieval unit 400 separates the retrieval according to the spatial data distribution state by referring to the block metadata stored in the block metadata storage 320 and provides a retrieval command to the distributed data retrieval unit 410 do.

The data retrieval unit 420 retrieves the spatial indexes stored in the spatial index storage 310 to inquire the arrangement status of the data blocks and logically separates the retrieval requests and provides the retrieved requests to the distributed data retrieval unit 410. The distributed data retrieval unit 410 retrieves And merges the data received from the search unit 410 to generate a final result and provides the search result to the requesting user node.

The plurality of distributed data retrieval units 410 receive retrieval commands from the data retrieval unit 420 and retrieve data stored in a plurality of distributed data repositories 330. In other words, the plurality of distributed data retrieval units 410 correspond to the respective data nodes of the plurality of distributed data repositories 330 in a one-to-one correspondence, and retrieve data from the respective data nodes and provide them to the data retrieval unit 420.

The user node can be an external application program or a service that inquires data by a user directly performing a spatial search and a data processing API such as ODBC or JDBC. The user requests space search or space query to the system of the present invention through the data searching unit 420 and receives the result.

Hereinafter, the operation of the spatial data distribution system in the Big Data Platform according to the present invention will be described.

First, the spatial index generator 220 reads data stored in the distributed data repository 330 to generate a spatial index, and stores the index in the spatial index repository 310.

The data distribution unit 230 reads the spatial index stored in the spatial index storage 310 to determine a range of a block that is a unit of spatial data distribution to generate a criterion of spatial data distribution, To the metadata storage unit 250 and the data storage unit 240.

The data storage unit 240 reads the distributed data from the distributed data storage 330 according to the range of the block determined by the data distribution unit 230, reconstructs the distributed data into blocks, and stores the blocks in the distributed data storage 330.

The block metadata storage unit 250 stores the range and reference information of the block determined by the data distribution unit 230 in the block metadata storage 320.

When the search request is received from the user, the data retrieval unit 420 reads the block metadata stored in the block metadata storage 320 and stores the block in the distributed data storage 330, and logically separates the search request.

The data searching unit 420 distributes the search request of the user to the corresponding distributed data searching unit 410 and transmits the same.

The distributed data retrieving unit 410 reads the data from the corresponding distributed node of the distributed data repository 330, generates the result data, and provides the generated result data to the data retrieving unit 420.

Accordingly, the data retrieving unit 420 merges the result data received from the distributed data retrieving unit 410, generates the requested result data, and provides the resultant data to the requesting user node.

A method of distributing and storing spatial data in a Big Data Platform according to the present invention corresponding to the operation of a spatial data distribution system in a Big Data Platform according to the present invention having such a configuration and a method of searching stored data, Will be described concretely and step by step.

First, referring to FIGS. 2 and 3, a distributed data storing method will be described.

FIG. 2 is an exemplary diagram for explaining a spatial data distribution operation in the present invention, and FIG. 3 is a flowchart illustrating an operation flow for a spatial data distribution method in a big data platform according to the present invention.

First, the data loading unit 210 shown in FIG. 1 reads raw data, processes it into a form suitable for the repository, and stores it in a distributed data repository (S302). In the example of FIG. 2, it is stored in distributed nodes. In this case, the raw data refers to spatial data, that is, data including coordinates or position information, for example, social network data in which coordinates are tagged, data collected from sensors, bus wins and places where positions are recorded, Electronic maps such as digital maps are also included. If you use Hadoop, which is widely used for large data storage management in a distributed environment, the distributed data store becomes HDFS.

The spatial index generator 220 shown in FIG. 1 reads the data stored in the distributed data store 330 to generate a spatial index according to the spatial data (S303).

The generated spatial index is stored in a main memory, a disk, or a spatial index storage 310.

Then, the data distribution unit 230 reads the spatial index stored in the spatial index storage 310, and determines a block range when the data is distributed, that is, a range in which data is to be bundled into a block that is a basic storage unit (S304). Here, the block range depends on the spatial index used, and may be determined, for example, as data corresponding to a cell of a fixed grid or a leaf node of an R-tree.

The data distribution unit 230 provides the determined block range information to the data storage unit 240 and the data storage unit 240 distributes the determined block range information to adjacent data blocks 240 according to the determined block range provided by the data distribution unit 230. [ In the same distributed node, the entire data is stored in the distributed data storage 330 (S305).

According to the above process, the data is stored twice in steps S302 and S305. It is preferable that the first stored data is deleted because it is unnecessary after the second data is stored. In some cases, The first storage can be processed in the memory, not on the disk.

Next, the block metadata storage unit 250 stores the block range, the correspondence relationship between the cells and the blocks, and the information in which the blocks are distributed to each of the distributed nodes in the distributed data storage 330 as block metadata in the main node S306).

The spatial index generation method of step S303 will be described in more detail with reference to the example of FIG.

Although FIG. 2 illustrates a simple example of a grid having a fixed size for the sake of clarity, the nature is the same even if other spatial indices such as R-tree are used.

In FIG. 2, the map data area is divided into areas of fixed size cells A1 to D4, and spatial data (indicated by dots in the example of FIG. 2) in each area is allocated to each area.

The areas A1, B1, A2, and B2 are spatially adjacent to each other, and C1, D1, C2, and D2 are allocated to one distribution node (number 1 in FIG. 2) And the area of C3, D3, C4, and D4 is stored in the distributed node # 4.

The block metadata storage unit 250 stores information indicating that data in the cells of the A1, B1, A2, and B2 areas are stored in the first storage node in the block metadata block metadata storage 320. In the distributed environment storage of Hadoop and the like, data replication is commonly used to store blocks in a plurality of nodes, but it is not shown in FIG. 2 in order to prevent confusion.

The method of determining the range of the data block, which is the step S304 of FIG. 3, will be described in more detail. It is preferable that the data size of the cell does not exceed the data block size. Since the data corresponding to one cell is spatially adjacent, if the amount of data in the cell exceeds the block size, the spatially adjacent data is divided into a plurality of blocks so that data corresponding to the cells are spatially allocated to a plurality of blocks . In this case, in order to determine which blocks the data corresponding to one cell exists, the list information of the blocks corresponding to one cell is stored in the block metadata.

After the distribution of data is completed, a method of searching for data according to a request of a user will be described with reference to FIG.

FIG. 4 is a flowchart illustrating a method of searching for spatial data stored in a spatial data distribution storage system according to the present invention.

4, when there is a data retrieval request from the user, the data retrieving unit 420 reads the spatial index stored in the spatial index storage 310 (S402).

Next, referring to the read spatial index, it is determined which cell corresponds to the search target, and the block metadata is read from the block metadata storage 320 (S403).

Then, it identifies the distributed node (s) in which the block (s) and block (s) corresponding to the cells are stored. Next, a search command is transmitted to the distributed data search unit 410 of the corresponding distributed node to read each block (S404).

Since there is candidate data that satisfies the search condition in the corresponding block, the distributed data search unit 410 performs a spatial operation on the data (e.g., spatial inclusion analysis) (S405).

The calculated result is temporarily stored in each of the distributed nodes, and the result is provided to the data search unit 420. Here, steps S404 to S406 are performed in parallel in a plurality of distributed nodes.

Then, the data retrieving unit 420 merges the data transmitted from the distributed nodes to generate a final result, and transmits the final result data in the user mode in which the retrieval is requested (S407).

Referring to FIG. 2, the stored data is stored in the main node, for example, when receiving a region query request to obtain information on an area indicated by an ellipse in the map area of FIG. 2 The data retrieval unit 420 first determines that the cells that need to be read are B3, C3, D3, B4, C4, and D4 by referring to the spatial index.

Then, the data retrieving unit 420 reads the block metadata, and obtains information that the data corresponding to the cell region exists only in the second and fourth distributed nodes and is not present in the first and third distributed nodes.

The data retrieval unit 420 delivers the command to the distributed data retrieval unit 410 in the second and fourth distributed nodes, and the second and fourth distributed data retrieval units 410 access the data, respectively, To obtain candidates of matching objects and to obtain actual region query results, spatial operations such as spatial inclusion are performed.

Each distributed data retrieval unit 410 transmits the computation result to the main node (data retrieval unit), and the main node performs processing such as removing redundant data from each computation result, and merges the data to generate the final result.

In the case where the method of the present invention is not used, for example, since spatial data (points) in the B3 region are randomly divided into 1 to 4 distributed nodes without consideration of spatial adjacency, 1 to 4 distributed nodes , And the number of data transfers between the main node and the distributed node becomes larger.

Although a simple example has been described with reference to FIG. 2, in an actual typical embodiment, when the entire area is widened, the data becomes large in capacity, and the number of cells increases, a space having a characteristic in which an operation is performed on an adjacent region in a relatively narrow range For retrieval, only a small number of distributed nodes are accessed, which reduces the amount of data to be transmitted from the distributed node to the main node and increases the processing speed.

If this method is not used, a relatively large number of distributed nodes should be accessed to access randomly distributed spatial objects. However, if this method is used, information indicating that spatially adjacent objects are stored in only a few distributed nodes Therefore, the processing speed is improved because only a few distributed nodes are accessed.

Although the spatial data distribution storage system and method in the Big Data Platform according to the present invention has been described by way of example, the scope of the present invention is not limited to the specific embodiments, And various alternatives, modifications, and changes may be made within the scope of the invention.

Therefore, the embodiments described in the present invention and the accompanying drawings are intended to illustrate rather than limit the technical spirit of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments and accompanying drawings . The scope of protection of the present invention should be construed according to the claims, and all technical ideas within the scope of equivalents should be interpreted as being included in the scope of the present invention.

200: Data processing unit 210: Data loading unit
220: Spatial Index Generator 230: Data Spreader
240: Data storage unit 250: Block metadata storage unit
300: Data Store 310: Spatial Index Storage
320: Block metadata store 330: Distributed data store
400: search unit 410: distributed data search unit
420: Data retrieval unit

Claims

A spatial indexing unit for reading distributed data stored in a plurality of distributed data stores, generating a spatial index for the read distributed data, and storing the index in a spatial index storage;
A spatial index storage unit for storing a spatial index of the spatial data distribution; a spatial index storage unit for storing the spatial index of the spatial index storage; A dispersion part;
A data storage unit for reading the distributed data stored in accordance with the range of the block determined by the data distribution unit and reconstructing the distributed data into a plurality of distributed data storage units; And
And a block metadata storage unit for storing the range and reference information of the block determined and generated by the data distribution unit as block metadata.