CN106708993A

CN106708993A - Spatial data storage processing middleware framework realization method based on big data technology

Info

Publication number: CN106708993A
Application number: CN201611170711.9A
Authority: CN
Inventors: 吴信才; 万波; 吴亮; 周顺平; 胡茂胜; 杨林; 陈波
Original assignee: BEIJING ZONDY CYBER TECHNOLOGY CO LTD; WUHAN ZONDY CYBER CO Ltd
Current assignee: BEIJING ZONDY CYBER TECHNOLOGY CO LTD; WUHAN ZONDY CYBER CO Ltd
Priority date: 2016-12-16
Filing date: 2016-12-16
Publication date: 2017-05-24
Anticipated expiration: 2036-12-16
Also published as: CN106708993B

Abstract

The invention relates to a spatial data storage processing middleware framework realization method based on a big data technology. The method enables a user to quickly acquire blended data content of existing multi-source heterogeneous structured data and unstructured data, and a mainstream big data storage tool is adopted to improve distributed storage efficiency. The spatial data storage processing middleware framework realization method based on the big data technology comprises a data extraction and conversion step and a data distributed storage step; multi-source heterogeneous spatial data is extracted, converted and loaded to construct a diversified fragmented unstructured data distributed virtual storage framework, and the data content capable of being read directly is provided for subsequent spatial big data analysis and mining.

Description

GML data storage treatment middleware framework implementation method based on big data technology

Technical field

Middleware framework implementation method, the party are processed the present invention relates to a kind of GML data storage based on big data technology Method is provided to that user is a kind of to be carried out to the data content that existing multi-source heterogeneous structural data mixes with unstructured data The method of quick obtaining, and distributed storage efficiency is improved using the big data access tools of main flow.

Background technology

Spatial data refers to for the position of representation space entity, shape, size and its all multi-aspect informations of distribution characteristics Data, it can be used to describe the target from real world, and it has the characteristics such as positioning, qualitative, time and spatial relationship. Spatial data is a kind of the natural generation that people depend on for existence to be represented with the fundamental space such as point, line, surface and entity data structure The data on boundary.

Big data（big data）, refer to caught with conventional software instrument in the time range that can be born, manage and The data acquisition system for the treatment of, is that the new tupe of needs could have stronger decision edge, see clearly discovery power and process optimization ability To adapt to magnanimity, high growth rate and diversified information assets.

Write in Victor mayer-Schoenberg and Kenneth Cook《The big data epoch》Middle big data refers to not Use random analysis method（Sample investigation）Such shortcut, and it is analyzed treatment using all data.The 5V features of big data （IBM is proposed）：Volume（Largely）、Velocity（At a high speed）、Variety（It is various）、Value（Value）、Veracity（Truly Property）.

The strategic importance of big data technology does not lie in grasps huge data message, and is to contain significant number to these According to carrying out specialized process.In other words, if big data is compared to a kind of industry, then this industry realizes the pass of profit Key, is to improve " working ability " to data, and " increment " of data is realized by " processing ".

Technically, big data is inseparable just as one piece of positive and negative of coin with the relation of cloud computing.Big number According to cannot necessarily be processed with the computer of separate unit, it is necessary to use distributed structure/architecture.Its characteristic is that mass data is entered Row distributed data digging.But it must rely on distributed treatment, distributed data base and cloud storage, the virtualization skill of cloud computing Art.

With the arriving of cloud era, big data（Big data）Also increasing concern has been attracted.Big data（Big data）It is commonly used to describe a large amount of unstructured datas and semi-structured data that a company creates, these data are being downloaded Can overspending time and money when being used to analyze to relevant database.Big data analysis is often linked together with cloud computing, Because real-time large data set analysis need the framework as MapReduce to come to tens of, hundreds of or even thousands of electricity Brain shares out the work.

Hadoop is a framework increased income, and can write and run Distributed Application treatment large-scale data.Distribution meter Nowadays application field is very wide in range and changes for calculation, but unusual part is Hadoop:(1) it is convenient:General commercial In the large construction cluster that machine is constituted, or as on the cloud computing services such as Amazon elastic calculation cloud (EC2), Hadoop can be supported Operation.(2) it is healthy and strong:Run in general commercial hardware, hardware may malfunction, so that influence program to run, but Hadoop The generation of this kind of failure for avoiding well.(3) it is expansible:Can very easily be extended by constantly increasing calculate node Hadoop clusters, therefore also can preferably process large-scale dataset.(4) efficient parallel codes are write, on Hadoop Become convenient and swift.Due to these natural advantages of Hadoop, make it with the obvious advantage in terms of distributed large program is write. Either company or individual, can build one's own Hadoop clusters, for studying distribution with very cheap PC Parallel computation.Also exactly because these characteristics, Hadoop is all favored in academia and business circles very much.

HBase is a PostgreSQL database distributed, towards row, and the Technology origin is write in Fay Chang Google papers " Bigtable：One distributed memory system of structural data ".Just as Bigtable make use of Google File system（File System）The Distributed Storage for being provided is the same, HBase on Hadoop provide similar to The ability of Bigtable.HBase is the sub-project of the Hadoop projects of Apache.HBase is different from general relational database, It is a database for being suitable for unstructured data storage.HBase unlike another is per-column rather than being based on Capable pattern.

HBase-Hadoop Database are a high reliability, high-performance, towards row, telescopic distribution deposits Storage system, large-scale structure storage cluster can be erected using HBase technologies on cheap PC Server.

Hadoop distributed file systems (HDFS) are designed to be adapted to operate in common hardware（commodity hardware）On distributed file system.It and existing distributed file system have many common ground.But meanwhile, it and The difference of other distributed file systems is also apparent.HDFS is a system for Error Tolerance, is adapted to be deployed in On cheap machine.HDFS can provide the data access of high-throughput, be especially suitable for the application on large-scale dataset.

HDFS supports traditional hierarchical file organization structure.User or application program can as needed create mesh , then be stored in file in these catalogues by record.The hierarchical structure of file system namespace and existing most of file systems System is similar:User can be to document creation, deletion, mobile or renaming.At present, HDFS does not support user disk quota and visit also The control of authority is asked, file hard link and soft link are not supported yet, but HDFS frameworks can well make up these characteristics.

HDFS has the characteristics of can reliably storing super large file across machine in a big cluster.It is by each file A series of data block is split into, except last, other data blocks are all onesize.In order to ensure fault-tolerant ability, All data blocks of file can all have wave file.The data block size and copy coefficient of each file are configurable.Should The copy number of any certain file can be specified with program.Copy coefficient both can document creation start specify, also may be used Change with after.

Apache Ambari are a kind of instruments based on Web, support supply, management and the prison of Apache Hadoop clusters Control.Ambari has supported most of Hadoop components at present, including HDFS, MapReduce, Hive, Pig, Hbase, Zookeper, Sqoop and Hcatalog etc..

ZooKeeper is one distributed, and the distributed application program coordination service of open source code, is Google Mono- realization increased income of Chubby, is the significant components of Hadoop and Hbase.It is one for Distributed Application provides uniformity The software of service, there is provided function include：Configuring maintenance, domain name service, distributed synchronization, group service etc..

The target of ZooKeeper is exactly the error-prone key service of packaged complexity, by interface and performance easy to use Efficiently, the system of function-stable is supplied to user.

ETL, is the abbreviation of English Extract-Transform-Load, for describing data from source terminal by taking out Take（extract）, conversion（transform）, loading（load）To the process of destination.The words of ETL mono- are more common in data warehouse, But its object is not limited to data warehouse.

ETL is the important ring for building data warehouse, and user extracts required data from data source, clear by data Wash, finally according to the data warehouse model for pre-defining, in loading data into data warehouse.

Sqoop（Pronunciation：skup）It is a instrument increased income, is mainly used in Hadoop（Hive）With traditional database （mysql、postgresql...）Between carry out the transmission of data, can be by a relevant database（For example：MySQL , Oracle, Postgres etc.）In data lead the HDFS for entering Hadoop, it is also possible to the data of HDFS are led into the relation of entering In type database.

Flume is the High Availabitity that Cloudera is provided, highly reliable, distributed massive logs collection, polymerization With the system of transmission, Flume supports to customize Various types of data sender in log system, for collecting data；Meanwhile, Flume Offer carries out simple process to data, and writes various data receivings（It is customizable）Ability.

The content of the invention

The technical problem to be solved in the present invention is：In distributed computer cluster environment, there is provided one kind is based on big data The GML data storage treatment middleware framework implementation method of technology, extracted by multi-source heterogeneous spatial data, changed, Loading, builds diversified fragmentation unstructured data distributed virtualization storing framework, be the analysis of follow-up space big data, The data content that provide and can directly read is provided.

In order to solve the above-mentioned technical problem, in the middle of a kind of GML data storage treatment based on big data technology of the invention Part framework implementation method, it is characterised in that:It is comprised the following steps：

Step A), for the multi-source heterogeneous spatial data and system data of big data quantity, conversion work is extracted using ETL tool datas Tool extracts these data, is converted to the data of general format；The data extract switch process：MapGIS data MapGIS data in MapGIS databases are led and entered by storage in MapGIS databases by MapGIS crossover tools In HBase distributed data bases, it is also possible to during the data of HBase are led enters MapGIS databases；

Step B), data distribution formula storing step：By MapGIS Conversion tools for Hadoop instruments by sky MapGIS formatted datas in spatial database are converted to the file format MapGIS Conversion tools of Hadoop management For Hadoop instruments, will pass through the MapGIS GML data storages of conversion in distributed data base HBase, by above-mentioned instrument Extract geographic range, the annotation content of text storage to content library of MapGIS forms（HBase）In, the extraction of annotation content of text Make it possible according to content retrieval map, being different from non-vector map can only be by the retrieval mode of filename, GIS maps letter Part of the breath as content library, together with achievement data content, for support space big data data mining.

In above scheme, data correlation RDF steps are followed by proceeded by data distribution formula storing step：Set up empty Between data index and semantic directory, store in data correlation collection of illustrative plates RDF；Wherein, the association between entity and data is base In the concept of collection of illustrative plates, data correlation collection of illustrative plates can associate space and geographical entity and a large amount of structurings or unstructured data.

In above scheme, the specific steps of the data correlation RDF include：

Semantic association tree step 301：Storage entity and its relation in semantic association tree；Triple is stored in semantic association tree Data, triple have recorded the relation between entity and entity, and the URL address informations where actual resource；

Resource URI steps 302：The entity of step 301 and the spatial data of step 303 are connected with each other by resource URI, can be visited mutually Ask；

HBase distributed storage steps 303：HBase is one towards row, sparse, distributed multidimensional ordering mapping table, often Data in Ge Lie races are all stored together, and I/O expenses are effectively reduced in read-write, and similar data are put together；

Wherein HBase distributed storages database is stored using the row of KeyValue, and Rowkey is capable major key, represents unique A line, records in table and is sorted according to Row Key；Herein with data archival URL as major key；All data are all by Rowkey （Major key）Conduct interviews, a wide row can hold the related all data of next major key；

KeyValue is the key-value pair of row name and the train value composition of row, and multiple KeyValue constitute a Column-family row Race；

Column-family row race, any property value comprising multiple logical attribute groups（Row）, a table is in the horizontal direction There is one or more row race, row race can be made up of any number of Column, and row race supports dynamic expansion, without predefined quantity And type, binary storage, user need to voluntarily carry out type conversion；Column-family row race can not lose original money as far as possible Material information content, such that it is able to real tissue and description data；

Table with entitled major key is numbered with archive files, wherein the attribute comprising archives report, so as to form distributed content Storehouse.

In above scheme, the algorithm of the semantic association tree is as follows：

Step 1）, start；

Step 2）, predefined root node, relation is set for the child node of RowKey and GeomID is sky；

Step 3）, major key Key, space attribute URI and the characteristic attribute specified in reading of content storehouse；

Step 4）If, space attribute URI for sky, perform step 5, otherwise, perform step 6；

Step 5）, match corresponding characteristic attribute in spatial data, build the URI of respective record, be saved in content library correspondence Attribute column in；

Step 6）, to characteristic attribute text participle, take root node for father node；

Step 7）, in order from word segmentation result concentrate value, then perform step 8, step 9, step 10；

Step 8）, in semantic association tree search relationship be the corresponding nodes of SubNode, if do not exist this node, perform step Rapid 9, step 10, otherwise returns to step 7；

Step 9）If, URI be sky, match corresponding characteristic attribute in spatial data, build the URI of respective record;

Step 10）, with this value create node Node, create relation for RowKey child node Key, i.e., triple [Node, RowKey, Key], it is the child node URI of GeomID, i.e. triple [Node, GeomID, URI] to create relation, with Node nodes It is child node, SubNode relations is set up with father node;

Step 11）, terminate.

It is compared with the prior art, the beneficial effects of the invention are as follows：Space big data of the invention extracts conversion and distribution The data content that storage method is supplied to user a kind of to mix existing multi-source heterogeneous structural data with unstructured data The method for carrying out quick obtaining, and distributed storage efficiency is improved using the big data access tools of main flow.

Content in HBase to arrange race in the way of stored, the data in each row race are stored together, and are being read I/O expenses are effectively reduced when writing, and similar data are put together, and memory space has been greatly saved after overcompression.

Using Hadoop technologies, storage, the tissue of content oriented pattern are carried out to destructuring spatial data, solution is tied by no means Structure spatial data homogeneity and data-oriented excavate tissue problem, make variation, fragmentation data homogeneity and Integration；Unstructured data Bian is stored with Key/Value, big field etc., convenient that subsequently spatial data is carried out quickly Effectively obtain, utilize.

Brief description of the drawings

Fig. 1 is data storage processing middleware framework schematic diagram of the invention；

Fig. 2 is that a specific embodiment flow of the implementation method that spatial data of the invention extracts conversion and distributed storage is shown It is intended to；

Fig. 3 is that spatial entities of the invention associate collection of illustrative plates between data；

Fig. 41:500000 stratigraphic unit data；

Fig. 51:500000 stratigraphic unit size of data and piecemeal size；

Fig. 61:500000 stratigraphic unit data block storage details.

Specific embodiment

The invention will be further described for 1- Fig. 6 and specific embodiment below in conjunction with the accompanying drawings, so that those skilled in the art Member can be better understood from the present invention and can be practiced, but illustrated embodiment is not as a limitation of the invention.

The invention provides a kind of GML data storage based on big data technology, the method for the treatment of, comprise the following steps：

Step A) for the multi-source heterogeneous spatial data and system data of big data quantity, conversion work is extracted using ETL tool datas Tool extracts these data, is converted to the data of general format；

Step B) these data virtualizations are stored in the big data distributed storage framework of space, it is managed collectively.

Further, multiple and distributing sources include local file system, and relational database, spatial data management platform is arrived Data, User Defined associated data, by space structure data and big data system are imported and exported between big data system mutually Unstructured data in system is associated, and is that subsequent data analysis lay the foundation.

Further, the ETL instruments are data extraction, conversion, loading tool, and many structuring numbers are extracted from data source According to, quickly, initial data is efficiently loaded into big data container, make energy between space big data storage and traditional storage mode Mutual change data, according to different data types, is divided into three instruments, respectively：

Real time data crossover tool, real time data importing is carried out by web crawlers and Flume；

Self-defining data crossover tool, storage efficiency is improved using Sqoop big datas access tools, while can be according to specific The self-defined crossover tool of traffic data type, and provide file loading function；

Conversion of Spatial Data instrument, is general format by the Conversion of Spatial Data of Data Format.

Further, the distributed storage framework includes five instruments, respectively：

Data correlation RDF graph database, supports the storage of the relation between geographical spatial data and other types data；

Distributed file system（HDFS）, deposit luv space data and information document.Distribution is provided based on HDFS frame systems The storage of formula file, to tackle a large amount of unstructured datas, such as multimedia file, by self-defined its memory card of extension, It is allowed to support the storage of GIS spatial data；

HBase distributed data bases, integrating HBase databases with storage organization by way of supporting routine data table or half The data type of structuring, based on its development interface specification, realizes the storage of GIS spatial data, while setting up structure in table Change the incidence relation of data and unstructured data, for follow-up data inquiry provides abundant Query Result, file data is entered Row quick obtaining, is stored in distributed real time access database HBase after original document is reorganized.Wherein, accompanying drawing, attached The files such as table, annex are individually deposited, and master file is then stored separately by chapters and sections.Set up to storing the content in HBase simultaneously Index, is stored in distributed caching Memcached or Redis, and so only index need to be obtained from internal memory is searched；

ZooKeeper cooperation with service, a kind of centralized services, for keeping configuration information and name, and provide distributed synchronization and Group service；

Ambari clustered node management and monitoring, effect is the cluster create, manage, monitoring Hadoop, is to allow Hadoop And the big data software of correlation is easier an instrument using, Ambari itself is also a software for distributed structure/architecture, Mainly it is made up of two parts：Ambari Server and Ambari Agent.In simple terms, user passes through Ambari Server notifies that Ambari Agent install corresponding software；Agent can periodically send each machine each software mould The state of block gives Ambari Server, and final these status informations can be presented on the GUI of Ambari, facilitate user to understand To the various states of cluster, and safeguarded accordingly.

As shown in figure 1, data storage processing middleware module block schematic illustration of the invention is included with lower module：

Data source modules 101：The data source of space big data includes spatial data, internet data, daily record flow data, local number According to file, relation data etc., the data form of these data sources has GIS data, document data, image data etc., these data Store in decentralized manner in different types of database node such as relevant database, spatial database.

ETL tool models 102：ETL instruments will disperse the data source of the various forms of storage to be extracted, changed, loaded；

Wherein, ETL instruments include real time data crossover tool, self-defining data crossover tool, the class of Conversion of Spatial Data instrument three；

This three classes instrument respectively extracts corresponding data in data source, is converted to the unified form that can read；

As relational data enters line access using Sqoop instruments, spatial data enters line access using Conversion of Spatial Data instrument.

HDFS distributed file systems module 103：The partial data that ETL instruments are extracted and changed such as file loading data will Distributed storage is in HDFS distributed file systems.

HBase distributed datas library module 104：The partial data that ETL instruments are extracted and changed such as spatial data, in real time number According to wait by distributed storage in HBase distributed data bases.

Data correlation RDF graph DBM 105：ETL instruments extract the data in change data source and store to distribution While database, data directory and semantic directory will be set up, stored in data correlation collection of illustrative plates RDF.

ZooKeeper cooperation with service module 106：The HBase of the multiple nodes under coordinated management distributed environment The distribution of regionserver.

Ambari clustered nodes administration and monitoring module 107：Visualization peace is carried out to the node in cluster under distributed environment Dress and monitoring.

As shown in Fig. 2 of the implementation method of spatial data extraction conversion of the invention and distributed storage is specific real Example is applied to comprise the following steps：

Data extract switch process 201：Spatial data is mainly stored in spatial database, and such as MapGIS data storages exist In MapGIS databases, the MapGIS data in MapGIS databases are led by MapGIS crossover tools enters HBase distributions In formula database, it is also possible to during the data of HBase are led enters MapGIS databases.

Data distribution formula storing step 202：By MapGIS Conversion tools for Hadoop instruments by sky MapGIS formatted datas in spatial database are converted to the file format MapGIS Conversion tools of Hadoop management For Hadoop instruments, by distributed data base HBase, these instruments are carried by the MapGIS GML data storages of conversion Take geographic range, the annotation content of text storage to content library of MapGIS forms（HBase）In, the extraction of annotation content of text makes Obtain and be possibly realized according to content retrieval map, being different from non-vector map can only be by the retrieval mode of filename, GIS map information Part as content library, together with achievement data content, supports the later data mining of space big data.

Data correlation RDF steps are proceeded by below：The index and semantic directory of spatial data are set up, storage is closed in data In connection collection of illustrative plates RDF.

Wherein, the association between entity and data is the concept based on collection of illustrative plates, and data correlation collection of illustrative plates can be by space and geographical reality Body and a large amount of structurings or unstructured data are associated, and are that follow-up united analysis and application lay the first stone.

As shown in figure 3, spatial entities of the invention include following step with the specific embodiment that collection of illustrative plates is associated between data Suddenly；

Semantic association tree step 301：Storage entity and its relation in semantic association tree；Triple is stored in semantic association tree Data, triple have recorded the relation between entity and entity, and the information such as URL addresses where actual resource.

Resource URI steps 302：The entity of step 301 and the spatial data of step 303 are by resource URI（Unique mark of data Show symbol）It is connected with each other, can accesses mutually.

HBase distributed storage steps 303：HBase is one towards row, sparse, distributed multidimensional ordering mapping Table, the data in each row race are stored together, and I/O expenses are effectively reduced in read-write, and similar data are placed on one Rise, memory space has been greatly saved after overcompression；

Column-family row race, any property value comprising multiple logical attribute groups（Row）, a table is in the horizontal direction There is one or more row race, row race can be made up of any number of Column, and row race supports dynamic expansion, without predefined quantity And type, binary storage, user need to voluntarily carry out type conversion.Column-family row race can not lose original money as far as possible Material information content, such that it is able to real tissue and description data.

With archive files numbering and the table of entitled major key, wherein the attribute comprising archives report（Such as file name, Reason spatial dimension, annex chart）Form distributed content storehouse.

The algorithm of semantic association tree described further below：

Step 1）, start；

Step 11）, terminate.

Triple is the concept in data structure, is primarily used to store a kind of compress mode of sparse matrix, is finger-type As ((x, y), set z) are often abbreviated as (x, y, z).Triple in the technical program have recorded between entity and entity The information such as the URL addresses where relation, and actual resource.

Claims

1. a kind of GML data storage based on big data technology processes middleware framework implementation method, it is characterised in that：Its bag Include following steps：

2. the GML data storage based on big data technology as described in claim 1 processes middleware framework implementation method, its It is characterised by:Data correlation RDF steps are followed by proceeded by data distribution formula storing step：Set up spatial data Index and semantic directory, store in data correlation collection of illustrative plates RDF；Wherein, the association between entity and data is based on collection of illustrative plates Concept, data correlation collection of illustrative plates can associate space and geographical entity and a large amount of structurings or unstructured data.

3. the GML data storage based on big data technology as described in claim 2 processes middleware framework implementation method, its It is characterised by:The specific steps of the data correlation RDF include：

4. the GML data storage based on big data technology as described in claim 3 processes middleware framework implementation method, its It is characterised by:The algorithm of the semantic association tree is as follows：

Step 1）, start；

Step 11）, terminate.