CN106708993B

CN106708993B - Method for realizing space data storage processing middleware framework based on big data technology

Info

Publication number: CN106708993B
Application number: CN201611170711.9A
Authority: CN
Inventors: 吴信才; 万波; 吴亮; 周顺平; 胡茂胜; 杨林; 陈波
Original assignee: Zondy Cyber Group Co ltd; WUHAN ZONDY CYBER TECHNOLOGY CO LTD
Current assignee: Zondy Cyber Group Co ltd; WUHAN ZONDY CYBER TECHNOLOGY CO LTD
Priority date: 2016-12-16
Filing date: 2016-12-16
Publication date: 2021-06-08
Anticipated expiration: 2036-12-16
Also published as: CN106708993A

Abstract

The invention relates to a space data storage processing middleware framework implementation method based on a big data technology, which can provide a method for a user to quickly acquire the mixed data content of the existing multi-source heterogeneous structured data and unstructured data, and improves the distributed storage efficiency by adopting a mainstream big data access tool. The method for realizing the space data storage processing middleware frame based on the big data technology comprises a data extraction and conversion step and a data distributed storage step, wherein a diversified fragmented unstructured data distributed virtual storage frame is constructed by extracting, converting and loading multi-source heterogeneous space data, and directly readable data contents are provided for subsequent space big data analysis and mining.

Description

Method for realizing space data storage processing middleware framework based on big data technology

Technical Field

The invention relates to a space data storage processing middleware framework implementation method based on a big data technology, which can provide a method for a user to quickly acquire the mixed data content of the existing multi-source heterogeneous structured data and unstructured data, and improves the distributed storage efficiency by adopting a mainstream big data access tool.

Background

The spatial data refers to data representing information of a plurality of aspects of the position, shape, size and distribution characteristics of a spatial entity, can be used for describing a target from the real world, and has the characteristics of positioning, qualitative, time and spatial relation and the like. Spatial data is data representing the natural world in which people live, with basic spatial data structures such as points, lines, planes, and entities.

Big data (big data), which refers to a data set captured, managed and processed by a conventional software tool within an affordable time range, is an information asset that needs a new processing mode to have stronger decision-making power, insight discovery power and flow optimization capability to adapt to a large amount, high growth rate and diversification.

In the "big data era" written by vkto, mel, schenberger and kenius, cusk, the big data means a shortcut that does not use a random analysis method (sampling survey), but performs analysis processing using all data. 5V characteristics of big data (proposed by IBM): volume (bulk), Velocity (high speed), Variety (multiple), Value (Value), Veracity (authenticity).

The strategic significance of big data technology is not to grasp huge data information, but to specialize the data containing significance. In other words, if big data is compared to an industry, the key to realizing profitability in the industry is to improve the "processing ability" of the data and realize the "value-added" of the data through the "processing".

Technically, the relation between big data and cloud computing is as inseparable as the front and back of a coin. The large data cannot be processed by a single computer necessarily, and a distributed architecture must be adopted. The method is characterized in that distributed data mining is carried out on mass data. But it must rely on distributed processing of cloud computing, distributed databases and cloud storage, virtualization technologies.

With the advent of the cloud era, Big data (Big data) has attracted more and more attention. Big data (Big data) is often used to describe the large amount of unstructured and semi-structured data created by a company that can take excessive time and money to download to a relational database for analysis. Big data analysis is often tied to cloud computing because real-time large dataset analysis requires a MapReduce-like framework to distribute work to tens, hundreds, or even thousands of computers.

Hadoop is an open-source framework, and can be used for writing and running distributed application to process large-scale data. Distributed computing has been applied in a wide variety of fields today, but is distinguished by Hadoop (1) which is convenient to operate on large clusters of general commercial machines or cloud computing services like the Amazon elastic computing cloud (EC 2). (2) The method is robust, the method runs on general commercial hardware, the hardware can make mistakes, and therefore program running is affected, but Hadoop well avoids the occurrence of faults. (3) The method is expandable, and the Hadoop cluster can be conveniently expanded by continuously increasing the computing nodes, so that a large-scale data set can be better processed. (4) And efficient parallel codes are written, and the method is convenient and quick on Hadoop. Due to the natural advantages of Hadoop, Hadoop has obvious advantages in writing distributed large programs. No matter a company or an individual can build a Hadoop cluster belonging to the company by using a very cheap PC, and the Hadoop cluster is used for researching distributed parallel computing. Because of these characteristics, Hadoop is very popular in both academia and business.

HBase is a distributed, column-oriented open source database, and the technology is derived from the Google paper "Bigtable: a distributed storage system of structured data. Just as Bigtable takes advantage of the distributed data storage provided by the Google File System (File System), HBase provides Bigtable-like capabilities over Hadoop. HBase is a sub-item of the Hadoop item of Apache. HBase is different from a general relational database, and is a database suitable for unstructured data storage. Another difference is that HBase is based on a column rather than a row based pattern.

The HBase-Hadoop Database is a distributed storage system with high reliability, high performance, orientation and scalability, and a large-scale structured storage cluster can be built on a low-cost PC Server by utilizing the HBase technology.

The Hadoop Distributed File System (HDFS) is designed to fit distributed file systems running on general purpose hardware (comfort hardware). It has many similarities with existing distributed file systems. But at the same time, its distinction from other distributed file systems is also clear. HDFS is a highly fault tolerant system suitable for deployment on inexpensive machines. HDFS provides high throughput data access and is well suited for application on large-scale data sets.

HDFS supports a traditional hierarchical file organization structure. A user or application may create directories as desired and then save files in those directories. The hierarchy of the file system namespace is similar to most existing file systems in that users can create, delete, move, or rename files. At present, the HDFS does not support control of user disk quotas and access rights, nor does it support file hard links and soft links, but the HDFS architecture can make up for these characteristics well.

HDFS has the feature of being able to reliably store very large files across machines in a large cluster. It splits each file into a series of data blocks, all but the last one of which is the same size. In order to guarantee fault tolerance, all data blocks of a file will have a copy file. The data block size and copy coefficients of each file are configurable. The application may specify the number of copies of any particular file. The copy coefficients may be specified at the beginning of file creation or may be changed later.

Apache Ambari is a Web-based tool that supports the provisioning, management, and monitoring of Apache Hadoop clusters. Ambari has currently supported most Hadoop components, including HDFS, MapReduce, Hive, Pig, Hbase, zookeeper, Sqoop, and Hcatalog, among others.

ZooKeeper is a distributed, open-source distributed application coordination service, is an open-source implementation of Chubby of Google, and is an important component of Hadoop and Hbase. It is a software that provides a consistent service for distributed applications, and the functions provided include: configuration maintenance, domain name service, distributed synchronization, group service, etc.

The ZooKeeper aims to package complex and error-prone key services and provide a simple and easy-to-use interface and a system with high performance and stable functions for users.

ETL, an abbreviation used in english Extract-Transform-Load, is used to describe the process of extracting (Extract), converting (Transform), and loading (Load) data from a source end to a destination end. The term ETL is more commonly used in data warehouses, but its objects are not limited to data warehouses.

The ETL is an important ring for constructing a data warehouse, and a user extracts required data from a data source, and finally loads the data into the data warehouse according to a predefined data warehouse model after data cleaning.

Sqoop is a source-opening tool, and is mainly used for data transmission between Hadoop (hive) and a traditional database (MySQL, postgresql), and data in a relational database (e.g., MySQL, Oracle, Postgres, etc.) can be imported into an HDFS of Hadoop, and data of the HDFS can also be imported into the relational database.

The flash is a high-availability, high-reliability and distributed system for acquiring, aggregating and transmitting mass logs provided by Cloudera, and supports various data senders customized in the log system for collecting data; at the same time, flash provides the ability to simply process data and write to various data recipients (customizable).

Disclosure of Invention

The technical problem to be solved by the invention is as follows: in a distributed computer cluster environment, a space data storage processing middleware framework implementation method based on a big data technology is provided, and a diversified fragmented unstructured data distributed virtual storage framework is constructed by extracting, converting and loading multi-source heterogeneous space data, so that directly readable data contents are provided for subsequent space big data analysis and mining.

In order to solve the technical problem, the invention discloses a method for realizing a space data storage processing middleware framework based on a big data technology, which is characterized by comprising the following steps of:

step A), aiming at multi-source heterogeneous spatial data and system data with large data volume, extracting the data by adopting an ETL tool data extraction and conversion tool, and converting the data into data with a general format; the data extraction and conversion steps are as follows: the MapGIS data is stored in a MapGIS database, the MapGIS data in the MapGIS database is guided into an HBase distributed database through a MapGIS conversion tool, and meanwhile, the HBase data can also be guided into the MapGIS database;

step B), data distributed storage step: converting the MapGIS format data in the spatial database into a file format MapGIS Conversion tool for Hadoop management through a MapGIS Conversion tool, storing the converted MapGIS spatial data in a distributed database HBase, extracting the geographic range of the MapGIS format and storing the annotation text content in a content library (HBase), wherein the extraction of the annotation text content enables the map to be searched according to the content, and GIS map information becomes a component of the content library and is used for supporting the mining of spatial big data together with the result data content in a search mode that non-vector maps can only be searched according to file names.

In the above scheme, the data distributed storage step is followed by a data association RDF step: establishing an index and a semantic directory of spatial data, and storing the index and the semantic directory in a data association map RDF; where the association between an entity and data is based on the concept of a graph, a data association graph can associate a spatial geographic entity with a large amount of structured or unstructured data.

In the above scheme, the specific steps of the data association RDF include:

semantic association tree step 301: storing entities and their relationships in a semantic association tree; storing triple data in the semantic association tree, wherein the triple records the relationship between the entities and URL address information of the entity resources;

resource URI step 302: the entity of step 301 and the spatial data of step 303 are connected to each other by a resource URI and are accessible to each other;

HBase distributed storage step 303: HBase is a column-oriented, sparse and distributed multidimensional sequencing mapping table, data in each column family are stored together, I/O overhead is effectively reduced during reading and writing, and similar data are put together;

the HBase distributed storage database adopts KeyValue column storage, Rowkey is a main Key of a Row and represents a unique Row, and records in a table are sorted according to Row Key; here, data file URL is used as the main key; all data are accessed through Rowkey, and one wide row can contain all data related to one main key;

KeyValue is a key value pair consisting of Column names and Column values of columns, and a plurality of KeyValues form a Column-family;

the Column-family comprises any attribute values (columns) of a plurality of logic attribute groups, one table has one or more Column families in the horizontal direction, each Column family can be composed of any plurality of columns, the Column families support dynamic expansion, the number and types do not need to be predefined, binary storage is realized, and the type conversion is carried out by a user; the Column-family can avoid losing the information quantity of the original data as much as possible, thereby being capable of really organizing and describing data;

and the table with the file archive number and name as main keys contains the attribute of the archive report, thereby forming the distributed content library.

In the above scheme, the algorithm of the semantic association tree is as follows:

step 1), starting;

step 2), predefining a root node, and setting child nodes with the relationship of RowKey and GeomiD as null;

step 3), reading a main Key Key, a spatial attribute URI and a specified characteristic attribute in a content library;

step 4), if the space attribute URI is empty, executing step 5, otherwise, executing step 6;

step 5), matching corresponding characteristic attributes in the spatial data, constructing a URI of a corresponding record, and storing the URI of the corresponding record in an attribute column corresponding to a content library;

step 6), segmenting words of the feature attribute text, and taking a root node as a father node;

step 7), taking values from the word segmentation result set in sequence, and then executing step 8, step 9 and step 10;

step 8), searching a node corresponding to the SubNode in the semantic association tree, if the node does not exist, executing the step 9 and the step 10, otherwise, returning to the step 7;

step 9), if the URI is empty, matching the corresponding characteristic attribute in the spatial data, and constructing the URI of the corresponding record;

step 10), a Node is created according to the value, a relationship is created to be a child Node Key of RowKey, namely a triple [ Node, RowKey, Key ], the relationship is created to be a child Node URI of GeomID, namely the triple [ Node, GeomID, URI ], the Node is used as a child Node, and a SubNode relationship is established with a father Node;

step 11), end.

Compared with the prior art, the invention has the beneficial effects that: the spatial big data extraction conversion and distributed storage method provides a method for users to quickly acquire the data content of the existing multi-source heterogeneous structured data and unstructured data, and improves the distributed storage efficiency by adopting a mainstream big data access tool.

The content in HBase is stored in a column group mode, data in each column group are stored together, I/O overhead is effectively reduced during reading and writing, similar data are put together, and storage space is greatly saved after compression.

By adopting a Hadoop technology, the unstructured spatial data is stored and organized in a content-oriented mode, the problems of homogenization of the unstructured spatial data and organization for data mining are solved, and diversified and fragmented data are homogenized and integrated; unstructured data is stored by Key/Value, large fields and the like, so that space data can be conveniently and effectively acquired and utilized subsequently.

Drawings

FIG. 1 is a data storage processing middleware framework of the present invention;

FIG. 2 is a flow chart illustrating an embodiment of a method for implementing spatial data extraction transformation and distributed storage according to the present invention;

FIG. 3 is a graph of associations between spatial entities and data according to the present invention;

FIG. 41 is a diagram showing the data size and block size of 50 ten thousand stratigraphic units;

FIG. 51 shows details of data block storage of 50 ten thousand stratigraphic units.

Detailed Description

The present invention will be further described with reference to the accompanying fig. 1-5 and the specific embodiments so that those skilled in the art can better understand the present invention and can implement the present invention, but the embodiments are not to be construed as limiting the present invention.

The invention provides a method for storing and processing spatial data based on big data technology, which comprises the following steps:

step A), aiming at multi-source heterogeneous spatial data and system data with large data volume, extracting the data by adopting an ETL tool data extraction and conversion tool, and converting the data into data with a general format;

and B) virtualizing and storing the data in a space big data distributed storage frame for unified management.

Furthermore, the multi-source heterogeneous data source comprises a local file system, a relational database, export data are mutually imported from a spatial data management platform to a big data system, associated data are customized by a user, and the spatial structured data and unstructured data in the big data system are associated to lay a foundation for subsequent data analysis.

Further, the ETL tool is a data extraction, transformation, and loading tool, which extracts multiple structured data from a data source, and loads the original data into a big data container quickly and efficiently, so that data can be transformed between spatial big data storage and a traditional storage manner, and the ETL tool is divided into three tools according to different data types, where the three tools are:

the real-time data conversion tool is used for importing real-time data through a web crawler and the Flume;

the user-defined data conversion tool adopts a Sqoop big data access tool to improve the storage efficiency, and meanwhile, the user-defined data conversion tool can be used according to a specific service data type and provide a file uploading function;

and the spatial data conversion tool is used for converting the spatial data in the spatial data format into a general format.

Further, the distributed storage framework includes five tools, respectively:

a data association RDF graph database supporting storage of relationships between geospatial data and other types of data;

and the distributed file system (HDFS) is used for storing original space data and data documents. The method comprises the steps that distributed file storage is provided based on an HDFS framework system so as to deal with a large amount of unstructured data such as multimedia files and the like, and storage plug-ins of the distributed file storage are expanded by self-definition to support storage of GIS space data;

the HBase distributed database is characterized in that structured or semi-structured data types are stored in a mode of supporting a conventional data table by integrating the HBase database, GIS spatial data storage is realized based on the development interface specification, the incidence relation between structured data and unstructured data is established in the table, rich query results are provided for subsequent data query, file data are rapidly acquired, and original documents are reorganized and then stored in the distributed real-time access database HBase. Wherein, the files such as the attached drawings, the attached tables, the attachments and the like are stored separately, and the main files are stored separately according to chapters. Meanwhile, an index is established for the content stored in the HBase and is stored in a distributed cache Memcached or Redis, so that the index is only required to be obtained from a memory for searching;

ZooKeeper collaboration service, a centralized service, is used to maintain configuration information and naming, and provide distributed synchronization and group services;

ambari cluster node management monitoring is used for creating, managing and monitoring a Hadoop cluster, is a tool for enabling Hadoop and related big data software to be used more easily, is also distributed-architecture software, and mainly comprises two parts: ambari Server and Ambari Agent. Simply put it simply, the user informs Ambari Agent to install corresponding software through Ambari Server; the Agent will send the state of each software module of each machine to Ambari Server regularly, and finally the state information will be presented in Ambari's GUI, which is convenient for the user to know the various states of the cluster and to perform corresponding maintenance.

As shown in FIG. 1, the data storage processing middleware module framework of the present invention comprises the following modules:

the data source module 101: the data sources of the spatial big data comprise spatial data, internet data, log stream data, local data files, relational data and the like, the data formats of the data sources comprise GIS data, document data, image data and the like, and the data sources are stored in different types of database nodes such as a relational database, a spatial database and the like in a scattered manner.

ETL tool module 102: the ETL tool extracts, converts and loads the data sources in various formats which are stored dispersedly;

the ETL tool comprises a real-time data conversion tool, a custom data conversion tool and a space data conversion tool;

the three tools respectively extract corresponding data in the data source and convert the data into a uniform readable format;

for example, relational data is accessed using the Sqoop tool, and spatial data is accessed using the spatial data transformation tool.

HDFS distributed file system module 103: part of the data extracted and converted by the ETL tool, such as file upload data, is stored in the HDFS distributed file system in a distributed mode.

HBase distributed database module 104: partial data extracted and converted by the ETL tool, such as spatial data, real-time data and the like, are stored in the HBase distributed database in a distributed mode.

Data association RDF graph database module 105: the ETL tool extracts data of the conversion data source and stores the data in the distributed database, and meanwhile, data indexes and semantic directories are established and stored in the data association map RDF.

ZooKeeper collaboration services module 106: the distribution of HBase regionservers of a plurality of nodes in a distributed environment is cooperatively managed.

Ambari cluster node management monitoring module 107: and visually installing and monitoring nodes in the cluster in the distributed environment.

As shown in fig. 2, a specific embodiment of the method for implementing spatial data extraction transformation and distributed storage according to the present invention includes the following steps:

data extraction and conversion step 201: the spatial data is mainly stored in a spatial database, for example, the mapGIS data is stored in a mapGIS database, and the mapGIS data in the mapGIS database is imported into an HBase distributed database through a mapGIS conversion tool, and meanwhile, the data of the HBase can also be imported into the mapGIS database.

Data distributed storage step 202: the method comprises the steps of converting MapGIS format data in a spatial database into file format MapGIS Conversion tools for Hadoop management through a MapGIS Conversion tools for Hadoop, storing the converted MapGIS spatial data in a distributed database HBase, extracting a geographical range of the MapGIS format and storing annotation text contents in a content library (HBase), wherein the extraction of the annotation text contents enables the map to be searched according to the contents, and GIS map information becomes a component of the content library and supports data mining after large spatial data together with result data contents in a search mode that non-vector maps can only be searched according to file names.

The data association RDF step is started as follows: and establishing an index and a semantic directory of the spatial data, and storing the index and the semantic directory in a data association map RDF.

The association between the entity and the data is based on the concept of a map, and the data association map can associate the space geographic entity with a large amount of structured or unstructured data, so that the basis is laid for subsequent unified analysis and application.

As shown in FIG. 3, one embodiment of the correlation map between spatial entities and data of the present invention comprises the following steps;

semantic association tree step 301: storing entities and their relationships in a semantic association tree; and storing triple data in the semantic association tree, wherein the triple records the relationship between the entities, the URL address of the entity resource and other information.

Resource URI step 302: the entity of step 301 and the spatial data of step 303 are connected to each other by a resource URI (unique identifier of data) and are accessible to each other.

HBase distributed storage step 303: HBase is a nematic, sparse and distributed multidimensional sorting mapping table, data in each column family are stored together, I/O (input/output) overhead is effectively reduced during reading and writing, similar data are put together, and storage space is greatly saved after compression;

the Column-family comprises any attribute values (columns) of a plurality of logical attribute groups, one table has one or more Column families in the horizontal direction, each Column family can be composed of any plurality of columns, the Column families support dynamic expansion, the number and types do not need to be predefined, binary storage is realized, and the type conversion is required by a user. The Column-family can prevent the information amount of the original data from losing as much as possible, thereby being capable of really organizing and describing the data.

The table with file archive number and name as primary keys, which contains the attributes of the archive report (e.g., archive name, geospatial range, attachment chart) forms a distributed content repository.

The algorithm of the semantic association tree is further described below:

step 1), starting;

step 11), end.

Triples are concepts in data structures, and are mainly a compression method for storing sparse matrices, and refer to a set of pointers such as ((x, y), z), often abbreviated as (x, y, z). The triple in the technical scheme records the relationship between the entities and the information such as the URL addresses where the entity resources are located.

Claims

1. A space data storage processing middleware framework implementation method based on big data technology is characterized in that: which comprises the following steps:

step A), aiming at multi-source heterogeneous spatial data and system data with large data volume, extracting the data by adopting an ETL tool data extraction and conversion tool, and converting the data into data with a general format; the data extraction and conversion steps are as follows: the MapGIS data is stored in a MapGIS database, the MapGIS data in the MapGIS database is imported into an HBase distributed database through a MapGIS conversion tool, and meanwhile, the HBase data is imported into the MapGIS database;

step B), data distributed storage step: converting the MapGIS format data in the spatial database into a file format MapGIS Conversion tools for Hadoop management through a MapGIS Conversion tools, storing the converted MapGIS spatial data in a distributed database HBase, extracting the geographic range of the MapGIS format and storing the annotation text content in a content library (HBase), wherein the extraction of the annotation text content enables the map to be retrieved according to the content, and the GIS map information becomes a component of the content library and is used for supporting the mining of spatial big data together with the result data content, and is different from a retrieving mode that a non-vector map can only be used for retrieving the map according to the file name;

the data distributed storage step is followed by a data association RDF step: establishing an index and a semantic directory of spatial data, and storing the index and the semantic directory in a data association map RDF; wherein the association between the entity and the data is based on the concept of a graph, and the data association graph can associate a spatial geographic entity with a large amount of structured or unstructured data; the specific steps of the data association RDF comprise:

the HBase distributed storage database adopts KeyValue column storage, Rowkey is a main Key of a Row and represents a unique Row, and records in a table are sorted according to Row Key; here, data file URL is used as the main key; all data is accessed through the Rowkey main key;

the Column-family comprises any attribute values of a plurality of logic attribute groups, one table has one or more Column families in the horizontal direction, each Column family is composed of any plurality of columns, the Column families support dynamic expansion, the number and types do not need to be predefined, binary storage is realized, and the type conversion is carried out by a user; the Column-family can avoid losing the information quantity of the original data as much as possible, thereby being capable of really organizing and describing data;

a table with file numbers and names as main keys, wherein the table contains the attributes of file reports, thereby forming a distributed content library;

the algorithm of the semantic association tree is as follows:

step 1), starting;

step 11), end.