CN106708993B - Method for realizing space data storage processing middleware framework based on big data technology - Google Patents

Method for realizing space data storage processing middleware framework based on big data technology Download PDF

Info

Publication number
CN106708993B
CN106708993B CN201611170711.9A CN201611170711A CN106708993B CN 106708993 B CN106708993 B CN 106708993B CN 201611170711 A CN201611170711 A CN 201611170711A CN 106708993 B CN106708993 B CN 106708993B
Authority
CN
China
Prior art keywords
data
column
spatial
mapgis
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611170711.9A
Other languages
Chinese (zh)
Other versions
CN106708993A (en
Inventor
吴信才
万波
吴亮
周顺平
胡茂胜
杨林
陈波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zondy Cyber Group Co ltd
WUHAN ZONDY CYBER TECHNOLOGY CO LTD
Original Assignee
Zondy Cyber Group Co ltd
WUHAN ZONDY CYBER TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zondy Cyber Group Co ltd, WUHAN ZONDY CYBER TECHNOLOGY CO LTD filed Critical Zondy Cyber Group Co ltd
Priority to CN201611170711.9A priority Critical patent/CN106708993B/en
Publication of CN106708993A publication Critical patent/CN106708993A/en
Application granted granted Critical
Publication of CN106708993B publication Critical patent/CN106708993B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a space data storage processing middleware framework implementation method based on a big data technology, which can provide a method for a user to quickly acquire the mixed data content of the existing multi-source heterogeneous structured data and unstructured data, and improves the distributed storage efficiency by adopting a mainstream big data access tool. The method for realizing the space data storage processing middleware frame based on the big data technology comprises a data extraction and conversion step and a data distributed storage step, wherein a diversified fragmented unstructured data distributed virtual storage frame is constructed by extracting, converting and loading multi-source heterogeneous space data, and directly readable data contents are provided for subsequent space big data analysis and mining.

Description

Method for realizing space data storage processing middleware framework based on big data technology
Technical Field
The invention relates to a space data storage processing middleware framework implementation method based on a big data technology, which can provide a method for a user to quickly acquire the mixed data content of the existing multi-source heterogeneous structured data and unstructured data, and improves the distributed storage efficiency by adopting a mainstream big data access tool.
Background
The spatial data refers to data representing information of a plurality of aspects of the position, shape, size and distribution characteristics of a spatial entity, can be used for describing a target from the real world, and has the characteristics of positioning, qualitative, time and spatial relation and the like. Spatial data is data representing the natural world in which people live, with basic spatial data structures such as points, lines, planes, and entities.
Big data (big data), which refers to a data set captured, managed and processed by a conventional software tool within an affordable time range, is an information asset that needs a new processing mode to have stronger decision-making power, insight discovery power and flow optimization capability to adapt to a large amount, high growth rate and diversification.
In the "big data era" written by vkto, mel, schenberger and kenius, cusk, the big data means a shortcut that does not use a random analysis method (sampling survey), but performs analysis processing using all data. 5V characteristics of big data (proposed by IBM): volume (bulk), Velocity (high speed), Variety (multiple), Value (Value), Veracity (authenticity).
The strategic significance of big data technology is not to grasp huge data information, but to specialize the data containing significance. In other words, if big data is compared to an industry, the key to realizing profitability in the industry is to improve the "processing ability" of the data and realize the "value-added" of the data through the "processing".
Technically, the relation between big data and cloud computing is as inseparable as the front and back of a coin. The large data cannot be processed by a single computer necessarily, and a distributed architecture must be adopted. The method is characterized in that distributed data mining is carried out on mass data. But it must rely on distributed processing of cloud computing, distributed databases and cloud storage, virtualization technologies.
With the advent of the cloud era, Big data (Big data) has attracted more and more attention. Big data (Big data) is often used to describe the large amount of unstructured and semi-structured data created by a company that can take excessive time and money to download to a relational database for analysis. Big data analysis is often tied to cloud computing because real-time large dataset analysis requires a MapReduce-like framework to distribute work to tens, hundreds, or even thousands of computers.
Hadoop is an open-source framework, and can be used for writing and running distributed application to process large-scale data. Distributed computing has been applied in a wide variety of fields today, but is distinguished by Hadoop (1) which is convenient to operate on large clusters of general commercial machines or cloud computing services like the Amazon elastic computing cloud (EC 2). (2) The method is robust, the method runs on general commercial hardware, the hardware can make mistakes, and therefore program running is affected, but Hadoop well avoids the occurrence of faults. (3) The method is expandable, and the Hadoop cluster can be conveniently expanded by continuously increasing the computing nodes, so that a large-scale data set can be better processed. (4) And efficient parallel codes are written, and the method is convenient and quick on Hadoop. Due to the natural advantages of Hadoop, Hadoop has obvious advantages in writing distributed large programs. No matter a company or an individual can build a Hadoop cluster belonging to the company by using a very cheap PC, and the Hadoop cluster is used for researching distributed parallel computing. Because of these characteristics, Hadoop is very popular in both academia and business.
HBase is a distributed, column-oriented open source database, and the technology is derived from the Google paper "Bigtable: a distributed storage system of structured data. Just as Bigtable takes advantage of the distributed data storage provided by the Google File System (File System), HBase provides Bigtable-like capabilities over Hadoop. HBase is a sub-item of the Hadoop item of Apache. HBase is different from a general relational database, and is a database suitable for unstructured data storage. Another difference is that HBase is based on a column rather than a row based pattern.
The HBase-Hadoop Database is a distributed storage system with high reliability, high performance, orientation and scalability, and a large-scale structured storage cluster can be built on a low-cost PC Server by utilizing the HBase technology.
The Hadoop Distributed File System (HDFS) is designed to fit distributed file systems running on general purpose hardware (comfort hardware). It has many similarities with existing distributed file systems. But at the same time, its distinction from other distributed file systems is also clear. HDFS is a highly fault tolerant system suitable for deployment on inexpensive machines. HDFS provides high throughput data access and is well suited for application on large-scale data sets.
HDFS supports a traditional hierarchical file organization structure. A user or application may create directories as desired and then save files in those directories. The hierarchy of the file system namespace is similar to most existing file systems in that users can create, delete, move, or rename files. At present, the HDFS does not support control of user disk quotas and access rights, nor does it support file hard links and soft links, but the HDFS architecture can make up for these characteristics well.
HDFS has the feature of being able to reliably store very large files across machines in a large cluster. It splits each file into a series of data blocks, all but the last one of which is the same size. In order to guarantee fault tolerance, all data blocks of a file will have a copy file. The data block size and copy coefficients of each file are configurable. The application may specify the number of copies of any particular file. The copy coefficients may be specified at the beginning of file creation or may be changed later.
Apache Ambari is a Web-based tool that supports the provisioning, management, and monitoring of Apache Hadoop clusters. Ambari has currently supported most Hadoop components, including HDFS, MapReduce, Hive, Pig, Hbase, zookeeper, Sqoop, and Hcatalog, among others.
ZooKeeper is a distributed, open-source distributed application coordination service, is an open-source implementation of Chubby of Google, and is an important component of Hadoop and Hbase. It is a software that provides a consistent service for distributed applications, and the functions provided include: configuration maintenance, domain name service, distributed synchronization, group service, etc.
The ZooKeeper aims to package complex and error-prone key services and provide a simple and easy-to-use interface and a system with high performance and stable functions for users.
ETL, an abbreviation used in english Extract-Transform-Load, is used to describe the process of extracting (Extract), converting (Transform), and loading (Load) data from a source end to a destination end. The term ETL is more commonly used in data warehouses, but its objects are not limited to data warehouses.
The ETL is an important ring for constructing a data warehouse, and a user extracts required data from a data source, and finally loads the data into the data warehouse according to a predefined data warehouse model after data cleaning.
Sqoop is a source-opening tool, and is mainly used for data transmission between Hadoop (hive) and a traditional database (MySQL, postgresql), and data in a relational database (e.g., MySQL, Oracle, Postgres, etc.) can be imported into an HDFS of Hadoop, and data of the HDFS can also be imported into the relational database.
The flash is a high-availability, high-reliability and distributed system for acquiring, aggregating and transmitting mass logs provided by Cloudera, and supports various data senders customized in the log system for collecting data; at the same time, flash provides the ability to simply process data and write to various data recipients (customizable).
Disclosure of Invention
The technical problem to be solved by the invention is as follows: in a distributed computer cluster environment, a space data storage processing middleware framework implementation method based on a big data technology is provided, and a diversified fragmented unstructured data distributed virtual storage framework is constructed by extracting, converting and loading multi-source heterogeneous space data, so that directly readable data contents are provided for subsequent space big data analysis and mining.
In order to solve the technical problem, the invention discloses a method for realizing a space data storage processing middleware framework based on a big data technology, which is characterized by comprising the following steps of:
step A), aiming at multi-source heterogeneous spatial data and system data with large data volume, extracting the data by adopting an ETL tool data extraction and conversion tool, and converting the data into data with a general format; the data extraction and conversion steps are as follows: the MapGIS data is stored in a MapGIS database, the MapGIS data in the MapGIS database is guided into an HBase distributed database through a MapGIS conversion tool, and meanwhile, the HBase data can also be guided into the MapGIS database;
step B), data distributed storage step: converting the MapGIS format data in the spatial database into a file format MapGIS Conversion tool for Hadoop management through a MapGIS Conversion tool, storing the converted MapGIS spatial data in a distributed database HBase, extracting the geographic range of the MapGIS format and storing the annotation text content in a content library (HBase), wherein the extraction of the annotation text content enables the map to be searched according to the content, and GIS map information becomes a component of the content library and is used for supporting the mining of spatial big data together with the result data content in a search mode that non-vector maps can only be searched according to file names.
In the above scheme, the data distributed storage step is followed by a data association RDF step: establishing an index and a semantic directory of spatial data, and storing the index and the semantic directory in a data association map RDF; where the association between an entity and data is based on the concept of a graph, a data association graph can associate a spatial geographic entity with a large amount of structured or unstructured data.
In the above scheme, the specific steps of the data association RDF include:
semantic association tree step 301: storing entities and their relationships in a semantic association tree; storing triple data in the semantic association tree, wherein the triple records the relationship between the entities and URL address information of the entity resources;
resource URI step 302: the entity of step 301 and the spatial data of step 303 are connected to each other by a resource URI and are accessible to each other;
HBase distributed storage step 303: HBase is a column-oriented, sparse and distributed multidimensional sequencing mapping table, data in each column family are stored together, I/O overhead is effectively reduced during reading and writing, and similar data are put together;
the HBase distributed storage database adopts KeyValue column storage, Rowkey is a main Key of a Row and represents a unique Row, and records in a table are sorted according to Row Key; here, data file URL is used as the main key; all data are accessed through Rowkey, and one wide row can contain all data related to one main key;
KeyValue is a key value pair consisting of Column names and Column values of columns, and a plurality of KeyValues form a Column-family;
the Column-family comprises any attribute values (columns) of a plurality of logic attribute groups, one table has one or more Column families in the horizontal direction, each Column family can be composed of any plurality of columns, the Column families support dynamic expansion, the number and types do not need to be predefined, binary storage is realized, and the type conversion is carried out by a user; the Column-family can avoid losing the information quantity of the original data as much as possible, thereby being capable of really organizing and describing data;
and the table with the file archive number and name as main keys contains the attribute of the archive report, thereby forming the distributed content library.
In the above scheme, the algorithm of the semantic association tree is as follows:
step 1), starting;
step 2), predefining a root node, and setting child nodes with the relationship of RowKey and GeomiD as null;
step 3), reading a main Key Key, a spatial attribute URI and a specified characteristic attribute in a content library;
step 4), if the space attribute URI is empty, executing step 5, otherwise, executing step 6;
step 5), matching corresponding characteristic attributes in the spatial data, constructing a URI of a corresponding record, and storing the URI of the corresponding record in an attribute column corresponding to a content library;
step 6), segmenting words of the feature attribute text, and taking a root node as a father node;
step 7), taking values from the word segmentation result set in sequence, and then executing step 8, step 9 and step 10;
step 8), searching a node corresponding to the SubNode in the semantic association tree, if the node does not exist, executing the step 9 and the step 10, otherwise, returning to the step 7;
step 9), if the URI is empty, matching the corresponding characteristic attribute in the spatial data, and constructing the URI of the corresponding record;
step 10), a Node is created according to the value, a relationship is created to be a child Node Key of RowKey, namely a triple [ Node, RowKey, Key ], the relationship is created to be a child Node URI of GeomID, namely the triple [ Node, GeomID, URI ], the Node is used as a child Node, and a SubNode relationship is established with a father Node;
step 11), end.
Compared with the prior art, the invention has the beneficial effects that: the spatial big data extraction conversion and distributed storage method provides a method for users to quickly acquire the data content of the existing multi-source heterogeneous structured data and unstructured data, and improves the distributed storage efficiency by adopting a mainstream big data access tool.
The content in HBase is stored in a column group mode, data in each column group are stored together, I/O overhead is effectively reduced during reading and writing, similar data are put together, and storage space is greatly saved after compression.
By adopting a Hadoop technology, the unstructured spatial data is stored and organized in a content-oriented mode, the problems of homogenization of the unstructured spatial data and organization for data mining are solved, and diversified and fragmented data are homogenized and integrated; unstructured data is stored by Key/Value, large fields and the like, so that space data can be conveniently and effectively acquired and utilized subsequently.
Drawings
FIG. 1 is a data storage processing middleware framework of the present invention;
FIG. 2 is a flow chart illustrating an embodiment of a method for implementing spatial data extraction transformation and distributed storage according to the present invention;
FIG. 3 is a graph of associations between spatial entities and data according to the present invention;
FIG. 41 is a diagram showing the data size and block size of 50 ten thousand stratigraphic units;
FIG. 51 shows details of data block storage of 50 ten thousand stratigraphic units.
Detailed Description
The present invention will be further described with reference to the accompanying fig. 1-5 and the specific embodiments so that those skilled in the art can better understand the present invention and can implement the present invention, but the embodiments are not to be construed as limiting the present invention.
The invention provides a method for storing and processing spatial data based on big data technology, which comprises the following steps:
step A), aiming at multi-source heterogeneous spatial data and system data with large data volume, extracting the data by adopting an ETL tool data extraction and conversion tool, and converting the data into data with a general format;
and B) virtualizing and storing the data in a space big data distributed storage frame for unified management.
Furthermore, the multi-source heterogeneous data source comprises a local file system, a relational database, export data are mutually imported from a spatial data management platform to a big data system, associated data are customized by a user, and the spatial structured data and unstructured data in the big data system are associated to lay a foundation for subsequent data analysis.
Further, the ETL tool is a data extraction, transformation, and loading tool, which extracts multiple structured data from a data source, and loads the original data into a big data container quickly and efficiently, so that data can be transformed between spatial big data storage and a traditional storage manner, and the ETL tool is divided into three tools according to different data types, where the three tools are:
the real-time data conversion tool is used for importing real-time data through a web crawler and the Flume;
the user-defined data conversion tool adopts a Sqoop big data access tool to improve the storage efficiency, and meanwhile, the user-defined data conversion tool can be used according to a specific service data type and provide a file uploading function;
and the spatial data conversion tool is used for converting the spatial data in the spatial data format into a general format.
Further, the distributed storage framework includes five tools, respectively:
a data association RDF graph database supporting storage of relationships between geospatial data and other types of data;
and the distributed file system (HDFS) is used for storing original space data and data documents. The method comprises the steps that distributed file storage is provided based on an HDFS framework system so as to deal with a large amount of unstructured data such as multimedia files and the like, and storage plug-ins of the distributed file storage are expanded by self-definition to support storage of GIS space data;
the HBase distributed database is characterized in that structured or semi-structured data types are stored in a mode of supporting a conventional data table by integrating the HBase database, GIS spatial data storage is realized based on the development interface specification, the incidence relation between structured data and unstructured data is established in the table, rich query results are provided for subsequent data query, file data are rapidly acquired, and original documents are reorganized and then stored in the distributed real-time access database HBase. Wherein, the files such as the attached drawings, the attached tables, the attachments and the like are stored separately, and the main files are stored separately according to chapters. Meanwhile, an index is established for the content stored in the HBase and is stored in a distributed cache Memcached or Redis, so that the index is only required to be obtained from a memory for searching;
ZooKeeper collaboration service, a centralized service, is used to maintain configuration information and naming, and provide distributed synchronization and group services;
ambari cluster node management monitoring is used for creating, managing and monitoring a Hadoop cluster, is a tool for enabling Hadoop and related big data software to be used more easily, is also distributed-architecture software, and mainly comprises two parts: ambari Server and Ambari Agent. Simply put it simply, the user informs Ambari Agent to install corresponding software through Ambari Server; the Agent will send the state of each software module of each machine to Ambari Server regularly, and finally the state information will be presented in Ambari's GUI, which is convenient for the user to know the various states of the cluster and to perform corresponding maintenance.
As shown in FIG. 1, the data storage processing middleware module framework of the present invention comprises the following modules:
the data source module 101: the data sources of the spatial big data comprise spatial data, internet data, log stream data, local data files, relational data and the like, the data formats of the data sources comprise GIS data, document data, image data and the like, and the data sources are stored in different types of database nodes such as a relational database, a spatial database and the like in a scattered manner.
ETL tool module 102: the ETL tool extracts, converts and loads the data sources in various formats which are stored dispersedly;
the ETL tool comprises a real-time data conversion tool, a custom data conversion tool and a space data conversion tool;
the three tools respectively extract corresponding data in the data source and convert the data into a uniform readable format;
for example, relational data is accessed using the Sqoop tool, and spatial data is accessed using the spatial data transformation tool.
HDFS distributed file system module 103: part of the data extracted and converted by the ETL tool, such as file upload data, is stored in the HDFS distributed file system in a distributed mode.
HBase distributed database module 104: partial data extracted and converted by the ETL tool, such as spatial data, real-time data and the like, are stored in the HBase distributed database in a distributed mode.
Data association RDF graph database module 105: the ETL tool extracts data of the conversion data source and stores the data in the distributed database, and meanwhile, data indexes and semantic directories are established and stored in the data association map RDF.
ZooKeeper collaboration services module 106: the distribution of HBase regionservers of a plurality of nodes in a distributed environment is cooperatively managed.
Ambari cluster node management monitoring module 107: and visually installing and monitoring nodes in the cluster in the distributed environment.
As shown in fig. 2, a specific embodiment of the method for implementing spatial data extraction transformation and distributed storage according to the present invention includes the following steps:
data extraction and conversion step 201: the spatial data is mainly stored in a spatial database, for example, the mapGIS data is stored in a mapGIS database, and the mapGIS data in the mapGIS database is imported into an HBase distributed database through a mapGIS conversion tool, and meanwhile, the data of the HBase can also be imported into the mapGIS database.
Data distributed storage step 202: the method comprises the steps of converting MapGIS format data in a spatial database into file format MapGIS Conversion tools for Hadoop management through a MapGIS Conversion tools for Hadoop, storing the converted MapGIS spatial data in a distributed database HBase, extracting a geographical range of the MapGIS format and storing annotation text contents in a content library (HBase), wherein the extraction of the annotation text contents enables the map to be searched according to the contents, and GIS map information becomes a component of the content library and supports data mining after large spatial data together with result data contents in a search mode that non-vector maps can only be searched according to file names.
The data association RDF step is started as follows: and establishing an index and a semantic directory of the spatial data, and storing the index and the semantic directory in a data association map RDF.
The association between the entity and the data is based on the concept of a map, and the data association map can associate the space geographic entity with a large amount of structured or unstructured data, so that the basis is laid for subsequent unified analysis and application.
As shown in FIG. 3, one embodiment of the correlation map between spatial entities and data of the present invention comprises the following steps;
semantic association tree step 301: storing entities and their relationships in a semantic association tree; and storing triple data in the semantic association tree, wherein the triple records the relationship between the entities, the URL address of the entity resource and other information.
Resource URI step 302: the entity of step 301 and the spatial data of step 303 are connected to each other by a resource URI (unique identifier of data) and are accessible to each other.
HBase distributed storage step 303: HBase is a nematic, sparse and distributed multidimensional sorting mapping table, data in each column family are stored together, I/O (input/output) overhead is effectively reduced during reading and writing, similar data are put together, and storage space is greatly saved after compression;
the HBase distributed storage database adopts KeyValue column storage, Rowkey is a main Key of a Row and represents a unique Row, and records in a table are sorted according to Row Key; here, data file URL is used as the main key; all data are accessed through Rowkey, and one wide row can contain all data related to one main key;
KeyValue is a key value pair consisting of Column names and Column values of columns, and a plurality of KeyValues form a Column-family;
the Column-family comprises any attribute values (columns) of a plurality of logical attribute groups, one table has one or more Column families in the horizontal direction, each Column family can be composed of any plurality of columns, the Column families support dynamic expansion, the number and types do not need to be predefined, binary storage is realized, and the type conversion is required by a user. The Column-family can prevent the information amount of the original data from losing as much as possible, thereby being capable of really organizing and describing the data.
The table with file archive number and name as primary keys, which contains the attributes of the archive report (e.g., archive name, geospatial range, attachment chart) forms a distributed content repository.
The algorithm of the semantic association tree is further described below:
step 1), starting;
step 2), predefining a root node, and setting child nodes with the relationship of RowKey and GeomiD as null;
step 3), reading a main Key Key, a spatial attribute URI and a specified characteristic attribute in a content library;
step 4), if the space attribute URI is empty, executing step 5, otherwise, executing step 6;
step 5), matching corresponding characteristic attributes in the spatial data, constructing a URI of a corresponding record, and storing the URI of the corresponding record in an attribute column corresponding to a content library;
step 6), segmenting words of the feature attribute text, and taking a root node as a father node;
step 7), taking values from the word segmentation result set in sequence, and then executing step 8, step 9 and step 10;
step 8), searching a node corresponding to the SubNode in the semantic association tree, if the node does not exist, executing the step 9 and the step 10, otherwise, returning to the step 7;
step 9), if the URI is empty, matching the corresponding characteristic attribute in the spatial data, and constructing the URI of the corresponding record;
step 10), a Node is created according to the value, a relationship is created to be a child Node Key of RowKey, namely a triple [ Node, RowKey, Key ], the relationship is created to be a child Node URI of GeomID, namely the triple [ Node, GeomID, URI ], the Node is used as a child Node, and a SubNode relationship is established with a father Node;
step 11), end.
Triples are concepts in data structures, and are mainly a compression method for storing sparse matrices, and refer to a set of pointers such as ((x, y), z), often abbreviated as (x, y, z). The triple in the technical scheme records the relationship between the entities and the information such as the URL addresses where the entity resources are located.

Claims (1)

1. A space data storage processing middleware framework implementation method based on big data technology is characterized in that: which comprises the following steps:
step A), aiming at multi-source heterogeneous spatial data and system data with large data volume, extracting the data by adopting an ETL tool data extraction and conversion tool, and converting the data into data with a general format; the data extraction and conversion steps are as follows: the MapGIS data is stored in a MapGIS database, the MapGIS data in the MapGIS database is imported into an HBase distributed database through a MapGIS conversion tool, and meanwhile, the HBase data is imported into the MapGIS database;
step B), data distributed storage step: converting the MapGIS format data in the spatial database into a file format MapGIS Conversion tools for Hadoop management through a MapGIS Conversion tools, storing the converted MapGIS spatial data in a distributed database HBase, extracting the geographic range of the MapGIS format and storing the annotation text content in a content library (HBase), wherein the extraction of the annotation text content enables the map to be retrieved according to the content, and the GIS map information becomes a component of the content library and is used for supporting the mining of spatial big data together with the result data content, and is different from a retrieving mode that a non-vector map can only be used for retrieving the map according to the file name;
the data distributed storage step is followed by a data association RDF step: establishing an index and a semantic directory of spatial data, and storing the index and the semantic directory in a data association map RDF; wherein the association between the entity and the data is based on the concept of a graph, and the data association graph can associate a spatial geographic entity with a large amount of structured or unstructured data; the specific steps of the data association RDF comprise:
semantic association tree step 301: storing entities and their relationships in a semantic association tree; storing triple data in the semantic association tree, wherein the triple records the relationship between the entities and URL address information of the entity resources;
resource URI step 302: the entity of step 301 and the spatial data of step 303 are connected to each other by a resource URI and are accessible to each other;
HBase distributed storage step 303: HBase is a column-oriented, sparse and distributed multidimensional sequencing mapping table, data in each column family are stored together, I/O overhead is effectively reduced during reading and writing, and similar data are put together;
the HBase distributed storage database adopts KeyValue column storage, Rowkey is a main Key of a Row and represents a unique Row, and records in a table are sorted according to Row Key; here, data file URL is used as the main key; all data is accessed through the Rowkey main key;
KeyValue is a key value pair consisting of Column names and Column values of columns, and a plurality of KeyValues form a Column-family;
the Column-family comprises any attribute values of a plurality of logic attribute groups, one table has one or more Column families in the horizontal direction, each Column family is composed of any plurality of columns, the Column families support dynamic expansion, the number and types do not need to be predefined, binary storage is realized, and the type conversion is carried out by a user; the Column-family can avoid losing the information quantity of the original data as much as possible, thereby being capable of really organizing and describing data;
a table with file numbers and names as main keys, wherein the table contains the attributes of file reports, thereby forming a distributed content library;
the algorithm of the semantic association tree is as follows:
step 1), starting;
step 2), predefining a root node, and setting child nodes with the relationship of RowKey and GeomiD as null;
step 3), reading a main Key Key, a spatial attribute URI and a specified characteristic attribute in a content library;
step 4), if the space attribute URI is empty, executing step 5, otherwise, executing step 6;
step 5), matching corresponding characteristic attributes in the spatial data, constructing a URI of a corresponding record, and storing the URI of the corresponding record in an attribute column corresponding to a content library;
step 6), segmenting words of the feature attribute text, and taking a root node as a father node;
step 7), taking values from the word segmentation result set in sequence, and then executing step 8, step 9 and step 10;
step 8), searching a node corresponding to the SubNode in the semantic association tree, if the node does not exist, executing the step 9 and the step 10, otherwise, returning to the step 7;
step 9), if the URI is empty, matching the corresponding characteristic attribute in the spatial data, and constructing the URI of the corresponding record;
step 10), a Node is created according to the value, a relationship is created to be a child Node Key of RowKey, namely a triple [ Node, RowKey, Key ], the relationship is created to be a child Node URI of GeomID, namely the triple [ Node, GeomID, URI ], the Node is used as a child Node, and a SubNode relationship is established with a father Node;
step 11), end.
CN201611170711.9A 2016-12-16 2016-12-16 Method for realizing space data storage processing middleware framework based on big data technology Active CN106708993B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611170711.9A CN106708993B (en) 2016-12-16 2016-12-16 Method for realizing space data storage processing middleware framework based on big data technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611170711.9A CN106708993B (en) 2016-12-16 2016-12-16 Method for realizing space data storage processing middleware framework based on big data technology

Publications (2)

Publication Number Publication Date
CN106708993A CN106708993A (en) 2017-05-24
CN106708993B true CN106708993B (en) 2021-06-08

Family

ID=58939039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611170711.9A Active CN106708993B (en) 2016-12-16 2016-12-16 Method for realizing space data storage processing middleware framework based on big data technology

Country Status (1)

Country Link
CN (1) CN106708993B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038261B (en) * 2017-05-28 2019-09-20 海南大学 A kind of processing framework resource based on data map, Information Atlas and knowledge mapping can Dynamic and Abstract Semantic Modeling Method
CN107133369A (en) * 2017-06-16 2017-09-05 郑州云海信息技术有限公司 A kind of distributed reading shared buffer memory aging method based on the expired keys of redis
CN107194007A (en) * 2017-06-20 2017-09-22 哈尔滨工业大学 A kind of integrated management system of spacecraft isomery test data
CN108491364A (en) * 2018-01-25 2018-09-04 苏州麦迪斯顿医疗科技股份有限公司 Medical treatment and nursing paperwork management system
CN108920519A (en) * 2018-06-04 2018-11-30 贵州数据宝网络科技有限公司 One-to-many data supply system and method
CN109344212A (en) * 2018-08-24 2019-02-15 武汉中地数码科技有限公司 A kind of geographical big data of subject-oriented feature excavates the method and system of recommendation
CN109254989B (en) * 2018-08-27 2020-11-20 望海康信(北京)科技股份公司 Elastic ETL (extract transform load) architecture design method and device based on metadata drive
CN109446296A (en) * 2018-09-10 2019-03-08 上海勋立信息科技有限公司 A kind of magnanimity unstructured data treating method and apparatus
CN110427446B (en) * 2019-08-02 2023-05-16 武汉中地数码科技有限公司 Method and system for rapidly publishing and browsing mass image services
CN112749216B (en) * 2019-10-30 2024-07-26 北京国双科技有限公司 Data importing method, device and equipment based on rule analysis
CN111190602A (en) * 2019-12-30 2020-05-22 富通云腾科技有限公司 Heterogeneous cloud resource-oriented conversion method
CN111310230B (en) * 2020-02-10 2023-04-14 腾讯云计算(北京)有限责任公司 Spatial data processing method, device, equipment and medium
CN111680041B (en) * 2020-05-31 2023-11-24 西南电子技术研究所(中国电子科技集团公司第十研究所) Safety high-efficiency access method for heterogeneous data
CN111858483A (en) * 2020-07-29 2020-10-30 湖南泛联新安信息科技有限公司 Software sample hybrid storage system based on multiple databases and file systems
CN112463837B (en) * 2020-12-17 2022-08-16 四川长虹电器股份有限公司 Relational database data storage query method
CN113378219B (en) * 2021-06-07 2024-05-28 北京许继电气有限公司 Unstructured data processing method and system
CN116881244B (en) * 2023-06-05 2024-03-26 易智瑞信息技术有限公司 Real-time processing method and device for space data based on column storage database

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183834A (en) * 2015-08-31 2015-12-23 上海电科智能系统股份有限公司 Ontology library based transportation big data semantic application service method
CN105468702A (en) * 2015-11-18 2016-04-06 中国科学院计算机网络信息中心 Large-scale RDF data association path discovery method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101436192B (en) * 2007-11-16 2011-03-16 国际商业机器公司 Method and apparatus for optimizing inquiry aiming at vertical storage type database
CN101826100A (en) * 2010-03-16 2010-09-08 中国测绘科学研究院 Automatic integrated system and method of wide area network (WAN)-oriented multisource emergency information
CN103678665B (en) * 2013-12-24 2016-09-07 焦点科技股份有限公司 A kind of big data integration method of isomery based on data warehouse and system
CN104598606A (en) * 2015-01-30 2015-05-06 北京东方泰坦科技股份有限公司 Integration method aiming at dynamic heterogeneous spatial information plotting data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183834A (en) * 2015-08-31 2015-12-23 上海电科智能系统股份有限公司 Ontology library based transportation big data semantic application service method
CN105468702A (en) * 2015-11-18 2016-04-06 中国科学院计算机网络信息中心 Large-scale RDF data association path discovery method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向模式的空间数据存储中间件结构化设计研究;李建华等;《测绘信息与工程》;20050908(第4期);第22-24页 *

Also Published As

Publication number Publication date
CN106708993A (en) 2017-05-24

Similar Documents

Publication Publication Date Title
CN106611046B (en) Spatial data storage processing middleware system based on big data technology
CN106708993B (en) Method for realizing space data storage processing middleware framework based on big data technology
US11816126B2 (en) Large scale unstructured database systems
Makris et al. A classification of NoSQL data stores based on key design characteristics
Padhy et al. RDBMS to NoSQL: reviewing some next-generation non-relational database’s
Tauro et al. Comparative study of the new generation, agile, scalable, high performance NOSQL databases
CN107451225B (en) Scalable analytics platform for semi-structured data
Zafar et al. Big data: the NoSQL and RDBMS review
CN110633186A (en) Log monitoring system for electric power metering micro-service architecture and implementation method
Liang et al. Express supervision system based on NodeJS and MongoDB
Chavan et al. Survey paper on big data
EP2973051A2 (en) Scalable analysis platform for semi-structured data
Mohammed et al. A review of big data environment and its related technologies
Hashem et al. An Integrative Modeling of BigData Processing.
Gao et al. Geospatial data storage based on HBase and MapReduce
US12061579B2 (en) Database gateway with machine learning model
Pothuganti Big data analytics: Hadoop-Map reduce & NoSQL databases
Dhanda Big data storage and analysis
Asaad et al. NoSQL databases: yearning for disambiguation
Dutta Distributed computing technologies in big data analytics
Gopalan et al. MYSQL to cassandra conversion engine
Chai et al. A document-based data warehousing approach for large scale data mining
Saxena et al. NoSQL Databases-Analysis, Techniques, and Classification
CN110569310A (en) Management method of relational big data in cloud computing environment
Aljarallah Comparative study of database modeling approaches

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant