CN108446293B - Method for constructing city portrait based on city multi-source heterogeneous data - Google Patents

Method for constructing city portrait based on city multi-source heterogeneous data Download PDF

Info

Publication number
CN108446293B
CN108446293B CN201810057801.XA CN201810057801A CN108446293B CN 108446293 B CN108446293 B CN 108446293B CN 201810057801 A CN201810057801 A CN 201810057801A CN 108446293 B CN108446293 B CN 108446293B
Authority
CN
China
Prior art keywords
data
city
grid
source heterogeneous
constructing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810057801.XA
Other languages
Chinese (zh)
Other versions
CN108446293A (en
Inventor
胡青阳
张一杨
袁祖瑞
舒元昊
邵建伟
张芬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETHIK Group Ltd
Original Assignee
CETHIK Group Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETHIK Group Ltd filed Critical CETHIK Group Ltd
Priority to CN201810057801.XA priority Critical patent/CN108446293B/en
Publication of CN108446293A publication Critical patent/CN108446293A/en
Application granted granted Critical
Publication of CN108446293B publication Critical patent/CN108446293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a method for constructing a city portrait based on city multi-source heterogeneous data, which comprises the steps of firstly obtaining original data from a city multi-source heterogeneous data interface and issuing the original data to a specified internal part of a message queue; then establishing a data rasterization conversion module corresponding to the urban multi-source heterogeneous data, extracting data from the inner part of the message queue, and performing rasterization conversion and data completion on the extracted data by adopting a corresponding algorithm according to the normalized raster data model and the original structure, service characteristics and space-time attributes of the data; and then, storing raster data formed after raster conversion and data completion into a target database designed according to the normalized raster data model to form an urban portrait formed by urban multi-source heterogeneous data. According to the method, multi-source heterogeneous data in the city space are mapped into a uniform and regular grid space, the correlation fusion of the heterogeneous data is realized, and basic data can be provided for specific application of the smart city.

Description

Method for constructing city portrait based on city multi-source heterogeneous data
Technical Field
The invention belongs to the technical field of big data processing, and particularly relates to a method for constructing a city portrait based on city multi-source heterogeneous data.
Background
Currently, urban development faces challenges such as population expansion, environmental deterioration, frequent public health events, traffic congestion, resource waste, etc., and smart cities are a future development trend.
The smart city is based on the combination of information technology and the internet, and by means of various intelligent applications, the operation efficiency of city infrastructure is improved, the city operation management and public service level are improved, and people's life is better. In the construction and management of smart cities, technologies such as internet of things, cloud computing and big data play more and more important roles. In a smart city, because many infrastructures and devices adopting the technology of the internet of things have the perceived and monitored functions, the infrastructures and the devices generate a large amount of data. The data has wide sources and various structures, and covers big data resources such as intelligent transportation, intelligent medical treatment, intelligent buildings, intelligent power grids, intelligent agriculture, intelligent security, intelligent environmental protection, intelligent tourism, intelligent education, intelligent water affairs and the like, and relates to the application scope of intelligent cities in the aspect of the aspect. The system is mainly massive structured or unstructured data generated by the channels such as the Internet, sensing equipment, video monitoring, mobile equipment, intelligent equipment, non-traditional information equipment and the like, and constantly infiltrates the aspects of daily management and operation of cities. The data form big data in the smart city, the big data are important information resources supporting the development of the smart city, city operation signs are expressed in a quantitative mode through the data, data of all departments about the city operation signs are collected through a big data technology, a city manager can be helped to summarize and analyze the data, and finally, the quantitative form of the city signs, namely various types of data, is managed.
In the practice of smart cities, how to collect, maintain and apply the multi-source and heterogeneous data becomes a key link and a necessary premise for further intelligent analysis and processing of the data. However, most of the existing urban data management technologies are based on a specific application scenario, focus on and use data in a certain topic, and lack overall planning and integration of urban data.
Disclosure of Invention
The invention aims to provide a method for constructing city portraits based on city multi-source heterogeneous data, which manages the multi-source heterogeneous city data through aggregation, accumulation and standardized storage, realizes the association and fusion of the multi-source heterogeneous data, and provides high-efficiency data service on the basis, thereby constructing multi-dimensional data 'city portraits' capable of serving deep learning, training and prediction and omnibearing city data display.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a method for constructing a city portrait based on city multi-source heterogeneous data comprises the following steps:
designing a normalized raster data model corresponding to the urban multi-source heterogeneous data;
acquiring original data from a city multi-source heterogeneous data interface, preprocessing the acquired original data, serializing a preprocessing result, and publishing the preprocessing result into a specified internal classification of a message queue;
establishing a data rasterization conversion module corresponding to urban multi-source heterogeneous data, extracting data from internal classes of a message queue, and performing rasterization conversion and data completion on the extracted data by adopting a corresponding algorithm according to the normalized raster data model and the original structure, service characteristics and time-space attributes of the data;
and storing the grid data formed after grid conversion and data completion into a target database designed according to a normalized grid data model to form a city portrait formed by multi-source heterogeneous data of the city, and providing a corresponding data interface according to the characteristics of the grid data to serve the deep learning algorithm and the application of the smart city.
Further, the normalized raster data model divides the urban space into a plurality of grids, maps the data into each corresponding grid according to the spatial distribution of the data, numbers all the grids, and uses the grid numbers as the spatial information of the data in the target database.
Further, the target database is a grid database, and the grid database comprises an implementation of a relational database and an implementation of a non-relational database; the relational database establishes an independent data table for each type of data, in the data table, grid numbers and time information corresponding to the data are respectively a row in the database, and each attribute in the data corresponds to a row; the non-relational database uses row keys to represent data attributes and time information, uses columns to represent space information, and uses the column names as grid numbers.
Further, the row key of the non-relational database comprises a character string obtained by encoding the data attribute by adopting a hash algorithm, and time information corresponding to the data.
Further, performing rasterization conversion and data completion on the extracted data by adopting a corresponding algorithm according to the normalized raster data model and the original structure, the service characteristics and the time-space attributes of the data, and including:
projecting the real data into the corresponding grids according to the coordinates of the sensing data which are distributed in the space point coordinates, change along with time and have real values only at limited positions, and calculating and complementing the data in the rest grids by utilizing the similarity of the data in space and time dimensions on the basis;
for data which are distributed in a space and only have position attributes, respectively counting the number of each type of data in each grid, and converting the position distribution data into rasterized density distribution data;
for data of which the original data has grid characteristics, the original grid size of the data is converted into grid data with uniform size in a normalized grid data model, and the original grid data distribution is mapped into a target grid in the conversion process according to the percentage of types or numerical values.
Further, the preprocessing the acquired raw data includes: error information caused by data source exception is filtered.
Further, the internal classifications are classified and established according to data sources, or contents, or places, or time, each internal classification corresponds to a preset standard format, and the preset standard format comprises an attribute category, a data storage type and a serialization mode.
Further, the extracting data from the internal classification of the message queue further includes:
and carrying out preprocessing of removing duplicate and irrelevant data on the data extracted from the internal part of the classes.
Further, the target database comprises a grid database and a data warehouse, and the method for constructing the city portrait based on the city multi-source heterogeneous data comprises the following steps:
and establishing a data rasterization conversion module corresponding to the urban multi-source heterogeneous data, extracting data from the internal classes of the message queue, and storing the extracted data into a data warehouse in a target database.
According to the method for constructing the city portrait based on the city multi-source heterogeneous data, the normalized grid data model corresponding to the city multi-source heterogeneous data is constructed, the city multi-source heterogeneous data is subjected to correlation fusion processing according to the normalized grid data model to obtain the city portrait, the multi-source heterogeneous data in the city space is mapped into the uniform and regular grid space, the goal of correlation fusion of heterogeneous data is achieved, and basic data can be provided for specific application of smart cities.
Drawings
FIG. 1 is a flow chart of a method for constructing a city portrait based on city multi-source heterogeneous data according to the present invention.
Detailed Description
The technical solutions of the present invention are further described in detail below with reference to the drawings and examples, which should not be construed as limiting the present invention.
The urban multi-source heterogeneous data related to the technical scheme covers all acquirable data generated in urban operation, including various real-time accessed sensing data, map data, remote sensing image data and indirectly generated information thereof, urban basic surface data and statistical data recorded and counted by each department of government, urban traffic flow data, air quality data, meteorological data, remote sensing image data, POI distribution data, land utilization data and the like. The data sources are different, the formats are various, and the data have different time and space accuracies, so before the data is applied to complex operations such as deep learning, a set of normalization processes is needed to perform normalization association fusion processing on the data.
As shown in fig. 1, an embodiment of the present technical solution is a method for constructing a city portrait based on city multi-source heterogeneous data, including the following steps:
and S1, designing a normalized raster data model corresponding to the urban multi-source heterogeneous data.
The technical scheme includes that a normalized grid data model is firstly established, an urban space is divided into a plurality of grids (for example, the grids are divided at equal intervals according to 1km multiplied by 1km), all the grids are numbered, namely original geographic information (longitude and latitude) is converted into grid numbers, and the grid numbers are used as spatial information of data in a database. The technical scheme divides the urban space into equal-interval grids (for example, 1km multiplied by 1km), and maps the data into each grid according to the spatial distribution, namely rasterization is carried out. The same type of data in each grid is uniformly distributed, and there may be differences in different grids, and at the same time, when the original data exists in the time dimension, the data in the grids also changes with time. By the method, multi-source heterogeneous data in the urban space are mapped to a uniform and regular grid space, and the goal of heterogeneous data association fusion is further achieved.
It should be noted that the grid may be divided into irregular shapes according to the administrative district, and the invention is not limited thereto.
In the normalized raster data model of this embodiment, a relational database (e.g., MySQL, Oracle) is used to store raster data with insignificant time change, and a non-relational database (NoSQL database, e.g., HBase) suitable for big data storage is used to store raster data with significant time change.
The relational database is suitable for storing structured data which does not change obviously along with time, such as grid metadata (namely grid basic information such as grid numbers, grid longitude and latitude ranges and administrative divisions), POI (point of interest) original data, land utilization data and the like. The table structure for such data is designed as follows: and establishing a separate data table for each type of data, wherein each attribute in the structured data corresponds to one column, and each data corresponds to one row. The method has the advantages of high query efficiency and various query modes, but is not suitable for storing mass data.
The non-relational database HBase is suitable for storing massive grid data which obviously change along with time, such as rasterized traffic situation data, meteorological data and the like. Data attribute and time information are represented by "row key" in HBase, and spatial information is represented by "column". The column name is the "grid number" obtained in the previous step. Besides, the non-relational database can also adopt Cassandra, which is not described in detail here.
The key to the design of the rasterized data HBase database is its row key (i.e., RowKey). The design of the row key mainly considers the following two problems:
firstly, because the main application requirements of rasterized data lie in that adding and inquiring operations are carried out according to data attribute names and time, row keys must contain two elements of the data attribute names and the time;
and secondly, the high order bits in the row key determine the physical storage path of the data in the cluster, and special processing should be carried out on the high order bits of the row key in order to balance the load of the cluster.
In summary, the row key design is as follows: the data attributes are encoded by adopting a hash algorithm, the common hash algorithm comprises MD5, SHA-1, RIPEMD, HAVAL and the like, so that the randomness of the high-order bits of the row keys is ensured, time information is added after the attributes, and the query is convenient to carry out according to the time.
For example, the row key corresponding to the temperature data of 14 pm 11/6/2017 is:
"78b21a804f24074f8103e571472556be_2017110614"。
the key point and difficulty of column design is how to express geographical information simply, in this embodiment, in order to rasterize data, it is assumed that a certain city is defined as a rectangular area composed of 1km × 1km grids, and all grids are numbered sequentially from west to east and from north to south, that is, geographical information in other forms such as longitude and latitude and the like is converted into grid numbers. The data corresponding to the row key is regarded as a 'layer', namely the data of all grid points with specific time points and specific attributes are composed of a plurality of columns of data. Assuming that the designated area has 1000 grids, each row key (layer) corresponds to 1000 columns with column names of [1, 1000 ].
The data that can be accessed and associated and fused by the technical solution includes (but is not limited to):
air quality data: the method is characterized in that the method is obtained from a public data interface (data cloud market and the like), the range can reach 300 cities in the whole country, and the air quality monitoring station data updated every hour in each city is obtained. Specific data types include AQI, PM2.5 concentration, NO2 concentration, SO2 concentration, and the like.
Meteorological data: the method comprises the steps of obtaining weather monitoring data updated every hour in each city from professional weather websites such as a China weather data network and the like, wherein the weather monitoring data comprises public interface data and file data, and the range of the cities is more than the county level of the country. Specific types include air temperature, air pressure, precipitation, etc.
Traffic situation data: obtained from a map service provider public interface such as a grand map, to the extent that it includes several specific cities for which data is available in the form of traffic situation for a given area or for a given route. The specific types comprise overall traffic situation, smooth traffic rate, slow traffic rate, congestion rate and the like.
POI data: the method comprises the steps of obtaining POI information in an area, including name, type, address and the like, from a map service provider public interface such as a Gade map and the like, wherein the area is an arbitrary prototype area or a polygonal area with the radius of 50 km. POI data can be converted into rasterized data by way of category counting, whose evolution period is relatively long, not necessarily including a time dimension.
Land utilization data: the public land use classification data obtained by means of remote sensing images, land surveys and the like, such as Arcgis grid files containing coverage information and altitude information. The data range can cover the whole country, the precision is influenced by a data source (for example, 90m multiplied by 90m), a first class and a second class in each area are obtained, the first class comprises types of woodland, grassland, artificial surface and the like, and the second class is used for further subdividing the first class. The land use data evolution period is relatively long and does not necessarily include a time dimension.
The rasterized urban multi-source heterogeneous data has a similar data structure with grids distributed in space and image lattices, and each type of data can be analogized with different components (such as RGB components and various spectrums in remote sensing images) in a color image and a remote sensing image. Rasterized data may be viewed as a "city portrait" that describes the state of the city's integrated operations.
And step S2, acquiring original data from the urban multi-source heterogeneous data interface, preprocessing the acquired original data, serializing the preprocessing result, and publishing the preprocessing result to the specified internal classification of the message queue.
Message queue middleware is an important component in a distributed system, such as ActiveMQ, RabbitMQ, ZeroMQ, MetaMQ, Kafka, etc., which implements message publishing and subscribing. In order to process multi-source and large-quantity streaming data, the technical scheme adopts a Kafka architecture, and has the characteristics of high throughput rate, high expansibility, high fault tolerance rate and O (1) time complexity and message persistence.
In the embodiment, a Kafka architecture is adopted to separate the data acquisition and processing processes, so that the real-time data access processing requirement is met. The Kafka-structured data acquisition module acquires data from an external data source, issues the data to the internal classification, and then extracts the data from the internal classification by the data rasterization conversion module for subsequent processing. The problem that the data acquisition speed is not matched with the data processing speed can be perfectly solved, and therefore real-time processing of mass data is achieved.
The Kafka data acquisition module of this embodiment is used for acquiring various data source raw data in real time according to the characteristics of the data source. Taking the example of acquiring meteorological data through a China meteorological data network public interface: the data interface updates the meteorological data of the previous hour every hour, so a timing data acquisition module is designed to be realized, and the original data of the target ground meteorological detection station is acquired through an http request every hour.
In the embodiment, a data acquisition module of a message queue middleware is adopted and accessed to a city multi-source heterogeneous data interface to acquire original data. Aiming at different data sources, different pre-designed data acquisition modules are arranged and are connected to data interfaces of different data sources to acquire original data.
For example:
the traffic situation data corresponds to the data acquisition module 1, and the data acquisition module 1 is accessed to a public interface of a service provider of the high-grade map to acquire the traffic situation data.
The meteorological data corresponds to the data acquisition module 2, and the data acquisition module 2 is accessed to a public interface of a China meteorological data network to acquire the meteorological data.
The data acquisition module of the embodiment also preprocesses the original data, filters various error information caused by data source abnormality, serializes the preprocessing result (such as converting into binary stream or json character string) and issues the preprocessing result to the specified part in Kafka.
The Kafka internal classification (topic) in the embodiment is established according to the classification of data sources, contents, places, time and other elements. Taking air quality as an example: firstly, the air quality data sources are various, such as a free data interface provided by pm25.in, a national city air quality real-time release platform provided by a Chinese environment detection central office and the like, and the data of each data source is different, so that Kafka internal classification can be divided into multiple types based on the data sources; secondly, the air quality content comprises two parts, namely city total air quality data and air quality data of monitoring stations in a city, and Kafka internal classification can be divided into two types according to the content; thirdly, according to the geographical position of the air quality monitoring station, the Kafka internal classification can be divided into a plurality of classes according to regions; and finally, according to attributes such as acquisition time of the air monitoring station and data acquisition time of an acquisition program, the Kafka internal classification can be classified into a plurality of classes according to time elements such as year, month and day. Or all air qualities can be classified into one category. The implementation also estimates the data volume of each internal classification at the same time, and selects proper configurations such as the number of brookers and partition of the kafka cluster, the retention time of the kafka data and the like according to the conditions such as the data source updating frequency and data volume, the server cluster scale and performance, the system real-time performance and accuracy and the like, so that the subsequent use, maintenance and expansion are facilitated.
Each internal classification in this embodiment corresponds to a preset standard format, and the preset standard format includes information such as an attribute category, a data storage type, a serialization manner, and the like, and is used as a standard of a subsequent program, so that implementation of a data acquisition program and a data processing program is decoupled, and design and implementation of a "message subscription-publishing mode" are facilitated.
For example, the preset standard format corresponding to the internal classification is a json format, and the original file TXT format file downloaded from the weather website or the weather data downloaded from the public API interface is finally converted into the uniform json format.
In this embodiment, the data acquisition modules all correspond to the designated internal classifications, and through this step, the acquired raw data is uniformly converted into the preset standard format corresponding to each internal classification. Therefore, the urban multi-source heterogeneous data is uniformly converted into a data format which can be uniformly processed by the message queue middleware Kafka, and subsequent further processing is facilitated.
And step S3, establishing a data rasterization conversion module corresponding to the urban multi-source heterogeneous data, extracting data from the inner part of the message queue, and performing rasterization conversion and data completion on the extracted data by adopting a corresponding algorithm according to the normalized raster data model and the original structure, service characteristics and space-time attributes of the data.
In this embodiment, different data rasterization conversion modules are designed to process data that needs to be put into a corresponding target database, subscribe a corresponding Kafka internal classification for the data rasterization conversion module, select data to be processed by subscribing the corresponding internal classification, and extract data from the subscribed internal classification during data processing.
In this embodiment, according to the normalized raster data model and the original structure, the service features, and the spatio-temporal attributes of the data, a corresponding algorithm is adopted to perform rasterization conversion and data completion on the extracted data, which includes:
the sensing data of the air quality, meteorological data and the like distributed in the space point location coordinates are changed along with time, and only the sensing data with real numerical values at limited positions are projected into the corresponding grids according to the coordinates of the sensing data.
For data, such as POI, distributed in the space and only having the position attribute, the number of each POI in each grid is respectively counted, so that the position distribution data is converted into rasterized density distribution data.
For the data such as land elevation and land utilization with grid characteristics of the original data, as the original grid sizes are different, the original data also need to be converted into grid data with uniform size in a grid data model, and the original grid data distribution is mapped into a target grid in the conversion process according to the percentage of types or numerical values.
The present embodiment also calculates and supplements the data in the remaining grids by using the similarity of the data in the spatial and temporal dimensions.
It should be noted that, due to the limitation of the data source, data (such as sensing data of air quality, weather, etc., traffic situation data) cannot be completely acquired, and some data may be missing in some grids. In order to complement the grid with missing data, in this embodiment, the method such as the K nearest neighbor algorithm (kNN), the conditional random field algorithm (CRF), the Artificial Neural Network (ANN) is applied to the grid with missing data, and the similarity of the data in the space and time dimensions is utilized to calculate and complement the data in the grid with missing data. Taking K-nearest neighbor algorithm (kNN, K-nearest neighbor) as an example, the basic idea is to find K nearest grids having data to a grid that cannot acquire data, and take the average as the data of the grid. Meanwhile, the completion result is further optimized by combining data such as land utilization and the like. For example, if a grid cannot acquire traffic situation data and its artificial surface coverage is 0, its traffic situation data is not associated with a critical grid. In addition, a machine learning model such as a CRF and an ANN can be established to predict a data missing grid, wherein a grid containing data can be used as a model training set, and a feature vector can be formed by combining basic features of the grid with other grid data.
Finally, the data with spatial attributes are placed into the corresponding grid according to their spatial attributes (i.e., location information). In the embodiment, multi-source heterogeneous data in an urban space is mapped into a uniform and regular grid space, and in one grid, various types of data such as traffic data, meteorological data, air quality data, land utilization data and the like exist. And various data are subjected to correlation fusion, the organization form of the data has high similarity with RGB components of a color image or multispectral information in a remote sensing image, and the comprehensive operation state of each grid unit in a city can be visually represented.
The technical scheme mainly aims to perform associated fusion processing on the urban multi-source heterogeneous data to construct the urban portrait, so that the target database can only comprise the raster database.
In this embodiment, a data rasterization conversion module is adopted, and after data is extracted from the internal categories of the subscription, preprocessing is required, where the preprocessing includes the steps of duplicate removal, irrelevant data removal, and the like.
And step S4, storing the grid data formed after grid conversion and data completion into a target database designed according to a normalized grid data model, forming an urban portrait formed by urban multi-source heterogeneous data, providing a corresponding data interface according to the characteristics of the grid data, and serving the applications of a deep learning algorithm and a smart city.
In this embodiment, the raster data formed after raster conversion and data completion is stored in a target database designed according to a normalized raster data model, where the target database is a raster database, and the raster database includes implementation of a relational database and implementation of a non-relational database; the relational database establishes an independent data table for each type of data, in the data table, grid numbers and time information corresponding to the data are respectively a row in the database, and each attribute in the data corresponds to a row; the non-relational database uses row keys to represent data attributes and time information, uses columns to represent space information, and uses the column names as grid numbers.
The grid database comprises a relational database and a non-relational database, wherein the relational database is used for storing grid data which does not change obviously with time, and the non-relational database is used for storing grid data which changes obviously with time.
For relational databases, separate data tables are built for each type of data. In the data table, the spatial information (grid number) and the time information of the data are respectively a column in the database, and each attribute in the data corresponds to a column, that is, each row of data includes all data of a specific time and a specific grid.
For the non-relational database, Hbase (a single-column non-relational distributed database) is selected in the embodiment, and the method has the characteristics of strong expansibility and high query efficiency. According to a pre-constructed normalized raster data model, calculating preset standard format data extracted from internal classifications according to the spatial attributes of the data to obtain corresponding raster numbers, determining corresponding grids, taking the raster numbers as column names, and putting the data in the grids into the columns, so that the data are fused into a target database in an associated manner.
Preferably, the target database includes a grid database and a data warehouse, and the method for constructing a city portrait based on city multi-source heterogeneous data includes:
and establishing a data rasterization conversion module corresponding to the urban multi-source heterogeneous data, extracting data from the internal classes of the message queue, and storing the extracted data into a data warehouse in a target database.
At this time, the target database comprises a grid database and a data warehouse, wherein the data warehouse is used for storing data extracted from the subscribed internal part classes, and the data is directly stored in the data warehouse in a standard format preset by the internal part classes so as to store all data collected from the external data source. The data warehouse of the embodiment selects Hive, has the characteristics of supporting SQL-like language operation and high fault tolerance, and is used for storing data extracted from internal part classes.
It should be noted that, in the data warehouse of this embodiment, all the data extracted from the internal part classes are stored, so that the data warehouse retains the location information (latitude and longitude information) of the data; secondly, the target database of the embodiment has another data table which is specially used for storing the corresponding relation between the grid number and the longitude and latitude, and the data table is mainly used in the rasterization process; and finally, in the rasterization database, the latitude and longitude information is not stored, and the spatial information of the associated and fused raster data is only expressed by the raster number.
According to the technical scheme, the target database is established after rasterization, so that basic data can be provided for deep learning algorithms and specific application of smart cities. Therefore, factors such as data capacity, data increment scale, data interface access amount and the like are comprehensively considered, server resources and various public cloud services are evaluated, a server cluster is established, and HBase, Hive, Kafka and MySQL environments are established.
The grid database stored in the environment such as Hbase can provide interfaces for specific applications such as grid data visualization and deep learning model training, and the specific interfaces include:
acquiring raster data, and acquiring raster data of all or specified positions according to attributes, time ranges and raster numbers (optional);
the method comprises the steps of obtaining POI data, and obtaining coordinate sets of all POIs under the types according to the types of the POIs;
acquiring grid information, acquiring longitude and latitude information of all grid points, and acquiring longitude and latitude information of an area;
acquiring prediction data, acquiring all or appointed area according to attribute, time range and grid number (optional), predicting the obtained grid data through a specific model, and acquiring error between the prediction data and real data at corresponding time point on the basis;
and (4) exporting a training set, and outputting the raster data into a specified file format as required to be input as a training set of models such as a neural network and the like in order to meet the training requirements of machine learning models such as a deep neural network and the like.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art can make various corresponding changes and modifications according to the present invention without departing from the spirit and the essence of the present invention, but these corresponding changes and modifications should fall within the protection scope of the appended claims.

Claims (7)

1. A method for constructing a city portrait based on city multi-source heterogeneous data is characterized in that the method for constructing the city portrait based on the city multi-source heterogeneous data comprises the following steps:
designing a normalized raster data model corresponding to the urban multi-source heterogeneous data, wherein the normalized raster data model divides an urban space into a plurality of grids, maps the data into each corresponding grid according to the spatial distribution of the data, numbers all the grids, and uses the grid numbers as the spatial information of the data in a target database;
acquiring original data from a city multi-source heterogeneous data interface, preprocessing the acquired original data, serializing a preprocessing result, and publishing the preprocessing result into a specified internal classification of a message queue;
establishing a data rasterization conversion module corresponding to urban multi-source heterogeneous data, extracting data from internal classes of a message queue, and performing rasterization conversion and data completion on the extracted data by adopting a corresponding algorithm according to the normalized raster data model and the original structure, service characteristics and time-space attributes of the data;
storing raster data formed after raster conversion and data completion into a target database designed according to a normalized raster data model to form an urban portrait formed by urban multi-source heterogeneous data, and providing a corresponding data interface according to the characteristics of the raster data to serve deep learning algorithm and the application of smart cities;
the method for performing rasterization conversion and data completion on the extracted data by adopting a corresponding algorithm according to the normalized raster data model and the original structure, the service characteristics and the time-space attributes of the data comprises the following steps:
projecting the real data into the corresponding grids according to the coordinates of the sensing data which are distributed in the space point coordinates, change along with time and have real values only at limited positions;
for data which are distributed in a space and only have position attributes, respectively counting the number of each type of data in each grid, and converting the position distribution data into rasterized density distribution data;
for data of which the original data has grid characteristics, converting the original grid size into grid data of uniform size in a normalized grid data model, and mapping the original grid data distribution into a target grid in the conversion process according to the percentage of types or numerical values;
and calculating and complementing the data in the rest grids by utilizing the similarity of the data in the spatial dimension and the time dimension.
2. The method of constructing a city representation based on city multi-source heterogeneous data of claim 1, wherein the target database is a grid database, the grid database comprising an implementation of a relational database and an implementation of a non-relational database; the relational database establishes an independent data table for each type of data, in the data table, grid numbers and time information corresponding to the data are respectively a row in the database, and each attribute in the data corresponds to a row; the non-relational database uses row keys to represent data attributes and time information, uses columns to represent space information, and uses the column names as grid numbers.
3. The method for constructing a city portrait based on city multisource heterogeneous data as claimed in claim 2, wherein the row key of the non-relational database comprises a character string obtained by encoding data attributes by a hash algorithm, and time information corresponding to the data.
4. The method for constructing a city portrait based on city multi-source heterogeneous data according to claim 1, wherein the preprocessing of the collected raw data comprises: error information caused by data source exception is filtered.
5. The method for constructing a city portrait based on city multisource heterogeneous data as claimed in claim 1, wherein the internal classifications are established by classification according to data source, or content, or location, or time, each internal classification corresponds to a preset standard format, and the preset standard format comprises an attribute category, a data storage type and a serialization mode.
6. The method for constructing a city representation based on city multi-source heterogeneous data according to claim 1, wherein the extracting data from internal categories of a message queue further comprises:
and carrying out preprocessing of removing duplicate and irrelevant data on the data extracted from the internal part of the classes.
7. The method for constructing a city representation based on city multi-source heterogeneous data according to claim 1, wherein the target database comprises a grid database and a data warehouse, and the method for constructing a city representation based on city multi-source heterogeneous data comprises the following steps:
and establishing a data rasterization conversion module corresponding to the urban multi-source heterogeneous data, extracting data from the internal classes of the message queue, and storing the extracted data into a data warehouse in a target database.
CN201810057801.XA 2018-01-22 2018-01-22 Method for constructing city portrait based on city multi-source heterogeneous data Active CN108446293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810057801.XA CN108446293B (en) 2018-01-22 2018-01-22 Method for constructing city portrait based on city multi-source heterogeneous data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810057801.XA CN108446293B (en) 2018-01-22 2018-01-22 Method for constructing city portrait based on city multi-source heterogeneous data

Publications (2)

Publication Number Publication Date
CN108446293A CN108446293A (en) 2018-08-24
CN108446293B true CN108446293B (en) 2020-12-15

Family

ID=63191067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810057801.XA Active CN108446293B (en) 2018-01-22 2018-01-22 Method for constructing city portrait based on city multi-source heterogeneous data

Country Status (1)

Country Link
CN (1) CN108446293B (en)

Families Citing this family (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109615239B (en) * 2018-12-13 2023-04-07 西安理工大学 Urban air quality assessment method based on social network media data
CN109657029A (en) * 2018-12-20 2019-04-19 贵州德稻生态文明大数据中心有限公司 A kind of network platform prediction technique for project region
CN109933624A (en) * 2018-12-28 2019-06-25 曙光信息产业(北京)有限公司 Aviation emergency management and rescue data collection system and method
CN109902115B (en) * 2019-01-28 2022-03-04 中山大学 Method for programmatically extracting region and watershed data oriented to raster data
CN109859087A (en) * 2019-02-19 2019-06-07 特斯联(北京)科技有限公司 A kind of intelligent Community soft environment configuration system and method surpassing brain using city
CN110321932B (en) * 2019-06-10 2021-06-25 浙江大学 Full-city air quality index estimation method based on deep multi-source data fusion
CN112214483B (en) * 2019-07-11 2024-08-06 广联达科技股份有限公司 Method and device for analyzing, associating, storing and accessing data in city information model
CN110750647B (en) * 2019-10-17 2020-07-31 北京华宇信息技术有限公司 Method for constructing E L P model of multi-source heterogeneous information data
CN110730305A (en) * 2019-10-28 2020-01-24 北京旷视科技有限公司 Multi-source snapshot image processing and accessing method and device based on blocking queue
CN110826454B (en) * 2019-10-30 2022-06-28 北京科技大学 Remote sensing image change detection method and device
CN111008189B (en) * 2019-11-26 2023-08-25 浙江电子口岸有限公司 Dynamic data model construction method
CN111158643A (en) * 2019-11-29 2020-05-15 石化盈科信息技术有限责任公司 Data processing system and method
CN111104449A (en) * 2019-12-18 2020-05-05 福州市勘测院 Multisource city space-time standard address fusion method based on geographic space portrait mining
CN111190952B (en) * 2019-12-23 2023-10-03 中电海康集团有限公司 Method for extracting and persistence of multi-scale features of city portrait based on image pyramid
CN111090129B (en) * 2019-12-31 2022-06-28 核工业北京地质研究院 Fast searching method for ore control structure of hard rock type uranium ore based on multi-source data fusion
CN111680021A (en) * 2020-05-11 2020-09-18 北京邮电大学 Multi-source heterogeneous disaster situation data processing and presenting method and device
CN111726404A (en) * 2020-06-14 2020-09-29 深圳市赛宇景观设计工程有限公司 Data acquisition method and system based on Internet of things
CN111858732B (en) * 2020-07-14 2024-04-05 北京北大软件工程股份有限公司 Data fusion method and terminal
CN112100256B (en) * 2020-08-06 2023-05-26 北京航空航天大学 Data-driven urban precise depth portrait system and method
CN112149294B (en) * 2020-09-14 2023-06-20 南京信息工程大学 Elastic weather grid design method
CN112256682B (en) * 2020-10-22 2022-09-20 佳都科技集团股份有限公司 Data quality detection method and device for multi-dimensional heterogeneous data
CN112417214A (en) * 2020-11-02 2021-02-26 中关村科学城城市大脑股份有限公司 Fusion method and system for multi-source heterogeneous data of urban brain scene
CN112287059B (en) * 2020-11-05 2021-06-15 重庆市规划和自然资源信息中心 Spatial prediction analysis method for acquiring land use change according to spatial big data
CN112732670A (en) * 2020-12-31 2021-04-30 石河子大学 Agricultural resource integration method and system based on network big data
CN112733745A (en) * 2021-01-14 2021-04-30 北京师范大学 Cultivated land image extraction method and system
CN113032884B (en) * 2021-04-07 2023-03-17 深圳大学 Building space quantification method, building space quantification device, building space quantification equipment and computer readable storage medium
CN113138431A (en) * 2021-04-13 2021-07-20 深圳市万向信息科技有限公司 Smart city meteorological observation method and system
CN114328780B (en) * 2021-12-24 2024-04-12 郑州信大先进技术研究院 Hexagonal lattice-based smart city geographic information updating method, equipment and medium
CN114547229B (en) * 2022-04-27 2022-08-02 河北先河环保科技股份有限公司 Multi-source atmospheric environment data fusion method and device, terminal and storage medium
CN114997344B (en) * 2022-08-04 2022-10-25 中关村科学城城市大脑股份有限公司 Multi-source data planning method and system based on urban brain
CN115774861B (en) * 2022-12-22 2023-07-21 广东五度空间科技有限公司 Natural resource multi-source heterogeneous data convergence fusion service system
CN115952200B (en) * 2023-01-17 2023-06-27 安芯网盾(北京)科技有限公司 MPP architecture-based multi-source heterogeneous data aggregation query method and device
CN116186414B (en) * 2023-03-31 2023-08-01 北京比格大数据有限公司 Entity portrait scheduling system and method
CN116450747B (en) * 2023-06-16 2023-08-29 长沙数智科技集团有限公司 Heterogeneous system collection processing system for office data
CN118012850B (en) * 2024-04-08 2024-07-30 北京市农林科学院智能装备技术研究中心 Intelligent irrigation multisource information-oriented database construction system, method and equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005331896A (en) * 2004-05-20 2005-12-02 Kankou:Kk Sampling inspection method of geographic data
CN101739460A (en) * 2009-12-16 2010-06-16 中国科学院对地观测与数字地球科学中心 Grid-based spatial data source unification service system and method
CN102750363A (en) * 2012-06-13 2012-10-24 天津市规划信息中心 Construction method of urban geographic information data warehouse
CN103116825A (en) * 2013-01-29 2013-05-22 江苏省邮电规划设计院有限责任公司 Intelligent city management system
CN104182454A (en) * 2014-07-04 2014-12-03 重庆科技学院 Multi-source heterogeneous data semantic integration model constructed based on domain ontology and method
CN104462244A (en) * 2014-11-19 2015-03-25 武汉大学 Smart city heterogeneous data sharing method based on meta model
CN105069020A (en) * 2015-07-14 2015-11-18 国家信息中心 3D visualization method and system of natural resource data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005331896A (en) * 2004-05-20 2005-12-02 Kankou:Kk Sampling inspection method of geographic data
CN101739460A (en) * 2009-12-16 2010-06-16 中国科学院对地观测与数字地球科学中心 Grid-based spatial data source unification service system and method
CN102750363A (en) * 2012-06-13 2012-10-24 天津市规划信息中心 Construction method of urban geographic information data warehouse
CN103116825A (en) * 2013-01-29 2013-05-22 江苏省邮电规划设计院有限责任公司 Intelligent city management system
CN104182454A (en) * 2014-07-04 2014-12-03 重庆科技学院 Multi-source heterogeneous data semantic integration model constructed based on domain ontology and method
CN104462244A (en) * 2014-11-19 2015-03-25 武汉大学 Smart city heterogeneous data sharing method based on meta model
CN105069020A (en) * 2015-07-14 2015-11-18 国家信息中心 3D visualization method and system of natural resource data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于工业互联网的煤矿井下机器人导航与无线视频监控技术研究;谭玉新;《中国优秀硕士学位论文全文数据库 工程科技I辑》;20170615;B021-57 *
基于智方体的地理时空栅格数据模型化研究;黄祥志;《中国博士学位论文全文数据库 基础科学辑》;20151015;A008-3 *

Also Published As

Publication number Publication date
CN108446293A (en) 2018-08-24

Similar Documents

Publication Publication Date Title
CN108446293B (en) Method for constructing city portrait based on city multi-source heterogeneous data
CN111932036B (en) Fine spatio-temporal scale dynamic population prediction method and system based on position big data
Hu et al. Mapping urban land use by using landsat images and open social data
Blaschke et al. Collective sensing: Integrating geospatial technologies to understand urban systems—An overview
Ghaemi et al. LaSVM-based big data learning system for dynamic prediction of air pollution in Tehran
Yu et al. Extracting and predicting taxi hotspots in spatiotemporal dimensions using conditional generative adversarial neural networks
CN107133900A (en) Urban land mixing utilizes feature grid computational methods and device
CN110458333A (en) A kind of population spatial distribution prediction technique and system based on POIs data
CN112288247A (en) Soil heavy metal risk identification method based on space interaction relation
CN114925043B (en) Application method and device based on space-time grid block data and electronic equipment
Qiu et al. Design and development of a web‐based interactive twin platform for watershed management
Gervasoni et al. Convolutional neural networks for disaggregated population mapping using open data
Olawoyin et al. Privacy-preserving publishing and visualization of spatial-temporal information
Jokar Arsanjani Characterizing and monitoring global landscapes using GlobeLand30 datasets: the first decade of the twenty-first century
CN111125553A (en) Intelligent urban built-up area extraction method supporting multi-source data
CN112950079B (en) Green space supply and demand data processing method and system, computer equipment and storage medium
Qiuying et al. Quantitative measurement of urban expansion and its driving factors in Qingdao: An empirical analysis based on county unit data
Ahmed et al. Traffic flow prediction using big data and gis: a survey of data sources, frameworks, challenges, and opportunities
Yap et al. A global feature-rich network dataset of cities and dashboard for comprehensive urban analyses
Wu et al. Geospatial big data: Survey and challenges
CN111382165A (en) Mobile homeland management system
Bin Asad et al. The impact of scale on extracting urban mobility patterns using texture analysis
Shrivastava A review of spatial big data platforms, opportunities, and challenges
Azri et al. Classified and clustered data constellation: An efficient approach of 3D urban data management
CN115903085A (en) Agricultural meteorological disaster early warning method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant