CN108710625B

CN108710625B - Automatic thematic knowledge mining system and method

Info

Publication number: CN108710625B
Application number: CN201810222910.2A
Authority: CN
Inventors: 刘强; 刘沛文; 黄耀森; 陈晨
Original assignee: CHENGDU RESEARCH INSTITUTE OF UESTC
Current assignee: CHENGDU RESEARCH INSTITUTE OF UESTC
Priority date: 2018-03-16
Filing date: 2018-03-16
Publication date: 2022-03-22
Anticipated expiration: 2038-03-16
Also published as: CN108710625A

Abstract

The invention belongs to the technical field of big data processing, and discloses a system and a method for automatically mining thematic knowledge, wherein the system comprises: the system comprises an Internet of things interface module, a semantic query module, a data mining module and a map aggregation and visualization module; the method comprises the following steps: firstly, collecting and establishing a database, and collecting and storing data acquired based on a web ubiquitous network, data of a monitoring sensor and various thematic data into a thematic database; then, mining thematic knowledge, forming thematic knowledge through body construction, semantic query and deep information mining, and transmitting files to an FTP file server; geographically associating thematic knowledge acquired from the FTP file server with a geographic base map to form mapping data; carrying out picture layout design, thematic chart design and picture finishing; and finally outputting a thematic map. The invention processes and analyzes huge land resource information by an efficient and accurate method by means of a machine learning algorithm under big data.

Description

Automatic thematic knowledge mining system and method

Technical Field

The invention belongs to the technical field of big data processing, and particularly relates to a system and a method for automatically mining thematic knowledge.

Background

In recent years, big data technology has been rapidly developed in the world, and a huge research enthusiasm has been raised, so that high attention is paid to the global industry, academia and governments of various countries. With the rapid development and popularization of computers and information technology, industrial application data is growing explosively. Industry mass data that can quite easily scale to hundreds of TBs and even PB has far exceeded the processing power of traditional computing technologies and information systems. Meanwhile, the big data usually implies a lot of depth knowledge and value which are not possessed in the case of small data volume, and the intelligent analysis and mining of the big data bring huge commercial value to the industry, so that various value-added services with high added values are realized, and the production management decision level and the economic benefit of the industry are improved.

Spatial analysis is an earlier part of research in geographic research, which has evolved from time to history, in other words, geography is derived from spatial analysis. In the early years, driven by the needs of survival and development, people must learn to understand and analyze the spatial relationship between surrounding geographic objects, so that various spatial analyses are in use. Maps are becoming the second language in geography, and from this time, people have come to use many kinds of spatial analysis, including measuring distances, directions and areas between various geographic objects in maps, and even using maps for various tactical research and strategic strategies. In recent years, the main technologies involved in spatial analysis have changed greatly, and geographic information system technologies and remote sensing technologies create a powerful spatial data analysis environment, and many new analysis models and processing methods for solving spatial problems are continuously mined. Since the ever-increasing massive spatial data drives the change of the spatial analysis process, spatial data analysis methods facing large amounts of data, such as data-level exploratory spatial analysis techniques, spatial visualization techniques, spatial data mining techniques, and artificial intelligence-based spatial analysis techniques, have attracted much attention and have been developed in recent years, and these analysis methods have high fault tolerance to the uncertainty and inaccuracy of large-scale spatial analysis. Along with the development of the times, the application field of GIS space analysis is wider and wider, wherein Yanjin provides application research of GIS space analysis in forest fire prevention, successfully solves the important problems encountered in forest fires, Tang Xian provides application of a GIS space analysis method in disease space domain model distribution, utilizes the prediction of disease distribution space, Zhuhai Yan provides the function of GIS space analysis in tropical cyclone research, combines tropical cyclone and GIS space analysis, Lihuqiong provides application research of GIS space analysis in improving educational resource space allocation, successfully optimizes resource space allocation, Huanu provides geological three-dimensional modeling and space analysis research, successfully applies space analysis technology to three-dimensional analysis, and Qinu forwards studies the application of GIS space analysis in supermarket site selection, thereby solving the important problems in supermarket site selection, wu Jianhua and the like research whether dangerous points, lines and surfaces exist in the yaw limit range of the airline by utilizing a GIS space analysis method in an electronic chart, provide decision support for airline design, and the real-time monitoring is carried out on the air route, the structure of the urban underground pipeline database based on GIS space analysis is researched by Hanyong and the like, and several models of spatial analysis are constructed, Cornishi starts from the basic knowledge point of the digital city concept, the relationship between the digital city and the city geographic information system is analyzed, Liuwei and the like propose that the GIS spatial analysis technology is applied to the environmental impact evaluation of mineral resources, related research results are obtained, linear buffer algorithms of the GIS are researched for excellence, GIS buffer analysis of linear targets is achieved, and Lixiangji provides research on GIS spatial data theory and spatial analysis methods, designs and achieves spatial analysis algorithms of a plurality of geographic information systems.

The prior art method mainly analyzes and processes static spatial data and makes a thematic map, and a smart city and application thereof are established on the basis of the Internet of things. Through thing networking and sensor, can collect, store and share city information. However, due to the multi-source, heterogeneous and real-time (quasi-real-time) characteristics of the data, the traditional spatial analysis processing software system is difficult to process and acquire real-time thematic knowledge, generates thematic maps, and cannot adapt to the informatization and knowledge requirements of smart cities, internet of things and big data times

In summary, the problems of the prior art are as follows: the traditional spatial statistical analysis method cannot receive real-time sensor data and synchronously carry out knowledge mining and analysis processing. The method solves the problems of real-time data receiving, analysis and processing, thematic knowledge mining and real-time, synchronous, integrated and automatic thematic map generation.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an automatic thematic knowledge mining software system.

The invention is realized in such a way that a thematic knowledge automatic mining system firstly provides a Web ubiquitous network, real-time sensor data, thematic data collection and other modes to obtain multi-source, heterogeneous and real-time spatial data and information, then synchronously carries out data preprocessing, on the basis, utilizes semantic technology and a machine learning model to carry out thematic knowledge mining and analysis, and finally, carries out real-time (or quasi-real-time) thematic map automatic manufacturing and visualization. The whole treatment process is highly automatic and real-time. The automatic thematic knowledge mining system comprises:

the Internet of things interface module is used for acquiring, storing, retrieving, exporting and displaying data;

the semantic query module is connected with the Internet of things interface module and used for constructing a body model through the body file and performing semantic retrieval query according to classes;

the data mining module is connected with the Internet of things interface module and is used for analyzing and mining data in the thematic database to obtain thematic knowledge; importing the data analysis result into an Excel table;

and the map aggregation and visualization module is connected with the data mining module and used for making and displaying thematic maps by using the obtained thematic knowledge.

Further, the internet of things interface module comprises:

the data acquisition and storage module is used for acquiring and storing data; the method comprises the steps of obtaining internet data by sending a request to a data providing website and analyzing and storing the internet data; real-time data sent by the monitoring equipment is received through GPRS wireless transmission;

the data retrieval module is connected with the data acquisition and storage module and is used for retrieving data acquisition time or searching the numerical range of each index;

the data export module is connected with the data acquisition and storage module and used for exporting the data into an Excel table and transmitting the Excel table to the local;

and the data display module is connected with the data acquisition and storage module and is used for sequentially displaying the data in the database in the interface by selecting the index types and the numerical value ranges in the input box or the list.

Further, the semantic query module firstly selects a body file from the local files, then generates a body model, selects the category and the search word after generating the body model, and displays the semantic query result;

the semantic query module comprises: the system comprises an ontology construction module and an ontology query module; the ontology construction module constructs an ontology model according to an ontology construction object determined by requirements; and the ontology query module carries out semantic query on the constructed ontology model.

Further, the data mining module comprises: the FP-tree correlation analysis module and the random forest classification module; the FP-tree correlation analysis module is used for reversely mining an index with higher correlation degree with thematic knowledge;

storing compression information about frequent patterns in frequent item set mining, wherein the FP-tree association analysis module is formed by gathering a root node and item prefix subtrees serving as children of the root node; each node of the item prefix sub-tree consists of three domains: node name, node count, node chain; wherein the node count represents the number of transactions on a path to the node, and the node chain points to the next node with the same name in the tree;

the random forest classification module is used for selecting data of the indexes and training a model by adopting a random forest method to finish random forest classification; the random forest classification module is a combined classifier comprising a plurality of non-pruning classification regression trees; the combined classifier is formed by introducing independent and identically distributed random variables, generating decision trees by using training set data and the random variables and finally combining all the decision trees by using an integrated learning idea.

Further, the map aggregation and visualization module represents certain subject content attribute data on a map by using color rendering, pattern filling, histogram or pie chart forms; corresponding achievements are visually displayed to the user by utilizing the visual effect of the thematic map;

the map aggregation and visualization module establishes a novel map drawing for thematic map making by introducing the latest map drawing and GIS technology; firstly, acquiring a geographical base map and thematic knowledge data, then designing a picture layout, designing a thematic chart in detail, carrying out picture conflict processing, and finally outputting a map.

Further, the data acquisition and storage module comprises a network API module and a GPRS wireless transmission module; the network API module is responsible for obtaining internet data by sending a request to a data providing website and analyzing and storing the internet data; acquiring meteorological monitoring data through a network API; the client sends a request to the server, the server calls an API (application programming interface) to acquire data, then JSON (java service connection) data is sent to the client, and the client analyzes the received JSON data and then stores the JSON data into a local MySQL database;

the GPRS wireless transmission module is responsible for receiving real-time data sent by the monitoring equipment through GPRS wireless transmission; in data acquisition, a PC (personal computer) end firstly monitors a port, sends an instruction to equipment after monitoring a connection request, and then receives and analyzes data sent by the equipment; the PC end monitors the fixed port, and sends an instruction to the equipment and receives data returned by the equipment through the input and output streams after receiving the request; and the analyzed data is also stored in the MySQL database, so that the establishment of the thematic knowledge database is supported.

Another objective of the present invention is to provide an automatic thematic knowledge mining method, which comprises:

firstly, collecting and establishing a database, and collecting and storing data acquired based on a web ubiquitous network, data of a monitoring sensor and various thematic data into a thematic database;

then, mining thematic knowledge, forming thematic knowledge through body construction, semantic query and deep information mining, and transmitting files to an FTP file server; geographically associating thematic knowledge acquired from the FTP file server with a geographic base map to form mapping data;

carrying out picture layout design, thematic chart design and picture finishing; and finally outputting a thematic map.

Further, the automatic thematic knowledge mining method specifically comprises the following steps:

1) network-based data acquisition:

obtaining data from the internet through an API;

data retrieval and display; inquiring and displaying data in the database according to the inquiry condition;

and (3) exporting data: exporting the equipment monitoring data in the database through an Excel table;

2) and (3) carrying out ontology construction and semantic query:

constructing an ontology: generating an ontology model for performing semantic description on an ontology; constructing an ontology by generating an ontology model;

semantic query: after generating the body model, selecting a retrieval word category and a retrieval word in the list, and displaying a semantic query result through retrieval;

3) and (3) performing correlation analysis: the method comprises the following steps:

generating a correlation data set, firstly selecting an original data set from a local file, then processing the original data set through a correlation data set generating function, and processing the original data set into a correlation data set for correlation analysis; the original data set is in a table format, and the generated associated data set is in a txt format;

correlation analysis: after the association analysis data set is successfully generated, selecting the generated association analysis data set, setting a frequency threshold, displaying a correlation analysis result through association analysis, inputting a selection parameter, and generating a data set which is then randomly classified;

4) and (3) random forest classification is carried out: the method comprises the following steps:

the classification is completed: firstly, transmitting a classified data set into a specified file path/user/hadoop/2014 AQI/test/lower of a distributed file management system of a Linux system, filling in an IP address of a virtual machine, the index number of the data set, the parameter number of a decision tree and the number of forest decision trees, and then finishing classification;

and (3) deriving a classification result: exporting the classification result, respectively selecting the classification result and the original data set in two columns of the classification result and the original data set, and importing the classification result into an original data set table;

5) generating a thematic map: the method comprises the following steps:

acquiring a data source of a thematic map:

collecting data sources in real time by using air monitoring and collecting equipment;

and (3) compiling a thematic map: establishing a novel map drawing for thematic map making by introducing the latest map drawing and GIS technology; firstly, adding a thematic page needing to be edited currently in a thematic map, then performing symbolization, picture editing, figure outline finishing, plate type design and thematic map design according to a drawing template, and then performing map conflict check to complete thematic map making.

Further, the random forest classification specifically includes:

the method comprises the following steps that firstly, K different sample data sets are randomly extracted from an original data set by adopting a bootstrap method and serve as a sub-training set of each decision tree, the volume of each sample is the same as that of the original data set, and data which are not sampled every time form data outside a bag;

secondly, respectively establishing a classification regression tree for each sample data set to generate K decision trees, randomly sampling the original data variable set to obtain variable subsets for each node of the decision trees in the generation process, and selecting the most variable from the subsets according to the Gini index minimum criterion to perform node splitting and branching;

thirdly, each classification regression tree recursively branches from top to bottom to grow until the set leaf node minimum size is reached, the decision trees stop growing, and all the decision trees are combined into a random forest; and fourthly, inputting the test data into a random forest model, respectively predicting by using K decision trees, and taking the average value of prediction results of the decision trees as a regression value.

The invention has the advantages and positive effects that: the system processes and analyzes multi-source, heterogeneous and real-time homeland resource information by an efficient and accurate method by means of a machine learning algorithm under big data; the system aims at realizing body construction and query in a database, digging out valuable thematic knowledge by using a machine learning algorithm, making a thematic map of related thematic knowledge, establishing database warehousing specifications on the basis of combining data acquisition based on a ubiquitous network and an Internet of things interface, and developing an automatic thematic knowledge mining software system by using a GIS (geographic information system) technology and a deep mining technology. The system supports the acquisition of web-based ubiquitous network data and GPRS wireless transmission data, and can realize the warehousing management of the data and the query operation of the data; the function of providing file downloading by using an FTP server is supported; supporting the construction and query of an ontology; and supporting to construct thematic knowledge association analysis in a knowledge base, classifying thematic knowledge by utilizing a random forest classification model, and finally making a related thematic knowledge thematic map of the mined information.

1. And real-time thematic knowledge mining of multi-source, heterogeneous and real-time spatial data and thematic map making are supported.

2. A good solution is provided for applications of smart cities and the like based on the Internet of things and big data.

Drawings

Fig. 1 is a functional structure diagram of an automatic thematic knowledge mining software system according to an embodiment of the present invention.

In the figure: 1. an Internet of things interface module; 2. a semantic query module; 3. a data mining module; 4. and a map aggregation and visualization module.

Fig. 2 is a framework of an automatic thematic knowledge mining system according to an embodiment of the present invention.

Fig. 3 is a flowchart of an automatic thematic knowledge mining system according to an embodiment of the present invention.

Fig. 4 is a process diagram for acquiring monitoring data by the API provided in the embodiment of the present invention.

Fig. 5 is a flow chart of GPRS acquisition data according to an embodiment of the present invention.

Fig. 6 is a flow chart of random forest calculation according to an embodiment of the present invention.

Fig. 7 is a real-time data acquisition module according to an embodiment of the present invention.

Fig. 8 is a flowchart of thematic map compilation according to an embodiment of the present invention.

FIG. 9 is a graph of a polyline generation result provided by an embodiment of the invention.

Fig. 10 illustrates the role of the city management database provided by the embodiment of the present invention in a smart city.

Fig. 11 is a geographic information database system setup framework for smart cities according to an embodiment of the present invention.

Fig. 12 is a geographic information database system setup framework for smart cities according to an embodiment of the present invention.

FIG. 13 is a conceptual model of spatial basis data provided by an embodiment of the present invention.

FIG. 14 is a conceptual model of city data provided by an embodiment of the invention.

FIG. 15 is a conceptual model of system management data provided by embodiments of the present invention.

FIG. 16 is a database general logic design provided by an embodiment of the present invention.

FIG. 17 is an FP-tree diagram constructed after reading the first entry as provided by embodiments of the present invention.

FIG. 18 is an FP-tree diagram constructed after reading the first three entries provided by embodiments of the present invention.

Fig. 19 is a diagram of a path extracted at the end of an e-node according to an embodiment of the present invention.

FIG. 20 is a conditional FP-tree diagram where the newly established FP-tree becomes e according to an embodiment of the present invention.

Detailed Description

In order to further understand the contents, features and effects of the present invention, the following embodiments are illustrated and described in detail with reference to the accompanying drawings.

The structure of the present invention will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the automatic thematic knowledge mining software system includes: the system comprises an Internet of things interface module 1, a semantic query module 2, a data mining module 3 and a map aggregation and visualization module 4.

The internet of things interface module 1 mainly completes the main functions of data acquisition and storage, data retrieval according to conditions, data export, display and the like. The internet of things interface module 1 mainly has the functions of data collection and data support for a mining system.

Data acquisition and storage: the module is divided into network API part and GPRS wireless transmission part. The network API part is responsible for obtaining internet data by sending a request to a data providing website and analyzing and storing the internet data. The flow chart for obtaining weather monitoring data through the network API is shown in FIG. 4. In fig. 4, a client sends a request to a server, the server calls an API function to obtain data, and then sends JSON data to the client, and the client parses the received JSON data and then stores the JSON data in a local MySQL database. And the GPRS wireless transmission part is responsible for receiving the real-time data sent by the monitoring equipment through GPRS wireless transmission. In the data acquisition process, the PC monitors the port first, sends an instruction to the device after monitoring the connection request, and then receives and analyzes the data sent by the device, and the flow chart is shown in fig. 5. And the PC end monitors the fixed port, and sends an instruction to the equipment and receives data returned by the equipment through the input and output streams respectively after receiving the request. And storing the analyzed data into a MySQL database so as to support the establishment of a thematic knowledge database.

And (3) retrieval of data: the search can be performed according to the data acquisition time retrieval or the numerical range of each index.

Derivation of data: the data can be exported as Excel tables to the local.

And displaying data according to the search conditions: and selecting the index type and the numerical range in the input box or the list to sequentially display the data in the database in the interface.

And the semantic query module 2 is used for supporting the system to construct an ontology model through the ontology file and carry out semantic retrieval query according to classes. The semantic query is carried out by firstly selecting a body file from a local file, then clicking a button of generating the body model to generate the body model, selecting the category of a search word and the search word after the body model is successfully generated, and clicking the search to display the semantic query result. The semantic query module is specifically divided into an ontology construction part and an ontology query part. The ontology construction part is responsible for constructing an ontology model for the ontology construction object determined according to the requirement. And the ontology query part is responsible for performing semantic query on the constructed ontology model.

And the data mining module 3 is mainly used for analyzing and mining data in the thematic database to obtain thematic knowledge. And after the system finishes the data mining operation, importing the data analysis result into an Excel table. The data mining module is specifically divided into an FP-tree correlation analysis part and a random forest classification part. And the FP-tree association analysis part is responsible for reversely mining the indexes with higher association degree with the thematic knowledge. And the random forest classification part is responsible for selecting the data of the indexes and training a model by adopting a random forest method to finish random forest classification.

FP-tree correlation analysis: association rule discovery is the mining of valuable, relevant knowledge from a large amount of data that describes the interrelationships between data items. As the data collected and stored in databases becomes larger and larger, there is an increasing interest in mining the relevant associated knowledge from these data. The FP-tree is an abbreviation of a frequent pattern tree (frequency pattern tree), and its main role is to store compressed information about frequent patterns in frequent item set mining. It is assembled from a root node and the item prefix subtrees as its children. Each node of the item prefix sub-tree consists of three domains: node name, node count, node chain. Where the node count represents the number of transactions on the path to the node and the chain of nodes points to the next node in the tree with the same name.

Random forest classification: a random forest is a combined classifier that contains multiple prune-free classification regression trees. The combined classifier is formed by introducing independent and identically distributed random variables, generating decision trees by using training set data and the random variables and finally combining all the decision trees by using an integrated learning idea. And (4) taking the average value of the prediction values of all decision trees according to the regression prediction result of the algorithm. The random forest is an effective classifier method and can effectively relieve transition fitting, and mainly because two random strategies, namely a bagging method and a characteristic subspace method, are introduced when each decision tree is constructed by the random forest, the random forest integrates classification regression results of each decision tree, part of random errors are offset, and the random forest has good tolerance on abnormal values and noise.

And the map aggregation and visualization module 4 is mainly used for making and displaying thematic maps by using the obtained thematic knowledge. Thematic maps emphasize maps representing one or several natural elements or socioeconomic phenomena. Certain subject matter attribute data is represented on a map by using color rendering, pattern filling, histogram or pie chart forms. By utilizing the visualization effect of the thematic map, corresponding achievements can be visually shown to the user. On the basis of the traditional thematic map making process flow, a novel map making process flow for thematic map making is established by introducing the latest map making and GIS technology, and the problems of automation and intellectualization of thematic map making are mainly solved. The specific flow comprises five steps, namely, firstly, acquiring a geographical base map and thematic knowledge data, then designing a map layout, designing a thematic chart in detail, carrying out map conflict processing, and finally outputting a map.

The development environment is as follows:

operating a system: the software environment is Windows 7, Windows 8 or Windows10, and the software has better portability and is compatible with 32-bit and 64-bit systems.

Secondly, developing a platform: ArcEngine 10.1 component package, TerraExplore 6.5 component package,. Net Framework4.0, Visual Studio 2010 compilation environment, C #4.0 compilation language environment, MySQL database, eclipse compilation environment, jena, Proti g e

③ hardware environment: the system comprises a single-core CPU (central processing unit) with the frequency not lower than 2GHz or a multi-core CPU with the frequency 1.5GHz, an available memory space with the frequency not lower than 4GB, an available disk space with the frequency not lower than 100GB, an integrated or independent network card, a Local Area Network (LAN) which can be connected with an operation pipe and a working space and an independent display card with the frequency not lower than 512mb of display memory.

The operation environment is as follows:

Operating the platform: ArcEngine Runtime 10.1, Mysql, java, Mahout,. Net Framework4.0.

③ hardware environment: the CPU is not lower than 2GHz single-core CPU or 1.5GHz multi-core CPU, not lower than 4GB available memory space and not lower than 1TB available disk space.

The Internet of things interface module 1 mainly provides data support for a mining system, the semantic query module 2 supports the system to construct a body model through body files and search and query according to classes, and the data mining module 3 analyzes and mines data in a special topic database to obtain special knowledge. The map aggregation and visualization module 4 mainly uses the acquired thematic knowledge to make and display a map.

The automatic thematic knowledge mining method provided by the embodiment of the invention comprises the following steps:

firstly, data collection and database establishment are carried out, and data obtained based on a web ubiquitous network, data of a monitoring sensor and various thematic data are collected, arranged and stored in a thematic database.

Then, mining thematic knowledge, forming thematic knowledge through body construction, semantic query and deep information mining, and transmitting files to an FTP file server; and geographically associating thematic knowledge acquired from the FTP file server with a geographic base map to form drawing data, and designing a map format, a thematic chart and a map finishing.

And finally outputting a thematic map.

The invention is further described below in connection with GIS space analysis under data.

GIS space analysis under big data

1. Platform construction

The cluster used by the subject is built by three hosts in a laboratory, and the configuration of the cluster hosts is shown in the following table:

numbering

Model type

Processor with a memory having a plurality of memory cells

Memory device

Hard disk

Operating system

1

ThinkStationP500

Intel to Strong E5-1600v3

16GB

1TB

CentOS-6.5

2

ThinkStationP500

Intel to Strong E5-1600v3

16GB

1TB

CentOS-6.5

3

PrecisionT5810

Intel to Strong E5-1600v3

8GB

500GB

CentOS-6.5

2. Data pre-processing

The invention provides several algorithm ideas for processing raster map data based on a MapReduce framework, and the algorithm ideas are programmed and realized in an actual cluster. The method covers the common grid map processing such as gradient, buffer area, Euclidean distribution, interpolation, kernel density and the like and the statistical calculation of grid data. In order to visually observe data content and facilitate programming realization, a grid data TIFF file is uniformly converted into a text file format by ArcGIS before operation;

the file content sequentially represents the number of image columns, the number of image rows, the image position coordinates, the image resolution and default values, and these data need to be transmitted into the Configuration file Configuration of the JOB for use as global parameters when performing MapReduce distribution calculation. The value of each pixel position of the image follows, and the meaning represented by different values of the image is different.

After the conversion is completed, the head of each row of the image matrix needs to be inserted with the row number thereof so that the data can be restored after being split. And directly placing the preprocessed image file into a Hadoop Distributed File System (HDFS).

3. Algorithm implementation

The image splitting and combining modes of different algorithms under the MapReduce framework are different, but the algorithms can be regularly circulated, and the algorithms with similar implementation processes are classified and explained according to the categories.

(1) Grade class

The algorithm needs to find out the values of a plurality of pixels in a rectangular range around each pixel, and the values are used for solving the target pixel to output a result, wherein the method is mainly applied to solving the slope, the slope direction and the like.

Assuming a matrix of existing partial elevation images:

now, to obtain the slope value at a33, the values of nine pixels in the rectangular box need to be obtained, and assuming that the image resolution is g, the slope formula is:

f_y＝[A₂₂-A₄₂+2(A₂₃-A₄₃)+A₂₄-A₄₄]/(8g)

f_x＝[A₂₂-A₂₄+2(A₃₂-A₃₄)+A₄₂-A₄₄]/(8g)

the K-V pair is output when the Map process traverses to A23:

the K-V pair is output when the Map process traverses to A33:

the K-V pair is output when the Map process traverses to A43:

when the Reduce process traverses to A33, the input K-V pair:

then, the gradient of each image element can be calculated by using a formula in the process of reducing each image element. Comparing the result images of the 90M resolution DEM image (about 214MB for TIF file and about 637MB for processed text file) in the urban river weir area with the result images intercepted after the gradient is solved by ArcGIS and MapReduce respectively, and comparing as follows:

the verified MapReduce result shows that points with deviation within 10 percent account for 88.63 percent of the total pixel points and have certain lifting space, and the error is probably because the formula adopted by the gradient algorithm of ArcGIS internal encapsulation is improved compared with a common gradient calculation formula. As can be seen from the image observation, both images can reflect the local gradient change more clearly. The two methods are operated five times, the average time for ArcGIS to process the image in serial is about 144 seconds, the average time for the MapReduce framework to process the image is about 116 seconds, and the advantage of parallel computing in processing time is more obvious with the increase of file forgiveness and the increase of the number of MapReduce cluster hosts.

(2) Euclidean distance classes

The algorithm is applied to the problem that knowing the values of the local dispersion points (source pixels), each pixel in the diagram needs to be evaluated according to their positional relationship to all source pixels. Such as euclidean distance, euclidean distribution (thieson polygons), inverse distance interpolation, etc.

Assuming an existing raster image portion pixel matrix:

now, knowing two source pixels, namely <1,2>, <3,1>, the Map process traverses the matrix to find all the source pixels and puts them into each row:

traversing each pixel of each line in the Reduce process, respectively calculating the geometric distance between each pixel and each source pixel, and taking the minimum distance as a result to write and output:

manually selecting 26 points on a 12000 × 12000 raster image as source pixels, obtaining a preprocessed image with the size of about 620MB, solving the gradient by using ArcGIS and MapReduce methods respectively, and then intercepting a part of the image to be compared, wherein the comparison is as follows:

the verified MapReduce result has the point with the deviation within 5 percent accounting for 98.64 percent of the total pixel points, and the calculation result has small error because the distance formula is relatively simple and has no great difference. As can be seen from the image observation, the distance between each pixel point and the source pixel can be more clearly reflected by the two images. The two methods are operated five times, the average time for ArcGIS processing the image in serial is about 143 seconds, the average time for the MapReduce framework processing the image is about 108 seconds, and the advantage of parallel computing in processing time is more obvious with the increase of file forgiveness and the increase of the number of MapReduce cluster hosts.

(3) Nuclear density class

The algorithm is suitable for the value of a known partial dispersion point (source pixel), and only pixels within a certain distance around the source pixel need to be evaluated in a graph.

When the target point x is within the n source pixel analysis ranges, the kernel density formula is

An image matrix is provided with two source pixels AB, the number of the pixels RCell of the radius is r/cell is 1, and the following table shows that:

in the Map process, the row and column numbers and the values of each source pixel are placed in a row within a radius range, and the rows which are not influenced are empty:

in the Reduce process, all pixels of the row which are set to be empty output null values, if the pixels in other rows solve the kernel density in the influence range of the source pixel, otherwise, the pixels output null values:

manually selecting 75 points on a 12000 x 12000 raster image as source pixels, assigning an unequal initial value of 200-20000 to each pixel, setting the influence radius of each point to be 500 pixels, obtaining the size of the preprocessed image to be about 620MB, solving the gradient by using ArcGIS and MapReduce respectively, and then intercepting part of the result image for comparison, wherein the comparison is as follows:

and setting the result comparison when each point influences the radius to reach 1000 pixels.

Experience proves that the point with deviation within 5% in the first experiment result accounts for 95.84% of the total pixel points, and the point with deviation within 5% in the second experiment result accounts for 96.56% of the total pixel points, so that the error is small. As can be seen from the image observation, the two images can both clearly reflect the point influence degree condition in the image range. The two methods were run five times each for one series of experiments, with the average time taken for ArcGIS to process the image being about 136 seconds for the first series of experiments, about 192 seconds for the MapReduce framework to process the image, about 140 seconds for the ArcGIS to process the image being about 140 seconds for the second series of experiments, and about 214 seconds for the MapReduce framework to process the image. The ArcGIS in the test is obviously shorter than the processing time of MapReduce because ArcGIS for calculating the nuclear density can only be developed based on a vector dot diagram layer, and compared with the MapReduce developed based on a grid image, the process of traversing a matrix for the first time is omitted in the drawing process, so that the efficiency of the ArcGIS is higher.

The invention is further described below in conjunction with a key algorithm model design.

Second, design of key algorithm model

1. Ontology semantic model

An ontology is an explicit formal specification of a shared conceptual model. The description of this definition embodies the meaning of ontology: the concept model is a model obtained by abstracting the related concepts of some phenomena in the guest world, and the meaning of the representation is independent of the concrete environmental state. The concepts used and the constraints imposed on them are clearly defined; the formal ontology is computer-readable; embodied in the shared ontology is commonly recognized knowledge, reflecting a recognized set of concepts in the relevant art, which is directed to groups rather than individuals.

The goal of ontologies is to capture knowledge of the relevant domain, provide a common understanding of the domain knowledge, determine commonly recognized words within the domain, and give a clear definition of these words and their interrelationships from different levels of formalization patterns.

Because the ontology is usually oriented to a specific subject field, the coverage range is large, and the attribute relationship inside the ontology is complicated, a good method is needed to standardize the ontology construction process, and the constructor is strived to keep consistency in the whole process. However, no unified method can achieve the above-mentioned objectives, and the following methods are commonly used for constructing the ontology: the skeleton method, the Methontology method, the Sensus method, the IDEF5 method, the seven-step method, and the like.

(1) Step of ontology construction

The system adopts a seven-step method to construct the ontology, and the seven-step method developed by Stanford university medical college is mainly used for constructing the domain ontology, and the steps are to determine the professional domain and category of the ontology; examining the possibility of reusing the existing ontology; listing important terms in the ontology; defining classes and class hierarchies (feasible methods for perfecting the class hierarchies include a top-down method, a bottom-up method and a comprehensive method); defining attributes of the class; defining facets of attributes; an instance is created.

(2) Query of ontology

Jena is a JAVA API developed by HP laboratories to support the relevant applications of the semantic Web. The system supports the description languages such as OWL, DAML + OIL and RDFS, supports the database access such as Oracle, SQLServer and MySQL, has an ARQ query engine and supports the SPARQL or RDQL query language.

Jena mainly has the following functions: the RDF file is read and written in the form of triples. RDF is a standard that W3C describes resources, and the contents of RDF documents can be quickly read, RDF models can be created, contents can be written and queries can be quickly searched by using Jena; an ontology API for processing RDF model-based ontologies is provided that supports operations on ontologies described by RDFS, DAML + OIL and OWL languages. Through the combination of the reasoning subsystems, the reasoning subsystems and the reasoning subsystems can extract relevant information from a specific ontology; supporting two storage modes of the files and the relational database of the ontology; providing a query engine, wherein SPARQL is a query language based on RDF, providing descriptions needed by a query program, matching the triples serving as query conditions with the triples in the ontology model, and returning results in a binding set form; the inference mechanism provides a technology for constructing inference rules, and the establishment of the ontology model is to connect an inference engine and the model in parallel to realize rule-based inference. Ontology query is an important means for searching from ontologies according to specific conditions of users to obtain ontology classes, attributes, instances and related information meeting requirements.

For the ontology with clear hierarchical relation, the query can be decomposed according to the requirements of the user during retrieval, and the result is returned after the query is finished in the ontology, so that the communication efficiency between the user and the information system is improved. During the retrieval process, the ontology provides a plurality of retrieval entries for the user. Because the ontology internal elements are stored in triple types, the results of the ontology-based query reflect more and more comprehensive information about the ontology database.

FP-Tree Association analysis

Association rule discovery is the mining of valuable, relevant knowledge from a large amount of data that describes the interrelationships between data items. As the size of the data collected and stored in the database is larger and larger, people are interested in mining relevant associated knowledge from the data.

And (3) constructing an algorithm by using the FP-tree:

(1) the transaction library D is scanned for all frequent item sets 1F contained in D, and their respective support levels. And sorting the frequent items in the 1F in descending order of the support degree to obtain L.

(2) A root node T of the FP-tree is created, marked with "null". The transaction library is scanned again. For each transaction in D, the frequent items in D are selected and sorted in the order in L. Let the sorted frequent-item table be [ P | P ], where P is the first frequent item and P is the remaining frequent items. An insert _ tree ([ P | P ], T) is called. The insert _ tree ([ P | P ], T) process performs as follows: if T has a child N to make n.item _ name ═ p.item _ name, then the count of N is incremented by 1; otherwise, a new node N is created, its count is set to 1, linked to its parent node T, and linked to the node with the same item _ name via node _ link. If P is not empty, an insert _ tree (P, N) is recursively called. The FP-tree is a highly compressed structure that stores all the information used to mine a frequent set of items. The memory space occupied by the FP-tree is proportional to the depth and width of the tree, which is generally the maximum value of the number of items contained in a single transaction; the width of the tree is the number of items contained in each layer on average. Since there are usually a large number of shared frequent entries in a transaction, the size of the tree is usually much smaller than the original database. Items in the frequent item set are arranged in descending order of the support degree, and the items with higher support degree are closer to the root of the FP-tree, so that more opportunities are provided for node sharing, and the high compression of the FP-tree is further ensured.

In order to understand the FP-tree more intuitively, a simple example is explained below. Table 1 below shows a data set that contains 10 transactions and 5 entries. (one transaction can be intuitively understood as high index monitoring items corresponding to a high AQI area, and an algorithm is used for discovering the items with higher support degree and relevance degree).

TABLE 1 FP-tree dataset

The FP-tree constructed after reading the first entry is as follows:

the FP-tree constructed after reading the first three entries is as in FIG. 17:

reading the FP-tree constructed with all ten entries is as in FIG. 18:

generally, the size of the FP-tree is smaller than that of uncompressed data because transactions of original data often share some common items, and at best, all transactions have the same item set, and the FP-tree only contains one node path, which results in the worst case when each transaction has a unique item set, and since the transactions do not contain any common items, the size of the FP-tree is practically the same as that of the original data, however, the storage requirement of the FP-tree increases because additional space is required to store pointers and techniques between nodes for each item.

The HeadLink table of this FP-tree after construction is as follows:

TABLE 2Headlink example

The FP-tree is used as a transition bridge from a frequent item set to a data set, and the method can be regarded as a reverse process of frequent item set mining based on the FP-tree.

The steps are generally as follows: firstly, finding an FP-tree which meets the support number constraint of a given frequent item set; then, generating a temporary database TempD which meets given constraints and is formed by only frequent items through the FP-tree; and finally, on the basis of the TempD of the temporary database, scattering the infrequent items under the limit of the minimum support number threshold value to generate a series of target databases meeting the constraint.

The FP-tree in the reverse mining algorithm is used as a transition bridge between a frequent item set and a data set, and can be regarded as a reverse process of frequent item set mining based on the FP-tree. The FP-tree, as a very compact data structure, stores all information related to the mining of the frequent item sets in the transaction database, and can be regarded as an intermediate product of the original database and the corresponding frequent item sets, so that the conversion process from a given frequent item set to the database becomes smooth, natural and easy. And generating a series of target databases which satisfy the given constraint and only generate a series of constraint-satisfying target databases by the frequent items after the PF-tree is constructed.

After the FP-tree is constructed, the algorithm first looks for a set of frequent items ending with e, followed by b, c, d, and finally a, since each transaction maps to a path in the FP-tree, which can be quickly accessed using the pointer associated with node e by looking only at the path containing a particular node (e.g., e), and the table below shows the extracted path ending with e-node as in fig. 19:

after finding the frequent item set ending with e through the paths, removing e in each item of the sets to obtain a new prefix path set which is combined into e. The prefix path set of e in the above figure is { { a, c, d }, { a, d }, { b, c } }, and b in the prefix path only appears once and is discarded without the requirement of reaching the frequent condition, at this time, { { a, c, d }, { a, d }, { c } } is called the frequent pattern base of e, and the FP-tree newly established by it becomes the condition FP-tree of e, as shown in fig. 20:

then, { e, d }, { e, c }, { e, d, a }, { e, d, c, a }, and { e, d, c, a } are solved up in a recursive manner until a condition FP-tree can no longer be constructed. And solving the data sets at the ends of all the projects d, c, b and a by the same method, and counting the frequent pattern base meeting the frequent condition each time to find out the frequent item set.

3. Random forest classification model

The random forest classification model is mainly used for classifying and managing thematic knowledge. The random forest algorithm is an integrated learning method in the field of machine learning, and a classifier on the whole meaning is formed by integrating classification effects of a plurality of decision trees. Compared with other classification algorithms, the random forest algorithm has many advantages, the advantages of the classification effect are represented by high classification accuracy, small generalization error and the capability of processing high-dimensional data, and the advantages of the training process are represented by the algorithm learning process which is rapid and easy to parallelize. Based on the two advantages, the random forest algorithm is widely applied and becomes one of the algorithms for processing the classification problem and selecting preferentially.

The random forest calculation flow chart is shown in fig. 6. The method comprises the following steps that firstly, K different sample data sets are randomly extracted from an original data set by adopting a bootstrap method and serve as a sub-training set of each decision tree, the volume of each sample is the same as that of the original data set, and data which are not sampled every time form data outside a bag; secondly, respectively establishing a classification regression tree for each sample data set to generate K decision trees, randomly sampling the original data variable set to obtain variable subsets for each node of the decision trees in the generation process, and selecting the most variable from the subsets according to the Gini index minimum criterion to perform node splitting and branching; thirdly, each classification regression tree recursively branches from top to bottom to grow until the set leaf node minimum size is reached, the decision trees stop growing, and all the decision trees are combined into a random forest; fourthly, inputting the test data into a random forest model, respectively predicting by using K decision trees, and taking the average value of prediction results of each decision tree as a regression value, namely a predicted value

In a random forest algorithm, the number of decision trees determines the classification precision, and the classification precision is directly influenced by the small number of decision trees; too much tends to result in overfitting which also affects the classification accuracy. The results they produce are validated by building different decision forests. Because the random forest algorithm has certain randomness during construction, even the precision verification result of the same parameter after each new random forest is constructed can generate certain deviation upwards or downwards around the real result level.

The invention is further described below in connection with the detailed functional design of the system.

Thirdly, detailed functional design of system

Based on the results, an automatic thematic map knowledge mining system is developed, the automatic generation of the urban management thematic map knowledge and information can be realized, and the ubiquitous intelligent information service is provided for intelligent urban management decisions.

The invention adopts a C/S framework, utilizes Arcengine to carry out secondary development, and adopts a Geodatabase geographic database model. The system forms 4 main functional modules.

The automatic thematic knowledge mining system is divided into five steps according to the flow, firstly, data collection and library establishment are carried out, and data obtained based on a web ubiquitous network, data of a monitoring sensor and various thematic data are collected, arranged and stored in a thematic database; then, mining thematic knowledge, forming thematic knowledge through body construction, semantic query and deep information mining, and transmitting files to an FTP file server; geographically associating thematic knowledge acquired from the FTP file server with a geographic base map to form drawing data, designing a map format, designing a thematic chart, rectifying the map, and finally outputting the thematic chart. The flow chart is shown in fig. 3.

1. Network-based data acquisition

(1) Obtaining data from the internet through an API

And clicking the network data interface to acquire data, and automatically starting to acquire air quality data from the network and storing the air quality data in the database by the system. Similarly, when the GPRS data interface is clicked to acquire data, the system can automatically send an instruction to the equipment and start to monitor the corresponding port to acquire the transmitted data.

(2) Data retrieval and presentation

Through the data query function, data in the database can be queried and displayed according to query conditions. And displaying the data acquired in a certain time period by filling in the start-stop time, and displaying all the data in the database by the system when the start-stop time is empty. The user can select each index field, and the database data to be displayed is inquired according to the range of the field value:

(3) exporting data

When equipment monitoring data in the database needs to be exported, the export Excel table on the interface is clicked, after the export path is selected, the export Excel table can be exported to the designated path by clicking the open interface.

2. Ontology construction and semantic query

(1) Ontology construction

Ontology construction is the most basic step of semantic query by a system, namely, the ontology model is generated. The standard format of the Ontology model is OWL format, and OWL (web Ontology language) is a web Ontology language developed by W3C, and is used to perform semantic description on an Ontology. On one hand, the compatibility of DAML +0IL/RDFS is kept, on the other hand, stronger semantic expression capability is guaranteed, and meanwhile, the judgment and reasoning of Description Logic (DL) are guaranteed to select the ontology file from the local files, and then the ontology is constructed by generating the ontology model function.

(2) Semantic query

After the ontology model is successfully generated, the category and the search word are selected from the list, and the semantic query result can be displayed through the search function.

3. Association analysis

(1) Generating a set of relational data

For the correlation analysis, an original data set is selected from a local file, and then the original data set is processed through a function of generating a correlation data set, so that the original data set is processed into a correlation data set which can be subjected to the correlation analysis. Here, the original data set is in a table format, and the generated associated data set is in a txt format. And after the generation of the associated data set is successful.

(2) Association analysis

After the association analysis data set is successfully generated, the generated association analysis data set is selected, a frequency threshold value is set, the correlation analysis result can be displayed through the association analysis function, the selection parameters are input, and a 'generate classification data set' button is clicked to generate a data set which is then randomly classified.

4. Random forest classification

(1) Completion classification

For random forest classification, firstly, a classification data set is transmitted into a specified file path/user/hadoop/2014 AQI/test/lower (a virtual machine needs to be used) of a distributed file management system of a Linux system, the IP address of the virtual machine, the index number of the data set, the parameter number of decision trees and the number of forest decision tree plants are filled, and then classification is completed by using the functions of 'generation description', 'forest construction' or 'classification start'.

(2) Classification result derivation

And exporting the classification result part-m-00000 to a windows system, respectively selecting the result and the original data set in two columns of the classification result and the original data set, and clicking a button for importing the classification result to import the classification result into the original data set table.

5. Thematic map generation

(1) Data source of thematic map

The thematic map comprises a basic base map and thematic contents. The basic base map is a positioning basis of thematic contents and provides a description of the relationship between thematic traversals and surrounding geographic environments. The subject content is determined by the subject of the map, usually, the basic base map is compiled by using a common geographic map and an image map with the same scale as the basic data, and the data used by the subject content has various sources.

And (4) using data such as DLG (digital Living graph) and DEM (digital elevation model) in the geographical national situation census data and using administrative division data in the DLG as common base map data.

The air quality data is published in real time, and the invention collects the real-time air quality data by two modes; one is that a plurality of monitoring points are selected in Sichuan province, and air quality index data are collected in real time by utilizing air monitoring and collecting equipment; another way is to use internet technology to obtain current data such as air quality of cities and cities in the province from the network (such as quasi-real-time data providers like weather and wind). These data are derived from real-time monitoring equipment at each monitoring site and are suitably processed. The technical route is shown in fig. 7.

Historical air quality data: and acquiring historical data of each area in Sichuan province on a related network with reference value by utilizing a web crawler technology.

Based on the characteristics of the acquired data, the method has diversity and complex structure. Generally, data generally has three modes: structured, unstructured and semi-structured. The structured data mostly exist in the form of a two-dimensional table, such as a relational database and an excel table, and 2015-year historical air quality data collected this time; the unstructured data mostly appear in the form of text documents without fixed rules, such as a statistical yearbook; while semi-structured data exists on the internet in the form of HTML web pages (including XML and JSON, etc. formats behind web pages) on a two-to-one basis.

(3) Thematic map compilation

On the basis of the traditional thematic map making process flow, a novel map making process flow for thematic map making is established by introducing the latest map making and GIS technology, and the problems of automation and intellectualization of thematic map making are mainly solved. The specific design process is as shown in fig. 8, firstly opening the thematic map project, adding the thematic page to be edited at present, then performing symbolization, drawing editing, drawing outline finishing, plate-type design, thematic map design and the like according to the drawing template, and after the completion of the map conflict check, the inspectors perform manual interactive modification on the drawing with problems until no errors exist, and completing the thematic map making.

The invention is further described below in conjunction with the specific workflow of the thematic atlas.

The specific work flow of the thematic atlas compilation is roughly divided into the following modules:

1. page design

Page creation input page title, selection of page type. The current page can be set after the page is created.

Page management mainly completes adjustment of page numbers of the pages, modification of information of the pages and deletion of the pages.

2. Layout design-map frame design

And selecting frame elements, setting parameters such as lace size, background color, line color and the like, and automatically setting the map lace frame by the system.

Map page design

Selecting page finishing elements, setting parameters such as height, width and the like, and automatically setting corresponding page elements by the system. And selecting a corresponding finishing element, clicking the map, and generating a page element at a specified position by the system.

3. Data editing

Firstly, selecting a tool for data; for selecting elements. Editor selection tool: elements are selected in the edited state. Rotation: a rotating element. Wire reversal: the starting point and the end point of the line element are exchanged. Merging: and merging the elements of the adjacent surfaces. Dividing the noodles: one surface element is divided into two or more. The elements to be divided are selected first. Seventh, line trimming: change the current bending state of the wire element, and the like. Only one element can be selected. Creating an element.

4. Data normalization

Equality difference grading standardization

There are two grading models in the arithmetic grading standardization, and a proper model can be selected to grade data according to the requirement.

② geometric grading standardization

Two grading models exist in the geometric grading standardization, and a proper model can be selected for grading data according to needs.

Statistics and grading standardization

The statistical grading normalization is based on the standard deviation of the statistical values.

5. Subject design

And (6) rendering a thematic map. And on the basis of at least 4 colors, the generated adjacent areas of the layers are different in color.

② special bar chart. (1) And (4) preparing data. (2) And generating a thematic bar graph.

③ special topic histogram.

A three-dimensional number histogram.

(1) Preparing data: the number of industrial activity units in each city is saved. (2) Quantity histogram attribute set.

And if the map layer is not associated with the map layer, not setting the associated attribute. This example associates layers.

Two-dimensional quantity histogram

(1) And (4) preparing data. (2) Quantity histogram attribute set.

Fourthly, special cake-shaped picture.

(1) And (4) preparing data. (2) And setting the attribute of the three-dimensional annular pie chart.

The production of the three-dimensional pie chart and the three-circle pie chart is consistent with the three-dimensional annular pie chart. Note that if the pie chart is associated with a layer, its associated attribute is to be set.

Special line drawing.

(1) And (4) preparing data. (2) And setting the attribute of the line graph.

Sixthly, designing a special chart. Editing thematic map.

(1) Editing a thematic chart: setting the chart Size and the symbol attribute.

(2) Moving thematic charts: move the chart to the appropriate location.

6. Symbol collision handling

The method includes the following steps of (1) universal tool.

(1) Element edge simplification.

(2) (2) waterway conflict detection: the distance between river basin roads is detected, and river basin roads which are too close to each other are marked.

7. Note mark

Adding notes: and setting the attributes of the text content, the font size and the like of the annotation.

And if the attribute of the note needs to be changed, selecting the note with the attribute to be changed, and opening the attribute table to change the attribute, such as within a blue small circle in the figure.

8. Output to form a map

And outputting the well-laid thematic map as a printing map under the publishing view.

And (5) utilizing mining knowledge to map the classified result. The grade 1 represents the annual average AQI of 0-25, the grade 2 represents the annual average AQI of 25-50, the grade 3 represents the annual average AQI of 50-75, and the grade 4 represents the annual average AQI of 75-100, so that the relation between the urban air quality and indexes such as population, total production value and the like can be seen, but the influence on the air quality of each city is wind direction.

The historical statistics thematic map is that the number of days of excellent, good, light pollution, moderate pollution, severe pollution and severe pollution of each county in Sichuan province is counted by 2016 years for 5-10 months, and a user can clearly see the air quality conditions of the regions in Sichuan in the period of 5-10 months. For example, air quality in the state of aba is the best, almost all being excellent. It can be seen that the more urban and economically developed areas are, the more the population is gathered and the worse the air quality is.

According to the release of the historical statistical thematic map, the air quality analysis can be specifically carried out on certain areas, corresponding pollution sources can be found out, and a constructive suggestion can be provided for government organs.

The invention is further described below in connection with database construction

Fourth, database construction

1. Construction technical route

The invention provides a database building scheme from the perspective of supporting the data mining of the smart city, uniformly manages the data of the smart city, and reasonably builds a database platform, wherein the main data comprises space basic data (such as water system data, administrative area data and traffic line data) and city data (such as population data, population income data, air quality data and the like), so that various data can be served for data mining.

The geographic information database oriented to the smart city needs to be placed in the whole smart city construction framework, the position, the action and the relation with other parts of the geographic information database are analyzed, and the scene of the geographic information database in supporting and constructing city management application is shown in fig. 10.

In the figure, the core of the geographic information system is a variety of database sets including a city base database, an application subject database, a shared exchange database, a personalized service product database, a resource catalog and a metadata database. The database is used as a centralized management center for various data of city management, and plays the following roles: (1) the metadata management center is used as a geographic information resource of the smart city; (2) constructing a unified geographic information resource retrieval center of the smart city; (3) and supporting the construction of urban management informatization infrastructure.

Internet of things data access and mass data centralized storage for smart city management: the construction of a data warehouse needs to be based on a large-scale commercial data warehouse technology, take a policy standard guarantee system as a basis, apply a mature database theory and method, combine a system architecture, break through multiple key technologies, introduce the latest database, storage and other technologies, and finally construct a geographic information database system. The construction framework of the geographic information database system facing the smart city is shown in fig. 11.

The database overall framework is based and supported by a standard specification system and a safety guarantee system, adopts a multi-layer system framework and is divided into a source data layer, a data import layer, a data storage layer, a data management layer, a data mining layer, a data service layer and a user layer. Standard specification system: the method is one of the most important and basic works for the construction and operation of the geographic information warehouse for city management, and aims to establish and operate the geographic information warehouse with targets, plans and step specifications and standardization.

A safety guarantee system: the purpose of establishing a safety guarantee system of the geographic information warehouse is to protect the information safety of the geographic information warehouse and guarantee the safety of data in each stage of life cycles such as production, processing, transmission, storage and the like. Source data layer: because the urban geographic information resources are rich, the related departments are numerous, and the urban geographic information resources have the characteristics of multiple sources, isomerism, mass and the like. A data import layer: after determining the information requirements of the data warehouse, data modeling is required, the processes of data extraction, cleaning and conversion from a data source to the data warehouse are determined, dimensions are analyzed and divided, and a physical data model and a storage structure of the data warehouse are determined.

(2) A data storage layer: the data storage layer is the core of the data warehouse, is divided into two parts of metadata and entity data, and is the main body of the data warehouse for providing services to the outside.

(3) And a data mining layer: the method is a window for providing services for the outside of a data warehouse facing a theme, and personalized product manufacturing is realized by using a data mining technology through modes of standard product processing, statistical analysis, application customization and the like. (data service layer: providing a user with a variety of service applications including directory service, map service, data access service, and integrated presentation service.

(4) A data management layer: the data management layer provides all management tools for supporting definition, management, service, operation and maintenance of the data warehouse. From the above, it can be analyzed and summarized that in the construction of smart cities, the situation of geographic information resources is complicated, and what the core needs to be solved is that a large amount of geographic information resources are stored from a centralized manner: the method comprises the steps of multi-source data unified management, heterogeneous system data exchange and sharing to the construction of the complete manufacturing capability of personalized products, so that analysis and design are carried out from the following four key technologies to solve the core problem of geographic information data warehouse system construction and support system research and development and implementation.

2. Database design and implementation

(1) Demand analysis

In order to realize data management and application of the smart city database, the smart city database needs to be constructed on the basis of the established spatial database. The database system should meet the application requirements of space data integration management, database updating maintenance, system security management, achievement application service and the like. The most basic requirement of a database system is the integrated management and display of various achievement data, so that the resource management, integrated display, query retrieval and other aspects of intelligent spatial data and non-spatial data need to be developed on the basis of the database, and the method specifically comprises the following aspects:

(1) and (3) integrated management of achievement data: request wisdom city achievement data and collect arrangement

The integrated storage and management of the thematic data, the urban statistical result data and the like.

(2) Data visualization: realize symbolization and two-dimensional superposition display of various achievement data

And three-dimensional terrain display, two-dimensional and three-dimensional linkage and other functions based on terrain and high-resolution remote sensing images.

(3) The comprehensive query retrieval function: the comprehensive query and retrieval functions based on the database are needed, and comprise query and retrieval functions in various modes such as mutual query of spatial position and attribute information, length and area measurement, statistical unit query, remote sensing interpretation sample query, air quality, ground surface coverage and traffic element query, buffer area query, metadata query and the like.

(2) Data update and maintenance

In order to facilitate the updating and maintenance of the database, the system needs to be provided with data warehousing examination, data preprocessing,

Data storage, data exchange, road network and water network construction, historical data management and maintenance and the like.

And (3) data warehousing inspection: and performing necessary pre-warehousing check on data to be warehoused or updated, wherein the contents comprise the consistency of the file and the structure of the data to be warehoused, the topological consistency, the logical consistency, the spatial reference and the correctness of the edge connection of vector data, and the smooth warehousing and updating of the data are ensured.

Data preprocessing: and performing objectification pretreatment on the result data subjected to warehousing examination before warehousing, wherein the objectification pretreatment comprises the functions of result data sorting, projection conversion, attribute structure adjustment, object element combination, object entity coding, data derivation extraction, water system and road network data processing and the like.

And (4) data storage: and warehousing the geographic national condition data subjected to objectification pretreatment or the collected and sorted thematic data or updating the data in the database.

Data exchange: besides realizing database updating and data input and output of external distribution service, the system also needs to have a data exchange function between a provincial database and a national database, and data in the provincial database can be imported into the national database after being exported to update data in provincial regions or partial regions.

Constructing a road network and a water network: and constructing a road network and a water system network under a database environment according to the road network and water system network data model.

Managing historical data: in order to realize historical data management, the system needs to effectively manage historical information of the geographic national condition census database, establish a historical evolution relation between spatial information and attribute information, and provide a function of historical backtracking or geographic national conditions at multiple time points. The specific management functions include: when the data is updated, the current data in the database is converted into historical data, the data is inquired and retrieved, the data is deleted and maintained, the version data is extracted, the data is exported, and the like.

(3) Database population design

The intelligent city database system is generally composed of four parts, namely an infrastructure, a database management and application service system, a database construction technical specification and the like, as shown in fig. 12.

Infrastructure: the system is a software, hardware and network environment supporting the operation of the whole database management and application service system, and mainly comprises IT infrastructure resources such as computing resources, storage resources, network resources and safety equipment. The virtual technology can be adopted to perform virtual management on infrastructure resources, and a cloud infrastructure platform is realized.

A database: the system is a data resource of the whole database system, provides data storage and management capability, and is divided into three major categories of general survey results, statistical analysis results and thematic data and seven sub-databases. Seven sub-databases of the geographic national condition general survey database are respectively a topographic and geomorphic, a remote sensing image interpretation sample, a ground surface coverage, geographic national condition elements, thematic data, geographic national condition statistical analysis results and other database sub-databases.

Database management and application service system: based on an infrastructure and a data unified access interface, functional components in the aspects of data updating maintenance and system security management and service interfaces such as two-dimensional visualization, query retrieval, statistical analysis, result release and the like are designed and developed, and a database management and application service system is constructed facing different application modes of a desktop end and a WEB end on the basis.

Database construction technical specification: the data content of the database, the database design requirement, the technical process, the operating environment and system safety design requirement, the provincial database building result convergence and other requirements are specified, and the provincial database and the national database are unified.

(4) Database concept design

(1) The spatial basic data (as shown in fig. 13) mainly consists of traffic lines, boundary lines and water systems, and describes basic appearance composition in an area, and a road network needs to be constructed on the basis of road elements and traffic auxiliary facilities, and a water system network needs to be constructed on the basis of water system elements and hydraulic facilities. The road facilities and the water system are mainly composed of arc sections with directionality, the arc sections have directionality, and each arc section represents a connecting line between two adjacent points. The nodes serve as the connection of the opposite sides and guide the movement from one arc segment to the other. The boundary line is composed of bending curves and arc sections, and each closed arc section represents an administrative region.

The traffic line is essentially a mesh structure and is composed of three elements of a network edge, a network node and an obstacle limiting point. The network side is composed of road sections such as highways and railways, the road sections have directions and can pass in one direction or two directions, the highways are divided into expressways, national roads, provincial roads, county roads and the like, and urban roads can be divided into subways (light rails) and other urban roads. The network node is composed of an expressway entrance, an expressway intersection, a broken road terminal, a subway (light rail) station and the like, and mainly plays a role in communication between road sections, the expressway can be communicated with other roads only at the entrance and the exit of the expressway when being intersected with other roads, and the communication relationship between the road sections at the plane intersection and the road sections is divided into straight going, left turning and right turning. The obstacle limit point has a limiting factor on some aspects of traffic, including road bridges, tunnels and the like.

The water system network consists of water system arc sections, water system nodes, barrier points and the like. The water system arc section comprises a central line structure line of a canal, structural lines of the canal, reservoirs, ponds, lakes and the like collected for penetrating the water system, and the like, and has a flow direction, and water system nodes are a water system junction, a river source and a river terminal (sea entrance), wherein the junction plays roles of connecting and flowing the opposite sides. Compared with the water system, the hydraulic auxiliary facilities such as dams and water gates can artificially split or obstruct the water flow, and influence the water flow, the flow direction and the ship passing function, so that the water system network also comprises some barrier limiting points on the water system. Among them, the dam having an influence on the water flow includes a rolling dam and a retaining dam, and the gates include a water inlet gate, a water discharge gate, a check gate, a tidal gate, a ship gate, and a canal head gate.

(2) The city data (as shown in fig. 14) mainly comprises administrative boundary lines, air quality data and social statistics data, and is a representative index administrative boundary line of a city, which is a boundary line for distinguishing administrative jurisdictions, and is an administrative division line of each county and city in the four-chuan province, and can be used for carrying out regional statistics on some data in the jurisdiction, such as water system density, population density map and the like.

Social statistics data are closely related to geographic units, indexes cannot be regarded as entity objects, the areas described by the indexes are regarded as entities, and all the indexes are used as components for supporting. It requires administrative boundary data for support.

Air quality data reflects the air pollution level of a city, which is a complex phenomenon where the air pollutant concentration at a particular time and place is influenced by many factors. The magnitude of the emission of man-made pollutants from stationary and mobile sources is one of the most important factors affecting air quality, including exhaust gases from vehicles, ships, airplanes, industrial pollution, residential and heating, waste incineration, etc. The development density of cities, landforms, weather and the like are also important factors influencing the air quality.

(3) System management data conceptual model

The database system management data comprises various data required by the database system management, including user data, authority data, log data, a data dictionary, geographical national condition overview and the like. The conceptual model entity object comprises a user, a function authority, a function, a data authority, a data directory, a system log, a data dictionary and the like. Furthermore, since the spatial geographical basic data is equivalent to 1: 5 ten thousand or 1: geographic data of 1 ten thousand scale levels, in order to facilitate the system management and meet the needs that the system data is displayed hierarchically, in space geographic basic data and existing 1: on the basis of 100 ten thousand basic geographic data, the general situation data of the geographic state is formed by integration and processing as the display data of a small scale. The system management data conceptual model is shown in fig. 15.

(5) Database logic design

The geographical national situation census database is logically designed based on a GeoDatabase model, SDO _ GEOMETRY type field is used for storing GIS spatial data, and all data are uniformly built under a 2000 national geodetic coordinate system and a geographical coordinate with the degree as a unit. The national geographical state and national state database name is GNCDB, the provincial database name is composed of GNCDB + 2-bit provincial administrative region name letter abbreviation codes, and the provincial administrative region name letter abbreviation codes are determined according to GB/T2260.

The data in the database is stored and managed in several forms of data component vector data sets, table data, document data and the like. The vector data set comprises data sets such as a road network, a water system network, geographic unit metadata, thematic data and statistical analysis result data, and the common table data comprises social and economic statistical data, network relations, tables, other table data and the like. The overall logical structure of the database is shown in fig. 16.

All vector element layer data ranges are logically seamlessly spliced in the whole library building range (the national library is built nationwide, and the provincial library is built in the province, autonomous regions and direct municipalities). Because the statistical analysis is mainly calculated according to county administrative districts as units, in order to improve the access performance and the statistical analysis efficiency of data, partition processing is carried out on layers with large data volumes by taking the county administrative districts as basic units when a database is built, namely, the layers with large data volumes are still logically stored and managed as one data layer in the database, adjacent element objects with completely same attributes are required to be physically combined in the county administrative districts, but the same object elements adjacent to different administrative regions are only logically connected and physically disconnected. The design of the table data such as the road network, the water system network, the social economic statistical data, the network steering table and the like of the vector data is required to be executed in compliance in a national database and a provincial database, and the rest data can be selectively executed in the provincial database.

The design of the database logic needs to add the following contents to the original data:

entity data objectification coding: in order to facilitate general survey data objectification query and statistical analysis, an element unique identifier (OBJECT field) is added to each element of each vector element layer, entities such as road elements and water area elements are objectified, objectification codes (entity name fields) are added, and a county and city (ID) attribute field is added to distinguish the administrative direct administration range of the road elements and the water area elements in detail.

Road and water system network: on the basis of general survey result data of roads, water areas, structures and the like, a road and water system network is constructed according to a road and water system network data model. The railway data and the water area data form a railway network and a water system network after the network is constructed. The road network data is formed by combining roads, urban roads and rural roads.

(1) Traffic line element logic design

For simplicity of design, only the traffic line layer is kept in the database:

TABLE 3 traffic line layer table

Name of field	Data type	Length of	Whether or not it can be empty	Remarks for note
					OBJECTID	Object ID
Shape	Gemetry		Yes
					Classification code	Long		Yes
Name of classification	String		60	Yes
					Entity name	String		60	Yes
Road coding	String		10	Yes
					Main zone	String		10	Yes
Sub-zone	String
			10	Yes
Shape_Length	Double						yes
		County (City district)	String		Yes
County (downtown) ID	String						yes
		Shape Type	String		yes				The attribute value is line

(2) Boundary line logic design

The boundary lines are data layers of the marking data of the administrative regions, such as country-level administrative regions, provincial administrative regions, special administrative regions, ground-level administrative regions, county-level administrative regions, township administrative regions, administrative region boundary lines at all levels, administrative villages, city center urban regions, other special administrative regions and the like. Only county level administrative area boundaries are used here.

TABLE 4 boundary layer table

Name of field	Data type	Length of	Whether or not it can be empty	Remarks for note
					OBJECID	ObjectID
SHAPE	Geometry		yes
					Classification code	Long		Yes
Name of classification	String		60	Yes
					Main zone	String		10	Yes
Sub-zone	String
			10	Yes
SHAPE_Length	Double						Yes
		County (City district)	String		Yes
County (downtown) ID	String						Yes
		Shape Type	String		yes				The attribute value is line

(3) Water line logic design

The water line layer comprises OBJECTID, SHAPE, classification code, classification name, main area, sub area,

SHAPE _ Length, county (downtown) ID, SHAPE Type field, the water system line logic is designed as the following table 5:

TABLE 5 aqueous line level Table

(4) Social statistics

Social statistics are usually stored in the form of tables, with fields to record the regions

The spatial data in the geographic unit corresponding to the data such as the total production value, the total production amount of the second industry in the region, the general population of the household registers at the end of the regional year, the users of the regional fixed telephones, the investment amount of the fixed assets in the whole society of the region and the like is a data table which is counted by taking the county level as a unit, and the main logic design of the social statistical data is as shown in the following table 6:

TABLE 6 social statistics data layer Table

Name of field	Data type	Length of	Whether or not it can be empty	Remarks for note
					Total value of area production (Wanyuan)	Long	Yes
Production of second industry of region (Wanyuan)	Long		Yes
					Regional end-of-year family register general population (thousands of people)	Long	Yes
Fixed telephone user (household) in area	Long		Yes
					Fixed assets investment of regional society (ten thousand yuan)	Long	yes
County (City district)	String		yes
					County (downtown) ID	String	Yes
Shape Type	String		yes	Attribute value is face
					FAI	Long	yes
DEM	Long		yes
					Index of vegetation	Long	yes
AQI	Long		yes

TABLE 7 associated data

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent changes and modifications made to the above embodiment according to the technical spirit of the present invention are within the scope of the technical solution of the present invention.

Claims

1. An automatic thematic knowledge mining system, characterized in that the automatic thematic knowledge mining system comprises:

the data mining module comprises: the FP-tree correlation analysis module and the random forest classification module; the FP-tree correlation analysis module is used for reversely mining an index with high correlation degree with thematic knowledge;

the random forest classification module is used for selecting data of the indexes and training a model by adopting a random forest method to finish random forest classification; the random forest classification module is a combined classifier comprising a plurality of non-pruning classification regression trees; the combined classifier is formed by introducing independent and identically distributed random variables, generating decision trees by using training set data and the random variables and finally combining all the decision trees by using an integrated learning idea;

2. The automatic thematic knowledge mining system according to claim 1, wherein the internet of things interface module comprises:

3. The automatic thematic knowledge mining system according to claim 1, wherein the semantic query module selects an ontology file from a local file, generates an ontology model, selects a category and a term of the term after the ontology model is generated, and displays a semantic query result;

4. The automatic thematic knowledge mining system of claim 1, wherein the map aggregation and visualization module represents certain subject content attribute data on a map by using color rendering, pattern filling, histogram or pie chart forms; corresponding achievements are visually displayed to the user by utilizing the visual effect of the thematic map;

5. The automatic thematic knowledge mining system according to claim 2, wherein the data acquisition and storage module comprises a network API module and a GPRS wireless transmission module; the network API module is responsible for obtaining internet data by sending a request to a data providing website and analyzing and storing the internet data; acquiring meteorological monitoring data through a network API; the client sends a request to the server, the server calls an API (application programming interface) to acquire data, then JSON (java service connection) data is sent to the client, and the client analyzes the received JSON data and then stores the JSON data into a local MySQL database;

6. The automatic thematic knowledge mining method of the automatic thematic knowledge mining system according to claim 1, wherein the automatic thematic knowledge mining method comprises:

7. The automatic thematic knowledge mining method according to claim 6, wherein the automatic thematic knowledge mining method specifically comprises:

1) network-based data acquisition:

obtaining data from the internet through an API;

2) and (3) carrying out ontology construction and semantic query:

5) generating a thematic map: the method comprises the following steps:

acquiring a data source of a thematic map:

8. The automatic thematic knowledge mining method according to claim 7, wherein,

the random forest classification specifically comprises:

secondly, respectively establishing a classification regression tree for each sample data set to generate K decision trees, randomly sampling the original data variable set to obtain a variable subset for each node of the decision tree in the generation process, and selecting the optimal variable from the subsets according to the Gini index minimum criterion to carry out node splitting and branching;