CN117314006A

CN117314006A - Intelligent data analysis method and system

Info

Publication number: CN117314006A
Application number: CN202311297297.8A
Authority: CN
Inventors: 惠青; 陈松; 殷承虹; 王纪忠; 王栖; 潘珠; 吕业清
Original assignee: HAINAN COLLEGE OF ECONOMICS AND BUSINESS
Current assignee: HAINAN COLLEGE OF ECONOMICS AND BUSINESS
Priority date: 2023-10-09
Filing date: 2023-10-09
Publication date: 2023-12-29

Abstract

The invention provides an intelligent data analysis method and system, and belongs to the technical field of data analysis. Firstly, urban and rural data of various sources are acquired; preprocessing the urban and rural data to obtain preprocessed data; secondly, carrying out data analysis on the preprocessed data to obtain an analysis result; then predicting the future urban and rural data to obtain a prediction result; and finally, visualizing the prediction result to provide interpretability information. The method integrates multi-source urban and rural data, adopts the techniques of spatial association analysis, spatial cluster analysis, spatial classification analysis and the like, can obtain multidimensional indexes and relations of the urban and rural data, effectively processes and utilizes the urban and rural data, provides high-quality and high-efficiency analysis and prediction results, can perform visualization and explanatory display, provides deeper understanding for government decisions, and is beneficial to reducing urban and rural development gaps.

Description

Intelligent data analysis method and system

Technical Field

The invention belongs to the technical field of data analysis, and particularly relates to an intelligent data analysis method and system.

Background

Urban and rural data refer to various data reflecting the population, economy, society, environment and the like of urban and rural areas, such as population density, town rate, income per capita, consumption level, environmental quality and the like. Analysis and prediction of urban and rural data are of great significance for understanding differences and relations between urban and rural areas, making reasonable development policies and plans and promoting urban and rural coordinated development. Urban and rural development imbalance is a global problem and has a great influence on government decisions, urban planning and rural development. Therefore, the development of the intelligent urban and rural data analysis method has important significance. Traditional urban and rural data analysis methods generally rely on manual processing and statistical analysis, are difficult to process for a large variety of urban and rural data sources, and cannot provide reliable future predictions and deep interpretations. Therefore, there is a need for an innovative approach to integrating multi-source urban and rural data for data preprocessing, analysis, prediction and visualization to support government and decision makers in making more targeted decisions.

Disclosure of Invention

Based on the technical problems, the intelligent data analysis method and system provided by the invention integrate multi-source urban and rural data, and adopt the techniques of spatial correlation analysis, spatial cluster analysis, spatial classification analysis and the like, so that multidimensional indexes and relations of the urban and rural data can be obtained, further understanding is provided for government decision making, and urban and rural development gap is facilitated to be reduced.

The invention provides an intelligent data analysis method, which comprises the following steps:

step S1: acquiring urban and rural data of various sources;

step S2: preprocessing the urban and rural data to obtain preprocessed data;

step S3: carrying out data analysis on the preprocessed data to obtain an analysis result;

step S4: predicting the urban and rural data in the future to obtain a prediction result;

step S5: and visualizing the prediction result to provide interpretability information.

Optionally, the preprocessing operation is performed on the urban and rural data to obtain preprocessed data, which specifically includes:

and sequentially performing data cleaning, data compression, data encoding and data decoding on the urban and rural data to obtain preprocessed data.

Optionally, the data analysis is performed on the preprocessed data to obtain an analysis result, which specifically includes:

Carrying out space association analysis on the preprocessed data to obtain urban and rural coefficients, wherein the method specifically comprises the following steps:

wherein URSC is urban and rural space correlation coefficient, n is a number, w _ij The element of the ith row and the jth column of the spatial weight matrix W represents the spatial relationship strength between the ith data and the jth data, x _i City attribute value, y, for the ith data _i For the rural attribute value of the ith data,for the mean value of all data city attribute values, +.>For all data rural attribute valuesA value;

in the formula, UPDC is urban and rural population difference coefficient, P _u For urban population, P _r Is the population number of rural areas, S _u Is the index of urban population structure, S _r For rural population structure indexes, urban and rural population structure indexes are calculated according to age, gender, education and other factors of population, and the calculation formula is as follows:

wherein S is the structural index of the human mouth, N is the number of structural factors of the human mouth, omega _I Is the value of factor I. The coefficient reflects the degree of difference between urban and rural population, and has a value range of [0,1 ]]When the value is 0, the urban and rural population is identical; when the value is 1, the urban and rural population is completely different.

In the formula, UEGC is an urban and rural economic gap coefficient, Y _u Is city average GDP, Y _r Is rural people average GDP, G _u G is the growth rate of city GDP _r For rural GDP growth rate, α is a tuning parameter used to control the sensitivity of the gap degree.

In the formula, UEDC is urban and rural environment difference coefficient, Q _u Is the quality index of urban environment, Q _r Is rural environment quality index, R _u R is the utilization rate of urban resources _r For rural resource utilization, E _u Is city, E _r Is rural resourceThe utilization rate, the coefficient reflects the degree of difference between urban and rural environments;

performing spatial clustering analysis on the preprocessed data to obtain a clustering result, wherein the method specifically comprises the following steps:

constructing an undirected weighted graph g= (V) _G ，E _G ，W _G ) Wherein V is _G Is a vertex set and represents all urban and rural data points; e (E) _G Is an edge set and represents the connection relation among all urban and rural data points; w (W) _G Is a weight matrix representing the weights of all edges;

calculating the degree of each vertex, and sorting the vertices in descending order according to the degree;

traversing from the vertex with the largest degree, taking the vertex as the center of a new class or cluster, dividing the vertex which is connected with the vertex and is not divided into the class or cluster, and updating the center of the class or cluster to be the average value of all vertex coordinates;

Repeating the traversing operation until all vertexes are divided or reach the preset number of categories or clusters;

performing spatial classification analysis on the preprocessed data to obtain classification results, wherein the method specifically comprises the following steps:

dividing the data set into a training set and a testing set, and carrying out normalization processing;

constructing a multi-layer perceptron neural network model, and initializing the weight of the neural network;

inputting the training set into a neural network, calculating the output of the neural network through forward propagation and comparing the output with a real class or label, calculating an error through backward propagation, updating the weight according to the error, and repeating the process until the error reaches the minimum or the maximum iteration number;

the test set is input into the neural network, the output of the neural network is calculated through forward propagation and compared with the real class or label, and the performance index is calculated.

Optionally, the predicting the future urban and rural data to obtain a prediction result specifically includes:

predicting urban population growth, wherein the prediction formula is as follows:

y _t ＝f(x _t ，z _t ，θ)+∈ _t

wherein y is _t Is the population of cities, x _t Z as time variable and other influencing factors _t For urban and rural population difference coefficient, θ is model parameter, and ε is _t Is an error random term;

The rural yield change is predicted, and the prediction formula is as follows:

wherein y is _p For the yield of p-th crop in rural area, x _p For time variable and other influencing factors, w _p Is an urban and rural economic gap coefficient,is the model parameter, eta _p Is an error random term;

the social economic index is predicted, and the prediction formula is as follows:

wherein Y is an output variable vector, and represents values of various social and economic indexes of urban and rural areas, X is an input variable vector, and represents values of various influencing factors of urban and rural areas, and Z is a spatial correlation coefficient of urban and rural areas.

The price of the agricultural product is predicted, and the prediction formula is as follows:

wherein, I _T Is a vector of m multiplied by 1, represents the prices of m agricultural products, A ₀ Is a constant vector of m×1, B is a coefficient vector of m×1, v _T Is the urban and rural environmental difference coefficient A _K As m x m coefficient matrix, u _T White noise direction of m×1An amount of;

the medical demand is predicted, and the prediction formula is as follows:

R _cd ＝f(R，S，Ψ)

wherein R is _cd The method is characterized in that the method comprises the steps of (1) predicting and grading the (d) medical project for the (c) user, wherein R is a grading matrix, S is a characteristic matrix comprising urban and rural economic gap coefficients, urban and rural environment difference coefficients, urban and rural space correlation coefficients, urban and rural population difference coefficients and the like, and ψ is a model parameter.

Optionally, the visualizing the prediction result provides explanatory information, which specifically includes:

Selecting a visual data type according to the prediction result to obtain a visual result;

the interpretation information is integrated in the visual results, providing an interpretation of the predicted results.

The invention also provides an intelligent data analysis system, which comprises:

the urban and rural data acquisition module is used for acquiring urban and rural data of various sources;

the preprocessing operation module is used for preprocessing the urban and rural data to obtain preprocessed data;

the data analysis module is used for carrying out data analysis on the preprocessed data to obtain an analysis result;

the urban and rural data prediction module is used for predicting future urban and rural data to obtain a prediction result;

and the result visualization interpretation module is used for visualizing the prediction result and providing interpretability information.

Optionally, the preprocessing operation module specifically includes:

and the data layer-by-layer processing sub-module is used for sequentially performing data cleaning, data compression, data encoding and data decoding on the urban and rural data to obtain preprocessed data.

Optionally, the data analysis module specifically includes:

Wherein URSC is urban and rural space correlation coefficient, n is a number, w _ij The element of the ith row and the jth column of the spatial weight matrix W represents the spatial relationship strength between the ith data and the jth data, x _i City attribute value, y, for the ith data _i For the rural attribute value of the ith data,for the mean value of all data city attribute values, +.>The average value of all rural attribute values of the data is obtained;

In the formula, UEDC is urban and rural environment difference coefficient, Q _u Is the quality index of urban environment, Q _r Is rural environment quality index, R _u R is the utilization rate of urban resources _r For rural resource utilization, E _u Is city, E _r The coefficient reflects the difference degree of urban and rural environments for rural resource utilization;

Optionally, the urban and rural data prediction module specifically includes:

y _t ＝f(x _t ，z _t ，θ)+∈ _t

the rural yield change is predicted, and the prediction formula is as follows:

wherein, I _T Is a vector of m multiplied by 1, represents the prices of m agricultural products, A ₀ Is a constant vector of m×1, B is a coefficient vector of m×1, v _T Is the urban and rural environmental difference coefficient A _K As m x m coefficient matrix, u _T White noise vector of m×1;

the medical demand is predicted, and the prediction formula is as follows:

R _cd ＝f(R，S，Ψ)

Optionally, the result visual interpretation module specifically includes:

the visualization type analysis sub-module is used for selecting a visualization data type according to the prediction result to obtain a visualization result;

And the information interpretation sub-module is used for integrating interpretation information in the visual result and providing interpretation about the prediction result.

Compared with the prior art, the invention has the following beneficial effects:

the invention automatically identifies and corrects errors, deletions or abnormal values in the data, ensures high-quality input data and improves the data reliability; by adopting the techniques of space association analysis, space cluster analysis, space classification analysis and the like, multidimensional indexes and relations of urban and rural data can be obtained, and deeper understanding is provided for government decisions; the urban population growth, rural yield change, socioeconomic index, agricultural product price, medical requirements and the like are predicted, so that urban and rural development gap is reduced; by adopting the visualization technology, the analysis and prediction results can be intuitively and beautifully displayed, and explanatory information about the results is provided, so that the analysis and prediction results are convenient for users to understand and use.

Drawings

FIG. 1 is a flow chart of an intelligent data analysis method of the present invention;

FIG. 2 is a block diagram of an intelligent data analysis system according to the present invention.

Detailed Description

The invention is further described below in connection with specific embodiments and the accompanying drawings, but the invention is not limited to these embodiments.

Example 1

As shown in fig. 1, the invention discloses an intelligent data analysis method, which comprises the following steps:

step S1: and obtaining urban and rural data of various sources.

Step S2: and carrying out preprocessing operation on urban and rural data to obtain preprocessed data.

Step S3: and carrying out data analysis on the preprocessed data to obtain an analysis result.

Step S4: and predicting the future urban and rural data to obtain a prediction result.

Step S5: and visualizing the prediction result to provide the interpretability information.

The steps are discussed in detail below:

step S1: and obtaining urban and rural data of various sources.

The step S1 specifically comprises the following steps:

the urban and rural related data are acquired by various modes, and the specific process is as follows:

web crawlers are a program that automatically captures data from the internet, and obtain urban and rural related data, such as demographics, economic indicators, social events, polls, etc., from various websites, such as government departments, news media, social platforms, forums, blogs, etc. The web crawlers need to adhere to the crawler protocols of the web site to prevent excessive burden on the web site or privacy violation.

The sensor is a device capable of sensing and measuring physical or chemical phenomena, and acquiring urban and rural related data such as temperature, humidity, air quality, noise, illumination and the like from various environments. The sensors need to be installed in the proper locations and regularly maintained and calibrated.

Satellite images are images of the earth surface photographed by using artificial satellites, and urban and rural related data such as land utilization, vegetation coverage, building distribution, water resources and the like are acquired from high altitude. The satellite images need to be preprocessed and interpreted to extract useful information.

Questionnaires, which is a method for collecting data to a target crowd by using pre-designed questions, obtain urban and rural related data, such as satisfaction, demand, preference, etc., from subjective feelings and opinions of people. Questionnaires require attention to the design and choice of questions, as well as the validity and representativeness of the data.

These data are stored in databases, file systems, cloud platforms, etc. for later use.

The step S2 specifically comprises the following steps:

sequentially performing data cleaning, data compression, data encoding and data decoding on urban and rural data to obtain preprocessed data, wherein the method specifically comprises the following steps of:

and (3) cleaning the data, namely cleaning and denoising the acquired urban and rural data by using a self encoder (Autoencoder) so as to remove errors, deletions or abnormal values in the data and improve the quality of the data. The self-encoder is an unsupervised learning model based on a neural network, learns the internal structure of data, and reconstructs input data. The self-Encoder consists of two parts, an Encoder (Encoder) and a Decoder (Decoder), respectively. The encoder compresses the input data into a low-dimensional implicit vector, and the decoder restores the implicit vector to an output data similar to the input data. The self-encoder is trained to enable output data to be as close to input data as possible, so that data can be cleaned and noise reduced. If the input data is a picture with noise, the output data is a picture with noise removed.

For urban and rural data, which is not an image, a table or text is cleaned and denoised using a self-encoder. The self-encoder designs different network structures and loss functions according to the type and structure of the data to adapt to different data characteristics. For table data, a Multi-Layer Perceptron (Multi-Layer Perceptron) is used as the network structure of the self-encoder, and a mean square error (Mean Squared Error) is used as the loss function of the self-encoder to achieve repair or population of erroneous or missing values in the table data. For text data, a recurrent neural network (Recurrent Neural Network) or Transformer (Transformer) is used as the network structure of the self-encoder, and Cross Entropy (Cross Entropy) is used as the loss function of the self-encoder to achieve correction or generation of spelling errors or grammar errors in the text data.

And (3) compressing and extracting features of the acquired data by utilizing Sparse Coding (spark Coding) so as to reduce the storage space and the computational complexity of the data and keep the main information of the data. Sparse coding is an unsupervised learning model based on dictionary learning, and represents high-dimensional input data as low-dimensional sparse vectors. Sparse coding consists of two parts, dictionary (Dictionary) and sparse coefficients (Sparse Coefficient), respectively. The dictionary is a matrix of a plurality of basic elements, and the sparse coefficient is a vector of a plurality of non-zero elements. The input data is approximately represented by the product of a dictionary and a sparse coefficient through training sparse coding, so that the compression and feature extraction of the data are realized. If the input data is a land cover picture, the dictionary is a sub-image of some vegetation features and the sparse coefficients are weights of these sub-images in the original image.

For urban and rural data, which is not an image, tables or text, sparse coding is also used to compress and extract features. Sparse coding selects different dictionaries and representation modes of sparse coefficients according to the type and structure of data so as to adapt to different data characteristics. For table data, matrix decomposition (Matrix Factorization) or tensor decomposition (Tensor Factorization) is used to decompose the table data into a dictionary matrix and a sparse coefficient matrix or tensor to achieve extraction and compression of important information and associations in the table data. For text data, word Embedding (Word Embedding) or Topic Model (Topic Model) is used to convert the text data into a dictionary vector and a sparse vector, so as to extract and compress semantic information and Topic information in the text data.

And (3) data encoding, namely encoding and indexing the compressed data by utilizing space-time hash (spatial-Temporal hash), so as to integrate and normalize the data with different time points and spatial resolutions and improve the retrieval efficiency of the data. The space-time hash is a coding method based on a hash function, and maps continuous space-time coordinates into discrete binary codes. The Spatio-temporal Hash consists of two parts, spatio-temporal partitioning (space-Temporal Partitioning) and Hash Mapping (Hash Mapping), respectively. Spatiotemporal partitioning divides a spatiotemporal range into a plurality of spatiotemporal units (Spatio-Temporal units), each spatiotemporal Unit corresponding to a unique number. The hash map converts each number into a fixed length binary code. The compressed data is encoded and indexed by using the time-space hash, so that the integration and standardization of the data with different time points and spatial resolutions are realized, and the retrieval efficiency of the data is improved. If the compressed data is population density variation of different sites in a city over different time periods, the space-time division divides the city area and time axis into a plurality of grids and intervals, each grid and interval corresponding to a number. The hash map converts each number to a binary code.

For urban and rural data that is not an image, a table or text is also encoded and indexed using time-space hashing. The space-time hash defines different space-time division and hash mapping modes according to the type and structure of the data so as to adapt to different data characteristics. For table data, a grid division or cluster division-based method is used to divide each row or each column in the table data into a space-time unit, and each space-time unit is assigned a number. Each number is then converted to a binary code using a hash function or neural network based method. For text data, a time window or semantic window based method is used to divide each word or each sentence in the text data into a space-time unit and assign each space-time unit a number. Each number is then also converted to a binary code using a hash function or neural network based method.

Data decoding, decoding and reconstruction of the encoded data using a generation countermeasure network (Generative Adversarial Network) to generate data with high spatial-temporal resolution and enhance the visualization of the data. Generating an countermeasure network is a deep learning model based on game theory, and realistic data is generated through two mutually competing neural networks. The generation countermeasure network consists of two parts, a Generator (Generator) and a Discriminator (Discriminator), respectively. The generator takes random noise or encoded data as input and outputs dummy data similar to the real data. The discriminator takes real data or false data as input and outputs a probability of judging whether the data is true or false. The generation of the countermeasure network through training enables the generator to deceive the discriminator, the discriminator can identify the generator, the decoding and reconstruction of the encoded data are realized, the data with high space-time resolution are generated, and the visual effect of the data is enhanced. If the encoded data is a binary code of population density variations at different locations in a city over different time periods, the generator takes the binary code as input and outputs an image that resembles the actual population density variations. The discriminator takes the real population density change image or the dummy population density change image as input and outputs a probability for judging whether the image is true or false.

For urban and rural data that is not an image, a form or text is also decoded and reconstructed using the generation countermeasure network. The generation countermeasure network designs the network structure and loss function of different generators and discriminators according to the type and structure of the data to accommodate different data characteristics. For the form data, a conditional generation countermeasure network (Conditional Generative Adversarial Network) is used to output a dummy form data similar to the real form data based on binary codes or random noise as input. The arbiter uses a multi-layer perceptron or convolutional neural network (Convolutional Neural Network) as the network structure and cross entropy or Least Squares (Least Squares) as the loss function to determine whether the input form data is true or false. For text data, a network structure using a sequence-to-sequence (Sequence to Sequence) or transformer as a generator outputs a pseudo-text data similar to the real text data based on binary coding or random noise as input. The arbiter uses a cyclic neural network or a transformer as a network structure and cross entropy or least square as a loss function to judge whether the input text data is true or false.

In this embodiment, the data is displayed later, and the decoded data is displayed to the user in a three-dimensional manner by using Virtual Reality (Virtual Reality), and multi-angle and multi-scale observation and interaction functions are provided. Virtual reality is a technique that utilizes computer technology to simulate real or imaginary scenes and experiences, leaving the user immersed in a simulated environment. The Virtual reality consists of three parts, virtual Scene, virtual Device and Virtual interaction (Virtual Interaction), respectively. The virtual scene is a scene composed of three-dimensional models, textures, lights and other elements, and is dynamically generated and updated according to the decoded data. A virtual device is a device capable of transmitting sensory information such as a user's sense of sight, hearing, touch, etc. to a virtual scene and transmitting feedback information of the virtual scene to the user, such as a head-mounted display, headphones, gloves, etc. Virtual interaction is a method that enables a user to perform various operations and controls in a virtual scene, such as movement, scaling, rotation, etc. The decoded data is displayed to the user in a three-dimensional mode by using virtual reality, and multi-angle and multi-scale observation and interaction functions are provided, so that the urban and rural data can be intuitively and deeply understood.

For urban and rural data, forms or text, which are not images, virtual reality is also used to present them to users in a three-dimensional manner, and to provide multi-angle and multi-scale viewing and interaction functions. The virtual reality generates different virtual scenes and virtual devices according to the type and structure of the data, and provides different virtual interaction modes so as to adapt to different data characteristics. For the form data, a visualization method based on a form of a histogram, a pie chart, a scatter chart, or the like is used to convert the form data into a three-dimensional virtual scene, and a head-mounted display, a glove, or the like is used to allow a user to observe and manipulate the data in the virtual scene. For text data, a visual method based on word cloud, topic map, emotion analysis and the like is used for converting the text data into a three-dimensional virtual scene, and a head-mounted display, a headset and the like are used for enabling a user to read and listen to the data in the virtual scene.

The step S3 specifically comprises the following steps:

the data analysis is the core part of urban and rural data analysis, and the input of the data analysis is the output of data collection, namely the preprocessed data. The output is the result obtained after various analysis, mining, visualization and other operations are performed on the data. These results reflect the urban and rural status, characteristics, trends, problems and the like, and provide basis and reference for the follow-up.

The data analysis content is as follows:

descriptive analysis is to describe basic characteristics of urban and rural data, including data distribution, concentration trend, discrete degree, correlation and the like. Descriptive analysis measures the characteristics of data using various statistics, such as mean, median, standard deviation, variance, maximum, minimum, quartile, etc., and is visually presented using histograms, box plots, scatter plots, etc. Descriptive analysis is used to understand the age structure, income level, education level, etc. of urban and rural population, and to compare the differences between urban and rural population.

And the relevance analysis is to analyze the relationship between urban and rural data, including linear relationship and nonlinear relationship. The correlation analysis measures the degree of correlation between variables by using various correlation coefficients, such as pearson correlation coefficients, spearman correlation coefficients, kendel correlation coefficients and the like, and performs visual display by using a thermodynamic diagram, regression lines and the like. Correlation analysis was used to explore the relationship between urban and rural population density and economic development level and to check if there was a significant positive or negative correlation. And the specific correlation coefficient needs to be analyzed and set according to actual conditions.

Carrying out space association analysis on the preprocessed data to obtain urban and rural space coefficients, wherein the formula is as follows:

wherein URSC is urban and rural space correlation coefficient, n is a number, w _ij The element of the ith row and the jth column of the spatial weight matrix W represents the spatial relationship strength between the ith data and the jth data, x _i City attribute values for the ith data, such as city population density, city economic development level, etc., y _i For the rural attribute values of the ith data, such as rural population density, rural economic development level and the like,for the mean value of all data city attribute values, +.>Is the average value of all data rural attribute values.

The meaning of urban and rural spatial correlation coefficients is to reflect the spatial interactions and effects between cities and rural areas. The value range is [ -1,1], the degree of spatial correlation between cities and rural areas is known, and the development trend and the change situation of the cities and the rural areas in the future are predicted according to the information. When the value is positive, the positive space correlation exists between the city and the rural area, namely, the region with high city attribute value is adjacent to or close to the region with high rural attribute value, and vice versa, the correlation can be continuously maintained or enhanced in the future, so that the coordinated development of the city and the rural area is promoted; when the value is negative, the negative spatial correlation exists between the city and the rural area, namely, the region with high city attribute value is adjacent to or close to the region with low rural attribute value, and vice versa, the larger gap and contradiction can occur in the regions in the future, so that the harmonious development of the city and the rural area is influenced; when the value is 0, no significant spatial correlation exists between the city and the rural area.

In order to calculate urban and rural spatial correlation coefficients, a spatial weight matrix W is determined. A spatial weight matrix is a matrix that describes the strength of the spatial relationship between data points. And constructing a space weight matrix by adopting a distance-based method. The spatial weight matrix W is a matrix for describing a spatial adjacent relationship between data, and has a size of n×n, where n is a number of data. Element W of ith row and jth column of W _ij Indicating the strength of the spatial relationship between the i-th data and the j-th data, i.e., how adjacent or proximate the two are.

Since W is a symmetric matrix, i.e. W _ij ＝w _ji The rows and columns of W are thus considered as numbers or indexes of data, not the data itself. In other words, the ith row of W represents the spatial relationship strength of all data adjacent to the ith data, and the jth column of W represents the spatial relationship strength of all data adjacent to the jth data.

Urban and rural population difference coefficient (UPDC) measures the degree of difference of population quantity and structure of cities and rural areas, and the calculation formula is as follows:

in the formula, UPDC is urban and rural population difference coefficient, P _u For urban population, P _r Is the population number of rural areas, S _u Is the index of urban population structure, S _r Is a rural population structure index, and the urban and rural population structure indexes are based on the years of population Age, gender, education and other factors are calculated, and the calculation formula is as follows:

Urban and rural economic gap coefficient (UEGC): the degree of gap between the urban and rural economic levels and the development speed is measured, and the calculation formula is as follows:

in the formula, UEGC is an urban and rural economic gap coefficient, Y _u Is city average GDP, Y _r Is rural people average GDP, G _u G is the growth rate of city GDP _r For rural GDP growth rate, α is a tuning parameter used to control the sensitivity of the gap degree. When alpha > 1, this indicates a greater sensitivity to larger gaps; when α < 1, this indicates a greater sensitivity to smaller gaps; when α=1, this indicates equal sensitivity to all the gaps. The coefficient reflects the difference degree between urban and rural economy, and the value range is [0,1]When the value is 0, the urban economy and rural economy are completely the same; when the value is 1, the urban and rural economy is completely different.

Urban and rural environmental coefficient of difference (UEDC): the degree of difference of urban and rural environment quality and resource utilization is measured, and the calculation formula is as follows:

in the formula, UEDC is urban and rural environment difference coefficient, Q _u Is a cityEnvironmental quality index, Q _r Is rural environment quality index, R _u R is the utilization rate of urban resources _r For rural resource utilization, E _u Is city, E _r The coefficient reflects the difference degree of urban and rural environments and has the value range of [0,1 ]]When the value is 0, the urban and rural environments are completely the same; when the value is 1, the urban and rural environments are completely different. The coefficient also considers the investment condition of cities and rural areas in the aspect of environmental protection, and if the investment of the cities and the rural areas is similar, the two are indicated to have common responsibility and targets on the aspect of environmental problems, so that the degree of difference can be reduced; if the input of the two is large, the two are different in importance and appeal in terms of environmental problems, so that the degree of difference is increased.

And (3) carrying out cluster analysis, classifying and grouping urban and rural data, and dividing the data into a plurality of categories or clusters according to the similarity or distance of the data. The clustering analysis utilizes various clustering algorithms, such as a K-mean value algorithm, a hierarchical clustering algorithm, a density clustering algorithm and the like to realize data clustering, and utilizes a tree diagram, a radar diagram and the like to carry out visual display. And (3) dividing and evaluating urban and rural areas by using cluster analysis, and providing different development strategies and suggestions according to different clustering results.

Performing spatial cluster analysis on the preprocessed data to obtain a clustering result, wherein the method specifically comprises the following steps:

in this embodiment, an urban and rural clustering algorithm (GBURC) based on graph theory is adopted, which is an algorithm for performing spatial clustering analysis on urban and rural data by using the concept and method of graph theory. Graph theory is a mathematical branch that studies the structure and properties of a graph, which is an abstract structure of vertices and edges that can be used to represent relationships between data. Spatial cluster analysis is an analysis method that classifies data into different categories or clusters according to spatial similarity or distance between data, and can be used to discover spatial distribution characteristics and rules of data.

The basic idea of the GBURC algorithm is to consider urban and rural data as vertices in a graph, consider spatial relationships between urban and rural data as edges in the graph, consider the spatial relationship strength between urban and rural data as weights of edges in the graph, select the center of a class or cluster according to the degree of the vertices (the number of edges connected with the vertices), and divide vertices adjacent to or close to the center into the same class or cluster, thereby realizing spatial clustering of urban and rural data.

The specific steps of the GBURC algorithm are as follows:

Firstly, constructing an undirected weighted graph G= (V) by using urban and rural data of Hainan province, such as indexes of population, economy, society, environment and the like _G ，E _G ，W _G ) Wherein V is _G Is a vertex set, representing all urban and rural data points, namely all city and counties; e (E) _G Is an edge set, and represents the connection relation between all urban and rural data points, namely the connection relation between all cities and counties; w (W) _G Is a weight matrix representing the weights of all edges. The weight matrix may be calculated based on spatial distances or similarities between urban and rural data points, such as euclidean distances or cosine similarities. In general, the smaller the spatial distance or similarity, the greater the weight, indicating that two urban and rural data points are adjacent or near.

In the second step, the degree (degree) of each vertex, i.e., the number of edges connected to the vertex, is calculated, and the vertices are sorted in descending order according to the degree size. In general, a larger degree indicates that the vertex is more adjacent or near to other vertices, and is more suitable as the center of a class or cluster.

And thirdly, traversing from the vertex with the largest degree, taking the vertex as the center of a new class or cluster, dividing the vertex which is connected with the vertex and is not divided into the class or cluster, and updating the center of the class or cluster to be the average value of all vertex coordinates. Thus, urban and rural data points in each category or cluster can be guaranteed to have higher spatial similarity or proximity, and different categories or clusters have lower spatial similarity or proximity.

Fourth, repeating the third step until all vertexes are divided or reach the preset number of categories or clusters. Therefore, the most suitable urban and rural data classification scheme can be determined according to actual conditions.

In this embodiment, urban and rural areas in Hainan province are divided and evaluated, and different development strategies and suggestions are provided according to different clustering results.

And (3) classifying and analyzing, namely judging urban and rural data, and dividing the data into a plurality of predefined categories or labels according to the characteristics or the attributes of the data. The classification analysis utilizes various classification algorithms, such as decision tree algorithm, support vector machine algorithm, naive Bayesian algorithm and the like to realize data classification, and utilizes confusion matrix, ROC curve and other modes to carry out visual display.

In this embodiment, a Neural Network-Based Urban and rural classification algorithm (Neural Network-Based Urban-Rural Classification Algorithm) is adopted, and the steps are as follows:

urban and rural classification algorithm (NNURC) based on neural network:

first, acquiring Hainan province urban and rural data. Some Hainan province rural data can be searched from the browser or more detailed and accurate data can be applied for from government departments and statistical institutions.

In a second step, the appropriate characteristics or attributes are selected to represent Hainan province rural data. Suitable characteristics or attributes are selected based on the problem and goal of the analysis required, such as population size, population density, population structure, urban and rural occupancy, economic level of development, land use type, traffic conditions, etc.

And thirdly, dividing the Hainan province rural data into a training set and a testing set, and carrying out normalization or standardization treatment on the data so as to eliminate dimension and range differences of the data. Some common data partitioning and processing methods are used, such as random partitioning, maximum-minimum normalization, standard deviation normalization, etc.

Fourth, a multi-layer perceptron (MLP) neural network model is constructed, wherein the number of nodes of an input layer is the characteristic or attribute number of data, the number of nodes of an output layer is the class or label number of the data, and the number of nodes and the number of layers of a hidden layer can be determined according to the complexity and the scale of the data. And initializing the weight and bias of the neural network to be random values, and adopting a Keras framework as an implementation basis.

And fifthly, inputting each data in the training set into a neural network, calculating the output of the neural network through forward propagation, comparing with a real class or label, calculating the error of the neural network through backward propagation, updating the weight and bias of the neural network according to the error, and repeating the process until the error reaches the minimum or the maximum iteration number. The existing urban and rural data in Hainan province is utilized to train and tune the model, and optimal parameters and super parameters are found, so that the model can be fitted with data characteristics and rules to the greatest extent. The model is evaluated and validated using some evaluation criteria and methods, such as mean square error, mean absolute error, etc. And (3) checking the generalization capability and stability of the model according to the evaluation result, and avoiding the phenomenon of over fitting or under fitting.

And sixthly, inputting each data in the test set into the neural network, calculating the output of the neural network through forward propagation, comparing the output with the real type or label, and calculating performance indexes such as accuracy, recall rate, F1 value and the like of the neural network. Common neural network assessment and visualization methods are used, such as confusion matrix, ROC curve, AUC values, etc.

In this embodiment, the trained model may be used to predict and recommend future urban and rural data in Hainan province, and give out a corresponding confidence interval and error range. Uncertainty and risk factors also need to be considered to avoid overly optimistic or pessimistic results. Reference and basis can be provided for planning and management of Hainan province and countryside according to the prediction result, or suggestions for improvement and optimization are provided. The model needs to be updated and improved according to newly collected Hainan province and county data, so that the model can adapt to data change and dynamic environment. The model is updated by considering factors such as the degree of data update, weight distribution, learning rate adjustment and the like. The model may be periodically inspected and evaluated to ensure the validity and reliability of the model.

Regression analysis can be used to model and predict the relationship between urban and rural population growth rate and GDP growth rate, and evaluate the trend and potential of urban and rural development according to different regression results. The method can also be used for identifying and processing abnormal values or outliers of urban and rural data by using the abnormal detection, finding out data with obvious differences or large deviation degrees from normal data, and analyzing the cause and influence of the data.

The step S4 specifically comprises the following steps:

the data prediction is an extension of urban and rural data analysis, and the input of the data prediction is the output of the data analysis, namely the result after analysis. The output is the result obtained after the future urban and rural data is predicted and recommended. These results show changes, risks, opportunities, etc. that may occur in the urban and rural future, providing guidance and advice for the follow-up.

y _t ＝f(x _t ，z _t ，θ)+∈ _t

wherein y is _t Is the population of cities, x _t Z as time variable and other influencing factors _t Is urban and rural population difference coefficient (UPDC), theta is model parameter, E _t And f is a deep learning model.

In this embodiment, urban population prediction predicts population change of cities in a certain future time according to historical data and related factors. The predictive models use long-term memory networks (LSTM), convolutional Neural Networks (CNN), attention mechanisms (Attention), because these models can efficiently process time series data, capture long-term dependencies and local features of the data, and automatically learn the importance of the data. Specifically, first, urban population data, urban and rural population difference coefficients, time variables and other influencing factors (economic growth rate, birth rate, death rate, urban area, economic development level, population mobility) are taken as input variables, and urban population is taken as output variables. The LSTM layer is then used to encode the input variables, extracting their timing characteristics. Next, the CNN layer is used to convolve the encoded input variables to extract their local features. Finally, the attribute layer is used for carrying out weighted summation on the convolved input variables to obtain final representation vectors, and the final representation vectors are input into a fully connected layer to obtain predicted values of the output variables. To evaluate the performance of the model, a mean square error is used as a loss function, which can measure the difference between the predicted and the true results. And optimizing model parameters to enable the loss function to reach the minimum value. Other influencing factors are specifically set according to actual conditions.

In this embodiment, urban population growth prediction may consider urban and rural population difference coefficients as an input variable in addition to time variables and other influencing factors, because it may reflect the degree of non-uniformity of urban and rural population distribution, thereby influencing the growth or reduction of urban population. For example, if the urban and rural population difference coefficient is large, which means that the rural population is large and the urban population is small, more rural population may migrate to the city, thereby causing the urban population to grow. Conversely, if the urban and rural population difference coefficient is smaller, which means that the rural population is smaller and the urban population is larger, more urban population may migrate to the rural area, thereby reducing the urban population. Therefore, the urban and rural population difference coefficient is used as an additional input variable and added into the model, so that the prediction accuracy is improved.

The rural yield change is predicted, and the prediction formula is as follows:

wherein y is _p For the yield of p-th crop in rural area, x _p For time variable and other influencing factors, w _p Is an urban and rural economic gap coefficient (UEGC),is the model parameter, eta _p And g is a machine learning model.

In this embodiment, the rural yield change prediction predicts the rural crop yield change condition in a certain time in the future according to the historical data and related factors. A Support Vector Machine (SVM) model is selected, so that regression problems are effectively processed, the accuracy and the robustness are high, and feature selection and feature engineering can be automatically performed. Specifically, various characteristics (such as air temperature, rainfall, soil pH, fertilization amount, etc.) of a rural area or farm are first taken as input characteristics, and yield values of crops (such as wheat yield, corn yield, etc.) are taken as output values. Then, fitting is carried out on the input characteristics, and a predicted value of the output value is obtained. To evaluate performance, mean Square Error (MSE) was also used as a loss function. The goal is to minimize the loss function by optimizing the model parameters or structure.

In this embodiment, the rural yield variation prediction may consider, in addition to time variations and other influencing factors, urban and rural economic gap coefficients as an input feature, because it may reflect the economic development level differences between urban and rural, thereby influencing rural agricultural production and farmer income. For example, if the urban and rural economic gap coefficient is large, which means that the urban resident income level is high and the rural resident income level is low, the agricultural investment in rural areas is insufficient, the agricultural technology is behind, and the agricultural yield is low. On the contrary, if the urban and rural economic gap coefficient is smaller, the urban resident income level is lower, and the rural resident income level is higher, the rural agricultural investment is sufficient, the agricultural technology is advanced, and the agricultural yield is high-efficiency. Therefore, the urban and rural economic gap coefficient is used as an additional input characteristic and added into the machine learning model, so that the prediction accuracy is improved.

wherein Y is an output variable vector, values of various social and economic indexes of urban and rural areas, X is an input variable vector, values of various influencing factors of the urban and rural areas, and Z is a urban and rural space correlation coefficient (URSC).

In this embodiment, the socioeconomic index prediction predicts the change of the socioeconomic index of urban and rural areas within a certain time in the future according to the historical data and related factors. A Bayesian Network (Bayesian Network) probability graph model is selected, the probability reasoning problem is effectively processed, the conditional probability and the posterior probability are calculated by using the Bayesian theorem, and the dependency relationship and the state transition relationship among various variables can be considered. Specifically, various factors of urban and rural areas (such as political stability, social welfare, education level, employment opportunities and the like) are first taken as input variables, and various indexes of the urban and rural areas (such as average income, poverty rate, happiness index, development index and the like) are taken as output variables. Then, a Bayesian network is used to model the relationship between the input variable and the output variable, and a posterior probability distribution of the output variable is obtained. In order to evaluate the performance of the model, log-likelihood (Log-likelihood) is used as an objective function, measuring the similarity between the predicted and actual results. The objective function is maximized by optimizing the model parameters or structure.

Wherein, I _T Is a vector of m multiplied by 1, represents the prices of m agricultural products at the T-th time point, such as wheat price, corn price and the like, A ₀ Is a constant vector of m multiplied by 1, represents an intercept term of a model, B is a coefficient vector of m multiplied by 1, represents the influence degree of urban and rural environment difference coefficient on agricultural product price, v _T Is a urban and rural environment difference coefficient (UEDC), is a scalar and represents the difference degree of urban and rural environments at the T time point, A _K A coefficient matrix of m x m, representing a hysteresis effect between prices of agricultural products, u _T The m×1 white noise vector represents a random error term in the model at the T-th time point, P is a positive integer, represents a hysteresis order in the model, and determines how many pieces of data of the history period are included in the model. For example, if p=1, then only the data of the last epoch (T-1) is included in the model; if p=2, then the last two epochs (T-1 and T-2 are included in the model) Data of (2); and so on. The selection of P needs to be determined according to the characteristics of the data and the fitting effect of the model, K is a positive integer, representing the hysteresis order index in the model, traversing all the historical periods contained in the model, and if p=2, K can be 1 or 2, corresponding to two periods T-1 and T-2 respectively.

In this embodiment, the agricultural product price prediction predicts the market price change of the agricultural product in a certain time in the future according to the historical data and the related factors. An autoregressive moving average model (ARMA) time series analysis method is selected to effectively process the trend and periodicity of price data, and the stationarity and the differentiability of the price data can be considered. Specifically, the price value of agricultural products (such as wheat price, corn price, etc.) is first taken as an input value and an output value. The price data is then modeled using an ARMA model, and a predicted value for the future price value is obtained. To evaluate the performance of the method, mean Square Error (MSE) was also used as a loss function. The loss function is minimized by optimizing the method parameters or structure.

In this embodiment, the agricultural product price prediction may consider an urban and rural environmental difference coefficient as an input value in addition to the time variable, because it may reflect the environmental quality difference between urban and rural, thereby affecting the supply and demand relationship and market competition of agricultural products. For example, if the urban and rural environments have a large difference coefficient, which means that the urban environment pollution level is high and the rural environment pollution level is low, urban residents may have a higher demand for agricultural products produced in rural areas, so that the price of the agricultural products is increased. On the contrary, if the urban and rural environment difference coefficient is smaller, the urban environment pollution degree is lower, and the rural environment pollution degree is higher, urban residents can have lower demands on agricultural products produced in rural areas, and the price of the agricultural products is reduced. Therefore, the urban and rural environmental difference coefficient can be used as an additional input value to be added into the time sequence analysis method, so that the prediction accuracy is improved.

The medical demand is predicted, and the prediction formula is as follows:

R _cd ＝f(R，S，Ψ)

In this embodiment, the medical demand prediction predicts the urban and rural medical demand change condition in a certain time in the future according to the historical data and related factors. Collaborative filtering (Collaborative Filtering) or matrix factorization (Matrix Factorization) recommendation system methods are selected because they can effectively address demand analysis issues, utilize scoring data between users and medical items for similarity calculations and recommendations, and can take into account implicit characteristics and non-linear relationships between users and medical items. Specifically, first, the scoring data (such as willingness to visit, satisfaction, etc.) of the user on the medical item is taken as an input value and an output value, and a scoring matrix is constructed. And then, decomposing or fitting the scoring matrix by using collaborative filtering, and obtaining a predicted value of the unknown scoring value. To evaluate the performance of the method, mean Square Error (MSE) was used as a loss function. The loss function is minimized by optimizing the method parameters or structure.

In this embodiment, the medical demand prediction may take into consideration, in addition to the scoring data, urban and rural economic gap coefficients, urban and rural environmental difference coefficients, urban and rural spatial correlation coefficients, urban and rural population difference coefficients, and the like as some input values or features, because they may reflect differences in the level of development and quality of life between urban and rural, thereby affecting the medical demands and preferences of urban and rural residents. For example, if these coefficients are large, indicating a large gap between urban and rural areas, there may be a large difference in the scores and demands of urban and rural residents on medical projects. Conversely, if these coefficients are small, indicating a small gap between urban and rural areas, there may be a small difference in the scores and demands of urban and rural residents on medical projects. Therefore, these coefficients are added to the recommender method as additional inputs or features to improve the accuracy of the predictions.

Predicting according to a clustering result obtained by an urban and rural clustering algorithm based on graph theory, wherein the method specifically comprises the following steps:

the prediction comprises economic development prediction, education and medical service prediction, resource allocation and urban and rural coordinated development prediction, environmental protection and sustainable development prediction.

The economic development prediction includes:

if the clustering result shows that the Hainan province has two categories or clusters, namely a city group and a rural group, the economic growth speed of the city group can be predicted to be faster, the government is recommended to strengthen the infrastructure construction and attract investment in the city group, meanwhile, the rural industry development and rural reform are promoted in the rural group, the obvious urban-rural gap and unbalanced development problems exist in the Hainan province are evaluated, the coordinated development and the integrated construction between the urban and rural areas are recommended to be strengthened, and the resource sharing and the mutual cooperation between the urban and rural areas are promoted.

If the clustering result shows that there are three categories or clusters in Hainan province, namely a central city, a peripheral city and a remote rural area, the central city can be predicted to become an engine for economic growth, the government is recommended to support the industrial upgrading and innovation of the central city, the urban-rural gap and unbalanced development problem of the Hainan province to a certain extent are evaluated, the association and cooperation between the central city and the peripheral city are recommended to be enhanced, and the support and the assistance between the peripheral city and the remote rural area are promoted.

If the clustering result shows that the Hainan province has four categories or clusters, namely developed cities, developing cities, underdeveloped rural areas and ecological protection areas, the problem that the Hainan province has urban and rural gaps and unbalanced development of different degrees can be evaluated, the developed cities are predicted to continue to lead, competition and cooperation between the developed cities and the developing cities are suggested to be enhanced, transfer payment and poverty-relieving development between the developing cities and the underdeveloped rural areas are promoted, and natural resources and environmental quality of the ecological protection areas are protected.

In the embodiment, the economic development prediction can be realized by using a BP neural network (Back Propagation Neural Network), and the method can realize fitting and learning of a nonlinear complex system by adjusting the weight and bias of the neural network.

Educational and medical service predictions include:

according to different urban and rural clustering results, education and medical demands of different areas can be predicted, government is recommended to strengthen the configuration of higher education and medical resources in urban areas, basic education and basic medical services are provided in rural areas at the same time, and the method is realized by using an ARIMA model (Auto Regressive Integrated Moving Average Model), and is a time sequence analysis method based on autoregressive and moving average, wherein the stability and the relevance of data can be described by establishing a difference equation, and parameter estimation and model selection are performed by maximum likelihood estimation or Bayesian estimation.

The resource allocation and urban and rural coordinated development prediction comprises the following steps:

based on the clustering result, the government can adjust the resource allocation strategy to ensure the resource sharing and mutual cooperation of urban and rural areas so as to promote the coordinated development of the urban and rural areas, the method is realized by using Random Forest (Random Forest), and the method for integrated learning based on a self-service method (bootstrap) and a Random subspace method (Random subspace) is used, so that the classification and regression of a nonlinear complex system can be realized by constructing a plurality of decision trees.

Environmental protection and sustainable development predictions include:

if the ecological protection zone category exists, special environmental protection policies and investment can be predicted for the zone, and government reinforced ecological environmental protection measures are recommended to promote sustainable development. A grey prediction model (Grey Prediction Model) may be used to implement a prediction method based on small amounts of data and incomplete information, which may describe the law of change of the data by creating differential equations and predict and verify by accumulating generated and reduced values.

Predicting according to the classification result of the urban and rural classification algorithm based on the neural network, specifically comprising the following steps:

the prediction comprises urban and rural consumption prediction and urban and rural data type prediction.

Urban and rural consumption behavior prediction, and the consumption behavior difference of urban and rural residents can be predicted by utilizing classification analysis of urban and rural data. This can be used in the formulation of marketing strategies to promote different products and services according to the needs of different areas. For example, the consumption behaviors of urban and rural residents can be classified by utilizing indexes such as average available income, consumption structure, consumption level, consumption preference and the like in urban and rural data, and corresponding marketing strategies are provided according to classification results, such as products and services with high quality, high added value and high technology are provided for urban residents, and products and services with low cost, high efficiency and high practicability are provided for rural residents.

Urban and rural data type prediction can be used for dividing urban and rural data into balanced type and unbalanced type based on classification results. This helps to better understand the relationship between cities and rural areas to take appropriate policies and actions. For example, if the classification result shows that urban and rural data are balanced, a better coordination and balance relation exists between the cities and the rural areas, so that the state can be kept continuously, and the resource sharing and mutual cooperation between the cities and the rural areas are enhanced; if the classification result shows that the urban and rural data are unbalanced, the fact that a large imbalance and unbalance relation exists between the city and the rural area is indicated, measures are needed to be taken to improve the state, and coordinated development and integrated construction between the city and the rural area are promoted.

In this embodiment, analysis may be performed according to population count, population density, urban rate, total regional production value, average income, incidence of poverty, forest coverage and PM 2.5 concentration, and the result indicates 2 categories or labels (city and rural), what data is specifically needed and what sort analysis is obtained, depending on the actual situation.

The step S5 specifically comprises the following steps:

selecting a visual data type according to the prediction result to obtain a visual result, which specifically comprises the following steps:

firstly, selecting a proper visual type according to the characteristics and purposes of a prediction result. For example, if the predicted result is time-series data, a line graph, a bar graph, an area graph, or the like may be selected; if the prediction result is spatial data, a map, a scatter plot, a thermodynamic diagram, etc. can be selected; if the prediction result is classification data, a pie chart, a radar chart, a box chart, or the like may be selected.

And secondly, generating corresponding graphics according to the visualization type by using a proper data analysis tool and a library. In the embodiment, the Python language is used for data analysis, and tools such as Matplotlib library, seabarn library, plotly library and the like are used for generating various types of graphics; then, the generated graph is required to be beautified and optimized, elements such as titles, labels, legends and the like are added, and parameters such as colors, fonts, sizes and the like are adjusted. May be performed using the functions and methods provided in the tools described above.

Finally, the generated graph needs to be displayed to the user, and an interactive function is provided, so that the user can freely explore and analyze the data. This can be done using a Dash framework, a Bokeh library, a Shiny package, etc.

Integrating explanatory information in the visual results to provide an explanation about the predicted results, specifically including:

firstly, an explanatory artificial intelligence technology is needed to make the decision process of the prediction model interpretable and extract important characteristics and parameters of the model. This can be done using LIME library, ELI5 library, or the like.

And secondly, the explanatory information needs to be converted into a language and a format which are suitable for the understanding of users, and the explanatory information is combined with the visual result. Using natural language generation techniques and template filling techniques.

It is then necessary to evaluate and optimize the explanatory information and ensure its accuracy and credibility. The evaluation may be performed using a manual evaluation or an automatic evaluation method.

Finally, the explanatory information needs to be displayed to the user, and feedback and modification functions are provided, so that the user can evaluate and improve the explanatory information. It may also be performed using the functions and methods provided in the tools described above. It can show which features have the greatest effect on the prediction results and how the model derives the advice. This helps the user to understand the recommendation and prediction results of the model more deeply.

In the embodiment, the visualization, interpretation and data prediction can be integrated and deployed, so that a user can view the prediction result on the platform and further analyze and decide through the visualization and interpretation functions. May be performed using the flash framework, django framework tool.

The joint process of the integration and deployment module is as follows:

firstly, each module needs to be integrated into a unified urban and rural data analysis platform, so that the cooperative work among the modules is ensured. Version control and collaborative development can be performed using the Git tools.

And secondly, a cloud solution is needed to be provided, so that a user can access and use the intelligent data analysis tool anytime and anywhere. The cloud deployment and management can be performed by using tools such as AWS service, azure service and the like.

Then, security and user privacy of the data need to be ensured, and data desensitization, encryption, access control and other technologies are adopted to protect the data. Encryption of data transmission can be performed by using SSL/TLS protocol, encryption of data storage can be performed by using AES algorithm or other symmetric encryption algorithm, and user access control can be performed by using OAuth2.0 protocol or other identity authentication protocol.

And finally, testing and maintaining the urban and rural data analysis platform to ensure the stability and reliability of the platform. Automated testing and debugging may be performed using tools such as the Pytest library, the Selenium library, etc.

In this embodiment, after the urban and rural data analysis platform is built, the urban and rural data analysis platform can be used by a user, and specifically includes:

government determinants may use platforms to view economic growth predictions for cities and rural areas. They can choose different policy variables on the interactive interface, simulate the impact of different policies on economic growth, and learn, through the explanatory AI, why a certain policy works better. Helping to formulate more targeted policies.

Urban planners can use platforms to analyze data such as land utilization, demographics, etc. in urban and rural areas to guide urban planning and land utilization planning.

Researchers can use platforms to conduct academic research, explore data in urban and rural areas, generate papers and reports, and share research results.

Example 2

As shown in fig. 2, the present invention discloses an intelligent data analysis system, which comprises:

and the urban and rural data acquisition module 10 is used for acquiring urban and rural data of various sources.

The preprocessing operation module 20 is used for preprocessing urban and rural data to obtain preprocessed data.

The data analysis module 30 is configured to perform data analysis on the preprocessed data to obtain an analysis result.

And the urban and rural data prediction module 40 is used for predicting future urban and rural data to obtain a prediction result.

The result visualization interpretation module 50 is configured to visualize the prediction result and provide interpretability information.

As an alternative embodiment, the preprocessing operation module 20 of the present invention specifically includes:

and the data layer-by-layer processing sub-module is used for sequentially performing data cleaning, data compression, data encoding and data decoding on urban and rural data to obtain preprocessed data.

As an alternative embodiment, the data analysis module 30 of the present invention specifically includes:

wherein URSC is urban and rural space correlation coefficient, n is a number, w _ij The element of the ith row and the jth column of the spatial weight matrix W represents the spatial relationship strength between the ith data and the jth data, x _i City attribute value, y, for the ith data _i For the rural attribute value of the ith data,for the mean value of all data city attribute values, +.>Is the average value of all data rural attribute values.

In the method, UEDC is urban and rural environmental differenceCoefficient, Q _u Is the quality index of urban environment, Q _r Is rural environment quality index, R _u R is the utilization rate of urban resources _r For rural resource utilization, E _u Is city, E _r The coefficient reflects the degree of difference between urban and rural environments for rural resource utilization.

constructing an undirected weighted graph g= (V) _G ，E _G ，W _G ) Wherein V is _G Is a vertex set and represents all urban and rural data points; e (E) _G Is an edge set and represents the connection relation among all urban and rural data points; w (W) _G Is a weight matrix representing the weights of all edges.

And calculating the degree of each vertex, and sorting the vertices in descending order according to the degree.

Traversing from the vertex with the largest degree, taking the vertex as the center of a new class or cluster, dividing the vertex which is connected with the vertex and is not divided into the class or cluster, and updating the center of the class or cluster to be the average value of all vertex coordinates.

carrying out space classification analysis on the preprocessed data to obtain classification results, wherein the method specifically comprises the following steps:

dividing the data set into a training set and a testing set, and carrying out normalization processing.

And constructing a multi-layer perceptron neural network model, and initializing the weight of the neural network.

The training set is input into the neural network, the output of the neural network is calculated through forward propagation and compared with the real class or label, the error is calculated through backward propagation, the weight is updated according to the error, and the process is repeated until the error reaches the minimum or the maximum iteration number.

As an alternative embodiment, the urban and rural data prediction module 40 of the present invention specifically includes:

y _t ＝f(x _t ，z _t ，θ)+∈ _t

wherein y is _t Is the population of cities, x _t Z as time variable and other influencing factors _t For urban and rural population difference coefficient, θ is model parameter, and ε is _t Is an error random term.

The rural yield change is predicted, and the prediction formula is as follows:

Wherein y is _p For the yield of p-th crop in rural area, x _p For time variable and other influencing factors, w _p Is an urban and rural economic gap coefficient,is the model parameter, eta _p Is an error random term.

wherein, I _T Vector of m×1Represents prices of m agricultural products, A ₀ Is a constant vector of m×1, B is a coefficient vector of m×1, v _T Is the urban and rural environmental difference coefficient A _K As m x m coefficient matrix, u _T Is a white noise vector of mx1.

The medical demand is predicted, and the prediction formula is as follows:

R _cd ＝f(R，S，Ψ)

As an alternative embodiment, the result visualization interpretation module 50 of the present invention specifically includes:

And the visualization type analysis sub-module is used for selecting the type of the visualization data according to the prediction result to obtain the visualization result.

And the information interpretation sub-module is used for integrating the interpretation information in the visual result and providing interpretation about the prediction result.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An intelligent data analysis method, the method comprising:

step S1: acquiring urban and rural data of various sources;

step S2: preprocessing the urban and rural data to obtain preprocessed data;

2. The intelligent data analysis method according to claim 1, wherein the preprocessing operation is performed on the urban and rural data to obtain preprocessed data, and the method specifically comprises:

3. The intelligent data analysis method according to claim 1, wherein the data analysis is performed on the preprocessed data to obtain an analysis result, and specifically comprises:

in the method, UPDC is population difference of urban and rural areasCoefficient, P _u For urban population, P _r Is the population number of rural areas, S _u Is the index of urban population structure, S _r For rural population structure indexes, urban and rural population structure indexes are calculated according to age, gender, education and other factors of population, and the calculation formula is as follows:

wherein S is the structural index of the human mouth, N is the number of structural factors of the human mouth, and omega I is the value of the I-th factor. The coefficient reflects the difference degree of urban and rural population, the value range is [0,1], when the value is 0, the urban and rural population is identical; when the value is 1, the urban and rural population is completely different.

construction of a oneUndirected weighted graph g= (V _G ，E _G ，W _G ) Wherein V is _G Is a vertex set and represents all urban and rural data points; e (E) _G Is an edge set and represents the connection relation among all urban and rural data points; w (W) _G Is a weight matrix representing the weights of all edges;

4. The intelligent data analysis method according to claim 1, wherein the predicting the future urban and rural data to obtain the prediction result specifically comprises:

y _t ＝f(x _t ，z _t ，θ)+∈ _t

the rural yield change is predicted, and the prediction formula is as follows:

the medical demand is predicted, and the prediction formula is as follows:

R _cd ＝f(R，S，Ψ)

5. The method for intelligent data analysis according to claim 1, wherein the visualizing the prediction result provides interpretability information, and specifically comprises:

6. An intelligent data analysis system, the system comprising:

7. The intelligent data analysis system according to claim 6, wherein the preprocessing operation module specifically comprises:

8. The intelligent data analysis system according to claim 6, wherein the data analysis module specifically comprises:

in the middle ofS is the structural index of the human mouth, N is the number of structural factors of the human mouth, omega _I Is the value of factor I. The coefficient reflects the degree of difference between urban and rural population, and has a value range of [0,1 ]]When the value is 0, the urban and rural population is identical; when the value is 1, the urban and rural population is completely different.

9. The intelligent data analysis system according to claim 6, wherein the urban and rural data prediction module specifically comprises:

y _t ＝f(x _t ，z _t ，θ)+∈ _t

wherein y is _i Is the population of cities, x _t Z as time variable and other influencing factors _t For urban and rural population difference coefficient, θ is model parameter, and ε is _t Is an error random term;

the rural yield change is predicted, and the prediction formula is as follows:

the medical demand is predicted, and the prediction formula is as follows:

R _cd ＝f(R，S，Ψ)

10. The intelligent data analysis system according to claim 6, wherein the result visualization interpretation module specifically comprises: