CN108595414B

CN108595414B - Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning

Info

Publication number: CN108595414B
Application number: CN201810239430.7A
Authority: CN
Inventors: 史舟; 徐烨; 贾晓琳; 尤其浩
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-03-22
Filing date: 2018-03-22
Publication date: 2020-07-10
Anticipated expiration: 2038-03-22
Also published as: CN108595414A

Abstract

The invention discloses a soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning. Firstly, acquiring polluted enterprise data, enterprise POI data and heavy metal pollution data of a region to be researched, adjusting the enterprise industry category distribution of a data set, separating a training data set and a test data set after word segmentation processing and removing local vocabularies, then counting word frequencies of words appearing in each sample according to a corpus established by the two data sets to be used as text features corresponding to the sample, training a polynomial naive Bayes model by using the sample of the training set, and evaluating the model by the grade of the test set; and finally, predicting industry classification and heavy metal pollution indexes according to the acquired enterprise data, performing numerical statistics in a grid generated according to the topological shape of the research area, performing spatial analysis by using a bivariate spatial autocorrelation method, judging the spatial distribution relation of pollution and enterprises, and identifying heavy metal point sources and area source pollution areas in the research area.

Description

Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning

Technical Field

The invention relates to a soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning, in particular to a classification method based on a specific text mining means and a bivariate space autocorrelation analysis method.

Technical Field

Under the development of modern industrialization, some non-regulated enterprises discharge industrial three wastes wantonly, which causes serious environmental pollution, wherein the heavy metal pollution of soil becomes a worldwide environmental problem. According to investigation, the total exceeding rate of soil pollution in China is 16.1%, the pollution type mainly takes heavy metal pollution as the main factor, and the farmland cultivated land with about two million hectares is damaged. The farmland soil pollution is mainly divided into point source pollution and non-point source pollution, wherein the non-point source pollution refers to the soil pollution caused by soil erosion, surface runoff and other modes without fixed pollution discharge points; point source pollution has fixed emission pollution source, has recognizable scope, compares in the face source pollution control and management more easily, and enterprise's pollution belongs to the point source pollution. At present, a plurality of research methods and models for analyzing soil heavy metal pollution sources exist, such as Wangzhou pine and Qiyong (Wangzhou pine, Qiyong. Xuzhou city surface soil heavy metal environmental risk measure and source analysis [ J ]. geochemistry, 2006,35(1):88-94.) adopt a statistical method of factor analysis and cluster analysis to define the sources and the categories of heavy metal elements in the surface soil of a research area; saby et al (Saby N P, thioulose J, journal C, et al. multivariable analysis of the spatial patterns of 8trace elements using the free soil monitoring network data [ J ]. Science of the Total Environment,2009,407(21): 5644-: the matrix of the soil, the texture of the soil, the weathering of the soil and artificial factors. However, the source analysis methods or models have certain defects, and the traditional statistical analysis method and chemical method such as correlation analysis method, principal component analysis method, cluster analysis method and factor analysis method ignore the spatial position information of heavy metal pollution, which is very limited to help the prevention and control of the heavy metal pollution of the soil; the combination of the spatial interpolation method and the traditional multivariate statistical method does not provide reliable quantitative analysis, and the spatial variability of the pollution cannot be well solved. And because the source and the gathering mechanism of the heavy metal pollution of the soil caused by the enterprises are very complex, the prevention, control and treatment work of the pollution of the enterprises at present becomes very difficult. Therefore, a bivariate spatial autocorrelation model (Moran's I) can be adopted to study the spatial correlation between the soil heavy metal pollution condition and the enterprise distribution, and effective guidance and help can be provided for the management and control of the enterprise pollution.

However, due to the existence of data islanding, the data cooperation difficulty among departments is large, the departments are difficult to cooperate, and the acquisition difficulty of enterprise information is quite large, therefore, a Classification method based on a specific Text mining means can be adopted, the industry category of the polluted enterprise is identified through the name of the enterprise, and the Classification method is used as the basis of research on the relation between the enterprise distribution and the pollution distribution.text Classification is an important method in data mining, a Classification function or model is constructed on the basis of a certain amount of existing data, and then Text data of other unknown categories are assigned to predefined categories through specific Text contents under a specific Classification system.A Text Classification model is adopted to carry out Text Classification on a mass of interest points, namely, (POI Point of interest) data (Zhang, ZhaJ, L-channel, C-Text Classification) 647 by using a Convolutional Text Classification model.

The soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning is based on a specific text mining means, a classification model is established mainly by adopting a polynomial naive Bayes method, and through carrying out bivariate space autocorrelation analysis on enterprise data obtained by classification and local pollution data, a directive effect is played on the research work of the relation between enterprise distribution and pollution distribution.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides a soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning. The specific technical scheme is as follows:

the method for identifying the pollution source of the soil heavy metal enterprise based on source-sink space variable reasoning comprises the following steps:

step 1) data acquisition: acquiring polluted enterprise data, enterprise POI data and soil heavy metal pollution data of an area to be researched, wherein the polluted enterprise data comprises enterprise names and corresponding industry classifications thereof; the enterprise POI data comprises all enterprise names and longitude and latitude information of an area to be researched; the soil heavy metal pollution data are soil survey data of an area to be researched and comprise pollution indexes and longitude and latitude information of each heavy metal element of soil;

step 2) enterprise data preprocessing: carrying out descriptive analysis on the polluted enterprise data acquired in the step 1), and adjusting the enterprise industry category distribution of the data set according to an analysis result to average the category distribution of the enterprise samples; then, carrying out word segmentation processing on the enterprise name, and removing the vocabulary of the local name; finally, separating a training data set and a test data set according to a proportion;

step 3), enterprise data classification: extracting a set of all words or phrases appearing in the training data set and the test data set from the result processed in the step 2) as a corpus; according to the corpus, the word frequency of words appearing in the enterprise name of each sample is counted, and text features corresponding to the samples are extracted; training a polynomial naive Bayes model by using samples of the training set to obtain model optimal parameters; and evaluating the model by scoring the test data set;

step 4), spatial analysis: performing word segmentation on POI enterprise data acquired in the step 1), removing vocabularies of local names, inputting the vocabularies into a polynomial naive Bayesian model trained in the step 3), predicting industry classification of enterprises in a data set, and performing space density analysis on different enterprises by using a nuclear density method; meanwhile, generating a regular grid with a specified size according to the topological shape of the area to be researched, and counting the number of enterprises classified in each industry and the pollution index of each heavy metal element in the soil in the grid; then, carrying out spatial analysis by using a bivariate spatial autocorrelation method;

step 5), pollution source judgment: analyzing the spatial distribution relation of the soil heavy metal pollution and the polluted enterprises, judging the point source pollution and surface source pollution distribution characteristics of the area to be researched and identifying the enterprise pollution sources.

Preferably, in step 2), the method for adjusting the enterprise industry category distribution of the data set includes: according to the pareto principle, the analysis results are sorted from high to low according to the frequency of the industry categories, the first industry categories with the accumulated ratio exceeding the threshold value are selected as representative categories, and the rest industry categories are all merged into one category, so that the industry category distribution of the sample is averaged.

Preferably, in the step 2), when performing word segmentation processing on the enterprise name, a word segmentation engine specifically adopted is jieba; the removed local name vocabulary includes names of places at or above the county/town level of the administrative division.

Preferably, the specific steps of step 3) are as follows:

3.1) extracting text features: firstly, finding out a set of words or N-element phrases in a training data set and a testing data set, wherein the total number of the words or phrases is N; then, numbering the words or phrases from 1 to N, and taking the numbered words as a corpus; then, constructing an N-dimensional vector for any sample in the training data set and the testing data set, wherein the value of the mth dimension represents the word frequency of the word numbered m in the sample, and the constructed N-dimensional vector is the extracted text feature;

3.2) training a polynomial naive Bayes model, namely, combining the text characteristics of the training set data, adjusting the text characteristic parameter n and the smoothing parameter α of the polynomial naive Bayes model, namely, adopting a grid search method based on 10-fold cross validation, wherein the evaluation index of the cross validation is the classification accuracy, and finally selecting the parameter with the highest average classification accuracy as the optimal parameter;

3.3) after the optimal parameters of the model are determined, the model is evaluated by testing the classification accuracy Acc and Kappa coefficient of the data set.

Preferably, in the step 4), the specific method for generating a regular grid with a specified size according to the topological shape of the region to be studied is: calculating the range represented by the minimum circumscribed rectangle according to the topological shape of the area to be researched, and then dividing the grid from a certain vertex of the minimum circumscribed rectangle according to a preset size specification to obtain grid data; the specific method for counting the number of enterprises classified in each industry and the pollution index of each soil heavy metal element in the grid is as follows: respectively counting the number of enterprise POI points of each industry classification falling into each grid, wherein the counting value represents the enterprise aggregation degree in the grid area; simultaneously counting each soil heavy metal element pollution index in each grid, and if a plurality of investigation points exist in a certain grid, regarding a certain soil heavy metal element, taking the average value of the soil heavy metal element pollution indexes of all the investigation points in the grid as the soil heavy metal element pollution index in the grid area; dividing grid data into different industry categories and different soil heavy metal elements to perform bivariate spatial correlation analysis, wherein a specific analysis formula is as follows:

in the formula (2), the first and second groups,

representing an attribute value a after binarization in the grid i, wherein the attribute a is a pollution index of a certain soil heavy metal element in the grid, and the standardization process is as follows: redefining the soil heavy metal element pollution index to be 0 when the soil heavy metal element pollution index is less than or equal to 1, and redefining the soil heavy metal element pollution index to be 1 when the soil heavy metal element pollution index is greater than 1;

representing the b attribute value after z-score mean value standardization in the grid i, wherein the b attribute is the number of enterprise POI points of a certain enterprise category in the grid; w is a_ijIs a matrix of spatial weights that is,

a local spatial correlation index representing the a attribute and the b attribute at the grid i; if it is

Is remarkable in thatIf the grid is positive, the soil heavy metal pollution degree at the grid i is positively correlated with the enterprise aggregation degree in the adjacent range; if it is

If the concentration is remarkably negative, the soil heavy metal pollution degree at the grid i and the enterprise aggregation degree in the adjacent range have negative correlation; such as

If the concentration is not obvious, the soil heavy metal pollution degree at the grid i and the enterprise aggregation degree in the adjacent range are not obviously correlated; according to the individual grids obtained

Values, form a corresponding spatial cluster map.

Preferably, in the step 5), the spatial distribution relationship between the soil heavy metal pollution and the polluted enterprises is judged according to the spatial clustering map, if the number attributes of a certain soil heavy metal pollution index attribute and an enterprise POI point of a certain enterprise category in the spatial clustering map are High-High in a certain area, the pollution source of the soil heavy metal in the area is judged to be possible point source pollution caused by the enterprises, and if the number attributes of the soil heavy metal pollution index attribute in the certain area and all the enterprise POI points in the spatial clustering map are both High-L ow, the pollution source of the soil heavy metal in the area is judged to be possible surface source pollution.

The method has the advantages that modeling is carried out based on classification data of the existing polluted enterprises, the categories of the polluted enterprises can be identified directly according to names of the public POI enterprises, meanwhile, according to the predicted categories, the spatial distribution relation between the distribution of various enterprises and the heavy metal pollution of the soil with different elements is analyzed by establishing grid data and using bivariate spatial autocorrelation, so that the discrete soil heavy metal pollution point data and the enterprise point data can be analyzed accurately, the original analysis method and thought are expanded, and the method has important theoretical, practical and popularization and application values on management and control work of the enterprise pollution.

Drawings

FIG. 1 is an industry profile of contaminated enterprise data, in an embodiment

FIG. 2 shows, in an example, the spatial distribution density of the result of prediction of the type of polluting enterprise (a. textile industry, b. Metal industry, c. chemical raw materials and chemical manufacturing industry, d. other industry)

FIG. 3 is a spatial clustering chart of the degree of pollution of heavy metal Cd in soil in a research area and the degree of aggregation of metal product enterprises in an embodiment

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The invention discloses a soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning, which comprises the following steps:

step 1) data acquisition: and acquiring polluted enterprise data, enterprise POI data and soil heavy metal pollution data of the area to be researched. The polluted enterprise data comprises enterprise names and corresponding industry classifications, and the industry classifications accord with national economic industry classification standard GB/T4754-2011. The enterprise POI data should include all enterprise names in the area to be studied and longitude and latitude information of the location of the enterprise. The soil heavy metal pollution data is soil investigation data of an area to be researched, and comprises soil heavy metal element pollution indexes of investigation points and longitude and latitude information of the investigation points, and when pollution sources of various heavy metals need to be identified, the investigation data also comprises corresponding heavy metal element pollution indexes.

Step 2) enterprise data preprocessing: carrying out descriptive analysis on the polluted enterprise data acquired in the step 1) to obtain a descriptive analysis result about the classification of the industry to which the enterprise belongs. Because the industry category distribution of the original polluted enterprise data is seriously uneven, only a few representative categories (the accumulation of a few category samples accounts for about 80%) need to be extracted, the rest categories are all merged into one category, and the enterprise industry category distribution of the data set is adjusted so as to average the distribution of the samples as much as possible. The adjusting method in this embodiment specifically includes: and sorting the analysis results from high to low according to the frequency of the industry categories, selecting a plurality of first industry categories with the accumulated ratio exceeding a threshold (80% can be selected) as representative categories, and merging all the other industry categories into one category. After the adjustment is finished, the word segmentation engine for the enterprise name is a jieba to perform word segmentation processing (the engine is trained by a hidden Markov chain model and has a good word segmentation effect), and words of the enterprise name including the names of the village/town levels of the administrative division and above are removed. And finally, separating the processed enterprise data into a training data set and a testing data set according to a sample ratio of 8:2, wherein each data set comprises words contained in the enterprise name after word segmentation and deletion and an industry class corresponding to the enterprise.

Step 3), enterprise data classification: extracting a set of all words or phrases appearing in the training data set and the test data set from the result processed in the step 2) as a corpus; according to the corpus, the word frequency of words appearing in the enterprise name of each sample is counted, and text features corresponding to the samples are extracted; training a polynomial naive Bayes model by using samples of the training set to obtain model optimal parameters; and the model is evaluated by scoring the test data set. The method comprises the following specific steps:

3.1) extracting text features: firstly, finding out a set of words or N-element phrases in a training data set and a test data set to obtain N words or phrases in total; then, numbering the words or phrases from 1 to N, and taking the numbered words as a corpus; then, constructing an N-dimensional vector for any sample in the training data set and the testing data set, wherein the value of the mth dimension represents the word frequency of the word numbered m in the sample, and the constructed N-dimensional vector is the extracted text feature;

the above-mentioned hyper-parameter n and naive bayes smoothing parameter α is specifically set forth as follows:

the so-called characteristic parameter n is a method for expanding the corpus after word segmentation processing, assuming that m words are counted after word segmentation processing, but m may be very small in practice, the effect of the established classification model will be very poor, and based on m words, n words can be taken to form a new word according to the sequence, so that the effect of expanding the corpus can be achieved. Obviously, this N is neither too large, but is a minimum of 1, and is a positive integer, so it needs to be adjusted by the grid search method;

the so-called naive bayes smoothing parameter α is a means to process new words, naive bayes modeling relies on a corpus, even if we can extend the corpus with a hyper-parameter N, it is impossible to consider all corpora, so the characteristics of the new words are lost when vectorizing the new words, and it is easier to generate overfitting phenomenon, so when calculating the posterior probability, word smoothing technology needs to be introduced to alleviate the phenomenon, and the specific formula is as follows:

in formula (1), α is a smoothing parameter, n is the number of features consistent with the number of words in the corpus, c is a certain category, x_iRefers to the value of the ith feature, i is 1,2,3, …, n, P (x)₁,x₂,…,x_nI c) means that the characteristic value of a sample is x on the premise that a certain sample class is known as c₁,x₂,…,x_nThe probability of (d); the characteristic value of N is x₁,x₂,…,x_nThe number of samples in the whole sample is counted, and N_cThe characteristic value of the finger is x₁,x₂,…,x_nThe number of sets of samples in the category c is counted.

3.3) after the optimal parameters of the model are determined, the classification model is evaluated by testing the classification accuracy Acc and Kappa coefficient of the data set.

If all the classification accuracy coefficients of the model meet the requirements, the model is trained and can be used for subsequent prediction, and the POI enterprise data which are preprocessed and are the same as the training data are input into the model, so that the industry classification of each enterprise can be predicted according to the vocabularies in the enterprise name.

Step 4), spatial analysis: performing word segmentation on POI enterprise data acquired in the step 1), removing vocabularies of local names, inputting the processed data into a trained polynomial naive Bayesian model in the step 3), predicting industry classification of enterprises in a data set, and performing space density analysis on different enterprises by using a nuclear density method; meanwhile, a regular grid with a specified size is generated according to the topological shape of the region to be researched, and the specific method comprises the following steps: and calculating the range represented by the minimum circumscribed rectangle according to the topological shape of the area to be researched, and then dividing the grid according to a preset size specification (the size can be adjusted according to actual needs) from a certain vertex of the minimum circumscribed rectangle to obtain grid data. The method for counting the number of enterprises classified in each industry and the pollution index of each soil heavy metal element in the grid comprises the following steps: respectively counting the number of enterprise POI points of each industry classification falling into each grid, wherein the counting value represents the enterprise aggregation degree in the grid area; meanwhile, counting each soil heavy metal element pollution index (if there are multiple types, each type needs to be counted) in each grid, and if only 1 investigation point exists in a certain grid, directly representing the soil heavy metal element pollution index of the grid by the data of the point; if a plurality of check points exist in a certain grid, regarding a certain soil heavy metal element, taking the average value of the soil heavy metal element pollution indexes of all the check points in the grid as the soil heavy metal element pollution index in the grid area. After the statistics is completed, a bivariate spatial autocorrelation method is used for spatial analysis, grid data are required to be divided into different industry categories and different soil heavy metal elements for bivariate spatial correlation analysis in sequence during analysis, for example, analysis is carried out on heavy metal A and heavy metal B, and specific selection of A and B can be adjusted according to research needs. The specific analytical formula is as follows:

in the formula (2), the first and second groups,

representing an attribute value a after binarization in the grid i, wherein the attribute a is a pollution index of a certain soil heavy metal element in the grid, and the standardization process is as follows: when the soil heavy metal element pollution index is less than or equal to 1, namely the soil heavy metal element pollution index is in one level under the warning limit, redefining the soil heavy metal element pollution index as 0, when the soil heavy metal element pollution index is greater than 1, namely the soil heavy metal element pollution index is in another level of mild, moderate and severe pollution, redefining the soil heavy metal element pollution index as 1;

If the concentration of the heavy metal in the soil is obviously positive, the soil heavy metal pollution degree at the grid i and the enterprise aggregation degree in the adjacent range have positive correlation; if it is

And if the concentration is not obvious, the soil heavy metal pollution degree at the grid i and the enterprise aggregation degree in the adjacent range are not obviously related. According to obtainingOf individual grids

Values, a corresponding spatial cluster map can be formed.

Step 5) judging a pollution source, namely judging the spatial distribution relation between the soil heavy metal pollution and a polluted enterprise according to a spatial clustering map, wherein the spatial clustering map has four attributes of High-High, High-L ow, L ow-High and L ow-L ow, if a certain area of the spatial clustering map is High-High, the attribute of a soil heavy metal pollution index and the number attribute of POI points of the enterprise in a certain line category are both High, namely the soil heavy metal pollution degree is relatively heavy, and the enterprises in the line are relatively concentrated, so that the pollution source of the soil heavy metal in the area is judged to be point source pollution possibly brought by the enterprises, and similarly, the other soil heavy metal pollution index attributes needing to be identified and the number attributes of POI points of other types of enterprises can be judged one by one, the same heavy metal can have multiple types of pollution, the same type of enterprises can also generate multiple heavy metal pollution, but the enterprise can not generate multiple heavy metal pollution sources in the same line, but can not be considered as a result of the heavy metal pollution source in the area, and the area is not considered as a result of the heavy metal pollution in the area.

Next, a certain area in the coastal region of southeast china is selected as a research area to be displayed by using the method, and the specific main steps are as described above, and are not repeated, and only the specific implementation details and the implementation effect for the embodiment are displayed.

Example (b):

in this embodiment, the analysis is performed by the above method, and the specific steps are as follows:

step 1) data acquisition: acquiring data of polluted enterprises in a research area, POI enterprises in the research area, administrative division topological data in the research area and soil heavy metal pollution data in the research area; for the data of the polluted enterprises in the research area, the classification of the data is required to meet the national economic industry classification standard GB/T4752-2011; selecting local soil investigation data for the soil heavy metal pollution data of the research area; for POI enterprise data in a research area, longitude and latitude information is required and belongs to a WGS84 coordinate system; in addition, POI data is downloaded through Web API based on a hundred-degree map;

step 2) enterprise data preprocessing: performing descriptive analysis on the polluted enterprise data acquired in the step 1), and finding that the classified distribution of the enterprise industry is seriously uneven as shown in fig. 1, wherein the classification distribution needs to be balanced to a certain extent for the subsequent modeling effect, only three main classifications of metal product industry, chemical raw materials, chemical product manufacturing industry and textile industry are reserved according to the result, and the rest are classified into one category; after class equalization processing is carried out, word segmentation processing is carried out on the enterprise name, and in the word segmentation process, local name words above the county/city level need to be removed; after word segmentation is finished, separating a test data set and a training data set according to a sample ratio of 8: 2;

step 3) enterprise data classification, namely firstly extracting the combination of all words appearing in the two data sets for the two data sets obtained in the step 2), combining the words according to the number N (N-gram language model) of the words forming the word group to form a corpus, counting the word frequency of the corresponding word in each sample according to the corpus to be used as the text characteristic of the sample, simultaneously training a naive Bayes model by using samples of the training set, adjusting the number N of the words forming the word group and Bayes smooth parameters α by using a 10-fold cross validation-based grid search method in the process, selecting the optimal parameters by using the average classification accuracy Acc of 10-time validation sets, and evaluating the model by using the Acc and Kappa coefficient k on the test set after determining the parameters of the model, wherein the calculated Acc is 86.3%, and the Kappa coefficient k is 0.82;

the calculation formula of each index is as follows:

in formula (2), Acc represents the classification accuracy, and refers to the ratio of the samples of the model prediction pair to all samples, where n refers to the number of all samples, and n is the ratio of all samples_cRefers to the number of samples of the prediction pair; in the formula (3), k denotes a kappa coefficient, where the formula for Acc is formula (2), and p_eThe calculation of (2) is shown in formula (4), where m denotes the number of classifications, and C denotes the number of classifications in formula (4)_iRefers to the number of samples with real category i, P_iThe number of samples with the model prediction category i is referred to, and n is the number of all samples.

Step 4) spatial analysis, namely performing the same word segmentation and text feature extraction on POI enterprise data in the research area obtained in the step 1) according to the model in the step 3), then predicting the type of the POI enterprise data by using a naive Bayes model, and converting the POI enterprise data into spatial point data (vector data) after prediction is finished, inputting enterprise point data of different types by using a nuclear density model, setting the size of an output pixel to be 1km, searching for a distance to be 10km, obtaining a spatial distribution density map of different enterprises, generating a regular grid of 1km × 1km according to the administrative division topological shape of the province, loading soil heavy metal pollution data into the province grid data, selecting data of Cd elements of metal product enterprises and soil heavy metals as an example, counting the number of Cd elements of the metal product enterprises in each grid, assigning Cd values to each grid unit, performing spatial self-correlation analysis on the point source data by using an improved double-variable Morland index method, generating a Queen spatial adjacent relation in each grid, and performing statistical analysis on the spatial distribution map of different spatial distribution characteristics of the pollution source distribution of spatial distribution regions in the POI enterprise;

and 5) judging the pollution source, namely analyzing the map as shown in a space clustering map of the pollution degree of the heavy metal Cd element in the soil in the research area and the aggregation degree of the metal product industry enterprises, analyzing the space distribution characteristics of point source pollution and area source pollution, judging that the heavy metal Cd in the soil in a certain area is the point source pollution caused by the metal product industry enterprises if the certain area of the space clustering map is High-High, indicating that the heavy metal Cd in the soil is not the point source pollution caused by the metal product industry enterprises if the certain area of the space clustering map is High-L ow, and judging that the heavy metal Cd in the soil in the area is the area source pollution possibly caused by livestock, dry and wet sedimentation, chemical fertilizer application and other enterprise types.

The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims

1. A soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning is characterized by comprising the following steps:

step 4), spatial analysis: performing word segmentation on the enterprise POI data acquired in the step 1), removing vocabularies of local names, inputting the vocabularies into the polynomial naive Bayesian model trained in the step 3), predicting the industry classification of each enterprise in the data set, and performing space density analysis on different enterprises by using a nuclear density method; meanwhile, generating a regular grid with a specified size according to the topological shape of the area to be researched, and counting the number of enterprises classified in each industry and the pollution index of each heavy metal element in the soil in the grid; then, carrying out spatial analysis by using a bivariate spatial autocorrelation method;

2. The method for identifying the pollution sources of the soil heavy metal enterprises based on the source-sink space variable inference as claimed in claim 1, wherein in the step 2), the method for adjusting the enterprise industry category distribution of the data set comprises the following steps: according to the pareto principle, the analysis results are sorted from high to low according to the frequency of the industry categories, the first industry categories with the accumulated ratio exceeding the threshold value are selected as representative categories, and the rest industry categories are all merged into one category, so that the industry category distribution of the sample is averaged.

3. The method for identifying the pollution sources of the soil heavy metal enterprises based on the source-sink space variable inference as claimed in claim 1, wherein in the step 2), when the word segmentation is performed on the enterprise name, a word segmentation engine is specifically adopted as a jieba; the removed local name vocabulary includes names of places at or above the county/town level of the administrative division.

4. The method for identifying the pollution source of the soil heavy metal enterprise based on the source-sink space variable inference as claimed in claim 1, wherein the specific steps of step 3) are as follows:

5. The method for identifying the pollution sources of the soil heavy metal enterprises based on the source-sink space variable inference as claimed in claim 4, wherein in the step 4), the specific way of generating the regular grid with the specified size according to the topological shape of the area to be researched is: calculating the range represented by the minimum circumscribed rectangle according to the topological shape of the area to be researched, and then dividing the grid from a certain vertex of the minimum circumscribed rectangle according to a preset size specification to obtain grid data; the specific method for counting the number of enterprises classified in each industry and the pollution index of each soil heavy metal element in the grid is as follows: respectively counting the number of enterprise POI points of each industry classification falling into each grid, wherein the counting value represents the enterprise aggregation degree in the grid area; simultaneously counting each soil heavy metal element pollution index in each grid, and if a plurality of investigation points exist in a certain grid, regarding a certain soil heavy metal element, taking the average value of the soil heavy metal element pollution indexes of all the investigation points in the grid as the soil heavy metal element pollution index in the grid area; dividing grid data into different industry categories and different soil heavy metal elements to perform bivariate spatial correlation analysis, wherein a specific analysis formula is as follows:

in the formula (2), the first and second groups,

Values, form a corresponding spatial cluster map.