CN108595414B - Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning - Google Patents

Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning Download PDF

Info

Publication number
CN108595414B
CN108595414B CN201810239430.7A CN201810239430A CN108595414B CN 108595414 B CN108595414 B CN 108595414B CN 201810239430 A CN201810239430 A CN 201810239430A CN 108595414 B CN108595414 B CN 108595414B
Authority
CN
China
Prior art keywords
enterprise
heavy metal
pollution
grid
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810239430.7A
Other languages
Chinese (zh)
Other versions
CN108595414A (en
Inventor
史舟
徐烨
贾晓琳
尤其浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201810239430.7A priority Critical patent/CN108595414B/en
Publication of CN108595414A publication Critical patent/CN108595414A/en
Application granted granted Critical
Publication of CN108595414B publication Critical patent/CN108595414B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Processing Of Solid Wastes (AREA)

Abstract

The invention discloses a soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning. Firstly, acquiring polluted enterprise data, enterprise POI data and heavy metal pollution data of a region to be researched, adjusting the enterprise industry category distribution of a data set, separating a training data set and a test data set after word segmentation processing and removing local vocabularies, then counting word frequencies of words appearing in each sample according to a corpus established by the two data sets to be used as text features corresponding to the sample, training a polynomial naive Bayes model by using the sample of the training set, and evaluating the model by the grade of the test set; and finally, predicting industry classification and heavy metal pollution indexes according to the acquired enterprise data, performing numerical statistics in a grid generated according to the topological shape of the research area, performing spatial analysis by using a bivariate spatial autocorrelation method, judging the spatial distribution relation of pollution and enterprises, and identifying heavy metal point sources and area source pollution areas in the research area.

Description

Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning
Technical Field
The invention relates to a soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning, in particular to a classification method based on a specific text mining means and a bivariate space autocorrelation analysis method.
Technical Field
Under the development of modern industrialization, some non-regulated enterprises discharge industrial three wastes wantonly, which causes serious environmental pollution, wherein the heavy metal pollution of soil becomes a worldwide environmental problem. According to investigation, the total exceeding rate of soil pollution in China is 16.1%, the pollution type mainly takes heavy metal pollution as the main factor, and the farmland cultivated land with about two million hectares is damaged. The farmland soil pollution is mainly divided into point source pollution and non-point source pollution, wherein the non-point source pollution refers to the soil pollution caused by soil erosion, surface runoff and other modes without fixed pollution discharge points; point source pollution has fixed emission pollution source, has recognizable scope, compares in the face source pollution control and management more easily, and enterprise's pollution belongs to the point source pollution. At present, a plurality of research methods and models for analyzing soil heavy metal pollution sources exist, such as Wangzhou pine and Qiyong (Wangzhou pine, Qiyong. Xuzhou city surface soil heavy metal environmental risk measure and source analysis [ J ]. geochemistry, 2006,35(1):88-94.) adopt a statistical method of factor analysis and cluster analysis to define the sources and the categories of heavy metal elements in the surface soil of a research area; saby et al (Saby N P, thioulose J, journal C, et al. multivariable analysis of the spatial patterns of 8trace elements using the free soil monitoring network data [ J ]. Science of the Total Environment,2009,407(21): 5644-: the matrix of the soil, the texture of the soil, the weathering of the soil and artificial factors. However, the source analysis methods or models have certain defects, and the traditional statistical analysis method and chemical method such as correlation analysis method, principal component analysis method, cluster analysis method and factor analysis method ignore the spatial position information of heavy metal pollution, which is very limited to help the prevention and control of the heavy metal pollution of the soil; the combination of the spatial interpolation method and the traditional multivariate statistical method does not provide reliable quantitative analysis, and the spatial variability of the pollution cannot be well solved. And because the source and the gathering mechanism of the heavy metal pollution of the soil caused by the enterprises are very complex, the prevention, control and treatment work of the pollution of the enterprises at present becomes very difficult. Therefore, a bivariate spatial autocorrelation model (Moran's I) can be adopted to study the spatial correlation between the soil heavy metal pollution condition and the enterprise distribution, and effective guidance and help can be provided for the management and control of the enterprise pollution.
However, due to the existence of data islanding, the data cooperation difficulty among departments is large, the departments are difficult to cooperate, and the acquisition difficulty of enterprise information is quite large, therefore, a Classification method based on a specific Text mining means can be adopted, the industry category of the polluted enterprise is identified through the name of the enterprise, and the Classification method is used as the basis of research on the relation between the enterprise distribution and the pollution distribution.text Classification is an important method in data mining, a Classification function or model is constructed on the basis of a certain amount of existing data, and then Text data of other unknown categories are assigned to predefined categories through specific Text contents under a specific Classification system.A Text Classification model is adopted to carry out Text Classification on a mass of interest points, namely, (POI Point of interest) data (Zhang, ZhaJ, L-channel, C-Text Classification) 647 by using a Convolutional Text Classification model.
The soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning is based on a specific text mining means, a classification model is established mainly by adopting a polynomial naive Bayes method, and through carrying out bivariate space autocorrelation analysis on enterprise data obtained by classification and local pollution data, a directive effect is played on the research work of the relation between enterprise distribution and pollution distribution.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provides a soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning. The specific technical scheme is as follows:
the method for identifying the pollution source of the soil heavy metal enterprise based on source-sink space variable reasoning comprises the following steps:
step 1) data acquisition: acquiring polluted enterprise data, enterprise POI data and soil heavy metal pollution data of an area to be researched, wherein the polluted enterprise data comprises enterprise names and corresponding industry classifications thereof; the enterprise POI data comprises all enterprise names and longitude and latitude information of an area to be researched; the soil heavy metal pollution data are soil survey data of an area to be researched and comprise pollution indexes and longitude and latitude information of each heavy metal element of soil;
step 2) enterprise data preprocessing: carrying out descriptive analysis on the polluted enterprise data acquired in the step 1), and adjusting the enterprise industry category distribution of the data set according to an analysis result to average the category distribution of the enterprise samples; then, carrying out word segmentation processing on the enterprise name, and removing the vocabulary of the local name; finally, separating a training data set and a test data set according to a proportion;
step 3), enterprise data classification: extracting a set of all words or phrases appearing in the training data set and the test data set from the result processed in the step 2) as a corpus; according to the corpus, the word frequency of words appearing in the enterprise name of each sample is counted, and text features corresponding to the samples are extracted; training a polynomial naive Bayes model by using samples of the training set to obtain model optimal parameters; and evaluating the model by scoring the test data set;
step 4), spatial analysis: performing word segmentation on POI enterprise data acquired in the step 1), removing vocabularies of local names, inputting the vocabularies into a polynomial naive Bayesian model trained in the step 3), predicting industry classification of enterprises in a data set, and performing space density analysis on different enterprises by using a nuclear density method; meanwhile, generating a regular grid with a specified size according to the topological shape of the area to be researched, and counting the number of enterprises classified in each industry and the pollution index of each heavy metal element in the soil in the grid; then, carrying out spatial analysis by using a bivariate spatial autocorrelation method;
step 5), pollution source judgment: analyzing the spatial distribution relation of the soil heavy metal pollution and the polluted enterprises, judging the point source pollution and surface source pollution distribution characteristics of the area to be researched and identifying the enterprise pollution sources.
Preferably, in step 2), the method for adjusting the enterprise industry category distribution of the data set includes: according to the pareto principle, the analysis results are sorted from high to low according to the frequency of the industry categories, the first industry categories with the accumulated ratio exceeding the threshold value are selected as representative categories, and the rest industry categories are all merged into one category, so that the industry category distribution of the sample is averaged.
Preferably, in the step 2), when performing word segmentation processing on the enterprise name, a word segmentation engine specifically adopted is jieba; the removed local name vocabulary includes names of places at or above the county/town level of the administrative division.
Preferably, the specific steps of step 3) are as follows:
3.1) extracting text features: firstly, finding out a set of words or N-element phrases in a training data set and a testing data set, wherein the total number of the words or phrases is N; then, numbering the words or phrases from 1 to N, and taking the numbered words as a corpus; then, constructing an N-dimensional vector for any sample in the training data set and the testing data set, wherein the value of the mth dimension represents the word frequency of the word numbered m in the sample, and the constructed N-dimensional vector is the extracted text feature;
3.2) training a polynomial naive Bayes model, namely, combining the text characteristics of the training set data, adjusting the text characteristic parameter n and the smoothing parameter α of the polynomial naive Bayes model, namely, adopting a grid search method based on 10-fold cross validation, wherein the evaluation index of the cross validation is the classification accuracy, and finally selecting the parameter with the highest average classification accuracy as the optimal parameter;
3.3) after the optimal parameters of the model are determined, the model is evaluated by testing the classification accuracy Acc and Kappa coefficient of the data set.
Preferably, in the step 4), the specific method for generating a regular grid with a specified size according to the topological shape of the region to be studied is: calculating the range represented by the minimum circumscribed rectangle according to the topological shape of the area to be researched, and then dividing the grid from a certain vertex of the minimum circumscribed rectangle according to a preset size specification to obtain grid data; the specific method for counting the number of enterprises classified in each industry and the pollution index of each soil heavy metal element in the grid is as follows: respectively counting the number of enterprise POI points of each industry classification falling into each grid, wherein the counting value represents the enterprise aggregation degree in the grid area; simultaneously counting each soil heavy metal element pollution index in each grid, and if a plurality of investigation points exist in a certain grid, regarding a certain soil heavy metal element, taking the average value of the soil heavy metal element pollution indexes of all the investigation points in the grid as the soil heavy metal element pollution index in the grid area; dividing grid data into different industry categories and different soil heavy metal elements to perform bivariate spatial correlation analysis, wherein a specific analysis formula is as follows:
Figure BDA0001604876630000061
in the formula (2), the first and second groups,
Figure BDA0001604876630000062
representing an attribute value a after binarization in the grid i, wherein the attribute a is a pollution index of a certain soil heavy metal element in the grid, and the standardization process is as follows: redefining the soil heavy metal element pollution index to be 0 when the soil heavy metal element pollution index is less than or equal to 1, and redefining the soil heavy metal element pollution index to be 1 when the soil heavy metal element pollution index is greater than 1;
Figure BDA0001604876630000063
representing the b attribute value after z-score mean value standardization in the grid i, wherein the b attribute is the number of enterprise POI points of a certain enterprise category in the grid; w is aijIs a matrix of spatial weights that is,
Figure BDA0001604876630000064
a local spatial correlation index representing the a attribute and the b attribute at the grid i; if it is
Figure BDA0001604876630000065
Is remarkable in thatIf the grid is positive, the soil heavy metal pollution degree at the grid i is positively correlated with the enterprise aggregation degree in the adjacent range; if it is
Figure BDA0001604876630000066
If the concentration is remarkably negative, the soil heavy metal pollution degree at the grid i and the enterprise aggregation degree in the adjacent range have negative correlation; such as
Figure BDA0001604876630000071
If the concentration is not obvious, the soil heavy metal pollution degree at the grid i and the enterprise aggregation degree in the adjacent range are not obviously correlated; according to the individual grids obtained
Figure BDA0001604876630000072
Values, form a corresponding spatial cluster map.
Preferably, in the step 5), the spatial distribution relationship between the soil heavy metal pollution and the polluted enterprises is judged according to the spatial clustering map, if the number attributes of a certain soil heavy metal pollution index attribute and an enterprise POI point of a certain enterprise category in the spatial clustering map are High-High in a certain area, the pollution source of the soil heavy metal in the area is judged to be possible point source pollution caused by the enterprises, and if the number attributes of the soil heavy metal pollution index attribute in the certain area and all the enterprise POI points in the spatial clustering map are both High-L ow, the pollution source of the soil heavy metal in the area is judged to be possible surface source pollution.
The method has the advantages that modeling is carried out based on classification data of the existing polluted enterprises, the categories of the polluted enterprises can be identified directly according to names of the public POI enterprises, meanwhile, according to the predicted categories, the spatial distribution relation between the distribution of various enterprises and the heavy metal pollution of the soil with different elements is analyzed by establishing grid data and using bivariate spatial autocorrelation, so that the discrete soil heavy metal pollution point data and the enterprise point data can be analyzed accurately, the original analysis method and thought are expanded, and the method has important theoretical, practical and popularization and application values on management and control work of the enterprise pollution.
Drawings
FIG. 1 is an industry profile of contaminated enterprise data, in an embodiment
FIG. 2 shows, in an example, the spatial distribution density of the result of prediction of the type of polluting enterprise (a. textile industry, b. Metal industry, c. chemical raw materials and chemical manufacturing industry, d. other industry)
FIG. 3 is a spatial clustering chart of the degree of pollution of heavy metal Cd in soil in a research area and the degree of aggregation of metal product enterprises in an embodiment
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
The invention discloses a soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning, which comprises the following steps:
step 1) data acquisition: and acquiring polluted enterprise data, enterprise POI data and soil heavy metal pollution data of the area to be researched. The polluted enterprise data comprises enterprise names and corresponding industry classifications, and the industry classifications accord with national economic industry classification standard GB/T4754-2011. The enterprise POI data should include all enterprise names in the area to be studied and longitude and latitude information of the location of the enterprise. The soil heavy metal pollution data is soil investigation data of an area to be researched, and comprises soil heavy metal element pollution indexes of investigation points and longitude and latitude information of the investigation points, and when pollution sources of various heavy metals need to be identified, the investigation data also comprises corresponding heavy metal element pollution indexes.
Step 2) enterprise data preprocessing: carrying out descriptive analysis on the polluted enterprise data acquired in the step 1) to obtain a descriptive analysis result about the classification of the industry to which the enterprise belongs. Because the industry category distribution of the original polluted enterprise data is seriously uneven, only a few representative categories (the accumulation of a few category samples accounts for about 80%) need to be extracted, the rest categories are all merged into one category, and the enterprise industry category distribution of the data set is adjusted so as to average the distribution of the samples as much as possible. The adjusting method in this embodiment specifically includes: and sorting the analysis results from high to low according to the frequency of the industry categories, selecting a plurality of first industry categories with the accumulated ratio exceeding a threshold (80% can be selected) as representative categories, and merging all the other industry categories into one category. After the adjustment is finished, the word segmentation engine for the enterprise name is a jieba to perform word segmentation processing (the engine is trained by a hidden Markov chain model and has a good word segmentation effect), and words of the enterprise name including the names of the village/town levels of the administrative division and above are removed. And finally, separating the processed enterprise data into a training data set and a testing data set according to a sample ratio of 8:2, wherein each data set comprises words contained in the enterprise name after word segmentation and deletion and an industry class corresponding to the enterprise.
Step 3), enterprise data classification: extracting a set of all words or phrases appearing in the training data set and the test data set from the result processed in the step 2) as a corpus; according to the corpus, the word frequency of words appearing in the enterprise name of each sample is counted, and text features corresponding to the samples are extracted; training a polynomial naive Bayes model by using samples of the training set to obtain model optimal parameters; and the model is evaluated by scoring the test data set. The method comprises the following specific steps:
3.1) extracting text features: firstly, finding out a set of words or N-element phrases in a training data set and a test data set to obtain N words or phrases in total; then, numbering the words or phrases from 1 to N, and taking the numbered words as a corpus; then, constructing an N-dimensional vector for any sample in the training data set and the testing data set, wherein the value of the mth dimension represents the word frequency of the word numbered m in the sample, and the constructed N-dimensional vector is the extracted text feature;
3.2) training a polynomial naive Bayes model, namely, combining the text characteristics of the training set data, adjusting the text characteristic parameter n and the smoothing parameter α of the polynomial naive Bayes model, namely, adopting a grid search method based on 10-fold cross validation, wherein the evaluation index of the cross validation is the classification accuracy, and finally selecting the parameter with the highest average classification accuracy as the optimal parameter;
the above-mentioned hyper-parameter n and naive bayes smoothing parameter α is specifically set forth as follows:
the so-called characteristic parameter n is a method for expanding the corpus after word segmentation processing, assuming that m words are counted after word segmentation processing, but m may be very small in practice, the effect of the established classification model will be very poor, and based on m words, n words can be taken to form a new word according to the sequence, so that the effect of expanding the corpus can be achieved. Obviously, this N is neither too large, but is a minimum of 1, and is a positive integer, so it needs to be adjusted by the grid search method;
the so-called naive bayes smoothing parameter α is a means to process new words, naive bayes modeling relies on a corpus, even if we can extend the corpus with a hyper-parameter N, it is impossible to consider all corpora, so the characteristics of the new words are lost when vectorizing the new words, and it is easier to generate overfitting phenomenon, so when calculating the posterior probability, word smoothing technology needs to be introduced to alleviate the phenomenon, and the specific formula is as follows:
Figure BDA0001604876630000101
in formula (1), α is a smoothing parameter, n is the number of features consistent with the number of words in the corpus, c is a certain category, xiRefers to the value of the ith feature, i is 1,2,3, …, n, P (x)1,x2,…,xnI c) means that the characteristic value of a sample is x on the premise that a certain sample class is known as c1,x2,…,xnThe probability of (d); the characteristic value of N is x1,x2,…,xnThe number of samples in the whole sample is counted, and NcThe characteristic value of the finger is x1,x2,…,xnThe number of sets of samples in the category c is counted.
3.3) after the optimal parameters of the model are determined, the classification model is evaluated by testing the classification accuracy Acc and Kappa coefficient of the data set.
If all the classification accuracy coefficients of the model meet the requirements, the model is trained and can be used for subsequent prediction, and the POI enterprise data which are preprocessed and are the same as the training data are input into the model, so that the industry classification of each enterprise can be predicted according to the vocabularies in the enterprise name.
Step 4), spatial analysis: performing word segmentation on POI enterprise data acquired in the step 1), removing vocabularies of local names, inputting the processed data into a trained polynomial naive Bayesian model in the step 3), predicting industry classification of enterprises in a data set, and performing space density analysis on different enterprises by using a nuclear density method; meanwhile, a regular grid with a specified size is generated according to the topological shape of the region to be researched, and the specific method comprises the following steps: and calculating the range represented by the minimum circumscribed rectangle according to the topological shape of the area to be researched, and then dividing the grid according to a preset size specification (the size can be adjusted according to actual needs) from a certain vertex of the minimum circumscribed rectangle to obtain grid data. The method for counting the number of enterprises classified in each industry and the pollution index of each soil heavy metal element in the grid comprises the following steps: respectively counting the number of enterprise POI points of each industry classification falling into each grid, wherein the counting value represents the enterprise aggregation degree in the grid area; meanwhile, counting each soil heavy metal element pollution index (if there are multiple types, each type needs to be counted) in each grid, and if only 1 investigation point exists in a certain grid, directly representing the soil heavy metal element pollution index of the grid by the data of the point; if a plurality of check points exist in a certain grid, regarding a certain soil heavy metal element, taking the average value of the soil heavy metal element pollution indexes of all the check points in the grid as the soil heavy metal element pollution index in the grid area. After the statistics is completed, a bivariate spatial autocorrelation method is used for spatial analysis, grid data are required to be divided into different industry categories and different soil heavy metal elements for bivariate spatial correlation analysis in sequence during analysis, for example, analysis is carried out on heavy metal A and heavy metal B, and specific selection of A and B can be adjusted according to research needs. The specific analytical formula is as follows:
Figure BDA0001604876630000121
in the formula (2), the first and second groups,
Figure BDA0001604876630000122
representing an attribute value a after binarization in the grid i, wherein the attribute a is a pollution index of a certain soil heavy metal element in the grid, and the standardization process is as follows: when the soil heavy metal element pollution index is less than or equal to 1, namely the soil heavy metal element pollution index is in one level under the warning limit, redefining the soil heavy metal element pollution index as 0, when the soil heavy metal element pollution index is greater than 1, namely the soil heavy metal element pollution index is in another level of mild, moderate and severe pollution, redefining the soil heavy metal element pollution index as 1;
Figure BDA0001604876630000123
representing the b attribute value after z-score mean value standardization in the grid i, wherein the b attribute is the number of enterprise POI points of a certain enterprise category in the grid; w is aijIs a matrix of spatial weights that is,
Figure BDA0001604876630000124
a local spatial correlation index representing the a attribute and the b attribute at the grid i; if it is
Figure BDA0001604876630000125
If the concentration of the heavy metal in the soil is obviously positive, the soil heavy metal pollution degree at the grid i and the enterprise aggregation degree in the adjacent range have positive correlation; if it is
Figure BDA0001604876630000126
If the concentration is remarkably negative, the soil heavy metal pollution degree at the grid i and the enterprise aggregation degree in the adjacent range have negative correlation; such as
Figure BDA0001604876630000127
And if the concentration is not obvious, the soil heavy metal pollution degree at the grid i and the enterprise aggregation degree in the adjacent range are not obviously related. According to obtainingOf individual grids
Figure BDA0001604876630000128
Values, a corresponding spatial cluster map can be formed.
Step 5) judging a pollution source, namely judging the spatial distribution relation between the soil heavy metal pollution and a polluted enterprise according to a spatial clustering map, wherein the spatial clustering map has four attributes of High-High, High-L ow, L ow-High and L ow-L ow, if a certain area of the spatial clustering map is High-High, the attribute of a soil heavy metal pollution index and the number attribute of POI points of the enterprise in a certain line category are both High, namely the soil heavy metal pollution degree is relatively heavy, and the enterprises in the line are relatively concentrated, so that the pollution source of the soil heavy metal in the area is judged to be point source pollution possibly brought by the enterprises, and similarly, the other soil heavy metal pollution index attributes needing to be identified and the number attributes of POI points of other types of enterprises can be judged one by one, the same heavy metal can have multiple types of pollution, the same type of enterprises can also generate multiple heavy metal pollution, but the enterprise can not generate multiple heavy metal pollution sources in the same line, but can not be considered as a result of the heavy metal pollution source in the area, and the area is not considered as a result of the heavy metal pollution in the area.
Next, a certain area in the coastal region of southeast china is selected as a research area to be displayed by using the method, and the specific main steps are as described above, and are not repeated, and only the specific implementation details and the implementation effect for the embodiment are displayed.
Example (b):
in this embodiment, the analysis is performed by the above method, and the specific steps are as follows:
step 1) data acquisition: acquiring data of polluted enterprises in a research area, POI enterprises in the research area, administrative division topological data in the research area and soil heavy metal pollution data in the research area; for the data of the polluted enterprises in the research area, the classification of the data is required to meet the national economic industry classification standard GB/T4752-2011; selecting local soil investigation data for the soil heavy metal pollution data of the research area; for POI enterprise data in a research area, longitude and latitude information is required and belongs to a WGS84 coordinate system; in addition, POI data is downloaded through Web API based on a hundred-degree map;
step 2) enterprise data preprocessing: performing descriptive analysis on the polluted enterprise data acquired in the step 1), and finding that the classified distribution of the enterprise industry is seriously uneven as shown in fig. 1, wherein the classification distribution needs to be balanced to a certain extent for the subsequent modeling effect, only three main classifications of metal product industry, chemical raw materials, chemical product manufacturing industry and textile industry are reserved according to the result, and the rest are classified into one category; after class equalization processing is carried out, word segmentation processing is carried out on the enterprise name, and in the word segmentation process, local name words above the county/city level need to be removed; after word segmentation is finished, separating a test data set and a training data set according to a sample ratio of 8: 2;
step 3) enterprise data classification, namely firstly extracting the combination of all words appearing in the two data sets for the two data sets obtained in the step 2), combining the words according to the number N (N-gram language model) of the words forming the word group to form a corpus, counting the word frequency of the corresponding word in each sample according to the corpus to be used as the text characteristic of the sample, simultaneously training a naive Bayes model by using samples of the training set, adjusting the number N of the words forming the word group and Bayes smooth parameters α by using a 10-fold cross validation-based grid search method in the process, selecting the optimal parameters by using the average classification accuracy Acc of 10-time validation sets, and evaluating the model by using the Acc and Kappa coefficient k on the test set after determining the parameters of the model, wherein the calculated Acc is 86.3%, and the Kappa coefficient k is 0.82;
the calculation formula of each index is as follows:
Figure BDA0001604876630000141
Figure BDA0001604876630000142
Figure BDA0001604876630000151
in formula (2), Acc represents the classification accuracy, and refers to the ratio of the samples of the model prediction pair to all samples, where n refers to the number of all samples, and n is the ratio of all samplescRefers to the number of samples of the prediction pair; in the formula (3), k denotes a kappa coefficient, where the formula for Acc is formula (2), and peThe calculation of (2) is shown in formula (4), where m denotes the number of classifications, and C denotes the number of classifications in formula (4)iRefers to the number of samples with real category i, PiThe number of samples with the model prediction category i is referred to, and n is the number of all samples.
Step 4) spatial analysis, namely performing the same word segmentation and text feature extraction on POI enterprise data in the research area obtained in the step 1) according to the model in the step 3), then predicting the type of the POI enterprise data by using a naive Bayes model, and converting the POI enterprise data into spatial point data (vector data) after prediction is finished, inputting enterprise point data of different types by using a nuclear density model, setting the size of an output pixel to be 1km, searching for a distance to be 10km, obtaining a spatial distribution density map of different enterprises, generating a regular grid of 1km × 1km according to the administrative division topological shape of the province, loading soil heavy metal pollution data into the province grid data, selecting data of Cd elements of metal product enterprises and soil heavy metals as an example, counting the number of Cd elements of the metal product enterprises in each grid, assigning Cd values to each grid unit, performing spatial self-correlation analysis on the point source data by using an improved double-variable Morland index method, generating a Queen spatial adjacent relation in each grid, and performing statistical analysis on the spatial distribution map of different spatial distribution characteristics of the pollution source distribution of spatial distribution regions in the POI enterprise;
and 5) judging the pollution source, namely analyzing the map as shown in a space clustering map of the pollution degree of the heavy metal Cd element in the soil in the research area and the aggregation degree of the metal product industry enterprises, analyzing the space distribution characteristics of point source pollution and area source pollution, judging that the heavy metal Cd in the soil in a certain area is the point source pollution caused by the metal product industry enterprises if the certain area of the space clustering map is High-High, indicating that the heavy metal Cd in the soil is not the point source pollution caused by the metal product industry enterprises if the certain area of the space clustering map is High-L ow, and judging that the heavy metal Cd in the soil in the area is the area source pollution possibly caused by livestock, dry and wet sedimentation, chemical fertilizer application and other enterprise types.
The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims (5)

1. A soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning is characterized by comprising the following steps:
step 1) data acquisition: acquiring polluted enterprise data, enterprise POI data and soil heavy metal pollution data of an area to be researched, wherein the polluted enterprise data comprises enterprise names and corresponding industry classifications thereof; the enterprise POI data comprises all enterprise names and longitude and latitude information of an area to be researched; the soil heavy metal pollution data are soil survey data of an area to be researched and comprise pollution indexes and longitude and latitude information of each heavy metal element of soil;
step 2) enterprise data preprocessing: carrying out descriptive analysis on the polluted enterprise data acquired in the step 1), and adjusting the enterprise industry category distribution of the data set according to an analysis result to average the category distribution of the enterprise samples; then, carrying out word segmentation processing on the enterprise name, and removing the vocabulary of the local name; finally, separating a training data set and a test data set according to a proportion;
step 3), enterprise data classification: extracting a set of all words or phrases appearing in the training data set and the test data set from the result processed in the step 2) as a corpus; according to the corpus, the word frequency of words appearing in the enterprise name of each sample is counted, and text features corresponding to the samples are extracted; training a polynomial naive Bayes model by using samples of the training set to obtain model optimal parameters; and evaluating the model by scoring the test data set;
step 4), spatial analysis: performing word segmentation on the enterprise POI data acquired in the step 1), removing vocabularies of local names, inputting the vocabularies into the polynomial naive Bayesian model trained in the step 3), predicting the industry classification of each enterprise in the data set, and performing space density analysis on different enterprises by using a nuclear density method; meanwhile, generating a regular grid with a specified size according to the topological shape of the area to be researched, and counting the number of enterprises classified in each industry and the pollution index of each heavy metal element in the soil in the grid; then, carrying out spatial analysis by using a bivariate spatial autocorrelation method;
step 5), pollution source judgment: analyzing the spatial distribution relation of the soil heavy metal pollution and the polluted enterprises, judging the point source pollution and surface source pollution distribution characteristics of the area to be researched and identifying the enterprise pollution sources.
2. The method for identifying the pollution sources of the soil heavy metal enterprises based on the source-sink space variable inference as claimed in claim 1, wherein in the step 2), the method for adjusting the enterprise industry category distribution of the data set comprises the following steps: according to the pareto principle, the analysis results are sorted from high to low according to the frequency of the industry categories, the first industry categories with the accumulated ratio exceeding the threshold value are selected as representative categories, and the rest industry categories are all merged into one category, so that the industry category distribution of the sample is averaged.
3. The method for identifying the pollution sources of the soil heavy metal enterprises based on the source-sink space variable inference as claimed in claim 1, wherein in the step 2), when the word segmentation is performed on the enterprise name, a word segmentation engine is specifically adopted as a jieba; the removed local name vocabulary includes names of places at or above the county/town level of the administrative division.
4. The method for identifying the pollution source of the soil heavy metal enterprise based on the source-sink space variable inference as claimed in claim 1, wherein the specific steps of step 3) are as follows:
3.1) extracting text features: firstly, finding out a set of words or N-element phrases in a training data set and a testing data set, wherein the total number of the words or phrases is N; then, numbering the words or phrases from 1 to N, and taking the numbered words as a corpus; then, constructing an N-dimensional vector for any sample in the training data set and the testing data set, wherein the value of the mth dimension represents the word frequency of the word numbered m in the sample, and the constructed N-dimensional vector is the extracted text feature;
3.2) training a polynomial naive Bayes model, namely, combining the text characteristics of the training set data, adjusting the text characteristic parameter n and the smoothing parameter α of the polynomial naive Bayes model, namely, adopting a grid search method based on 10-fold cross validation, wherein the evaluation index of the cross validation is the classification accuracy, and finally selecting the parameter with the highest average classification accuracy as the optimal parameter;
3.3) after the optimal parameters of the model are determined, the model is evaluated by testing the classification accuracy Acc and Kappa coefficient of the data set.
5. The method for identifying the pollution sources of the soil heavy metal enterprises based on the source-sink space variable inference as claimed in claim 4, wherein in the step 4), the specific way of generating the regular grid with the specified size according to the topological shape of the area to be researched is: calculating the range represented by the minimum circumscribed rectangle according to the topological shape of the area to be researched, and then dividing the grid from a certain vertex of the minimum circumscribed rectangle according to a preset size specification to obtain grid data; the specific method for counting the number of enterprises classified in each industry and the pollution index of each soil heavy metal element in the grid is as follows: respectively counting the number of enterprise POI points of each industry classification falling into each grid, wherein the counting value represents the enterprise aggregation degree in the grid area; simultaneously counting each soil heavy metal element pollution index in each grid, and if a plurality of investigation points exist in a certain grid, regarding a certain soil heavy metal element, taking the average value of the soil heavy metal element pollution indexes of all the investigation points in the grid as the soil heavy metal element pollution index in the grid area; dividing grid data into different industry categories and different soil heavy metal elements to perform bivariate spatial correlation analysis, wherein a specific analysis formula is as follows:
Figure FDA0002469093350000041
in the formula (2), the first and second groups,
Figure FDA0002469093350000042
representing an attribute value a after binarization in the grid i, wherein the attribute a is a pollution index of a certain soil heavy metal element in the grid, and the standardization process is as follows: redefining the soil heavy metal element pollution index to be 0 when the soil heavy metal element pollution index is less than or equal to 1, and redefining the soil heavy metal element pollution index to be 1 when the soil heavy metal element pollution index is greater than 1;
Figure FDA0002469093350000043
representing the b attribute value after z-score mean value standardization in the grid i, wherein the b attribute is the number of enterprise POI points of a certain enterprise category in the grid; w is aijIs a matrix of spatial weights that is,
Figure FDA0002469093350000044
a local spatial correlation index representing the a attribute and the b attribute at the grid i; if it is
Figure FDA0002469093350000045
If the concentration of the heavy metal in the soil is obviously positive, the soil heavy metal pollution degree at the grid i and the enterprise aggregation degree in the adjacent range have positive correlation; if it is
Figure FDA0002469093350000047
If the concentration is remarkably negative, the soil heavy metal pollution degree at the grid i and the enterprise aggregation degree in the adjacent range have negative correlation; such as
Figure FDA0002469093350000046
If the concentration is not obvious, the soil heavy metal pollution degree at the grid i and the enterprise aggregation degree in the adjacent range are not obviously correlated; according to the individual grids obtained
Figure FDA0002469093350000048
Values, form a corresponding spatial cluster map.
CN201810239430.7A 2018-03-22 2018-03-22 Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning Active CN108595414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810239430.7A CN108595414B (en) 2018-03-22 2018-03-22 Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810239430.7A CN108595414B (en) 2018-03-22 2018-03-22 Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning

Publications (2)

Publication Number Publication Date
CN108595414A CN108595414A (en) 2018-09-28
CN108595414B true CN108595414B (en) 2020-07-10

Family

ID=63626992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810239430.7A Active CN108595414B (en) 2018-03-22 2018-03-22 Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning

Country Status (1)

Country Link
CN (1) CN108595414B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109785912A (en) * 2019-02-13 2019-05-21 中国科学院大气物理研究所 A kind of factor method for quickly identifying and device for target contaminant source resolution
CN110175739A (en) * 2019-04-12 2019-08-27 广东省生态环境技术研究所 A kind of heavy industries pollution Source Apportionment, system and storage medium
CN110175647A (en) * 2019-05-28 2019-08-27 北华航天工业学院 A kind of pollution source discrimination clustered based on principal component analysis and K-means
CN110706004B (en) * 2019-06-27 2022-03-29 华南农业大学 Farmland heavy metal pollutant tracing method based on hierarchical clustering
CN111310803B (en) * 2020-01-20 2021-06-01 江苏神彩科技股份有限公司 Environment data processing method and device
CN112084286B (en) * 2020-09-14 2021-06-29 智慧足迹数据科技有限公司 Spatial data processing method and device, computer equipment and storage medium
CN112288247B (en) * 2020-10-20 2024-04-09 浙江大学 Soil heavy metal risk identification method based on space interaction relationship
CN112903660A (en) * 2021-03-11 2021-06-04 广西大学 Method for judging current situation and source of pollution of watershed water body
CN113902249B (en) * 2021-09-02 2022-07-22 北京市农林科学院信息技术研究中心 Method and device for analyzing soil heavy metal influence factors
CN116662853B (en) * 2023-05-29 2024-04-30 新禾数字科技(无锡)有限公司 Method and system for automatically identifying analysis result of pollution source

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138668A (en) * 2015-09-06 2015-12-09 中山大学 Urban business center and retailing format concentrated area identification method based on POI data
CN105844301A (en) * 2016-04-05 2016-08-10 北华航天工业学院 Soil heavy metal pollution source analysis method based on Bayes source identification
CN107657284A (en) * 2017-10-11 2018-02-02 宁波爱信诺航天信息有限公司 A kind of trade name sorting technique and system based on Semantic Similarity extension

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138668A (en) * 2015-09-06 2015-12-09 中山大学 Urban business center and retailing format concentrated area identification method based on POI data
CN105844301A (en) * 2016-04-05 2016-08-10 北华航天工业学院 Soil heavy metal pollution source analysis method based on Bayes source identification
CN107657284A (en) * 2017-10-11 2018-02-02 宁波爱信诺航天信息有限公司 A kind of trade name sorting technique and system based on Semantic Similarity extension

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《Spatial distribution of soil heavy metal pollution estimated by different interpolation methods: Accuracy and uncertainty analysis》;YunfengXie Tong-binChen MeiLei JunYang Qing-junGuo BoSong Xiao-y;《Chemosphere》;20110131;第82卷(第3期);全文 *

Also Published As

Publication number Publication date
CN108595414A (en) 2018-09-28

Similar Documents

Publication Publication Date Title
CN108595414B (en) Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning
CN112966926B (en) Flood sensitivity risk assessment method based on ensemble learning
CN113642849B (en) Geological disaster risk comprehensive evaluation method and device considering spatial distribution characteristics
CN115688404B (en) Rainfall landslide early warning method based on SVM-RF model
CN110929939B (en) Landslide hazard susceptibility spatial prediction method based on clustering-information coupling model
CN112131731B (en) Urban growth cellular simulation method based on spatial feature vector filtering
CN111126865B (en) Technology maturity judging method and system based on technology big data
CN114330812A (en) Landslide disaster risk assessment method based on machine learning
CN113360587B (en) Land surveying and mapping equipment and method based on GIS technology
CN112907113B (en) Vegetation change cause identification method considering spatial correlation
CN113591700A (en) Prediction method for potential landslide and river blockage
CN111797188B (en) Urban functional area quantitative identification method based on open source geospatial vector data
CN108764527B (en) Screening method for soil organic carbon library time-space dynamic prediction optimal environment variables
CN116129262A (en) Cultivated land suitability evaluation method and system for suitable mechanized transformation
CN116756572B (en) Construction method based on mangrove ecological system distribution data set
CN114782211B (en) Sea mountain distribution range information acquisition method and system
CN114578448A (en) Investigation point positioning method based on multi-ground-source geological parameters
CN114755387B (en) Water body monitoring point location optimization method based on hypothesis testing method
CN103955953A (en) Terrain collaborative variable selection method for digital soil cartography
CN117952658B (en) Urban resource allocation and industry characteristic analysis method and system based on big data
Sahraei et al. Daily discharge forecasting using least square support vector regression and regression tree
CN116645012B (en) High-precision dynamic identification method for spatial range of urban border area
CN117131756B (en) Ground crack susceptibility evaluation method based on ground surface time sequence deformation and disaster-pregnancy background
CN117114105B (en) Target object recommendation method and system based on scientific research big data information
CN116863328A (en) Three-dimensional telemetry data 3D atmosphere pollution fusion and analysis method based on ResNet network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant