CN108595414A - Heavy metal-polluted soil enterprise pollution source discrimination based on source remittance space variable reasoning - Google Patents

Heavy metal-polluted soil enterprise pollution source discrimination based on source remittance space variable reasoning Download PDF

Info

Publication number
CN108595414A
CN108595414A CN201810239430.7A CN201810239430A CN108595414A CN 108595414 A CN108595414 A CN 108595414A CN 201810239430 A CN201810239430 A CN 201810239430A CN 108595414 A CN108595414 A CN 108595414A
Authority
CN
China
Prior art keywords
pollution
enterprise
heavy metal
data
soil
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810239430.7A
Other languages
Chinese (zh)
Other versions
CN108595414B (en
Inventor
史舟
徐烨
贾晓琳
尤其浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201810239430.7A priority Critical patent/CN108595414B/en
Publication of CN108595414A publication Critical patent/CN108595414A/en
Application granted granted Critical
Publication of CN108595414B publication Critical patent/CN108595414B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Processing Of Solid Wastes (AREA)

Abstract

The invention discloses a kind of heavy metal-polluted soil enterprise pollution source discriminations based on source remittance space variable reasoning.Contaminating enterprises' data, enterprise's POI data and the heavy metal pollution data of area to be studied are obtained first, the enterprises ' industry category distribution of data set is adjusted again, and isolate training dataset and test data set after word segmentation processing rejects endemic vocabulary, then the corpus established according to the two data sets, count the word frequency for the word that each sample occurs, as the corresponding text feature of the sample, and multinomial model-naive Bayesian is trained using the sample of training set, by the scoring of test set come assessment models;Finally predict that trade classification and heavy metal-polluted staining index carry out numerical statistic in the grid generated according to research area's topology according to the business data of acquisition, and carry out spatial analysis using bivariate spatial autocorrelation method, judge the spatial relationship of pollution and enterprise, heavy metal point source, pollution of area source region in Study of recognition area.

Description

Heavy metal-polluted soil enterprise pollution source discrimination based on source remittance space variable reasoning
Technical field
The present invention relates to a kind of heavy metal-polluted soil enterprise pollution source discriminations based on source remittance space variable reasoning, specifically Be related to it is a kind of based on particular text excavate means sorting technique and be based on bivariate spatial autocorrelation analysis method.
Technical background
Under the development of modern industrialization, Industrial " three Waste " discharges wantonly in some enterprises not supervised, and causes serious Environmental pollution, wherein heavy metal pollution of soil has become global environmental problem.According to investigations, national soil pollution is total Exceeding standard rate is 16.1%, and pollution type mainly based on heavy metal pollution, causes about 20,000,000 hectares of farmland arable land It destroys.Farmland Soil Pollution is broadly divided into point-source pollution and pollution of area source, and wherein pollution of area source refers to no fixed pollution row It puts a little through soil pollution caused by the modes such as the soil erosion, rainwash;Point-source pollution has fixed exhaust emission source, has Identifiable range is easier management and control compared to pollution of area source and administers, and enterprise pollution belongs to point-source pollution.Have perhaps at present The research method and model of more heavy metal pollution of soil source resolutions, such as Wang Xue pines and the Qin Yong (Xuzhou Urbans Wang Xuesong, Qin Yong tables Layer heavy metal in soil environmental risk is estimated and source resolution [J] geochemistry, 2006,35 (1):88-94.) use Factor minute The statistical method of analysis and clustering defines the source of heavy metal element and classification in research area's topsoil;Saby etc. (Saby N P,Thioulouse J,Jolivet C C,et al.Multivariate analysis of the spatial patterns of 8trace elements using the French soil monitoring network data. [J].Science of the Total Environment,2009,407(21):5644-5652.) use Principal Component Analysis The eight heavy metal species cumulative effects under natural cause and human activity factor in research area's topsoil are calculated, and pass through drop Content of beary metal is obtained four principal components by the method that dimension calculates, further according to steady geo-statistic interpolation model to the score of principal component The prediction of space interpolation has been carried out to obtain four sources of heavy metal:Parent soil material, the soil texture, soil weathering with it is artificial because Element.But the method for these source resolutions or model have certain defect, traditional statistical analysis technique and chemical method such as phase Closing property analytic approach, Principal Component Analysis, clustering methodology and factor analysis etc. all ignore the spatial position of heavy metal pollution Information, this is fairly limited for the help of the prevention and control of heavy metal pollution of soil;And spatial interpolation methods and traditional multivariate statistics side The combination of method does not provide reliable quantitative analysis, and the Spatial Variability of pollution can not be solved well.Again due to enterprise The source of heavy metal pollution of soil caused by industry and the mechanism collected are sufficiently complex, this makes the prevention and control of current enterprise pollution Work becomes quite difficult.In this regard, bivariate spatial autocorrelation model (Moran ' s I) may be used to study heavy metal-polluted soil Spatial coherence between pollution situation and enterprise object can provide the management and control of enterprise pollution effective guidance and help.
But due to the presence of data silo phenomenon, trans-departmental data files difficulty is big, inter-sectional to be difficult to cooperate, The acquisition difficulty of company information is quite big.It is therefore possible to use a kind of sorting technique for being excavated means based on particular text, is passed through The title of enterprise is identified the category of employment of contaminating enterprises, as the research contacted between enterprise object and pollution distribution Basis.Text classification is a kind of critically important method in data mining, is constructed on the basis of a certain amount of data with existing One classification function or model, then the text data of other unknown classifications is passed through into specific text under specific taxonomic hierarchies Content is assigned in predefined classification.(section is refined POI systematic searchings [J] of based on random vocabulary iterative model and is calculated for section refining Machine application study, 2014,31 (10):3024-3027.) use point of interest i.e. POI of the random vocabulary iterative model to magnanimity (Point of intrest) data carry out text classification.(Zhang X, Zhao J, the Lecun Y.Character- such as Zhang level Convolutional Networks for Text Classification[J].2015:649-657.) use The convolutional neural networks of character level carry out text classification.
Heavy metal-polluted soil enterprise pollution source discrimination based on source remittance space variable reasoning is by one kind based on specific Text mining means, the method for mainly using multinomial naive Bayesian establishes disaggregated model, and is obtained by classification Business data carries out bivariate spatial autocorrelation analysis with local contamination data, is contacted between enterprise object and pollution distribution A directive property is played the role of in research work.
Invention content
It is an object of the invention to solve problems of the prior art, and one kind is provided and is pushed away based on source remittance space variable The heavy metal-polluted soil enterprise pollution source discrimination of reason.Specific technical solution is as follows:
Based on the heavy metal-polluted soil enterprise pollution source discrimination of source remittance space variable reasoning, include the following steps:
Step 1) data acquisition:The contaminating enterprises' data, enterprise's POI data and heavy metal-polluted soil for obtaining area to be studied are dirty Data are contaminated, contaminating enterprises' data include enterprise name and its corresponding trade classification;Enterprise's POI data includes All enterprise names of area to be studied and latitude and longitude information;The heavy metal pollution of soil data are the soil of area to be studied Earth survey data, including each heavy metal contamination index of soil and latitude and longitude information;
Step 2) business data pre-processes:Being described property of the contaminating enterprises' data analysis that step 1) is got, according to Analysis result is adjusted the enterprises ' industry category distribution of data set, and the category distribution of enterprise's sample is made to equalize;Then, Word segmentation processing is carried out to enterprise name, and rejects the vocabulary of local title;Finally, training dataset and test are isolated in proportion Data set;
Step 3) business data is classified:By step 2), treated as a result, first extracting training dataset and test data The set for concentrating all words or phrase that occurred, as corpus;According to this corpus, enterprise's name of each sample is counted The word frequency of the word occurred in title extracts and is used as the corresponding text feature of the sample;And it is trained using the sample of training set more Item formula model-naive Bayesian obtains model optimized parameter;And by the scoring of test data set come assessment models;
Step 4) spatial analysis:Word segmentation processing is carried out to the POI business data that step 1) obtains, and rejects local title Vocabulary is inputted in step 3) trained multinomial model-naive Bayesian, prediction data and concentrates the row of each enterprise Industry is classified, and is reused cuclear density method and is carried out space density analysis to different enterprises;Meanwhile it being given birth to according to the topology in area to be studied At the regular grid of specified size, enterprise's number of the every profession and trade classification in statistical unit refers to each heavy metal contamination of soil Number;Then bivariate spatial autocorrelation method is used to carry out spatial analysis;
Step 5) sources survey:The spatial relationship for analyzing heavy metal pollution of soil and contaminating enterprises, judges to wait grinding Study carefully the point-source pollution in region, pollution of area source distribution characteristics and identifies enterprise pollution source.
Preferably, in the step 2), the method being adjusted to the enterprises ' industry category distribution of data set is:Root According to Pareto Principle, the analysis result is sorted from high to low according to the frequency of category of employment, selection adds up accounting and is more than Several preceding categorys of employment of threshold value are as representative category, remaining category of employment whole merger is one kind so that the row of sample Industry category distribution equalizes.
Preferably, in the step 2), when carrying out word segmentation processing to enterprise name, the participle engine specifically used is jieba;The local title vocabulary being removed includes the place name of township/town rank of administrative division or more.
Preferably, the step 3) is as follows:
3.1) text feature is extracted:First, the word of training dataset and test data concentration or the collection of n-gram word group are found out It closes, in total N number of word or phrase;Then, these words or phrase are numbered from 1 to N, using the word after number as corpus; Then, any one sample concentrated for training dataset and test data, constructs the vector of a N-dimensional, wherein m-th of dimension The value of degree means that the word frequency of word that number is m in the sample, and the N-dimensional vector constructed is exactly that the text that extracts is special Sign;
3.2) training multinomial model-naive Bayesian:The text feature of combined training collection data joins text featureization The smoothing parameter α of number n and multinomial model-naive Bayesian carries out tune ginseng, i.e., is searched using the grid based on 10 folding cross validations Suo Fangfa, and the evaluation index of cross validation is classification accuracy, finally selects that highest parameter of average classification accuracy As optimized parameter;
3.3) after the optimized parameter for determining model, by the classification accuracy Acc of test data set and Kappa coefficients come Assessment models.
Preferably, in the step 4), the regular grid for specifying size is generated according to the topology in area to be studied Specific practice be:According to the topology in area to be studied, the range representated by minimum enclosed rectangle is calculated, then outside minimum Some vertex for connecing rectangle starts, and divides grid according to default size specification, obtains Grid square;And it is each in statistical unit Enterprise's number of trade classification and the specific practice of each effect of fertilizer pollution index are:Statistics is fallen into each grid respectively Every profession and trade classification enterprise POI points number, count value represents the enterprises assembling degree in the mesh region;It counts simultaneously Each effect of fertilizer pollution index in each grid, if there are multiple points for investigation in some grid, for certain soil Earth heavy metal element, this kind of effect of fertilizer pollution index average value of all points for investigation is as the grid using in the grid This kind of effect of fertilizer pollution index in region;Divide Grid square to different industries classification, different soils heavy metal member Element carries out bivariate space correlation analysis, and concrete analysis formula is as follows:
In formula (2),It represents by a attribute values after binaryzation in grid i, wherein a attributes are a certain in grid Kind effect of fertilizer pollution index, standardisation process are:It, will be native when effect of fertilizer pollution index is less than or equal to 1 It is 0 that earth heavy metal contamination index, which redefines, when effect of fertilizer pollution index is more than 1, by effect of fertilizer It is 1 that pollution index, which redefines,;Represent the b attribute values after the standardization of z-score mean values, wherein b attributes in grid i It is the number of the enterprise POI points of certain industry classification in grid;wijFor Spatial weight matrix,Indicate a attributes at grid i With the local spatial correlation index of b attributes;IfSignificantly it is just, then to show the heavy metal pollution of soil degree at grid i and neighbour Enterprises assembling degree in nearly range has positive correlation;IfIt is significantly negative, then shows the heavy metal-polluted soil at grid i Pollution level has negative correlation with the enterprises assembling degree in nearby sphere;Such asNot significantly, then show native at grid i Earth heavy metal pollution degree is with the enterprises assembling degree in nearby sphere without apparent relevance;According to obtained each gridValue, forms corresponding space clustering figure.
Preferably, in the step 5), heavy metal pollution of soil and contaminating enterprises are judged according to space clustering figure Spatial relationship;If the enterprise of a certain heavy metal pollution of soil index properties and certain industry classification in space clustering figure A number attribute of POI points is High-High in a certain region, then judges that the pollution sources of this kind of region heavy metal-polluted soil may be The point-source pollution that such enterprise brings;If heavy metal pollution of soil index properties are in a certain region and all enterprises in space clustering figure A number attribute of industry POI points is High-Low, then judges that the pollution sources of the region heavy metal-polluted soil may be pollution of area source.
The beneficial effects of the invention are as follows the grouped datas based on existing contaminating enterprises to be modeled, subsequently can direct root Its classification is identified according to the title of open POI enterprises, meanwhile, according to the classification of prediction, become using double by establishing Grid square Distribution relation of all kinds of enterprise objects of quantity space autocorrelation analysis from different element heavy metal pollution of soil spatially, to make The analysis of spatial relationship, extension can accurately be carried out by obtaining discrete heavy metal pollution of soil point data and enterprise's point data Original analysis method and thinking have important theory, practice significance and popularization to the improvement management and control work of enterprise pollution Application value.
Description of the drawings
Fig. 1 is the industry distribution figure of contaminating enterprises' data in embodiment
Fig. 2 is contaminating enterprises type prediction result space distribution density (a. textile industries, b. metal products in embodiment Industry, c. chemical raw materials and chemical product manufacturing, the other industries of d.)
Fig. 3 is the pollution level and metal product industry enterprises assembling that area's heavy metal-polluted soil Cd elements are studied in embodiment The space clustering figure of degree
Specific implementation mode
Present invention will be further explained below with reference to the attached drawings and examples.
The heavy metal-polluted soil enterprise pollution source discrimination based on source remittance space variable reasoning of the present invention, including following step Suddenly:
Step 1) data acquisition:The contaminating enterprises' data, enterprise's POI data and heavy metal-polluted soil for obtaining area to be studied are dirty Contaminate data.Contaminating enterprises' data should include enterprise name and its corresponding trade classification, and the sector classification meets national economy Professional museum GB/T4754-2011.Enterprise's POI data should include all enterprise names of area to be studied and enterprise place The latitude and longitude information of position.Heavy metal pollution of soil data are the soil investigation data of area to be studied, including investigation point is each Effect of fertilizer pollution index and the latitude and longitude information for investigating point when need to identify the pollution sources of various heavy, need Also include corresponding heavy metal contamination index in survey data.
Step 2) business data pre-processes:Being described property of the contaminating enterprises' data analysis that step 1) is got, obtains Descriptive analysis result about the affiliated trade classification of enterprise.Since the category of employment distribution of original contaminating enterprises' data is tight Weight unevenness, therefore only need to extract a small number of representative classifications (minority class very this accumulative accounting 80% or so), Remaining classification whole merger is one kind, and point so that sample is adjusted by the enterprises ' industry category distribution to data set Cloth equalizes as possible.Method of adjustment in the present embodiment is specially:By the analysis result according to category of employment frequency from High to Low sequence, select accumulative accounting be more than threshold value (80% can be chosen) several preceding categorys of employment as representative category, Remaining category of employment whole merger is one kind.With participle engine it is that jieba carries out word segmentation processing to enterprise name after adjustment (engine is that have participle effect well by hidden Markov chain model training), and reject in enterprise name and wrap The vocabulary of the local title of township containing administrative division/town rank or more.Finally by treated business data, according to 8: 2 sample proportion isolates training dataset and test data set, includes after participle, delete processing in each data set The corresponding category of employment of contained vocabulary and enterprise in enterprise name.
Step 3) business data is classified:By step 2), treated as a result, first extracting training dataset and test data The set for concentrating all words or phrase that occurred, as corpus;According to this corpus, enterprise's name of each sample is counted The word frequency of the word occurred in title extracts and is used as the corresponding text feature of the sample;And it is trained using the sample of training set more Item formula model-naive Bayesian obtains model optimized parameter;And by the scoring of test data set come assessment models.It specifically does Method is as follows:
3.1) text feature is extracted:First, the word of training dataset and test data concentration or the collection of n-gram word group are found out It closes, obtains N number of word or phrase in total;Then, these words or phrase are numbered from 1 to N, using the word after number as language material Library;Then, any one sample concentrated for training dataset and test data, constructs the vector of a N-dimensional, wherein m The value of a dimension means that the word frequency of word that number is m in the sample, and the N-dimensional vector constructed is exactly the text extracted Eigen;
3.2) training multinomial model-naive Bayesian:The text feature of combined training collection data joins text featureization The smoothing parameter α of number n and multinomial model-naive Bayesian carries out tune ginseng, i.e., is searched using the grid based on 10 folding cross validations Suo Fangfa, and the evaluation index of cross validation is classification accuracy, finally selects that highest parameter of average classification accuracy As optimized parameter;
It is as follows for being specifically described for above-mentioned hyper parameter n and naive Bayesian smoothing parameter α:
So-called characterized parameter n expands a kind of method of corpus actually after word segmentation processing, it is assumed that passes through After word segmentation processing, amount to m word, but the actually possible very littles of this m, the effect of the disaggregated model of foundation can be very poor, and is based on This m, n word can be taken to form a neologisms in sequence, can thus achieve the effect that expand corpus.Obviously, this A N can neither be too big, and minimum 1, and be positive integer, so needing to be adjusted with grid data service;
So-called naive Bayesian smoothing parameter α is a kind of means of processing neologisms, and nave bayes modeling depends on one A corpus even if we can expand corpus with hyper parameter N, but can not possibly consider all language materials, so The feature of neologisms can be lost when to neologisms vectorization, it is easier to over-fitting is generated, so testing probability after computation When, it needs to introduce word smoothing technique to alleviate this phenomenon, specific formula is as follows:
In formula (1), α is smoothing parameter, and n refers to the number of feature, consistent with the number of word in corpus;C refer to some Classification, xiRefer to the value of ith feature, i=1,2,3 ..., n, P (x1,x2,…,xn| c) refer in some known sample Under the premise of this classification is c, which is x1,x2,…,xnProbability;The feature value that N refers to is x1,x2,…,xn Number statistics of the sample in whole sample, and NcThe feature value of finger is x1,x2,…,xnSample gather in classification c Number statistics.
3.3) after the optimized parameter for determining model, by the classification accuracy Acc of test data set and Kappa coefficients come Assess disaggregated model.
If every classification accuracy coefficient of the model has all reached requirement, both show that model has trained, Ke Yiyong In subsequent prediction, pass through pretreated POI business data identical with aforementioned training data by being inputted into model, so that it may To predict the trade classification of each enterprise according to the vocabulary in enterprise name.
Step 4) spatial analysis:Word segmentation processing is carried out to the POI business data that step 1) obtains, and rejects local title Vocabulary, will treated data input step 3) in trained multinomial model-naive Bayesian, predictive data set The trade classification of Zhong Ge enterprises reuses cuclear density method and carries out space density analysis to different enterprises;Meanwhile according to area to be studied Topology generate the regular grid for specifying size, specific practice is:According to the topology in area to be studied, calculate minimum outer The range representated by rectangle is connect, then since some vertex of minimum enclosed rectangle, lattice are divided according to default size specification Net (size can be adjusted according to actual needs), obtains Grid square.In statistical unit every profession and trade classification enterprise's number with Each effect of fertilizer pollution index, statistical method are:Statistics falls into the enterprise of the classification of the every profession and trade in each grid respectively The number of POI points, count value represent the enterprises assembling degree in the mesh region;Each soil in each grid is counted simultaneously Heavy metal contamination index (there are many if each be both needed to count), if in some grid only have 1 points for investigation, directly with The data of the point represent the effect of fertilizer pollution index of the grid;If right there are multiple points for investigation in some grid In certain effect of fertilizer, made with this kind of effect of fertilizer pollution index average value of all points for investigation in the grid For this kind of effect of fertilizer pollution index in the mesh region.After completing above-mentioned statistics, using bivariate space from phase Pass method carries out spatial analysis, when analysis Grid square need to be divided different industries classification, different soils heavy metal element successively into Row bivariate space correlation is analyzed, such as A heavy metals are analyzed with B industry elements, and A and B specific choices can be according to research need It is adjusted.It is as follows to make a concrete analysis of formula:
In formula (2),It represents by a attribute values after binaryzation in grid i, wherein a attributes are a certain in grid Kind effect of fertilizer pollution index, standardisation process are:Effect of fertilizer pollution index is less than or equal to guard against when 1 It is a rank that limit is lower, and it is 0 that effect of fertilizer pollution index, which is redefined, and effect of fertilizer pollution index is more than 1 When slight, moderate and serious pollution be another rank, it is 1 that effect of fertilizer pollution index, which is redefined,;It represents B attribute values in grid i after the standardization of z-score mean values, wherein b attributes are the enterprises of certain industry classification in grid The number of POI points;wijFor Spatial weight matrix,Indicate the local spatial correlation index of a attributes and b attributes at grid i; IfSignificantly it is just, then to show that the heavy metal pollution of soil degree at grid i has with the enterprises assembling degree in nearby sphere Positive correlation;IfIt is significantly negative, then shows that the heavy metal pollution of soil degree at grid i and the enterprise in nearby sphere are poly- Collection degree has negative correlation;Such asNot significantly, then show at grid i in heavy metal pollution of soil degree and nearby sphere Enterprises assembling degree is without apparent relevance.According to obtained each gridValue, can both form corresponding space clustering Figure.
Step 5) sources survey:The spatial distribution of heavy metal pollution of soil and contaminating enterprises is judged according to space clustering figure Relationship has tetra- attribute of High-High, High-Low, Low-High, Low-Low in space clustering figure.If space clustering figure A certain region is High-High, then shows the enterprise POI of a certain heavy metal pollution of soil index properties and certain industry classification A number attribute of point is higher, i.e., this kind of effect of fertilizer pollution level is heavier herein, and the enterprise of the sector is also more It concentrates, thus judges that the pollution sources of this kind of region heavy metal-polluted soil may be the point-source pollution that such enterprise brings;Similarly, right A number attribute of other heavy metal pollution of soil index properties for being identified of needs and other types enterprise POI points can one by one into Row differentiates that there may be a plurality of types of contaminating enterprises, same class enterprises may also generate various heavy for same heavy metal species Pollution.But heavy metal pollution of soil index properties are in a certain region and a number attribute of all enterprise POI points in dendrogram High-Low then shows that this kind of effect of fertilizer pollution level is heavier herein, but the enterprise of any industry does not concentrate, should Class heavy metal-polluted soil is not to have point-source pollution caused by any kind enterprise, and the pollution sources of the region heavy metal-polluted soil can It can be pollution of area source.Certain differentiation result can not be used as final result, preferably need to be discriminated in conjunction with artificial on-site inspection Not.But this method can rapidly carry out the identification of possible pollution sources in the spatial dimension of large area, greatly reduce manpower Material resources consume.
Coastal area of southeastern China region is chosen below with the above method to be shown as research area, specific main step Suddenly as previously mentioned, it is no longer repeated, only displaying is directed to the specific implementation details of the embodiment and realizes effect.
Embodiment:
It in the present embodiment, is analyzed, is as follows using the above method:
Step 1) data acquisition:Obtain contaminating enterprises of research area data, research area's POI business data, research area administrative area Draw topological data, research area's heavy metal pollution of soil data;For contaminating enterprises of research area data, it is desirable that its classification meets its people Economic sectors' criteria for classification GB/T4752-2011;For studying area's heavy metal pollution of soil data, local soil investigation is chosen Data;For studying area's POI business data, latitude and longitude information is needed, and belong to WGS84 coordinate systems;In addition, POI data, is Web API based on Baidu map are downloaded;
Step 2) business data pre-processes:Being described property of the contaminating enterprises' data analysis that step 1) is got, such as Fig. 1 It is shown, it is found that enterprises ' industry classification distribution is serious uneven, for the effect subsequently modeled, needs exist for category distribution into certain Balance Treatment, according to as a result, only retaining metal product industry, chemical raw material and three chemical product manufacturing, textile industry main classes Not, remaining whole is classified as one kind;After finishing class equilibrium treatment, word segmentation processing is carried out to enterprise name, when participle, is needed Reject the local title vocabulary of county/city's rank or more;After the completion of participle, according to 8:2 sample proportion isolates test data set With training dataset;
Step 3) business data is classified:For two parts of data sets that step 2) is got, two datasets are extracted first The middle all contaminations occurred, and according to composition phrase word number N (N-gram language models), to these words into Row combination, forms corpus;According to this corpus, the word frequency of equivalent in each sample is counted, the text as the sample Feature;Meanwhile model-naive Bayesian is trained using the sample of training set, in this process, intersect using based on 10 foldings The grid data service of verification come adjust composition phrase word number N and naive Bayesian smoothing parameter α, use 10 times verification collection Average classification accuracy Acc select optimized parameter;After the parameter for determining model, the Acc and Kappa on test set are used Coefficient k carrys out evaluation model, and the Acc calculated is coefficient k=0.82 86.3%, Kappa;
The calculation formula of each index is as follows:
In formula (2), Acc presentation class accuracys rate, the sample for the model prediction pair for referring to accounts for the ratio of all samples, Wherein n refers to the number of all samples, ncRefer to the number of samples of prediction pair;In formula (3), k refers to kappa systems Number, the calculation formula of Acc therein is exactly formula (2), and peCalculating then as shown in formula (4), in formula (4), what m referred to It is the number of classification, CiRefer to that true classification is the number of samples of i, PiRefer to that model prediction classification is the number of samples of i, n Refer to all number of samples.
Step 4) spatial analysis:The research area POI business data that step 1) obtains is done same according to the model of step 3) Participle, the text feature of sample extract, and the type of POI business data is then predicted with model-naive Bayesian, will after prediction It is converted to space point data (vector data);Using nuclear density model, different classes of enterprise's point data, setting output are inputted Pixel size be 1km, detection range 10km, obtain the spacial distribution density figure of different enterprises;According to the administrative area of the province It draws topology and generates the regular grid of 1km × 1km, while heavy metal pollution of soil data are loaded in province's grid data; Metal product industry enterprise is chosen herein with the data of heavy metal-polluted soil Cd elements as example;Count the metal in each grid The average value of product industry enterprise counting number and heavy metal-polluted soil Cd element pollution indexes, by statistical value assignment in each grid list In member;Spatial autocorrelation analysis, the space phase of selection are carried out to the Grid square using the not blue index method of bivariate after improvement Adjacent relationship is Queen, and the space of pollution and enterprise is carried out to the spatial relationship type of different zones in the space clustering figure of generation The judgement of distribution relation, analysis point-source pollution, pollution of area source spatial distribution characteristic;
Step 5) sources survey:As a result as the pollution level of research area's heavy metal-polluted soil Cd elements is looked forward to metal product industry It shown in the space clustering figure of industry aggregation extent, is carried out to analyzing the figure, analysis point-source pollution, pollution of area source spatial distribution characteristic, If a certain region of space clustering figure is High-High, judge that heavy metal-polluted soil Cd is by metal product industry enterprise in the region Caused point-source pollution;If a certain region of space clustering figure is High-Low, show that heavy metal-polluted soil Cd is not by gold Point-source pollution caused by metal products industry enterprise judges this in the region if identical result is also presented in other types of business Class heavy metal-polluted soil may be the pollution of area source as caused by the reasons such as livestock and poultry, dried wet deposition, chemical fertilizer application.
Above-mentioned embodiment is only a preferred solution of the present invention, so it is not intended to limiting the invention.Have The those of ordinary skill for closing technical field can also make various changes without departing from the spirit and scope of the present invention Change and modification.Therefore all technical solutions for taking the mode of equivalent substitution or equivalent transformation to be obtained all fall within the guarantor of the present invention It protects in range.

Claims (6)

1. a kind of heavy metal-polluted soil enterprise pollution source discrimination based on source remittance space variable reasoning, which is characterized in that including Following steps:
Step 1) data acquisition:Obtain contaminating enterprises' data, enterprise's POI data and the heavy metal pollution of soil number of area to be studied According to contaminating enterprises' data include enterprise name and its corresponding trade classification;Enterprise's POI data includes to wait grinding Study carefully all enterprise names in region and latitude and longitude information;The heavy metal pollution of soil data are the soil tune of area to be studied Data are looked into, including each heavy metal contamination index of soil and latitude and longitude information;
Step 2) business data pre-processes:Being described property of the contaminating enterprises' data analysis that step 1) is got, according to analysis As a result, the enterprises ' industry category distribution to data set is adjusted, the category distribution of enterprise's sample is made to equalize;Then, to enterprise Industry title carries out word segmentation processing, and rejects the vocabulary of local title;Finally, training dataset and test data are isolated in proportion Collection;
Step 3) business data is classified:By step 2), treated as a result, first extracting training dataset and test data concentration The set of all words or phrase that occurred, as corpus;According to this corpus, in the enterprise name for counting each sample The word frequency of the word of appearance extracts and is used as the corresponding text feature of the sample;And train multinomial using the sample of training set Model-naive Bayesian obtains model optimized parameter;And by the scoring of test data set come assessment models;
Step 4) spatial analysis:Word segmentation processing is carried out to the POI business data that step 1) obtains, and rejects the word of local title It converges, is inputted in step 3) trained multinomial model-naive Bayesian, prediction data and concentrates the industry of each enterprise Classification reuses cuclear density method and carries out space density analysis to different enterprises;Meanwhile it being generated according to the topology in area to be studied The regular grid of size is specified, the enterprise's number and each heavy metal contamination index of soil of the every profession and trade classification in statistical unit; Then bivariate spatial autocorrelation method is used to carry out spatial analysis;
Step 5) sources survey:The spatial relationship for analyzing heavy metal pollution of soil and contaminating enterprises, judges area to be studied The point-source pollution in domain, pollution of area source distribution characteristics simultaneously identify enterprise pollution source.
2. the heavy metal-polluted soil enterprise pollution source discrimination as described in claim 1 based on source remittance space variable reasoning, It is characterized in that, in the step 2), the method being adjusted to the enterprises ' industry category distribution of data set is:According to Pareto Principle sorts the analysis result according to the frequency of category of employment from high to low, selects to add up accounting more than before threshold value Several categorys of employment are as representative category, remaining category of employment whole merger is one kind so that the category of employment of sample point Cloth equalizes.
3. the heavy metal-polluted soil enterprise pollution source discrimination as described in claim 1 based on source remittance space variable reasoning, It is characterized in that, in the step 2), when carrying out word segmentation processing to enterprise name, the participle engine specifically used is jieba;Quilt The local title vocabulary of rejecting includes the place name of township/town rank of administrative division or more.
4. the heavy metal-polluted soil enterprise pollution source discrimination as described in claim 1 based on source remittance space variable reasoning, It is characterized in that, the step 3) is as follows:
3.1) text feature is extracted:First, the word of training dataset and test data concentration or the set of n-gram word group are found out, always N number of word or phrase altogether;Then, these words or phrase are numbered from 1 to N, using the word after number as corpus;Then, For any one sample that training dataset and test data are concentrated, the vector of a N-dimensional is constructed, wherein m-th dimension Value means that the word frequency of word that number is m in the sample, and the N-dimensional vector constructed is exactly the text feature extracted;
3.2) training multinomial model-naive Bayesian:The text feature of combined training collection data, to text feature parameter n with And the smoothing parameter α of multinomial model-naive Bayesian carries out tune ginseng, that is, uses the grid search side based on 10 folding cross validations Method, and the evaluation index of cross validation is classification accuracy, finally select average that highest parameter of classification accuracy as Optimized parameter;
3.3) it after the optimized parameter for determining model, is assessed by the classification accuracy Acc and Kappa coefficients of test data set Model.
5. the heavy metal-polluted soil enterprise pollution source discrimination as described in claim 1 based on source remittance space variable reasoning, It is characterized in that, in the step 4), the specific of the regular grid for specifying size is generated according to the topology in area to be studied and is done Method is:According to the topology in area to be studied, the range representated by minimum enclosed rectangle is calculated, then from minimum enclosed rectangle Some vertex starts, and divides grid according to default size specification, obtains Grid square;And the every profession and trade classification in statistical unit Enterprise's number and the specific practice of each effect of fertilizer pollution index be:Statistics falls into the every profession and trade in each grid respectively The number of the enterprise POI points of classification, count value represent the enterprises assembling degree in the mesh region;Count each grid simultaneously Each interior effect of fertilizer pollution index, if there are multiple points for investigation in some grid, for certain heavy metal-polluted soil Element, this kind of effect of fertilizer pollution index average value of all points for investigation is as in the mesh region using in the grid This kind of effect of fertilizer pollution index;Different industries classification, different soils heavy metal element is divided to carry out Grid square double Variable space correlation analysis, concrete analysis formula are as follows:
In formula (2),It represents by a attribute values after binaryzation in grid i, wherein a attributes are a certain soil in grid Earth heavy metal contamination index, standardisation process are:When effect of fertilizer pollution index is less than or equal to 1, by soil weight It is 0 that metal pollution index, which redefines, and when effect of fertilizer pollution index is more than 1, effect of fertilizer is polluted It is 1 that index, which redefines,;The b attribute values after the standardization of z-score mean values in grid i are represented, wherein b attributes are lattice The number of the enterprise POI points of certain industry classification in net;wijFor Spatial weight matrix,Indicate that a attributes at grid i belong to b The local spatial correlation index of property;IfSignificantly it is just, then to show the heavy metal pollution of soil degree at grid i and neighbouring model Enterprises assembling degree in enclosing has positive correlation;IfIt is significantly negative, then shows the heavy metal pollution of soil at grid i Degree has negative correlation with the enterprises assembling degree in nearby sphere;Such asNot significantly, then show soil weight at grid i Metallic pollution degree is with the enterprises assembling degree in nearby sphere without apparent relevance;According to obtained each grid Value, forms corresponding space clustering figure.
6. the heavy metal-polluted soil enterprise pollution source discrimination as described in claim 1 based on source remittance space variable reasoning, It is characterized in that, in the step 5), the spatial distribution of heavy metal pollution of soil and contaminating enterprises is judged according to space clustering figure Relationship;If of a certain heavy metal pollution of soil index properties and the enterprise POI points of certain industry classification in space clustering figure Number attribute is High-High in a certain region, then judges that the pollution sources of this kind of region heavy metal-polluted soil may be such enterprise The point-source pollution brought;If heavy metal pollution of soil index properties are in a certain region and all enterprise POI points in space clustering figure A number attribute be High-Low, then judge that the pollution sources of the region heavy metal-polluted soil may be pollution of area source.
CN201810239430.7A 2018-03-22 2018-03-22 Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning Active CN108595414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810239430.7A CN108595414B (en) 2018-03-22 2018-03-22 Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810239430.7A CN108595414B (en) 2018-03-22 2018-03-22 Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning

Publications (2)

Publication Number Publication Date
CN108595414A true CN108595414A (en) 2018-09-28
CN108595414B CN108595414B (en) 2020-07-10

Family

ID=63626992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810239430.7A Active CN108595414B (en) 2018-03-22 2018-03-22 Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning

Country Status (1)

Country Link
CN (1) CN108595414B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109785912A (en) * 2019-02-13 2019-05-21 中国科学院大气物理研究所 A kind of factor method for quickly identifying and device for target contaminant source resolution
CN110175647A (en) * 2019-05-28 2019-08-27 北华航天工业学院 A kind of pollution source discrimination clustered based on principal component analysis and K-means
CN110175739A (en) * 2019-04-12 2019-08-27 广东省生态环境技术研究所 A kind of heavy industries pollution Source Apportionment, system and storage medium
CN110706004A (en) * 2019-06-27 2020-01-17 华南农业大学 Farmland heavy metal pollutant tracing method based on hierarchical clustering
CN111310803A (en) * 2020-01-20 2020-06-19 江苏神彩科技股份有限公司 Environment data processing method and device
CN112084286A (en) * 2020-09-14 2020-12-15 智慧足迹数据科技有限公司 Spatial data processing method and device, computer equipment and storage medium
CN112288247A (en) * 2020-10-20 2021-01-29 浙江大学 Soil heavy metal risk identification method based on space interaction relation
CN112903660A (en) * 2021-03-11 2021-06-04 广西大学 Method for judging current situation and source of pollution of watershed water body
CN113902249A (en) * 2021-09-02 2022-01-07 北京市农林科学院信息技术研究中心 Method and device for analyzing soil heavy metal influence factors
CN116662853A (en) * 2023-05-29 2023-08-29 新禾数字科技(无锡)有限公司 Method and system for automatically identifying analysis result of pollution source

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138668A (en) * 2015-09-06 2015-12-09 中山大学 Urban business center and retailing format concentrated area identification method based on POI data
CN105844301A (en) * 2016-04-05 2016-08-10 北华航天工业学院 Soil heavy metal pollution source analysis method based on Bayes source identification
CN107657284A (en) * 2017-10-11 2018-02-02 宁波爱信诺航天信息有限公司 A kind of trade name sorting technique and system based on Semantic Similarity extension

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138668A (en) * 2015-09-06 2015-12-09 中山大学 Urban business center and retailing format concentrated area identification method based on POI data
CN105844301A (en) * 2016-04-05 2016-08-10 北华航天工业学院 Soil heavy metal pollution source analysis method based on Bayes source identification
CN107657284A (en) * 2017-10-11 2018-02-02 宁波爱信诺航天信息有限公司 A kind of trade name sorting technique and system based on Semantic Similarity extension

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YUNFENGXIE TONG-BINCHEN MEILEI JUNYANG QING-JUNGUO BOSONG XIAO-Y: "《Spatial distribution of soil heavy metal pollution estimated by different interpolation methods: Accuracy and uncertainty analysis》", 《CHEMOSPHERE》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109785912A (en) * 2019-02-13 2019-05-21 中国科学院大气物理研究所 A kind of factor method for quickly identifying and device for target contaminant source resolution
CN110175739A (en) * 2019-04-12 2019-08-27 广东省生态环境技术研究所 A kind of heavy industries pollution Source Apportionment, system and storage medium
CN110175647A (en) * 2019-05-28 2019-08-27 北华航天工业学院 A kind of pollution source discrimination clustered based on principal component analysis and K-means
CN110706004A (en) * 2019-06-27 2020-01-17 华南农业大学 Farmland heavy metal pollutant tracing method based on hierarchical clustering
CN110706004B (en) * 2019-06-27 2022-03-29 华南农业大学 Farmland heavy metal pollutant tracing method based on hierarchical clustering
CN111310803A (en) * 2020-01-20 2020-06-19 江苏神彩科技股份有限公司 Environment data processing method and device
CN112084286B (en) * 2020-09-14 2021-06-29 智慧足迹数据科技有限公司 Spatial data processing method and device, computer equipment and storage medium
CN112084286A (en) * 2020-09-14 2020-12-15 智慧足迹数据科技有限公司 Spatial data processing method and device, computer equipment and storage medium
CN112288247A (en) * 2020-10-20 2021-01-29 浙江大学 Soil heavy metal risk identification method based on space interaction relation
CN112288247B (en) * 2020-10-20 2024-04-09 浙江大学 Soil heavy metal risk identification method based on space interaction relationship
CN112903660A (en) * 2021-03-11 2021-06-04 广西大学 Method for judging current situation and source of pollution of watershed water body
CN113902249A (en) * 2021-09-02 2022-01-07 北京市农林科学院信息技术研究中心 Method and device for analyzing soil heavy metal influence factors
CN113902249B (en) * 2021-09-02 2022-07-22 北京市农林科学院信息技术研究中心 Method and device for analyzing soil heavy metal influence factors
CN116662853A (en) * 2023-05-29 2023-08-29 新禾数字科技(无锡)有限公司 Method and system for automatically identifying analysis result of pollution source
CN116662853B (en) * 2023-05-29 2024-04-30 新禾数字科技(无锡)有限公司 Method and system for automatically identifying analysis result of pollution source

Also Published As

Publication number Publication date
CN108595414B (en) 2020-07-10

Similar Documents

Publication Publication Date Title
CN108595414A (en) Heavy metal-polluted soil enterprise pollution source discrimination based on source remittance space variable reasoning
DeFries et al. Multiple criteria for evaluating machine learning algorithms for land cover classification from satellite data
Wan et al. A knowledge-based decision support system to analyze the debris-flow problems at Chen-Yu-Lan River, Taiwan
Chen et al. A method for mineral prospectivity mapping integrating C4. 5 decision tree, weights-of-evidence and m-branch smoothing techniques: a case study in the eastern Kunlun Mountains, China
CN111126865B (en) Technology maturity judging method and system based on technology big data
Sumathi et al. Data mining: analysis of student database using classification techniques
A. Rashid et al. Association rule mining using time series data for Malaysia climate variability prediction
CN115358481A (en) Early warning and identification method, system and device for enterprise ex-situ migration
Sharma et al. Forecasting and prediction of air pollutants concentrates using machine learning techniques: the case of India
Kosztyán et al. Generalized network-based dimensionality analysis
Hussain Machine learning of the reverse migration models for population prediction: a review
Arifin et al. Comparative analysis on educational data mining algorithm to predict academic performance
Gunawan et al. C4. 5, K-Nearest Neighbor, Naïve Bayes, and Random Forest Algorithms Comparison to Predict Students' on TIME Graduation
Ni et al. The analysis and research of clustering algorithm based on PCA
Yang et al. Research on landslide susceptibility prediction model based on LSTM-RF-MDBN
Dai et al. Landslide risk classification based on ensemble machine learning
Riahi-Madvar et al. Pre-processing and Input Vector Selection Techniques in Computational Soft Computing Models of Water Engineering
Chaudhari et al. Data mining with meteorological data
CN112506930A (en) Data insight platform based on machine learning technology
Stutz et al. Computationally intensive multivariate statistics and relative frequency distributions in archaeology (with an application to the Early Epipaleolithic of the Levant)
CN117952658B (en) Urban resource allocation and industry characteristic analysis method and system based on big data
Sagar et al. Prediction technique for time series data sets using regression models
Devi et al. A Survey on Data Mining and Its Current Research Directions.
Rachmawanto et al. Visitor Prediction Decision Support System at Dieng Tourism Objects Using the K-Nearest Neighbor Method
CN117114105B (en) Target object recommendation method and system based on scientific research big data information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant