CN113011478A - Pollution source identification method and system based on data fusion - Google Patents

Pollution source identification method and system based on data fusion Download PDF

Info

Publication number
CN113011478A
CN113011478A CN202110246412.3A CN202110246412A CN113011478A CN 113011478 A CN113011478 A CN 113011478A CN 202110246412 A CN202110246412 A CN 202110246412A CN 113011478 A CN113011478 A CN 113011478A
Authority
CN
China
Prior art keywords
data
pollution source
sample
module
ultraviolet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110246412.3A
Other languages
Chinese (zh)
Inventor
吴静
刘博�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202110246412.3A priority Critical patent/CN113011478A/en
Publication of CN113011478A publication Critical patent/CN113011478A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N21/00Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
    • G01N21/17Systems in which incident light is modified in accordance with the properties of the material investigated
    • G01N21/25Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands
    • G01N21/31Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N21/00Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
    • G01N21/17Systems in which incident light is modified in accordance with the properties of the material investigated
    • G01N21/25Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands
    • G01N21/31Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry
    • G01N21/33Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using ultraviolet light
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N21/00Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
    • G01N21/62Systems in which the material investigated is excited whereby it emits light or causes a change in wavelength of the incident light
    • G01N21/63Systems in which the material investigated is excited whereby it emits light or causes a change in wavelength of the incident light optically excited
    • G01N21/64Fluorescence; Phosphorescence
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/18Water
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Abstract

The invention discloses a pollution source identification method and a system based on data fusion, wherein the method comprises the following steps: carrying out pollution index test after preprocessing a pollution source sample to obtain conventional water quality data, ultraviolet-visible absorption spectrum data and three-dimensional fluorescence spectrum data; preprocessing test data and then extracting features; splicing the extracted characteristic data to construct fusion data; establishing a pollution source identification model according to the fusion data and a classification algorithm and training; and identifying the pollution source through the trained pollution source identification model. The system comprises a sampling module, a sample preprocessing module, a sample introduction module, an analysis testing module, a data transmission module and a display and system control module. The method performs data fusion on the conventional water quality, the ultraviolet-visible absorption spectrum and the three-dimensional fluorescence spectrum and is applied to pollution source identification, and compared with the traditional pollution source identification method, the method is accurate, intelligent, low in cost and strong in operability, and has important significance for tracing the pollution source.

Description

Pollution source identification method and system based on data fusion
Technical Field
The invention relates to the technical field of environmental supervision, in particular to a pollution source identification method and system based on data fusion.
Background
At present, the traceability of the stealing, the draining and the leaking of a sewage disposal unit mainly depends on manual investigation. The manual investigation generally refers to the tracing of illegal sewage discharge by investigating the water quality condition of the sewage discharge unit from the accident site to the upstream step by step after the occurrence of the pollution accident. However, this method is time-consuming and labor-consuming, and is also prone to loss of timeliness and low in efficiency.
In recent years, manual investigation methods based on characteristic pollutant database assistance have appeared. The data of pollutants in the polluted water body, soil or atmosphere is compared with the characteristic pollutant database of the pollution source, so that the investigation range can be reduced, the investigation workload is reduced, and the investigation efficiency is improved. Before that, Wanpingyu et al, Beijing university of chemical industry, proposed a chemical watermark information database containing pollution sources such as anion species, organic species, metal element species, fluorescence information and the like for tracing water body pollution. A novel water pollution emission source database with strong operability is provided in the related technology, and the database comprises three sub-databases, namely a pollution source basic information database, a conventional water quality database and a water quality fingerprint database. These databases are complex, have high cost and are difficult to realize online early warning tracing.
In addition, the method lacks of an automatic pollution source comparison method and a complete system. In actual work, the source of pollution is often judged by manual comparison. The comparison method requires the staff to have strong professional knowledge and experience, the judgment result has strong subjectivity, and scientific and quantitative data support is lacked. Under complex conditions, the misjudgment rate of the pollution source is high, and the judgment of the pollution source still has large delay. Therefore, there is a need to develop an automatic comparing method and system for pollution sources to improve the accuracy and real-time of pollution source tracing.
Based on data driving, the intelligent, accurate and efficient discrimination of the pollution source can be realized by establishing a discrimination model by using a chemical metrology analysis method. The pollutant composition of the pollution source is generally complex, and a single index has limitation on the characterization of the pollution source. Identification models based on a single index for similar pollution sources (e.g., different enterprises in the same industry) tend to be less accurate for pollution source identification. The data fusion strategy can integrate different pollution information together, and can more comprehensively reflect the characteristics of pollution sources. Therefore, the identification model based on data fusion has better identification performance than the identification model based on a single data source, and the identification accuracy of the pollution source can be improved.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, the invention aims to provide a pollution source identification method based on data fusion, which is accurate, intelligent, low in cost, strong in operability, beneficial to large-scale popularization and significant in pollution source tracing.
Another objective of the present invention is to provide a pollution source identification system based on data fusion.
In order to achieve the above object, an embodiment of an aspect of the present invention provides a pollution source identification method based on data fusion, including the following steps:
collecting a pollution source sample;
pretreating the pollution source sample;
carrying out pollution index test on the pretreated pollution source sample to obtain conventional water quality data, ultraviolet-visible absorption spectrum data and three-dimensional fluorescence spectrum data;
preprocessing the conventional water quality data, the ultraviolet-visible absorption spectrum data and the three-dimensional fluorescence spectrum data;
performing characteristic extraction on the preprocessed conventional water quality data, the preprocessed ultraviolet-visible absorption spectrum data and the preprocessed three-dimensional fluorescence spectrum data;
splicing the characteristic data extracted from the conventional water quality data, the ultraviolet-visible absorption spectrum data and the three-dimensional fluorescence spectrum data to construct fused data;
establishing a pollution source identification model according to the fusion data and a classification algorithm and training;
and identifying the pollution source through the trained pollution source identification model.
In addition, the pollution source identification method based on data fusion according to the above embodiment of the present invention may further have the following additional technical features:
further, in one embodiment of the present invention, the contamination source sample includes: water sample, soil sample and atmospheric sample, to carry out the preliminary treatment to the pollution source sample includes: filtering a water sample by using a 0.2-10.0 mu m filter membrane, dissolving a soil sample by using ultrapure water, filtering a soil leaching solution by using a 0.2-10.0 mu m filter membrane, and filtering a gas sample by using a 0.2-10.0 mu m filter membrane after dissolving the gas sample in the ultrapure water.
Further, in one embodiment of the present invention, the pollution index test includes regular water quality, uv-vis absorption spectrum, and three-dimensional fluorescence spectrum;
the conventional water quality includes but is not limited to pH value, conductivity, chemical oxygen demand, total nitrogen, ammonia nitrogen and total phosphorus;
the scanning range of the ultraviolet-visible absorption spectrum is 200-800 nm, and the scanning interval is 0.1-10 nm;
the scanning range of the excitation wavelength of the three-dimensional fluorescence spectrum is 200-600 nm, the scanning range of the emission wavelength is 230-700 nm, and the scanning interval is 1-10 nm.
Further, in an embodiment of the present invention, the preprocessing the regular water quality data, the ultraviolet-visible absorption spectrum data and the three-dimensional fluorescence spectrum data includes:
pretreating conventional water quality data: eliminating abnormal data by adopting a Lauda criterion;
preprocessing an ultraviolet-visible absorption spectrum: eliminating invalid data in the spectral data, and then performing standard normal transformation on the spectral data;
preprocessing the three-dimensional fluorescence spectrum: and converting the fluorescence intensity of the original fluorescence fingerprint into Raman units (R.U.) by utilizing the integral of the Raman scattering intensity of ultrapure water with the excitation wavelength of 350nm and the emission wavelength of 370-430 nm.
Further, in an embodiment of the present invention, the feature extraction is performed on the preprocessed test data, including feature extraction on the conventional water quality data, the ultraviolet-visible absorption spectrum data and the three-dimensional fluorescence spectrum data;
the characteristic extraction method of the conventional water quality data and the ultraviolet-visible absorption spectrum data comprises principal component analysis, nonnegative matrix decomposition and independent component analysis;
the characteristic extraction of the three-dimensional fluorescence spectrum data is to extract main fluorescence components of the three-dimensional fluorescence spectrum by utilizing parallel factor analysis.
Further, in one embodiment of the present invention, the classification algorithm includes, but is not limited to, partial least squares resolution analysis, support vector machine, K nearest neighbor algorithm.
Further, in an embodiment of the present invention, the creating and training a pollution source recognition model according to the fusion data and the classification algorithm further includes:
model initialization: selecting 75-95% of sample data as a training set, establishing the pollution source identification model by adopting a cross validation method, and selecting the optimal potential variable number according to a cross validation error minimization principle;
model training: setting the number of variables as the optimal number of potential variables to fit the pollution source identification model again;
model prediction: and predicting the residual 5-25% of sample data through the fitted pollution source identification model, and evaluating the performance of the model according to a prediction result, wherein evaluation parameters are sensitivity, specificity, accuracy and correctness of the pollution source identification model.
In order to achieve the above object, another embodiment of the present invention provides a pollution source identification method system based on data fusion, including:
the device comprises a sampling module, a sample preprocessing module, a sample introduction module, an analysis testing module, a data transmission module, a display module and a system control module;
the sampling module is used for collecting a pollution source sample;
the sample pretreatment module is used for pretreating the pollution source sample;
the sample introduction module is used for conveying the pretreated sample to the analysis and test module;
the analysis testing module is used for carrying out pollution index testing on the pretreated pollution source sample to obtain conventional water quality data, ultraviolet-visible absorption spectrum data and three-dimensional fluorescence spectrum data;
the data transmission module is used for transmitting data among the modules;
the display module is used for displaying data;
the system control module is used for embedding a pollution source identification model and controlling, conventional water quality data, ultraviolet-visible absorption spectrum data and three-dimensional fluorescence spectrum data are obtained through the data transmission module, a configuration software is used for displaying results in the display module in real time, the conventional water quality data, the ultraviolet-visible absorption spectrum data and the three-dimensional fluorescence spectrum data collected in the configuration software are transmitted to the pollution source identification model, the pollution source identification model identifies the type of a sample, and then the identification result is returned to the display module to display the identification result.
Further, in one embodiment of the present invention, the analytical test modules include, but are not limited to, online pH meters, online conductivity meters, online COD analyzers, online ammonia nitrogen analyzers, online total nitrogen meters, and online total phosphorus analyzers, uv-vis absorption spectrophotometers, and fluorescence spectrophotometers.
The pollution source identification method and system based on data fusion in the embodiment of the invention have the following advantages:
(1) the data fusion technology is applied to pollution source identification, and a pollution source identification model is established, so that the dependence on technical experts is broken through, and the tracing efficiency is improved; the source tracing accuracy is improved, particularly the source tracing accuracy under the complex pollution emission scene;
(2) the pollution index analysis and test method selected by data fusion is mature and reliable, simple and convenient to operate, low in cost, rich in information and beneficial to large-scale popularization;
(3) the manual comparison usually needs hours, the pollution source identification system set up by the invention can carry out real-time comparison, the timeliness of pollution source tracing is enhanced, and the method has important significance for pollution source tracing.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a pollution source identification method based on data fusion according to an embodiment of the present invention;
FIG. 2 is a flow diagram of pollution source identification model building according to one embodiment of the present invention;
FIG. 3 is a schematic diagram of cross-validation error rate as a function of number of potential variables, according to one embodiment of the present invention;
FIG. 4 is a calculated response value of a training and prediction phase model to class 1(ZQ), according to one embodiment of the present invention;
fig. 5 is a schematic structural diagram of a pollution source identification system based on data fusion according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The following describes a pollution source identification method and system based on data fusion according to an embodiment of the present invention with reference to the accompanying drawings.
First, a pollution source identification method based on data fusion proposed according to an embodiment of the present invention will be described with reference to the accompanying drawings.
FIG. 1 is a flow chart of a pollution source identification method based on data fusion according to an embodiment of the present invention.
FIG. 2 is a flow chart of pollution source identification model building according to one embodiment of the present invention.
As shown in fig. 1 and 2, the pollution source identification method based on data fusion includes the following steps:
step S1, a contamination source sample is collected.
And step S2, preprocessing the pollution source sample.
The contamination source samples include: the method comprises the following steps of pretreating a pollution source sample by using a water sample, a soil sample and an atmospheric sample: filtering a water sample by using a 0.2-10.0 mu m filter membrane, dissolving a soil sample by using ultrapure water, filtering a soil leaching solution by using a 0.2-10.0 mu m filter membrane, and filtering a gas sample by using a 0.2-10.0 mu m filter membrane after dissolving the gas sample in the ultrapure water.
And step S3, performing pollution index test on the pretreated pollution source sample to obtain conventional water quality data, ultraviolet-visible absorption spectrum data and three-dimensional fluorescence spectrum data.
Further, the pollution index analytical tests include, but are not limited to, conventional water quality, ultraviolet-visible absorption spectra, and three-dimensional fluorescence spectra.
Conventional water qualities include, but are not limited to, pH, conductivity, chemical oxygen demand, total nitrogen, ammonia nitrogen, and total phosphorus; the scanning range of the ultraviolet-visible absorption spectrum is 200-800 nm, and the scanning interval is 0.1-10 nm; the scanning range of the excitation wavelength of the three-dimensional fluorescence spectrum is 200-600 nm, the scanning range of the emission wavelength is 230-700 nm, and the scanning interval is 1-10 nm.
And step S4, preprocessing the conventional water quality data, the ultraviolet-visible absorption spectrum data and the three-dimensional fluorescence spectrum data.
The data preprocessing comprises the preprocessing of conventional water quality, ultraviolet-visible absorption spectrum and three-dimensional fluorescence spectrum data.
The conventional water quality data preprocessing refers to that abnormal data are removed by adopting a Lauda criterion (a 3 sigma criterion).
The ultraviolet-visible absorption spectrum preprocessing refers to removing the part without effective data (namely, basically all the effective data are solvent background absorption) in the spectrum data, and then performing standard normal transformation on the spectrum to reduce the influence of scattering, wherein the calculation formula is as follows:
Figure BDA0002964255620000051
wherein S isSNVFor the transformed data, SkAs the original data, it is the original data,
Figure BDA0002964255620000052
the average value of all wavelength points of the original spectrum is m, and the m is the number of the wavelength points of the spectrum.
The three-dimensional fluorescence spectrum pretreatment is to convert the fluorescence intensity of the original fluorescence fingerprint into Raman units (R.U.) by utilizing the integral of the Raman scattering intensity of ultrapure water with the excitation wavelength of 350nm and the emission wavelength of 370-430 nm.
And step S5, performing feature extraction on the preprocessed conventional water quality data, the preprocessed ultraviolet-visible absorption spectrum data and the preprocessed three-dimensional fluorescence spectrum data.
The data feature extraction comprises feature extraction of conventional water quality data, ultraviolet-visible absorption spectrum data and three-dimensional fluorescence spectrum data.
The characteristic extraction methods of the conventional water quality data and the ultraviolet-visible absorption spectrum include but are not limited to principal component analysis, nonnegative matrix decomposition and independent component analysis. As a preferable scheme, the invention adopts a principal component analysis method, and the basic principle is as follows:
Figure BDA0002964255620000061
wherein X represents conventional water quality data or ultraviolet-visible absorption spectrum data, F represents principal component number, and tfScore vector, p, representing the f-th principalfRepresenting the load vector of the f-th principal element. Score vector tfIs the coordinate of the sample on the f-th principal element, i.e. the coordinate of the new variable. Load vector pfThe correlation coefficient of the original variable and the f-th principal element is represented, and the larger the load is, the more the principal element fully explains the variable. F score vectors form a score matrix T, and F load vectors form a load matrix P. Multiplying T and P to obtain F principal component simulation parts
Figure BDA0002964255620000062
E is the model residual.
The characteristic extraction of the three-dimensional fluorescence spectrum data refers to the utilization of a parallel factor analysis method to obtain main fluorescence components of the three-dimensional fluorescence spectrum, and the basic principle is as follows:
Figure BDA0002964255620000063
wherein x isijkIs the fluorescence intensity of the ith sample at the emission wavelength j, excitation wavelength k; f represents the number of factors; a isif、bjf、ckfRespectively represent elements in the load matrix A, B, C; epsilonijkIs the model residual, representing the part that cannot be interpreted by the model.
And step S6, splicing the characteristic data extracted from the conventional water quality data, the ultraviolet-visible absorption spectrum data and the three-dimensional fluorescence spectrum data to construct fused data.
The fusion data is obtained by splicing a characteristic data matrix extracted from conventional water quality data, ultraviolet-visible absorption spectrum data and three-dimensional fluorescence spectrum data into a new matrix, and the formula is as follows:
F=[A,B,C]
in the formula, F is a fusion data matrix, and A, B, C is a characteristic data matrix extracted from conventional water quality data, ultraviolet-visible absorption spectrum data and three-dimensional fluorescence spectrum data.
And step S7, establishing a pollution source recognition model according to the fusion data and the classification algorithm and training.
And step S8, identifying the pollution source through the trained pollution source identification model.
The establishment of the pollution source identification model refers to the establishment of a classification model based on feature fusion data by utilizing fusion data and a classification algorithm. Classification algorithms include, but are not limited to, partial least squares resolution analysis (PLS-DA), Support Vector Machine (SVM), K-nearest neighbor (KNN) node algorithms. As a preferred scheme, the method adopts partial least squares resolution analysis as a classification algorithm of the recognition model. The process of establishing the recognition model comprises model initialization, training, prediction and performance evaluation.
Model initialization refers to selecting 75% -95% of sample data as a training set, establishing a model by adopting a cross validation method and selecting the optimal potential variable number according to a cross validation error minimization principle.
Model training refers to setting the number of potential variables to the best number of potential variables to fit the model again.
The model prediction means that the residual 5-25% of sample sets are predicted by using the trained model, and the performance of the model is evaluated according to the prediction result.
The model performance evaluation parameters include sensitivity (sn), specificity (sp), precision (pr), and Accuracy (Acc). sn, sp, and pr are single class performance parameters.
Sensitivity represents the ability of the classifier to correctly identify a certain class. Assume that there are two categories, category 1 is positive and category 2 is negative. Taking class 1 as an example, the sensitivity of class 1 describes the correct proportion of model prediction in all samples with positive true values, and the calculation formula is as follows:
sn=TPN/(TPN+FNN)
wherein, TPN represents the number of samples with true value being positive and classification result being positive. FNN indicates that the true value is positive, and the classification result is the number of samples with negative, namely the number of samples with false negative.
The specificity represents the capability of the classifier for rejecting samples in another class (class 2), namely the capability of the classifier for correctly identifying negative samples, and describes the correct proportion predicted by the model in all samples with true negative values, and the calculation formula is as follows:
sp=TNN/(FPN+TNN)
wherein, TNN indicates that the true value is negative, and the classification result is the number of samples of negative. FPN indicates that the true value is negative, but the classification result is the number of samples of positive, namely the number of samples of false positive.
The accuracy rate represents the ability of the classifier to avoid misidentifying a certain class of samples. Taking class 1 as an example, the accuracy describes the correct proportion of model prediction in all samples with positive predicted values, and the calculation formula is as follows:
pr=TPN/(TPN+FPN)
the accuracy rate describes the proportion of the number of correct classification samples to the total number of classification samples, and does not take any classification performance information about a single class into account. For the above category 1 and category 2, the accuracy calculation formula is:
Acc=(TPN+TNN)/(TPN+FNN+TNN+FPN)
the invention will be further elucidated with reference to specific embodiments and figures thereof.
1) Contamination Source sample Collection
A and B are two sources of contamination, respectively, from which 30 water samples are collected, respectively.
2) Sample pretreatment
The sample was filtered through a 0.45 μm filter.
3) Sample analysis test
The samples were tested for pH, conductivity, chemical oxygen demand, ammonia nitrogen, total nitrogen and total phosphorus. The results show that the conventional water quality difference between A and B is not significant: the pH value is 6-8, the conductivity is 900-1800 mu S/cm, and the total phosphorus is 1-11 mg/L. The chemical oxygen demand of A is between 150 and 1300mg/L, and the chemical oxygen demand of B is between 70 and 600 mg/L. The ammonia nitrogen of A is 50-150 mg/L, and the ammonia nitrogen of B is 30-130 mg/L. The total nitrogen of A is between 60 and 130mg/L, and the total nitrogen of B is between 20 and 110 mg/L.
The sample was tested for uv-visible absorption spectra and three-dimensional fluorescence spectra. The scanning range of the ultraviolet-visible absorption spectrum is 200-800 nm, and the scanning interval is 0.2 nm. The result shows that the ultraviolet-visible absorption spectra of A and B basically show a single exponential decline trend, a weak absorption peak exists near 260-280 nm, and the difference is not obvious.
And testing the three-dimensional fluorescence spectrum of the sample, wherein the excitation wavelength is 220-600 nm, the emission wavelength is 230-650 nm, and the scanning interval is 5 nm. Typical three-dimensional fluorescence spectra for both A and B are at the excitation/emission wavelength (denoted as E)x/Em) 225/340nm and Ex/EmTwo fluorescence peaks near 275/340nm were present, and the difference was not significant.
4) Data pre-processing
And the conventional water quality data is analyzed by adopting the Lauda criterion, and no abnormal sample is found, so that the conventional water quality data of all samples can be used for modeling.
The absorption of the original uv-vis absorption spectrum after 500nm is almost zero, thus rejecting this part of the spectrum. The remaining spectra were then subjected to a standard normal transformation.
The three-dimensional fluorescence spectrum converts the fluorescence intensity of an original three-dimensional response spectrum into Raman units (R.U.) by utilizing the integral of Raman scattering intensity between ultrapure water excitation wavelength 350nm and emission wavelength 370-430 nm.
5) Feature extraction
And (3) carrying out principal component analysis on the conventional water quality data, wherein the cumulative variance contribution rate of the first three principal components reaches 95.53 percent and contains most of information of the original data. Therefore, 3 principal components PC1 to PC3 were extracted from 6 regular water quality data.
And (3) carrying out principal component analysis on the pretreated ultraviolet-visible absorption spectrum, wherein the cumulative variance contribution rate of the first six principal components reaches 98.78%, and the maximum information of the original data is contained. Therefore, 6 main components, UVPC1 to UVPC6, were also extracted from the uv-visible absorption spectrum data.
And (3) carrying out parallel factorization on the preprocessed three-dimensional fluorescence spectrum data to obtain 6 main fluorescence components F1-F6.
6) Data fusion
PC1 to PC3, UVPC1 to UVPC6, and F1 to F6 were spliced together to construct fusion data FD ═ PC1, PC2, PC3, UVPC1, UVPC2, UVPC3, UVPC4, UVPC5, UVPC6, F1, F2, F3, F4, F5, F6.
7) Establishing a pollution source identification model
a) Consider a sample of a as class 1 and a sample of B as class 2. Taking 23A and 23B samples as training samples, and taking the rest samples as prediction samples;
b) leading in training samples and carrying out data normalization;
c) and (3) initially establishing a PLS-DA classification model, and selecting the optimal variable number according to a cross validation error rate minimization principle. As shown in FIG. 3, when the number of potential variables is 1-5, the cross validation error rates are all the minimum value of 0. Theoretically, the number of potential variables is 1 to 5. However, the lower number of potential variables may contain less information, which may cause the performance of the model to be reduced during prediction; the more the number of potential variables is, the more information redundancy is possible, and the noise is increased in prediction. Preferably, the intermediate value 3 is selected as the optimal potential variable number in the embodiment;
d) setting the number of potential variables to 3 to refit the model and save the model;
e) leading in a prediction sample and carrying out data normalization;
f) classifying and identifying the prediction samples by utilizing the established model;
g) and (4) checking the prediction performance of the model, wherein the result shows that the established recognition model has perfect prediction performance on A and B. Neither the training nor prediction phases have samples misclassified, see FIG. 4. The sensitivity, specificity and accuracy of the single-class performance parameters are all 1, and the accuracy of the overall performance parameters is also 1.
The pollution source identification method based on data fusion provided by the embodiment of the invention comprises the steps of sample collection, sample analysis and test, data preprocessing, feature extraction, data fusion, identification model establishment and pollution source identification system establishment. And performing data fusion on the conventional water quality, the ultraviolet-visible absorption spectrum and the three-dimensional fluorescence spectrum, and applying the data fusion to pollution source identification. Compared with the traditional pollution source identification method, the method is accurate, intelligent, low in cost, high in operability, beneficial to large-scale popularization and significant in pollution source tracing.
Next, a pollution source identification system based on data fusion proposed according to an embodiment of the present invention will be described with reference to the drawings.
Fig. 5 is a schematic structural diagram of a pollution source identification system based on data fusion according to an embodiment of the present invention.
As shown in fig. 5, the pollution source identification system based on data fusion includes: the device comprises a sampling module, a sample preprocessing module, a sample introduction module, an analysis testing module, a data transmission module, a display module and a system control module;
the sampling module is used for collecting a pollution source sample;
the sample pretreatment module is used for pretreating a pollution source sample;
the sample introduction module is used for conveying the pretreated sample to the analysis and test module;
the analysis testing module is used for carrying out pollution index testing on the pretreated pollution source sample to obtain conventional water quality data, ultraviolet-visible absorption spectrum data and three-dimensional fluorescence spectrum data;
the data transmission module is used for transmitting data among the modules;
the display module is used for displaying data;
the system control module is used for embedding and controlling the pollution source identification model, obtaining conventional water quality data, ultraviolet-visible absorption spectrum data and three-dimensional fluorescence spectrum data through the data transmission module, displaying results in the display module in real time through the configuration software, transmitting the conventional water quality data, the ultraviolet-visible absorption spectrum data and the three-dimensional fluorescence spectrum data collected in the configuration software to the pollution source identification model, judging the type of a sample pollution source sample by the pollution source identification model, and returning the judgment result to the display module to display the judgment result.
Analytical test modules include, but are not limited to, online pH meters, online conductivity meters, online COD analyzers, online ammonia nitrogen analyzers, online total nitrogen and online total phosphorus analyzers, ultraviolet-visible absorption spectrophotometers, and fluorescence spectrophotometers.
The method for building the pollution source identification system comprises the following steps:
1) connecting a sampling module, a sample pretreatment module, a sample introduction module, an analysis test module, a data transmission module, a display module and a system control module together, wherein the analysis test module comprises an online pH meter, an online conductivity meter, an online COD (chemical oxygen demand) analyzer, an online ammonia nitrogen analyzer, an online total nitrogen meter, an online total phosphorus analyzer, an ultraviolet-visible absorption spectrophotometer and a fluorescence spectrophotometer;
2) embedding a pollution source identification model into a system control module;
3) the system control module starts a sample collection module, a sample is sent to an analysis test module through a sample introduction module, then the measured data is transmitted to the system control model through a data transmission module, and a result is displayed in real time on a display module by using configuration software MCGS;
4) and carrying out data exchange on the pollution source identification model and MCGS industrial personal computer configuration software by adopting an OPC technology. The water quality data collected in the configuration software is transmitted to a pollution source identification model, the pollution source identification model analyzes the sample category, and then the judgment result is returned to a display module to display the judgment result;
5) and continuously repeating the steps 3) and 4) to realize real-time comparison of the pollution sources.
It should be noted that the foregoing explanation of the method embodiment is also applicable to the system of this embodiment, and is not repeated here.
The pollution source identification system based on data fusion provided by the embodiment of the invention comprises sample collection, sample analysis and test, data preprocessing, feature extraction, data fusion, identification model establishment and pollution source identification system establishment. And performing data fusion on the conventional water quality, the ultraviolet-visible absorption spectrum and the three-dimensional fluorescence spectrum, and applying the data fusion to pollution source identification. Compared with the traditional pollution source identification method, the method is accurate, intelligent, low in cost, high in operability, beneficial to large-scale popularization and significant in pollution source tracing.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (9)

1. A pollution source identification method based on data fusion is characterized by comprising the following steps:
collecting a pollution source sample;
pretreating the pollution source sample;
carrying out pollution index test on the pretreated pollution source sample to obtain conventional water quality data, ultraviolet-visible absorption spectrum data and three-dimensional fluorescence spectrum data;
preprocessing the conventional water quality data, the ultraviolet-visible absorption spectrum data and the three-dimensional fluorescence spectrum data;
performing characteristic extraction on the preprocessed conventional water quality data, the preprocessed ultraviolet-visible absorption spectrum data and the preprocessed three-dimensional fluorescence spectrum data;
splicing the characteristic data extracted from the conventional water quality data, the ultraviolet-visible absorption spectrum data and the three-dimensional fluorescence spectrum data to construct fused data;
establishing a pollution source identification model according to the fusion data and a classification algorithm and training;
and identifying the pollution source through the trained pollution source identification model.
2. The method of claim 1, wherein the contamination source sample comprises: water sample, soil sample and atmospheric sample, to carry out the preliminary treatment to the pollution source sample includes: filtering a water sample by using a 0.2-10.0 mu m filter membrane, dissolving a soil sample by using ultrapure water, filtering a soil leaching solution by using a 0.2-10.0 mu m filter membrane, and filtering a gas sample by using a 0.2-10.0 mu m filter membrane after dissolving the gas sample in the ultrapure water.
3. The method of claim 1, wherein the pollution index test comprises regular water quality, uv-vis absorption spectra, and three-dimensional fluorescence spectra;
the conventional water quality includes but is not limited to pH value, conductivity, chemical oxygen demand, total nitrogen, ammonia nitrogen and total phosphorus;
the scanning range of the ultraviolet-visible absorption spectrum is 200-800 nm, and the scanning interval is 0.1-10 nm;
the scanning range of the excitation wavelength of the three-dimensional fluorescence spectrum is 200-600 nm, the scanning range of the emission wavelength is 230-700 nm, and the scanning interval is 1-10 nm.
4. The method of claim 3, wherein pre-processing the regular water quality data, the ultraviolet-visible absorption spectrum data, and the three-dimensional fluorescence spectrum data comprises:
pretreating conventional water quality data: eliminating abnormal data by adopting a Lauda criterion;
preprocessing an ultraviolet-visible absorption spectrum: eliminating invalid data in the spectral data, and then performing standard normal transformation on the spectral data;
preprocessing the three-dimensional fluorescence spectrum: and converting the fluorescence intensity of the original fluorescence fingerprint into Raman units (R.U.) by utilizing the integral of the Raman scattering intensity of ultrapure water with the excitation wavelength of 350nm and the emission wavelength of 370-430 nm.
5. The method of claim 1, wherein the feature extraction of the pre-processed test data comprises feature extraction of conventional water quality data, uv-vis absorption spectrum data, and three-dimensional fluorescence spectrum data;
the characteristic extraction method of the conventional water quality data and the ultraviolet-visible absorption spectrum data comprises but is not limited to principal component analysis, nonnegative matrix decomposition and independent component analysis;
the characteristic extraction of the three-dimensional fluorescence spectrum data is to extract main fluorescence components of the three-dimensional fluorescence spectrum by utilizing parallel factor analysis.
6. The method of claim 1, wherein the classification algorithm includes, but is not limited to, partial least squares resolution analysis, support vector machine, K nearest neighbor algorithm.
7. The method of claim 1, wherein building and training a pollution source recognition model based on the fused data and classification algorithm further comprises:
model initialization: selecting 75-95% of sample data as a training set, establishing the pollution source identification model by adopting a cross validation method, and selecting the optimal potential variable number according to a cross validation error minimization principle;
model training: setting the number of variables as the optimal number of potential variables to fit the pollution source identification model again;
model prediction: and predicting the residual 5-25% of sample data through the fitted pollution source identification model, and evaluating the performance of the model according to a prediction result, wherein evaluation parameters are sensitivity, specificity, accuracy and correctness of the pollution source identification model.
8. A pollution source identification system based on data fusion, comprising:
the device comprises a sampling module, a sample preprocessing module, a sample introduction module, an analysis testing module, a data transmission module, a display module and a system control module;
the sampling module is used for collecting a pollution source sample;
the sample pretreatment module is used for pretreating the pollution source sample;
the sample introduction module is used for conveying the pretreated sample to the analysis and test module;
the analysis testing module is used for carrying out pollution index testing on the pretreated pollution source sample to obtain conventional water quality data, ultraviolet-visible absorption spectrum data and three-dimensional fluorescence spectrum data;
the data transmission module is used for transmitting data among the modules;
the display module is used for displaying data;
the system control module is used for embedding a pollution source identification model and controlling, conventional water quality data, ultraviolet-visible absorption spectrum data and three-dimensional fluorescence spectrum data are obtained through the data transmission module, a configuration software is used for displaying results in the display module in real time, the conventional water quality data, the ultraviolet-visible absorption spectrum data and the three-dimensional fluorescence spectrum data collected in the configuration software are transmitted to the pollution source identification model, the pollution source identification model identifies the type of a sample, and then the identification result is returned to the display module to display the identification result.
9. The system of claim 8, wherein the analytical test modules include, but are not limited to, an online pH meter, an online conductivity meter, an online COD analyzer, an online ammonia nitrogen analyzer, an online total nitrogen meter, and an online total phosphorous analyzer, an ultraviolet-visible absorption spectrophotometer, and a fluorescence spectrophotometer.
CN202110246412.3A 2021-03-05 2021-03-05 Pollution source identification method and system based on data fusion Pending CN113011478A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110246412.3A CN113011478A (en) 2021-03-05 2021-03-05 Pollution source identification method and system based on data fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110246412.3A CN113011478A (en) 2021-03-05 2021-03-05 Pollution source identification method and system based on data fusion

Publications (1)

Publication Number Publication Date
CN113011478A true CN113011478A (en) 2021-06-22

Family

ID=76407160

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110246412.3A Pending CN113011478A (en) 2021-03-05 2021-03-05 Pollution source identification method and system based on data fusion

Country Status (1)

Country Link
CN (1) CN113011478A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113029994A (en) * 2021-03-31 2021-06-25 扬州大学 Microcystin concentration inversion method based on multi-source characteristic spectrum of extracellular organic matter
CN113376114A (en) * 2021-06-24 2021-09-10 北京市生态环境监测中心 Water pollution tracing method based on ultraviolet-visible spectrum data
CN113588617A (en) * 2021-08-02 2021-11-02 清华大学 Water quality multi-feature early warning traceability system and method
CN114166747A (en) * 2021-11-29 2022-03-11 浙江大学 Discrete three-dimensional fluorescence/visible light absorption spectrum detection device for distinguishing water pollution
CN115219472A (en) * 2022-08-12 2022-10-21 生态环境部华南环境科学研究所(生态环境部生态环境应急研究所) Method and system for quantitatively identifying multiple pollution sources of mixed water body
CN115239354A (en) * 2022-08-11 2022-10-25 中国科学院大气物理研究所 Pollutant tracing method and system applied to unsteady multi-point source
CN116363440A (en) * 2023-05-05 2023-06-30 北京建工环境修复股份有限公司 Deep learning-based identification and detection method and system for colored microplastic in soil
CN117633706A (en) * 2023-11-30 2024-03-01 众悦(威海)信息技术有限公司 Data processing method for information system data fusion

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009080049A1 (en) * 2007-12-21 2009-07-02 Dma Sorption Aps Monitoring oil condition and/or quality, on-line or at-line, based on chemometric data analysis of flourescence and/or near infrared spectra
CN107480839A (en) * 2017-10-13 2017-12-15 深圳市博安达信息技术股份有限公司 The classification Forecasting Methodology of high-risk pollution sources based on principal component analysis and random forest
CN109470667A (en) * 2018-11-14 2019-03-15 华东理工大学 A kind of combination water quality parameter and three-dimensional fluorescence spectrum carry out the method that pollutant is traced to the source
CN109711547A (en) * 2018-12-24 2019-05-03 武汉邦拓信息科技有限公司 A kind of pollution sources disorder data recognition method based on deep learning algorithm
CN110083585A (en) * 2019-03-15 2019-08-02 清华大学 A kind of water pollution discharge source database and its method for building up
CN111222575A (en) * 2020-01-07 2020-06-02 北京联合大学 KLXS multi-model fusion method and system based on HRRP target recognition
CN111426668A (en) * 2020-04-28 2020-07-17 华夏安健物联科技(青岛)有限公司 Method for tracing, classifying and identifying polluted water body by using three-dimensional fluorescence spectrum characteristic information
US20200348232A1 (en) * 2020-04-28 2020-11-05 Chinese Research Academy Of Environmental Sciences Rapid detection method for condition of landfill leachate polluting groundwater and application thereof
CN111982878A (en) * 2020-08-24 2020-11-24 安徽思环科技有限公司 Water pollution analysis method based on ultraviolet visible spectrum and three-dimensional fluorescence spectrum

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009080049A1 (en) * 2007-12-21 2009-07-02 Dma Sorption Aps Monitoring oil condition and/or quality, on-line or at-line, based on chemometric data analysis of flourescence and/or near infrared spectra
CN107480839A (en) * 2017-10-13 2017-12-15 深圳市博安达信息技术股份有限公司 The classification Forecasting Methodology of high-risk pollution sources based on principal component analysis and random forest
CN109470667A (en) * 2018-11-14 2019-03-15 华东理工大学 A kind of combination water quality parameter and three-dimensional fluorescence spectrum carry out the method that pollutant is traced to the source
CN109711547A (en) * 2018-12-24 2019-05-03 武汉邦拓信息科技有限公司 A kind of pollution sources disorder data recognition method based on deep learning algorithm
CN110083585A (en) * 2019-03-15 2019-08-02 清华大学 A kind of water pollution discharge source database and its method for building up
CN111222575A (en) * 2020-01-07 2020-06-02 北京联合大学 KLXS multi-model fusion method and system based on HRRP target recognition
CN111426668A (en) * 2020-04-28 2020-07-17 华夏安健物联科技(青岛)有限公司 Method for tracing, classifying and identifying polluted water body by using three-dimensional fluorescence spectrum characteristic information
US20200348232A1 (en) * 2020-04-28 2020-11-05 Chinese Research Academy Of Environmental Sciences Rapid detection method for condition of landfill leachate polluting groundwater and application thereof
CN111982878A (en) * 2020-08-24 2020-11-24 安徽思环科技有限公司 Water pollution analysis method based on ultraviolet visible spectrum and three-dimensional fluorescence spectrum

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
武晓莉等;: "多源光谱信息融合在水质分析中的应用", 分析化学, vol. 35, no. 12, pages 1716 - 1720 *
鲍灵利;: "紫外可见光谱检测水体COD算法研究", 信息通信, no. 03, pages 6 - 8 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113029994A (en) * 2021-03-31 2021-06-25 扬州大学 Microcystin concentration inversion method based on multi-source characteristic spectrum of extracellular organic matter
CN113376114A (en) * 2021-06-24 2021-09-10 北京市生态环境监测中心 Water pollution tracing method based on ultraviolet-visible spectrum data
CN113588617A (en) * 2021-08-02 2021-11-02 清华大学 Water quality multi-feature early warning traceability system and method
CN114166747A (en) * 2021-11-29 2022-03-11 浙江大学 Discrete three-dimensional fluorescence/visible light absorption spectrum detection device for distinguishing water pollution
CN114166747B (en) * 2021-11-29 2023-12-15 浙江大学 Discrete three-dimensional fluorescence/visible light absorption spectrum detection device for distinguishing water pollution
CN115239354A (en) * 2022-08-11 2022-10-25 中国科学院大气物理研究所 Pollutant tracing method and system applied to unsteady multi-point source
CN115239354B (en) * 2022-08-11 2024-03-22 中国科学院大气物理研究所 Pollutant tracing method and system applied to unsteady multi-point source
CN115219472A (en) * 2022-08-12 2022-10-21 生态环境部华南环境科学研究所(生态环境部生态环境应急研究所) Method and system for quantitatively identifying multiple pollution sources of mixed water body
NL2034211A (en) 2022-08-12 2024-02-16 South China Institute Of Environmental Sciences Mee Res Institute Of Eco Environmental Emergency Mee Method and system for quantitatively identifying multi-pollution sources of mixed water body
CN116363440A (en) * 2023-05-05 2023-06-30 北京建工环境修复股份有限公司 Deep learning-based identification and detection method and system for colored microplastic in soil
CN116363440B (en) * 2023-05-05 2023-12-19 北京建工环境修复股份有限公司 Deep learning-based identification and detection method and system for colored microplastic in soil
CN117633706A (en) * 2023-11-30 2024-03-01 众悦(威海)信息技术有限公司 Data processing method for information system data fusion

Similar Documents

Publication Publication Date Title
CN113011478A (en) Pollution source identification method and system based on data fusion
CN105631203A (en) Method for recognizing heavy metal pollution source in soil
CN108665119B (en) Water supply pipe network abnormal working condition early warning method
CN105334186A (en) Infrared spectral analysis method
CN115389439B (en) River pollutant monitoring method and system based on big data
CN111783616B (en) Nondestructive testing method based on data-driven self-learning
CN114202243A (en) Engineering project management risk early warning method and system based on random forest
CN100445732C (en) Burning evaluation method for machining surface based on CCD image characteristics
CN113311081B (en) Pollution source identification method and device based on three-dimensional liquid chromatography fingerprint
CN111724290A (en) Environment-friendly equipment identification method and system based on deep hierarchical fuzzy algorithm
CN116359285A (en) Oil gas concentration intelligent detection system and method based on big data
CN116858822A (en) Quantitative analysis method for sulfadiazine in water based on machine learning and Raman spectrum
CN115598164A (en) Machine learning integrated soil heavy metal concentration detection method and system
CN111476363A (en) Stable learning method and device for distinguishing decorrelation of variables
CN115728290A (en) Method, system, equipment and storage medium for detecting chromium element in soil
CN115219472A (en) Method and system for quantitatively identifying multiple pollution sources of mixed water body
CN115186935A (en) Electromechanical device nonlinear fault prediction method and system
CN113916817A (en) Spectroscopy chromaticity online measurement method for urban drinking water
JPH09120455A (en) Feature discriminating method using neural network
CN112508946B (en) Cable tunnel anomaly detection method based on antagonistic neural network
CN114897835B (en) Image-based real-time detection system for ash content of coal products
CN117349777B (en) Intelligent identification system and method for online monitoring data of water environment
CN102692395B (en) Light interference gas detection device and working condition detection method thereof
CN112906793B (en) Monitoring data repairing method and system for bridge health monitoring system
CN112508946A (en) Cable tunnel abnormity detection method based on antagonistic neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination