CN117591506B - Site soil and groundwater environment monitoring data cleaning method based on fusion model - Google Patents
Site soil and groundwater environment monitoring data cleaning method based on fusion model Download PDFInfo
- Publication number
- CN117591506B CN117591506B CN202410046720.5A CN202410046720A CN117591506B CN 117591506 B CN117591506 B CN 117591506B CN 202410046720 A CN202410046720 A CN 202410046720A CN 117591506 B CN117591506 B CN 117591506B
- Authority
- CN
- China
- Prior art keywords
- data
- model
- data set
- cleaning
- deep learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004140 cleaning Methods 0.000 title claims abstract description 59
- 230000004927 fusion Effects 0.000 title claims abstract description 57
- 239000002689 soil Substances 0.000 title claims abstract description 45
- 239000003673 groundwater Substances 0.000 title claims abstract description 39
- 238000012544 monitoring process Methods 0.000 title claims abstract description 35
- 238000000034 method Methods 0.000 title claims abstract description 33
- 230000002159 abnormal effect Effects 0.000 claims abstract description 42
- 238000013136 deep learning model Methods 0.000 claims abstract description 37
- 238000012549 training Methods 0.000 claims abstract description 35
- 239000003344 environmental pollutant Substances 0.000 claims abstract description 22
- 231100000719 pollutant Toxicity 0.000 claims abstract description 22
- 238000012545 processing Methods 0.000 claims abstract description 7
- 238000001514 detection method Methods 0.000 claims description 27
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 16
- 238000012360 testing method Methods 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 13
- 238000012795 verification Methods 0.000 claims description 12
- 238000013450 outlier detection Methods 0.000 claims description 10
- 230000008439 repair process Effects 0.000 claims description 10
- 238000004458 analytical method Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000009499 grossing Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 3
- 238000012952 Resampling Methods 0.000 claims description 2
- 238000013135 deep learning Methods 0.000 claims description 2
- 238000007781 pre-processing Methods 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 6
- 238000013508 migration Methods 0.000 description 5
- 230000005012 migration Effects 0.000 description 5
- 238000005293 physical law Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- XKMRRTOUMJRJIA-UHFFFAOYSA-N ammonia nh3 Chemical compound N.N XKMRRTOUMJRJIA-UHFFFAOYSA-N 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 239000000356 contaminant Substances 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 238000011835 investigation Methods 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000012530 fluid Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000007791 liquid phase Substances 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 239000011148 porous material Substances 0.000 description 1
- 238000013441 quality evaluation Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012502 risk assessment Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000007790 solid phase Substances 0.000 description 1
- 239000003403 water pollutant Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2433—Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Quality & Reliability (AREA)
- Processing Of Solid Wastes (AREA)
Abstract
The invention discloses a field soil and groundwater environment monitoring data cleaning method based on a fusion model, which belongs to the technical field of polluted field soil and groundwater data processing, and comprises the following steps: acquiring site soil and groundwater pollutant data, and performing classification treatment to obtain a classified data set; training a plurality of reference deep learning models based on the classified data sets, constructing a multi-model fused data cleaning network model, and adjusting parameters of the multi-model fused data cleaning network model by using loss values of the multi-model fused data cleaning network model to obtain a standard fusion model; inputting data to be cleaned into a standard fusion model, and detecting and repairing abnormal values; and merging the cleaned data and storing the merged data into a database or a data warehouse. The method for cleaning the field soil and groundwater environment monitoring data based on the fusion model can be used for fusing a plurality of reference deep learning models to improve the accuracy and efficiency of data cleaning.
Description
Technical Field
The invention relates to the technical field of polluted site soil and groundwater data processing, in particular to a site soil and groundwater environment monitoring data cleaning method based on a fusion model.
Background
During the whole life cycle of investigation, risk assessment, risk management and long-term monitoring, a pollution site will obtain a large amount of monitoring data about soil and groundwater pollutants, the monitoring data has a large number of samples, the monitoring projects are large, the data structure is complex, and a large amount of characteristic information, relation information and classification information are implied. Meanwhile, redundant data, missing data, uncertain data, inconsistent data and other dirty data exist in soil and groundwater data inevitably, the usability of the whole data is seriously affected by the data, and the quality evaluation results of the soil and groundwater environment are often deviated due to the abnormal data, so that the subsequent decision management is affected.
At present, data cleaning is based on outlier identification and outlier filling, and related methods are mainly provided from the angles of attribute values, spatial scales, time sequences and the like. For example, a smoothing-based data cleansing method, a statistical-based data cleansing method, a constraint-based data cleansing method, and the like are applicable to cleansing of single-dimensional data only. However, there are few methods of data cleansing that combine time and space scales. Even based on the space-time scale, most researches are independently developed on the time scale and the space scale, namely, firstly, the abnormal value of the time scale and the abnormal value of the space scale are detected respectively, and then, the obtained result is further used for judging the space-time abnormal value. Such an approach separates the temporal and spatial correlations without taking into account the temporal and spatial interactions.
At present, mass data of a polluted site is continuously updated, a fixed data set is not adopted, the automatic extraction and analysis requirements are greatly increased, and an analysis rule for manually identifying each piece of data cannot be used. The existing data cleaning algorithm is proposed for static data with single type, most of the technologies only clean the static data with single problems such as abnormal points, missing values and the like, prior information cannot be added into the data cleaning process, abnormal detection and data restoration are not organically combined, and the comprehensive processing requirement of mass data is difficult to meet.
Disclosure of Invention
The invention aims to provide a field soil and groundwater environment monitoring data cleaning method based on a fusion model, which establishes a reference deep learning model by learning a characteristic distribution rule of data and can fuse a plurality of reference deep learning models to improve the accuracy and efficiency of data cleaning.
In order to achieve the above purpose, the invention provides a field soil and groundwater environment monitoring data cleaning method based on a fusion model, which comprises the following steps:
s1, acquiring site soil and groundwater pollutant data as an original data set, and classifying the original data set to obtain a classified data set;
s2, training a plurality of reference deep learning models based on the classified data sets, constructing a multi-model fused data cleaning network model, and adjusting parameters of the multi-model fused data cleaning network model by using loss values of the multi-model fused data cleaning network model to obtain a standard fusion model;
s3, inputting the data to be cleaned into a standard fusion model, and detecting and repairing abnormal values to obtain cleaned data;
and S4, merging the cleaned data and storing the merged data into a database or a data warehouse.
Preferably, in step S1, the original dataset comprises sample features and corresponding classification tags; sample characteristics are sample number and sample lot; the corresponding classification labels comprise soil monitoring indexes and underground water monitoring indexes.
Preferably, in step S1, the classification processing is performed on the original data set, including the following steps:
s11, carrying out feature extraction and feature preprocessing on an original data set according to different features of the original data to obtain a standardized data set, wherein the standardized data set inherits sample features and corresponding classification labels of the original data set;
and S12, respectively classifying the site soil and groundwater pollutant data by the standardized data set according to the detection index as a classification label to obtain a classified data set.
Preferably, in step S2, a plurality of reference deep learning models are trained based on the classified data set, and a multi-model fused data cleansing network model is constructed, including the following steps:
s21, dividing the classified data set obtained in the step S1 into a training data set, a verification data set and a test data set;
s22, training a deep learning model by using the classified data set aiming at abnormal value detection and abnormal value restoration, automatically learning the characteristics and modes of data, and establishing a plurality of reference deep learning models by introducing soil and underground water physical priori knowledge PDE as a loss function to restrict the solution space of the deep learning model;
s23, training each reference deep learning model by using a training data set to obtain a training result;
s24, respectively verifying each reference deep learning model by using a verification data set to obtain a verification result;
s25, respectively testing each reference deep learning model by using a test data set to obtain a test result;
s26, respectively evaluating and taking the test result of each converged reference deep learning model as a network weight, fusing a plurality of reference deep learning models and the network weight to construct a multi-model fused data cleaning network model, training the multi-model fused data cleaning network model by using a training data set, and verifying by using a verification data set;
and S27, obtaining the loss value of the multi-model fusion data cleaning network model by using the test data set, adjusting parameters in the multi-model fusion data cleaning network model according to the loss value, and outputting a standard fusion model.
Preferably, in step S3, data to be cleaned is input to a standard fusion model, and abnormal value detection and abnormal value repair are performed to obtain cleaned data, including the following steps:
s31, acquiring data to be cleaned, and inputting the data to be cleaned into a standard fusion model to obtain a data cleaning result;
s32, resampling and laboratory analysis are carried out if the abnormal value positively deviates from the outlier standard value; the data which positively deviates from the outlier standard value is only subjected to outlier detection, and no repair treatment is carried out;
the cleaned data set is refined into confirmed clean data, repaired abnormal data and uncertain data, wherein the uncertain data stores abnormal values which deviate from outlier standard values in the forward direction, and the abnormal values are judged by the human body;
s33, repeating the steps S21-S26 of the confirmed clean data and the repaired abnormal data in the cleaned data set, iteratively training a standard fusion model, obtaining new corrected parameters by the standard fusion model and fine-tuning the standard fusion model after each iteration, and updating and storing the new standard fusion model.
Preferably, in step S22, the abnormal value detection includes missing value detection, repeated value detection, outlier detection, data format unified detection, error data detection, unreasonable data detection; outlier repair includes missing value padding, duplicate value replacement, outlier smoothing, erroneous data replacement, and unreasonable data replacement.
Preferably, in step S22, the deep learning model includes RNNs, LSTM, transformer and one or more of these deep learning variants.
Preferably, in step S32, the outlier standard calculation uses the torch. Nn module in the pyrerch framework to implement the monitor data fit and outlier standard calculation.
Preferably, in step S4, the database is a MongoDB document type database model, and the data is stored in JSON/BSON format; the data repository is a distributed file system of HDFS.
Therefore, the method for cleaning the field soil and groundwater environment monitoring data based on the fusion model has the following technical effects:
(1) By adopting a multi-model fusion mode, various abnormal values can be accurately found by utilizing the cooperation among a plurality of reference models with respective data processing characteristics, and meanwhile, effective data is fully reserved, so that the accuracy of data cleaning is improved, the calculated amount is greatly reduced, and the data cleaning time is shortened;
(2) In consideration of the correlation of the existence of pollutant data in soil and underground water, an embedded physical knowledge neural network based on data driving and knowledge driving is constructed, the migration process of the pollutants in the soil and the underground water is simulated, and the rationality and the interpretability of abnormal data restoration are improved.
(3) The multi-model fused data cleaning network model has strong adaptability, can process various complex data cleaning tasks, and can obtain better performance and effect than a single model by fusing a plurality of reference deep learning models.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
FIG. 1 is a flow chart of a method for cleaning field soil and groundwater environment monitoring data based on a fusion model;
FIG. 2 is a long and short term memory deep learning model for outlier detection according to the present invention;
FIG. 3 is a physical information driven deep learning model of outlier repair of the present invention;
FIG. 4 is a view showing an outlier detection and outlier repair scenario (data anomaly after manual inspection of uncertain data, data repair) according to the first embodiment of the present invention;
fig. 5 is a view showing an outlier detection and outlier repair scenario (uncertain data is classified into clean data confirmed by a data set after manual inspection) according to the second embodiment of the present invention.
Detailed Description
The technical scheme of the invention is further described below through the attached drawings and the embodiments.
Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs.
Example 1
As shown in fig. 1, the flow chart of the method for cleaning the field soil and groundwater environment monitoring data based on the fusion model specifically comprises the following steps:
s1, classifying the acquired site soil and groundwater pollutant data
S11, acquiring long-time series of pollutant data of soil and groundwater in a polluted site of a certain producing enterprise, wherein the pollutant data comprise detection batches, detection point numbers, pollutant types, pollutant concentrations, units and detection limits;
s12, carrying out format standardization treatment on the site soil and groundwater pollutant data by using a python writing algorithm to obtain a standardized data set of the site soil and groundwater pollutant data, wherein the standardized data set inherits the characteristics of an original sample and corresponding classification labels;
and S13, classifying the standardized data set according to the pollutant type as a classification label to respectively classify soil and groundwater pollutant data. In the embodiment, the monitoring period of the soil and the underground water is from 1 month in 2019 to 12 months in 2021, the monitoring frequency is month, 36 batches are total, the number of soil monitoring points is 362, and the detection index is 85; 556 groundwater monitoring points, 232 groundwater detection indexes, and data total amount exceeding 5×10 6 A level.
S2, training a plurality of reference deep learning models based on the classified data, constructing a multi-model fused data cleaning network model, and adjusting parameters of the multi-model fused data cleaning network model by using loss values of the multi-model fused data cleaning network model to obtain a standard fused model
S21, dividing the data set obtained by classification in the S1 into a training data set, a verification data set and a test data set, taking the underground water detection index ammonia nitrogen as an example in the embodiment, and dividing the training set, the verification set and the test set by 70%, 10% and 20% by adopting a cross verification method.
S22-S25, migration of contaminants in groundwater is a time series, in this embodiment outlier detection is based on LSTM neural network, as shown in FIG. 2, shown at the firsttThe structure of LSTM cells at each time point in a neural network at a certain layer, arrows in the figure represent information flows, and three dashed boxes represent three gates: forget gate, input gate, output gate. The three gates are calculated by a sigmoid activation function, usingCurrently entered informationX t State quantity of last round outputAnd information carrier of the previous round +.>Obtaining the state quantity of the output of the first round>And the information carrier of the present round output->. Wherein,σis an activation function, in generaltanhA function; />、/>And->The calculated values of the forget gate, the input gate and the output gate are calculated as follows:
(1)
(2)
(3)
(4)
(5)
(6)
wherein,、/>and->Input weights of the forget gate, the input gate and the output gate respectively; />、/>And->Bias of forget gate, input gate and output gate respectively; />Is a unit state variable, +.>Is offset.
In this embodiment, when the iteration number reaches 500 times, the loss function loss tends to converge, and in this training, the iteration number is set to 600 times to improve the performance of the model, and a reference deep learning model for outlier detection is constructed. In the embodiment, the accurate identification of the abnormal value is realized, and the accuracy rate of the identification of the abnormal value reaches 99.5%.
S22-S25, the pollutants enter the groundwater liquid phase environment through the soil solid phase, so that the distribution of the pollutants in the soil and the groundwater is consistent and different. In this embodiment, outlier restoration is based on a pyrerch neural network, as shown in fig. 3, a physical priori driving loss function is constructed by using an underground water flow equation and an underground water pollutant migration equation as physical priori knowledge, and a loss term predicted against a physical law is added into the loss function to realize physical constraint on a deep learning model.
Wherein, the underground water flow PDE equation is:
(7)
wherein,his a water head, and is provided with a water inlet,m;k x ,k y ,k z the permeation coefficients in the x, y and z directions, m/d;ωis the source sink item, d -1 ;μ s Is the water storage rate, m -1 ;tIs time, d.
The groundwater contaminant migration PDE equation is:
(8)
wherein,θis the porosity of the medium;representing solute componentskConcentration of (2) mg/L; />Is the hydrodynamic diffusion coefficient tensor, m 2 /d;v i Is the actual velocity of pore water flow, m/d;q s is the amount of fluid given or received by the aquifer per unit volume, and represents the source and sink terms, positive values represent the source, negative values represent the sink, d -1 ;/>Is a solute component in a source and sink itemkConcentration of (2) mg/L; />Represents the sum of the chemical reaction terms, mg/(L.d).
Adding a penalty term for violating prediction of the physical law into the loss function to realize physical constraint on the deep learning model, wherein the penalty term comprises two parts, as shown in fig. 3, namely:
(9)
(10)
(11)
wherein,MSE f representing fitting value of neural network as loss term of physical lawMean square error constrained with real physical law, equation of physical law can be well fitted for internal configuration points, +.>Approaching zero, when the value of the loss function is 0, each point in the interior has a neural network value approaching to a true value; />For training the data loss term, the fitting value of the neural network at the initial condition and boundary condition is represented +.>And the true value->Mean square error generated for any one training data point, +.>Approaching zero.
In fig. 3, x represents input data; t represents time; θ represents a set of parameters of the neural network, including weights and biases; NN (x, t; θ) represents a function in the neural network that uses the input x and time t to make an inference or prediction; epsilon indicates "random fetch", and "tend to 0" is fetched in this embodiment. In this embodiment, when the iteration number reaches 10000 times, the loss function loss tends to converge, and in this training, the iteration number is set to 10500 times to improve the performance of the model, and a result of pollutant simulation is obtained and is used as a reference deep learning model for outlier repair.
S26-S27, in the embodiment, each of outlier detection and outlier restoration constructs a reference deep learning model, a multi-model fusion data cleaning network model is constructed based on a deep learning model fusion strategy, a training data set (70%), a verification data set (10%) and a test data set (20%) are used for training the fusion model, parameters in the fusion model are adjusted step by step according to a loss function, and a standard fusion model is output. In this embodiment, when the iteration number of the fusion model reaches 5000 times, the trained result can meet the convergence requirement, and the trained fusion model is stored as a standard fusion model.
S3, inputting the data to be cleaned into a standard fusion model, and detecting and repairing abnormal values to obtain cleaned data
S31, inputting the data to be cleaned into the multi-model fusion data cleaning network model trained in the S2, and detecting and repairing abnormal values;
s32, in the embodiment, calculating an outlier standard value of ammonia nitrogen concentration change, using a torch.nn module in a PyTorch frame to realize monitoring data fitting, gradually approaching or fitting training data by a model in a plurality of training periods, and generating a prediction zone by using the model after training, wherein the upper and lower boundary ranges of the range of the prediction zone are the range of outlier boundary values, as shown in FIG. 4, in 36 groups of monitoring data, { confirmed clean data } are 33 groups; { repaired abnormal data } has 2 groups, and the operations of filling the missing value and replacing unreasonable data are respectively carried out; 1 group of uncertain data is determined to be abnormal in data detection after manual review, and the data is repaired to replace the original abnormal data.
S33, repeating the steps from S21 to S26 for the data-cleaned 35 groups of data including { confirmed clean data and repaired abnormal data }, and updating and saving the iterative training model as a new standard fusion model for the requirement of subsequent data cleaning.
And S4, merging the data after the cleaning treatment, and storing the data into a MongoDB database so as to facilitate subsequent analysis and modeling.
Example two
The fusion model generated by steps S1 to S2 is identical in the second embodiment and the first embodiment, except that, as shown in fig. 5, among the 36 sets of monitoring data, { confirmed clean data }, there are 32 sets; { repaired abnormal data } has 3 groups, and the missing value filling, unreasonable data replacement operation and outlier smoothing are respectively carried out; 1 group of uncertain data is determined to be free of abnormality in the data detection process after manual review, a new pollution source is found through further investigation, and further training of the model is needed. The method comprises the following steps: discarding the data before { uncertain data }, repeating the step S2 by using the { uncertain data } and the following monitoring data, and training to obtain a new fusion model for data cleaning.
Therefore, the method for cleaning the field soil and groundwater environment monitoring data based on the fusion model is adopted, a multi-model fusion mode is adopted, and various abnormal values are accurately found by utilizing the cooperation among a plurality of reference models with respective data processing characteristics, and meanwhile effective data are fully reserved, so that the accuracy of data cleaning is improved, the calculated amount is greatly reduced, and the data cleaning time is shortened; in consideration of the correlation of the existence of pollutant data in soil and underground water, an embedded physical knowledge neural network based on data driving and knowledge driving is constructed, the migration process of the pollutants in the soil and the underground water is simulated, and the accuracy of abnormal data restoration is improved; the multi-model fused data cleaning network model has strong adaptability, can process various complex data cleaning tasks, and can obtain better performance and effect than a single model by fusing a plurality of reference deep learning models.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the invention can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the invention.
Claims (8)
1. A field soil and groundwater environment monitoring data cleaning method based on a fusion model is characterized by comprising the following steps:
s1, acquiring site soil and groundwater pollutant data as an original data set, and classifying the original data set to obtain a classified data set;
s2, training a plurality of reference deep learning models based on the classified data sets, constructing a multi-model fused data cleaning network model, and adjusting parameters of the multi-model fused data cleaning network model by using loss values of the multi-model fused data cleaning network model to obtain a standard fusion model;
s3, inputting the data to be cleaned into a standard fusion model, and detecting and repairing abnormal values to obtain cleaned data;
s4, merging the cleaned data and storing the merged data into a database or a data warehouse;
in step S2, training a plurality of reference deep learning models based on the classified data set, and constructing a multi-model fused data cleaning network model, including the following steps:
s21, dividing the classified data set obtained in the step S1 into a training data set, a verification data set and a test data set;
s22, training a deep learning model by using the classified data set aiming at abnormal value detection and abnormal value restoration, automatically learning the characteristics and modes of data, and establishing a plurality of reference deep learning models by introducing soil and underground water physical priori knowledge PDE as a loss function to restrict the solution space of the deep learning model;
s23, training each reference deep learning model by using a training data set to obtain a training result;
s24, respectively verifying each reference deep learning model by using a verification data set to obtain a verification result;
s25, respectively testing each reference deep learning model by using a test data set to obtain a test result;
s26, respectively evaluating and taking the test result of each converged reference deep learning model as a network weight, fusing a plurality of reference deep learning models and the network weight to construct a multi-model fused data cleaning network model, training the multi-model fused data cleaning network model by using a training data set, and verifying by using a verification data set;
and S27, obtaining the loss value of the multi-model fusion data cleaning network model by using the test data set, adjusting parameters in the multi-model fusion data cleaning network model according to the loss value, and outputting a standard fusion model.
2. The method for cleaning field soil and groundwater environment monitoring data based on a fusion model according to claim 1, wherein in step S1, the original dataset includes sample features and corresponding classification labels; sample characteristics are sample number and sample lot; the corresponding classification labels comprise soil monitoring indexes and underground water monitoring indexes.
3. The method for cleaning field soil and groundwater environment monitoring data based on a fusion model according to claim 2, wherein in step S1, classification processing is performed on an original data set, and the method comprises the following steps:
s11, carrying out feature extraction and feature preprocessing on an original data set according to different features of the original data to obtain a standardized data set, wherein the standardized data set inherits sample features and corresponding classification labels of the original data set;
and S12, respectively classifying the site soil and groundwater pollutant data by the standardized data set according to the detection index as a classification label to obtain a classified data set.
4. The method for cleaning field soil and groundwater environment monitoring data based on a fusion model according to claim 3, wherein in step S3, data to be cleaned is input into a standard fusion model, abnormal value detection and abnormal value restoration are performed, and cleaned data are obtained, comprising the following steps:
s31, acquiring data to be cleaned, and inputting the data to be cleaned into a standard fusion model to obtain a data cleaning result;
s32, resampling and laboratory analysis are carried out if the abnormal value positively deviates from the outlier standard value; the data which positively deviates from the outlier standard value is only subjected to outlier detection, and no repair treatment is carried out;
the cleaned data set is refined into confirmed clean data, repaired abnormal data and uncertain data, wherein the uncertain data stores abnormal values which deviate from outlier standard values in the forward direction, and the abnormal values are judged by the human body;
s33, repeating the steps S21-S26 of the confirmed clean data and the repaired abnormal data in the cleaned data set, iteratively training a standard fusion model, obtaining new corrected parameters by the standard fusion model and fine-tuning the standard fusion model after each iteration, and updating and storing the new standard fusion model.
5. The method for cleaning field soil and groundwater environment monitoring data based on a fusion model according to claim 1, wherein in step S22, abnormal value detection comprises missing value detection, repeated value detection, outlier detection, data format unified detection, error data detection and unreasonable data detection; outlier repair includes missing value padding, duplicate value replacement, outlier smoothing, erroneous data replacement, and unreasonable data replacement.
6. The method of claim 1, wherein in step S22, the deep learning model comprises RNNs, LSTM, transformer and one or more of these deep learning variants.
7. The method for cleaning field soil and groundwater environment monitoring data based on a fusion model according to claim 4, wherein in step S32, the outlier standard value calculation uses a torch.nn module in a pyrerch framework to realize the monitoring data fitting and the outlier standard value calculation.
8. The method for cleaning field soil and groundwater environment monitoring data based on a fusion model according to claim 1, wherein in step S4, the database is a MongoDB document type database model, and the data is stored in JSON/BSON format; the data repository is a distributed file system of HDFS.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410046720.5A CN117591506B (en) | 2024-01-12 | 2024-01-12 | Site soil and groundwater environment monitoring data cleaning method based on fusion model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410046720.5A CN117591506B (en) | 2024-01-12 | 2024-01-12 | Site soil and groundwater environment monitoring data cleaning method based on fusion model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117591506A CN117591506A (en) | 2024-02-23 |
CN117591506B true CN117591506B (en) | 2024-03-22 |
Family
ID=89922206
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410046720.5A Active CN117591506B (en) | 2024-01-12 | 2024-01-12 | Site soil and groundwater environment monitoring data cleaning method based on fusion model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117591506B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110378480A (en) * | 2019-06-14 | 2019-10-25 | 平安科技(深圳)有限公司 | Model training method, device and computer readable storage medium |
CN111199343A (en) * | 2019-12-24 | 2020-05-26 | 上海大学 | Multi-model fusion tobacco market supervision abnormal data mining method |
CN112950047A (en) * | 2021-03-18 | 2021-06-11 | 京师天启(北京)科技有限公司 | Progressive identification method for suspected contaminated site |
CN116128417A (en) * | 2022-12-28 | 2023-05-16 | 上海龙照电子有限公司 | Method and system for identifying and early warning missing part risk of computer hardware inventory |
CN116522566A (en) * | 2023-07-05 | 2023-08-01 | 南京大学 | Groundwater monitoring network optimization method based on physical information driven deep learning model |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230403588A1 (en) * | 2022-06-10 | 2023-12-14 | Qualcomm Incorporated | Machine learning data collection, validation, and reporting configurations |
-
2024
- 2024-01-12 CN CN202410046720.5A patent/CN117591506B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110378480A (en) * | 2019-06-14 | 2019-10-25 | 平安科技(深圳)有限公司 | Model training method, device and computer readable storage medium |
CN111199343A (en) * | 2019-12-24 | 2020-05-26 | 上海大学 | Multi-model fusion tobacco market supervision abnormal data mining method |
CN112950047A (en) * | 2021-03-18 | 2021-06-11 | 京师天启(北京)科技有限公司 | Progressive identification method for suspected contaminated site |
CN116128417A (en) * | 2022-12-28 | 2023-05-16 | 上海龙照电子有限公司 | Method and system for identifying and early warning missing part risk of computer hardware inventory |
CN116522566A (en) * | 2023-07-05 | 2023-08-01 | 南京大学 | Groundwater monitoring network optimization method based on physical information driven deep learning model |
Non-Patent Citations (4)
Title |
---|
A robust framework for identification of PDEs from noisy data;Zhang Zhiming et al.;《Journal of Computational Physics》;20211231;1-11 * |
基于引导滤波与神经网络算法的螺纹孔检测方法;马晓锋 等;《制造技术与机床》;20220131;165-170 * |
我国南方某沿江腾退化工污染场地土壤与地下水风险评估;周美春 等;《环境生态学》;20230731;33-38 * |
硅藻土基水处理剂开发及其在城镇污水深度处理中的应用研究;徐源;《中国优秀硕士学位论文全文数据库 工程科技Ⅰ辑》;20220315;B016-1232 * |
Also Published As
Publication number | Publication date |
---|---|
CN117591506A (en) | 2024-02-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Behmel et al. | Water quality monitoring strategies—A review and future perspectives | |
Aytek et al. | A genetic programming approach to suspended sediment modelling | |
Zhao et al. | Water quality evolution mechanism modeling and health risk assessment based on stochastic hybrid dynamic systems | |
Okafor et al. | Missing data imputation on IoT sensor networks: Implications for on-site sensor calibration | |
Reed et al. | Save now, pay later? Multi-period many-objective groundwater monitoring design given systematic model errors and uncertainty | |
CN111325403B (en) | Method for predicting residual life of electromechanical equipment of highway tunnel | |
CN106127242A (en) | Year of based on integrated study Extreme Precipitation prognoses system and Forecasting Methodology thereof | |
Singh et al. | Groundwater pollution source identification and simultaneous parameter estimation using pattern matching by artificial neural network | |
CN108334943A (en) | The semi-supervised soft-measuring modeling method of industrial process based on Active Learning neural network model | |
Hanea et al. | Drill and learn: a decision-making work flow to quantify value of learning | |
CN107798431A (en) | A kind of Medium-and Long-Term Runoff Forecasting method based on Modified Elman Neural Network | |
DiRenzo et al. | A practical guide to understanding and validating complex models using data simulations | |
CN110334478A (en) | Machinery equipment abnormality detection model building method, detection method and model | |
Chang et al. | Reinforcement learning for improving the accuracy of pm2. 5 pollution forecast under the neural network framework | |
Padberg et al. | Using machine learning for estimating the defect content after an inspection | |
CN111898673A (en) | Dissolved oxygen content prediction method based on EMD and LSTM | |
Reddy et al. | The prediction of quality of the air using supervised learning | |
CN117591506B (en) | Site soil and groundwater environment monitoring data cleaning method based on fusion model | |
Sharma et al. | Hybrid Software Reliability Model for Big Fault Data and Selection of Best Optimizer Using an Estimation Accuracy Function | |
Ardimento et al. | Using deep temporal convolutional networks to just-in-time forecast technical debt principal | |
CN116502539A (en) | VOCs gas concentration prediction method and system | |
Luo et al. | Groundwater pollution source identification using Metropolis-Hasting algorithm combined with Kalman filter algorithm | |
CN116386756A (en) | Soft measurement modeling method based on integrated neural network reliability estimation and weighted learning | |
Saitta et al. | Feature selection using stochastic search: An application to system identification | |
Knüsel | Epistemological issues in data-driven modeling in climate research |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |