CN117591506B - Site soil and groundwater environment monitoring data cleaning method based on fusion model - Google Patents

Site soil and groundwater environment monitoring data cleaning method based on fusion model Download PDF

Info

Publication number
CN117591506B
CN117591506B CN202410046720.5A CN202410046720A CN117591506B CN 117591506 B CN117591506 B CN 117591506B CN 202410046720 A CN202410046720 A CN 202410046720A CN 117591506 B CN117591506 B CN 117591506B
Authority
CN
China
Prior art keywords
data
model
data set
cleaning
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410046720.5A
Other languages
Chinese (zh)
Other versions
CN117591506A (en
Inventor
黄蕾
任富天
梅丹兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202410046720.5A priority Critical patent/CN117591506B/en
Publication of CN117591506A publication Critical patent/CN117591506A/en
Application granted granted Critical
Publication of CN117591506B publication Critical patent/CN117591506B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Processing Of Solid Wastes (AREA)

Abstract

The invention discloses a field soil and groundwater environment monitoring data cleaning method based on a fusion model, which belongs to the technical field of polluted field soil and groundwater data processing, and comprises the following steps: acquiring site soil and groundwater pollutant data, and performing classification treatment to obtain a classified data set; training a plurality of reference deep learning models based on the classified data sets, constructing a multi-model fused data cleaning network model, and adjusting parameters of the multi-model fused data cleaning network model by using loss values of the multi-model fused data cleaning network model to obtain a standard fusion model; inputting data to be cleaned into a standard fusion model, and detecting and repairing abnormal values; and merging the cleaned data and storing the merged data into a database or a data warehouse. The method for cleaning the field soil and groundwater environment monitoring data based on the fusion model can be used for fusing a plurality of reference deep learning models to improve the accuracy and efficiency of data cleaning.

Description

Site soil and groundwater environment monitoring data cleaning method based on fusion model
Technical Field
The invention relates to the technical field of polluted site soil and groundwater data processing, in particular to a site soil and groundwater environment monitoring data cleaning method based on a fusion model.
Background
During the whole life cycle of investigation, risk assessment, risk management and long-term monitoring, a pollution site will obtain a large amount of monitoring data about soil and groundwater pollutants, the monitoring data has a large number of samples, the monitoring projects are large, the data structure is complex, and a large amount of characteristic information, relation information and classification information are implied. Meanwhile, redundant data, missing data, uncertain data, inconsistent data and other dirty data exist in soil and groundwater data inevitably, the usability of the whole data is seriously affected by the data, and the quality evaluation results of the soil and groundwater environment are often deviated due to the abnormal data, so that the subsequent decision management is affected.
At present, data cleaning is based on outlier identification and outlier filling, and related methods are mainly provided from the angles of attribute values, spatial scales, time sequences and the like. For example, a smoothing-based data cleansing method, a statistical-based data cleansing method, a constraint-based data cleansing method, and the like are applicable to cleansing of single-dimensional data only. However, there are few methods of data cleansing that combine time and space scales. Even based on the space-time scale, most researches are independently developed on the time scale and the space scale, namely, firstly, the abnormal value of the time scale and the abnormal value of the space scale are detected respectively, and then, the obtained result is further used for judging the space-time abnormal value. Such an approach separates the temporal and spatial correlations without taking into account the temporal and spatial interactions.
At present, mass data of a polluted site is continuously updated, a fixed data set is not adopted, the automatic extraction and analysis requirements are greatly increased, and an analysis rule for manually identifying each piece of data cannot be used. The existing data cleaning algorithm is proposed for static data with single type, most of the technologies only clean the static data with single problems such as abnormal points, missing values and the like, prior information cannot be added into the data cleaning process, abnormal detection and data restoration are not organically combined, and the comprehensive processing requirement of mass data is difficult to meet.
Disclosure of Invention
The invention aims to provide a field soil and groundwater environment monitoring data cleaning method based on a fusion model, which establishes a reference deep learning model by learning a characteristic distribution rule of data and can fuse a plurality of reference deep learning models to improve the accuracy and efficiency of data cleaning.
In order to achieve the above purpose, the invention provides a field soil and groundwater environment monitoring data cleaning method based on a fusion model, which comprises the following steps:
s1, acquiring site soil and groundwater pollutant data as an original data set, and classifying the original data set to obtain a classified data set;
s2, training a plurality of reference deep learning models based on the classified data sets, constructing a multi-model fused data cleaning network model, and adjusting parameters of the multi-model fused data cleaning network model by using loss values of the multi-model fused data cleaning network model to obtain a standard fusion model;
s3, inputting the data to be cleaned into a standard fusion model, and detecting and repairing abnormal values to obtain cleaned data;
and S4, merging the cleaned data and storing the merged data into a database or a data warehouse.
Preferably, in step S1, the original dataset comprises sample features and corresponding classification tags; sample characteristics are sample number and sample lot; the corresponding classification labels comprise soil monitoring indexes and underground water monitoring indexes.
Preferably, in step S1, the classification processing is performed on the original data set, including the following steps:
s11, carrying out feature extraction and feature preprocessing on an original data set according to different features of the original data to obtain a standardized data set, wherein the standardized data set inherits sample features and corresponding classification labels of the original data set;
and S12, respectively classifying the site soil and groundwater pollutant data by the standardized data set according to the detection index as a classification label to obtain a classified data set.
Preferably, in step S2, a plurality of reference deep learning models are trained based on the classified data set, and a multi-model fused data cleansing network model is constructed, including the following steps:
s21, dividing the classified data set obtained in the step S1 into a training data set, a verification data set and a test data set;
s22, training a deep learning model by using the classified data set aiming at abnormal value detection and abnormal value restoration, automatically learning the characteristics and modes of data, and establishing a plurality of reference deep learning models by introducing soil and underground water physical priori knowledge PDE as a loss function to restrict the solution space of the deep learning model;
s23, training each reference deep learning model by using a training data set to obtain a training result;
s24, respectively verifying each reference deep learning model by using a verification data set to obtain a verification result;
s25, respectively testing each reference deep learning model by using a test data set to obtain a test result;
s26, respectively evaluating and taking the test result of each converged reference deep learning model as a network weight, fusing a plurality of reference deep learning models and the network weight to construct a multi-model fused data cleaning network model, training the multi-model fused data cleaning network model by using a training data set, and verifying by using a verification data set;
and S27, obtaining the loss value of the multi-model fusion data cleaning network model by using the test data set, adjusting parameters in the multi-model fusion data cleaning network model according to the loss value, and outputting a standard fusion model.
Preferably, in step S3, data to be cleaned is input to a standard fusion model, and abnormal value detection and abnormal value repair are performed to obtain cleaned data, including the following steps:
s31, acquiring data to be cleaned, and inputting the data to be cleaned into a standard fusion model to obtain a data cleaning result;
s32, resampling and laboratory analysis are carried out if the abnormal value positively deviates from the outlier standard value; the data which positively deviates from the outlier standard value is only subjected to outlier detection, and no repair treatment is carried out;
the cleaned data set is refined into confirmed clean data, repaired abnormal data and uncertain data, wherein the uncertain data stores abnormal values which deviate from outlier standard values in the forward direction, and the abnormal values are judged by the human body;
s33, repeating the steps S21-S26 of the confirmed clean data and the repaired abnormal data in the cleaned data set, iteratively training a standard fusion model, obtaining new corrected parameters by the standard fusion model and fine-tuning the standard fusion model after each iteration, and updating and storing the new standard fusion model.
Preferably, in step S22, the abnormal value detection includes missing value detection, repeated value detection, outlier detection, data format unified detection, error data detection, unreasonable data detection; outlier repair includes missing value padding, duplicate value replacement, outlier smoothing, erroneous data replacement, and unreasonable data replacement.
Preferably, in step S22, the deep learning model includes RNNs, LSTM, transformer and one or more of these deep learning variants.
Preferably, in step S32, the outlier standard calculation uses the torch. Nn module in the pyrerch framework to implement the monitor data fit and outlier standard calculation.
Preferably, in step S4, the database is a MongoDB document type database model, and the data is stored in JSON/BSON format; the data repository is a distributed file system of HDFS.
Therefore, the method for cleaning the field soil and groundwater environment monitoring data based on the fusion model has the following technical effects:
(1) By adopting a multi-model fusion mode, various abnormal values can be accurately found by utilizing the cooperation among a plurality of reference models with respective data processing characteristics, and meanwhile, effective data is fully reserved, so that the accuracy of data cleaning is improved, the calculated amount is greatly reduced, and the data cleaning time is shortened;
(2) In consideration of the correlation of the existence of pollutant data in soil and underground water, an embedded physical knowledge neural network based on data driving and knowledge driving is constructed, the migration process of the pollutants in the soil and the underground water is simulated, and the rationality and the interpretability of abnormal data restoration are improved.
(3) The multi-model fused data cleaning network model has strong adaptability, can process various complex data cleaning tasks, and can obtain better performance and effect than a single model by fusing a plurality of reference deep learning models.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
FIG. 1 is a flow chart of a method for cleaning field soil and groundwater environment monitoring data based on a fusion model;
FIG. 2 is a long and short term memory deep learning model for outlier detection according to the present invention;
FIG. 3 is a physical information driven deep learning model of outlier repair of the present invention;
FIG. 4 is a view showing an outlier detection and outlier repair scenario (data anomaly after manual inspection of uncertain data, data repair) according to the first embodiment of the present invention;
fig. 5 is a view showing an outlier detection and outlier repair scenario (uncertain data is classified into clean data confirmed by a data set after manual inspection) according to the second embodiment of the present invention.
Detailed Description
The technical scheme of the invention is further described below through the attached drawings and the embodiments.
Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs.
Example 1
As shown in fig. 1, the flow chart of the method for cleaning the field soil and groundwater environment monitoring data based on the fusion model specifically comprises the following steps:
s1, classifying the acquired site soil and groundwater pollutant data
S11, acquiring long-time series of pollutant data of soil and groundwater in a polluted site of a certain producing enterprise, wherein the pollutant data comprise detection batches, detection point numbers, pollutant types, pollutant concentrations, units and detection limits;
s12, carrying out format standardization treatment on the site soil and groundwater pollutant data by using a python writing algorithm to obtain a standardized data set of the site soil and groundwater pollutant data, wherein the standardized data set inherits the characteristics of an original sample and corresponding classification labels;
and S13, classifying the standardized data set according to the pollutant type as a classification label to respectively classify soil and groundwater pollutant data. In the embodiment, the monitoring period of the soil and the underground water is from 1 month in 2019 to 12 months in 2021, the monitoring frequency is month, 36 batches are total, the number of soil monitoring points is 362, and the detection index is 85; 556 groundwater monitoring points, 232 groundwater detection indexes, and data total amount exceeding 5×10 6 A level.
S2, training a plurality of reference deep learning models based on the classified data, constructing a multi-model fused data cleaning network model, and adjusting parameters of the multi-model fused data cleaning network model by using loss values of the multi-model fused data cleaning network model to obtain a standard fused model
S21, dividing the data set obtained by classification in the S1 into a training data set, a verification data set and a test data set, taking the underground water detection index ammonia nitrogen as an example in the embodiment, and dividing the training set, the verification set and the test set by 70%, 10% and 20% by adopting a cross verification method.
S22-S25, migration of contaminants in groundwater is a time series, in this embodiment outlier detection is based on LSTM neural network, as shown in FIG. 2, shown at the firsttThe structure of LSTM cells at each time point in a neural network at a certain layer, arrows in the figure represent information flows, and three dashed boxes represent three gates: forget gate, input gate, output gate. The three gates are calculated by a sigmoid activation function, usingCurrently entered informationX t State quantity of last round outputAnd information carrier of the previous round +.>Obtaining the state quantity of the output of the first round>And the information carrier of the present round output->. Wherein,σis an activation function, in generaltanhA function; />、/>And->The calculated values of the forget gate, the input gate and the output gate are calculated as follows:
(1)
(2)
(3)
(4)
(5)
(6)
wherein,、/>and->Input weights of the forget gate, the input gate and the output gate respectively; />、/>And->Bias of forget gate, input gate and output gate respectively; />Is a unit state variable, +.>Is offset.
In this embodiment, when the iteration number reaches 500 times, the loss function loss tends to converge, and in this training, the iteration number is set to 600 times to improve the performance of the model, and a reference deep learning model for outlier detection is constructed. In the embodiment, the accurate identification of the abnormal value is realized, and the accuracy rate of the identification of the abnormal value reaches 99.5%.
S22-S25, the pollutants enter the groundwater liquid phase environment through the soil solid phase, so that the distribution of the pollutants in the soil and the groundwater is consistent and different. In this embodiment, outlier restoration is based on a pyrerch neural network, as shown in fig. 3, a physical priori driving loss function is constructed by using an underground water flow equation and an underground water pollutant migration equation as physical priori knowledge, and a loss term predicted against a physical law is added into the loss function to realize physical constraint on a deep learning model.
Wherein, the underground water flow PDE equation is:
(7)
wherein,his a water head, and is provided with a water inlet,mk x k y k z the permeation coefficients in the x, y and z directions, m/d;ωis the source sink item, d -1μ s Is the water storage rate, m -1tIs time, d.
The groundwater contaminant migration PDE equation is:
(8)
wherein,θis the porosity of the medium;representing solute componentskConcentration of (2) mg/L; />Is the hydrodynamic diffusion coefficient tensor, m 2 /d;v i Is the actual velocity of pore water flow, m/d;q s is the amount of fluid given or received by the aquifer per unit volume, and represents the source and sink terms, positive values represent the source, negative values represent the sink, d -1 ;/>Is a solute component in a source and sink itemkConcentration of (2) mg/L; />Represents the sum of the chemical reaction terms, mg/(L.d).
Adding a penalty term for violating prediction of the physical law into the loss function to realize physical constraint on the deep learning model, wherein the penalty term comprises two parts, as shown in fig. 3, namely:
(9)
(10)
(11)
wherein,MSE f representing fitting value of neural network as loss term of physical lawMean square error constrained with real physical law, equation of physical law can be well fitted for internal configuration points, +.>Approaching zero, when the value of the loss function is 0, each point in the interior has a neural network value approaching to a true value; />For training the data loss term, the fitting value of the neural network at the initial condition and boundary condition is represented +.>And the true value->Mean square error generated for any one training data point, +.>Approaching zero.
In fig. 3, x represents input data; t represents time; θ represents a set of parameters of the neural network, including weights and biases; NN (x, t; θ) represents a function in the neural network that uses the input x and time t to make an inference or prediction; epsilon indicates "random fetch", and "tend to 0" is fetched in this embodiment. In this embodiment, when the iteration number reaches 10000 times, the loss function loss tends to converge, and in this training, the iteration number is set to 10500 times to improve the performance of the model, and a result of pollutant simulation is obtained and is used as a reference deep learning model for outlier repair.
S26-S27, in the embodiment, each of outlier detection and outlier restoration constructs a reference deep learning model, a multi-model fusion data cleaning network model is constructed based on a deep learning model fusion strategy, a training data set (70%), a verification data set (10%) and a test data set (20%) are used for training the fusion model, parameters in the fusion model are adjusted step by step according to a loss function, and a standard fusion model is output. In this embodiment, when the iteration number of the fusion model reaches 5000 times, the trained result can meet the convergence requirement, and the trained fusion model is stored as a standard fusion model.
S3, inputting the data to be cleaned into a standard fusion model, and detecting and repairing abnormal values to obtain cleaned data
S31, inputting the data to be cleaned into the multi-model fusion data cleaning network model trained in the S2, and detecting and repairing abnormal values;
s32, in the embodiment, calculating an outlier standard value of ammonia nitrogen concentration change, using a torch.nn module in a PyTorch frame to realize monitoring data fitting, gradually approaching or fitting training data by a model in a plurality of training periods, and generating a prediction zone by using the model after training, wherein the upper and lower boundary ranges of the range of the prediction zone are the range of outlier boundary values, as shown in FIG. 4, in 36 groups of monitoring data, { confirmed clean data } are 33 groups; { repaired abnormal data } has 2 groups, and the operations of filling the missing value and replacing unreasonable data are respectively carried out; 1 group of uncertain data is determined to be abnormal in data detection after manual review, and the data is repaired to replace the original abnormal data.
S33, repeating the steps from S21 to S26 for the data-cleaned 35 groups of data including { confirmed clean data and repaired abnormal data }, and updating and saving the iterative training model as a new standard fusion model for the requirement of subsequent data cleaning.
And S4, merging the data after the cleaning treatment, and storing the data into a MongoDB database so as to facilitate subsequent analysis and modeling.
Example two
The fusion model generated by steps S1 to S2 is identical in the second embodiment and the first embodiment, except that, as shown in fig. 5, among the 36 sets of monitoring data, { confirmed clean data }, there are 32 sets; { repaired abnormal data } has 3 groups, and the missing value filling, unreasonable data replacement operation and outlier smoothing are respectively carried out; 1 group of uncertain data is determined to be free of abnormality in the data detection process after manual review, a new pollution source is found through further investigation, and further training of the model is needed. The method comprises the following steps: discarding the data before { uncertain data }, repeating the step S2 by using the { uncertain data } and the following monitoring data, and training to obtain a new fusion model for data cleaning.
Therefore, the method for cleaning the field soil and groundwater environment monitoring data based on the fusion model is adopted, a multi-model fusion mode is adopted, and various abnormal values are accurately found by utilizing the cooperation among a plurality of reference models with respective data processing characteristics, and meanwhile effective data are fully reserved, so that the accuracy of data cleaning is improved, the calculated amount is greatly reduced, and the data cleaning time is shortened; in consideration of the correlation of the existence of pollutant data in soil and underground water, an embedded physical knowledge neural network based on data driving and knowledge driving is constructed, the migration process of the pollutants in the soil and the underground water is simulated, and the accuracy of abnormal data restoration is improved; the multi-model fused data cleaning network model has strong adaptability, can process various complex data cleaning tasks, and can obtain better performance and effect than a single model by fusing a plurality of reference deep learning models.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the invention can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the invention.

Claims (8)

1. A field soil and groundwater environment monitoring data cleaning method based on a fusion model is characterized by comprising the following steps:
s1, acquiring site soil and groundwater pollutant data as an original data set, and classifying the original data set to obtain a classified data set;
s2, training a plurality of reference deep learning models based on the classified data sets, constructing a multi-model fused data cleaning network model, and adjusting parameters of the multi-model fused data cleaning network model by using loss values of the multi-model fused data cleaning network model to obtain a standard fusion model;
s3, inputting the data to be cleaned into a standard fusion model, and detecting and repairing abnormal values to obtain cleaned data;
s4, merging the cleaned data and storing the merged data into a database or a data warehouse;
in step S2, training a plurality of reference deep learning models based on the classified data set, and constructing a multi-model fused data cleaning network model, including the following steps:
s21, dividing the classified data set obtained in the step S1 into a training data set, a verification data set and a test data set;
s22, training a deep learning model by using the classified data set aiming at abnormal value detection and abnormal value restoration, automatically learning the characteristics and modes of data, and establishing a plurality of reference deep learning models by introducing soil and underground water physical priori knowledge PDE as a loss function to restrict the solution space of the deep learning model;
s23, training each reference deep learning model by using a training data set to obtain a training result;
s24, respectively verifying each reference deep learning model by using a verification data set to obtain a verification result;
s25, respectively testing each reference deep learning model by using a test data set to obtain a test result;
s26, respectively evaluating and taking the test result of each converged reference deep learning model as a network weight, fusing a plurality of reference deep learning models and the network weight to construct a multi-model fused data cleaning network model, training the multi-model fused data cleaning network model by using a training data set, and verifying by using a verification data set;
and S27, obtaining the loss value of the multi-model fusion data cleaning network model by using the test data set, adjusting parameters in the multi-model fusion data cleaning network model according to the loss value, and outputting a standard fusion model.
2. The method for cleaning field soil and groundwater environment monitoring data based on a fusion model according to claim 1, wherein in step S1, the original dataset includes sample features and corresponding classification labels; sample characteristics are sample number and sample lot; the corresponding classification labels comprise soil monitoring indexes and underground water monitoring indexes.
3. The method for cleaning field soil and groundwater environment monitoring data based on a fusion model according to claim 2, wherein in step S1, classification processing is performed on an original data set, and the method comprises the following steps:
s11, carrying out feature extraction and feature preprocessing on an original data set according to different features of the original data to obtain a standardized data set, wherein the standardized data set inherits sample features and corresponding classification labels of the original data set;
and S12, respectively classifying the site soil and groundwater pollutant data by the standardized data set according to the detection index as a classification label to obtain a classified data set.
4. The method for cleaning field soil and groundwater environment monitoring data based on a fusion model according to claim 3, wherein in step S3, data to be cleaned is input into a standard fusion model, abnormal value detection and abnormal value restoration are performed, and cleaned data are obtained, comprising the following steps:
s31, acquiring data to be cleaned, and inputting the data to be cleaned into a standard fusion model to obtain a data cleaning result;
s32, resampling and laboratory analysis are carried out if the abnormal value positively deviates from the outlier standard value; the data which positively deviates from the outlier standard value is only subjected to outlier detection, and no repair treatment is carried out;
the cleaned data set is refined into confirmed clean data, repaired abnormal data and uncertain data, wherein the uncertain data stores abnormal values which deviate from outlier standard values in the forward direction, and the abnormal values are judged by the human body;
s33, repeating the steps S21-S26 of the confirmed clean data and the repaired abnormal data in the cleaned data set, iteratively training a standard fusion model, obtaining new corrected parameters by the standard fusion model and fine-tuning the standard fusion model after each iteration, and updating and storing the new standard fusion model.
5. The method for cleaning field soil and groundwater environment monitoring data based on a fusion model according to claim 1, wherein in step S22, abnormal value detection comprises missing value detection, repeated value detection, outlier detection, data format unified detection, error data detection and unreasonable data detection; outlier repair includes missing value padding, duplicate value replacement, outlier smoothing, erroneous data replacement, and unreasonable data replacement.
6. The method of claim 1, wherein in step S22, the deep learning model comprises RNNs, LSTM, transformer and one or more of these deep learning variants.
7. The method for cleaning field soil and groundwater environment monitoring data based on a fusion model according to claim 4, wherein in step S32, the outlier standard value calculation uses a torch.nn module in a pyrerch framework to realize the monitoring data fitting and the outlier standard value calculation.
8. The method for cleaning field soil and groundwater environment monitoring data based on a fusion model according to claim 1, wherein in step S4, the database is a MongoDB document type database model, and the data is stored in JSON/BSON format; the data repository is a distributed file system of HDFS.
CN202410046720.5A 2024-01-12 2024-01-12 Site soil and groundwater environment monitoring data cleaning method based on fusion model Active CN117591506B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410046720.5A CN117591506B (en) 2024-01-12 2024-01-12 Site soil and groundwater environment monitoring data cleaning method based on fusion model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410046720.5A CN117591506B (en) 2024-01-12 2024-01-12 Site soil and groundwater environment monitoring data cleaning method based on fusion model

Publications (2)

Publication Number Publication Date
CN117591506A CN117591506A (en) 2024-02-23
CN117591506B true CN117591506B (en) 2024-03-22

Family

ID=89922206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410046720.5A Active CN117591506B (en) 2024-01-12 2024-01-12 Site soil and groundwater environment monitoring data cleaning method based on fusion model

Country Status (1)

Country Link
CN (1) CN117591506B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110378480A (en) * 2019-06-14 2019-10-25 平安科技(深圳)有限公司 Model training method, device and computer readable storage medium
CN111199343A (en) * 2019-12-24 2020-05-26 上海大学 Multi-model fusion tobacco market supervision abnormal data mining method
CN112950047A (en) * 2021-03-18 2021-06-11 京师天启(北京)科技有限公司 Progressive identification method for suspected contaminated site
CN116128417A (en) * 2022-12-28 2023-05-16 上海龙照电子有限公司 Method and system for identifying and early warning missing part risk of computer hardware inventory
CN116522566A (en) * 2023-07-05 2023-08-01 南京大学 Groundwater monitoring network optimization method based on physical information driven deep learning model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230403588A1 (en) * 2022-06-10 2023-12-14 Qualcomm Incorporated Machine learning data collection, validation, and reporting configurations

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110378480A (en) * 2019-06-14 2019-10-25 平安科技(深圳)有限公司 Model training method, device and computer readable storage medium
CN111199343A (en) * 2019-12-24 2020-05-26 上海大学 Multi-model fusion tobacco market supervision abnormal data mining method
CN112950047A (en) * 2021-03-18 2021-06-11 京师天启(北京)科技有限公司 Progressive identification method for suspected contaminated site
CN116128417A (en) * 2022-12-28 2023-05-16 上海龙照电子有限公司 Method and system for identifying and early warning missing part risk of computer hardware inventory
CN116522566A (en) * 2023-07-05 2023-08-01 南京大学 Groundwater monitoring network optimization method based on physical information driven deep learning model

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A robust framework for identification of PDEs from noisy data;Zhang Zhiming et al.;《Journal of Computational Physics》;20211231;1-11 *
基于引导滤波与神经网络算法的螺纹孔检测方法;马晓锋 等;《制造技术与机床》;20220131;165-170 *
我国南方某沿江腾退化工污染场地土壤与地下水风险评估;周美春 等;《环境生态学》;20230731;33-38 *
硅藻土基水处理剂开发及其在城镇污水深度处理中的应用研究;徐源;《中国优秀硕士学位论文全文数据库 工程科技Ⅰ辑》;20220315;B016-1232 *

Also Published As

Publication number Publication date
CN117591506A (en) 2024-02-23

Similar Documents

Publication Publication Date Title
Behmel et al. Water quality monitoring strategies—A review and future perspectives
Aytek et al. A genetic programming approach to suspended sediment modelling
Zhao et al. Water quality evolution mechanism modeling and health risk assessment based on stochastic hybrid dynamic systems
Okafor et al. Missing data imputation on IoT sensor networks: Implications for on-site sensor calibration
Reed et al. Save now, pay later? Multi-period many-objective groundwater monitoring design given systematic model errors and uncertainty
CN111325403B (en) Method for predicting residual life of electromechanical equipment of highway tunnel
CN106127242A (en) Year of based on integrated study Extreme Precipitation prognoses system and Forecasting Methodology thereof
Singh et al. Groundwater pollution source identification and simultaneous parameter estimation using pattern matching by artificial neural network
CN108334943A (en) The semi-supervised soft-measuring modeling method of industrial process based on Active Learning neural network model
Hanea et al. Drill and learn: a decision-making work flow to quantify value of learning
CN107798431A (en) A kind of Medium-and Long-Term Runoff Forecasting method based on Modified Elman Neural Network
DiRenzo et al. A practical guide to understanding and validating complex models using data simulations
CN110334478A (en) Machinery equipment abnormality detection model building method, detection method and model
Chang et al. Reinforcement learning for improving the accuracy of pm2. 5 pollution forecast under the neural network framework
Padberg et al. Using machine learning for estimating the defect content after an inspection
CN111898673A (en) Dissolved oxygen content prediction method based on EMD and LSTM
Reddy et al. The prediction of quality of the air using supervised learning
CN117591506B (en) Site soil and groundwater environment monitoring data cleaning method based on fusion model
Sharma et al. Hybrid Software Reliability Model for Big Fault Data and Selection of Best Optimizer Using an Estimation Accuracy Function
Ardimento et al. Using deep temporal convolutional networks to just-in-time forecast technical debt principal
CN116502539A (en) VOCs gas concentration prediction method and system
Luo et al. Groundwater pollution source identification using Metropolis-Hasting algorithm combined with Kalman filter algorithm
CN116386756A (en) Soft measurement modeling method based on integrated neural network reliability estimation and weighted learning
Saitta et al. Feature selection using stochastic search: An application to system identification
Knüsel Epistemological issues in data-driven modeling in climate research

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant