CN113723679B - Drinking water quality prediction method and system based on cost-sensitive deep cascade forests - Google Patents

Drinking water quality prediction method and system based on cost-sensitive deep cascade forests Download PDF

Info

Publication number
CN113723679B
CN113723679B CN202110992331.8A CN202110992331A CN113723679B CN 113723679 B CN113723679 B CN 113723679B CN 202110992331 A CN202110992331 A CN 202110992331A CN 113723679 B CN113723679 B CN 113723679B
Authority
CN
China
Prior art keywords
cost
water quality
data
sensitive
drinking water
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110992331.8A
Other languages
Chinese (zh)
Other versions
CN113723679A (en
Inventor
陈达
邓永锋
陈兴国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN202110992331.8A priority Critical patent/CN113723679B/en
Publication of CN113723679A publication Critical patent/CN113723679A/en
Application granted granted Critical
Publication of CN113723679B publication Critical patent/CN113723679B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A20/00Water conservation; Efficient water supply; Efficient water use
    • Y02A20/152Water filtration

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Strategic Management (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Human Resources & Organizations (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Development Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a drinking water quality prediction method and a system based on a cost-sensitive deep cascade forest, wherein the method comprises the following steps: and a data acquisition step: collecting raw drinking water quality data, wherein the raw drinking water quality data comprises water quality parameters; a data preprocessing step: carrying out data cleaning and data standardization on the original drinking water quality data to obtain water quality pretreatment data; and a prediction step: inputting the water quality pretreatment data into a water quality prediction model to predict whether the water quality is qualified or not; the water quality prediction model is obtained by machine learning training through a plurality of groups of data, and each group of data in the plurality of groups of data comprises drinking water quality training data and label information for identifying whether the water quality training data is qualified or not; according to the invention, the unbalanced cost matrix is set to introduce cost sensitive factors so as to improve the unbalanced data prediction capability of the model, and the water quality prediction method has higher accuracy.

Description

Drinking water quality prediction method and system based on cost-sensitive deep cascade forests
Technical Field
The invention relates to the technical field of environmental quality monitoring and forecasting, in particular to a drinking water quality forecasting method and system based on a cost-sensitive deep cascade forest.
Background
Fresh water resources are main resources for human survival, and the quality of drinking water is closely related to human health. Therefore, various measures are taken to ensure the water quality safety in all countries of the world, and among the measures, the use of artificial intelligence technology to monitor and forecast the water quality is an important step for protecting the safety of drinking water.
Although many artificial intelligence models exist to predict the quality of surface water by using a support vector machine and random forests, among the models, the traditional artificial intelligence models are particularly difficult to be qualified for predicting the quality of drinking water. One of the main reasons is that the drinking water quality data set used for training of these models is a typical extremely unbalanced data set, whereas artificial intelligence models have a better predictive power on balance data, which can significantly reduce the predictive power of these models. Therefore, the condition of the drinking water quality cannot be effectively monitored and predicted by using the traditional model alone, the prediction performance of the traditional learning model such as LR, SVM and the like on the drinking water quality is not high, and especially the accuracy and the stability of the unqualified water quality prediction are not high.
There are two general types of methods for handling unbalanced data: one major class is resampling, undersampling or mixed sampling of data, and the proportion of the few classes of samples is changed from the perspective of a data set; another broad class is to use integrated models to predict imbalance data, enhancing the ability to predict imbalance data from the perspective of increasing the strain capacity of the model.
Disclosure of Invention
In order to overcome the defects and shortcomings in the prior art, the first aim of the invention is to provide a drinking water quality prediction method based on a cost-sensitive deep cascade forest, which improves the unbalanced data prediction capability of a model by introducing cost-sensitive factors, so as to accurately predict the drinking water quality condition and ensure the drinking water safety.
A second object of the invention is to propose a cost-sensitive deep cascade forest based drinking water quality prediction system.
In order to achieve the first object, the present invention adopts the following technical scheme:
a drinking water quality prediction method based on a cost-sensitive deep cascade forest comprises the following steps:
and a data acquisition step: collecting raw data of drinking water quality, wherein the raw data of drinking water quality comprise water quality parameters including pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and dissolved oxygen;
a data preprocessing step: carrying out data cleaning and data standardization on the original drinking water quality data to obtain water quality pretreatment data;
and a prediction step: inputting the water quality pretreatment data into a water quality prediction model to predict whether the water quality is qualified or not;
the water quality prediction model is obtained by machine learning training through multiple groups of data, and each group of data in the multiple groups of data comprises drinking water quality training data and label information for identifying whether the water quality training data is qualified or not;
the water quality prediction model introduces cost sensitivity factors by setting an unbalanced cost matrix.
As an optimal technical scheme, the water quality prediction model is obtained through machine learning training by using a plurality of groups of data, and specifically comprises the following steps:
step S100, constructing a cost sensitive base classifier: introducing cost sensitivity factors to construct a cost sensitive base classifier, and specifically, directly introducing cost matrixes to represent the cost sensitivity of the base classifier;
the cost matrix is an asymmetric matrix and is used for setting the cost of the misclassification errors of the minority class samples to be far higher than the misclassification cost of the majority class samples;
step S200, constructing a cost-sensitive deep cascade forest: the cost-sensitive depth cascade forest is of a multi-layer structure, each layer is provided with a plurality of estimators, the types and the numbers of the estimators of each layer are the same, and each estimator comprises a plurality of cost-sensitive base classifiers;
step S300, pretreatment of drinking water quality data: cleaning and standardizing the original data of the drinking water quality;
step S400, cost-sensitive deep cascade forest training: k-fold dividing a training set and a verification set are carried out on the pre-processed water quality data, a prediction objective function is set, training is carried out on the basis of the training set, the prediction objective function is optimized, the super-parameters are fixed, K-1 fold data are used as the training set and used for training a model, and the rest 1 fold data are used as the verification set and used for verifying the model;
step S500, verifying the cost-sensitive deep cascade deep forest: and verifying the model by using the water quality data of the verification set, comparing the prediction capacities of the cost-sensitive deep cascade deep forests under different cost matrixes, and screening the optimal cost matrix to obtain a water quality prediction model.
As a preferable technical scheme, the cost sensitive base classifier adopts one or more of a random tree and a complete random tree.
As an optimal technical scheme, the optimal cost matrix is specifically selected by searching an initial cost matrix and an optimal cost matrix through a heuristic method and a grid method.
As a preferred embodiment, in step S400, the prediction objective function is expressed as:
the formula y is label information for indicating whether the water quality is qualified or not, whereinRepresents the prediction accuracy probability of the j-th type tag information, i represents the i-th data, j represents the j-th type tag information, n represents the total number of data, C ij A cost matrix representing i data and j classes of tag information.
As a preferable technical scheme, in step S400, the cost-sensitive deep cascade forest training specifically adopts a CS-DCF algorithm for training, where the CS-DCF algorithm is a cost-sensitive deep cascade forest algorithm;
the training based on the training set specifically comprises the following steps: the first layer input is an original feature vector, the other layers input is the original feature vector and the output probability vector of the adjacent previous layer, each layer calculates cost, and if the cost is reduced, the original feature vector is combined with the output of the layer to be used as the input of the next layer; each layer then follows the same procedure until the predicted objective function is no longer decreasing, and finally the result of the corresponding layer is the output of the cost-sensitive depth cascade forest.
As a preferable technical scheme, the cost-sensitive depth cascade forest algorithm specifically includes the following steps:
step S401, initializing:
inputting a feature sampling matrix, a cost matrix and iteration times, initializing the number of the current processing layers to be 1, and initializing the serial number value of the current iteration times to be 0;
the feature sampling matrix is expressed as:
X=(X 1 ,X 2 ,…,X p )
wherein p is the dimension of the feature sampling matrix, the cost matrix is represented as C, and the iteration number is s;
step S402, a cyclic training step: the following operations are executed until all the processing layers are traversed;
constructing a cost sensitive base classifier according to the cost matrix, and obtaining a cost value c of the current processing layer by using the cost sensitive base classifier;
taking the output of the current processing layer as a new feature matrix F;
generating a new sampling matrix: for the case when the current processing layer number is not 1, connecting a new feature matrix F of the current processing layer with the sampling matrix X to obtain a new sampling matrix X';
if the current treatment layer number is 1 or the cost value c 'of the adjacent next treatment layer meets the condition that c' -c >0, assigning the current treatment layer number to the next treatment layer number, and assigning the cost value of the current treatment layer to the cost value of the adjacent next treatment layer;
otherwise, accumulating the number value of the current iteration times to 1, if the number value of the current iteration times is the iteration times, exiting, otherwise, accumulating the number of the current processing layers to 1;
step S403, output step:
and outputting the cost sensitive deep cascade forest.
As a preferred technical solution, the plurality of estimators adopts any one or a combination of a plurality of random forests and completely random forests.
As a preferable technical scheme, the original characteristic vector is one or any combination of a plurality of water quality parameters such as pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and dissolved oxygen.
In order to achieve the second object, the present invention adopts the following technical scheme:
a drinking water quality prediction system based on a cost-sensitive deep cascade forest comprises a data acquisition module, a data preprocessing module and a prediction module;
the data acquisition module is used for acquiring raw data of the drinking water quality, wherein the raw data of the drinking water quality comprise water quality parameters including pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and soluble oxygen;
the data preprocessing module is used for carrying out data cleaning and data standardization on the original drinking water quality data to obtain water quality preprocessing data;
the prediction module is used for inputting the water quality pretreatment data into a water quality prediction model to predict whether the water quality is qualified or not;
the water quality prediction model is obtained by machine learning training through multiple groups of data, and each group of data in the multiple groups of data comprises drinking water quality training data and label information for identifying whether the water quality training data is qualified or not;
the water quality prediction model introduces cost sensitivity factors by setting an unbalanced cost matrix;
the water quality prediction model is obtained through machine learning training by using a plurality of groups of data, and specifically comprises the following steps:
step S100, constructing a cost sensitive base classifier: introducing cost sensitivity factors to construct a cost sensitive base classifier, and specifically, directly introducing cost matrixes to represent the cost sensitivity of the base classifier;
the cost matrix is an asymmetric matrix and is used for setting the cost of the misclassification errors of the minority class samples to be far higher than the misclassification cost of the majority class samples;
step S200, constructing a cost-sensitive deep cascade forest: the cost-sensitive depth cascade forest is of a multi-layer structure, each layer is provided with a plurality of estimators, the types and the numbers of the estimators of each layer are the same, and each estimator comprises a plurality of cost-sensitive base classifiers;
step S300, pretreatment of drinking water quality data: cleaning and standardizing the original data of the drinking water quality;
step S400, cost-sensitive deep cascade forest training: k-fold dividing a training set and a verification set are carried out on the pre-processed water quality data, a prediction objective function is set, training is carried out on the basis of the training set, the prediction objective function is optimized, the super-parameters are fixed, K-1 fold data are used as the training set and used for training a model, and the rest 1 fold data are used as the verification set and used for verifying the model;
step S500, verifying the cost-sensitive deep cascade deep forest: using the water quality data of the verification set to verify the model, comparing the prediction capacities of the cost-sensitive deep cascade deep forests under different cost matrixes, and screening an optimal cost matrix to obtain a water quality prediction model;
the unbalanced cost matrix is set as follows:
in c 01 To misclassify the unacceptable water quality data into acceptable water quality data; c 10 To divide the qualified water quality into unqualified water quality in a staggered way; c 00 And c 11 The quality of the qualified drinking water and the quality of the unqualified drinking water are respectively the cost of correctly classifying the quality of the qualified drinking water and the quality of the unqualified drinking water, wherein the unqualified quality of the water is a decimal sample, and the qualified quality of the water is a majority sample.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) According to the invention, the cost sensitivity factor is introduced to construct the cost sensitivity deep cascade forest model, the facing prediction object is an extremely unbalanced environment problem, and the cost of error classification is obviously improved by setting the unbalanced cost sensitivity matrix, so that the cost of error classification is obviously higher than the cost of correct classification; on the prediction result, the unbalanced data prediction capability of the model can be further improved by introducing the cost sensitive factor, compared with the traditional learning model LR, SVM, SVM and the like, the method can remarkably improve the accuracy of predicting the drinking water quality, is superior to the surface water prediction, has higher accuracy and stability for predicting the unqualified water quality, has higher accuracy and stability for inferior water separation of minority class, and has higher reference value for meeting the water quality monitoring requirement of a future drinking water plant.
Drawings
FIG. 1 is a flow chart showing the steps of a method for predicting the quality of drinking water based on a cost-sensitive deep cascade forest according to the embodiment 1 of the present invention;
FIG. 2 is a training flow chart of the water quality prediction model in the embodiment 1 of the present invention;
FIG. 3 is a schematic diagram of the training process of the water quality prediction model in embodiment 1 of the present invention;
FIG. 4 is a flow chart of the CS-DCF algorithm in embodiment 1 of the present invention;
FIG. 5 is a schematic diagram showing the effect of the F1 value of the water quality prediction model in the training process in the embodiment 2 of the present invention;
FIG. 6 is a schematic diagram showing the effect of the water quality prediction model in the verification process in example 2 of the present invention.
Detailed Description
In the description of the present disclosure, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present disclosure and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present disclosure.
Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Likewise, the terms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that elements or items appearing before the word are encompassed by the element or item recited after the word and equivalents thereof, and that other elements or items are not excluded. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.
In the description of the present disclosure, it should be noted that the terms "mounted," "connected," and "connected" are to be construed broadly, unless otherwise specifically defined and limited. For example, the connection can be fixed connection, detachable connection or integrated connection; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the terms in this disclosure will be understood by those of ordinary skill in the art in the specific context. In addition, technical features related to different embodiments of the present disclosure described below may be combined with each other as long as they do not make a conflict with each other.
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Examples
Example 1
As shown in fig. 1, this embodiment proposes a method for predicting drinking water quality based on a cost-sensitive deep cascade forest, which includes the following steps:
and a data acquisition step: the method comprises the steps of collecting raw data of the drinking water quality by using an electronic sensor, wherein the raw data of the drinking water quality comprise water quality parameters, and the water quality parameters comprise pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and dissolved oxygen.
In practical application, the electronic sensor specifically comprises a pH sensor, a temperature sensor, a turbidity sensor, a conductivity sensor, a heavy metal detection sensor, a chloride ion sensor, a sulfate sensor and a dissolved oxygen sensor, wherein the pH sensor is used for detecting the pH of water quality, the temperature sensor is used for detecting the temperature of the water quality, the turbidity sensor is used for detecting the turbidity of the water quality, the conductivity sensor is used for detecting the conductivity of the water quality, the heavy metal detection sensor is used for detecting the heavy metal of the water quality, the chloride ion sensor is used for detecting the chloride of the water quality, the sulfate sensor is used for detecting the sulfate of the water quality, and the dissolved oxygen sensor is used for detecting the dissolved oxygen of the water quality.
A data preprocessing step: and (5) carrying out data cleaning and data standardization on the original drinking water quality data to obtain water quality pretreatment data.
And a prediction step: and inputting the water quality pretreatment data into a water quality prediction model to predict whether the water quality is qualified or not. The water quality prediction model is obtained through machine learning training by using a plurality of sets of data, and each set of data in the plurality of sets of data comprises drinking water quality training data and label information for identifying whether the water quality training data is qualified or not.
Referring to fig. 2 and 3, the water quality prediction model is obtained through machine learning training by using multiple sets of data, and specifically includes the following steps:
step S100, constructing a cost sensitive base classifier: introducing cost sensitivity factors to construct a cost sensitive base classifier, and specifically, directly introducing cost matrixes to represent the cost sensitivity of the base classifier; the cost matrix is an asymmetric matrix, namely an unbalanced cost matrix, and is used for setting the cost of classifying the minority class samples to be far higher than the misclassification cost of the majority class samples. One skilled in the art may use one or more of a random tree and a completely random tree for the base classifier according to practical situations, and the embodiment is not limited herein.
Step S200, constructing a cost-sensitive deep cascade forest: the cost sensitive depth cascade forest has multiple layers, each layer is composed of multiple estimators, the type and the number of the estimators of each layer are the same, and each estimator is composed of multiple cost sensitive base classifiers.
Step S300, pretreatment of drinking water quality data: the raw data of the drinking water quality are cleaned and normalized.
Step S400, cost-sensitive deep cascade forest training: and carrying out K-fold division on the pre-processed water quality data, setting a prediction objective function, training based on the training set, optimizing the prediction objective function and fixing the super-parameters. Wherein k-1 fold data is used as a training set and to train the model, and the remaining 1 fold data is used as a validation set and to validate the model.
Step S500, verifying the cost-sensitive deep cascade deep forest: and (3) verifying the model by using the water quality data of the verification set, comparing the prediction capacities of the cost-sensitive deep cascade deep forests under different cost matrixes, screening the optimal cost matrix, and further obtaining a water quality prediction model so as to have higher prediction precision and stability in the subsequent water quality prediction, thereby obtaining an optimal water quality prediction result. In practical application, the optimal cost matrix is screened, namely an initial cost matrix and an optimal cost matrix are found through a heuristic method and a grid method, and therefore an optimal prediction objective function is obtained. The initial cost matrix is the first cost matrix to be found, and the cost matrices to be found later all take the initial cost matrix as the starting point.
In this embodiment, in step S400, the cost-sensitive deep cascade forest training specifically uses a CS-DCF algorithm to train, where the CS-DCF algorithm is a cost-sensitive deep cascade forest algorithm. In practical application, training based on the training set is specifically as follows: the first layer input is an original feature vector, the other layers input is the original feature vector and the output probability vector of the adjacent previous layer, each layer calculates cost, and if the cost is reduced, the original feature vector is combined with the output of the layer to be used as the input of the next layer; each layer then follows the same procedure until the predicted objective function is no longer decreasing. And finally, outputting a result of the corresponding layer as a cost-sensitive deep cascade forest.
As shown in fig. 4, the cost-sensitive depth cascade forest algorithm specifically includes the following steps:
step S401, initializing:
inputting a feature sampling matrix, a cost matrix and iteration times, initializing the number of the current processing layers to be 1, and initializing the serial number value of the current iteration times to be 0;
the feature sampling matrix is expressed as:
X=(X 1 ,X 2 ,…,X p )
wherein p is the dimension of the feature sampling matrix, the cost matrix is represented as C, and the iteration number is s;
step S402, a cyclic training step: the following operations are executed until all the processing layers are traversed;
constructing a cost sensitive base classifier according to the cost matrix, and obtaining a cost value c of the current processing layer by using the cost sensitive base classifier;
taking the output of the current processing layer as a new feature matrix F;
generating a new sampling matrix: for the case when the current processing layer number is not 1, connecting a new feature matrix F of the current processing layer with the sampling matrix X to obtain a new sampling matrix X';
if the current treatment layer number is 1 or the cost value c 'of the adjacent next treatment layer meets the condition that c' -c >0, assigning the current treatment layer number to the next treatment layer number, and assigning the cost value of the current treatment layer to the cost value of the adjacent next treatment layer;
otherwise, accumulating the number value of the current iteration number by 1, if the number value of the current iteration number is the iteration number, exiting, otherwise, accumulating the number of the current processing layers by 1.
Step S403, output step:
and outputting the cost sensitive deep cascade forest.
In this embodiment, the number of layers of the cost-sensitive deep cascade forest is 100, and the multiple estimators are any one or a combination of multiple random forests and completely random forests; the original characteristic vector is various combinations of water quality parameters such as pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and dissolved oxygen.
In addition, the water quality prediction model constructed by the method can be applied to actual drinking water quality prediction by a person skilled in the art, so that higher accuracy is achieved.
Example 2
In this example 2, on the basis of the above example 1, actual drinking water quality data of a certain water service group in germany is taken as a data set, and this is further described as an example: in this example, the model running platform was Python v.3.6, and the model evaluation index was F1-score.
Selecting a random tree and a complex-random tree as base classifiers, setting an unbalanced cost matrix, and constructing a cost sensitive base classifier by taking a cost-sensitive random tree and a cost-sensitive completely-random tree as cost sensitive base classifiers, wherein the unbalanced cost matrix is set as follows:
in c 01 To divide unqualified water quality into qualified water quality in a wrong way; c 10 To divide the qualified water quality into unqualified water quality in a staggered way; c 00 And c 11 The cost of correctly classifying the quality of the qualified drinking water and the unqualified drinking water respectively; the unqualified water quality is a decimal sample, and the qualified water quality is a majority sample. Thus, here c 00 And c 11 Has a value of 0 and c 01 The value of (c) is greater than c 10
In a given unbalanced cost matrix C, samples y are correctly predicted asIs a function of the best predicted objective function of:
in the middle ofThe prediction probability of the label y is shown, y is label information for showing whether the water quality is qualified, wherein +.>Represents the prediction accuracy probability of the j-th type tag information, i represents the i-th data, j represents the j-th type tag information, n represents the total number of data, C ij A cost matrix representing i data and j classes of tag information.
The cost-sensitive deep-level deep forest is constructed using 2 cost-sensitive random deep forest estimators and 2 cost-sensitive fully random deep forest estimators, each comprising 200 tree. Each estimator employs 5-fold cross validation.
The actual drinking water data has 133212 samples after pretreatment, wherein the comparison of unqualified water quality samples is only 0.18%, and the actual drinking water data is extremely unbalanced data. Each sample contained pH, conductivity (Cond), turbidity (Turb), spectral Absorption Coefficient (SAC), temperature (Tp), and Pulse Frequency Modulation (PFM), with the first four parameters being changed to reject potable water.
In this embodiment, statistical precision (presision), recall (Recall), and F1 value (F1-score) are performed after model prediction is completed.
Precision:
Wherein TP represents the number of qualified water quality data items with correct classification, and FP represents the number of water quality data items with overall classification.
Recall:
Where TP represents the number of true data stripes predicted by positive samples, FP represents the number of true data stripes predicted by negative samples, and FN represents the number of false data stripes predicted by positive samples.
F1-score:
Referring to fig. 5 and 6, the cost matrix ratio at the highest F1-Score is found: for training, the initial cost matrix proportion is 1:30; when the training process is verified, the optimal cost matrix proportion is 1: the optimally predicted F1-score value was 94.14.+ -. 1.73%.
Example 3
The embodiment provides a drinking water quality prediction system based on a cost-sensitive deep cascade forest, which comprises a data acquisition module, a data preprocessing module and a prediction module;
the data acquisition module is used for acquiring raw data of the drinking water quality, wherein the raw data of the drinking water quality comprise water quality parameters including pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and soluble oxygen;
the data preprocessing module is used for carrying out data cleaning and data standardization on the original drinking water quality data to obtain water quality preprocessing data;
the prediction module is used for inputting the water quality pretreatment data into the water quality prediction model to predict whether the water quality is qualified or not.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (8)

1. A drinking water quality prediction method based on a cost-sensitive deep cascade forest is characterized by comprising the following steps of:
and a data acquisition step: collecting raw data of drinking water quality, wherein the raw data of drinking water quality comprise water quality parameters including pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and dissolved oxygen;
a data preprocessing step: carrying out data cleaning and data standardization on the original drinking water quality data to obtain water quality pretreatment data;
and a prediction step: inputting the water quality pretreatment data into a water quality prediction model to predict whether the water quality is qualified or not;
the water quality prediction model is obtained by machine learning training through multiple groups of data, and each group of data in the multiple groups of data comprises drinking water quality training data and label information for identifying whether the water quality training data is qualified or not;
the water quality prediction model is obtained through machine learning training by using a plurality of groups of data, and specifically comprises the following steps:
step S100, constructing a cost sensitive base classifier: introducing cost sensitive factors to construct a cost sensitive base classifier, and specifically, directly introducing a cost matrix form to express the cost sensitivity of the cost sensitive base classifier;
the cost matrix is an unbalanced cost matrix and is used for setting the cost of the misclassification errors of the minority class samples to be far higher than the misclassification cost of the majority class samples;
step S200, constructing a cost-sensitive deep cascade forest: the cost-sensitive depth cascade forest is of a multi-layer structure, each layer is provided with a plurality of estimators, the types and the numbers of the estimators of each layer are the same, and each estimator comprises a plurality of cost-sensitive base classifiers;
step S300, pretreatment of drinking water quality data: cleaning and standardizing the original data of the drinking water quality;
step S400, cost-sensitive deep cascade forest training: k-fold dividing a training set and a verification set are carried out on the pre-processed water quality data, a prediction objective function is set, training is carried out on the basis of the training set, the prediction objective function is optimized, the super-parameters are fixed, K-1 fold data are used as the training set and used for training a model, and the rest 1 fold data are used as the verification set and used for verifying the model;
in step S400, the prediction objective function is expressed as:
the formula y is label information for indicating whether the water quality is qualified or not, whereinRepresents the prediction accuracy probability of the j-th type tag information, i represents the i-th data, j represents the j-th type tag information, n represents the total number of data, C ij A cost matrix representing i data and j types of tag information;
step S500, verifying a cost-sensitive deep cascade forest: using the water quality data of the verification set to verify the model, comparing the prediction capacities of the cost-sensitive depth cascade forests under different cost matrixes, and screening an optimal cost matrix to obtain a water quality prediction model;
the water quality prediction model introduces cost sensitivity factors by setting an unbalanced cost matrix.
2. The method for predicting the quality of drinking water based on the cost-sensitive deep cascade forest according to claim 1, wherein the cost-sensitive basis classifier adopts one or a combination of a plurality of random trees and completely random trees.
3. The method for predicting the quality of drinking water based on the cost-sensitive deep cascade forest according to claim 1, wherein the optimal cost matrix is specifically selected by searching an initial cost matrix and an optimal cost matrix through a heuristic method and a grid method.
4. The method for predicting drinking water quality based on a cost-sensitive deep cascade forest according to claim 1, wherein in step S400, the cost-sensitive deep cascade forest training is specifically performed by using a CS-DCF algorithm, and the CS-DCF algorithm is a cost-sensitive deep cascade forest algorithm;
the training based on the training set specifically comprises the following steps: the first layer input is an original feature vector, the other layers input is the original feature vector and the output probability vector of the adjacent previous layer, each layer calculates cost, and if the cost is reduced, the original feature vector is combined with the output of the layer to be used as the input of the next layer; each layer then follows the same procedure until the predicted objective function is no longer decreasing, and finally the result of the corresponding layer is the output of the cost-sensitive depth cascade forest.
5. The method for predicting the quality of drinking water based on a cost-sensitive depth cascade forest according to claim 4, wherein the cost-sensitive depth cascade forest algorithm specifically comprises the following steps:
step S401, initializing:
inputting a feature sampling matrix, a cost matrix and iteration times, initializing the number of the current processing layers to be 1, and initializing the serial number value of the current iteration times to be 0;
the feature sampling matrix is expressed as:
X=(X 1 ,X 2 ,…,X p )
wherein p is the dimension of the feature sampling matrix, the cost matrix is represented as C, and the iteration number is s;
step S402, a cyclic training step: the following operations are executed until all the processing layers are traversed;
constructing a cost sensitive base classifier according to the cost matrix, and obtaining a cost value c of the current processing layer by using the cost sensitive base classifier;
taking the output of the current processing layer as a new feature matrix F;
generating a new feature sampling matrix: for the case when the current processing layer number is not 1, connecting a new feature matrix F of the current processing layer with the feature sampling matrix X to obtain a new feature sampling matrix X';
if the current treatment layer number is 1 or the cost value c 'of the adjacent next treatment layer meets the condition of c' -c >0, assigning the current treatment layer number to the next treatment layer number, and assigning the cost value of the current treatment layer to the cost value of the adjacent next treatment layer;
otherwise, accumulating the number value of the current iteration times to 1, if the number value of the current iteration times is the iteration times, exiting, otherwise, accumulating the number of the current processing layers to 1;
step S403, output step:
and outputting the cost sensitive deep cascade forest.
6. The method for predicting drinking water quality based on cost sensitive deep cascade forests as recited in claim 4, wherein the plurality of estimators employ any one or a combination of a plurality of random forests, completely random forests.
7. The method for predicting the quality of drinking water based on the cost-sensitive deep cascade forest of claim 4, wherein the original eigenvector is one or any combination of a plurality of water quality parameters of pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and dissolved oxygen.
8. A drinking water quality prediction system based on a cost-sensitive deep cascade forest is characterized by comprising a data acquisition module, a data preprocessing module and a prediction module;
the data acquisition module is used for acquiring raw data of the drinking water quality, wherein the raw data of the drinking water quality comprise water quality parameters including pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and soluble oxygen;
the data preprocessing module is used for carrying out data cleaning and data standardization on the original drinking water quality data to obtain water quality preprocessing data;
the prediction module is used for inputting the water quality pretreatment data into a water quality prediction model to predict whether the water quality is qualified or not;
the water quality prediction model is obtained by machine learning training through multiple groups of data, and each group of data in the multiple groups of data comprises drinking water quality training data and label information for identifying whether the water quality training data is qualified or not;
the water quality prediction model introduces cost sensitivity factors by setting an unbalanced cost matrix;
the water quality prediction model is obtained through machine learning training by using a plurality of groups of data, and specifically comprises the following steps:
step S100, constructing a cost sensitive base classifier: introducing cost sensitive factors to construct a cost sensitive base classifier, and specifically, directly introducing a cost matrix form to express the cost sensitivity of the cost sensitive base classifier;
the cost matrix is an unbalanced cost matrix and is used for setting the cost of the misclassification errors of the minority class samples to be far higher than the misclassification cost of the majority class samples;
step S200, constructing a cost-sensitive deep cascade forest: the cost-sensitive depth cascade forest is of a multi-layer structure, each layer is provided with a plurality of estimators, the types and the numbers of the estimators of each layer are the same, and each estimator comprises a plurality of cost-sensitive base classifiers;
step S300, pretreatment of drinking water quality data: cleaning and standardizing the original data of the drinking water quality;
step S400, cost-sensitive deep cascade forest training: k-fold dividing a training set and a verification set are carried out on the pre-processed water quality data, a prediction objective function is set, training is carried out on the basis of the training set, the prediction objective function is optimized, the super-parameters are fixed, K-1 fold data are used as the training set and used for training a model, and the rest 1 fold data are used as the verification set and used for verifying the model;
in step S400, the prediction objective function is expressed as:
the formula y is label information for indicating whether the water quality is qualified or not, whereinRepresents the prediction accuracy probability of the j-th type tag information, i represents the i-th data, j represents the j-th type tag information, n represents the total number of data, C ij A cost matrix representing i data and j types of tag information;
step S500, verifying a cost-sensitive deep cascade forest: using the water quality data of the verification set to verify the model, comparing the prediction capacities of the cost-sensitive depth cascade forests under different cost matrixes, and screening an optimal cost matrix to obtain a water quality prediction model;
the unbalanced cost matrix is set as follows:
in c 01 To misclassify the unacceptable water quality data into acceptable water quality data; c 10 To divide the qualified water quality into unqualified water quality in a staggered way; c 00 And c 11 Respectively qualified drinking water and unqualified drinking waterThe cost of correct classification of the quality of the grid drinking water is that the unqualified quality of water is a decimal sample, and the qualified quality of water is a majority sample.
CN202110992331.8A 2021-08-27 2021-08-27 Drinking water quality prediction method and system based on cost-sensitive deep cascade forests Active CN113723679B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110992331.8A CN113723679B (en) 2021-08-27 2021-08-27 Drinking water quality prediction method and system based on cost-sensitive deep cascade forests

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110992331.8A CN113723679B (en) 2021-08-27 2021-08-27 Drinking water quality prediction method and system based on cost-sensitive deep cascade forests

Publications (2)

Publication Number Publication Date
CN113723679A CN113723679A (en) 2021-11-30
CN113723679B true CN113723679B (en) 2024-04-16

Family

ID=78678447

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110992331.8A Active CN113723679B (en) 2021-08-27 2021-08-27 Drinking water quality prediction method and system based on cost-sensitive deep cascade forests

Country Status (1)

Country Link
CN (1) CN113723679B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107748932A (en) * 2017-10-20 2018-03-02 杭州尚青科技有限公司 A kind of air quality grade Forecasting Methodology of fusion sequence mode excavation and cost sensitive learning
WO2019033636A1 (en) * 2017-08-16 2019-02-21 哈尔滨工业大学深圳研究生院 Method of using minimized-loss learning to classify imbalanced samples
CN109446393A (en) * 2018-09-12 2019-03-08 北京邮电大学 A kind of Web Community's topic classification method and device
CN111128372A (en) * 2019-12-02 2020-05-08 重庆邮电大学 Disease prediction method based on RF-LR improved algorithm
WO2020199345A1 (en) * 2019-04-02 2020-10-08 广东石油化工学院 Semi-supervised and heterogeneous software defect prediction algorithm employing github
CN111881159A (en) * 2020-08-05 2020-11-03 长沙理工大学 Fault detection method and device based on cost-sensitive extreme random forest

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019033636A1 (en) * 2017-08-16 2019-02-21 哈尔滨工业大学深圳研究生院 Method of using minimized-loss learning to classify imbalanced samples
CN107748932A (en) * 2017-10-20 2018-03-02 杭州尚青科技有限公司 A kind of air quality grade Forecasting Methodology of fusion sequence mode excavation and cost sensitive learning
CN109446393A (en) * 2018-09-12 2019-03-08 北京邮电大学 A kind of Web Community's topic classification method and device
WO2020199345A1 (en) * 2019-04-02 2020-10-08 广东石油化工学院 Semi-supervised and heterogeneous software defect prediction algorithm employing github
CN111128372A (en) * 2019-12-02 2020-05-08 重庆邮电大学 Disease prediction method based on RF-LR improved algorithm
CN111881159A (en) * 2020-08-05 2020-11-03 长沙理工大学 Fault detection method and device based on cost-sensitive extreme random forest

Also Published As

Publication number Publication date
CN113723679A (en) 2021-11-30

Similar Documents

Publication Publication Date Title
CN108734355B (en) Short-term power load parallel prediction method and system applied to power quality comprehensive management scene
CN109934269B (en) Open set identification method and device for electromagnetic signals
CN110881037A (en) Network intrusion detection method and training method and device of model thereof, and server
CN108985135A (en) A kind of human-face detector training method, device and electronic equipment
CN108874959A (en) A kind of user's dynamic interest model method for building up based on big data technology
CN111343147B (en) Network attack detection device and method based on deep learning
CN110413775A (en) A kind of data label classification method, device, terminal and storage medium
CN108766559A (en) Clinical decision support method and system for intelligent disorder in screening
CN114048468A (en) Intrusion detection method, intrusion detection model training method, device and medium
CN107992945A (en) Feature gene selection method based on deep learning and evolutionary computation
CN107480441B (en) Modeling method and system for children septic shock prognosis prediction
CN116151319A (en) Method and device for searching neural network integration model and electronic equipment
CN113744083B (en) Water quality prediction method based on environment unbalance data
CN113239199B (en) Credit classification method based on multi-party data set
CN108830407B (en) Sensor distribution optimization method in structure health monitoring under multi-working condition
CN113723679B (en) Drinking water quality prediction method and system based on cost-sensitive deep cascade forests
CN113837266A (en) Software defect prediction method based on feature extraction and Stacking ensemble learning
CN113642255A (en) Photovoltaic power generation power prediction method based on multi-scale convolution cyclic neural network
CN116665482B (en) Parking space recommending method and device based on intelligent parking
CN116842459A (en) Electric energy metering fault diagnosis method and diagnosis terminal based on small sample learning
CN117035509A (en) Electric energy meter state evaluation method and device, electronic equipment and readable storage medium
CN116758469A (en) Crowd abnormal condition and single person movement state detection method
CN112465253B (en) Method and device for predicting links in urban road network
CN115734274A (en) Cellular network fault diagnosis method based on deep learning and knowledge graph
CN115358448A (en) Model for measuring and calculating comprehensive bearing capacity of rural resource environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant