CN113723679B - Drinking water quality prediction method and system based on cost-sensitive deep cascade forests - Google Patents
Drinking water quality prediction method and system based on cost-sensitive deep cascade forests Download PDFInfo
- Publication number
- CN113723679B CN113723679B CN202110992331.8A CN202110992331A CN113723679B CN 113723679 B CN113723679 B CN 113723679B CN 202110992331 A CN202110992331 A CN 202110992331A CN 113723679 B CN113723679 B CN 113723679B
- Authority
- CN
- China
- Prior art keywords
- cost
- water quality
- data
- sensitive
- drinking water
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 239000003651 drinking water Substances 0.000 title claims abstract description 79
- 235000020188 drinking water Nutrition 0.000 title claims abstract description 78
- 238000000034 method Methods 0.000 title claims abstract description 33
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims abstract description 123
- 239000011159 matrix material Substances 0.000 claims abstract description 81
- 238000012549 training Methods 0.000 claims abstract description 70
- 238000007781 pre-processing Methods 0.000 claims abstract description 13
- 238000004140 cleaning Methods 0.000 claims abstract description 11
- 238000010801 machine learning Methods 0.000 claims abstract description 11
- 238000012545 processing Methods 0.000 claims description 21
- 230000006870 function Effects 0.000 claims description 19
- 238000005070 sampling Methods 0.000 claims description 19
- 230000035945 sensitivity Effects 0.000 claims description 15
- 238000012795 verification Methods 0.000 claims description 14
- 238000004422 calculation algorithm Methods 0.000 claims description 13
- QVGXLLKOCUKJST-UHFFFAOYSA-N atomic oxygen Chemical compound [O] QVGXLLKOCUKJST-UHFFFAOYSA-N 0.000 claims description 12
- 229910001385 heavy metal Inorganic materials 0.000 claims description 12
- 229910052760 oxygen Inorganic materials 0.000 claims description 12
- 239000001301 oxygen Substances 0.000 claims description 12
- 150000001805 chlorine compounds Chemical class 0.000 claims description 9
- 150000003467 sulfuric acid derivatives Chemical class 0.000 claims description 9
- 238000007637 random forest analysis Methods 0.000 claims description 7
- 238000012216 screening Methods 0.000 claims description 5
- 125000004122 cyclic group Chemical group 0.000 claims description 3
- 230000003247 decreasing effect Effects 0.000 claims description 3
- 230000035622 drinking Effects 0.000 claims 1
- 238000013473 artificial intelligence Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- VEXZGXHMUGYJMC-UHFFFAOYSA-M Chloride anion Chemical compound [Cl-] VEXZGXHMUGYJMC-UHFFFAOYSA-M 0.000 description 3
- QAOWNCQODCNURD-UHFFFAOYSA-L Sulfate Chemical compound [O-]S([O-])(=O)=O QAOWNCQODCNURD-UHFFFAOYSA-L 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 239000002352 surface water Substances 0.000 description 2
- 238000012952 Resampling Methods 0.000 description 1
- 238000010521 absorption reaction Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 235000012206 bottled water Nutrition 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000013277 forecasting method Methods 0.000 description 1
- 239000013505 freshwater Substances 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A20/00—Water conservation; Efficient water supply; Efficient water use
- Y02A20/152—Water filtration
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Economics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Strategic Management (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Human Resources & Organizations (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Development Economics (AREA)
- Game Theory and Decision Science (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Entrepreneurship & Innovation (AREA)
- Marketing (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a drinking water quality prediction method and a system based on a cost-sensitive deep cascade forest, wherein the method comprises the following steps: and a data acquisition step: collecting raw drinking water quality data, wherein the raw drinking water quality data comprises water quality parameters; a data preprocessing step: carrying out data cleaning and data standardization on the original drinking water quality data to obtain water quality pretreatment data; and a prediction step: inputting the water quality pretreatment data into a water quality prediction model to predict whether the water quality is qualified or not; the water quality prediction model is obtained by machine learning training through a plurality of groups of data, and each group of data in the plurality of groups of data comprises drinking water quality training data and label information for identifying whether the water quality training data is qualified or not; according to the invention, the unbalanced cost matrix is set to introduce cost sensitive factors so as to improve the unbalanced data prediction capability of the model, and the water quality prediction method has higher accuracy.
Description
Technical Field
The invention relates to the technical field of environmental quality monitoring and forecasting, in particular to a drinking water quality forecasting method and system based on a cost-sensitive deep cascade forest.
Background
Fresh water resources are main resources for human survival, and the quality of drinking water is closely related to human health. Therefore, various measures are taken to ensure the water quality safety in all countries of the world, and among the measures, the use of artificial intelligence technology to monitor and forecast the water quality is an important step for protecting the safety of drinking water.
Although many artificial intelligence models exist to predict the quality of surface water by using a support vector machine and random forests, among the models, the traditional artificial intelligence models are particularly difficult to be qualified for predicting the quality of drinking water. One of the main reasons is that the drinking water quality data set used for training of these models is a typical extremely unbalanced data set, whereas artificial intelligence models have a better predictive power on balance data, which can significantly reduce the predictive power of these models. Therefore, the condition of the drinking water quality cannot be effectively monitored and predicted by using the traditional model alone, the prediction performance of the traditional learning model such as LR, SVM and the like on the drinking water quality is not high, and especially the accuracy and the stability of the unqualified water quality prediction are not high.
There are two general types of methods for handling unbalanced data: one major class is resampling, undersampling or mixed sampling of data, and the proportion of the few classes of samples is changed from the perspective of a data set; another broad class is to use integrated models to predict imbalance data, enhancing the ability to predict imbalance data from the perspective of increasing the strain capacity of the model.
Disclosure of Invention
In order to overcome the defects and shortcomings in the prior art, the first aim of the invention is to provide a drinking water quality prediction method based on a cost-sensitive deep cascade forest, which improves the unbalanced data prediction capability of a model by introducing cost-sensitive factors, so as to accurately predict the drinking water quality condition and ensure the drinking water safety.
A second object of the invention is to propose a cost-sensitive deep cascade forest based drinking water quality prediction system.
In order to achieve the first object, the present invention adopts the following technical scheme:
a drinking water quality prediction method based on a cost-sensitive deep cascade forest comprises the following steps:
and a data acquisition step: collecting raw data of drinking water quality, wherein the raw data of drinking water quality comprise water quality parameters including pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and dissolved oxygen;
a data preprocessing step: carrying out data cleaning and data standardization on the original drinking water quality data to obtain water quality pretreatment data;
and a prediction step: inputting the water quality pretreatment data into a water quality prediction model to predict whether the water quality is qualified or not;
the water quality prediction model is obtained by machine learning training through multiple groups of data, and each group of data in the multiple groups of data comprises drinking water quality training data and label information for identifying whether the water quality training data is qualified or not;
the water quality prediction model introduces cost sensitivity factors by setting an unbalanced cost matrix.
As an optimal technical scheme, the water quality prediction model is obtained through machine learning training by using a plurality of groups of data, and specifically comprises the following steps:
step S100, constructing a cost sensitive base classifier: introducing cost sensitivity factors to construct a cost sensitive base classifier, and specifically, directly introducing cost matrixes to represent the cost sensitivity of the base classifier;
the cost matrix is an asymmetric matrix and is used for setting the cost of the misclassification errors of the minority class samples to be far higher than the misclassification cost of the majority class samples;
step S200, constructing a cost-sensitive deep cascade forest: the cost-sensitive depth cascade forest is of a multi-layer structure, each layer is provided with a plurality of estimators, the types and the numbers of the estimators of each layer are the same, and each estimator comprises a plurality of cost-sensitive base classifiers;
step S300, pretreatment of drinking water quality data: cleaning and standardizing the original data of the drinking water quality;
step S400, cost-sensitive deep cascade forest training: k-fold dividing a training set and a verification set are carried out on the pre-processed water quality data, a prediction objective function is set, training is carried out on the basis of the training set, the prediction objective function is optimized, the super-parameters are fixed, K-1 fold data are used as the training set and used for training a model, and the rest 1 fold data are used as the verification set and used for verifying the model;
step S500, verifying the cost-sensitive deep cascade deep forest: and verifying the model by using the water quality data of the verification set, comparing the prediction capacities of the cost-sensitive deep cascade deep forests under different cost matrixes, and screening the optimal cost matrix to obtain a water quality prediction model.
As a preferable technical scheme, the cost sensitive base classifier adopts one or more of a random tree and a complete random tree.
As an optimal technical scheme, the optimal cost matrix is specifically selected by searching an initial cost matrix and an optimal cost matrix through a heuristic method and a grid method.
As a preferred embodiment, in step S400, the prediction objective function is expressed as:
the formula y is label information for indicating whether the water quality is qualified or not, whereinRepresents the prediction accuracy probability of the j-th type tag information, i represents the i-th data, j represents the j-th type tag information, n represents the total number of data, C ij A cost matrix representing i data and j classes of tag information.
As a preferable technical scheme, in step S400, the cost-sensitive deep cascade forest training specifically adopts a CS-DCF algorithm for training, where the CS-DCF algorithm is a cost-sensitive deep cascade forest algorithm;
the training based on the training set specifically comprises the following steps: the first layer input is an original feature vector, the other layers input is the original feature vector and the output probability vector of the adjacent previous layer, each layer calculates cost, and if the cost is reduced, the original feature vector is combined with the output of the layer to be used as the input of the next layer; each layer then follows the same procedure until the predicted objective function is no longer decreasing, and finally the result of the corresponding layer is the output of the cost-sensitive depth cascade forest.
As a preferable technical scheme, the cost-sensitive depth cascade forest algorithm specifically includes the following steps:
step S401, initializing:
inputting a feature sampling matrix, a cost matrix and iteration times, initializing the number of the current processing layers to be 1, and initializing the serial number value of the current iteration times to be 0;
the feature sampling matrix is expressed as:
X=(X 1 ,X 2 ,…,X p )
wherein p is the dimension of the feature sampling matrix, the cost matrix is represented as C, and the iteration number is s;
step S402, a cyclic training step: the following operations are executed until all the processing layers are traversed;
constructing a cost sensitive base classifier according to the cost matrix, and obtaining a cost value c of the current processing layer by using the cost sensitive base classifier;
taking the output of the current processing layer as a new feature matrix F;
generating a new sampling matrix: for the case when the current processing layer number is not 1, connecting a new feature matrix F of the current processing layer with the sampling matrix X to obtain a new sampling matrix X';
if the current treatment layer number is 1 or the cost value c 'of the adjacent next treatment layer meets the condition that c' -c >0, assigning the current treatment layer number to the next treatment layer number, and assigning the cost value of the current treatment layer to the cost value of the adjacent next treatment layer;
otherwise, accumulating the number value of the current iteration times to 1, if the number value of the current iteration times is the iteration times, exiting, otherwise, accumulating the number of the current processing layers to 1;
step S403, output step:
and outputting the cost sensitive deep cascade forest.
As a preferred technical solution, the plurality of estimators adopts any one or a combination of a plurality of random forests and completely random forests.
As a preferable technical scheme, the original characteristic vector is one or any combination of a plurality of water quality parameters such as pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and dissolved oxygen.
In order to achieve the second object, the present invention adopts the following technical scheme:
a drinking water quality prediction system based on a cost-sensitive deep cascade forest comprises a data acquisition module, a data preprocessing module and a prediction module;
the data acquisition module is used for acquiring raw data of the drinking water quality, wherein the raw data of the drinking water quality comprise water quality parameters including pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and soluble oxygen;
the data preprocessing module is used for carrying out data cleaning and data standardization on the original drinking water quality data to obtain water quality preprocessing data;
the prediction module is used for inputting the water quality pretreatment data into a water quality prediction model to predict whether the water quality is qualified or not;
the water quality prediction model is obtained by machine learning training through multiple groups of data, and each group of data in the multiple groups of data comprises drinking water quality training data and label information for identifying whether the water quality training data is qualified or not;
the water quality prediction model introduces cost sensitivity factors by setting an unbalanced cost matrix;
the water quality prediction model is obtained through machine learning training by using a plurality of groups of data, and specifically comprises the following steps:
step S100, constructing a cost sensitive base classifier: introducing cost sensitivity factors to construct a cost sensitive base classifier, and specifically, directly introducing cost matrixes to represent the cost sensitivity of the base classifier;
the cost matrix is an asymmetric matrix and is used for setting the cost of the misclassification errors of the minority class samples to be far higher than the misclassification cost of the majority class samples;
step S200, constructing a cost-sensitive deep cascade forest: the cost-sensitive depth cascade forest is of a multi-layer structure, each layer is provided with a plurality of estimators, the types and the numbers of the estimators of each layer are the same, and each estimator comprises a plurality of cost-sensitive base classifiers;
step S300, pretreatment of drinking water quality data: cleaning and standardizing the original data of the drinking water quality;
step S400, cost-sensitive deep cascade forest training: k-fold dividing a training set and a verification set are carried out on the pre-processed water quality data, a prediction objective function is set, training is carried out on the basis of the training set, the prediction objective function is optimized, the super-parameters are fixed, K-1 fold data are used as the training set and used for training a model, and the rest 1 fold data are used as the verification set and used for verifying the model;
step S500, verifying the cost-sensitive deep cascade deep forest: using the water quality data of the verification set to verify the model, comparing the prediction capacities of the cost-sensitive deep cascade deep forests under different cost matrixes, and screening an optimal cost matrix to obtain a water quality prediction model;
the unbalanced cost matrix is set as follows:
in c 01 To misclassify the unacceptable water quality data into acceptable water quality data; c 10 To divide the qualified water quality into unqualified water quality in a staggered way; c 00 And c 11 The quality of the qualified drinking water and the quality of the unqualified drinking water are respectively the cost of correctly classifying the quality of the qualified drinking water and the quality of the unqualified drinking water, wherein the unqualified quality of the water is a decimal sample, and the qualified quality of the water is a majority sample.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) According to the invention, the cost sensitivity factor is introduced to construct the cost sensitivity deep cascade forest model, the facing prediction object is an extremely unbalanced environment problem, and the cost of error classification is obviously improved by setting the unbalanced cost sensitivity matrix, so that the cost of error classification is obviously higher than the cost of correct classification; on the prediction result, the unbalanced data prediction capability of the model can be further improved by introducing the cost sensitive factor, compared with the traditional learning model LR, SVM, SVM and the like, the method can remarkably improve the accuracy of predicting the drinking water quality, is superior to the surface water prediction, has higher accuracy and stability for predicting the unqualified water quality, has higher accuracy and stability for inferior water separation of minority class, and has higher reference value for meeting the water quality monitoring requirement of a future drinking water plant.
Drawings
FIG. 1 is a flow chart showing the steps of a method for predicting the quality of drinking water based on a cost-sensitive deep cascade forest according to the embodiment 1 of the present invention;
FIG. 2 is a training flow chart of the water quality prediction model in the embodiment 1 of the present invention;
FIG. 3 is a schematic diagram of the training process of the water quality prediction model in embodiment 1 of the present invention;
FIG. 4 is a flow chart of the CS-DCF algorithm in embodiment 1 of the present invention;
FIG. 5 is a schematic diagram showing the effect of the F1 value of the water quality prediction model in the training process in the embodiment 2 of the present invention;
FIG. 6 is a schematic diagram showing the effect of the water quality prediction model in the verification process in example 2 of the present invention.
Detailed Description
In the description of the present disclosure, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present disclosure and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present disclosure.
Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Likewise, the terms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that elements or items appearing before the word are encompassed by the element or item recited after the word and equivalents thereof, and that other elements or items are not excluded. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.
In the description of the present disclosure, it should be noted that the terms "mounted," "connected," and "connected" are to be construed broadly, unless otherwise specifically defined and limited. For example, the connection can be fixed connection, detachable connection or integrated connection; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the terms in this disclosure will be understood by those of ordinary skill in the art in the specific context. In addition, technical features related to different embodiments of the present disclosure described below may be combined with each other as long as they do not make a conflict with each other.
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Examples
Example 1
As shown in fig. 1, this embodiment proposes a method for predicting drinking water quality based on a cost-sensitive deep cascade forest, which includes the following steps:
and a data acquisition step: the method comprises the steps of collecting raw data of the drinking water quality by using an electronic sensor, wherein the raw data of the drinking water quality comprise water quality parameters, and the water quality parameters comprise pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and dissolved oxygen.
In practical application, the electronic sensor specifically comprises a pH sensor, a temperature sensor, a turbidity sensor, a conductivity sensor, a heavy metal detection sensor, a chloride ion sensor, a sulfate sensor and a dissolved oxygen sensor, wherein the pH sensor is used for detecting the pH of water quality, the temperature sensor is used for detecting the temperature of the water quality, the turbidity sensor is used for detecting the turbidity of the water quality, the conductivity sensor is used for detecting the conductivity of the water quality, the heavy metal detection sensor is used for detecting the heavy metal of the water quality, the chloride ion sensor is used for detecting the chloride of the water quality, the sulfate sensor is used for detecting the sulfate of the water quality, and the dissolved oxygen sensor is used for detecting the dissolved oxygen of the water quality.
A data preprocessing step: and (5) carrying out data cleaning and data standardization on the original drinking water quality data to obtain water quality pretreatment data.
And a prediction step: and inputting the water quality pretreatment data into a water quality prediction model to predict whether the water quality is qualified or not. The water quality prediction model is obtained through machine learning training by using a plurality of sets of data, and each set of data in the plurality of sets of data comprises drinking water quality training data and label information for identifying whether the water quality training data is qualified or not.
Referring to fig. 2 and 3, the water quality prediction model is obtained through machine learning training by using multiple sets of data, and specifically includes the following steps:
step S100, constructing a cost sensitive base classifier: introducing cost sensitivity factors to construct a cost sensitive base classifier, and specifically, directly introducing cost matrixes to represent the cost sensitivity of the base classifier; the cost matrix is an asymmetric matrix, namely an unbalanced cost matrix, and is used for setting the cost of classifying the minority class samples to be far higher than the misclassification cost of the majority class samples. One skilled in the art may use one or more of a random tree and a completely random tree for the base classifier according to practical situations, and the embodiment is not limited herein.
Step S200, constructing a cost-sensitive deep cascade forest: the cost sensitive depth cascade forest has multiple layers, each layer is composed of multiple estimators, the type and the number of the estimators of each layer are the same, and each estimator is composed of multiple cost sensitive base classifiers.
Step S300, pretreatment of drinking water quality data: the raw data of the drinking water quality are cleaned and normalized.
Step S400, cost-sensitive deep cascade forest training: and carrying out K-fold division on the pre-processed water quality data, setting a prediction objective function, training based on the training set, optimizing the prediction objective function and fixing the super-parameters. Wherein k-1 fold data is used as a training set and to train the model, and the remaining 1 fold data is used as a validation set and to validate the model.
Step S500, verifying the cost-sensitive deep cascade deep forest: and (3) verifying the model by using the water quality data of the verification set, comparing the prediction capacities of the cost-sensitive deep cascade deep forests under different cost matrixes, screening the optimal cost matrix, and further obtaining a water quality prediction model so as to have higher prediction precision and stability in the subsequent water quality prediction, thereby obtaining an optimal water quality prediction result. In practical application, the optimal cost matrix is screened, namely an initial cost matrix and an optimal cost matrix are found through a heuristic method and a grid method, and therefore an optimal prediction objective function is obtained. The initial cost matrix is the first cost matrix to be found, and the cost matrices to be found later all take the initial cost matrix as the starting point.
In this embodiment, in step S400, the cost-sensitive deep cascade forest training specifically uses a CS-DCF algorithm to train, where the CS-DCF algorithm is a cost-sensitive deep cascade forest algorithm. In practical application, training based on the training set is specifically as follows: the first layer input is an original feature vector, the other layers input is the original feature vector and the output probability vector of the adjacent previous layer, each layer calculates cost, and if the cost is reduced, the original feature vector is combined with the output of the layer to be used as the input of the next layer; each layer then follows the same procedure until the predicted objective function is no longer decreasing. And finally, outputting a result of the corresponding layer as a cost-sensitive deep cascade forest.
As shown in fig. 4, the cost-sensitive depth cascade forest algorithm specifically includes the following steps:
step S401, initializing:
inputting a feature sampling matrix, a cost matrix and iteration times, initializing the number of the current processing layers to be 1, and initializing the serial number value of the current iteration times to be 0;
the feature sampling matrix is expressed as:
X=(X 1 ,X 2 ,…,X p )
wherein p is the dimension of the feature sampling matrix, the cost matrix is represented as C, and the iteration number is s;
step S402, a cyclic training step: the following operations are executed until all the processing layers are traversed;
constructing a cost sensitive base classifier according to the cost matrix, and obtaining a cost value c of the current processing layer by using the cost sensitive base classifier;
taking the output of the current processing layer as a new feature matrix F;
generating a new sampling matrix: for the case when the current processing layer number is not 1, connecting a new feature matrix F of the current processing layer with the sampling matrix X to obtain a new sampling matrix X';
if the current treatment layer number is 1 or the cost value c 'of the adjacent next treatment layer meets the condition that c' -c >0, assigning the current treatment layer number to the next treatment layer number, and assigning the cost value of the current treatment layer to the cost value of the adjacent next treatment layer;
otherwise, accumulating the number value of the current iteration number by 1, if the number value of the current iteration number is the iteration number, exiting, otherwise, accumulating the number of the current processing layers by 1.
Step S403, output step:
and outputting the cost sensitive deep cascade forest.
In this embodiment, the number of layers of the cost-sensitive deep cascade forest is 100, and the multiple estimators are any one or a combination of multiple random forests and completely random forests; the original characteristic vector is various combinations of water quality parameters such as pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and dissolved oxygen.
In addition, the water quality prediction model constructed by the method can be applied to actual drinking water quality prediction by a person skilled in the art, so that higher accuracy is achieved.
Example 2
In this example 2, on the basis of the above example 1, actual drinking water quality data of a certain water service group in germany is taken as a data set, and this is further described as an example: in this example, the model running platform was Python v.3.6, and the model evaluation index was F1-score.
Selecting a random tree and a complex-random tree as base classifiers, setting an unbalanced cost matrix, and constructing a cost sensitive base classifier by taking a cost-sensitive random tree and a cost-sensitive completely-random tree as cost sensitive base classifiers, wherein the unbalanced cost matrix is set as follows:
in c 01 To divide unqualified water quality into qualified water quality in a wrong way; c 10 To divide the qualified water quality into unqualified water quality in a staggered way; c 00 And c 11 The cost of correctly classifying the quality of the qualified drinking water and the unqualified drinking water respectively; the unqualified water quality is a decimal sample, and the qualified water quality is a majority sample. Thus, here c 00 And c 11 Has a value of 0 and c 01 The value of (c) is greater than c 10 。
In a given unbalanced cost matrix C, samples y are correctly predicted asIs a function of the best predicted objective function of:
in the middle ofThe prediction probability of the label y is shown, y is label information for showing whether the water quality is qualified, wherein +.>Represents the prediction accuracy probability of the j-th type tag information, i represents the i-th data, j represents the j-th type tag information, n represents the total number of data, C ij A cost matrix representing i data and j classes of tag information.
The cost-sensitive deep-level deep forest is constructed using 2 cost-sensitive random deep forest estimators and 2 cost-sensitive fully random deep forest estimators, each comprising 200 tree. Each estimator employs 5-fold cross validation.
The actual drinking water data has 133212 samples after pretreatment, wherein the comparison of unqualified water quality samples is only 0.18%, and the actual drinking water data is extremely unbalanced data. Each sample contained pH, conductivity (Cond), turbidity (Turb), spectral Absorption Coefficient (SAC), temperature (Tp), and Pulse Frequency Modulation (PFM), with the first four parameters being changed to reject potable water.
In this embodiment, statistical precision (presision), recall (Recall), and F1 value (F1-score) are performed after model prediction is completed.
Precision:
Wherein TP represents the number of qualified water quality data items with correct classification, and FP represents the number of water quality data items with overall classification.
Recall:
Where TP represents the number of true data stripes predicted by positive samples, FP represents the number of true data stripes predicted by negative samples, and FN represents the number of false data stripes predicted by positive samples.
F1-score:
Referring to fig. 5 and 6, the cost matrix ratio at the highest F1-Score is found: for training, the initial cost matrix proportion is 1:30; when the training process is verified, the optimal cost matrix proportion is 1: the optimally predicted F1-score value was 94.14.+ -. 1.73%.
Example 3
The embodiment provides a drinking water quality prediction system based on a cost-sensitive deep cascade forest, which comprises a data acquisition module, a data preprocessing module and a prediction module;
the data acquisition module is used for acquiring raw data of the drinking water quality, wherein the raw data of the drinking water quality comprise water quality parameters including pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and soluble oxygen;
the data preprocessing module is used for carrying out data cleaning and data standardization on the original drinking water quality data to obtain water quality preprocessing data;
the prediction module is used for inputting the water quality pretreatment data into the water quality prediction model to predict whether the water quality is qualified or not.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.
Claims (8)
1. A drinking water quality prediction method based on a cost-sensitive deep cascade forest is characterized by comprising the following steps of:
and a data acquisition step: collecting raw data of drinking water quality, wherein the raw data of drinking water quality comprise water quality parameters including pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and dissolved oxygen;
a data preprocessing step: carrying out data cleaning and data standardization on the original drinking water quality data to obtain water quality pretreatment data;
and a prediction step: inputting the water quality pretreatment data into a water quality prediction model to predict whether the water quality is qualified or not;
the water quality prediction model is obtained by machine learning training through multiple groups of data, and each group of data in the multiple groups of data comprises drinking water quality training data and label information for identifying whether the water quality training data is qualified or not;
the water quality prediction model is obtained through machine learning training by using a plurality of groups of data, and specifically comprises the following steps:
step S100, constructing a cost sensitive base classifier: introducing cost sensitive factors to construct a cost sensitive base classifier, and specifically, directly introducing a cost matrix form to express the cost sensitivity of the cost sensitive base classifier;
the cost matrix is an unbalanced cost matrix and is used for setting the cost of the misclassification errors of the minority class samples to be far higher than the misclassification cost of the majority class samples;
step S200, constructing a cost-sensitive deep cascade forest: the cost-sensitive depth cascade forest is of a multi-layer structure, each layer is provided with a plurality of estimators, the types and the numbers of the estimators of each layer are the same, and each estimator comprises a plurality of cost-sensitive base classifiers;
step S300, pretreatment of drinking water quality data: cleaning and standardizing the original data of the drinking water quality;
step S400, cost-sensitive deep cascade forest training: k-fold dividing a training set and a verification set are carried out on the pre-processed water quality data, a prediction objective function is set, training is carried out on the basis of the training set, the prediction objective function is optimized, the super-parameters are fixed, K-1 fold data are used as the training set and used for training a model, and the rest 1 fold data are used as the verification set and used for verifying the model;
in step S400, the prediction objective function is expressed as:
the formula y is label information for indicating whether the water quality is qualified or not, whereinRepresents the prediction accuracy probability of the j-th type tag information, i represents the i-th data, j represents the j-th type tag information, n represents the total number of data, C ij A cost matrix representing i data and j types of tag information;
step S500, verifying a cost-sensitive deep cascade forest: using the water quality data of the verification set to verify the model, comparing the prediction capacities of the cost-sensitive depth cascade forests under different cost matrixes, and screening an optimal cost matrix to obtain a water quality prediction model;
the water quality prediction model introduces cost sensitivity factors by setting an unbalanced cost matrix.
2. The method for predicting the quality of drinking water based on the cost-sensitive deep cascade forest according to claim 1, wherein the cost-sensitive basis classifier adopts one or a combination of a plurality of random trees and completely random trees.
3. The method for predicting the quality of drinking water based on the cost-sensitive deep cascade forest according to claim 1, wherein the optimal cost matrix is specifically selected by searching an initial cost matrix and an optimal cost matrix through a heuristic method and a grid method.
4. The method for predicting drinking water quality based on a cost-sensitive deep cascade forest according to claim 1, wherein in step S400, the cost-sensitive deep cascade forest training is specifically performed by using a CS-DCF algorithm, and the CS-DCF algorithm is a cost-sensitive deep cascade forest algorithm;
the training based on the training set specifically comprises the following steps: the first layer input is an original feature vector, the other layers input is the original feature vector and the output probability vector of the adjacent previous layer, each layer calculates cost, and if the cost is reduced, the original feature vector is combined with the output of the layer to be used as the input of the next layer; each layer then follows the same procedure until the predicted objective function is no longer decreasing, and finally the result of the corresponding layer is the output of the cost-sensitive depth cascade forest.
5. The method for predicting the quality of drinking water based on a cost-sensitive depth cascade forest according to claim 4, wherein the cost-sensitive depth cascade forest algorithm specifically comprises the following steps:
step S401, initializing:
inputting a feature sampling matrix, a cost matrix and iteration times, initializing the number of the current processing layers to be 1, and initializing the serial number value of the current iteration times to be 0;
the feature sampling matrix is expressed as:
X=(X 1 ,X 2 ,…,X p )
wherein p is the dimension of the feature sampling matrix, the cost matrix is represented as C, and the iteration number is s;
step S402, a cyclic training step: the following operations are executed until all the processing layers are traversed;
constructing a cost sensitive base classifier according to the cost matrix, and obtaining a cost value c of the current processing layer by using the cost sensitive base classifier;
taking the output of the current processing layer as a new feature matrix F;
generating a new feature sampling matrix: for the case when the current processing layer number is not 1, connecting a new feature matrix F of the current processing layer with the feature sampling matrix X to obtain a new feature sampling matrix X';
if the current treatment layer number is 1 or the cost value c 'of the adjacent next treatment layer meets the condition of c' -c >0, assigning the current treatment layer number to the next treatment layer number, and assigning the cost value of the current treatment layer to the cost value of the adjacent next treatment layer;
otherwise, accumulating the number value of the current iteration times to 1, if the number value of the current iteration times is the iteration times, exiting, otherwise, accumulating the number of the current processing layers to 1;
step S403, output step:
and outputting the cost sensitive deep cascade forest.
6. The method for predicting drinking water quality based on cost sensitive deep cascade forests as recited in claim 4, wherein the plurality of estimators employ any one or a combination of a plurality of random forests, completely random forests.
7. The method for predicting the quality of drinking water based on the cost-sensitive deep cascade forest of claim 4, wherein the original eigenvector is one or any combination of a plurality of water quality parameters of pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and dissolved oxygen.
8. A drinking water quality prediction system based on a cost-sensitive deep cascade forest is characterized by comprising a data acquisition module, a data preprocessing module and a prediction module;
the data acquisition module is used for acquiring raw data of the drinking water quality, wherein the raw data of the drinking water quality comprise water quality parameters including pH, temperature, turbidity, conductivity, heavy metals, chlorides, sulfates and soluble oxygen;
the data preprocessing module is used for carrying out data cleaning and data standardization on the original drinking water quality data to obtain water quality preprocessing data;
the prediction module is used for inputting the water quality pretreatment data into a water quality prediction model to predict whether the water quality is qualified or not;
the water quality prediction model is obtained by machine learning training through multiple groups of data, and each group of data in the multiple groups of data comprises drinking water quality training data and label information for identifying whether the water quality training data is qualified or not;
the water quality prediction model introduces cost sensitivity factors by setting an unbalanced cost matrix;
the water quality prediction model is obtained through machine learning training by using a plurality of groups of data, and specifically comprises the following steps:
step S100, constructing a cost sensitive base classifier: introducing cost sensitive factors to construct a cost sensitive base classifier, and specifically, directly introducing a cost matrix form to express the cost sensitivity of the cost sensitive base classifier;
the cost matrix is an unbalanced cost matrix and is used for setting the cost of the misclassification errors of the minority class samples to be far higher than the misclassification cost of the majority class samples;
step S200, constructing a cost-sensitive deep cascade forest: the cost-sensitive depth cascade forest is of a multi-layer structure, each layer is provided with a plurality of estimators, the types and the numbers of the estimators of each layer are the same, and each estimator comprises a plurality of cost-sensitive base classifiers;
step S300, pretreatment of drinking water quality data: cleaning and standardizing the original data of the drinking water quality;
step S400, cost-sensitive deep cascade forest training: k-fold dividing a training set and a verification set are carried out on the pre-processed water quality data, a prediction objective function is set, training is carried out on the basis of the training set, the prediction objective function is optimized, the super-parameters are fixed, K-1 fold data are used as the training set and used for training a model, and the rest 1 fold data are used as the verification set and used for verifying the model;
in step S400, the prediction objective function is expressed as:
the formula y is label information for indicating whether the water quality is qualified or not, whereinRepresents the prediction accuracy probability of the j-th type tag information, i represents the i-th data, j represents the j-th type tag information, n represents the total number of data, C ij A cost matrix representing i data and j types of tag information;
step S500, verifying a cost-sensitive deep cascade forest: using the water quality data of the verification set to verify the model, comparing the prediction capacities of the cost-sensitive depth cascade forests under different cost matrixes, and screening an optimal cost matrix to obtain a water quality prediction model;
the unbalanced cost matrix is set as follows:
in c 01 To misclassify the unacceptable water quality data into acceptable water quality data; c 10 To divide the qualified water quality into unqualified water quality in a staggered way; c 00 And c 11 Respectively qualified drinking water and unqualified drinking waterThe cost of correct classification of the quality of the grid drinking water is that the unqualified quality of water is a decimal sample, and the qualified quality of water is a majority sample.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110992331.8A CN113723679B (en) | 2021-08-27 | 2021-08-27 | Drinking water quality prediction method and system based on cost-sensitive deep cascade forests |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110992331.8A CN113723679B (en) | 2021-08-27 | 2021-08-27 | Drinking water quality prediction method and system based on cost-sensitive deep cascade forests |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113723679A CN113723679A (en) | 2021-11-30 |
CN113723679B true CN113723679B (en) | 2024-04-16 |
Family
ID=78678447
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110992331.8A Active CN113723679B (en) | 2021-08-27 | 2021-08-27 | Drinking water quality prediction method and system based on cost-sensitive deep cascade forests |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113723679B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107748932A (en) * | 2017-10-20 | 2018-03-02 | 杭州尚青科技有限公司 | A kind of air quality grade Forecasting Methodology of fusion sequence mode excavation and cost sensitive learning |
WO2019033636A1 (en) * | 2017-08-16 | 2019-02-21 | 哈尔滨工业大学深圳研究生院 | Method of using minimized-loss learning to classify imbalanced samples |
CN109446393A (en) * | 2018-09-12 | 2019-03-08 | 北京邮电大学 | A kind of Web Community's topic classification method and device |
CN111128372A (en) * | 2019-12-02 | 2020-05-08 | 重庆邮电大学 | Disease prediction method based on RF-LR improved algorithm |
WO2020199345A1 (en) * | 2019-04-02 | 2020-10-08 | 广东石油化工学院 | Semi-supervised and heterogeneous software defect prediction algorithm employing github |
CN111881159A (en) * | 2020-08-05 | 2020-11-03 | 长沙理工大学 | Fault detection method and device based on cost-sensitive extreme random forest |
-
2021
- 2021-08-27 CN CN202110992331.8A patent/CN113723679B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019033636A1 (en) * | 2017-08-16 | 2019-02-21 | 哈尔滨工业大学深圳研究生院 | Method of using minimized-loss learning to classify imbalanced samples |
CN107748932A (en) * | 2017-10-20 | 2018-03-02 | 杭州尚青科技有限公司 | A kind of air quality grade Forecasting Methodology of fusion sequence mode excavation and cost sensitive learning |
CN109446393A (en) * | 2018-09-12 | 2019-03-08 | 北京邮电大学 | A kind of Web Community's topic classification method and device |
WO2020199345A1 (en) * | 2019-04-02 | 2020-10-08 | 广东石油化工学院 | Semi-supervised and heterogeneous software defect prediction algorithm employing github |
CN111128372A (en) * | 2019-12-02 | 2020-05-08 | 重庆邮电大学 | Disease prediction method based on RF-LR improved algorithm |
CN111881159A (en) * | 2020-08-05 | 2020-11-03 | 长沙理工大学 | Fault detection method and device based on cost-sensitive extreme random forest |
Also Published As
Publication number | Publication date |
---|---|
CN113723679A (en) | 2021-11-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108734355B (en) | Short-term power load parallel prediction method and system applied to power quality comprehensive management scene | |
CN109934269B (en) | Open set identification method and device for electromagnetic signals | |
CN110881037A (en) | Network intrusion detection method and training method and device of model thereof, and server | |
CN108985135A (en) | A kind of human-face detector training method, device and electronic equipment | |
CN108874959A (en) | A kind of user's dynamic interest model method for building up based on big data technology | |
CN111343147B (en) | Network attack detection device and method based on deep learning | |
CN110413775A (en) | A kind of data label classification method, device, terminal and storage medium | |
CN108766559A (en) | Clinical decision support method and system for intelligent disorder in screening | |
CN114048468A (en) | Intrusion detection method, intrusion detection model training method, device and medium | |
CN107992945A (en) | Feature gene selection method based on deep learning and evolutionary computation | |
CN107480441B (en) | Modeling method and system for children septic shock prognosis prediction | |
CN116151319A (en) | Method and device for searching neural network integration model and electronic equipment | |
CN113744083B (en) | Water quality prediction method based on environment unbalance data | |
CN113239199B (en) | Credit classification method based on multi-party data set | |
CN108830407B (en) | Sensor distribution optimization method in structure health monitoring under multi-working condition | |
CN113723679B (en) | Drinking water quality prediction method and system based on cost-sensitive deep cascade forests | |
CN113837266A (en) | Software defect prediction method based on feature extraction and Stacking ensemble learning | |
CN113642255A (en) | Photovoltaic power generation power prediction method based on multi-scale convolution cyclic neural network | |
CN116665482B (en) | Parking space recommending method and device based on intelligent parking | |
CN116842459A (en) | Electric energy metering fault diagnosis method and diagnosis terminal based on small sample learning | |
CN117035509A (en) | Electric energy meter state evaluation method and device, electronic equipment and readable storage medium | |
CN116758469A (en) | Crowd abnormal condition and single person movement state detection method | |
CN112465253B (en) | Method and device for predicting links in urban road network | |
CN115734274A (en) | Cellular network fault diagnosis method based on deep learning and knowledge graph | |
CN115358448A (en) | Model for measuring and calculating comprehensive bearing capacity of rural resource environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |