CN112185484A

CN112185484A - AdaBoost model-based water quality characteristic mineral water classification method

Info

Publication number: CN112185484A
Application number: CN202011090983.4A
Authority: CN
Inventors: 单耀; 王艺岚; 王旭锋
Original assignee: China University of Mining and Technology CUMT; North China Institute of Science and Technology
Current assignee: China University of Mining and Technology CUMT; North China Institute of Science and Technology
Priority date: 2020-10-13
Filing date: 2020-10-13
Publication date: 2021-01-05

Abstract

The invention discloses a water quality characteristic mineral water classification method based on an AdaBoost model, which comprises the following steps: collecting a water sample in a mineral water source; testing the water quality information of each group of water samples; establishing an Excel table by utilizing a plurality of groups of water quality information, and importing an R language; reducing the dimension of the data by using a principal component analysis method to obtain dimension reduction data; classifying the dimension reduction data by using a Gaussian mixture model to obtain classified data; marking the classified data, and selecting a plurality of groups of marking data which can be effectively distinguished; marking a plurality of groups of data according to the following steps of 7: 3 into a training data set and a test data set; performing feature selection on the training data set; establishing an AdaBoost model; and (3) using the AdaBoost model for testing the data set to evaluate the accuracy of the AdaBoost model. According to the method for classifying the mineral water with the water quality characteristics based on the AdaBoost model, the reasonability and the scientificity of mineral water classification are improved.

Description

AdaBoost model-based water quality characteristic mineral water classification method

Technical Field

The invention relates to the technical field of mineral water classification, in particular to a water quality characteristic mineral water classification method based on an AdaBoost model.

Background

Mineral water is a precious water resource, and is beneficial to human bodies and suitable for long-term drinking, so that the mineral water has a great resource protection value and an economic value. The mineral water is mainly dissolved in inorganic substances, wherein the macroelement comprises K⁺、Na⁺、Ca²⁺、Mg²⁺、Cl^-、SO4^2-、HCO₃ ^-And the important trace elements include Se, Sr, Li, Zn, etc. These components represent the difference of water quality on one hand, and the difference of formation conditions and formation processes on the other hand, and are related to the geological conditions of the underground aquifer where the components are located.

The global mineral water sales increased at a rate of 6.4% per year in the last five years. The government of China promulgates the national standard of 'Natural mineral Water for drinking' (GB 8537-2018), and makes some regulations on the water quality of mineral water. From the viewpoint of current research and general acceptance, mineral water is roughly classified into metasilicic acid mineral water, strontium mineral water, zinc mineral water, lithium mineral water, selenium mineral water, bromine mineral water, iodine mineral water, carbonated mineral water, and the like. The classification method only inspects the characteristics of a certain aspect of the mineral water, but the inspection is not comprehensive and the classification is not reasonable.

Disclosure of Invention

The invention provides a water quality characteristic mineral water classification method based on an AdaBoost model, which can improve the rationality and scientificity of mineral water classification.

The water quality characteristic mineral water classification method based on the AdaBoost model comprises the following steps: step S1: selecting more than three mineral water source places with certain distances, and collecting water samples in the mineral water source places, wherein the number of the water samples is at least 60 groups, and each water source place is not less than 20 groups; step S2: testing the water quality information of each group of water samples, wherein the water quality information comprises the content of macroelements, the content of trace elements, the pH value, total soluble solids, the value and the hardness of isotopes; step S3: establishing an Excel table by utilizing a plurality of groups of water quality information, converting the Excel table into a CSV table, and importing the CSV table into an R language; step S4: reducing the dimension of the data by using a principal component analysis method to obtain dimension reduction data; step S5: classifying the dimensionality reduction data by using a Gaussian mixture model to obtain classified data; step S6: marking the classified data, and selecting a plurality of groups of marking data which can be effectively distinguished; step S7: importing a plurality of groups of the mark data into an R language and according to 7: 3 into a training data set and a test data set; step S8: selecting characteristics of the training data set by adopting a random forest method, and selecting 3-6 parameters; step S9: applying an AdaBoost model framework to the training data set subjected to the feature selection for training, and establishing an AdaBoost model; step S10: and applying the AdaBoost model to the test data set, evaluating the correctness of the AdaBoost model, and improving the AdaBoost model.

According to the water quality characteristic mineral water classification method based on the AdaBoost model, in consideration of the difference of the importance of each discrimination parameter, a principal component analysis method and a Gaussian mixture model are used for carrying out characteristic selection on data, namely more representative data can be selected in the angle of a sample, and then R language is used for selecting the data according to the ratio of 7: and 3, dividing the ratio into a training data set and a testing data set, using the training data set to establish an AdaBoost model, using the AdaBoost model to the testing data set, evaluating the accuracy of the AdaBoost model, and improving the AdaBoost model, so that the accuracy of the AdaBoost model can be improved, and further the rationality and the scientificity of mineral water classification can be improved.

According to some embodiments of the invention, after the step S2, and before the step S3, the method further comprises: and converting the content of the macroelements into equivalent concentration percentage, and converting the content of the trace elements into equivalent concentration.

According to some embodiments of the invention, the step S4 is performed with the psych package in the R language.

In some embodiments of the invention, the dimension of the dimension reduction data is 2-4.

According to some embodiments of the invention, said step S5 is performed with an mcust package in said R language.

According to some embodiments of the invention, the tag data selected in the step S6 is 3-5 sets.

According to some embodiments of the invention, the establishing the AdaBoost model is performed with an adapag package of the R language.

According to some embodiments of the invention, after the step S10, the method further comprises: the AdaBoost model is used for the verification of actual mineral water.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

Fig. 1 is a flowchart of a water quality characteristic mineral water classification method based on an AdaBoost model according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The following disclosure provides many different embodiments, or examples, for implementing different features of the invention. To simplify the disclosure of the present invention, the components and arrangements of specific examples are described below. Of course, they are merely examples and are not intended to limit the present invention. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. In addition, the present invention provides examples of various specific processes and materials, but one of ordinary skill in the art may recognize the applicability of other processes and/or the use of other materials.

A water quality characteristic mineral water classifying method based on an AdaBoost model according to an embodiment of the present invention will be described with reference to the accompanying drawings.

As shown in fig. 1, a method for classifying water quality characteristic mineral water based on an AdaBoost model according to an embodiment of the present invention includes: step S1, step S2, step S3, step S4, step S5, step S6, step S7, step S8, step S9, and step S10.

Specifically, as shown in fig. 1, in step S1, more than three mineral water sources are selected at a certain distance, and water samples are collected at the mineral water sources, the number of the water samples is at least 60 groups, and each mineral water source is not less than 20 groups. It is understood that the number of water samples may be 60, 70, 80 or more. Therefore, the number of samples can be increased, and the accuracy of the model is improved.

As shown in fig. 1, in step S2, water quality information of each group of water samples is tested, and the water quality information includes content of macroelements, content of trace elements, pH value, total soluble solids, and isotope value and hardness.

It can be understood that the values and the hardness of the macroelement content, the trace element content, the pH value, the total soluble solid and the isotope of different types of water samples are different, and more classification bases can be provided for the classification of the water samples through the analysis of the values and the hardness of the macroelement content, the trace element content, the pH value, the total soluble solid and the isotope.

As shown in fig. 1, in step S3, an Excel table is created by using multiple sets of water quality information, the Excel table is converted into a CSV table, and the CSV table is imported into the R language.

As shown in fig. 1, in step S4, dimension reduction is performed on the data by using a principal component analysis method to obtain dimension reduced data; step S5, classifying the dimension reduction data by using a Gaussian mixture model to obtain classified data; step S6, labeling the classified data, selecting multiple groups of labeled data which can be effectively distinguished; step S7 is to import multiple sets of markup data into the R language and to perform the following steps according to 7: 3 into a training data set and a test data set; step S8, selecting 3-6 parameters for feature selection of the training data set by a random forest method; step S9 is to apply the AdaBoost model framework to the training data set after feature selection for training and establish an AdaBoost model; step S10 is to use the AdaBoost model for the test data set, evaluate the correctness of the AdaBoost model, and improve the AdaBoost model. In one example of the present invention, the above building of the AdaBoost model is done in the adapag package of the R language.

The resource distribution of mineral water is regular. By establishing a classification model of mineral water, the method is helpful for more scientifically and effectively managing mineral water resources, guiding the public to reasonably identify and select the mineral water required by the public, and simultaneously has a basic guiding function for further analysis of the mineral water.

It can be understood that, in consideration of the difference in importance of each discriminant, feature selection is performed on data by using a principal component analysis method and a gaussian mixture model, that is, more representative data can be selected in a sample angle, and then R language is used and the selected data is calculated according to the following formula 7: and 3, dividing the ratio into a training data set and a testing data set, using the training data set to establish an AdaBoost model, using the AdaBoost model for the testing data set, evaluating the accuracy of the AdaBoost model, and improving the AdaBoost model, so that the accuracy of the AdaBoost model can be improved, and further the rationality and the scientificity of mineral water classification are improved.

According to some embodiments of the invention, after step S2, and before step S3, the method further comprises: the content of the macroelements is converted into the percentage of equivalent concentration, and the content of the microelements is converted into the equivalent concentration. Therefore, the calculation difficulty can be reduced, the calculation efficiency is improved, and the calculation time is saved. The pH, total soluble solids, isotope values and hardness remain unchanged from the original units.

According to some embodiments of the invention, step S4 is done in the psych package in the R language. Therefore, time can be saved, and accuracy of results can be improved.

Specifically, in one example of the present invention, CSV table data is imported into the R language, and the dimension of the data is reduced by a principal component analysis method, and the calculation method is as follows, and a psych package in the R language is used for calculation.

(1) And centralize data X in columns.

(2) And calculating a covariance matrix C of the sample matrix.

(3) And solving the eigenvalue and the eigenvector of the covariance matrix C of the sample set matrix X.

(4) And constructing a dimension reduction conversion matrix U. And (3) setting n variables in the original data, reducing the dimension of the data to m variables (m < n) according to the requirement (geochemical professional analysis) or a calculation result (covariance matrix), and forming a dimension reduction conversion matrix U by using eigenvectors corresponding to m eigenvalues before ranking according to the size arrangement.

(5) And obtaining a dimension reduction matrix Z of X by using the dimension reduction conversion formula Z as XU, wherein the matrix represents the information of the original data set by using lower dimension (variable).

In some embodiments of the invention, the dimensionality of the dimension reduction data is 2-4. Therefore, the calculation difficulty can be reduced while the result accuracy is ensured, the calculation efficiency is improved, and the calculation time is saved. For example, in one example of the invention, the dimensionality of the dimension reduction data is 2, 3, or 4.

According to some embodiments of the invention, step S5 is accomplished in the mCluster package in the R language. Therefore, time can be saved, and accuracy of results can be improved.

Specifically, in one example of the present invention, the data after dimension reduction is classified by a Gaussian Mixture Model (GMM), and the calculation method is as follows:

(1) and setting each group of water sample data to accord with Gaussian distribution, wherein the distribution function is shown as a formula (2-a):

where K is the number of classifications, π_kIs the coefficient of mixing, μ_kTo expect, sigma_kIs a covariance matrix, representing the probability that each sample belongs to the m (m-1, 2, …, k) th group,

(2) and the joint probability of the sample set X (n sample points) is shown as the formula (2-b):

(3) and the log-likelihood function is shown as formula (2-c):

and solving the value of mu, sigma, pi and k in the model by using an EM algorithm.

According to some embodiments of the invention, the tag data selected in step S6 are 3-5 sets. Therefore, the calculation difficulty can be reduced while the result accuracy is ensured, the calculation efficiency is improved, and the calculation time is saved. For example, in one example of the present invention, the marker data is 3 groups, 4 groups, or 5 groups.

It should be noted that there are multiple parameters to be set and optimized during modeling, and the more important parameters include the maximum feature number considered during partitioning, the maximum depth of the decision tree, and the other parameters that may need to be considered mainly include the minimum sample number required during internal node repartitioning, the minimum sample number of leaf nodes, the minimum sample weight of leaf nodes, the maximum leaf node number, and the like. For example, there are 3-6 variables in the model, and the parameters can be optimized to be 2 or 3. The optimization of specific parameters also needs to be determined according to the discriminant performance of the model. The AdaBoost model is substituted back, the misjudged data can be analyzed, and it should be noted that unless an obvious error occurs, the data in the training data set is not deleted, and if a part of the data is deleted, the data needs to be trained again.

In an example of the present invention, in step S8, for convenience of use, constant elements are used as the characteristic parameters for modeling, and trace elements with distinct characteristics may be used as the characteristic parameters for modeling. The steps of feature selection are as follows:

(1) and setting the data set X to contain N samples, and randomly taking the N samples from the data set by using a self-service method (Bootstrap) and bagging the samples to serve as a training data set. In this process, each sampleThe probability of not being selected is p ═ (1-1/N)^N. When N tends to + ∞, p ≈ 0.37. This indicates that about 37% of the samples were not selected during bootstrap sampling, referred to as out-of-bag data (OOB). The in-bag data was used to train the model and the out-of-bag data was used to evaluate the model.

(2) And performing extraction for k times, so that k training data sets can be obtained. A decision tree is built with each training data set using a pruning-free approach. At the position of each node, M features are randomly selected from the total number M of features, the Gini index of each feature in the M features is calculated, the smaller the Gin index is, the better the distinguishing effect of the features is, and the optimal feature is selected as the branch node. A complete decision tree is built according to this strategy.

(3) And k decision trees can be obtained by using k data sets to form a random forest model. The quality of the model can be evaluated with the prediction accuracy of the out-of-bag data (OOB). Mean Square Error (MSE) of out-of-bag data_OOB) And a coefficient of determination (R)_RF ²) Such as equations (1-a) and (1-b), where the smaller the mean square error, the larger the decision coefficient, indicating that the model is superior.

Where n is the number of data outside the bag, y_iIs an observed value of the data outside the bag,

is the predicted value of the model,

is the out-of-bag data prediction variance.

(4) Selecting an important predictive feature using the average impure reduction value. And (3) calculating the Gini index of each variable by applying a formula (1-c) at each node of each tree, calculating the Gini index of each characteristic on each node of each tree, averaging all the Gini indexes according to the characteristics, and calculating the average impure degree reduction value. Each feature is then ranked so that the importance of the features in the model can be scored to select the appropriate feature to model.

Where pi is the probability that a sample belongs to the ith branch, N is the total number of branches at the node, and IGini is the Gini index. Important variables are determined by integrating the analysis method of the random forest and the analysis of the geochemistry for modeling, the important variables are mainly selected from macroelements, and are assisted by microelements, isotopes and other parameters, and the number of the important variables is generally 3-6.

Applying an AdaBoost algorithm to establish a machine learning model, and comprising the following steps of:

(1) setting the training data set to have N records, and initializing the weight of each record to be 1/N

W₁＝{W₁₁,...,W_1i,...,W_1N},W_1i＝1/N,i＝1,2,...,N

(2) For M ═ 1,2, …, M (M is the number of rounds trained)

(2.1) W according to the record_mWeight with put back samples on D, training data set D_m

(2.2) training D Using decision Tree Algorithm_mObtaining a model G_m(x)

(2.3) calculation of G_m(x) Misclassification rate of

In the formula, G_m(x_j) As a result of the prediction, y_jIf the actual result is the result, I returns 1 when misjudging, otherwise returns 0, e_mWeighted summary value for each record misclassification

(2.4) if e_m>0.5, then return to step 2.2

(2.5) updating the weight for each correctly classified piece of data, multiplying it by e_m/(1-e_m) Then normalizing the weight of all data to obtain W_m+1

(2.6) setting the weight of the model

(3) Obtaining M models G_m(x) (m-1, 2, …, m) and their weights α_m(m-1, 2, …, m). Calculating the data to be measured by using m models, and weighting and summarizing the calculation results to obtain a final result which can be expressed as

Wherein G is_m(x) Representing the result of the feature x calculated by the mth predictive model, sign is a sign function.

It should be noted that, in step S10, the model is replaced and the misjudged data is analyzed, and unless an error is obvious, the data in the training data set is not deleted, and if a part of the data is deleted, the data needs to be trained again.

According to some embodiments of the invention, after step S10, the method further comprises: the AdaBoost model is used for verification of actual mineral water (step S11). Thus, the accuracy of the AdaBoost model may be further verified using actual mineral water. In an example of the present invention, the model is adaptively modified according to the detection result, so that the reliability of the detection result can be further improved.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A water quality characteristic mineral water classification method based on an AdaBoost model is characterized by comprising the following steps:

step S1: selecting more than three mineral water source places with certain distances, and collecting water samples in the mineral water source places, wherein the number of the water samples is at least 60 groups, and each water source place is not less than 20 groups;

step S2: testing the water quality information of each group of water samples, wherein the water quality information comprises the content of macroelements, the content of trace elements, the pH value, total soluble solids, the value and the hardness of isotopes;

step S3: establishing an Excel table by utilizing a plurality of groups of water quality information, converting the Excel table into a CSV table, and importing the CSV table into an R language;

step S4: reducing the dimension of the data by using a principal component analysis method to obtain dimension reduction data;

step S5: classifying the dimensionality reduction data by using a Gaussian mixture model to obtain classified data;

step S6: marking the classified data, and selecting a plurality of groups of marking data which can be effectively distinguished;

step S7: importing a plurality of groups of the mark data into an R language and according to 7: 3 into a training data set and a test data set;

step S8: selecting characteristics of the training data set by adopting a random forest method, and selecting 3-6 parameters;

step S9: applying an AdaBoost model framework to the training data set subjected to the feature selection for training, and establishing an AdaBoost model;

step S10: and applying the AdaBoost model to the test data set, evaluating the correctness of the AdaBoost model, and improving the AdaBoost model.

2. The AdaBoost model-based water quality characterization mineral water classification method according to claim 1, wherein after the step S2 and before the step S3, the method further comprises: and converting the content of the macroelements into equivalent concentration percentage, and converting the content of the trace elements into equivalent concentration.

3. A water quality characteristic mineral water classification method based on AdaBoost model as claimed in claim 1, characterized in that said step S4 is completed with psych package in said R language.

4. A water quality characteristic mineral water classification method based on an AdaBoost model according to claim 3, characterized in that the dimensionality of the dimensionality reduction data is 2-4.

5. A water quality characteristic mineral water classification method based on AdaBoost model as claimed in claim 1, characterized in that said step S5 is completed with mcust package of said R language.

6. The method for classifying water quality characteristic mineral water based on the AdaBoost model as claimed in claim 1, wherein the labeled data selected in the step S6 is 3-5 groups.

7. The method for classifying water quality characteristic mineral water based on the AdaBoost model according to claim 1, wherein the establishing of the AdaBoost model is completed by using an adapag package of the R language.

8. A water quality characteristic mineral water classification method based on AdaBoost model according to claim 1, characterized in that after said step S10, the method further comprises: the AdaBoost model is used for the verification of actual mineral water.