CN114913980A - Vegetable-based model for predicting heavy metal enrichment coefficient and human health risk exposed by oral diet and application - Google Patents

Vegetable-based model for predicting heavy metal enrichment coefficient and human health risk exposed by oral diet and application Download PDF

Info

Publication number
CN114913980A
CN114913980A CN202210576755.0A CN202210576755A CN114913980A CN 114913980 A CN114913980 A CN 114913980A CN 202210576755 A CN202210576755 A CN 202210576755A CN 114913980 A CN114913980 A CN 114913980A
Authority
CN
China
Prior art keywords
model
vegetable
vegetables
human health
heavy metal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210576755.0A
Other languages
Chinese (zh)
Inventor
李峰
梅延成
葛飞
颜侃轩
武晨
易盛炜
魏铭
朱中南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiangtan University
Original Assignee
Xiangtan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiangtan University filed Critical Xiangtan University
Priority to CN202210576755.0A priority Critical patent/CN114913980A/en
Publication of CN114913980A publication Critical patent/CN114913980A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Economics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Public Health (AREA)
  • Strategic Management (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Biomedical Technology (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Marketing (AREA)
  • Tourism & Hospitality (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Game Theory and Decision Science (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a vegetable-based model for predicting heavy metal enrichment coefficient and human health risk exposed by oral diet and application thereof. According to the model, the enrichment coefficient of heavy metals in vegetables is taken as characteristic data, meanwhile, the influence of the concentration of the heavy metals in soil, the planting area and the vegetable type on the enrichment coefficient is considered, and the prediction of human health risks is realized by establishing the heavy metal enrichment coefficient of the vegetables and the exposure of oral diet by adopting a random forest classification model. The model is simple in structure, relevant parameters are easy to obtain, human health risks of exposure of the vegetables to the oral diet can be accurately predicted, potential risks of the vegetables in the planting and eating processes can be quickly and accurately found out according to the prediction result of the model, and the model has important guiding significance for production and consumption of the vegetables.

Description

Vegetable-based model for predicting heavy metal enrichment coefficient and human health risk exposed by oral diet and application
Technical Field
The invention relates to a human health risk prediction model, in particular to a vegetable-based heavy metal enrichment coefficient and oral diet exposure human health risk prediction model and application thereof, and belongs to the technical field of environmental protection.
Background
The main pollutants affecting the soil quality of agricultural land are heavy metals, of which cadmium is the first pollutant. Agricultural products such as food crops and vegetables exposed on contaminated soil may constitute a public health hazard. Cadmium, an optional metal element, is highly toxic to organisms. It can affect the yield and quality of agricultural products, endanger the safety of human living environment and threaten the safety of ecological environment.
China is both a big vegetable producing country and a big vegetable consuming country. Vegetables are in a wide variety of varieties and they provide key nutrients for human health, such as carbohydrates, proteins, vitamins, minerals and fibers, which are needed by people. Cadmium pollution in vegetables is widely reported in different areas of China, and the vegetables are the main exposure mode in heavy metal exposure in diet, so that the evaluation of the health risk of the heavy metals in the vegetables to human bodies is very important.
The enrichment coefficient is the ratio of the concentration of pollutants in plant tissues to the concentration in soil, can reflect the enrichment capacity of different vegetables, has different enrichment capacities, is also an important parameter of migration and accumulation of heavy metals in the soil and the biological effectiveness of the soil, is the most important point, is a key link for transmitting the heavy metals in a soil-vegetable system to people through a food chain, and is an initial link. The difference in the enrichment factor is influenced by various factors, such as the nature and concentration of the pollutants, the physicochemical properties of the soil (such as pH, organic matter, cation exchange capacity, etc.), the climate and the vegetable species. Therefore, the enrichment coefficient indicates various differences to a certain extent, is a comprehensive index, is used for guiding vegetable planting and eating and providing vegetable planting and eating strategies on the one hand, and has important significance for exploring the health risk of heavy metal transmitted to a human body in a soil-vegetable system through a food chain on the other hand, and is beneficial to establishing and correcting a method and a model for evaluating the health risk of the human body.
The current human health risk evaluation method in China is based on a four-step method, namely hazard identification, hazard evaluation, exposure evaluation and risk characterization, and the used mature models comprise a RBCA model developed by American society for testing and materials, a CSOIL model developed by Dutch Ministry of housing, a RAGS model developed by the United states environmental protection agency and the like. These models evaluate human health risks, namely skin contact, breath inhalation, oral intake, where oral dietary exposure can reach over 90%, the main exposure route, mainly by three means. Although there are so many mature models, the original evaluation model has many parameters, so that data acquisition is difficult, and a simplified model for predicting the human health risk of oral diet by using an enrichment coefficient is lacking. Therefore, it is necessary to establish a diet evaluation method and a health risk evaluation model which are corrected by the enrichment coefficient and conform to the national conditions of China.
Disclosure of Invention
Aiming at the problems in the prior art, the first purpose of the invention is to provide a vegetable-based model for predicting the heavy metal enrichment coefficient and the human health risk exposed by oral diet, the model is established by a random forest algorithm, after limited training, the model can accurately predict the influence of the heavy metal in the vegetable on the human health according to the enrichment coefficient, and the model has the advantages of simple structure, accurate prediction, simple data acquisition, wide application range and the like.
The second purpose of the invention is to provide an application of the model, the damage level of the heavy metal enrichment coefficient in the vegetables in different seasons and regions to human health is judged through the model, and the vegetables are guided to be planted and eaten according to the risk level, so that the property and health safety of producers and consumers are effectively guaranteed.
In order to achieve the technical purpose, the invention provides a vegetable-based model for predicting the heavy metal enrichment coefficient and the human health risk exposed by oral diet, which comprises the following steps:
step 1): acquiring characteristic data of vegetables related to heavy metal enrichment;
step 2): cleaning and preprocessing the characteristic data related to the heavy metal;
step 3): acquiring vegetable oral diet parameters, and calculating the human health risk value of the vegetable oral diet;
step 4): establishing a machine learning model of vegetable characteristic data and human health risk values of vegetable oral meals;
step 5): and evaluating the performance of the obtained machine learning model.
The invention adopts a five-step method for modeling, establishes a mapping relation between the vegetable and the human health risks of the vegetable oral diet through a machine learning method, and can accurately predict the risks of the vegetable to the human health according to the characteristic data of the vegetable through limited training, wherein the AUC is more than 0.97, and the overall classification precision is more than 91%.
As a preferred embodiment, the heavy metal enrichment-related characteristic data includes: the enrichment coefficient of heavy metals in vegetables, the heavy metal concentration of vegetable planting soil, a planting area and the variety of vegetables.
Preferably, the enrichment coefficient of the heavy metals in the vegetables is the ratio of the concentration of the heavy metals in edible parts of the vegetables to the concentration of the heavy metals in the vegetable planting soil.
The invention adopts the enrichment coefficient of heavy metal in vegetables to reflect the enrichment capacity of different vegetables, is one of key parameters of a key link and an initial link for transmitting the heavy metal in a soil-vegetable system to a human body through a food chain, and plays a key role in evaluating the human health risk exposed by oral diet; the difference of the heavy metal enrichment coefficients of the vegetables is influenced by various factors (such as pollutant properties and concentration thereof, soil physicochemical properties and climate and the like), is a comprehensive index, can replace other factors to accurately predict the influence of the heavy metals in the vegetables on the human health, and achieves the purpose of providing prediction precision by utilizing characteristic data such as the enrichment coefficients.
As a preferred scheme, the planting area is divided into the following parts according to geographical positions: north, east, middle, south and southwest regions.
As a preferred embodiment, the heavy metal includes: at least one of chromium, cadmium, mercury, arsenic, nickel, copper and zinc.
The invention describes the enrichment coefficient of heavy metals in vegetables from two dimensions of the type and the region of the heavy metals in the vegetables, and as the vegetables are strong seasonal crops, the seasonal dimension can be additionally increased to assist in describing the enrichment coefficient of the heavy metals in the vegetables.
As a preferred solution, the data washing and preprocessing process includes: I) performing relevance verification on the characteristic data, and classifying; II) carrying out format cleaning on the characteristic data; III) deleting and cleaning abnormal values in the feature data; IV) carrying out interpolation processing on the missing values in the characteristic data.
As a preferred embodiment, the vegetable oral diet parameters include: concentration of contaminants in edible parts of vegetables C f Daily vegetable intake by mouth IR, frequency of exposure EF, duration of exposure ED, body weight BW, average exposure time AT, digestive tract absorption efficiency factor ABS o And an oral ingestion reference dose RfD o
As a preferred scheme, the equation for calculating the human health risk of vegetable oral diet is as follows:
formula 1:
Figure 100002_DEST_PATH_IMAGE001
wherein: HQ is the human health risk value of vegetable oral diet, and is dimensionless; c f The concentration of pollutants at the edible parts in the vegetables is mg/kg; IR is the amount of vegetables taken orally daily in mg/d; EF is exposure frequency, and the dimension is days/year; ED is the exposure period, dimension year; BW is body weight, and the dimension is kg; AT is the mean exposure time in terms of days; ABS (acrylonitrile butadiene styrene) o Is a factor of digestive tract absorption efficiency, and is dimensionless; RfD o For oral intake of a reference dose, the dimensions are mg/kg-day.
As a preferable scheme, the human health risks of the vegetable oral diet are divided into urban human health risks of the vegetable oral diet and rural human health risks of the vegetable oral diet according to the living characteristics of people.
As a preferable scheme, the machine learning model of the vegetable characteristic data and the human health risk value of the vegetable oral diet adopts a random forest classification model, and the establishing process comprises the following steps:
i) the human health risk value of vegetable oral diet is divided into the following four risk levels: HQ is more than 0 and less than or equal to 0.5, no risk exists, HQ is more than 0.5 and less than or equal to 1, low risk exists, HQ is more than 1 and less than or equal to 2, medium risk exists, and HQ is more than 2 and high risk exists;
ii) dividing the characteristic data into test set data and training set data, wherein the test set data accounts for 50-80% of the total amount of the characteristic data;
iii) determining main parameters of a random forest classification model, and performing model fitting on the test set data and the risk level, wherein the main parameters of the random forest classification model comprise: determining the number ntree of the random decision trees and the number mtry of the prediction variables sampled randomly from each decision tree;
iv) analyzing the importance of the test set data, and expressing the test set data by using MeandDecreaseaGini, wherein the Gini index calculation expression is as follows:
formula 2:
Figure 100002_DEST_PATH_IMAGE002
wherein: n represents n categories of test set data; p (i) is the proportion of the ith category in the current node;
v) performing model testing by using the test set data.
The method takes training set data as input parameters of a random forest classification model, and for each decision tree in the random forest classification model, sample data which does not participate in the selection of the decision tree is obtained through self-help method (Bootstrap) sampling, and it should be noted that when the basic quantity of the sample data is large enough, 36.8% of data of each tree is not extracted, and the data is called as data outside a bag. Therefore, the quantity of the prediction variables and the quantity of the decision trees randomly sampled by each decision tree in the random forest classification model can be determined according to the minimum value of the average out-of-bag errors of all the decision trees, and further, the method can be used for the parameter optimization process.
As a preferred scheme, the performance evaluation main process is as follows: performing precision inspection and model performance evaluation on the model by adopting a confusion matrix; the model performance evaluation comprises an area under a curve, a recall rate and overall classification precision, wherein the recall rate and the overall classification precision are calculated according to the following formulas:
formula 3:
Figure 100002_DEST_PATH_IMAGE003
formula 4:
Figure 100002_DEST_PATH_IMAGE004
formula 5:
Figure 100002_DEST_PATH_IMAGE005
in formulas 3 to 5: TP is the number of positive samples with correct model classification and is dimensionless; FN is the number of wrong positive samples of model classification, and is dimensionless; FP is the number of negative samples of the model classification error, and is dimensionless; TN is the number of negative samples with correct model classification and is dimensionless; TPR is real example rate and has no dimension; r is recall rate and is dimensionless; FPR is false positive rate and is dimensionless; OA is the overall classification accuracy, dimensionless.
The confusion matrix is a detection mode for representing prediction accuracy, in the invention, the enrichment coefficient of heavy metals in vegetables is used as an input quantity to predict the human health risk of vegetable diet, therefore, the confusion matrix adopted by the invention is an error matrix of the prediction result of the human health risk of vegetable diet. According to the confusion matrix, the performance of the model is further described by adopting an area under a curve (AUC), a recall ratio (R) and overall classification precision (OA), wherein although the recall ratio has no necessary relation with the accuracy of the model, in an actual test, the recall ratio and the accuracy mostly show a negative correlation relation, so that the recall ratio can be adopted to reflect the accuracy of the model; furthermore, in order to better detect the overall performance of the model, the AUC of the model is calculated, the AUC refers to the area under the ROC curve, but the ROC curve graph is not drawn in the invention, and the characteristic data of the vegetables in the invention are all discrete values, so that the characteristic data can be directly obtained by adopting a programmed counting method, the AUC can be used for measuring the generalization capability of the classification model, and when the area under the curve is closer to 1, the better the generalization capability of the model is, and the better the performance is. Through the detection processes, the overall performance of the model can be accurately and comprehensively analyzed, and targeted parameter optimization can be performed according to the analysis result of the model performance.
The invention also provides application of the model to guide vegetable planting and eating selection. According to the result calculated by the model, if the HQ corresponding to the vegetables is expressed as high risk, planting and eating are not recommended, if the HQ is expressed as low risk or no risk, planting and eating are recommended, and if the HQ is expressed as medium risk, a small amount of planting and a small amount of eating are recommended.
Compared with the prior art, the invention has the beneficial technical effects that:
1) the model for the heavy metal enrichment coefficient in the vegetables and the human health risk exposed by the oral diet adopts a random forest algorithm, and after limited training, the model can accurately predict the influence of the heavy metal in the vegetables on the human health according to the enrichment coefficient. In addition, the model can be automatically adjusted along with the increase of the use times, so that the test result is more and more accurate.
2) According to the technical scheme provided by the invention, the data source is wide, the model structure is simple, the applicability of the model is further improved through multi-dimensional classification of time and space, the potential risks of vegetables in the planting and eating processes can be quickly and accurately found out according to the prediction result of the model provided by the invention, and the method has important guiding significance for the production and consumption of the vegetables.
Drawings
FIG. 1 is a schematic flow chart of a model method for constructing a heavy metal enrichment coefficient and oral diet exposure human health risk based on vegetables.
FIG. 2 is an importance graph of model independent variables for constructing the heavy metal enrichment factor based on vegetables and the human health risk of oral dietary exposure according to the present invention: (a) urban population (b) rural population.
FIG. 3 is a graph of AUC values of the training set and test set test random forest classification model of the present invention.
FIG. 4 is a graph of AUC values of the validation set test random forest classification model of the present invention.
FIG. 5 is a vegetable planting area recommendation graph based on vegetable cadmium enrichment coefficients and model prediction results.
Detailed Description
The following specific examples are intended to further illustrate the present disclosure, but not to limit the scope of the claims.
The embodiment provided by the invention takes a human health risk model generated by heavy metal cadmium in edible vegetables as an example, and other heavy metals such as chromium, arsenic, copper and the like can be modeled according to the method provided by the embodiment.
Example 1
A vegetable-based heavy metal enrichment factor and oral meal exposure human health risk modeling method, as shown in fig. 1, comprising the steps of:
step S1: acquiring related data such as cadmium enrichment coefficient of vegetables, cadmium concentration in soil, planting areas, vegetable names and the like in literature;
specifically, the collected literature data come from the published literature of databases such as Web of Science, China knowledge network, Wanfang database and the like; the cadmium enrichment coefficient is the ratio of the concentration of heavy metal cadmium at edible parts in the vegetables to the concentration of heavy metal cadmium in soil, planting areas are divided into north, east, China, south and southwest, and the vegetable types are leaf vegetables, root vegetables and fruit vegetables.
Specifically, the leafy vegetables include Chinese cabbage, spinach, crowndaisy chrysanthemum, pakchoi, leek, leaf lettuce, water spinach, cabbage, green Chinese onion, mustard, lettuce, amaranth, red flowering cabbage and the like, the root vegetables include asparagus lettuce, taro, radish, lotus root, celery, sweet potato, garlic, ginger and the like, and the fruit vegetables include hot pepper, tomato, beans, towel gourd, eggplant, wax gourd, cucumber, pumpkin, balsam pear and the like.
Step S2: and cleaning the acquired characteristic data, cleaning abnormal values and performing interpolation processing on missing values.
Step S3: calculating human health risks of vegetable oral diet according to different provinces, wherein receptor groups are urban groups and rural groups, and the used evaluation model is a non-carcinogenic hazard quotient model released by the United states environmental protection agency:
Figure DEST_PATH_IMAGE006
wherein C is f The concentration of contaminants in edible parts of vegetables (mg/kg), IR the amount of vegetables taken per day orally (mg/d), EF the frequency of exposure (days/year), ED the period of exposure (year), BW the weight (kg), AT the mean time of exposure (days), ABS o Factor for digestive tract absorption efficiency, RfD o The reference dosage (mg/kg-d) is orally taken, and related parameters are from national statistical bureau, Chinese handbook of population exposure parameters (adult paper), and technical guide for evaluating soil pollution risk of construction land (HJ 25.3-2019).
Specifically, four risk grades are divided according to the calculated harm quotient levels of urban population and rural population: no risk (HQ is more than 0 and less than or equal to 0.5), low risk (HQ is more than 0.5 and less than or equal to 1), medium risk (HQ is more than 1 and less than or equal to 2) and high risk (HQ is more than 2).
Step S4: according to the data after the risk grades are divided, a random forest classification model is established by taking a cadmium enrichment coefficient, soil heavy metal cadmium concentration, a planting area and vegetable types as independent variables and taking a classified grade of the level of the harm commodity generated by the divided oral diet as a dependent variable;
specifically, in order to determine the main parameters of the random forest classification model, that is, to determine the number of the randomly sampled predictor variables and the number of the decision trees of each decision tree in the random forest classification model, the method includes: (i) the nodes are divided in each decision tree, i.e. the number of randomly sampled predictor variables: the parameter mtry. For the classification model, the default value is the quadratic root of the total number of predicted variables. (ii) Number of decision trees (classification trees): the parameter ntree. In order to optimize the number of nodes and decision trees constructed by the final model, through a parameter optimization process, the final ntree is set to be 500, and the mtry is set to be 4; (iii) model type, classification: selecting and classifying parameter types to carry out classification prediction;
and (3) performing variable importance evaluation by using a random forest classification model: for a certain prediction variable, the importance of the calculation is the mean value of the difference between the transformed prediction error and the original prediction error, and can be represented by means of MeanDecreasegini, which is represented as the average reduction value of Gini index (node purity), wherein the larger the value is, the larger the importance of the variable is, and the Gini index calculation expression is as follows:
Figure DEST_PATH_IMAGE007
wherein n represents n categories; and p (i) indicates the proportion of the category i in the current node, wherein the ith category number/total number (refers to the data value of the current node).
Specifically, a training data set is used as an input of a random forest classification model, and for each decision tree in the random forest classification model, the error of the data outside the bag is calculated by adopting the data outside the bag corresponding to the decision tree; the training data set is 70% of the data after risk classification, the rest 30% of the training data set is a test data set, and the data outside the bag refers to data which are obtained by Boostrap sampling and do not participate in the process of establishing a single decision tree when the single decision tree is established; the parameter optimization process is to determine the number of the prediction variables and the number of the decision trees randomly sampled by each decision tree in the random forest classification model according to the minimum value of the average out-of-bag errors of all the decision trees; the establishment process of the random forest classification model can be realized by adopting R language software.
Step S5 checks the model performance: checking the performance of the model by using the divided test data set and the verification set; evaluating the performance of the model by using the precision evaluation index; and the precision evaluation index is obtained by performing precision inspection on the random forest based population risk prediction model by using a confusion matrix generated by the test set and the verification set.
Specifically, the verification set is data obtained by performing on-site cooperative collection on soil and vegetable samples, determining the concentration of heavy metals, and calculating and classifying; the confusion matrix (also called error matrix) for evaluating the accuracy of the crowd risk prediction is a standard format for representing accuracy evaluation, and is represented in a matrix form of n rows and n columns. The confusion matrix is generated by testing the test set and the verification set, specific evaluation indexes comprise Area Under the Curve (AUC), Recall rate (Recall, R), Overall classification precision (OA) and the like, and the precision indexes reflect the classification precision from different sides.
Specifically, the Area Under the Curve (AUC) refers to the Area Under the working Characteristic (ROC) Curve of the subject, and is used for measuring the generalization ability of the classification model, and the closer the AUC value is to 1, the better the generalization ability of the model is, i.e. the better the performance is; the real case Rate (TPR) is taken as a vertical axis, the False Positive case Rate (FPR) is taken as a horizontal axis to draw an ROC curve, and then the area under the ROC curve is calculated to be an AUC value; the Recall rate (Recall, R) represents the proportion of the number of samples which are classified correctly in a certain classification category to the total number of samples in the classification category; the Overall classification precision (OA) represents the proportion of the number of samples which are classified correctly to the number of all samples;
the calculation expressions of the true normal rate (TPR), the false normal rate (FPR), the recall rate (R) and the overall classification precision (OA) are as follows:
Figure DEST_PATH_IMAGE008
Figure DEST_PATH_IMAGE009
Figure DEST_PATH_IMAGE010
wherein TP is the number of positive samples with correct model classification, FN is the number of positive samples with wrong model classification, FP is the number of negative samples with wrong model classification, TN is the number of negative samples with correct model classification, positive samples represent a certain classification category, and negative samples represent other categories except the certain classification category of the positive samples.
The effect of the classification model is verified using the partitioned test set data: the importance of the independent variable to the construction of the model is shown in figure 2, which shows that the selected independent variable enrichment coefficient plays a major role in predicting results, and shows that it is reasonable to select the enrichment coefficient to evaluate the human health risk; AUC (Area Under the dark) values of the training set and the testing set are shown in figure 3, and the AUC values of the risk prediction models of urban population and rural population in the training set are both 1, which indicates that the best performance is obtained in the training set; the AUC values of the models for risk prediction of urban population and rural population in the test set in the graph are 0.9857 and 0.9765 respectively, which are both close to 1, and the results show that the models have good generalization ability and good model performance; tables 1 and 2 show the results of the confusion matrix for the urban and rural population in the test set output by the random forest classification model.
TABLE 1 urban population Risk prediction results for vegetable exposure to oral diet
Figure BDA0003660606660000094
Figure BDA0003660606660000101
TABLE 2 vegetable oral diet exposure rural population risk prediction results
Figure BDA0003660606660000102
The prediction result shows that the recall rates of the urban population without risk, the urban population with low risk, the urban population with medium risk and the urban population with high risk are respectively 96.50%, 82.71%, 76.40% and 95.04%, the recall rates of the rural population with no risk, the urban population with low risk, the rural population with medium risk and the urban population with medium risk are respectively 97.84%, 79.84%, 73.33% and 95.04%, the overall classification precision of the urban population and the rural population with low risk and the rural population with medium risk is respectively 91.75% and 91.77%, and the accuracy is higher, so that the generalization capability of the model is better.
Example 2
To further illustrate the model performance in example 1, this example uses measured data for validation:
1. actual measurement data acquisition: 101 vegetable samples and soil samples for planting vegetables are collected to be respectively subjected to cadmium concentration analysis, and the analysis method is carried out according to national GB5009.15-2014 standard and HJ803-2016 standard.
2. Enrichment factor calculation and hazard quotient ranking
And calculating the ratio of the cadmium concentration of each vegetable to the cadmium concentration of the soil in which the vegetable is planted, calculating the enrichment coefficient of the vegetable, calculating the hazard quotient level according to the cadmium concentration of the vegetable, and grading the risk level. The calculation results are shown in Table 3:
TABLE 3 cadmium enrichment factor in vegetables and risk rating of harm traders for urban and rural populations
Figure BDA0003660606660000103
Figure BDA0003660606660000111
Figure BDA0003660606660000121
Figure BDA0003660606660000131
Figure BDA0003660606660000141
3. Checking the effects of the model
The AUC value of the verification set test model is shown in figure 4, the AUC values of the risk prediction models of urban population and rural population in the figure are 0.9789 and 0.9740 respectively, and are both close to 1, which shows that the performance of the verification set test model can obtain better results, and the classification model can be used in actual situations; tables 4 and 5 show the results of the validation set confusion matrix for the urban and rural population output by the random forest classification model.
TABLE 4 urban population risk prediction results of vegetable oral diet exposure
Figure BDA0003660606660000142
TABLE 5 vegetable oral diet exposed rural population risk prediction results
Figure BDA0003660606660000151
The prediction result shows that the recall rates of no risk, low risk, medium risk and high risk of urban population are 100%, 78.95%, 72.22% and 100% respectively, the recall rates of no risk, low risk, medium risk and high risk of rural population are 100%, 84.21%, 82.35% and 93.75% respectively, the overall classification precision of urban population and rural population is 91.09% and 93.07% respectively, the accuracy is high, and the model can obtain better performance in the application of actual values, so the simplified model for predicting the human health risk of vegetable oral diet based on the enrichment coefficient is proved to be effective.
Example 3
A vegetable planting and consumption strategy based on a model of enrichment factor and human health risk of oral dietary exposure: performing cadmium enrichment coefficient difference analysis on vegetables by using the existing data, wherein the cadmium enrichment coefficient difference comprises seasonal cadmium enrichment difference and regional cadmium enrichment difference, and comparing the cadmium enrichment capacities of the same type of vegetables in different seasons and regions; meanwhile, the human health risk levels calculated by the model established in the embodiment 1 are compared with the human health risk levels generated by eating the same kind of vegetables in different seasons and regions, and according to the difference of the cadmium enrichment coefficient of the vegetables and the human health risk levels of the edible vegetables, the lower health risk level is determined in which season and which region the same kind of vegetables are planted and eaten, and the vegetables with the lower risk level are recommended to be planted and eaten, so that a vegetable planting and eating strategy based on the enrichment coefficient is provided, which comprises a spring summer planting and eating strategy (table 6), a southern population planting and northern population planting and eating strategy (table 6) and a vegetable planting region recommendation map (fig. 5).
TABLE 6 season and regional vegetable planting and eating strategy
Figure BDA0003660606660000161
As shown in table 6, it is recommended to grow and eat leeks, scallions, garlic, eggplants in spring and summer because there is a lower risk of human health in spring and summer when these vegetables are eaten compared to autumn and winter; similarly, lettuce, spinach, pakchoi, lettuce, celery and tomatoes are recommended to be planted and eaten in autumn and winter. For the planting and eating recommendations in the north-south area, it is recommended to plant and eat spinach, lettuce, water spinach, coriander in the south area, because eating these vegetables in the south area has a lower risk of human health than in the north area; similarly, it is recommended to plant and eat young vegetables such as young rape, amaranth, pakchoi, lettuce, celery, tomato and hot pepper in northern areas. According to the method, the harm of heavy metal cadmium transferred to human bodies through food chains can be weakened by adjusting the planting mode and reasonably blending the vegetables in each area.
The recommended vegetable planting area map is shown in fig. 5, and six recommended vegetable planting areas are determined according to the comparison of planting and eating risks of vegetables in different areas, and are respectively leaf, rhizome and fruit recommended planting areas, leaf and rhizome recommended planting areas, fruit recommended planting areas, leaf and fruit recommended planting areas, root and fruit recommended planting areas, and root and fruit recommended planting areas, as shown in the map. According to the recommended six planting areas, the human health risks caused by eating vegetables can be reduced through reasonable planting.
While the invention has been described in terms of what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various equivalent modifications, alternatives, and equivalents.

Claims (8)

1. A vegetable-based model for predicting heavy metal enrichment coefficients and human health risks of oral meal exposure is characterized in that: the method comprises the following steps:
step 1): acquiring characteristic data of vegetables related to heavy metal enrichment;
step 2): cleaning and preprocessing the characteristic data related to the heavy metal;
step 3): acquiring vegetable oral diet parameters, and calculating the human health risk value of the vegetable oral diet;
step 4): establishing a machine learning model of vegetable characteristic data and human health risk values of vegetable oral meals;
step 5): and evaluating the performance of the obtained machine learning model.
2. The vegetable-based model for predicting heavy metal enrichment factor and human health risk of oral dietary exposure of claim 1, wherein: the heavy metal enrichment related characteristic data comprises: the enrichment coefficient of heavy metals in vegetables, the concentration of heavy metals in vegetable planting soil, a planting area and the variety of vegetables.
3. The vegetable-based model for predicting heavy metal enrichment factor and human health risk of oral dietary exposure of claim 2, wherein:
the enrichment coefficient of the heavy metal in the vegetables is the ratio of the concentration of the heavy metal at the edible part of the vegetables to the concentration of the heavy metal in the vegetable planting soil;
the planting area is divided into the following parts according to geographical positions: northern, eastern, middle, southern, and southwest regions;
the heavy metals include: at least one of chromium, cadmium, mercury, arsenic, nickel, copper and zinc.
4. The vegetable-based model for predicting heavy metal enrichment factor and human health risk of oral dietary exposure of claim 1, wherein: the data cleansing and pre-processing process comprises: I) performing relevance verification on the characteristic data, and classifying; II) carrying out format cleaning on the characteristic data; III) deleting and cleaning abnormal values in the feature data; IV) carrying out interpolation processing on the missing values in the characteristic data.
5. The vegetable-based model for predicting heavy metal enrichment factor and human health risk of oral dietary exposure of claim 1, wherein: the vegetable oral diet parameters include: concentration of contaminants in edible parts of vegetables C f Daily vegetable intake by mouth IR, frequency of exposure EF, duration of exposure ED, body weight BW, average exposure time AT, digestive tract absorption efficiency factor ABS o And an oral ingestion reference dose RfD o (ii) a The equation for calculating the human health risk of vegetable oral diet is as follows:
formula 1:
Figure DEST_PATH_IMAGE001
wherein: HQ is the human health risk value of vegetable diet without dimension; c f The concentration of pollutants at edible parts in the vegetables is mg/kg; IR is the amount of vegetables taken orally daily in mg/d; EF is the exposure frequency and the dimensions are daysA/year; ED is exposure period, and dimension is year; BW is body weight, and the dimension is kg; AT is the mean exposure time in terms of days; ABS (acrylonitrile butadiene styrene) o Is a factor of digestive tract absorption efficiency, and is dimensionless; RfD o For oral intake of a reference dose, the dimensions are mg/kg-day.
6. The model for predicting heavy metal enrichment factor and human health risk of oral dietary exposure in vegetables according to claim 1 or 5, wherein:
the machine learning model of the vegetable characteristic data and the vegetable oral diet human health risk value adopts a random forest classification model, and the establishment process comprises the following steps:
i) the human health risk value of vegetable oral diet is divided into the following four risk levels: no risk is found when 0< HQ < 0.5, low risk is found when 0.5< HQ < 1, medium risk is found when 1< HQ < 2, and high risk is found when 2< HQ;
ii) dividing the characteristic data into test set data and training set data, wherein the test set data accounts for 50-80% of the total amount of the characteristic data;
iii) determining main parameters of a random forest classification model, and performing model fitting on the test set data and the risk level, wherein the main parameters of the random forest classification model comprise: determining the number ntree of the random decision trees and the number mtry of the prediction variables sampled randomly from each decision tree;
iv) performing significance analysis on the test set data, and expressing the test set data by using MeandDecreases Gini, wherein the Gini index calculation expression is as follows:
formula 2:
Figure DEST_PATH_IMAGE002
wherein: n represents n categories of test set data; p (i) is the proportion of the ith category in the current node;
v) performing model testing by using the test set data.
7. The vegetable-based model for predicting heavy metal enrichment factor and human health risk of oral dietary exposure of claim 1, wherein: the main process of the performance evaluation is as follows: carrying out precision inspection and model performance evaluation on the model by adopting a confusion matrix; the model performance evaluation comprises an area under a curve, a recall rate and overall classification precision, wherein the recall rate and the overall classification precision are calculated according to the following formulas:
formula 3:
Figure DEST_PATH_IMAGE003
formula 4:
Figure DEST_PATH_IMAGE004
formula 5:
Figure DEST_PATH_IMAGE005
in formulas 3 to 5: TP is the number of positive samples with correct model classification and is dimensionless; FN is the number of wrong positive samples of model classification, and is dimensionless; FP is the number of negative samples of the model classification error, and is dimensionless; TN is the number of negative samples with correct model classification and is dimensionless; TPR is real example rate and has no dimension; r is recall rate and is dimensionless; FPR is false positive rate and is dimensionless; OA is the overall classification accuracy, dimensionless.
8. Use of a vegetable-based model for predicting the risk of heavy metal enrichment and oral dietary exposure for human health according to any one of claims 1 to 7, wherein: the method is applied to planting and edible selection of vegetables.
CN202210576755.0A 2022-05-25 2022-05-25 Vegetable-based model for predicting heavy metal enrichment coefficient and human health risk exposed by oral diet and application Pending CN114913980A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210576755.0A CN114913980A (en) 2022-05-25 2022-05-25 Vegetable-based model for predicting heavy metal enrichment coefficient and human health risk exposed by oral diet and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210576755.0A CN114913980A (en) 2022-05-25 2022-05-25 Vegetable-based model for predicting heavy metal enrichment coefficient and human health risk exposed by oral diet and application

Publications (1)

Publication Number Publication Date
CN114913980A true CN114913980A (en) 2022-08-16

Family

ID=82769063

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210576755.0A Pending CN114913980A (en) 2022-05-25 2022-05-25 Vegetable-based model for predicting heavy metal enrichment coefficient and human health risk exposed by oral diet and application

Country Status (1)

Country Link
CN (1) CN114913980A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116227692A (en) * 2023-02-06 2023-06-06 中国科学院生态环境研究中心 Crop heavy metal enrichment risk quantification method, system and storable medium
CN117854721A (en) * 2023-12-15 2024-04-09 兰州大学 Method, equipment and storage medium for evaluating human input health risk of heavy metal in soil

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116227692A (en) * 2023-02-06 2023-06-06 中国科学院生态环境研究中心 Crop heavy metal enrichment risk quantification method, system and storable medium
CN116227692B (en) * 2023-02-06 2023-09-26 中国科学院生态环境研究中心 Crop heavy metal enrichment risk quantification method, system and storable medium
CN117854721A (en) * 2023-12-15 2024-04-09 兰州大学 Method, equipment and storage medium for evaluating human input health risk of heavy metal in soil

Similar Documents

Publication Publication Date Title
Pérez‐Ramos et al. Evidence for a ‘plant community economics spectrum’driven by nutrient and water limitations in a Mediterranean rangeland of southern France
Wright et al. Leaf trait relationships in Australian plant species
Weiher et al. Challenging Theophrastus: a common core list of plant traits for functional ecology
CN114913980A (en) Vegetable-based model for predicting heavy metal enrichment coefficient and human health risk exposed by oral diet and application
Pinel-Alloul et al. Spatial and environmental components of freshwater zooplankton structure
Paul et al. Meta-analysis of regression coefficients for the relationship between Fusarium head blight and deoxynivalenol content of wheat
McCune et al. Regional gradients in lichen communities of the southeast United States
Chapagain et al. Decomposing crop model uncertainty: A systematic review
Haakonsson et al. Predicting cyanobacterial biovolume from water temperature and conductivity using a Bayesian compound Poisson-Gamma model
Stapanian et al. Disturbance metrics predict a wetland vegetation index of biotic integrity
Krivanek et al. Inheritance of resistance to Xylella fastidiosa within a Vitis rupestris× Vitis arizonica hybrid population
McLaughlin et al. Interactive effects of ambient ozone and climate measured on growth of mature loblolly pine trees
Jupke et al. Evaluating the biological validity of European river typology systems with least disturbed benthic macroinvertebrate communities
Feio et al. Macro-scale (biomes) differences in neotropical stream processes and community structure
Ziadi et al. Leaf nitrogen concentration as an indicator of corn nitrogen status
Gara et al. A candidate vegetation index of biological integrity based on species dominance and habitat fidelity
Dias et al. DRIS formulas for evaluation of nutritional status of cupuaçu trees
Lo et al. Dry season forage assessment across senegalese rangelands using earth observation data
Torrez et al. Specific leaf area: a predictive model using dried samples
Krupa et al. Considerations for establishing relationships between ambient ozone (O3) and adverse crop response
Luo et al. Seasonal and successional dietary shifts of two sympatric rodents in coastal heathland: a possible mechanism for coexistence
Tucker et al. An approach to assess relative degradation in dissimilar forests: toward a comparative assessment of institutional outcomes
Steiner et al. Archaeobotanical and isotopic analyses of waterlogged remains from the Neolithic pile-dwelling site of Zug-Riedmatt (Switzerland): Resilience strategies of a plant economy in a changing local environment
Sarkar et al. Analyzing farmers’ vulnerability and adaptation strategy to climate change in arid ecosystem of India
Gao et al. Allometric relationships and trade‐offs in 11 common M editerranean‐climate grasses

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination