CN118039031A

CN118039031A - Method for judging regional ore-forming potential based on machine learning of apatite components

Info

Publication number: CN118039031A
Application number: CN202311652244.3A
Authority: CN
Inventors: 许博; 郑育宇; 温子豪
Original assignee: China University of Geosciences Beijing
Current assignee: China University of Geosciences Beijing
Priority date: 2023-12-05
Filing date: 2023-12-05
Publication date: 2024-05-14
Anticipated expiration: 2043-12-05
Also published as: CN118039031B

Abstract

The invention provides a method for distinguishing regional mineral potential based on machine learning of apatite components, which comprises the steps of compiling three global datasets of chemical components of main elements and/or microelements of apatite from mineral and non-mineral rock samples, and training a series of XGBoost models to determine the mineral potential of a deposit. Compared with the traditional binary diagram, the new classification method has greatly improved accuracy and efficiency in distinguishing whether the apatite is from a rich ore rock body or a lean ore rock body. In addition, feature importance analysis shows that V/Y and Cl/F ratios and S content are critical to metal enrichment and mineralization.

Description

Method for judging regional ore-forming potential based on machine learning of apatite components

Technical Field

The invention relates to the technical field of geological investigation and mineral exploration, in particular to a method for distinguishing regional ore potential based on machine learning of apatite components.

Background

Apatite (Ca ₅[PO₄]₃ [ F, cl, OH ]) is a widely occurring side mineral in most igneous and metamorphic rock and exogenous debris deposits, and has a strong resistance to weathering. In view of its sensitivity to the crystalline environment, its chemical composition is considered to be an ideal indicator mineral. The trace elements and volatile chemical components as well as isotopic characteristics of apatite can characterize different crystallization environments, including magma systems, low-grade metamorphic systems, and depositional environments. Therefore, the microelement chemical features of apatite are widely used to reflect the lithology of source rock, including tracing the place of origin of clastic rock, and to constrain rock causative processes, particularly to reveal the origin and evolution of magma. In addition, the main and trace element chemistries of apatite are also used in mineral prospecting, including the use of various chemical metrics such as Sr/Y, mn, eu/Eu ^*, th/U, la/Sm and (Ce/Yb) _N, and binary classification schemes such as Sr vs. F (Mn, Y, (La/Yb) _N、Eu/Eu^*)、F/Cl vs. F、Cl vs. Eu/Eu^*, th/U, la/Sm and (Ce/Yb) N. Where F, cl vs. Eu/Eu ^*、V/Y vs. REE+Y、Cl vs. SO₃ and ⁸⁷Sr/⁸⁶ Sr vs. Cl/F, etc. binary classification schemes are commonly used to determine the mineral formation of rock magma.

The field of Machine Learning (ML) involves the use of computer programming to identify data rules in a dataset, which are then applied to predictions. Machine learning provides a powerful tool kit for decoding potential information in high-dimensional data. In the past few years, there has been a great deal of interest in the application of ML in solid earth science. ML has been widely used for seismic phase detection and seismic classification, geophysical data processing and image interpretation, geophysical inversion, and multi-physics and multi-disciplinary information integration. Given the complexity and diversity of geochemical data, ML-based classification methods have become a promising approach over traditional methods, particularly in large scale geological processes such as predicting global mantle deterioration, revealing source components of basalt in the slab, identifying connate water concentrations in the mantle pyroxene, determining quartz forming environments, and classifying source rocks of clastic zircon. In the field of mineral deposit exploration, there are two studies attempting to describe the mineralisation of magma using ML based on zircon composition data, with the aim of determining the potential for copper mineralization of porphyry. Tan et al (2023) used partial least squares discriminant analysis (PLS-DA) on apatite trace element datasets (4,298 data) to distinguish between apatite from different types of deposits and rocks. Their spectra cannot be directly distinguished into mineral magma apatite and hydrothermal apatite, but show great potential in classifying lean and rich mineral apatite from granite related deposits and underscores the role of V, eu and Sr in classification.

Here, the present invention compiles a global data set of three apatite major elements and/or trace element chemistries from ore-forming and non-ore-forming rock samples and trains a series XGBoost of models to determine the ore-forming potential of the deposit. Compared with the traditional binary diagram, the new classification method has greatly improved accuracy and efficiency in distinguishing whether the apatite is from a rich ore rock body or a lean ore rock body. In addition, feature importance analysis shows that V/Y and Cl/F ratios and S content are critical to metal enrichment and mineralization.

Disclosure of Invention

In order to make up for the defects of the prior art, the invention provides a method for distinguishing regional mineral potential based on machine learning of apatite components.

The invention is realized by the following technical scheme: the method for distinguishing regional ore potential based on machine learning of apatite components specifically comprises the following steps:

S1, database construction: the original dataset used for modeling contained 13382 pieces of apatite component data; deposit types are classified according to their value, morphology, alteration, ore mineralogy, and host rock relevance; the analysis results collected from apatite formed with the deposit are labeled "mineralized", and the apatite analysis results in unmineralized rock are labeled "unmineralized"; according to these criteria, 9104 and 4278 pieces of data are labeled "mineralized" and "unmineralized", respectively;

S2, dividing the sub-databases: the original dataset is divided into three subsets, wherein the analysis result of the sample containing CaO, P ₂O₅、SO₃, cl and F is selected as a 'main quantity' dataset, and the analysis result of the sample containing trace elements is selected as a 'trace' dataset; the analysis result containing both principal and trace elements is set as a "principal and trace" dataset;

S3, preprocessing the data collected in the S2:

S31, processing missing values: i.e. eliminating any element with a missing value greater than 60% of the whole column; after filtering, the "master" dataset included 5618 pieces of data, and the "trace" dataset included 9979 pieces of data;

S32, calculating geochemical indexes, including LREE、HREE、Sr/Y、V/Y、Ce/Nd、Eu^*、Ce^* _N、EuN/Eu^* _N、Ce/Ce^*、Eu/Eu^*/Y、REE+Y、(La/Yb)_N and La/Sm, and adding the geochemical indexes to a micro-data set to serve as new characteristics; the "major and minor" data set includes 2448 pieces of data and 43 features;

s4, a machine learning method: adopting XGBoost model, the training method is addition operation, and each new tree is added to adapt to the residual error of the previous prediction; adding the results of all the trees to obtain a final prediction result; given a dataset d= { (xi, yi) } (|d|=n, xi e Rm, yi e R), where there are n examples and m features, the output of the tree set model using K addition functions is predicted as the sum of K scores:

Wherein, Representing the space of the regression tree, the function q representing the structure of each tree, which maps an example to the corresponding leaf index, T being the number of leaves in the tree, each f _k corresponding to an independent tree structure q and leaf weight w, w _i representing the score on the ith leaf;

S5, model super-parameter adjustment: combining the five-fold cross validation method with a grid search strategy, wherein the grid search strategy thoroughly generates candidate parameters from a parameter value grid, and selects and outputs the candidate parameters with highest scores according to the evaluation result of the predefined index;

S6, machine learning classification results: 14 XGBoost models were trained based on three apatite component datasets altogether; five models were trained using the "prime and trace" datasets, the number of selected features being 43, 35, 22, 12 and 6, respectively, two models were trained using the "prime" dataset, all ten prime elements were used to train model M-1, and four selected elements were used to train model M-2, the "trace" dataset was used to train seven models, and the number of relevant features was set to 33, 28, 21, 14, 7, 3 and 2 in order; the classification result of the XGBoost model is displayed as a confusion matrix;

Obtaining the relative importance of all features used in each model from XGBoost algorithm to determine the elements in apatite that are highly correlated with the mineralisation;

Preferably, the ore deposit in the step S1 comprises a porphyry type, a skarn type, a shallow low-temperature Au-Ag ore deposit, a mountain-forming Au ore deposit, a copper-iron-oxide gold ore (IOCG), a sodium-based oxide (IOA), a mountain-forming Ni-Cu+ -platinum group element ore and a carbonate ore deposit.

Preferably, the features of the "major" dataset in step S31 include CaO, P ₂O₅、SO₃、F、Cl、FeO、MnO、Na2O、SiO₂ and Cl/F, and the features of the "minor" dataset include V, mn, rb, sr, Y, zr, la, ce, pr, nd, sm, eu, gd, tb, dy, ho, er, tm, yb and Lu; the features of the "master and trace" data sets include CaO、P₂O₅、SO₃、F、Cl、FeO、Cl/F、SiO₂、Na₂O、MgO、Rb、Sr、Y、Zr、La、Ce、Pr、Nd、Sm、Ee/Sm）、Pr、Nd、Sm、Eu、Gd、Tb、Dy、Ho、Er、Tm、Yb、Lu、Th、U、Sr/Y、V/Y、Ce/Nd、Eu^*、Ce^* _N、Eu_N/Eu^* _N、Ce/Ce^*、Eu/Eu^*/Y、REE+Y、(La/Yb)_N、La/Sm、LREE、HREE.

Preferably, the grid search in step S5 is performed by determining the optimal combination of the super parameters including eta, gamma, maximum depth and alpha, generating 3600 candidate models, and selecting the optimal model

Preferably, V, sr, Y, eu, ce and Rb most often appear in the top ten features of the relative importance ranking, V being most important in the ranking, in all models of "trace" datasets in step S6; of all five models of V are selected, the relative importance of V is highest among the four models, and second among the remaining one model; the relative importance of the SO ₃ content is highest in the two models of the 'prime quantity' data set, and the proportion of each characteristic is quite consistent; features that play a key role in the "principal and micro" dataset models are similar to those in the "micro" dataset models.

The invention adopts the technical proposal, and compared with the prior art, the invention has the following beneficial effects: in the present invention, the performance of several conventional apatite fertility indicators was evaluated using the raw data set (fig. 4). For example, xu et al (2021) have proposed three indices in apatite that can effectively distinguish between rich and lean porphyry. However, when applied to the dataset of the present study, its best accuracy was only 0.553 (fig. 4 a). More precisely, the classification based on Cl/F ratio (fig. 4 a) had a True Positive Rate (TPR) of 0.421 for the rich mineral apatite and a True Negative Rate (TNR) of 0.580 for the lean mineral apatite. The accuracy of the V and Y double graph (fig. 4 b), TPR and TNR were 0.261, 0.866 and 0.026, respectively, indicating that it was able to identify rich mineral apatite but not lean mineral apatite. In addition, on a global data set, conventional discriminant graphs show lower accuracy (from 0.242 to 0.553), which when applied to mineral exploration may lead to erroneous mineralisation potential assessment and unreliable mineralisation zone localization.

As the geochemical data associated with apatite increases, the limitations of conventional research methods are also increasingly prominent. One of the main limitations is that the mineral-rich geochemical index of local porphyry cannot be accurately applied to the mineral formation evaluation of other areas. In addition, the traditional method which only depends on limited indexes cannot comprehensively consider the ore formation information contained in various elements, so that the potential of metal enrichment cannot be effectively estimated.

ML models capable of processing high-dimensional geochemical data are considered to be powerful mineral exploration tools. Compared with the traditional element two-dimensional graph, the XGBoost model in the study is obviously more accurate and efficient, and the accuracy is varied from 0.8507 to 0.9918, which shows that the success rate is higher in the processes of prospecting and prospecting. In addition, ML can integrate all the characteristics of the apatite microelements at the same time, and directly capture the relationship between geochemical data and mineralization. The advantage of this approach is that the results are applicable to any geological environment. As the amount of apatite geochemical data from various deposit types increases, ML models trained on such data sets may become more complex and accurate.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a block diagram of major and trace elements and geochemical indicators of a global apatite sample, expressed as weight percentages (a) and ppm (b). Boxes represent the quartile spacing (IQR) and mark the upper quartile (75%) and the lower quartile (25%). Outliers extend to 1.5 times that of IQR. The horizontal line within the color box represents the median (50%). Black square symbols and circular symbols represent average and outliers, respectively;

Fig. 2 is a confusion matrix (left) and feature importance ranking (right) for four representative XGBoost models. The confusion matrix displays the prediction result of each category;

FIG. 3 is a correlation between feature selection and XGBoost model performance;

Fig. 4 is a scatter plot of elemental ratios of rich ("mineralized") and lean ("unmineralized") apatite in the raw dataset.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will be more clearly understood, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, without conflict, the embodiments of the present application and features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced otherwise than as described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

The method for discriminating regional mineral potential based on machine learning of apatite components according to the embodiment of the present invention will be specifically described with reference to fig. 1 to 3.

The invention provides a method for distinguishing regional mineralization potential based on machine learning of apatite components, which specifically comprises the following steps:

S1, database construction: all apatite composition data for modeling were collected and compiled from existing literature, containing 241 sampling points in 27 countries worldwide. Each site includes a plurality of samples and analyses. The raw dataset contains 13382 pieces of apatite component data, including point analysis data and averages of documents not providing point analysis data. FIG. 1 shows the elements and geochemical data structures contained in the dataset;

Deposit types are classified according to their value, morphology, alteration, ore mineralogy, and host rock relevance; the ore deposit includes a porphyry type, a skarn type, a shallow low temperature Au-Ag ore deposit, a mountain-forming Au ore deposit, a copper iron oxide gold ore (IOCG), a sodium-based (IOA) ore, a mountain-forming Ni-cu±platinum group element ore, and a carbonate ore deposit. The analysis results collected from apatite formed with the deposit are labeled "mineralized", and the apatite analysis results in unmineralized rock are labeled "unmineralized"; according to these criteria, 9104 and 4278 pieces of data are labeled "mineralized" and "unmineralized", respectively;

S2, dividing the sub-databases: to further distinguish the effects of the principal and trace elements, the original dataset is divided into three subsets, with the analysis results for samples containing CaO, P ₂O₅、SO₃, cl and F selected as the "principal" dataset and the analysis results for samples containing trace elements selected as the "trace" dataset; the analysis result containing both principal and trace elements is set as a "principal and trace" dataset;

S3, preprocessing the data collected in the S2:

s31, processing missing values: i.e. eliminating any element with a missing value greater than 60% of the whole column; after filtering, the "main amount" data set comprises 5618 pieces of data, the characteristics of the "main amount" data set comprise CaO, P ₂O₅、SO₃、F、Cl、FeO、MnO、Na2O、SiO₂ and Cl/F, the "micro" data set comprises 9979 pieces of data, and the characteristics of the "micro" data set comprise V, mn, rb, sr, Y, zr, la, ce, pr, nd, sm, eu, gd, tb, dy, ho, er, tm, yb and Lu;

S32, calculating geochemical indexes, and adding the geochemical indexes into a micro data set as new characteristics, wherein the indexes are considered to have important significance on ore formation and magma evolution. These indices include LREE、HREE、Sr/Y、V/Y、Ce/Nd、Eu^*、Ce^* _N、EuN/Eu^* _N、Ce/Ce^*、Eu/Eu^*/Y、REE+Y、(La/Yb)_N and La/Sm; the "major and minor" data set includes 2448 pieces of data and 43 features; the features of the "master and trace" data sets include CaO、P₂O₅、SO₃、F、Cl、FeO、Cl/F、SiO₂、Na₂O、MgO、Rb、Sr、Y、Zr、La、Ce、Pr、Nd、Sm、Ee/Sm）、Pr、Nd、Sm、Eu、Gd、Tb、Dy、Ho、Er、Tm、Yb、Lu、Th、U、Sr/Y、V/Y、Ce/Nd、Eu^*、Ce^* _N、Eu_N/Eu^* _N、Ce/Ce^*、Eu/Eu^*/Y、REE+Y、(La/Yb)_N、La/Sm、LREE、HREE.

S4, a Machine Learning (ML) method:

XGBoost is a gradient tree promotion based ML system that can solve real world scale problems with minimal resources. XGBoost is a distributed gradient promotion library, which is optimized for high efficiency and flexibility. Its flexibility is manifested in being able to handle sparse data with a variety of possible reasons, including missing values and frequently occurring 0 values. In addition, its parallel and distributed computing capabilities help to speed up learning, thereby enabling faster model exploration. Highly scalable end-to-end tree enhancement systems can be efficiently extended to larger data sets with minimal cluster resources. In addition, the XGBoost tree structure can identify important features, so that the interpretation of the result is improved, and the relationship between the apatite component and the ore formation is clarified, and the geochemical significance of the apatite component is explored.

Using the XGBoost model, XGBoost is an ML algorithm that runs under a gradient lifting framework. The training method is addition operation, and each new tree is added to adapt to the residual error of the previous prediction; adding the results of all the trees to obtain a final prediction result; given a dataset d= { (xi, yi) } (|d|=n, xi e Rm, yi e R), where there are n examples and m features, the output of the tree set model using K addition functions is predicted as the sum of K scores:

S5, model super-parameter adjustment: the five-fold cross-validation method is combined with a grid search strategy for optimizing XGBoost models. The grid searching strategy thoroughly generates candidate parameters from a parameter value grid, and selects and outputs the candidate parameters with highest scores according to the evaluation result of the predefined index; the grid search procedure is to determine the best combination of the hyper-parameters including eta, gamma, maximum depth and alpha, and generate 3600 candidate models from which the best model is selected.

S6, machine learning classification results: based on the three apatite component datasets, 14 XGBoost models were trained in total according to different feature choices; five models were trained using the "prime and trace" datasets, 43, 35, 22, 12 and 6 respectively, two models were trained using the "prime" dataset, all ten prime elements were used to train model M-1, and four selected elements were used to model M-2, the "trace" dataset was considered very important for recognition mineralization and therefore was used to train seven models, the relevant feature numbers were set to 33, 28, 21, 14, 7, 3 and 2 in sequence; the classification result of the XGBoost model is displayed as a confusion matrix; fig. 2 shows the predicted results of four representative models.

Obtaining the relative importance of all features used in each model from XGBoost algorithm to determine the elements in apatite that are highly correlated with the mineralisation; of all models of "trace" datasets, V, sr, Y, eu, ce and Rb occur most often in the top ten features of the relative importance ranking, V being most important in the ranking; of all five models of V are selected, the relative importance of V is highest among the four models, and second among the remaining one model; some geochemical criteria also have an impact on ranking, including Sr/Y, V/Y, eu ^*、(La/Yb)_N and La/Sm. The relative importance of the SO ₃ content is highest in the two models of the "prime" dataset. However, the proportions of each feature are quite consistent; features that play a key role in the "principal and micro" dataset models are similar to those in the "micro" dataset models. In addition, cl, F and Cl/F are also notable.

Feature selection: the classification results also indicate that there is a positive correlation between the number of features and the model performance. As shown in FIG. 3, the XGBoost model scores higher when training on more elements and geochemical indices. For example, the accuracy and F1 score increases from 0.9146 and 0.8507 for model T-7 (feature number=2) to 0.9682 and 0.9474 for model T-5 (feature number=7), and 0.9939 and 0.9900 for model T-3 (feature number=33).

Overall, of the 12 models, more than 90% of the samples from the test set were correctly classified by 10 models (accuracy greater than 0.9), indicating that the models in this study perform well in distinguishing between "mineralized" and "unmineralized" apatite. Of all 14 models, model M-T-1 obtained the highest score on both the training set and the test set. In the results of this model, all samples in the training set were correctly classified (accuracy=1), and more than 99% of samples in the test set were correctly classified (accuracy=0.9918). The elemental data obtained in practice may not be sufficient to meet the requirements of model M-T-1; however, model M-T-4 can achieve similar performance with only 9 elements (12 features), with accuracy and F1 scores of 0.9878 and 0.900, respectively. This suggests that the classification model in the present study may function in various situations. However, when the number of selected features is reduced to 2, the performance of the XGBoost model drops dramatically (fig. 3). From the overall classification results, the XGBoost model in this study clearly achieves excellent performance after appropriate feature selection and is applicable to various situations.

In the description of the present specification, the terms "one embodiment," "some embodiments," "particular embodiments," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The method for distinguishing regional ore potential based on machine learning of apatite components is characterized by comprising the following steps:

s1, database construction: the original dataset used for modeling contained 13382 pieces of apatite component data;

Deposit types are classified according to their value, morphology, alteration, ore mineralogy, and host rock relevance; the analysis results collected from apatite formed with the deposit are labeled "mineralized", and the apatite analysis results in unmineralized rock are labeled "unmineralized"; according to these criteria, 9104 and 4278 pieces of data are labeled "mineralized" and "unmineralized", respectively;

S3, preprocessing the data collected in the S2:

S4, a machine learning method: adopting XGBoost model, its training method is addition operation, and every new tree is added to adapt to the residual error of previous prediction; adding the results of all the trees to obtain a final prediction result; given a dataset d= { (xi, yi) } (|d|=n, xi e Rm, yi e R), where there are n examples and m features, the output of the tree set model using K addition functions is predicted as the sum of K scores:

the relative importance of all features used in each model was obtained from the XGBoost algorithm to determine the elements in apatite that are highly correlated with the mineralisation.

2. The method for determining regional mineral potential based on machine learning of apatite ingredients according to claim 1, wherein the ore deposit in step S1 comprises a porphyry type, a skarn type, a shallow low temperature Au-Ag ore deposit, a mountain-forming Au ore deposit, a copper iron oxide gold ore (IOCG), a rhynchophylla type (IOA), a mountain-forming Ni-cu±platinum group element ore, and a carbonate ore deposit.

3. The method for determining regional mineral potential based on machine learning of apatite ingredients according to claim 1, wherein the features of the "major" dataset in step S31 include CaO, P ₂O₅、SO₃、F、Cl、FeO、MnO、Na2O、SiO₂ and Cl/F, and the features of the "minor" dataset include V, mn, rb, sr, Y, zr, la, ce, pr, nd, sm, eu, gd, tb, dy, ho, er, tm, yb and Lu; the features of the "master and trace" data sets include CaO、P₂O₅、SO₃、F、Cl、FeO、Cl/F、SiO₂、Na₂O、MgO、Rb、Sr、Y、Zr、La、Ce、Pr、Nd、Sm、Ee/Sm）、Pr、Nd、Sm、Eu、Gd、Tb、Dy、Ho、Er、Tm、Yb、Lu、Th、U、Sr/Y、V/Y、Ce/Nd、Eu^*、Ce^* _N、Eu_N/Eu^* _N、Ce/Ce^*、Eu/Eu^*/Y、REE+Y、(La/Yb)_N、La/Sm、LREE、HREE.

4. The method according to claim 1, wherein the mesh searching in step S5 is performed by determining the optimal combination of super parameters including eta, gamma, maximum depth and alpha, and generating 3600 candidate models, and selecting the optimal model.

5. The method of claim 1, wherein, in step S6, of all models of "micro" datasets, V, sr, Y, eu, ce and Rb occur most frequently in the top ten features of relative importance ranking, V being most important in ranking; of all five models of V are selected, the relative importance of V is highest among the four models, and second among the remaining one model; the relative importance of the SO ₃ content is highest in the two models of the 'prime quantity' data set, and the proportion of each characteristic is quite consistent; features that play a key role in the "principal and micro" dataset models are similar to those in the "micro" dataset models.