CN117405652A

CN117405652A - Qualitative analysis method, system and equipment for aluminum alloy of small sample stacking model

Info

Publication number: CN117405652A
Application number: CN202311245619.4A
Authority: CN
Inventors: 戴宇佳; 刘子源; 马庆
Original assignee: Zhejiang A&F University ZAFU
Current assignee: Zhejiang A&F University ZAFU
Priority date: 2023-09-25
Filing date: 2023-09-25
Publication date: 2024-01-16

Abstract

The invention discloses a qualitative analysis method, a system and equipment for aluminum alloy of a small sample stacking model, and relates to the field of qualitative analysis of aluminum alloy; acquiring aluminum alloy spectrum data; in a first layer architecture of a small sample stacking model, performing dimension reduction treatment on aluminum alloy spectrum data by utilizing a random forest, and obtaining a characteristic wave band according to a principle of non-purity of a kene; determining a spectrum peak of the aluminum alloy spectrum data by using an automatic peak searching algorithm, and calculating the full width at half maximum of the spectrum peak according to a fitting curve; in a second layer architecture of the small sample stacking model, predicting the full width at half maximum of a characteristic wave band and a spectrum peak by utilizing different heterogeneous learners; and in a third layer architecture of the small sample stacking model, predicting a prediction result and a characteristic wave band by using a classifier determined by a logistic regression method. The invention can improve the accuracy and stability of the qualitative analysis of the aluminum alloy of the small sample stacking model.

Description

Qualitative analysis method, system and equipment for aluminum alloy of small sample stacking model

Technical Field

The invention relates to the field of aluminum alloy qualitative analysis, in particular to a small sample stacking model aluminum alloy qualitative analysis method, a system and equipment.

Background

Laser Induced Breakdown Spectroscopy (LIBS) is popular as a rapid and simple elemental analysis technique since the advent of the technology, has the capability of real-time and on-line detection of solid, liquid and gas, can realize analysis of elements to be detected without complex pretreatment of samples, and is widely applied to the fields of aerospace, environmental monitoring, mineral exploitation, metallurgical analysis, atmospheric monitoring and the like. LIBS ablates the surface of the sample with high energy laser pulses to generate a plasma, and qualitative and quantitative analysis of the components of the sample is performed by collecting and processing the spectral signals emitted by the plasma. The classification accuracy of the LIBS technique depends to a large extent on the number of samples and the information extracted from the acquired spectra. In general, more abundant sample data can improve the precision of model classification, so that the established model has better generalization, and the obtained more abundant LIBS spectral line information can provide more representative sample characteristic information. However, the problem of data redundancy is also exacerbated by large sample data with higher dimensions, and as the number of samples collected increases, the computational resources and computation time required increase exponentially, creating a model for researchers also with greater difficulty. Meanwhile, LIBS spectral lines are affected by chemical and physical matrix effects, background and other factors, and further result in uncertainty of spectrum signals. And many times it is difficult to obtain sufficient sample data in the face of some detection objects. Therefore, how to obtain effective information from the original spectrum and thus build a predictive model with better reliability is a problem that has to be faced.

In small sample data, the burden of computational resources is reduced due to the rarity of the data sample size, but the effective information is more difficult to acquire due to the reduction of the overall representativeness, the data-driven model often faces the risk of over fitting when the characteristic information is too much, and the problem of under fitting due to the data size of a small sample learning task can also occur simultaneously. So how to get the correctness of the results in case of a sparse data volume is a very challenging problem in LIBS. An Li and the like repeatedly randomly extract data from An original spectrum by utilizing a resampling technology under the condition that the sample size is smaller than 20 so as to establish a large number of training sets to reduce the risk of overfitting in the learning process, and a model is established by combining PCA-PLS so as to accurately and quantitatively analyze coal; meanwhile, an Li and the like select effective spectrums aiming at unreliability of small sample estimation by using a statistical correction strategy, and extract characteristics of emission peak intensity and emission shape related intensity from the selected spectrums so as to reduce noise influence, enhance characteristic information of the spectrums, and build a model by combining PCA-PLS so as to realize accurate measurement of detonation heat of high-energy materials. It is worth further study how to extract valid features from a small amount of LIBS sample data and build a robust model.

Disclosure of Invention

The invention aims to provide a qualitative analysis method, a qualitative analysis system and qualitative analysis equipment for aluminum alloy of a small sample stacking model, which can improve the accuracy and the stability of qualitative analysis for aluminum alloy of the small sample stacking model.

In order to achieve the above object, the present invention provides the following solutions:

a qualitative analysis method for aluminum alloy of a small sample stacking model comprises the following steps:

constructing a small sample stacking model;

acquiring aluminum alloy spectrum data by using a laser-induced breakdown spectroscopy experimental device;

carrying out qualitative analysis on the aluminum alloy spectrum data by using a small sample stacking model;

the qualitative analysis of the small sample stacking model comprises the following steps:

in a first layer architecture of a small sample stacking model, performing dimension reduction treatment on aluminum alloy spectrum data by utilizing a random forest, and obtaining a characteristic wave band according to a principle of non-purity of a kene; determining a spectrum peak of the aluminum alloy spectrum data by utilizing an automatic peak searching algorithm, and calculating the full width at half maximum of the spectrum peak according to a fitting curve;

in a second layer architecture of the small sample stacking model, predicting the full width at half maximum of a characteristic wave band and a spectrum peak by using different heterogeneous learners respectively to obtain a corresponding prediction result;

and in a third layer of architecture of the small sample stacking model, predicting a prediction result and a characteristic wave band by using a classifier determined by a logistic regression method to obtain a qualitative analysis result.

Optionally, the peaks of the aluminum alloy spectral data are fitted using a voigt function.

Optionally, the different heterogeneous learner includes: KNN model, XGBoost model, and SVM model.

Optionally, the logistic regression method is L2 regularized logistic regression.

A small sample stacking model aluminum alloy qualitative analysis system, comprising:

the small sample stacking model building module is used for building a small sample stacking model;

the aluminum alloy spectrum data acquisition module is used for acquiring aluminum alloy spectrum data by utilizing the laser-induced breakdown spectroscopy experimental device;

the qualitative analysis module is used for carrying out qualitative analysis on the aluminum alloy spectrum data by utilizing the small sample stacking model;

A small sample stacking model aluminum alloy qualitative analysis apparatus, comprising: at least one processor, at least one memory, and computer program instructions stored in the memory, which when executed by the processor, implement the one small sample stack model aluminum alloy qualitative analysis method.

Optionally, the memory is a computer readable storage medium.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the small sample stacking model aluminum alloy qualitative analysis method, system and equipment provided by the invention, a small sample stacking model of a three-layer stacking machine learning model is constructed, a first layer of the stacking model obtains reliable sample characteristics by utilizing a resampling mechanism of a random forest, a spectrum peak with better independence is selected by utilizing an automatic peak searching algorithm, the selected spectrum peak is fitted and the full width at half maximum of the spectrum peak is calculated aiming at a broadening mechanism existing in a LIBS spectrum, priori knowledge of the full width at half maximum of a spectral line is added, and spectral line intensity and the spectrum peak are broadened and serve as spectrum characteristics at the same time, so that the characteristic dimension of the LIBS spectrum is enriched. In the second layer, different heterogeneous learners are integrated to learn feature diversity and complexity in different feature spaces of the samples, and in order to solve the defect of small samples, the reliability of a prediction result is further enhanced by utilizing cross verification in each base learner. The outputs of the three learners, through the output of the first layer, will constitute the new dataset as input to the third layer. The logistic regression is used as a classifier of the third layer, the learned characteristics are reasonably combined, and then the probability of each category is output so as to carry out final prediction. The invention can improve the accuracy and stability of the qualitative analysis of the aluminum alloy of the small sample stacking model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a qualitative analysis method for aluminum alloy of a small sample stacking model provided by the invention;

FIG. 2 is a schematic diagram of a laser induced breakdown spectroscopy experimental apparatus;

FIG. 3 is a chart of the femtosecond LIBS spectra of five samples;

FIG. 4 is a small sample stacking model frame diagram;

FIG. 5 is a spectral distribution of LIBS spectra after visualization;

FIG. 6 is a schematic diagram of classification accuracy for each prediction set;

FIG. 7 is a schematic diagram of class prediction results for each model, part (a), part (b) KNN and PCA-KNN, (c), part (d), part (e), and part (f), part (RF and PCA-RF);

FIG. 8 is a schematic view of a selected spectral peak fit;

FIG. 9 is a schematic illustration of a voigt fit;

FIG. 10 is a schematic diagram of feature importance ((a) part of random forest feature importance; b) part of importance of random forest incorporating PCA);

FIG. 11 is a graph showing the improved model classification results for FWHM ((a) part FWHM-RF and (b) part PCA-FWHM-RF);

FIG. 12 is a schematic diagram of a Stacking prediction result;

FIG. 13 is a diagram showing training spectral number and discrimination.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

As shown in FIG. 1, the qualitative analysis method for the aluminum alloy of the small sample stacking model provided by the invention comprises the following steps:

s101, constructing a small sample stacking model, and as shown in FIG. 4;

s102, acquiring aluminum alloy spectrum data by using a laser-induced breakdown spectroscopy experimental device;

as shown in fig. 2, the laser-induced breakdown spectroscopy experimental apparatus is composed of a femtosecond laser (Libra, coherent, USA), an energy attenuation system, a spectrometer (Mechelle 5000, andor), a pulse delay trigger, a three-dimensional translation stage, an optical lens, and a computer, as shown in fig. 1. The femtosecond laser has the wavelength of 800nm, the pulse width of 50fs, the repetition frequency of 1.0kHz and the energy stability of 0.5 percent. Throughout the experiment, the laser energy was set at 1.8mJ. The femtosecond laser-induced plasma emission spectrum was collected by a fused silica lens L2 (f=75mm) and coupled into a fiber probe (core diameter 200) equipped with an echelle grating spectrometer of ICCD (1024×1024pixel, dh334t) with an acquisition wavelength range of 200-975nm, precision of 0.05nm, resolution of λ/Δλ=5000, a gate width of 5 μs and a delay time of 4 μs. Excessive target sample ablation is avoided, and the aluminum alloy sample is arranged on the XYZ-3D translation stage, so that each laser pulse acts on a new position on the target surface. To avoid interference of air plasma with spectroscopic analysis, the aluminum alloy sample surface was placed 1mm in front of the focal point. To reduce random errors in spectral detection, the number of spectral collection integration pulses was 50 integration average, with 20 sets of data per sample. LIBS classification research is carried out on 5 Al-Cu-Mg-Fe-Ni aluminum alloys (1060, 6061, 5052, 2024, 7075). Table 1 lists the exact concentrations of Cu, mg, fe, ni, mn, si, zn and Ti elements in the 5 samples. FIG. 3 is a chart of the femtosecond LIBS spectra of the five samples.

TABLE 1 sample element content

S103, carrying out qualitative analysis on aluminum alloy spectrum data by using a small sample stacking model;

s301, in a first layer architecture of a small sample stacking model, performing dimension reduction treatment on aluminum alloy spectrum data by utilizing a random forest, and obtaining a characteristic wave band according to a principle of non-purity of a base; determining a spectrum peak of the aluminum alloy spectrum data by utilizing an automatic peak searching algorithm, and calculating the full width at half maximum of the spectrum peak according to a fitting curve;

as a specific example, 461 characteristic bands are output with 95% as a boundary.

And calculating the full width at half maximum of the selected spectrum peak according to the fitting curve. 7 spectral peaks were selected and their full width at half maximum and intensity data for 461 bands were simultaneously output as the first layer.

Output of the first layer of the stacked model= (x ₀ ,x ₁ ,…,x ₄₆₀ ,F ₀ ,…,F ₆ )；

Wherein x is _i (i=0, 1,., 460) represents the intensity profile selected over the random forest base Yu Jini, 95% is selected as the threshold in operation, and 461 wavelengths are selected. F (F) _j (j=0, 1,.,. 6) represents the full width at half maximum of the 7 selected spectral peaks calculated after fitting by the voigt function.

S302, in a second layer architecture of the small sample stacking model, predicting the full width at half maximum of a characteristic wave band and a spectrum peak by using different heterogeneous learners respectively to obtain a corresponding prediction result;

the different heterogeneous learner includes: KNN model, XGBoost model, and SVM model. Each learner is referred to as a base learner. The heterogeneous learner converts the original data into different feature spaces, and the different characterization capacities of each feature space can effectively improve the prediction accuracy. In this layer, eachThe individual base learners cross-verify to reduce the risk of overfitting. The training set is divided into five subsets by five-fold cross validation, and for each subset, different base models are used for training, and the trained base models are applied to the validation set to obtain a prediction result p of the validation set _ij (i is the base model (1, 2, 3), j is the fold number (1, 2,3,4, 5)), at which point the output of the second layer is as shown in:

s303, predicting a prediction result and a characteristic wave band by using a classifier determined by a logistic regression method in a third layer architecture of the small sample stacking model to obtain a qualitative analysis result.

Input of the third layer of the stacked model= (x) ₀ ,x ₁ ,…,x ₄₆₀ ,F ₀ ,...,F ₆ ,p ₁ ,p ₂ ,p ₃ )。

The L2 regularized logistic regression is selected as a classifier of a third layer, the overfitting risk brought by rich features learned by the first two layers is reduced through regularization, and finally qualitative analysis of the sample is realized.

When the qualitative analysis is carried out on the aluminum alloy sample, after plasma is induced by laser ablation, analysis signals are collected by a spectrometer, and the original spectrum data is obtained after computer processing. The spectrum data is firstly reconstructed through a first layer of the stacking model, and the reconstructed data of the first layer is used as the input of a second layer of three heterogeneous learners. The heterogeneous learner of the second layer predicts the reconstructed data separately, the predicted result is added as a feature to the reconstructed data, and the output of the second layer and the output of the first layer are taken as the input of the third layer. The logistic regression in the third layer performs final qualitative analysis by training and learning the input data.

In the current learning task, more abundant sample data is often obtained from a sample through multiple experiments. When the data volume is small, the collected sample data may be distributed more uniformly in the corresponding feature space, may be more concentrated and have distinct classification boundaries or more complex and overlapping boundaries. The uncertainty of the sample distribution condition makes the learning based on the small sample data easily generate great deviation from the real condition, which is also the problem of the small sample learning-the unreliability of the prediction result. It is therefore an object of the present invention to improve the accuracy of the prediction results by deriving features that best characterize the spectrum from the collected spectral data as much as possible. In order to understand the approximate distribution of the data, first, the LIBS spectrum is visualized by PCA, as shown in FIG. 5, projection1 is the spatial projection of the PCA-1 and PCA-2 planes; projection2 is the spatial projection of the PCA-1 and PCA-3 planes; projection1 is a spatial projection of the PCA-2 and PCA-3 planes. It can be seen from fig. 5 that the distribution of the five types of aluminum alloy samples in space is more complex, the distribution of the two types of aluminum alloys 1060 and 5052 has a more distinct classification boundary with the other three types, while the three types of aluminum alloys 2024, 6061 and 7075 are clustered together in space, so that it is difficult to obtain a better classification boundary from an observation point of view, and therefore, it is desirable that the three types of samples can be better distinguished in a higher dimensional space.

When the data set is divided according to 70% of the training set and 30% of the testing set, algorithms such as KNN, support Vector Machines (SVM), random Forests (RF) and the like which have good performance in LIBS classification are used for comparison, and prediction models based on full spectral lines and reconstructed spectra after dimension reduction are respectively established. The PCA most commonly used in LIBS spectrum data processing is selected as a dimension reduction mode, and the first ten principal components are input as a dimension reduced model. And comparing the pass accuracy, the precision, the Recall rate and the F1 prediction score of the model effect. As shown in fig. 6, in the model built based on the full spectrum, the random forest has the best effect, and compared with the random forest, the SVM and KNN have significantly stronger generalization capability than the random forest when predicting multiple types, and the feature subset is built by feature selection based on the information amount, so that the original spectrum can be better obtained without preprocessing. Therefore, the learning effect of the PCA-SVM is improved after the spectrum reconstruction is carried out on the PCA-SVM, and the effect of the PCA-SVM is obviously better than that of RF. At the same time, PCA is also used for RF comparison of results, showing the best performance in models based on full spectrum and reduced dimension reconstructed spectrum.

As shown in fig. 7, the aluminum alloys of the types 1060 and 5052 are spatially distributed independently of the other three types, so that the two types are well distinguished in the classifier for comparison, and only the RF (part (e) of fig. 7) predicts the type 1060 sample as the type 2024 from the 16.7% (5 test spectra) test sample. As observed in fig. 5, the 6061, 7075, and 2024 distributions are difficult to find obvious decision boundaries, and these three categories tend to be misclassified with each other in the predictions of various models. The model with the most error rate obtained in classification is 2024, the distribution of 2024 in space is always small from 7075 and 6061, and only the random forest achieves the best effect when predicting the model of 2024, because the re-sampling mechanism of the random forest and the principle of extracting information of decision tree based on the non-purity of the keni lead the random forest to avoid the phenomenon of over fitting when training on a training set. Because the matrix effect and the self-absorption effect of the LIBS spectrum, the collected spectrum often has noise and data overlap of different samples in the same wave band, and the method is different from the characteristic screening principle based on information quantity of random forests, SVM and KNN are sensitive to noise and data overlap when facing high-dimensional data, so that after the PCA is subjected to the dimension reduction treatment, the reconstructed spectrum effectively extracts characteristic information so as to improve the identification result of the two, although the random forests can be subjected to effective characteristic selection through a Bootstrap technology and a voting strategy, the LIBS spectrum always has the problem of characteristic information redundancy, PCA dimension reduction can remove some redundant or irrelevant characteristics, thereby reducing the influence of noise, and improving the generalization capability of the model, so that the identification capability of RF after the PCA dimension reduction is effectively improved. In summary, the comparison results of different classification models show that the PCA-RF is superior to all evaluation indexes of other predictors, which shows that compared with the traditional model, the PCA-RF can better distinguish the aluminum alloy model, but accurate prediction is still difficult to complete, and particularly difficult identification of 2024 and 6061 models is still a problem required to be faced by model construction.

The dimension reduction mode based on the PCA can well process LIBS data and improve the classification precision of the classifier, but the characteristics obtained after the dimension reduction of the PCA are difficult to explain in a physical sense. Meanwhile, when high-dimensional information of the LIBS spectrum is calculated, larger calculation load and noise interference are generated, so that a mode of learning the full spectrum is seldom adopted, and a dimension reduction means is indispensable. When observing a spectrum, sometimes the spectrum can be distinguished by manual observation, but the model cannot be distinguished after modeling, and manual observation is compared with the shape of spectral line, the height of spectral peak and the existence of spectral peak in many cases, so some researchers consider that a machine vision technology is introduced into the classification of LIBS to give a feature of machine vision dimension, such as JIUJIANGYAN and the like, to convert an original spectrum into a picture, so that a gradient histogram is extracted by utilizing an image processing mode, and classification features are indirectly extracted from multidimensional spectrum intensity.

The broadening widely exists in various spectrums, the shape of a spectrum peak has very close relation with the broadening, compared with a machine vision mode, the aluminum alloy LIBS spectrum data has rich feature vectors, and the principle of increasing the feature dimension is that the aluminum alloy LIBS spectrum data does not add more burden to the self-redundancy feature. Full width at half maximum refers to the width of the spectral peak, i.e., the distance between two points at half the peak height. In the spectrum, FWHM is commonly used to represent the width of a peak, reflecting important parameters such as resolution and accuracy of the peak. Spectral peaks in LIBS can generally be classified into two types, mainly affected by Stark broadening and mainly affected by Doppler broadening. Doppler broadening is due to thermal motion of the emitting atoms or ions, which results in a gaussian shape. The full width at half maximum of the gaussian line shape is proportional to the square root of the temperature and the center frequency. Thus, the higher the temperature or frequency, the greater the full width at half maximum due to doppler spread; stark broadening is the broadening of a spectral line at the atomic energy level under the influence of an electric field, which causes the spectral line to exhibit a Lorentz line shape, as a type of pressure broadening. The full width at half maximum of Stark broadening depends on electron density and electric field strength. Thus, the higher the electron density or electric field strength, the greater the full width at half maximum due to Stark broadening. Therefore, the spectrum information of LIBS can be more fully utilized by describing the spread of spectrum lines and the shape of spectrum peaks by using the full width at half maximum. The convolution of the Voigt function line shape as a gaussian function and a lorentz function can describe the spectrum line shape that is co-acted by the Strak broadening and the doppler broadening, so that the spectrum line affected by the self-absorption effect can more closely approximate the intensity of the theoretical line shape:

where A is the magnitude of the function, μ is the center of the function, σ is the standard deviation of the Gaussian component, γ is the half-width of the Lorentz component, ω (z) is the Faddeeva function or the scaled complex error function, where z is a complex variable defined as:

therefore, after the peak searching algorithm, several spectrum peaks with stronger independence and smaller interference are selected for fitting, the full width at half maximum of the spectrum peaks are calculated, the fitted spectrum peaks are shown in fig. 8, and the Voigt fitting is shown in fig. 9.

In order to explore the influence of the stretching mechanism on the model, random forests which are best represented in comparison are selected for exploration, the characteristics of RF which are better represented in the comparison classification model and FWHM which is introduced through PCA-RF are respectively explored, whether the introduction of the stretching characteristics can influence the model in the study of reconstructing the spectrum based on full spectrum and dimension reduction or not is explored, and the analogy results are shown in the table. It can be found from the table that after the widening mechanism is introduced into the random forest, the classification accuracy of the random forest is improved, the classification accuracy of the 6061 variety is improved from 83.3% to 100% in the part (e) of fig. 7, the prediction of the model 2024 is improved from 83.3% to 100% by introducing the widening mechanism into the random forest after the PCA dimension reduction, and the improved model effect is compared as shown in the table 2.

Table 2 comparison of model effects after improvement

Model	Accuracy	Precision	Recall	F1score
					RF	0.8667	0.8762	0.8667	0.8690
FWHM-RF	0.9	0.9048	0.9	0.8998
					PCA-RF	0.9333	0.9380	0.9333	0.9331
PCA-FWHM-RF	0.9667	0.9714	0.9667	0.9664

At this time, the feature selection score added to the half-height full-width random forest selection is visualized in the first ten, as shown in fig. 10, it can be observed that the feature importance score is performed based on the full spectrum in the conventional random forest, and in a plurality of dimensions (27398 dimensions), the importance of the added information of the half-height full width of MgI 285.21nm and the half-height full width of CuI324.75nm in the random forest is entered in the first ten, which indicates that the introduced broadening feature description information brings effective enhancement to the spectrum data. At the same time, note that the 324.8251nm spectral intensity is a second feature of importance to conventional random forests, which is also one of the reasons why CuI324.75nm full width at half maximum information is considered to be important. In the classification result recognition result of the confusion matrix is shown in fig. 11, in the part (a) of fig. 11, 16.7% of samples of the model 6061 are misrecognized as 7075 when the full width at half maximum is not added, and after the full width at half maximum is introduced, the full width at half maximum of the emission peak of the corresponding element has good difference information due to the obvious gradient relation between the content of copper element and the content of magnesium element between the two models, so that the expression of random forests is improved. In PCA-RF, the full width half maximum information of CuI324.75nm is regarded as the most important resolution information, and simultaneously, the full width half maximum information of MnII 279.52nm is regarded as the second important resolution information except the main component, and in the confusion matrix, the improvement of the recognition result of 2024 is observed because the copper element content in 2024 type aluminum alloy has more obvious difference compared with other types of aluminum alloy, and the spectral characteristics of 2024 type can be effectively enhanced by adding the full width half maximum information of the emission peak of CuI324.75nm in PCA-RF, so that the effect of the model is improved. The manganese element content in 2024 is a range value compared with other aluminum alloys, and at this time, the peak value corresponding to 2024 may change due to different contents, resulting in the change of the excited spectral intensity and the broadening of the spectrum peak, thereby helping to better identify the model. Therefore, in qualitative identification of aluminum alloy, introduction of a stretching mechanism can improve model effect in a targeted manner, and different models have different sensitivity to the selected spectral peak stretching, so that more advanced analysis is needed for improving classification accuracy by selecting proper spectral peak stretching according to the established model.

From the study of various models on the spectrum based on full spectrum and dimension reduction reconstruction, even under the condition that the sample quantity is sufficient, the traditional model still has difficulty in accurately predicting the test spectrum through the study of the training spectrum, the enhanced model obtains the improvement effect through the introduction of the descriptor giving the spectrum continuity characteristic to the widening characteristic, but the accurate prediction of the sample still has not been realized, and the reliable prediction of the model can not be ensured when the training spectrum quantity is less. Therefore, by comparing the respective advantages of the above models and performing fusion and screening, a three-layer Stacking model is established. The random forest is a tree-shaped structure learner integrated with a plurality of decision trees, and is used for constructing a first layer, effective spectrum information is obtained based on characteristic selection of information quantity, so that the risk of under fitting caused by small data quantity and complex data is reduced. Meanwhile, a spectrum peak with better independence is selected by utilizing an automatic peak searching algorithm, a voigt function is used for fitting the selected spectrum peak, the full width at half maximum of the spectrum peak is calculated, and the possibility of under fitting can be reduced by adding effective features; random forests have shown in experiments good classification performance of tree learners in the face of LIBS spectra, so when constructing a second layer of heterogeneous learners, we hope to add a tree learner as one of the heterogeneous learners, so as to learn feature diversity and complexity in different feature spaces of samples. XGBoost is used as a tree-shaped learner based on Boosting mode integration, has excellent overfitting prevention and generalization capability as the random forest, forms a heterogeneous learner of a second layer with the random forest and the other two (KNN and SVM), and simultaneously reduces overfitting risks caused by insufficient data by utilizing cross verification in each base learner. The outputs of the three learners and the output of the first layer form new training data as the input of the third layer; the logistic regression is used as a classifier of the third layer to effectively combine the rich features obtained by the first two layers, and the probability of each category is output after the risk of overfitting is further reduced by using L2 regularization so as to carry out final prediction.

To verify the Stacking model robustness, the results of the Stacking model were compared with other models, the comparison results are shown in table 3, and the category prediction results of Stacking are shown in fig. 12.

Table 3 comparison of model effects

Model	Accuracy	Precision	Recall	F1-score
					KNN	0.6667	0.6405	0.6667	0.6317
SVM	0.7667	0.7667	0.7667	0.7610
					RF	0.8667	0.8762	0.8667	0.8690
PCA-KNN	0.7667	0.7667	0.7667	0.7610
					PCA-SVM	0.8667	0.875	0.8667	0.8629
PCA-RF	0.9333	0.9380	0.9333	0.9331
					FWHM-RF	0.9	0.904761	0.9	0.899766
PCA-FWHM-RF	0.96667	0.971428	0.96666	0.9664335
					Stacking	1.0	1.0	1.0	1.0

As can be seen from table 3, the difference in accuracy macroscopically reflects the prediction capability of the model, and the performance of the model is the same as the result obtained by the experiment performed above, but in practical application, the success rate when the model prediction sample is a certain model and the success rate when the model is predicted are often focused on. Therefore, the evaluation of the model through precision and recall is a reasonable standard, and in order to pursue the robustness of the model, the precision and recall are often comprehensively considered, and the model can be intuitively judged through the F1-score. The accuracy of the RF with the same identification accuracy is better than that of the PCA-SVM under the condition of the same recall rate, so that the score of the F1-score is also higher than that of the PCA-SVM, and therefore, the model robustness of the RF is slightly better than that of the PCA-SVM under the condition of the same accuracy rate. The Stacking model is highest among four types of evaluation indexes, and as can be observed from fig. 12, the model can accurately predict each type of sample. In the qualitative identification of the aluminum alloy with multiple models, the accuracy rate can be used as a macroscopic evaluation standard to reflect the quality of model prediction results, and meanwhile, the comparison of other three evaluation indexes can enable the experiment to measure the robustness of the model more accurately. The Stacking integrates the advantages of various learners, solves the problems in the modeling process in a targeted manner, enhances the generalization capability based on the LIBS spectrum model, and provides an effective modeling method for LIBS analysis.

The model based on data driving can be built by using larger data volume, and the meaning of small sample learning is to build a model with strong robustness when data is scarce in practical problems. The final effect obtained by the model is determined to a great extent by the selection of the characteristics during modeling, the characteristics of different dimensions of the spectrum are given in a layering manner in the work, the reconstructed spectrum comprises the original spectral line selected based on the non-purity of the keni, the broadening of the manually selected spectral line and the prediction labels of three heterogeneous classifiers, the reliable classification basis of the qualitative analysis model is given by the characteristics of multiple dimensions, and the established model can perform more accurate qualitative analysis under the condition of lack of data quantity. In operation, the number of spectra of the training set and the test set is re-divided, and when the training sample size is reduced to 25 spectra, the fold number of the cross validation is changed from 5 fold to 3 fold, and when the training spectrum number is 10, 2 fold cross validation is used, and the minimum training spectrum number is set to 10. In quantitative analysis, the increase of the number of sample spectrums can enhance the accuracy of estimation of the content of a certain element by a model, so that a more reliable identification result is brought, in qualitative analysis, the model can establish prediction from an input spectrum to a qualitative result after spectrum learning of samples of each model, and in the case of extremely lack of the number of samples and more classification categories, the model is a difficult task. In the previous discussion, the PCA-RF combined with FWHM was improved, and the PCA-RF and the PCA-FWHM-RF were compared simultaneously, and the change in the relationship between the qualitative recognition accuracy and the training spectrum was shown in FIG. 13.

The accuracy of PCA-RF and PCA-FWHM-RF continuously fluctuates in the graph, when the sample size is 20, the identification success rate of the PCA-RF and the PCA-FWHM-RF does not reach 75%, when the sample size is increased gradually, the accuracy of the PCA-RF and the PCA-FWHM-RF steadily rises until the number of training spectrums reaches 35, and the accuracy of the PCA-RF and the PCA-FWHM-RF reaches the same. The trend of both methods changes when approaching 50 spectra, except that when training spectra reach 45, the accuracy of PCA-RF identification continues to increase, while the accuracy of PCA-FWHM-RF identification decreases at this point, and then the trend of both changes again when 50 training spectra are obtained. As the number of spectra increased to 70, the recognition accuracy of both increased, and the recognition rate of PCA-FWHM-RF was higher than that of PCA-RF. By observing the change trend of the recognition rate of the two methods, it can be seen that the distribution of the sample points is continuously changed in the process of increasing the number of spectrums, so that the model is continuously subjected to weight adjustment, and the accuracy is reduced and continuously fluctuates after the optimal recognition accuracy is achieved for the first time. Along with the increasing number of training spectrums, the reliability of the model is also improved, and the accuracy of identification is naturally improved again. The method is a situation that most models face practical problems, random forests are used as a powerful learner for processing high-dimensional data, modeling of the spectra after PCA reconstruction is still difficult to accurately perform qualitative recognition when the training amount is small, and even if characteristic information which can generate gain on qualitative recognition results is extracted from the spectra as much as possible in the process of establishing the models. Although the accuracy of the identification is slightly improved when the training spectrum is less than 35 after the FWHM is added, compared with the PCA-RF, more samples are still needed to show better performance, and meanwhile, the accuracy of the identification in two modes continuously fluctuates when 45 spectrums and 70 spectrums are used, so that the increase of the number of training spectrums also brings difficulty to the establishment of a model, so that the model is excessively fitted, and accurate prediction is difficult. From the results of the two, it is observed that, in the information of the original spectrum, the more abundant spectrum description features with more dimensions can be obtained to better help the model establishment, and the effective feature information can give the model greater reliability so as to solve the difficulty of small sample identification. However, the model of Stacking can achieve the prediction level of the improved random forest model obtained by learning by using 35 training spectrums when the training spectrums are only 10, the accuracy of Stacking is over 85% when the number of training spectrums is 10, at this time, each model of aluminum alloy has only two spectrums, and the accuracy is lower than 65% when the PCA-RF and PCA-FWHM-RF are identified. The prediction capability of the model built based on 15 spectrums can reach the effect that the other two models can reach when 70 training spectrums are obtained, and the qualitative recognition accuracy of the model fluctuates less along with the increase of the number of the training spectrums, so that the accuracy is always kept above 96.25%. The Stacking model builds a model layer by layer against the problems of complex data and sparse sample number faced in the spectrum learning of a small sample LIBS. The method effectively solves the problem of under fitting caused by complex data and lack of sample size and the problem of over fitting caused by redundancy of spectrum features. In summary, compared with the improved RF method and other traditional algorithms, the Stacking model shows better robustness and accuracy of the prediction result when qualitatively analyzing the aluminum alloy sample, and can perform more accurate qualitative analysis on a larger part of models in small-scale spectrum sample modeling.

According to the invention, the spectrum characteristic information is extracted and reconstructed through the hierarchy to enhance the qualitative analysis of the aluminum alloy. The prediction model of the reconstructed spectrum integrating the intensity, the full width at half maximum and the machine learning characteristics is researched, and compared with other model methods such as improved random forest modeling, the algorithm remarkably improves the performance of the prediction model. The model built by the algorithm in the case that each model sample has only 3 spectra for training has equivalent performance to the improved random forest algorithm with 14 training spectra for each model. In the new predictive model with a training spectral power of 15, the error rate of qualitative recognition is less than 4%. When the training spectrum quantity is 10 (2 of each model), the accuracy of the aluminum alloy qualitative analysis is improved by about 20 percent compared with an improved random forest, and meanwhile, under the condition that the training spectrum is sufficient, the accuracy, the precision, the Recall rate and the F1 predictive score of the new model qualitative analysis all reach 1.0, and the improvement of 6.7 percent, 6.19 percent, 6.7 percent and 6.69 percent compared with PCA-RF with the best performance in experiments is achieved. This is mainly due to: 1. the new algorithm extracts richer characteristic information from the spectrum in modeling, and gives characteristic descriptions of different dimensions of the spectrum. 2. The targets of different levels of the stacked model correspond to the under-fitting and over-fitting problems in small samples due to lack of sample size and complex data, respectively, and due to excessive dimensional information. The method for accurately modeling the small-scale LIBS spectrum can reliably predict the whole body when the training sample size is small, provides an effective qualitative analysis method for experimental objects with spectrum acquisition difficulty and limited acquisition, and can be widely used for performance verification in different data sets in the future.

Corresponding to the method, the small sample stacking model aluminum alloy qualitative analysis system provided by the invention comprises the following components:

In order to execute the method corresponding to the embodiment to realize the corresponding functions and technical effects, the invention provides a small sample stacking model aluminum alloy qualitative analysis device, which comprises: at least one processor, at least one memory, and computer program instructions stored in the memory, which when executed by the processor, implement the one small sample stack model aluminum alloy qualitative analysis method.

The memory is a computer-readable storage medium.

Based on the above description, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or a part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention. And the aforementioned computer storage medium includes: various media capable of storing program codes, such as a U disk, a mobile hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. The qualitative analysis method for the aluminum alloy of the small sample stacking model is characterized by comprising the following steps of:

constructing a small sample stacking model;

2. The method for qualitative analysis of aluminum alloy in small sample stack model according to claim 1, wherein the voigt function is used to fit the spectral peaks of the aluminum alloy spectral data.

3. The method for qualitative analysis of aluminum alloy in small sample stack model according to claim 1, wherein the different heterogeneous learners comprise: KNN model, XGBoost model, and SVM model.

4. The small sample stacking model aluminum alloy qualitative analysis method according to claim 1, wherein the logistic regression method is L2 regularized logistic regression.

5. A small sample stacking model aluminum alloy qualitative analysis system, comprising:

6. A small sample stacking model aluminum alloy qualitative analysis apparatus, characterized by comprising: at least one processor, at least one memory, and computer program instructions stored in the memory, which when executed by the processor, implement a small sample stack model aluminum alloy qualitative analysis method according to any of claims 1-4.

7. The small sample stacking model aluminum alloy qualitative analysis device of claim 6, wherein the memory is a computer readable storage medium.