CN117789038B - Training method of data processing and recognition model based on machine learning - Google Patents
Training method of data processing and recognition model based on machine learning Download PDFInfo
- Publication number
- CN117789038B CN117789038B CN202410205784.5A CN202410205784A CN117789038B CN 117789038 B CN117789038 B CN 117789038B CN 202410205784 A CN202410205784 A CN 202410205784A CN 117789038 B CN117789038 B CN 117789038B
- Authority
- CN
- China
- Prior art keywords
- sample
- search
- data
- samples
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000012549 training Methods 0.000 title claims abstract description 30
- 238000010801 machine learning Methods 0.000 title claims abstract description 17
- 238000012545 processing Methods 0.000 title claims abstract description 11
- 239000002689 soil Substances 0.000 claims abstract description 53
- 238000003066 decision tree Methods 0.000 claims abstract description 32
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 27
- 238000007637 random forest analysis Methods 0.000 claims abstract description 18
- 238000013528 artificial neural network Methods 0.000 claims abstract description 13
- 238000003062 neural network model Methods 0.000 claims abstract description 13
- 238000000605 extraction Methods 0.000 claims abstract description 8
- 210000002569 neuron Anatomy 0.000 claims abstract description 8
- 230000009467 reduction Effects 0.000 claims abstract description 7
- 239000011159 matrix material Substances 0.000 claims description 33
- 239000013598 vector Substances 0.000 claims description 33
- 230000006399 behavior Effects 0.000 claims description 29
- 230000009471 action Effects 0.000 claims description 20
- 229910001385 heavy metal Inorganic materials 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 11
- 238000012216 screening Methods 0.000 claims description 6
- 238000013145 classification model Methods 0.000 claims description 5
- 238000012847 principal component analysis method Methods 0.000 claims description 4
- 238000012360 testing method Methods 0.000 claims description 4
- 230000002776 aggregation Effects 0.000 claims description 3
- 238000004220 aggregation Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000000007 visual effect Effects 0.000 claims description 3
- 230000004931 aggregating effect Effects 0.000 claims 1
- 238000005457 optimization Methods 0.000 abstract description 4
- 238000001514 detection method Methods 0.000 abstract description 3
- 230000008030 elimination Effects 0.000 abstract description 2
- 238000003379 elimination reaction Methods 0.000 abstract description 2
- 238000004880 explosion Methods 0.000 abstract 1
- 238000001228 spectrum Methods 0.000 description 15
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 11
- 229910052799 carbon Inorganic materials 0.000 description 11
- 238000012544 monitoring process Methods 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 4
- 238000002310 reflectometry Methods 0.000 description 4
- 230000007547 defect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000004162 soil erosion Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 239000002028 Biomass Substances 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000009614 chemical analysis method Methods 0.000 description 1
- 238000000701 chemical imaging Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 239000003041 laboratory chemical Substances 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 229910052751 metal Inorganic materials 0.000 description 1
- 239000002184 metal Substances 0.000 description 1
- 230000000813 microbial effect Effects 0.000 description 1
- 235000015097 nutrients Nutrition 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Landscapes
- Image Analysis (AREA)
Abstract
The invention provides a training method of a data processing and identifying model based on machine learning, which belongs to the technical field of data processing, and comprises the steps of firstly collecting soil information, marking training samples for training the model, performing dimension reduction operation on data, then performing sample expansion through a SMOTE sample generating method based on rapid clustering, then performing feature extraction on the data through a neural network, optimizing parameters of the neurons through a neural network model based on search operator algorithm optimization, avoiding gradient elimination and gradient explosion phenomena caused by a traditional neural network parameter optimizing method, and finally classifying hyperspectral data through a machine learning classifying model based on an improved random forest, and effectively improving classification precision of a classifier through evaluating classification performance of each decision tree in a decision tree training stage; the algorithm designed by the invention has higher detection precision, and has higher robustness and generalization capability.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a training method of a data processing and identifying model based on machine learning.
Background
The heavy metal elements in the soil are difficult to degrade by natural environment, and the heavy metal pollution is difficult to treat and has strong hazard. Therefore, the heavy metal pollution condition of the soil is monitored in real time, the diffusion of a pollution area can be timely avoided, and the heavy metal pollution of the soil is prevented from further aggravating. The traditional method for identifying heavy metal pollution of soil is to collect soil in the field and perform chemical analysis by a laboratory to judge the pollution condition in a certain area, and has the defects of accurate identification precision, long analysis period, great manpower and material resources consumption and difficulty in meeting the real-time monitoring requirement of a macroscopic area. With the development of hyperspectral remote sensing and related fields, a solution is brought for rapidly monitoring soil heavy metal pollution in a macroscopic region. The hyperspectral remote sensing has the characteristics of rapidness, dynamics and no destructiveness, and the hyperspectral remote sensing is used in the field of soil heavy metal pollution and can meet the requirement of large-scale real-time monitoring. The hyperspectral image space information can reflect physical space structure information outside the object, such as texture features, geometric features and the like; the spectral information reflects the change in chemical composition within the object. Whether it is a single-point high-resolution spectrum band measured in a laboratory or a hyperspectral image obtained by satellite or airborne means, the spectrum bands contain a large amount of information of the measured object, but adjacent spectrum bands have the characteristic of higher correlation, and a large amount of information redundancy can increase the difficulty of feature extraction. And the soil is complex in composition and low in heavy metal element content, so that the response of the soil in the soil spectrum is weak. How to effectively extract important characteristic information in a complex plurality of spectrum bands is an important research content in the hyperspectral field.
The invention patent with the patent number of CN202110651965.7 in the prior art provides an undisturbed soil profile carbon component prediction method based on hyperspectral imaging and support vector machine technology, and based on the acquisition of hyperspectral images of soil profile samples with preset depths at various sample positions, the method takes each characteristic spectrum band of a target sample spectrum region corresponding to the soil carbon component type as input, and soil carbon component data of the target sample spectrum region corresponding to the soil carbon component type as output, and obtains a soil carbon component prediction model corresponding to the soil carbon component type through training, so as to further realize the prediction of the soil profile carbon component of the target region; the whole design scheme can rapidly and accurately predict the contents of the components such as organic carbon, soluble carbon, carbon easy to oxidize, soil microbial biomass carbon and the like in the undisturbed soil profile, and realize the fine drawing of the spatial distribution of the components on the soil profile; makes up the defects of the traditional laboratory chemical analysis method.
The invention patent with the patent number of CN201910717696.2 in the prior art provides a soil quality monitoring method based on aviation hyperspectrum, which comprises the following steps: step 1, acquiring aviation hyperspectral data of a soil quality monitoring area, and acquiring samples of the soil quality monitoring area in the wild to analyze the content of heavy metal elements; step 2, preprocessing the aviation hyperspectral data; step 3, reconstructing a hyperspectral data spectrum of the aviation to eliminate radiation distortion of a ground object spectrum caused by various atmospheric components; step 4, extracting the spectrum of the sampling point aviation hyperspectral image in the aviation hyperspectral remote sensing data; step 5, spectrum transformation and correlation coefficient analysis are carried out, the correlation coefficient between the content of the soil and the soil spectrum parameter is obtained, and the sensitive wave band of the characteristic spectrum is found out; and 6, establishing an inversion soil quality monitoring model of the aviation hyperspectral data to obtain monitored soil nutrient and metal element content data. When the method is applied, the large-range soil foundation data can be accurately obtained, the workload can be reduced, the soil quality monitoring period can be shortened, and the cost can be reduced.
The invention patent with the patent number of CN201510119440.3 in the prior art provides a technical method for identifying soil attribute hyperspectrum, and relates to the technical field of soil exploration. The method comprises the following steps: s1, acquiring soil hyperspectral images at different times based on remote sensing satellite data; s2, after image preprocessing, obtaining bare soil through supervision and classification, extracting the surface reflectivity of the bare soil, and establishing a bare soil surface reflectivity inversion model according to the surface reflectivity of the bare soil; s3, designing an indoor soil erosion test, and acquiring soil erosion data corresponding to the soil hyperspectral image acquisition time; s4, acquiring soil classification and calculating a soil K value through the soil corrosiveness data obtained in the step S3; and S5, establishing a hyperspectral model affecting the soil property of the corrodibility K according to the soil K value and the spectral data in the earth surface reflectivity inversion model. The invention solves the problem that the hyperspectral remote sensing technology cannot be used for measuring the soil corrosiveness.
Although the above prior art can identify the pollution degree of soil, the existing method still needs to be further improved in terms of model design and data processing, and especially needs to be further optimized in terms of improving detection precision, robustness and generalization capability of the model.
Disclosure of Invention
Aiming at the technical problems, the invention adopts the following technical scheme: a training method of a data processing and identifying model based on machine learning comprises the following steps:
s1, acquiring soil data, and marking a training sample for model training;
S2, reducing the dimension of the data, and recombining high-dimension characteristic variables with larger correlation numbers to form a group of low-dimension linear independent variables;
s3, sample expansion, namely generating new samples in the categories of a few samples, and reducing the imbalance phenomenon of the categories of the samples;
s4, extracting characteristics of the data in the step S3, and providing a neural network model optimized based on a search operator algorithm to optimize parameters of neurons, wherein the number of layers of the neural network adopted in the step is 2, and the parameters are searched based on the search operator algorithm in the neural network model optimized based on the search operator algorithm;
S5, training a classifier machine learning model;
And S6, applying the trained model to carry out soil heavy metal pollution degree, training the model by using a marked sample, and detecting and identifying the data to be detected and identified after model training is completed.
Further, the dimension reduction method adopted in S2 is a principal component analysis method, and includes the following steps:
s201, data standardization, wherein the standardized calculation method comprises the following steps:
;
wherein Z represents a standardized value, and all variables are scaled according to the proportion through the step;
s202, calculating a covariance matrix, wherein the covariance matrix is defined as one mathematically Matrix/>Representing the dimensionality of the acquired data, each element in the matrix representing the covariance of the corresponding variable, for a vector with variable/>And the hyperspectral band scene of variable b, the covariance of which is a2 x 2 matrix, as follows:
;
Wherein, Representing covariance matrix,/>Representing the covariance of the variable with itself, i.e., variable/>Is a variance of (2); /(I)Representing the variable/>The covariance with the variable b is given by,Representing the variable/>Is a variance of (2);
S203, calculating a feature vector and a feature value:
Calculating from the covariance matrix to obtain feature vectors and feature values, wherein the feature vectors and the feature values are calculated in pairs, namely, each feature vector has a corresponding feature value, and the number of feature vectors to be calculated determines the dimension of data;
The eigenvectors are used to learn the maximum variance in the data using covariance matrices, since more variance in the hyperspectral data represents more information about the data, eigenvectors are used to identify and calculate principal components, and on the other hand, eigenvalues represent only scalar quantities for each eigenvector, so eigenvectors and eigenvalues will be used to calculate principal components of the data;
S204, calculating main components:
After the feature vectors and the feature values are calculated, the feature vectors are required to be ordered in a descending order, the feature vector corresponding to the higher feature value has more important position, the feature vector with the highest feature value is used as a first main component, and then the screened main components form a feature matrix;
s205, reducing the dimension of the data set:
Rearranging the raw data with final principal components representing the largest and most important information of the dataset; in order to replace the original data set with the newly formed principal component, it is simply multiplied with the transpose of the original data, and the obtained data is used as the dimension-reduced data.
Further, in S3, sample expansion adopts a SMOTE sample generation method based on fast clustering, which includes the following steps:
s301, obtaining each minority class sample by calculating Euclidean distance from the sample to other minority class samples And performing linear interpolation between the sample and the selected neighbor sample in a random selection mode to generate a new minority sample, wherein the specific process is as follows:
;
Wherein, Representation/>One sample in the immediate vicinity,/>Is a random number,/>Is an input sample,/>Is the new sample generated;
s302, pair generation The samples are clustered rapidly, and firstly, the distance between objects is calculated according to the following formula:
;
Wherein, And/>For/>2 Of the samples,/>For/>And/>In order to accelerate the clustering speed, a threshold value is set, and the formula is as follows:
;
Wherein, Representing a threshold value/>Is a scaling factor, set by man, typically greater than 0 and less than 1; /(I)Respectively minimum distance and maximum distance between categories;
At the generated sample In the method, samples generated in each category are screened to improve the quality of the generated samples, and screening conditions are as follows:
;
Represents the screening sample set, will/> Combining the sample set with the original data set to obtain an equalized sample set/>For subsequent feature extraction.
Further, in S4, a neural network model optimized based on a search operator algorithm is adopted to carry out the parameter on the neuronsSum parameter/>Optimizing,/>Weight parameters of neuronsThreshold parameters of neurons;
The number of layers of the neural network adopted in the step is 2, and parameters are subjected to search operator algorithm pairs in the neural network model optimized based on the search operator algorithm Sum parameter/>The searching method of (2) comprises the following steps:
s401, defining a search operator, and setting search conditions:
setting n search operators in the search operator population, wherein the individual states of the search operators are expressed as follows Wherein/>For/>The states of the search operators, namely free variables in the parameter optimizing problem; for objective function/>A representation; search operator/>、/>The distance between them is/>; The searching radius of the searching operator is Visual; the Step length of searching is Step; the crowding factor is/>; At a certain moment/>Search operatorsSearching for any position/>, within a search radius VisualIf/>Position status is better than/>Location, then go to/>Further forward in the direction of position, i.e. arrival/>A location; otherwise, continuing to search for other locations within the field of view, the process is expressed as:
;
In the method, in the process of the invention, A random number of 0 to 1;
Before the action, each search operator sequentially executes the searching action, the clustering action, the rear-end collision action and the random action, and then selects the optimal action to execute, so that the search operator population can reach a position closer to the optimal solution:
(1) Search behavior
Assume the firstThe state of a search operator at a certain moment is/>Randomly selecting a state within its search rangeThe following formula is satisfied:
;
and/> Respectively express/>And/>Priority decryption concentration in state, if/>This search operator is moved one step in this direction, namely:
;
If the forward condition is not met, a state is selected again in the search range, whether the moving condition is met or not is judged, after the set repeated times are repeatedly selected, if the moving condition is still not met, the moving is carried out randomly;
(2) Aggregation behavior
Assume the firstThe state of a search operator at a certain moment is/>The number of other search operators searched in the current state is n, and the central position is/>The judgment basis is as follows:
;
Wherein, Is a congestion degree factor,/>And/>The priority decryption concentration of the central position and the current position are respectively represented;
If the above formula is established, the priority decryption concentration of the center is higher and the center is not crowded, and the center is moved to the center direction by one step; if not, executing searching behavior;
(3) Rear-end collision behavior
Assume the firstThe state of a search operator at a certain moment is/>Searching other search operators nearby in the current state, and finding out/>, in the peers, with maximum priority decryption concentrationIts position is/>The judgment basis is as follows:
;
if the above formula holds, other search operators are indicated Where there is a denser preferential solution and less crowding, then the search operator/>Moving in one step in the direction; if not, executing searching behavior;
(4) Random behavior
This behavior is a default behavior of the search behavior, i.e. randomly selecting a position to move to within the field of view, the position of the next state is:
;
By the method, the optimal solution set of the neural network parameters is obtained.
Further, in S5, hyperspectral data is classified by a machine learning classification model based on an improved random forest, and the degree of heavy metal pollution is identified, and the improved random forest algorithm is as follows:
In the training stage of the decision tree, a higher weight is given to the decision tree capable of accurately classifying a few class samples by evaluating the classification performance of each decision tree, a final prediction result is obtained by a weighted voting mode, and the prediction result of the random forest is defined as follows:
;
Wherein, Representing the predicted outcome of a random forest,/>Representing the maximum index function, N is the test set, T is the number of decision trees,/>, andTo indicate a function,/>For/>Prediction result of decision tree,/>Representing category,/>For/>Voting weight of the decision tree; when the prediction result of the decision tree is true, the function/>, is indicatedThe value of (2) is 1, whereas 0;
when the improved random forest algorithm works, firstly, a confusion matrix is constructed, TP in the confusion matrix represents that a stable sample is judged as a stable sample, FN represents that the stable sample is judged as a unstable sample, FP represents that the unstable sample is judged as a stable sample, and TN represents that the unstable sample is judged as a unstable sample;
Accuracy of classification of destabilized samples using each decision tree And recall/>Harmonic mean value/>The voting weight value/>, of each tree is taken as the weight of the treeThe definition is as follows:
;
The larger the decision tree is, the better the classification performance of the decision tree on minority class samples is, and the heavy metal pollution degree is identified by improving a machine learning classification model of a random forest.
Compared with the prior art, the invention has the beneficial effects that: the algorithm designed by the invention realizes the expansion of samples by carrying out dimension reduction on high-dimension original data and introducing a SMOTE sample generation method of rapid clustering, thereby reducing the imbalance phenomenon of sample types; obtaining an optimal solution set of the neural network parameters by using a neural network model optimized based on a search operator algorithm; in the training stage of the decision tree by improving the random forest algorithm, a higher weight is given to the decision tree capable of accurately classifying a few types of samples by evaluating the classification performance of each decision tree, and a final prediction result is obtained by a weighted voting mode, so that the classification performance of the model is improved; and finally, the obtained algorithm model has higher detection precision, and higher robustness and generalization capability.
Drawings
Fig. 1 is a flowchart illustrating an embodiment of the present invention.
Detailed Description
The following describes the embodiments of the present invention further with reference to the drawings.
Examples: referring to fig. 1, a training method of a machine learning-based data processing and recognition model includes the steps of:
S1, acquiring soil data, and marking a training sample for model training; the collected data is derived from hyperspectral remote sensing images or sensor data, and in this embodiment, hyperspectral remote sensing images are taken as an example for illustration.
S2, performing dimension reduction operation on high-dimension hyperspectral data
The original hyperspectral data has multiple wave bands, high dimensionality and large data volume, and has data redundancy, so that the influence caused by 'dimensionality disaster' is reduced, the information loss is reduced as much as possible while the dimensionality of the data is reduced, and the proposed soil heavy metal pollution identification classification framework firstly carries out constraint on the spectrum dimension of the original hyperspectral remote sensing image, and the aims of dimension reduction and redundant information elimination of the data are achieved by reserving a plurality of main components.
The dimension reduction method adopted in the step is a principal component analysis method, and a high-dimensional characteristic variable with a large correlation coefficient is recombined by projecting a high-dimensional hyperspectral remote sensing image into a low-dimensional subspace to form a low-dimensional linear independent group of variables; when the primary component analysis method processes the original hyperspectral remote sensing image, the method mainly comprises the following steps:
S201, data standardization, wherein the standardization can enable all variables and values in hyperspectral data to be in a similar range, and if the standardization operation is not performed, deviation of results can occur; the standardized calculation method comprises the following steps:
;
Where Z represents a normalized value, all variables are scaled by this step.
S202, calculating a covariance matrix, wherein the principal component analysis method is helpful for identifying the correlation and the dependence among elements in the hyperspectral dataset, and the covariance matrix represents the correlation among different variables in the dataset; the covariance matrix is mathematically defined as oneMatrix, in hyperspectral remote sensing image,/>Representing the dimension of the hyperspectral remote sensing image, each element in the matrix representing the covariance of the corresponding variable, for a vector with the variable/>And the hyperspectral band scene of variable b, the covariance of which is a2 x 2 matrix, as follows:
;
Wherein, Representing covariance matrix,/>Representing the covariance of the variable with itself, i.e., variable/>Is a variance of (2); /(I)Representing the variable/>The covariance with the variable b is given by,Representing the variable/>Is a variance of (2); in the covariance matrix, the covariance value indicates the degree to which two variables are interdependent, and if the covariance value is negative, it indicates that the variables are inversely proportional to each other, and conversely, that the variables are directly proportional to each other.
S203, calculating a feature vector and a feature value:
Calculating from covariance matrix to obtain feature vector and feature value, wherein the principal component is obtained by converting original vector, re-representing partially converted vector, compressing and re-integrating most of information originally scattered in original vector in the process of extracting principal component, if the first 5 space dimensions in hyperspectral data are reserved, calculating 5 principal components, so that the 1 st principal component stores the maximum possible information, the 2 nd principal component stores the rest maximum information, and so on; the eigenvectors and eigenvalues are computed in pairs, i.e. there is a corresponding one for each eigenvector, the number of eigenvectors that need to be computed determines the dimensionality of the data.
The hyperspectral remote sensing image is a 3-dimensional data set, the number of characteristic vectors and characteristic values is 3, the characteristic vectors are used for knowing the maximum variance in the data by using a covariance matrix, and the characteristic vectors are used for identifying and calculating principal components because more differences in the hyperspectral data represent more information about the data; on the other hand, the eigenvalues represent only scalar quantities of the respective eigenvectors, and therefore, the eigenvectors and eigenvalues will be used to calculate the principal components of the hyperspectral data.
S204, calculating main components:
After the feature vectors and the feature values are calculated, the feature vectors are required to be ordered in a descending order, the feature vector corresponding to the higher feature value has more important position, the feature vector with the highest feature value is used as a first main component, and the like, so that the main component with lower importance can be deleted to reduce the size of data, and the screened main components form a feature matrix, wherein all important data variables with the maximum data information are contained.
S205, reducing the dimension of the data set:
Rearranging the raw data with final principal components representing the largest and most important information of the dataset; in order to replace the original data set with the newly formed principal component, it is simply multiplied with the transpose of the original data, and the obtained data is used as the dimension-reduced data.
S3, sample expansion:
because the data acquisition often has the phenomenon of sample class imbalance, namely the difference of the number of samples of different classes is large, and the class with small number of samples is difficult to effectively distinguish when the data is classified, the invention provides the SMOTE sample generation method based on the rapid clustering, which generates new samples in the classes of a few samples and reduces the phenomenon of sample class imbalance.
A SMOTE sample generation method based on rapid clustering is adopted, and comprises the following steps:
s301, obtaining each minority class sample by calculating Euclidean distance from the sample to other minority class samples And performing linear interpolation between the sample and the selected neighbor sample in a random selection mode to generate a new minority sample, wherein the specific process is as follows:
;
Wherein, Representation/>One sample in the immediate vicinity,/>Is a random number,/>Is an input sample,/>Is the new sample generated;
s302, pair generation The samples are clustered rapidly, and firstly, the distance between objects is calculated according to the following formula:
;
Wherein, And/>For/>2 Of the samples,/>For/>And/>In order to accelerate the clustering speed, a threshold value is set, and the formula is as follows:
;
Wherein, Representing a threshold value/>Is a scaling factor, set by man, typically greater than 0 and less than 1; /(I)Respectively minimum distance and maximum distance between categories;
At the generated sample In the method, samples generated in each category are screened to improve the quality of the generated samples, and screening conditions are as follows:
;
Represents the screening sample set, will/> Combining the sample set with the original data set to obtain an equalized sample set/>For subsequent feature extraction.
S4, extracting characteristics of the hyperspectral data:
The data obtained through the steps are subjected to feature extraction, the hyperspectral data is subjected to feature extraction by adopting a neural network, and the hyperspectral data is different from a traditional neural network model, the neural network optimization algorithm is improved in the step, and the parameters of the neural network model to the neurons based on the search operator algorithm optimization are provided Sum parametersOptimization is performed in which/>Weight parameters of neuronsThreshold parameters of neurons; the number of layers of the neural network adopted in the step is 2, and parameters/>, based on a search operator algorithm in a neural network model optimized by the search operator algorithmSum parameter/>The searching method of (2) comprises the following steps:
s401, defining a search operator, and setting search conditions:
setting n search operators in the search operator population, wherein the individual states of the search operators are expressed as follows Wherein/>For/>The states of the search operators, namely free variables in the parameter optimizing problem; for objective function/>A representation; search operator/>、/>The distance between them is/>; The searching radius of the searching operator is Visual; the Step length of searching is Step; the crowding factor is/>; At a certain moment/>Search operatorsSearching for any position/>, within a search radius VisualIf/>Position status is better than/>Location, then go to/>Further forward in the direction of position, i.e. arrival/>A location; otherwise, continuing to search for other locations within the field of view, the process is expressed as:
;
In the method, in the process of the invention, Is a random number between 0 and 1.
Before the action, each search operator sequentially executes the searching action, the clustering action, the rear-end collision action and the random action, and then selects the optimal action to execute, so that the search operator population can reach a position closer to the optimal solution:
(1) Search behavior
Assume the firstThe state of a search operator at a certain moment is/>Randomly selecting a state within its search rangeThe following formula is satisfied:
;
and/> Respectively express/>And/>Priority decryption concentration in state, if/>This search operator is moved one step in this direction, namely:
;
If the forward condition is not met, a state is selected again in the search range, whether the moving condition is met or not is judged, after the set repeated times are repeatedly selected, if the moving condition is still not met, the moving is carried out randomly.
(2) Aggregation behavior
Assume the firstThe state of a search operator at a certain moment is/>The number of other search operators searched in the current state is n, and the central position is/>The judgment basis is as follows:
;
Wherein, Is a congestion degree factor,/>And/>Representing the priority decryption concentration for the central location and the current location, respectively.
If the above formula is established, the priority decryption concentration of the center is higher and the center is not crowded, and the center is moved to the center direction by one step; if not, a search action is performed.
(3) Rear-end collision behavior
Assume the firstThe state of a search operator at a certain moment is/>Searching other search operators nearby in the current state, and finding out/>, in the peers, with maximum priority decryption concentrationIts position is/>The judgment basis is as follows:
;
if the above formula holds, other search operators are indicated Where there is a denser preferential solution and less crowding, then the search operator/>Moving in one step in the direction; if not, a search action is performed.
(4) Random behavior
This behavior is a default behavior of the search behavior, i.e. randomly selecting a position to move to within the field of view, the position of the next state is:
;
By the method, the optimal solution set of the neural network parameters is obtained.
S5, training a classifier machine learning model;
After feature extraction, the invention provides a machine learning classification model based on an improved random forest to classify hyperspectral data and identify the heavy metal pollution degree.
In order to improve the recognition capability of the random forest to minority samples, the invention provides an improved random forest algorithm, in the training stage of the decision tree, the classification performance of each decision tree is evaluated, a higher weight is given to the decision tree capable of accurately classifying minority samples, and a final prediction result is obtained in a weighted voting mode, wherein the prediction result of the random forest is defined as:
;
Wherein, Representing the predicted outcome of a random forest,/>Representing the maximum index function, N is the test set, T is the number of decision trees,/>, andTo indicate a function,/>For/>Prediction result of decision tree,/>Representing category,/>For/>Voting weights of the decision tree; when the prediction result of the decision tree is true, the function/>, is indicatedThe value of (2) is 1, and vice versa is 0.
When the improved random forest algorithm works, firstly, a confusion matrix is constructed, TP in the confusion matrix represents that a stable sample is judged as a stable sample, FN represents that the stable sample is judged as a unstable sample, FP represents that the unstable sample is judged as a stable sample, and TN represents that the unstable sample is judged as a unstable sample;
Accuracy of classification of destabilized samples using each decision tree And recall/>Harmonic mean value/>The voting weight value/>, of each tree is taken as the weight of the treeThe definition is as follows:
;
The larger the decision tree is, the better the classification performance of the decision tree on minority class samples is, and the heavy metal pollution degree is identified by improving a machine learning classification model of a random forest.
And S6, applying the trained model to carry out soil heavy metal pollution degree, training the model by using a marked sample, and detecting and identifying the data to be detected and identified after model training is completed.
Claims (2)
1. The training method of the data processing and identifying model based on machine learning is characterized by comprising the following steps:
s1, acquiring soil data, and marking a training sample for model training;
S2, reducing the dimension of the data, and recombining the high-dimension characteristic variables to form a group of low-dimension linear independent variables;
s3, sample expansion, namely generating new samples in the categories of a few samples, and reducing the imbalance phenomenon of the categories of the samples;
s4, extracting characteristics of the data in the step S3, and providing a neural network model optimized based on a search operator algorithm to optimize parameters of neurons, wherein the number of layers of the neural network adopted in the step is 2, and the parameters are searched based on the search operator algorithm in the neural network model optimized based on the search operator algorithm;
S4, adopting a neural network model optimized based on a search operator algorithm to carry out parameters on neurons Sum parameter/>Optimizing,/>Weight parameters of neuronsThreshold parameters of neurons;
The number of layers of the neural network adopted in the step is 2, and parameters are subjected to search operator algorithm pairs in the neural network model optimized based on the search operator algorithm Sum parameter/>The searching method of (2) comprises the following steps:
s401, defining a search operator, and setting search conditions:
setting n search operators in the search operator population, wherein the individual states of the search operators can be expressed as follows Wherein/>For/>The states of the search operators, namely free variables in the parameter optimizing problem; for objective function/>A representation; search operator/>、/>The distance between them is/>; The searching radius of the searching operator is Visual; the Step length of searching is Step; the crowding factor is/>; At a certain moment/>Search operatorsSearching for any position/>, within a search radius VisualIf/>Position status is better than/>Location, then go to/>Further forward in the direction of position, i.e. arrival/>A location; otherwise, continuing to search for other locations within the field of view, the process is expressed as:
;
In the method, in the process of the invention, A random number of 0 to 1;
Before the action, each search operator sequentially executes the searching action, the clustering action, the rear-end collision action and the random action, and then selects the optimal action to execute, so that the search operator population can reach a position closer to the optimal solution:
(1) Search behavior
Assume the firstThe state of a search operator at a certain moment is/>Randomly selecting a state/>, within its search rangeThe following formula is satisfied:
;
and/> Respectively express/>And/>Priority decryption concentration in state, if/>This search operator is moved one step in this direction, namely:
;
If the forward condition is not met, a state is selected again in the search range, whether the moving condition is met or not is judged, after the set repeated times are repeatedly selected, if the moving condition is still not met, the moving is carried out randomly;
(2) Aggregation behavior
Assume the firstThe state of a search operator at a certain moment is/>The number of other search operators searched in the current state is n, and the central position is/>The judgment basis is as follows:
;
Wherein, Is a congestion degree factor,/>And/>The priority decryption concentration of the central position and the current position are respectively represented;
If the above formula is established, the priority decryption concentration of the center is higher and the center is not crowded, and the center is moved to the center direction by one step; if not, executing searching behavior;
(3) Rear-end collision behavior
Assume the firstThe state of a search operator at a certain moment is/>Searching other search operators nearby in the current state, and finding out/>, in the peers, with maximum priority decryption concentrationIts position is/>The judgment basis is as follows:
;
if the above formula holds, other search operators are indicated Where there is a denser preferential solution and less crowding, then the search operator/>Moving in one step in the direction; if not, executing searching behavior;
(4) Random behavior
This behavior is a default behavior of the search behavior, i.e. randomly selecting a position to move to within the field of view, the position of the next state is:
;
Acquiring an optimal solution set of the neural network parameters through searching behaviors, aggregating behaviors, rear-end collision behaviors and random behaviors;
S5, training a classifier machine learning model;
S6, applying the trained model to carry out soil heavy metal pollution degree, training the model by using a sample with a mark, and detecting and identifying the data to be detected and identified after model training is completed;
S3, sample expansion adopts a SMOTE sample generation method based on rapid clustering, and the method comprises the following steps:
S301, obtaining k neighbor samples of each minority sample by calculating Euclidean distances from the minority sample to other minority samples, and generating a new minority sample by performing linear interpolation between the sample and the selected neighbor sample in a random selection mode, wherein the specific process is shown in the following formula:
;
Wherein, Representing one sample among k neighbors,/>Is a random number,/>Is an input sample,/>Is the new sample generated;
s302, pair generation The samples are clustered rapidly, and firstly, the distance between objects is calculated according to the following formula:
;
Wherein, And/>For/>2 Of the samples,/>For/>And/>In order to accelerate the clustering speed, a threshold value is set, and the formula is as follows:
;
Wherein, Representing a threshold value/>Is a proportionality coefficient, is set by man, is provided withThe value range of (2) is more than 0 and less than 1; respectively minimum distance and maximum distance between categories;
At the generated sample In the method, samples generated in each category are screened to improve the quality of the generated samples, and screening conditions are as follows:
;
Represents the screening sample set, will/> Combining the sample set with the original data set to obtain an equalized sample set/>For subsequent feature extraction.
2. The training method of a machine learning-based data processing and recognition model according to claim 1, wherein the dimension reduction method adopted in S2 is a principal component analysis method, comprising the steps of:
s201, data standardization, wherein the standardized calculation method comprises the following steps:
;
wherein Z represents a standardized value, and all variables are scaled according to the proportion through the step;
s202, calculating a covariance matrix, wherein the covariance matrix is defined as one mathematically Matrix/>Representing the dimensionality of the acquired data, each element in the matrix representing the covariance of the corresponding variable, for a vector with variable/>And the hyperspectral band scene of variable b, the covariance of which is a2 x 2 matrix, as follows:;
Wherein, Representing covariance matrix,/>Representing the covariance of the variable with itself, i.e., variable/>Is a variance of (2); /(I)Representing the variable/>The covariance with the variable b is given by,Representing the variable/>Is a variance of (2);
S203, calculating a feature vector and a feature value:
Calculating from the covariance matrix to obtain feature vectors and feature values, wherein the feature vectors and the feature values are calculated in pairs, namely, each feature vector has a corresponding feature value, and the number of feature vectors to be calculated determines the dimension of data;
The eigenvectors are used to learn the maximum variance in the data using covariance matrices, since more variance in the hyperspectral data represents more information about the data, eigenvectors are used to identify and calculate principal components, and on the other hand, eigenvalues represent only scalar quantities for each eigenvector, so eigenvectors and eigenvalues will be used to calculate principal components of the data;
S204, calculating main components:
After the feature vectors and the feature values are calculated, the feature vectors are required to be ordered in a descending order, the feature vector corresponding to the higher feature value has more important position, the feature vector with the highest feature value is used as a first main component, and then the screened main components form a feature matrix;
s205, reducing the dimension of the data set:
Rearranging the raw data with final principal components representing the largest and most important information of the dataset; to replace the original data set with the newly formed principal component, it is simply multiplied with the transpose of the original data, the resulting data being dimensionality reduced
;
Wherein,Representing the predicted outcome of a random forest,/>Representing the maximum index function, N is the test set, T is the number of decision trees,/>, andTo indicate a function,/>For the prediction result of the t-th decision tree, y represents the category,Voting weight for the t decision tree; when the prediction result of the decision tree is true, the function/>, is indicatedThe value of (2) is 1, whereas 0;
when the improved random forest algorithm works, firstly, a confusion matrix is constructed, TP in the confusion matrix represents that a stable sample is judged as a stable sample, FN represents that the stable sample is judged as a unstable sample, FP represents that the unstable sample is judged as a stable sample, and TN represents that the unstable sample is judged as a unstable sample;
Accuracy of classification of destabilized samples using each decision tree And recall/>Harmonic mean value/>The voting weight value/>, of each tree is taken as the weight of the treeThe definition is as follows:
;
The larger the decision tree is, the better the classification performance of the decision tree on minority class samples is, and the heavy metal pollution degree is identified by improving a machine learning classification model of a random forest.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410205784.5A CN117789038B (en) | 2024-02-26 | 2024-02-26 | Training method of data processing and recognition model based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410205784.5A CN117789038B (en) | 2024-02-26 | 2024-02-26 | Training method of data processing and recognition model based on machine learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117789038A CN117789038A (en) | 2024-03-29 |
CN117789038B true CN117789038B (en) | 2024-05-10 |
Family
ID=90392988
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410205784.5A Active CN117789038B (en) | 2024-02-26 | 2024-02-26 | Training method of data processing and recognition model based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117789038B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117974221B (en) * | 2024-04-01 | 2024-09-13 | 国网江西省电力有限公司南昌供电分公司 | Electric vehicle charging station location selection method and system based on artificial intelligence |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102819745A (en) * | 2012-07-04 | 2012-12-12 | 杭州电子科技大学 | Hyper-spectral remote sensing image classifying method based on AdaBoost |
CN104021396A (en) * | 2014-06-23 | 2014-09-03 | 哈尔滨工业大学 | Hyperspectral remote sensing data classification method based on ensemble learning |
CN108596246A (en) * | 2018-04-23 | 2018-09-28 | 浙江科技学院 | The method for building up of soil heavy metal content detection model based on deep neural network |
CN112001788A (en) * | 2020-08-21 | 2020-11-27 | 东北大学 | Credit card default fraud identification method based on RF-DBSCAN algorithm |
CN112784907A (en) * | 2021-01-27 | 2021-05-11 | 安徽大学 | Hyperspectral image classification method based on spatial spectral feature and BP neural network |
CN113256066A (en) * | 2021-04-23 | 2021-08-13 | 新疆大学 | PCA-XGboost-IRF-based job shop real-time scheduling method |
WO2021189830A1 (en) * | 2020-03-26 | 2021-09-30 | 平安科技(深圳)有限公司 | Sample data optimization method, apparatus and device, and storage medium |
WO2022001159A1 (en) * | 2020-06-29 | 2022-01-06 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Latent low-rank projection learning based unsupervised feature extraction method for hyperspectral image |
CN115436407A (en) * | 2022-08-16 | 2022-12-06 | 电子科技大学 | Element content quantitative analysis method combining random forest regression with principal component analysis |
CN115526298A (en) * | 2022-10-18 | 2022-12-27 | 安徽工业大学 | High-robustness comprehensive prediction method for concentration of atmospheric pollutants |
CN115965119A (en) * | 2022-12-01 | 2023-04-14 | 北方工业大学 | Method for power prediction optimization of distributed energy storage system |
CN116187543A (en) * | 2023-01-10 | 2023-05-30 | 中南大学 | Machine learning-based soil heavy metal content prediction method and application thereof |
CN116776245A (en) * | 2023-06-09 | 2023-09-19 | 常州大学 | Three-phase inverter equipment fault diagnosis method based on machine learning |
CN116881451A (en) * | 2023-06-28 | 2023-10-13 | 华迪计算机集团有限公司 | Text classification method based on machine learning |
CN117272999A (en) * | 2023-09-05 | 2023-12-22 | 联通(广东)产业互联网有限公司 | Model training method and device based on class incremental learning, equipment and storage medium |
-
2024
- 2024-02-26 CN CN202410205784.5A patent/CN117789038B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102819745A (en) * | 2012-07-04 | 2012-12-12 | 杭州电子科技大学 | Hyper-spectral remote sensing image classifying method based on AdaBoost |
CN104021396A (en) * | 2014-06-23 | 2014-09-03 | 哈尔滨工业大学 | Hyperspectral remote sensing data classification method based on ensemble learning |
CN108596246A (en) * | 2018-04-23 | 2018-09-28 | 浙江科技学院 | The method for building up of soil heavy metal content detection model based on deep neural network |
WO2021189830A1 (en) * | 2020-03-26 | 2021-09-30 | 平安科技(深圳)有限公司 | Sample data optimization method, apparatus and device, and storage medium |
WO2022001159A1 (en) * | 2020-06-29 | 2022-01-06 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Latent low-rank projection learning based unsupervised feature extraction method for hyperspectral image |
CN112001788A (en) * | 2020-08-21 | 2020-11-27 | 东北大学 | Credit card default fraud identification method based on RF-DBSCAN algorithm |
CN112784907A (en) * | 2021-01-27 | 2021-05-11 | 安徽大学 | Hyperspectral image classification method based on spatial spectral feature and BP neural network |
CN113256066A (en) * | 2021-04-23 | 2021-08-13 | 新疆大学 | PCA-XGboost-IRF-based job shop real-time scheduling method |
CN115436407A (en) * | 2022-08-16 | 2022-12-06 | 电子科技大学 | Element content quantitative analysis method combining random forest regression with principal component analysis |
CN115526298A (en) * | 2022-10-18 | 2022-12-27 | 安徽工业大学 | High-robustness comprehensive prediction method for concentration of atmospheric pollutants |
CN115965119A (en) * | 2022-12-01 | 2023-04-14 | 北方工业大学 | Method for power prediction optimization of distributed energy storage system |
CN116187543A (en) * | 2023-01-10 | 2023-05-30 | 中南大学 | Machine learning-based soil heavy metal content prediction method and application thereof |
CN116776245A (en) * | 2023-06-09 | 2023-09-19 | 常州大学 | Three-phase inverter equipment fault diagnosis method based on machine learning |
CN116881451A (en) * | 2023-06-28 | 2023-10-13 | 华迪计算机集团有限公司 | Text classification method based on machine learning |
CN117272999A (en) * | 2023-09-05 | 2023-12-22 | 联通(广东)产业互联网有限公司 | Model training method and device based on class incremental learning, equipment and storage medium |
Non-Patent Citations (3)
Title |
---|
Empirical Analysis of Sampling Methods on Imbalanced Data;Disha N 等;《2022 IEEE North Karnataka Subsection Flagship International Conference (NKCon)》;20230526;全文 * |
基于数据增强和模型更新的异常流量检测技术;张浩;陈龙;魏志强;;信息网络安全;20200210(第02期);全文 * |
高光谱遥感影像多级联森林深度网络分类算法;武复宇;王雪;丁建伟;杜培军;谭琨;;遥感学报;20200425(第04期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117789038A (en) | 2024-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110533631B (en) | SAR image change detection method based on pyramid pooling twin network | |
CN109117883B (en) | SAR image sea ice classification method and system based on long-time memory network | |
CN117789038B (en) | Training method of data processing and recognition model based on machine learning | |
Mokhtari et al. | Comparison of supervised classification techniques for vision-based pavement crack detection | |
CN112613536B (en) | Near infrared spectrum diesel fuel brand recognition method based on SMOTE and deep learning | |
TW201350836A (en) | Optimization of unknown defect rejection for automatic defect classification | |
CN103714148B (en) | SAR image search method based on sparse coding classification | |
TW201407154A (en) | Integration of automatic and manual defect classification | |
CN113191926B (en) | Method and system for identifying grain and oil crop supply chain hazard based on deep integrated learning network | |
CN110751209B (en) | Intelligent typhoon intensity determination method integrating depth image classification and retrieval | |
CN107016416B (en) | Data classification prediction method based on neighborhood rough set and PCA fusion | |
Kaur et al. | Computer vision-based tomato grading and sorting | |
CN115699040A (en) | Method and system for training a machine learning model for classifying components in a material stream | |
CN108875118A (en) | A kind of blast furnace molten iron silicon content prediction model accuracy estimating method and apparatus | |
CN114997501A (en) | Deep learning mineral resource classification prediction method and system based on sample unbalance | |
CN113344045A (en) | Method for improving SAR ship classification precision by combining HOG characteristics | |
DB et al. | Classification of oil palm female inflorescences anthesis stages using machine learning approaches | |
CN115147615A (en) | Rock image classification method and device based on metric learning network | |
Wang et al. | Classification and extent determination of rock slope using deep learning | |
CN111242028A (en) | Remote sensing image ground object segmentation method based on U-Net | |
CN104463207A (en) | Knowledge self-encoding network and polarization SAR image terrain classification method thereof | |
CN110675382A (en) | Aluminum electrolysis superheat degree identification method based on CNN-LapseLM | |
CN117372144A (en) | Wind control strategy intelligent method and system applied to small sample scene | |
CN112465821A (en) | Multi-scale pest image detection method based on boundary key point perception | |
CN112801204A (en) | Hyperspectral classification method with lifelong learning ability based on automatic neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |