CN117789038B - Training method of data processing and recognition model based on machine learning - Google Patents

Training method of data processing and recognition model based on machine learning Download PDF

Info

Publication number
CN117789038B
CN117789038B CN202410205784.5A CN202410205784A CN117789038B CN 117789038 B CN117789038 B CN 117789038B CN 202410205784 A CN202410205784 A CN 202410205784A CN 117789038 B CN117789038 B CN 117789038B
Authority
CN
China
Prior art keywords
sample
search
data
samples
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410205784.5A
Other languages
Chinese (zh)
Other versions
CN117789038A (en
Inventor
张镇
靖婉琦
刘晨甲
王兆信
谢东明
宋光恒
孙德润
徐如明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shuju Shandong Intelligent Technology Co ltd
Liaocheng Laike Intelligent Robot Co ltd
Original Assignee
Shuju Shandong Intelligent Technology Co ltd
Liaocheng Laike Intelligent Robot Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shuju Shandong Intelligent Technology Co ltd, Liaocheng Laike Intelligent Robot Co ltd filed Critical Shuju Shandong Intelligent Technology Co ltd
Priority to CN202410205784.5A priority Critical patent/CN117789038B/en
Publication of CN117789038A publication Critical patent/CN117789038A/en
Application granted granted Critical
Publication of CN117789038B publication Critical patent/CN117789038B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a training method of a data processing and identifying model based on machine learning, which belongs to the technical field of data processing, and comprises the steps of firstly collecting soil information, marking training samples for training the model, performing dimension reduction operation on data, then performing sample expansion through a SMOTE sample generating method based on rapid clustering, then performing feature extraction on the data through a neural network, optimizing parameters of the neurons through a neural network model based on search operator algorithm optimization, avoiding gradient elimination and gradient explosion phenomena caused by a traditional neural network parameter optimizing method, and finally classifying hyperspectral data through a machine learning classifying model based on an improved random forest, and effectively improving classification precision of a classifier through evaluating classification performance of each decision tree in a decision tree training stage; the algorithm designed by the invention has higher detection precision, and has higher robustness and generalization capability.

Description

Training method of data processing and recognition model based on machine learning
Technical Field
The invention relates to the technical field of data processing, in particular to a training method of a data processing and identifying model based on machine learning.
Background
The heavy metal elements in the soil are difficult to degrade by natural environment, and the heavy metal pollution is difficult to treat and has strong hazard. Therefore, the heavy metal pollution condition of the soil is monitored in real time, the diffusion of a pollution area can be timely avoided, and the heavy metal pollution of the soil is prevented from further aggravating. The traditional method for identifying heavy metal pollution of soil is to collect soil in the field and perform chemical analysis by a laboratory to judge the pollution condition in a certain area, and has the defects of accurate identification precision, long analysis period, great manpower and material resources consumption and difficulty in meeting the real-time monitoring requirement of a macroscopic area. With the development of hyperspectral remote sensing and related fields, a solution is brought for rapidly monitoring soil heavy metal pollution in a macroscopic region. The hyperspectral remote sensing has the characteristics of rapidness, dynamics and no destructiveness, and the hyperspectral remote sensing is used in the field of soil heavy metal pollution and can meet the requirement of large-scale real-time monitoring. The hyperspectral image space information can reflect physical space structure information outside the object, such as texture features, geometric features and the like; the spectral information reflects the change in chemical composition within the object. Whether it is a single-point high-resolution spectrum band measured in a laboratory or a hyperspectral image obtained by satellite or airborne means, the spectrum bands contain a large amount of information of the measured object, but adjacent spectrum bands have the characteristic of higher correlation, and a large amount of information redundancy can increase the difficulty of feature extraction. And the soil is complex in composition and low in heavy metal element content, so that the response of the soil in the soil spectrum is weak. How to effectively extract important characteristic information in a complex plurality of spectrum bands is an important research content in the hyperspectral field.
The invention patent with the patent number of CN202110651965.7 in the prior art provides an undisturbed soil profile carbon component prediction method based on hyperspectral imaging and support vector machine technology, and based on the acquisition of hyperspectral images of soil profile samples with preset depths at various sample positions, the method takes each characteristic spectrum band of a target sample spectrum region corresponding to the soil carbon component type as input, and soil carbon component data of the target sample spectrum region corresponding to the soil carbon component type as output, and obtains a soil carbon component prediction model corresponding to the soil carbon component type through training, so as to further realize the prediction of the soil profile carbon component of the target region; the whole design scheme can rapidly and accurately predict the contents of the components such as organic carbon, soluble carbon, carbon easy to oxidize, soil microbial biomass carbon and the like in the undisturbed soil profile, and realize the fine drawing of the spatial distribution of the components on the soil profile; makes up the defects of the traditional laboratory chemical analysis method.
The invention patent with the patent number of CN201910717696.2 in the prior art provides a soil quality monitoring method based on aviation hyperspectrum, which comprises the following steps: step 1, acquiring aviation hyperspectral data of a soil quality monitoring area, and acquiring samples of the soil quality monitoring area in the wild to analyze the content of heavy metal elements; step 2, preprocessing the aviation hyperspectral data; step 3, reconstructing a hyperspectral data spectrum of the aviation to eliminate radiation distortion of a ground object spectrum caused by various atmospheric components; step 4, extracting the spectrum of the sampling point aviation hyperspectral image in the aviation hyperspectral remote sensing data; step 5, spectrum transformation and correlation coefficient analysis are carried out, the correlation coefficient between the content of the soil and the soil spectrum parameter is obtained, and the sensitive wave band of the characteristic spectrum is found out; and 6, establishing an inversion soil quality monitoring model of the aviation hyperspectral data to obtain monitored soil nutrient and metal element content data. When the method is applied, the large-range soil foundation data can be accurately obtained, the workload can be reduced, the soil quality monitoring period can be shortened, and the cost can be reduced.
The invention patent with the patent number of CN201510119440.3 in the prior art provides a technical method for identifying soil attribute hyperspectrum, and relates to the technical field of soil exploration. The method comprises the following steps: s1, acquiring soil hyperspectral images at different times based on remote sensing satellite data; s2, after image preprocessing, obtaining bare soil through supervision and classification, extracting the surface reflectivity of the bare soil, and establishing a bare soil surface reflectivity inversion model according to the surface reflectivity of the bare soil; s3, designing an indoor soil erosion test, and acquiring soil erosion data corresponding to the soil hyperspectral image acquisition time; s4, acquiring soil classification and calculating a soil K value through the soil corrosiveness data obtained in the step S3; and S5, establishing a hyperspectral model affecting the soil property of the corrodibility K according to the soil K value and the spectral data in the earth surface reflectivity inversion model. The invention solves the problem that the hyperspectral remote sensing technology cannot be used for measuring the soil corrosiveness.
Although the above prior art can identify the pollution degree of soil, the existing method still needs to be further improved in terms of model design and data processing, and especially needs to be further optimized in terms of improving detection precision, robustness and generalization capability of the model.
Disclosure of Invention
Aiming at the technical problems, the invention adopts the following technical scheme: a training method of a data processing and identifying model based on machine learning comprises the following steps:
s1, acquiring soil data, and marking a training sample for model training;
S2, reducing the dimension of the data, and recombining high-dimension characteristic variables with larger correlation numbers to form a group of low-dimension linear independent variables;
s3, sample expansion, namely generating new samples in the categories of a few samples, and reducing the imbalance phenomenon of the categories of the samples;
s4, extracting characteristics of the data in the step S3, and providing a neural network model optimized based on a search operator algorithm to optimize parameters of neurons, wherein the number of layers of the neural network adopted in the step is 2, and the parameters are searched based on the search operator algorithm in the neural network model optimized based on the search operator algorithm;
S5, training a classifier machine learning model;
And S6, applying the trained model to carry out soil heavy metal pollution degree, training the model by using a marked sample, and detecting and identifying the data to be detected and identified after model training is completed.
Further, the dimension reduction method adopted in S2 is a principal component analysis method, and includes the following steps:
s201, data standardization, wherein the standardized calculation method comprises the following steps:
wherein Z represents a standardized value, and all variables are scaled according to the proportion through the step;
s202, calculating a covariance matrix, wherein the covariance matrix is defined as one mathematically Matrix/>Representing the dimensionality of the acquired data, each element in the matrix representing the covariance of the corresponding variable, for a vector with variable/>And the hyperspectral band scene of variable b, the covariance of which is a2 x 2 matrix, as follows:
Wherein, Representing covariance matrix,/>Representing the covariance of the variable with itself, i.e., variable/>Is a variance of (2); /(I)Representing the variable/>The covariance with the variable b is given by,Representing the variable/>Is a variance of (2);
S203, calculating a feature vector and a feature value:
Calculating from the covariance matrix to obtain feature vectors and feature values, wherein the feature vectors and the feature values are calculated in pairs, namely, each feature vector has a corresponding feature value, and the number of feature vectors to be calculated determines the dimension of data;
The eigenvectors are used to learn the maximum variance in the data using covariance matrices, since more variance in the hyperspectral data represents more information about the data, eigenvectors are used to identify and calculate principal components, and on the other hand, eigenvalues represent only scalar quantities for each eigenvector, so eigenvectors and eigenvalues will be used to calculate principal components of the data;
S204, calculating main components:
After the feature vectors and the feature values are calculated, the feature vectors are required to be ordered in a descending order, the feature vector corresponding to the higher feature value has more important position, the feature vector with the highest feature value is used as a first main component, and then the screened main components form a feature matrix;
s205, reducing the dimension of the data set:
Rearranging the raw data with final principal components representing the largest and most important information of the dataset; in order to replace the original data set with the newly formed principal component, it is simply multiplied with the transpose of the original data, and the obtained data is used as the dimension-reduced data.
Further, in S3, sample expansion adopts a SMOTE sample generation method based on fast clustering, which includes the following steps:
s301, obtaining each minority class sample by calculating Euclidean distance from the sample to other minority class samples And performing linear interpolation between the sample and the selected neighbor sample in a random selection mode to generate a new minority sample, wherein the specific process is as follows:
Wherein, Representation/>One sample in the immediate vicinity,/>Is a random number,/>Is an input sample,/>Is the new sample generated;
s302, pair generation The samples are clustered rapidly, and firstly, the distance between objects is calculated according to the following formula:
Wherein, And/>For/>2 Of the samples,/>For/>And/>In order to accelerate the clustering speed, a threshold value is set, and the formula is as follows:
Wherein, Representing a threshold value/>Is a scaling factor, set by man, typically greater than 0 and less than 1; /(I)Respectively minimum distance and maximum distance between categories;
At the generated sample In the method, samples generated in each category are screened to improve the quality of the generated samples, and screening conditions are as follows:
Represents the screening sample set, will/> Combining the sample set with the original data set to obtain an equalized sample set/>For subsequent feature extraction.
Further, in S4, a neural network model optimized based on a search operator algorithm is adopted to carry out the parameter on the neuronsSum parameter/>Optimizing,/>Weight parameters of neuronsThreshold parameters of neurons;
The number of layers of the neural network adopted in the step is 2, and parameters are subjected to search operator algorithm pairs in the neural network model optimized based on the search operator algorithm Sum parameter/>The searching method of (2) comprises the following steps:
s401, defining a search operator, and setting search conditions:
setting n search operators in the search operator population, wherein the individual states of the search operators are expressed as follows Wherein/>For/>The states of the search operators, namely free variables in the parameter optimizing problem; for objective function/>A representation; search operator/>、/>The distance between them is/>; The searching radius of the searching operator is Visual; the Step length of searching is Step; the crowding factor is/>; At a certain moment/>Search operatorsSearching for any position/>, within a search radius VisualIf/>Position status is better than/>Location, then go to/>Further forward in the direction of position, i.e. arrival/>A location; otherwise, continuing to search for other locations within the field of view, the process is expressed as:
In the method, in the process of the invention, A random number of 0 to 1;
Before the action, each search operator sequentially executes the searching action, the clustering action, the rear-end collision action and the random action, and then selects the optimal action to execute, so that the search operator population can reach a position closer to the optimal solution:
(1) Search behavior
Assume the firstThe state of a search operator at a certain moment is/>Randomly selecting a state within its search rangeThe following formula is satisfied:
and/> Respectively express/>And/>Priority decryption concentration in state, if/>This search operator is moved one step in this direction, namely:
If the forward condition is not met, a state is selected again in the search range, whether the moving condition is met or not is judged, after the set repeated times are repeatedly selected, if the moving condition is still not met, the moving is carried out randomly;
(2) Aggregation behavior
Assume the firstThe state of a search operator at a certain moment is/>The number of other search operators searched in the current state is n, and the central position is/>The judgment basis is as follows:
Wherein, Is a congestion degree factor,/>And/>The priority decryption concentration of the central position and the current position are respectively represented;
If the above formula is established, the priority decryption concentration of the center is higher and the center is not crowded, and the center is moved to the center direction by one step; if not, executing searching behavior;
(3) Rear-end collision behavior
Assume the firstThe state of a search operator at a certain moment is/>Searching other search operators nearby in the current state, and finding out/>, in the peers, with maximum priority decryption concentrationIts position is/>The judgment basis is as follows:
if the above formula holds, other search operators are indicated Where there is a denser preferential solution and less crowding, then the search operator/>Moving in one step in the direction; if not, executing searching behavior;
(4) Random behavior
This behavior is a default behavior of the search behavior, i.e. randomly selecting a position to move to within the field of view, the position of the next state is:
By the method, the optimal solution set of the neural network parameters is obtained.
Further, in S5, hyperspectral data is classified by a machine learning classification model based on an improved random forest, and the degree of heavy metal pollution is identified, and the improved random forest algorithm is as follows:
In the training stage of the decision tree, a higher weight is given to the decision tree capable of accurately classifying a few class samples by evaluating the classification performance of each decision tree, a final prediction result is obtained by a weighted voting mode, and the prediction result of the random forest is defined as follows:
Wherein, Representing the predicted outcome of a random forest,/>Representing the maximum index function, N is the test set, T is the number of decision trees,/>, andTo indicate a function,/>For/>Prediction result of decision tree,/>Representing category,/>For/>Voting weight of the decision tree; when the prediction result of the decision tree is true, the function/>, is indicatedThe value of (2) is 1, whereas 0;
when the improved random forest algorithm works, firstly, a confusion matrix is constructed, TP in the confusion matrix represents that a stable sample is judged as a stable sample, FN represents that the stable sample is judged as a unstable sample, FP represents that the unstable sample is judged as a stable sample, and TN represents that the unstable sample is judged as a unstable sample;
Accuracy of classification of destabilized samples using each decision tree And recall/>Harmonic mean value/>The voting weight value/>, of each tree is taken as the weight of the treeThe definition is as follows:
The larger the decision tree is, the better the classification performance of the decision tree on minority class samples is, and the heavy metal pollution degree is identified by improving a machine learning classification model of a random forest.
Compared with the prior art, the invention has the beneficial effects that: the algorithm designed by the invention realizes the expansion of samples by carrying out dimension reduction on high-dimension original data and introducing a SMOTE sample generation method of rapid clustering, thereby reducing the imbalance phenomenon of sample types; obtaining an optimal solution set of the neural network parameters by using a neural network model optimized based on a search operator algorithm; in the training stage of the decision tree by improving the random forest algorithm, a higher weight is given to the decision tree capable of accurately classifying a few types of samples by evaluating the classification performance of each decision tree, and a final prediction result is obtained by a weighted voting mode, so that the classification performance of the model is improved; and finally, the obtained algorithm model has higher detection precision, and higher robustness and generalization capability.
Drawings
Fig. 1 is a flowchart illustrating an embodiment of the present invention.
Detailed Description
The following describes the embodiments of the present invention further with reference to the drawings.
Examples: referring to fig. 1, a training method of a machine learning-based data processing and recognition model includes the steps of:
S1, acquiring soil data, and marking a training sample for model training; the collected data is derived from hyperspectral remote sensing images or sensor data, and in this embodiment, hyperspectral remote sensing images are taken as an example for illustration.
S2, performing dimension reduction operation on high-dimension hyperspectral data
The original hyperspectral data has multiple wave bands, high dimensionality and large data volume, and has data redundancy, so that the influence caused by 'dimensionality disaster' is reduced, the information loss is reduced as much as possible while the dimensionality of the data is reduced, and the proposed soil heavy metal pollution identification classification framework firstly carries out constraint on the spectrum dimension of the original hyperspectral remote sensing image, and the aims of dimension reduction and redundant information elimination of the data are achieved by reserving a plurality of main components.
The dimension reduction method adopted in the step is a principal component analysis method, and a high-dimensional characteristic variable with a large correlation coefficient is recombined by projecting a high-dimensional hyperspectral remote sensing image into a low-dimensional subspace to form a low-dimensional linear independent group of variables; when the primary component analysis method processes the original hyperspectral remote sensing image, the method mainly comprises the following steps:
S201, data standardization, wherein the standardization can enable all variables and values in hyperspectral data to be in a similar range, and if the standardization operation is not performed, deviation of results can occur; the standardized calculation method comprises the following steps:
Where Z represents a normalized value, all variables are scaled by this step.
S202, calculating a covariance matrix, wherein the principal component analysis method is helpful for identifying the correlation and the dependence among elements in the hyperspectral dataset, and the covariance matrix represents the correlation among different variables in the dataset; the covariance matrix is mathematically defined as oneMatrix, in hyperspectral remote sensing image,/>Representing the dimension of the hyperspectral remote sensing image, each element in the matrix representing the covariance of the corresponding variable, for a vector with the variable/>And the hyperspectral band scene of variable b, the covariance of which is a2 x 2 matrix, as follows:
Wherein, Representing covariance matrix,/>Representing the covariance of the variable with itself, i.e., variable/>Is a variance of (2); /(I)Representing the variable/>The covariance with the variable b is given by,Representing the variable/>Is a variance of (2); in the covariance matrix, the covariance value indicates the degree to which two variables are interdependent, and if the covariance value is negative, it indicates that the variables are inversely proportional to each other, and conversely, that the variables are directly proportional to each other.
S203, calculating a feature vector and a feature value:
Calculating from covariance matrix to obtain feature vector and feature value, wherein the principal component is obtained by converting original vector, re-representing partially converted vector, compressing and re-integrating most of information originally scattered in original vector in the process of extracting principal component, if the first 5 space dimensions in hyperspectral data are reserved, calculating 5 principal components, so that the 1 st principal component stores the maximum possible information, the 2 nd principal component stores the rest maximum information, and so on; the eigenvectors and eigenvalues are computed in pairs, i.e. there is a corresponding one for each eigenvector, the number of eigenvectors that need to be computed determines the dimensionality of the data.
The hyperspectral remote sensing image is a 3-dimensional data set, the number of characteristic vectors and characteristic values is 3, the characteristic vectors are used for knowing the maximum variance in the data by using a covariance matrix, and the characteristic vectors are used for identifying and calculating principal components because more differences in the hyperspectral data represent more information about the data; on the other hand, the eigenvalues represent only scalar quantities of the respective eigenvectors, and therefore, the eigenvectors and eigenvalues will be used to calculate the principal components of the hyperspectral data.
S204, calculating main components:
After the feature vectors and the feature values are calculated, the feature vectors are required to be ordered in a descending order, the feature vector corresponding to the higher feature value has more important position, the feature vector with the highest feature value is used as a first main component, and the like, so that the main component with lower importance can be deleted to reduce the size of data, and the screened main components form a feature matrix, wherein all important data variables with the maximum data information are contained.
S205, reducing the dimension of the data set:
Rearranging the raw data with final principal components representing the largest and most important information of the dataset; in order to replace the original data set with the newly formed principal component, it is simply multiplied with the transpose of the original data, and the obtained data is used as the dimension-reduced data.
S3, sample expansion:
because the data acquisition often has the phenomenon of sample class imbalance, namely the difference of the number of samples of different classes is large, and the class with small number of samples is difficult to effectively distinguish when the data is classified, the invention provides the SMOTE sample generation method based on the rapid clustering, which generates new samples in the classes of a few samples and reduces the phenomenon of sample class imbalance.
A SMOTE sample generation method based on rapid clustering is adopted, and comprises the following steps:
s301, obtaining each minority class sample by calculating Euclidean distance from the sample to other minority class samples And performing linear interpolation between the sample and the selected neighbor sample in a random selection mode to generate a new minority sample, wherein the specific process is as follows:
Wherein, Representation/>One sample in the immediate vicinity,/>Is a random number,/>Is an input sample,/>Is the new sample generated;
s302, pair generation The samples are clustered rapidly, and firstly, the distance between objects is calculated according to the following formula:
Wherein, And/>For/>2 Of the samples,/>For/>And/>In order to accelerate the clustering speed, a threshold value is set, and the formula is as follows:
Wherein, Representing a threshold value/>Is a scaling factor, set by man, typically greater than 0 and less than 1; /(I)Respectively minimum distance and maximum distance between categories;
At the generated sample In the method, samples generated in each category are screened to improve the quality of the generated samples, and screening conditions are as follows:
Represents the screening sample set, will/> Combining the sample set with the original data set to obtain an equalized sample set/>For subsequent feature extraction.
S4, extracting characteristics of the hyperspectral data:
The data obtained through the steps are subjected to feature extraction, the hyperspectral data is subjected to feature extraction by adopting a neural network, and the hyperspectral data is different from a traditional neural network model, the neural network optimization algorithm is improved in the step, and the parameters of the neural network model to the neurons based on the search operator algorithm optimization are provided Sum parametersOptimization is performed in which/>Weight parameters of neuronsThreshold parameters of neurons; the number of layers of the neural network adopted in the step is 2, and parameters/>, based on a search operator algorithm in a neural network model optimized by the search operator algorithmSum parameter/>The searching method of (2) comprises the following steps:
s401, defining a search operator, and setting search conditions:
setting n search operators in the search operator population, wherein the individual states of the search operators are expressed as follows Wherein/>For/>The states of the search operators, namely free variables in the parameter optimizing problem; for objective function/>A representation; search operator/>、/>The distance between them is/>; The searching radius of the searching operator is Visual; the Step length of searching is Step; the crowding factor is/>; At a certain moment/>Search operatorsSearching for any position/>, within a search radius VisualIf/>Position status is better than/>Location, then go to/>Further forward in the direction of position, i.e. arrival/>A location; otherwise, continuing to search for other locations within the field of view, the process is expressed as:
In the method, in the process of the invention, Is a random number between 0 and 1.
Before the action, each search operator sequentially executes the searching action, the clustering action, the rear-end collision action and the random action, and then selects the optimal action to execute, so that the search operator population can reach a position closer to the optimal solution:
(1) Search behavior
Assume the firstThe state of a search operator at a certain moment is/>Randomly selecting a state within its search rangeThe following formula is satisfied:
and/> Respectively express/>And/>Priority decryption concentration in state, if/>This search operator is moved one step in this direction, namely:
If the forward condition is not met, a state is selected again in the search range, whether the moving condition is met or not is judged, after the set repeated times are repeatedly selected, if the moving condition is still not met, the moving is carried out randomly.
(2) Aggregation behavior
Assume the firstThe state of a search operator at a certain moment is/>The number of other search operators searched in the current state is n, and the central position is/>The judgment basis is as follows:
Wherein, Is a congestion degree factor,/>And/>Representing the priority decryption concentration for the central location and the current location, respectively.
If the above formula is established, the priority decryption concentration of the center is higher and the center is not crowded, and the center is moved to the center direction by one step; if not, a search action is performed.
(3) Rear-end collision behavior
Assume the firstThe state of a search operator at a certain moment is/>Searching other search operators nearby in the current state, and finding out/>, in the peers, with maximum priority decryption concentrationIts position is/>The judgment basis is as follows:
if the above formula holds, other search operators are indicated Where there is a denser preferential solution and less crowding, then the search operator/>Moving in one step in the direction; if not, a search action is performed.
(4) Random behavior
This behavior is a default behavior of the search behavior, i.e. randomly selecting a position to move to within the field of view, the position of the next state is:
By the method, the optimal solution set of the neural network parameters is obtained.
S5, training a classifier machine learning model;
After feature extraction, the invention provides a machine learning classification model based on an improved random forest to classify hyperspectral data and identify the heavy metal pollution degree.
In order to improve the recognition capability of the random forest to minority samples, the invention provides an improved random forest algorithm, in the training stage of the decision tree, the classification performance of each decision tree is evaluated, a higher weight is given to the decision tree capable of accurately classifying minority samples, and a final prediction result is obtained in a weighted voting mode, wherein the prediction result of the random forest is defined as:
Wherein, Representing the predicted outcome of a random forest,/>Representing the maximum index function, N is the test set, T is the number of decision trees,/>, andTo indicate a function,/>For/>Prediction result of decision tree,/>Representing category,/>For/>Voting weights of the decision tree; when the prediction result of the decision tree is true, the function/>, is indicatedThe value of (2) is 1, and vice versa is 0.
When the improved random forest algorithm works, firstly, a confusion matrix is constructed, TP in the confusion matrix represents that a stable sample is judged as a stable sample, FN represents that the stable sample is judged as a unstable sample, FP represents that the unstable sample is judged as a stable sample, and TN represents that the unstable sample is judged as a unstable sample;
Accuracy of classification of destabilized samples using each decision tree And recall/>Harmonic mean value/>The voting weight value/>, of each tree is taken as the weight of the treeThe definition is as follows:
The larger the decision tree is, the better the classification performance of the decision tree on minority class samples is, and the heavy metal pollution degree is identified by improving a machine learning classification model of a random forest.
And S6, applying the trained model to carry out soil heavy metal pollution degree, training the model by using a marked sample, and detecting and identifying the data to be detected and identified after model training is completed.

Claims (2)

1. The training method of the data processing and identifying model based on machine learning is characterized by comprising the following steps:
s1, acquiring soil data, and marking a training sample for model training;
S2, reducing the dimension of the data, and recombining the high-dimension characteristic variables to form a group of low-dimension linear independent variables;
s3, sample expansion, namely generating new samples in the categories of a few samples, and reducing the imbalance phenomenon of the categories of the samples;
s4, extracting characteristics of the data in the step S3, and providing a neural network model optimized based on a search operator algorithm to optimize parameters of neurons, wherein the number of layers of the neural network adopted in the step is 2, and the parameters are searched based on the search operator algorithm in the neural network model optimized based on the search operator algorithm;
S4, adopting a neural network model optimized based on a search operator algorithm to carry out parameters on neurons Sum parameter/>Optimizing,/>Weight parameters of neuronsThreshold parameters of neurons;
The number of layers of the neural network adopted in the step is 2, and parameters are subjected to search operator algorithm pairs in the neural network model optimized based on the search operator algorithm Sum parameter/>The searching method of (2) comprises the following steps:
s401, defining a search operator, and setting search conditions:
setting n search operators in the search operator population, wherein the individual states of the search operators can be expressed as follows Wherein/>For/>The states of the search operators, namely free variables in the parameter optimizing problem; for objective function/>A representation; search operator/>、/>The distance between them is/>; The searching radius of the searching operator is Visual; the Step length of searching is Step; the crowding factor is/>; At a certain moment/>Search operatorsSearching for any position/>, within a search radius VisualIf/>Position status is better than/>Location, then go to/>Further forward in the direction of position, i.e. arrival/>A location; otherwise, continuing to search for other locations within the field of view, the process is expressed as:
In the method, in the process of the invention, A random number of 0 to 1;
Before the action, each search operator sequentially executes the searching action, the clustering action, the rear-end collision action and the random action, and then selects the optimal action to execute, so that the search operator population can reach a position closer to the optimal solution:
(1) Search behavior
Assume the firstThe state of a search operator at a certain moment is/>Randomly selecting a state/>, within its search rangeThe following formula is satisfied:
and/> Respectively express/>And/>Priority decryption concentration in state, if/>This search operator is moved one step in this direction, namely:
If the forward condition is not met, a state is selected again in the search range, whether the moving condition is met or not is judged, after the set repeated times are repeatedly selected, if the moving condition is still not met, the moving is carried out randomly;
(2) Aggregation behavior
Assume the firstThe state of a search operator at a certain moment is/>The number of other search operators searched in the current state is n, and the central position is/>The judgment basis is as follows:
Wherein, Is a congestion degree factor,/>And/>The priority decryption concentration of the central position and the current position are respectively represented;
If the above formula is established, the priority decryption concentration of the center is higher and the center is not crowded, and the center is moved to the center direction by one step; if not, executing searching behavior;
(3) Rear-end collision behavior
Assume the firstThe state of a search operator at a certain moment is/>Searching other search operators nearby in the current state, and finding out/>, in the peers, with maximum priority decryption concentrationIts position is/>The judgment basis is as follows:
if the above formula holds, other search operators are indicated Where there is a denser preferential solution and less crowding, then the search operator/>Moving in one step in the direction; if not, executing searching behavior;
(4) Random behavior
This behavior is a default behavior of the search behavior, i.e. randomly selecting a position to move to within the field of view, the position of the next state is:
Acquiring an optimal solution set of the neural network parameters through searching behaviors, aggregating behaviors, rear-end collision behaviors and random behaviors;
S5, training a classifier machine learning model;
S6, applying the trained model to carry out soil heavy metal pollution degree, training the model by using a sample with a mark, and detecting and identifying the data to be detected and identified after model training is completed;
S3, sample expansion adopts a SMOTE sample generation method based on rapid clustering, and the method comprises the following steps:
S301, obtaining k neighbor samples of each minority sample by calculating Euclidean distances from the minority sample to other minority samples, and generating a new minority sample by performing linear interpolation between the sample and the selected neighbor sample in a random selection mode, wherein the specific process is shown in the following formula:
Wherein, Representing one sample among k neighbors,/>Is a random number,/>Is an input sample,/>Is the new sample generated;
s302, pair generation The samples are clustered rapidly, and firstly, the distance between objects is calculated according to the following formula:
Wherein, And/>For/>2 Of the samples,/>For/>And/>In order to accelerate the clustering speed, a threshold value is set, and the formula is as follows:
Wherein, Representing a threshold value/>Is a proportionality coefficient, is set by man, is provided withThe value range of (2) is more than 0 and less than 1; respectively minimum distance and maximum distance between categories;
At the generated sample In the method, samples generated in each category are screened to improve the quality of the generated samples, and screening conditions are as follows:
Represents the screening sample set, will/> Combining the sample set with the original data set to obtain an equalized sample set/>For subsequent feature extraction.
2. The training method of a machine learning-based data processing and recognition model according to claim 1, wherein the dimension reduction method adopted in S2 is a principal component analysis method, comprising the steps of:
s201, data standardization, wherein the standardized calculation method comprises the following steps:
wherein Z represents a standardized value, and all variables are scaled according to the proportion through the step;
s202, calculating a covariance matrix, wherein the covariance matrix is defined as one mathematically Matrix/>Representing the dimensionality of the acquired data, each element in the matrix representing the covariance of the corresponding variable, for a vector with variable/>And the hyperspectral band scene of variable b, the covariance of which is a2 x 2 matrix, as follows:
Wherein, Representing covariance matrix,/>Representing the covariance of the variable with itself, i.e., variable/>Is a variance of (2); /(I)Representing the variable/>The covariance with the variable b is given by,Representing the variable/>Is a variance of (2);
S203, calculating a feature vector and a feature value:
Calculating from the covariance matrix to obtain feature vectors and feature values, wherein the feature vectors and the feature values are calculated in pairs, namely, each feature vector has a corresponding feature value, and the number of feature vectors to be calculated determines the dimension of data;
The eigenvectors are used to learn the maximum variance in the data using covariance matrices, since more variance in the hyperspectral data represents more information about the data, eigenvectors are used to identify and calculate principal components, and on the other hand, eigenvalues represent only scalar quantities for each eigenvector, so eigenvectors and eigenvalues will be used to calculate principal components of the data;
S204, calculating main components:
After the feature vectors and the feature values are calculated, the feature vectors are required to be ordered in a descending order, the feature vector corresponding to the higher feature value has more important position, the feature vector with the highest feature value is used as a first main component, and then the screened main components form a feature matrix;
s205, reducing the dimension of the data set:
Rearranging the raw data with final principal components representing the largest and most important information of the dataset; to replace the original data set with the newly formed principal component, it is simply multiplied with the transpose of the original data, the resulting data being dimensionality reduced
Wherein,Representing the predicted outcome of a random forest,/>Representing the maximum index function, N is the test set, T is the number of decision trees,/>, andTo indicate a function,/>For the prediction result of the t-th decision tree, y represents the category,Voting weight for the t decision tree; when the prediction result of the decision tree is true, the function/>, is indicatedThe value of (2) is 1, whereas 0;
when the improved random forest algorithm works, firstly, a confusion matrix is constructed, TP in the confusion matrix represents that a stable sample is judged as a stable sample, FN represents that the stable sample is judged as a unstable sample, FP represents that the unstable sample is judged as a stable sample, and TN represents that the unstable sample is judged as a unstable sample;
Accuracy of classification of destabilized samples using each decision tree And recall/>Harmonic mean value/>The voting weight value/>, of each tree is taken as the weight of the treeThe definition is as follows:
The larger the decision tree is, the better the classification performance of the decision tree on minority class samples is, and the heavy metal pollution degree is identified by improving a machine learning classification model of a random forest.
CN202410205784.5A 2024-02-26 2024-02-26 Training method of data processing and recognition model based on machine learning Active CN117789038B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410205784.5A CN117789038B (en) 2024-02-26 2024-02-26 Training method of data processing and recognition model based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410205784.5A CN117789038B (en) 2024-02-26 2024-02-26 Training method of data processing and recognition model based on machine learning

Publications (2)

Publication Number Publication Date
CN117789038A CN117789038A (en) 2024-03-29
CN117789038B true CN117789038B (en) 2024-05-10

Family

ID=90392988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410205784.5A Active CN117789038B (en) 2024-02-26 2024-02-26 Training method of data processing and recognition model based on machine learning

Country Status (1)

Country Link
CN (1) CN117789038B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117974221B (en) * 2024-04-01 2024-09-13 国网江西省电力有限公司南昌供电分公司 Electric vehicle charging station location selection method and system based on artificial intelligence

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819745A (en) * 2012-07-04 2012-12-12 杭州电子科技大学 Hyper-spectral remote sensing image classifying method based on AdaBoost
CN104021396A (en) * 2014-06-23 2014-09-03 哈尔滨工业大学 Hyperspectral remote sensing data classification method based on ensemble learning
CN108596246A (en) * 2018-04-23 2018-09-28 浙江科技学院 The method for building up of soil heavy metal content detection model based on deep neural network
CN112001788A (en) * 2020-08-21 2020-11-27 东北大学 Credit card default fraud identification method based on RF-DBSCAN algorithm
CN112784907A (en) * 2021-01-27 2021-05-11 安徽大学 Hyperspectral image classification method based on spatial spectral feature and BP neural network
CN113256066A (en) * 2021-04-23 2021-08-13 新疆大学 PCA-XGboost-IRF-based job shop real-time scheduling method
WO2021189830A1 (en) * 2020-03-26 2021-09-30 平安科技(深圳)有限公司 Sample data optimization method, apparatus and device, and storage medium
WO2022001159A1 (en) * 2020-06-29 2022-01-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Latent low-rank projection learning based unsupervised feature extraction method for hyperspectral image
CN115436407A (en) * 2022-08-16 2022-12-06 电子科技大学 Element content quantitative analysis method combining random forest regression with principal component analysis
CN115526298A (en) * 2022-10-18 2022-12-27 安徽工业大学 High-robustness comprehensive prediction method for concentration of atmospheric pollutants
CN115965119A (en) * 2022-12-01 2023-04-14 北方工业大学 Method for power prediction optimization of distributed energy storage system
CN116187543A (en) * 2023-01-10 2023-05-30 中南大学 Machine learning-based soil heavy metal content prediction method and application thereof
CN116776245A (en) * 2023-06-09 2023-09-19 常州大学 Three-phase inverter equipment fault diagnosis method based on machine learning
CN116881451A (en) * 2023-06-28 2023-10-13 华迪计算机集团有限公司 Text classification method based on machine learning
CN117272999A (en) * 2023-09-05 2023-12-22 联通(广东)产业互联网有限公司 Model training method and device based on class incremental learning, equipment and storage medium

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819745A (en) * 2012-07-04 2012-12-12 杭州电子科技大学 Hyper-spectral remote sensing image classifying method based on AdaBoost
CN104021396A (en) * 2014-06-23 2014-09-03 哈尔滨工业大学 Hyperspectral remote sensing data classification method based on ensemble learning
CN108596246A (en) * 2018-04-23 2018-09-28 浙江科技学院 The method for building up of soil heavy metal content detection model based on deep neural network
WO2021189830A1 (en) * 2020-03-26 2021-09-30 平安科技(深圳)有限公司 Sample data optimization method, apparatus and device, and storage medium
WO2022001159A1 (en) * 2020-06-29 2022-01-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Latent low-rank projection learning based unsupervised feature extraction method for hyperspectral image
CN112001788A (en) * 2020-08-21 2020-11-27 东北大学 Credit card default fraud identification method based on RF-DBSCAN algorithm
CN112784907A (en) * 2021-01-27 2021-05-11 安徽大学 Hyperspectral image classification method based on spatial spectral feature and BP neural network
CN113256066A (en) * 2021-04-23 2021-08-13 新疆大学 PCA-XGboost-IRF-based job shop real-time scheduling method
CN115436407A (en) * 2022-08-16 2022-12-06 电子科技大学 Element content quantitative analysis method combining random forest regression with principal component analysis
CN115526298A (en) * 2022-10-18 2022-12-27 安徽工业大学 High-robustness comprehensive prediction method for concentration of atmospheric pollutants
CN115965119A (en) * 2022-12-01 2023-04-14 北方工业大学 Method for power prediction optimization of distributed energy storage system
CN116187543A (en) * 2023-01-10 2023-05-30 中南大学 Machine learning-based soil heavy metal content prediction method and application thereof
CN116776245A (en) * 2023-06-09 2023-09-19 常州大学 Three-phase inverter equipment fault diagnosis method based on machine learning
CN116881451A (en) * 2023-06-28 2023-10-13 华迪计算机集团有限公司 Text classification method based on machine learning
CN117272999A (en) * 2023-09-05 2023-12-22 联通(广东)产业互联网有限公司 Model training method and device based on class incremental learning, equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Empirical Analysis of Sampling Methods on Imbalanced Data;Disha N 等;《2022 IEEE North Karnataka Subsection Flagship International Conference (NKCon)》;20230526;全文 *
基于数据增强和模型更新的异常流量检测技术;张浩;陈龙;魏志强;;信息网络安全;20200210(第02期);全文 *
高光谱遥感影像多级联森林深度网络分类算法;武复宇;王雪;丁建伟;杜培军;谭琨;;遥感学报;20200425(第04期);全文 *

Also Published As

Publication number Publication date
CN117789038A (en) 2024-03-29

Similar Documents

Publication Publication Date Title
CN110533631B (en) SAR image change detection method based on pyramid pooling twin network
CN109117883B (en) SAR image sea ice classification method and system based on long-time memory network
CN117789038B (en) Training method of data processing and recognition model based on machine learning
Mokhtari et al. Comparison of supervised classification techniques for vision-based pavement crack detection
CN112613536B (en) Near infrared spectrum diesel fuel brand recognition method based on SMOTE and deep learning
TW201350836A (en) Optimization of unknown defect rejection for automatic defect classification
CN103714148B (en) SAR image search method based on sparse coding classification
TW201407154A (en) Integration of automatic and manual defect classification
CN113191926B (en) Method and system for identifying grain and oil crop supply chain hazard based on deep integrated learning network
CN110751209B (en) Intelligent typhoon intensity determination method integrating depth image classification and retrieval
CN107016416B (en) Data classification prediction method based on neighborhood rough set and PCA fusion
Kaur et al. Computer vision-based tomato grading and sorting
CN115699040A (en) Method and system for training a machine learning model for classifying components in a material stream
CN108875118A (en) A kind of blast furnace molten iron silicon content prediction model accuracy estimating method and apparatus
CN114997501A (en) Deep learning mineral resource classification prediction method and system based on sample unbalance
CN113344045A (en) Method for improving SAR ship classification precision by combining HOG characteristics
DB et al. Classification of oil palm female inflorescences anthesis stages using machine learning approaches
CN115147615A (en) Rock image classification method and device based on metric learning network
Wang et al. Classification and extent determination of rock slope using deep learning
CN111242028A (en) Remote sensing image ground object segmentation method based on U-Net
CN104463207A (en) Knowledge self-encoding network and polarization SAR image terrain classification method thereof
CN110675382A (en) Aluminum electrolysis superheat degree identification method based on CNN-LapseLM
CN117372144A (en) Wind control strategy intelligent method and system applied to small sample scene
CN112465821A (en) Multi-scale pest image detection method based on boundary key point perception
CN112801204A (en) Hyperspectral classification method with lifelong learning ability based on automatic neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant