CN112686296A - Octane loss value prediction method based on particle swarm optimization random forest parameters - Google Patents

Octane loss value prediction method based on particle swarm optimization random forest parameters Download PDF

Info

Publication number
CN112686296A
CN112686296A CN202011587477.6A CN202011587477A CN112686296A CN 112686296 A CN112686296 A CN 112686296A CN 202011587477 A CN202011587477 A CN 202011587477A CN 112686296 A CN112686296 A CN 112686296A
Authority
CN
China
Prior art keywords
random forest
data
particle swarm
value
particle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011587477.6A
Other languages
Chinese (zh)
Other versions
CN112686296B (en
Inventor
杨春曦
陈瑞
韩世昌
范升序
李一鸣
陈锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202011587477.6A priority Critical patent/CN112686296B/en
Publication of CN112686296A publication Critical patent/CN112686296A/en
Application granted granted Critical
Publication of CN112686296B publication Critical patent/CN112686296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an octane loss value prediction method based on particle swarm optimization random forest parameters, which comprises the steps of 1, calculating information gain values of relevant characteristics of octane loss values, and deleting characteristics with small influence on octane number loss; step 2, preprocessing the residual data; step 3, training a random forest algorithm by adopting a training data set to obtain a training model; step 4, initializing particle swarm algorithm parameters; step 5, adopting the root mean square error as a fitness function of the particle swarm algorithm, continuously solving the optimal values of the number of the parameter decision trees and the depth of the trees in the training model through the particle swarm algorithm, and introducing the optimal parameters into the training model to obtain an optimal prediction model; and 6, inputting a new test set again, importing the optimal prediction model for testing, and obtaining a prediction result. The method can be effectively used for predicting the octane loss value.

Description

Octane loss value prediction method based on particle swarm optimization random forest parameters
Technical Field
The invention relates to an octane loss value prediction method based on particle swarm optimization random forest parameters, and belongs to the technical field of octane loss value prediction in a gasoline catalytic cracking process flow.
Background
In the gasoline catalytic cracking process, in order to meet the requirement of gasoline sulfur content under a new national standard environment, the requirement of further improvement on desulfurization treatment in gasoline is required, but the octane number content in the gasoline is influenced by excessive process operation in the desulfurization process. The octane number is used as the most important index for reflecting the combustion performance of the gasoline, the octane loss value in the process is controlled, and the economic benefit in production can be effectively improved. Most of the traditional chemical process modeling is realized based on data association and mechanism modeling, but the complexity of the actual oil refining process is high, the control variables have highly nonlinear and strongly coupled relations, the traditional chemical process modeling has high requirements on raw material analysis, and the process optimization response is not timely, so the effect is not ideal.
Currently, the prediction of octane number in process production has been widely studied and good prediction results are obtained. The method mainly focuses on the prediction of the octane component ratio in the finished oil. And analyzing the collected data in the product oil by using a machine learning method, and then performing analysis prediction by using a machine learning model.
Disclosure of Invention
The invention provides an octane loss value prediction method based on particle swarm optimization random forest parameters, which is used for predicting an octane loss value.
The technical scheme of the invention is as follows: a particle swarm optimization random forest parameter-based octane loss value prediction method comprises the following steps:
step 1, calculating an information gain value of relevant characteristics of octane loss values, and deleting characteristics with small influence of octane number loss;
step 2, preprocessing the residual data with characteristics less influenced by the deletion of the octane number loss, and dividing the preprocessed data into a training data set and a test data set;
step 3, training the random forest algorithm by adopting a training data set to obtain a training model, and verifying the training model by adopting a test data set;
step 4, initializing particle swarm algorithm parameters;
and 5, adopting the root mean square error of the verified random forest algorithm training model as a fitness function of the particle swarm algorithm, continuously solving the optimal value of the number n _ estimators of the parameter decision trees and the depth max _ depth of the trees in the verified random forest algorithm training model through the particle swarm algorithm, and introducing the optimal parameters into the verified random forest algorithm training model to obtain the optimal prediction model.
Further comprising:
and 6, inputting the data processed in the step 1 and the step 2 again as a new test set, importing the new test set into an optimal prediction model, and testing to obtain a prediction result.
In the step 1, the deleting conditions are as follows: and whether the information gain value of the characteristic is smaller than the average information gain value of all the characteristics or not is judged, and the characteristics corresponding to the average information gain value smaller than all the characteristics are deleted.
In the step 2, the pretreatment specifically comprises: filling null values and normalizing.
The filling null specifically is: the sample data is concentrated, and when a single characteristic of a certain sample data is a null value, the mean value of the sum of the previous data and the next data at the null value position is used for filling the null value; otherwise, when more than two characteristics in a certain sample data have null values, deleting the data.
The normalization is specifically performed by min-max normalization, so that the result value is mapped between [0-1 ].
In the step 4, the parameters are set as follows: population number, particle position inertial weight, particle learning factor and particle dimension; the population number, the particle position inertia weight and the particle learning factor are used as main parameters influencing the particle swarm algorithm, and the particle dimension is the number of optimized random forest parameters.
The invention has the beneficial effects that:
(1) according to the method, the information gain calculation is carried out on the collected data, the size and the distribution interval of the information gain value of each feature data can be conveniently observed, the association degree of the features and the octane loss value in the collected data is conveniently observed, the feature data with low association coupling degree in the original data set is further deleted, the effective information of the feature data is continuously extracted, meanwhile, the time length required by model training is reduced, the overfitting influence of too much feature data on the model is avoided, through the step 1, the feature data required by the model training is guaranteed to have high association coupling effective information, and the economic cost and the time cost in the model training are reduced.
(2) The null value data are processed by different means, and the data of more than two null values are deleted instead of filled, so that abnormal data are effectively removed, and the defects that the filling is different from other normal data and the training effect of the model is interfered are avoided; and for a null value, the mean value of the sum of data before and after the null value position is adopted for replacing, the purpose is that the mean value reflects the change trend before and after, the deviation from the actual true value is effectively reduced, and the purpose of replacing the true value is achieved. Meanwhile, normalization processing is added, through the normalization processing, the indexes are in the same order of magnitude according to standardization, the influence of different dimensions and dimension units caused by different evaluation indexes can be eliminated, and the comparability between data indexes is solved, so that the convergence speed in the subsequent parameter optimization process is higher, and the convergence optimal solution is easier to obtain. The normalized data can be divided into a training data set and a testing data set in a random sampling mode 1:1, so that the large data difference between the training data set and the testing data set is avoided.
(3) After the processing of the steps 1 and 2, the random forest algorithm is further combined, the data processed in the steps 1 and 2 can well avoid the defects of the random forest algorithm (such as high noise, easy overfitting and long training time), so that the advantages of the random forest algorithm can be further improved, the random forest is trained through the training data set processed in the steps, the model can achieve a good training effect, and then the test data set is used for testing the random forest algorithm to show the prediction effect of the model.
(4) By using the initialized particle swarm algorithm to solve the optimal value of the parameters in the training model, the defects of uncertainty and deviation caused by the fact that the algorithm parameters in the training model are set by adopting artificial experience values can be avoided; the optimal parameters are obtained through the particle swarm algorithm and input into the training model, so that the training model can be promoted to be an optimal prediction model, and the time required by training the training model is reduced and the generalization capability of the training model is promoted simultaneously under the condition that the prediction capability of the model is effectively enhanced through the number n _ estimators and the depth max _ depth of the proper decision trees selected through the particle swarm algorithm; furthermore, two parameters of the number n _ estimators of the decision tree and the depth max _ depth of the tree are selected as particle dimensions, so that the condition that the search dimension is higher due to the fact that the number of target values is larger can be avoided, the search running time of a particle swarm algorithm is greatly increased, and the search efficiency is reduced; the defects that single parameter searching is too simple and the effect improvement is unstable can be avoided; by constructing the two-dimensional space search range, parameter values required by the optimization training model can be ensured to be searched, the algorithm search operation time is short, and the algorithm efficiency is ensured.
In conclusion, the training effect of the training model can be improved by performing the processing of the step 3 after the processing of the steps 1 and 2, and the prediction capability of the training model is continuously improved from the aspect of parameters by further matching with particle swarm parameter optimization, so that the training time of the training model is reduced, and the prediction performance of the model is further improved; experiments of the invention also show that the optimal prediction model has good prediction capability on new characteristic data collected in the process, has strong stability and can be effectively used for predicting the octane loss value.
Drawings
FIG. 1 shows a flow chart of the present invention;
FIG. 2 shows a comparison experimental verification diagram of superiority of the random forest algorithm to data in the scene of the invention;
FIG. 3 is a scatter plot of relative data distribution according to the method of the present invention;
FIG. 4 is a graph showing the prediction capability of the method of the present invention in real short-term data.
Detailed Description
Example 1: as shown in fig. 1, a method for predicting octane loss value based on particle swarm optimization random forest parameters comprises the following steps:
step 1, calculating an information gain value of relevant characteristics of octane loss values, and deleting characteristics with small influence of octane number loss;
step 2, preprocessing the residual data with characteristics less influenced by the deletion of the octane number loss, and dividing the preprocessed data into a training data set and a test data set;
step 3, training the random forest algorithm by adopting a training data set to obtain a training model, and verifying the training model by adopting a test data set;
step 4, initializing particle swarm algorithm parameters;
and 5, adopting the root mean square error of the verified random forest algorithm training model as a fitness function of the particle swarm algorithm, continuously solving the optimal value of the number n _ estimators of the parameter decision trees and the depth max _ depth of the trees in the verified random forest algorithm training model through the particle swarm algorithm, and introducing the optimal parameters into the verified random forest algorithm training model to obtain the optimal prediction model.
Further, it may be provided that the method further includes: and 6, inputting the data processed in the step 1 and the step 2 again as a new test set, importing the new test set into an optimal prediction model, and testing to obtain a prediction result.
Further, in the step 1, the deleting conditions may be: and whether the information gain value of the characteristic is smaller than the average information gain value of all the characteristics or not is judged, and the characteristics corresponding to the average information gain value smaller than all the characteristics are deleted.
Further, in the step 2, the pretreatment specifically includes: filling null values and normalizing.
Further, the filling null value may be specifically set as: the sample data is concentrated, and when a single characteristic of a certain sample data is a null value, the mean value of the sum of the previous data and the next data at the null value position is used for filling the null value; otherwise, when more than two characteristics in a certain sample data have null values, deleting the data.
Further, the normalization may be set to specifically employ min-max normalization, so that the result value is mapped between [0-1 ].
Further, in the step 4, the parameters may be set as follows: population number, particle position inertial weight, particle learning factor and particle dimension; the population number, the particle position inertia weight and the particle learning factor are used as main parameters influencing the particle swarm algorithm, and the particle dimension is the number of optimized random forest parameters.
In the step 1, the characteristic with small influence on octane number loss is deleted to avoid the problem of overfitting, and the specific deletion conditions are as follows: whether the characteristic information gain value is smaller than the average information gain value of all the characteristics in the original data set or not and deleting the characteristics corresponding to the average information gain value smaller than all the characteristics in the original data set, wherein the information gain formula is as follows:
Figure BDA0002867564650000051
in the formula, the sample set is assumed to have n types of labels, and the set is C ═ C1,C2,...,Cn),i=1,2,...n;CiThe label set C is the ith type label. Suppose there are m classes of features in the sample set, and the set is T ═ T (T)1,t2,...,tm),j=1,2,...m,tjIs the jth class feature in the feature set T. P (t)j) Represents a feature tjThe probability of occurrence of the event is,
Figure BDA0002867564650000052
representation featuretjProbability of absence, P (C)i) Is represented by CiProportion value of class label data to total data, P (C)i/tj) Represents a feature tjAt the time of occurrence CiThe probability of the occurrence of the class data,
Figure BDA0002867564650000053
represents a feature tjWhen not present CiProbability of occurrence of class data. H (C) represents the information entropy value of the tag set C, the smaller the entropy value is, the lower the randomness degree of the information is, and the characteristic T in the characteristic set TjConditional entropy of (C) H (C)i/tj) Is shown in the known characteristic tjUnder the conditions of (1), label CiThe lower the conditional entropy value, the lower the degree of randomness of the information of (C)iAnd tjHigher the degree of correlation of (c), the information gain IG (t)j) The difference value of the information entropy minus the conditional entropy is obtained, so that the larger the information gain is, the label CiAnd the feature tjThe higher the degree of correlation of (c), the higher the feature tjThe greater the ability to reduce the degree of randomness of the total data, the greater its value is for classification.
The method comprises the steps of constructing an original data set for relevant features of octane loss values, wherein each sample data is composed of m features and corresponding labels, calculating information gain values of the relevant features of the octane loss, selecting information gains to carry out subsequent judgment but not others, carrying out numerical display on the features in the data based on the information gains, conveniently observing the size and distribution interval of each feature information gain value, and providing a convenient numerical observation means for selecting or rejecting the relevant features; and whether the characteristic information gain value is smaller than the average information gain value of all the characteristics in the original data set or not is used as a deletion judgment condition, so that the data with low degree of association with the label category in the original data set can be effectively deleted, the calculation consumption is reduced, the overfitting of the model can be avoided, and the purpose of improving the training speed of the model in the subsequent step is achieved.
When the null value in the step 2 is filled, the average value of the sum of the data before and after the position of each null value is adopted to replace the filled null value, so that the data characteristics with small amplitude fluctuation change between adjacent data in continuous time are attached, and the deviation can be effectively reduced.
And 2, preprocessing the data, processing the characteristic data by using a normalization function, standardizing the data by processing, and eliminating the influence of different dimensions and dimension units caused by different evaluation indexes to solve the comparability between the data indexes, wherein all the indexes are in the same order of magnitude. The normalization specifically adopts min-max normalization, also called dispersion normalization, which is linear transformation of the original data to map the result value between [0-1 ]; the conversion formula is as follows:
Figure BDA0002867564650000061
max is the maximum value in the characteristic data, and min is the minimum value in the characteristic data. A is data before normalization, A*Is normalized data.
In step 5, the particle swarm optimization is used for optimizing random forest parameters, the parameters for determining the performance of the model mainly consist of the number n _ estimators of the decision trees and the depth max _ depth of the decision trees, the prediction precision of the model can be effectively improved through the optimal parameter combination of the particle swarm optimization search, the deviation of the predicted value is reduced, and the predicted value is closer to the true value, most parameter settings of the traditional random forest algorithm are based on manual settings, and the empirical values are relied on; in the optimization iteration process by means of particle swarm, the position of a selected parameter is set as a two-dimensional vector Sq=(S1,S2),S1=n-estimators,S2Max-depth, each particle has two attributes: the speed and the position of the particles are continuously searched, wherein the individual searching optimal solution is Pbest, the group optimal solution is Gtest, and the particles continuously update the speed and the position of the particles through Pbest and Gtest in the iterative optimization process:
Figure BDA0002867564650000062
wherein:
Figure BDA0002867564650000063
for the q-dimensional component of the particle lambda airspeed vector at the kth iteration,
Figure BDA0002867564650000064
is the q-dimensional component of the particle lambda position vector at the kth iteration. Alpha is the inertial weight of the particle position, q is the particle dimension, generally r1、r2Two value ranges are [0, 1]]In order to increase the search randomness, the position and velocity of the particles are generally limited to [ X ]MIN,XMAX],[VMIN,VMAX],VMINIs the minimum search velocity, V, of the particleMAXIs the maximum search speed of the particle, XMINFor the minimum search position of the particle, XMAXThe maximum search position for the particle is to ensure that the particle does not search blindly.
And 5, the value of the random forest parameter must be an integer, so that the position and speed values of the particle are rounded by using a rounding function, and when the model can obtain the optimal solution, the optimal parameter combination is output to be a positive integer.
Example 2: aiming at a method for predicting octane loss value based on particle swarm optimization random forest parameters, the invention provides the following experimental data process:
step 1, extracting characteristic data influencing octane loss in a gasoline catalytic cracking process flow to serve as an original data set, calculating information gain values of all characteristics in the original data set, and deleting the characteristics with small influence on octane number loss to avoid the problem of overfitting. The method comprises the following specific steps:
1.1, in the gasoline catalytic cracking process, the octane loss is caused by that excessive olefin substances are generated to cause the reaction consumption of octane due to hydrodesulfurization in the desulfurization process, so that data collected by a hydrodesulfurization section sensor in the gasoline catalytic cracking process per hour is selected as an original data set, only section data causing octane number loss in the gasoline catalytic cracking process flow is locked and extracted, the data collection cost is effectively lowered, and effective information in characteristic data is conveniently extracted; and calculating and acquiring information gain values of the features in the original data set, deleting the features which are smaller than all the features in the original data set and correspond to the average information gain values of the features, wherein the deleted feature variables are shown in table 1.
TABLE 1 delete characteristic information gain Table
Name of variable Information gain
Feed unit feedstock sulfur content 0.342
Stabilizing column pressure 0.356
Recycle hydrogen to lockhopper dipleg flow 1.073
Cumulative flow of waste hydrogen discharge 1.039
Pressure of reducer 1.98
Reactor top pressure 1.919
Flow rate of light hydrocarbon out of device 1.174
Fuel gas inlet pressure 1.664
Flow of light naphtha into the device 1.984
Regenerator pressure 1.958
Flow of refined gasoline to feeding buffer tank 1.846
Pressure of outlet mixed hydrogen point of circulating hydrogen compressor 1.899
Middle temperature of R-101 bed 1.927
Average information gain value of all features 1.987674
The remaining features include: saturated hydrocarbon (alkane + cyclane) content, olefin content, aromatic hydrocarbon content, bromine number, raw material sulfur content, spent adsorbent coke content, spent adsorbent sulfur content, hydrogen-oil ratio, reducer fluidization hydrogen flow, reactor upper temperature, reactor bottom temperature, reactor top-bottom pressure difference, back-flushing hydrogen temperature, back-flushing hydrogen pressure, dry gas outlet device temperature, refined gasoline outlet device flow, refined gasoline outlet device sulfur content, steam inlet device pressure, steam inlet device flow, dry gas outlet device flow, fuel gas inlet device temperature, fuel gas inlet device flow, 1.0MPa steam inlet device temperature, D107 converter line pressure difference, D107 lift nitrogen flow, catalytic gasoline inlet device total flow, 2# catalytic gasoline inlet device flow, 3# catalytic gasoline inlet device flow, raw material pump outlet flow, raw material inlet device flow, Hydrogen flow of hydrogen mixing point, inlet temperature of heating furnace, exhaust temperature of heating furnace, outlet temperature of circulating hydrogen of heating furnace, inlet temperature of reactor, D104 destabilizing tower flow, reducer temperature, regeneration air flow, R102 regenerator lifting nitrogen flow, regenerator top and bottom differential pressure, regenerator top flue gas temperature, regenerator temperature, regeneration flue gas oxygen content, raw material inlet device flow, D-123 steam outlet flow, D-110 steam coil inlet flow, stabilizing tower lower temperature, stabilizing tower top outlet temperature, stabilizing tower bottom outlet temperature, regenerator top/regenerator receiver differential pressure, emergency hydrogen main pipe flow, emergency hydrogen R-101 flow, blocking hopper hydrocarbon content, blocking hopper charging line pressure, R-101 bed lower temperature, D-121 sulfur-containing sewage discharge capacity, hydrocracking light naphtha inlet device accumulated flow, The flow from the hydrogen of 8.0MPa to the inlet of the recycle hydrogen compressor and the flow from the hydrogen of 8.0MPa to the outlet of the back-flushing hydrogen compressor.
Step 2, preprocessing the residual data with characteristics of small influence on octane number loss deletion, specifically: filling null values, normalizing, dividing into a training data set and a test data set, and extracting as a comparison and verification experiment data set. The method comprises the following specific steps:
2.1, deleting a certain piece of data when two or more characteristics contained in the certain piece of sample data have null values; (2) in the sample data set, a single characteristic of a certain sample data is a null value, and the average value of the sum of the previous data and the next data at the null value position is used for substitution, so that the deviation can be effectively reduced.
And 2.2, normalizing all the characteristic data to eliminate dimensional influence among the data.
2.3, dividing the 2017 and 2019 year data into a training data set and a testing data set according to a random 1:1 ratio. The data are acquired for a long time, and octane loss numerical characteristic data are constructed and completed through long-time historical data collection, so that the prediction capability of the model can be more accurate.
And 3, training the random forest algorithm by adopting the training data set to obtain a training model, and verifying the training model by adopting the test data set. The method comprises the following specific steps:
and 3.1, inputting a training data set and a test data set, performing comparison and verification experiments of different regression algorithms, wherein prediction results are shown in fig. 2 and fig. 3, the prediction capability of the different regression algorithms on the test data set is shown in fig. 2, solid dots with dotted lines are true values of each serial number data in the test data set, solid inverted triangles are predicted values of each serial number data in the test data set by the regression algorithms, and the closer the predicted value corresponding to each serial number in the graph is to the true value, the better the prediction capability of the regression algorithms is. Fig. 3 shows relative positional deviation values between the true values of the serial number data in the test data set and the predicted values of the serial number data in the test data set in different regression algorithms, where if the data points in the graph are linearly concentrated, the better the performance of the regression algorithm is. The regression models in the positions of fig. 3 and fig. 2 correspond one to verify the sample data processing effect of different algorithms in the present invention. The comparative evaluation index selects a correlation coefficient, a root mean square error, a mean square error and a mean absolute error, wherein the formula is as follows:
Figure BDA0002867564650000091
in the above formula, h (x)I) Indicating the predicted value of the I-th data,
Figure BDA0002867564650000092
the true value of the I-th data is represented,
Figure BDA0002867564650000093
the mean value is shown, and the number of data is M. The numerator of the correlation coefficient represents the sum of the square deviations of the real value and the predicted value, the denominator represents the sum of the square deviations of the real value and the mean value, and the value range is [0, 1]]In between, a closer value to 1 indicates a better model fit. The mean square error is the sum of the squares of the differences between the actual and predicted values in the test set, the root mean square error is the square root of the mean square error, the squared absolute error is the average of the absolute values of the differences between the predicted and actual values, the mean square error is the same as the root mean square error, and the more the numerical value is close to 0, the more the modulus is representedThe type prediction precision is high. The square absolute error can avoid the problem of mutual error cancellation, and the actual situation of the error of the predicted value can be better reflected. The evaluation indexes of the actual comparative experiment results are shown in table 2.
TABLE 2 evaluation index table for different regression models
Figure BDA0002867564650000094
The performance indexes of Decision Trees (DT), Logistic Regression (LR), Support Vector Machines (SVM), K neighbor (KNN), AdaBoos, Bagging, BP neural networks (BP) and the like are weaker than those of Random Forest (RF) models.
Step 4, initializing particle swarm algorithm parameters, and setting main parameters as follows: the method comprises the following steps of population quantity, particle position inertia weight, particle learning factors and particle dimension, wherein the population quantity, the particle position inertia weight and the particle learning factors are used as main parameters influencing a particle swarm algorithm, and the particle dimension is the number of optimized random forest parameters. The method comprises the following specific steps:
4.1, when the parameters of the established random forest are optimized, the number n _ estimators of the decision trees and the depth max _ depth of the trees in the core parameters are optimized, theoretically, the number of the decision trees is increased, the variance of a prediction result can be effectively reduced, training time can be increased, the deeper the depth of the decision trees is, the stronger the prediction capability of the corresponding model is, and the more the depth of the decision trees is, the longer the training time is and the easier overfitting is. Therefore, the selection of the proper number n _ estimators and depth of the decision trees can effectively enhance the prediction capability of the model by max _ depth, and simultaneously reduce the time length required by the model training. Because the parameter is manually set, the optimal model prediction capability cannot be obtained usually by means of empirical values, and the particle swarm algorithm is continuously updated according to the positions of particles to determine the final optimal parameter combination.
4.2, initializing the population, setting the population number w to 100, the iteration number k to 100, the optimization dimension q to 2, the inertia weight alpha to 0.8, and the learning factor c1=c22, particle search velocity VMIN=1,VMAXWhen 5, the particle searches for position XMIN=1,XMAX=20。
And 4.3, setting a fitness function as a judgment index of the optimal position of the particle search, and selecting the root mean square error as the fitness function of the prediction model by the model.
And 5, adopting the root mean square error of the random forest algorithm training model as a fitness function of the particle swarm algorithm, taking the root mean square error as the square root of the sum of squares value of the difference between the real value and the predicted value of the test set, wherein the smaller the root mean square error is, the stronger the prediction performance of the model is, and performing minimum value optimization on the root mean square error by using the particle swarm algorithm so as to determine the parameter value corresponding to the optimal performance when the random forest regression model reaches the search condition. Therefore, the optimal values of the number n _ estimators of the parameter decision trees and the depth max _ depth of the trees in the random forest algorithm are continuously solved through the particle swarm, and the optimal parameters are led into the random forest algorithm to obtain an optimal prediction model. The method comprises the following specific steps:
and 5.1, calculating the fitness value of each particle by adopting the fitness function established in the step 4.3 for the trained random forest prediction model.
5.2 after the continuous iteration process, counting the fitness of each particle, selecting the particles with smaller fitness, and continuously reducing the range to carry out iteration after remembering the positions. And when the iteration times are reached, obtaining the optimal fitness function value and the position and the speed of the corresponding particle, outputting the corresponding parameter value, and inputting the optimal parameter combination into the random forest prediction model to obtain the optimal prediction model.
And 6, inputting the processed latest data again to be extracted as a new test data set, importing the new test data set into an optimal prediction model for testing to obtain a prediction result, and verifying the stability of the model. The method comprises the following specific steps:
6.1 the data of 1-2 months in 2020 is processed according to step 1 and step 2, but not divided, and all the processed data is used as a new test data set.
6.2, importing the new test data set into an optimal prediction model based on particle swarm optimization random forest parameters (PSO-RF), and obtaining a prediction result distribution graph as shown in FIG. 4, wherein the solid dots with lines are real values of each serial number data in the new test data set, and the solid pentagons with lines are predicted values of each serial number data in the new test data set. The evaluation indexes of the prediction model are shown in table 2. As can be seen from fig. 4 and table 2, the optimal prediction model has a good prediction effect on the new test data set, and is within the actual error range.
Table 2 test data prediction result evaluation table
Evaluation index Numerical value
MSE 0.01881
MAE 0.10302
RMSE 0.13716
The experimental stand-alone processor of the embodiment of the invention is Intel (R) core (TM) i5-4590 CPU3.3GHz, the running memory is 12GB, the operating system is a 64-bit windows7 flagship edition, and the program compiling language is Python.
The working principle of the invention is as follows: according to relevant literature data, equipment characteristic parameters relevant to desulfurization core reaction are selected from process production, the characteristic parameter extraction range is narrowed to relevant equipment such as a reactor, a regenerator, raw material relevant equipment, hydrogenation relevant equipment, a catalyst and the like, influence factors influencing octane loss values in various steps and equipment are found out according to the literature, wherein the influence of factors influencing octane loss values on hydrogen-oil ratio, mass space velocity, spent adsorbent sulfur holding rate, carbon holding rate and the like is researched by Zhengyunfeng et al (influence of raw material oil on octane number of catalytic cracking gasoline), and the influence of factors such as steam pressure, stable tower top and bottom temperature, catalyst circulation capacity, olefin, bromine and the like on octane loss is researched by Shangbao et al (strengthening process management and reducing octane number loss of refined gasoline of an S Zorb device). In summary, from the analysis of the principle of catalytic cracking decomposition of gasoline, the characteristic factors causing octane loss are extracted. Due to the complex association relation of mutual coupling among the operation variables, the method carries out information gain calculation on the characteristic parameters of the operation variables selected from the factory data, takes the octane loss value as the variable parameter, carries out information gain calculation on the characteristics, and deletes the characteristics with small influence on octane number loss so as to avoid the problem of over-fitting. The method comprises the steps of determining characteristic variables, preprocessing the residual data, effectively predicting the octane number loss trend through a random forest prediction model optimized by particle swarm parameters, and verifying that the method is superior to other traditional regression algorithms through a comparison experiment because the random forest has the characteristics of high dimensional feature insensitivity, high sample processing unsaturation, unsuitability for overfitting and insensitivity to noise data, so that the octane loss value prediction method based on particle swarm parameter optimization and random forest is designed.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (7)

1. A particle swarm optimization random forest parameter-based octane loss value prediction method is characterized by comprising the following steps: the method comprises the following steps:
step 1, calculating an information gain value of relevant characteristics of octane loss values, and deleting characteristics with small influence of octane number loss;
step 2, preprocessing the residual data with characteristics less influenced by the deletion of the octane number loss, and dividing the preprocessed data into a training data set and a test data set;
step 3, training the random forest algorithm by adopting a training data set to obtain a training model, and verifying the training model by adopting a test data set;
step 4, initializing particle swarm algorithm parameters;
and 5, adopting the root mean square error of the verified random forest algorithm training model as a fitness function of the particle swarm algorithm, continuously solving the optimal value of the number n _ estimators of the parameter decision trees and the depth max _ depth of the trees in the verified random forest algorithm training model through the particle swarm algorithm, and introducing the optimal parameters into the verified random forest algorithm training model to obtain the optimal prediction model.
2. The particle swarm optimization random forest parameter-based octane loss value prediction method according to claim 1, wherein the method comprises the following steps: further comprising:
and 6, inputting the data processed in the step 1 and the step 2 again as a new test set, importing the new test set into an optimal prediction model, and testing to obtain a prediction result.
3. The particle swarm optimization random forest parameter-based octane loss value prediction method according to claim 1 or 2, wherein the method comprises the following steps: in the step 1, the deleting conditions are as follows: and whether the information gain value of the characteristic is smaller than the average information gain value of all the characteristics or not is judged, and the characteristics corresponding to the average information gain value smaller than all the characteristics are deleted.
4. The particle swarm optimization random forest parameter-based octane loss value prediction method according to claim 1 or 2, wherein the method comprises the following steps: in the step 2, the pretreatment specifically comprises: filling null values and normalizing.
5. The particle swarm optimization random forest parameter-based octane loss value prediction method of claim 3, wherein the octane loss value prediction method comprises the following steps: the filling null specifically is: the sample data is concentrated, and when a single characteristic of a certain sample data is a null value, the mean value of the sum of the previous data and the next data at the null value position is used for filling the null value; otherwise, when more than two characteristics in a certain sample data have null values, deleting the data.
6. The particle swarm optimization random forest parameter-based octane loss value prediction method of claim 3, wherein the octane loss value prediction method comprises the following steps: the normalization is specifically performed by min-max normalization, so that the result value is mapped between [0-1 ].
7. The particle swarm optimization random forest parameter-based octane loss value prediction method according to claim 1 or 2, wherein the method comprises the following steps: in the step 4, the parameters are set as follows: population number, particle position inertial weight, particle learning factor and particle dimension; the population number, the particle position inertia weight and the particle learning factor are used as main parameters influencing the particle swarm algorithm, and the particle dimension is the number of optimized random forest parameters.
CN202011587477.6A 2020-12-29 2020-12-29 Octane loss value prediction method based on particle swarm optimization random forest parameters Active CN112686296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011587477.6A CN112686296B (en) 2020-12-29 2020-12-29 Octane loss value prediction method based on particle swarm optimization random forest parameters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011587477.6A CN112686296B (en) 2020-12-29 2020-12-29 Octane loss value prediction method based on particle swarm optimization random forest parameters

Publications (2)

Publication Number Publication Date
CN112686296A true CN112686296A (en) 2021-04-20
CN112686296B CN112686296B (en) 2022-07-01

Family

ID=75454768

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011587477.6A Active CN112686296B (en) 2020-12-29 2020-12-29 Octane loss value prediction method based on particle swarm optimization random forest parameters

Country Status (1)

Country Link
CN (1) CN112686296B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254435A (en) * 2021-07-15 2021-08-13 北京电信易通信息技术股份有限公司 Data enhancement method and system
CN113408187A (en) * 2021-05-15 2021-09-17 西安石油大学 Optimization method for reducing gasoline octane number loss based on random forest
CN116306321A (en) * 2023-05-18 2023-06-23 湖南工商大学 Particle swarm-based adsorbed water treatment scheme optimization method, device and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017098862A1 (en) * 2015-12-08 2017-06-15 国立研究開発法人物質・材料研究機構 Fuel oil discrimination sensor equipped with receptor layer composed of hydrocarbon-group-modified microparticles, and fuel oil discrimination method
CN109668856A (en) * 2017-10-17 2019-04-23 中国石油化工股份有限公司 The method and apparatus for predicting hydrocarbon system's composition of LCO hydrogenating materials and product
CN110059852A (en) * 2019-03-11 2019-07-26 杭州电子科技大学 A kind of stock yield prediction technique based on improvement random forests algorithm
CN110766222A (en) * 2019-10-22 2020-02-07 太原科技大学 Particle swarm parameter optimization and random forest based PM2.5 concentration prediction method
CN111797674A (en) * 2020-04-10 2020-10-20 成都信息工程大学 MI electroencephalogram signal identification method based on feature fusion and particle swarm optimization algorithm

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017098862A1 (en) * 2015-12-08 2017-06-15 国立研究開発法人物質・材料研究機構 Fuel oil discrimination sensor equipped with receptor layer composed of hydrocarbon-group-modified microparticles, and fuel oil discrimination method
CN109668856A (en) * 2017-10-17 2019-04-23 中国石油化工股份有限公司 The method and apparatus for predicting hydrocarbon system's composition of LCO hydrogenating materials and product
CN110059852A (en) * 2019-03-11 2019-07-26 杭州电子科技大学 A kind of stock yield prediction technique based on improvement random forests algorithm
CN110766222A (en) * 2019-10-22 2020-02-07 太原科技大学 Particle swarm parameter optimization and random forest based PM2.5 concentration prediction method
CN111797674A (en) * 2020-04-10 2020-10-20 成都信息工程大学 MI electroencephalogram signal identification method based on feature fusion and particle swarm optimization algorithm

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408187A (en) * 2021-05-15 2021-09-17 西安石油大学 Optimization method for reducing gasoline octane number loss based on random forest
CN113254435A (en) * 2021-07-15 2021-08-13 北京电信易通信息技术股份有限公司 Data enhancement method and system
CN113254435B (en) * 2021-07-15 2021-10-29 北京电信易通信息技术股份有限公司 Data enhancement method and system
CN116306321A (en) * 2023-05-18 2023-06-23 湖南工商大学 Particle swarm-based adsorbed water treatment scheme optimization method, device and equipment
CN116306321B (en) * 2023-05-18 2023-08-18 湖南工商大学 Particle swarm-based adsorbed water treatment scheme optimization method, device and equipment

Also Published As

Publication number Publication date
CN112686296B (en) 2022-07-01

Similar Documents

Publication Publication Date Title
CN112686296B (en) Octane loss value prediction method based on particle swarm optimization random forest parameters
CN110379463B (en) Marine algae cause analysis and concentration prediction method and system based on machine learning
CN112489733B (en) Octane number loss prediction method based on particle swarm algorithm and neural network
CN109034260B (en) Desulfurization tower oxidation fan fault diagnosis system and method based on statistical principle and intelligent optimization
US11820947B2 (en) Method of reducing octane loss in catalytic cracking of gasoline in S-zorb plant
Alvarez et al. An evolutionary algorithm to discover quantitative association rules from huge databases without the need for an a priori discretization
CN111144609A (en) Boiler exhaust emission prediction model establishing method, prediction method and device
CN112835570A (en) Machine learning-based visual mathematical modeling method and system
CN112435720A (en) Prediction method based on self-attention mechanism and multi-drug characteristic combination
CN115188429A (en) Catalytic cracking unit key index modeling method integrating time sequence feature extraction
CN105740960B (en) A kind of optimization method of industry hydrocracking reaction condition
Chu et al. Co-training based on semi-supervised ensemble classification approach for multi-label data stream
CN114239400A (en) Multi-working-condition process self-adaptive soft measurement modeling method based on local double-weighted probability hidden variable regression model
CN113111588B (en) NO of gas turbine X Emission concentration prediction method and device
CN112420132A (en) Product quality optimization control method in gasoline catalytic cracking process
Guo et al. Optimization Modeling and Empirical Research on Gasoline Octane Loss Based on Data Analysis
CN112342050B (en) Method and device for optimizing light oil yield of catalytic cracking unit and storage medium
CN113408187A (en) Optimization method for reducing gasoline octane number loss based on random forest
CN116449691A (en) Raw oil processing control method and device
Divine et al. Enhancing biomass Pyrolysis: Predictive insights from process simulation integrated with interpretable Machine learning models
CN110389948A (en) A kind of tail oil prediction technique of the hydrocracking unit based on data-driven
Hamedi et al. Integrating artificial immune genetic algorithm and metaheuristic ant colony optimizer with two-dose vaccination and modeling for residual fluid catalytic cracking process
Hasibuan et al. Bootstrap aggregating of classification and regression trees in identification of single nucleotide polymorphisms
CN117434911B (en) Equipment running state monitoring method and device and electronic equipment
CN115497573B (en) Carbon-based biological and geological catalytic material property prediction and preparation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant