WO2023202345A1

WO2023202345A1 - Hierarchical group construction-based method for predicting pure component refining properties

Info

Publication number: WO2023202345A1
Application number: PCT/CN2023/085001
Authority: WO
Inventors: 王耀宗; 陈松航; 陈豪; 王森林; 张剑铭; 连明昌; 钟浪; 刘哲夫
Original assignee: 泉州装备制造研究所
Priority date: 2022-04-19
Filing date: 2023-03-30
Publication date: 2023-10-26
Also published as: CN114708930A

Abstract

Disclosed is a hierarchical group construction-based method for predicting pure component refining properties, the method comprising: predicting an octane number and a cetane number of pure component compounds in a petroleum product, performing hierarchical construction by means of component feature sets, introducing hierarchical group construction to prevent the redundancy of the feature sets, introducing Bayes' rule when a third hierarchical component descriptor is added into the feature sets, so that posterior probability distribution estimation can be performed on the feature sets, and selecting the feature sets which have higher posterior probability rather than solely focusing on prediction precision. On the basis of the foregoing, Bayes' rule is introduced again, so that posterior probability distribution estimation may be performed on the final model, and the risk of the final prediction model overfitting is prevented. The present invention may be applied to crude oil and product blending units in the petrochemical industry, and may effectively improve the petroleum refining precision.

Description

A prediction method for refining properties of pure components based on hierarchical group construction

Technical field

The invention relates to the field of data analysis in the refining and chemical industry, and in particular to a method for predicting the refining and chemical properties of pure components based on hierarchical group construction.

Background technique

Due to the limitations of analytical chemistry and computer hardware, traditional refining unit models mostly use lumped kinetic models. Raw materials and products are often divided into lumps based on macroscopic properties such as boiling point or solubility. For example, the ten-lump and eleven-lump models are widely used in catalytic cracking units. However, lumping based on macro-level division naturally has multi-component attributes and cannot characterize component information in detail, making it difficult to extend such lumped models to new raw materials and catalyst systems. However, the lumped model at the molecular level can calculate the composition and properties of raw materials from the pure component level, establish a reaction network, and then accurately predict the properties of the products of the refining and chemical processing units. Combined with the pure component property prediction model and the mixing rule model, the molecular dynamics model can not only predict the distribution of products in the refining unit, that is, qualitative analysis, but can also quantitatively predict the corresponding refining properties of the product. This progress can enable decision makers to directionally design the chemical structure of pure components in products and optimize unit operating conditions, thereby guiding the direction of refining theoretical research and industrial production. Among them, the accuracy of predicting the refining properties of pure components is directly related to the accuracy of product quality assessment, which in turn affects the optimization direction of each operating unit. It is a key point of the molecular dynamics model and is also related to whether the molecular management technology can be successfully applied. Refinery optimization.

Contents of the invention

The technical problem to be solved by the present invention is to provide a method for predicting the refining properties of pure components based on hierarchical groups, aiming at the pure components in petroleum products, such as the octane number of each component of gasoline products and the octane number of each component of diesel products. The cetane number of each component is predicted and constructed hierarchically through the component feature set. When adding component descriptors for feature collection, Bayes' rule is introduced, so that the posterior probability distribution can be estimated. Here On the basis of this, hierarchical group construction is introduced to construct group fragments hierarchically to avoid Avoid the risk of overfitting in the final prediction.

The present invention specifically includes the following steps:

Step 10. Use SMILES, a coded simplified component expression method, to represent complex component structures with two-dimensional coding, and build a predefined group fragment component library, including primary groups, secondary groups, and tertiary groups. ; The primary group is a basic group containing a component structure; the secondary group is a combination of link positions of basic groups, used to distinguish aromatic hydrocarbons from paraffins and corresponding isomers; the Tertiary groups are descriptors describing the topological structure of components;

Step 20: Screen out primary groups and secondary groups from the predefined group fragment component library according to the molecular structure of the target component, and then use the predetermined refining properties of the target component with the first group. Multiple third-level groups are screened out based on the principle of maintaining minimal correlation with secondary groups while maintaining maximum information content with the properties to be predicted. Randomly select any number of third-level groups and the first-level groups. and secondary groups to form multiple component feature sets, and then screen to obtain the component feature set with the highest posterior probability;

Step 30: Use a linear accumulation function to combine groups at different levels for modeling and then solve the coefficients through the training set to obtain a hierarchical group model;

Step 40: Generate multiple candidate models based on the component feature set with the highest posterior probability. Based on the hierarchical group model, by using Bayes' rule again, obtain the confidence intervals of all candidate models, and combine each of the Based on the accuracy of the candidate models, the octane number and cetane number models suitable for the actual conditions of the refinery are screened out based on the principle of multi-objective optimization.

Further, in step 20, the component feature set with the largest posterior probability is screened, which specifically includes:

A single model m belongs to the candidate model set M. Each model obeys the distribution of the known data set Y, f(y|m,β _m ), where the parameter vector β _m ∈B _m is the set of possible values of the coefficient of the model m _. , assuming the prior probability of model m is f(m), then the posterior probability is:

Among them, f(y|m) is the edge similarity, calculated by f(y|m)=∫f(y|m,β _m )f(β _m |m)dβ _m and f(β _m |m) , its value is approximately estimated using the Markov Monte Carlo random sampling method, and the sampling range is (m, The space where β _m ) is located:

in, where p is the number of all features.

Further, the step 30 includes a data preprocessing process and a modeling verification process;

The data preprocessing process is: normalize the data set through probability and statistical methods, then use unsupervised learning methods to directly perform cluster analysis on the data set, approximate the sparse holes in the feature space in the data set, and obtain training. set;

During the modeling verification process, the linear accumulation function is used for modeling hierarchical groups, and the formula is as follows:

Among them, the function f(Y) is the function of the property to be predicted, C _i is the contribution of the i-th group in the primary group, N _i is the number of occurrences of the i group, δ is the coefficient of the primary group; w is the second-level group coefficient. The first-level group coefficient, D _j is the contribution of j group in the second-level group, M _j is the number of occurrences; λ is the third-level component descriptor group coefficient, f(Y*) is the third-level descriptor pair The total contribution of a given property;

When calculating the hierarchical group coefficients δ, w, λ and group contribution degrees C _i and D _j , the hierarchical method is used to regression in sequence, and C _i is obtained through training set regression; then the secondary group contribution degree D _j is obtained through regression; f ( Y*) is calculated from the component descriptor without regression calculation. Finally, the group coefficients δ, w, and λ are obtained through unified regression, that is, the size of the weight, which represents the influence of the group fragment at the corresponding level on a given property.

The invention has the following advantages:

Selecting group fragments based on the mechanism and combining them with component descriptors that do not rely on data regression coefficients reduces the number of regression calculation coefficients required, reduces the dependence on the size of the data set to a considerable extent, and also provides eigenvalues Set the posterior distribution probability of the model to achieve "soft" constraints, which is suitable for prediction research on the refining properties of pure components with a limited amount of data; on this basis, Bayes' rule is introduced again, so that the final model can be evaluated posteriorly Probability distribution estimation to avoid the risk of overfitting of the final prediction model.

Description of the drawings

The present invention will be further described below with reference to the accompanying drawings and embodiments.

Figure 1 is a schematic flow chart of the method of the present invention;

Figure 2 is a schematic diagram of hierarchical groups of the present invention;

Figure 3 is a schematic diagram of the component feature set construction and screening process of the present invention;

Figure 4 is a schematic diagram of the modeling process of hierarchical groups in the present invention;

Figure 5 is one of the schematic diagrams of the uncertainty analysis process of the candidate model of the present invention;

Figure 6 is the second schematic diagram of the uncertainty analysis process of the candidate model of this method.

Detailed ways

The embodiments of the present invention provide a method for predicting the refining properties of pure components based on hierarchical groups, aiming at the octane number of pure components in petroleum products, such as the octane number of each component in gasoline products, and the octane number of each component in diesel products. The cetane number is predicted by hierarchically constructing the component feature set. When adding the component descriptor for feature collection, Bayes' rule is introduced, so that the posterior probability distribution can be estimated. On this basis, the Hierarchical group construction, hierarchically constructing group fragments to avoid the risk of overfitting in the final prediction.

As shown in Figure 1, the general idea of the present invention is as follows:

S1: Construct a library of predefined group fragment components.

In view of the shortcomings of existing group fragment construction methods, a new set of component characteristics is proposed to characterize the refining properties of components in petroleum products. This component feature set combines the characteristic groups of the mechanism and the component descriptors screened by machine learning to characterize the refining properties of the components. While constructing groups and descriptors, they are divided into levels. Higher-level groups contain more detailed descriptions of components.

The primary group includes the basic groups of the component structure such as -CH, -CH3, -CO, etc. Simple structural components such as alkanes can be disassembled and characterized through this hierarchical group. However, this hierarchical group can only represent the basic composition of the component and cannot represent the linking position of the group in the component. The difference in linking position has a decisive impact on the refining properties of the component.

Therefore, secondary groups focus on building up group blocks, which are combinations of basic groups that distinguish aromatics from paraffins and the corresponding isomers. As shown in Figure 2, the primary basic group includes the R group representing the aromatic ring A6 group and CH2. The -CH2 group is linked to the benzene ring, so it is The carbon on the benzene ring to which it is connected forms a new group block aC-CH, which is represented in the secondary group block to characterize the component.

The third-level group uses component descriptors. Due to the large number of component descriptors, the accuracy of descriptors based on quantum chemical calculations is still controversial in the scientific community. Therefore, we will focus on descriptors that describe the topological structure of the components. Such as connectivity index.

S2. Construct and filter the component feature set according to the target component.

As shown in Figure 3, when it is necessary to predict the refining properties of the target component, the codable, simplified component expression method SMILES is used to represent the complex component structure with two-dimensional coding, and the molecular structure of the given component is Automatically disassemble into group fragments that match the component library for quantitative analysis.

First, the primary groups and secondary groups are screened from the group library according to the molecular structure of the target component. Global optimization algorithms such as simulated annealing and genetic algorithms can be used for screening. Then, tertiary groups are added to the selected feature set. However, the addition of tertiary groups will inevitably overlap with primary and secondary groups, resulting in redundant feature sets. Therefore, combining information theory and machine learning, the concept of minimum correlation-maximum amount of information is introduced to ensure that the added third-level group maintains the minimum correlation with the existing low-level groups, while maintaining the maximum amount of information with the properties to be predicted, that is Maximize the representation of the properties to be predicted.

Then Bayes' rule is introduced for feature selection to calculate the posterior probability of the candidate model. A single model m belongs to the candidate model set M. Each model obeys the distribution of the known data set Y, f(y|m,β _m ), where the parameter vector β _m ∈B _m is the set of possible values of the coefficient of the model m _. . Suppose the prior probability of model m is f(m), then the posterior probability is:

Among them, f(y|m) is the edge similarity, which can be calculated by f(y|m)=∫f(y|m,β _m )f(β _m |m)dβ _m and f(β _m |m) . However, in most cases, the analytical solution of this integral cannot be obtained, so the Markov Monte Carlo (MCMC) random sampling method is used to approximately estimate its value. The sampling range is the space where (m,β _m ) is located:

Feature selection is a branch problem of model selection, that is, using binomial distribution to represent candidate models, where p is the number of all features. From this, the posterior distribution probability of the model represented by each feature subset based on the known data set Y is obtained, thereby achieving "soft" constraints. The core of this feature selection method based on Bayes' rule is the MCMC sampling method.

Selecting group fragments based on the mechanism and combining them with component descriptors that do not rely on data regression coefficients reduces the number of regression calculation coefficients required, reduces the dependence on the size of the data set to a considerable extent, and also provides eigenvalues Set the posterior distribution probability of the model to achieve "soft" constraints, which is suitable for prediction research on the refining properties of pure components with limited data volume.

S3: Perform hierarchical group modeling and coefficient solution.

The process of hierarchical group modeling and coefficient solution is shown in Figure 4, which can be divided into two parts: data preprocessing and modeling verification. Due to the sparseness of existing data sets due to the refining properties of components, advanced statistics and machine learning methods need to be introduced in the data preprocessing stage to strive to improve the accuracy of small sample data modeling.

The distribution of the eigenvalues and experimental values of the components in the database is difficult to meet the requirements of normal distribution, which will affect the model effect during the modeling process. It needs to be normalized through the probability and statistics method, namely the Box-Cox log-likelihood function method. State transformation. Due to the sparsity of the feature space, the randomly selected training set is difficult to cover the feature space of the test set, causing the model based on the training set to be over extrapolated and reducing the model prediction effect. Therefore, the second step uses an unsupervised learning method, that is, only focusing on the feature set of the data set without evaluating the modeling effect, clustering analysis is directly performed on the data set, and the sparse holes in the feature space in the data set are approximated. Based on This selected training set can cover the feature space of the test set samples to the greatest extent and improve the model prediction effect.

The modeling of hierarchical groups gives priority to the traditional linear accumulation function, because it has a small amount of calculation and can give the contribution coefficient of the corresponding group, which provides richer mechanism information to a certain extent. Its formula is as follows:

Among them, the function f(Y) is the function of the property to be predicted, C _i is the contribution of the i-th group in the primary group, N _i is the number of occurrences of the i group, δ is the coefficient of the primary group; w is the second-level group coefficient. The first-level group coefficient, D _j is the contribution of j group in the second-level group, M _j is the number of occurrences; λ is the third-level component descriptor group coefficient, f(Y*) is the total contribution of three-level descriptors to a given property.

When calculating the hierarchical group coefficients δ, w, λ and group contribution degrees C _i and D _j , the hierarchical method is used for sequential regression. C _i is obtained through training set regression; the secondary group contribution degree D _j is then obtained through regression; since f(Y*) is calculated from the component descriptor, no regression calculation is required, thus greatly reducing the need for training set size. Finally, the group coefficients δ, w, and λ are obtained through unified regression. The calculated group coefficients δ, w, λ, that is, the size of the weight, can represent the influence of the group fragments at the corresponding level on a given property.

S4: Conduct uncertainty analysis.

As shown in Figures 5 and 6, uncertainty analysis of predicted values, that is, estimation of confidence intervals, is crucial to the practical application of the model. Since the hierarchical group model has explicit mathematical expressions and also includes the probability distribution of each candidate model, by using Bayes' rule again, the confidence intervals of all candidate models can be obtained, combined with the accuracy of each model, and comprehensively considering the model's Accuracy and practicality, obtaining an octane number and cetane number model that is more suitable for the actual situation of the refinery.

It should be noted that those skilled in the art can perform appropriate deformations and corresponding parameter settings based on relevant principles during the process of yield prediction calculation. The above-described embodiment only expresses one implementation mode of the present invention. The description is relatively specific and detailed, but it should not be understood as limiting the patent scope of the invention.

A specific embodiment of the present invention is as follows:

Step 20: Screen out primary groups and secondary groups from the predefined group fragment component library according to the molecular structure of the target component, and then use the predetermined refining properties of the target component with the first group. Multiple third-level groups are screened out based on the principle of maintaining minimal correlation with secondary groups while maintaining maximum information content with the properties to be predicted. Randomly select any number of third-level groups and the first-level groups. and secondary groups to form multiple component feature sets, and then screen to obtain the component features with the highest posterior probability. set;

In the step 20, the component feature set with the largest posterior probability is screened, which specifically includes:

Among them, f(y|m) is the edge similarity, calculated by f(y|m)=∫f(y|m,β _m )f(β _m |m)dβ _m and f(β _m |m) , its value is approximately estimated using the Markov Monte Carlo random sampling method, and the sampling range is the space where (m, β _m ) is located:

in, where p is the number of all features.

The step 30 includes a data preprocessing process and a modeling verification process;

Among them, the function f(Y) is the function of the property to be predicted, C _i is the contribution of the i-th group in the primary group, N _i is the number of occurrences of the i group, δ is the coefficient of the primary group; w is the second-level group coefficient. Level group coefficient, D _j is two The contribution of j group in the first-level group, M _j is the number of occurrences; λ is the third-level component descriptor group coefficient, f(Y*) is the total contribution of the third-level descriptor to a given property;

This invention selects group fragments based on the mechanism and combines them with component descriptors that do not rely on data regression coefficients, which reduces the number of required regression calculation coefficients, reduces the dependence on the size of the data set to a considerable extent, and at the same time provides The posterior distribution probability of the feature subset model realizes "soft" constraints, which is suitable for prediction research on the refining properties of pure components with limited data volume; on this basis, Bayes' rule is introduced again, so that the final model can be The posterior probability distribution is estimated to avoid the risk of overfitting of the final prediction model.

Although the specific embodiments of the present invention have been described above, those skilled in the art should understand that the specific embodiments we have described are only illustrative and are not used to limit the scope of the present invention. Those skilled in the art Equivalent modifications and changes made by skilled persons in accordance with the spirit of the present invention shall be covered by the scope of protection of the claims of the present invention.

Claims

A method for predicting the refining properties of pure components based on hierarchical group construction, which is characterized by including:

Step 10. Use SMILES, a coded simplified component expression method, to represent complex component structures with two-dimensional coding, and build a predefined group fragment component library, including primary groups, secondary groups, and tertiary groups. ; The primary group is a basic group containing a component structure; the secondary group is a combination of link positions of basic groups, used to distinguish aromatic hydrocarbons from paraffins and corresponding isomers; the Tertiary groups are descriptors describing the topological structure of components;

Step 20: Screen out primary groups and secondary groups from the predefined group fragment component library according to the molecular structure of the target component, and then use the predetermined refining properties of the target component with the first group. Multiple third-level groups are screened out based on the principle of maintaining minimal correlation with secondary groups while maintaining maximum information content with the properties to be predicted. Randomly select any number of third-level groups and the first-level groups. and secondary groups to form multiple component feature sets, and then screen to obtain the component feature set with the highest posterior probability;

Step 30: Use a linear accumulation function to combine groups at different levels for modeling and then solve the coefficients through the training set to obtain a hierarchical group model;

Step 40: Generate multiple candidate models based on the component feature set with the highest posterior probability. Based on the hierarchical group model, by using Bayes' rule again, obtain the confidence intervals of all candidate models, and combine each of the Based on the accuracy of the candidate models, the octane number and cetane number models suitable for the actual conditions of the refinery are screened out based on the principle of multi-objective optimization.
The method according to claim 1, characterized in that: in the step 20, screening to obtain the component feature set with the largest posterior probability, specifically includes:

A single model m belongs to the candidate model set M. Each model obeys the distribution of the known data set Y, f(y|m,β m ), where the parameter vector β m ∈B m is the set of possible values of the coefficient of the model m . , assuming the prior probability of model m is f(m), then the posterior probability is:

Among them, f(y|m) is the edge similarity, calculated by f(y|m)=∫f(y|m,β m )f(β m |m)dβ m and f(β m |m) , its value is approximately estimated using the Markov Monte Carlo random sampling method, and the sampling range is the space where (m, β m ) is located:

in, where p is the number of all features.
The method according to claim 1, characterized in that: the step 30 includes a data preprocessing process and a modeling verification process;

The data preprocessing process is: normalize the data set through probability and statistical methods, then use unsupervised learning methods to directly perform cluster analysis on the data set, approximate the sparse holes in the feature space in the data set, and obtain training. set;

During the modeling verification process, the linear accumulation function is used for modeling hierarchical groups, and the formula is as follows:

Among them, the function f(Y) is the function of the property to be predicted, C i is the contribution of the i-th group in the primary group, N i is the number of occurrences of the i group, δ is the coefficient of the primary group; w is the second-level group coefficient. The first-level group coefficient, D j is the contribution of j group in the second-level group, M j is the number of occurrences; λ is the third-level component descriptor group coefficient, f(Y*) is the third-level descriptor pair The total contribution of a given property;

When calculating the hierarchical group coefficients δ, w, λ and group contribution degrees C i and D j , the hierarchical method is used to regression in sequence, and C i is obtained through training set regression; then the secondary group contribution degree D j is obtained through regression; f ( Y*) is calculated from the component descriptor without regression calculation. Finally, the group coefficients δ, w, and λ are obtained through unified regression, that is, the size of the weight, which represents the influence of the group fragment at the corresponding level on a given property.