WO2023202345A1 - Hierarchical group construction-based method for predicting pure component refining properties - Google Patents

Hierarchical group construction-based method for predicting pure component refining properties Download PDF

Info

Publication number
WO2023202345A1
WO2023202345A1 PCT/CN2023/085001 CN2023085001W WO2023202345A1 WO 2023202345 A1 WO2023202345 A1 WO 2023202345A1 CN 2023085001 W CN2023085001 W CN 2023085001W WO 2023202345 A1 WO2023202345 A1 WO 2023202345A1
Authority
WO
WIPO (PCT)
Prior art keywords
group
component
groups
hierarchical
model
Prior art date
Application number
PCT/CN2023/085001
Other languages
French (fr)
Chinese (zh)
Inventor
王耀宗
陈松航
陈豪
王森林
张剑铭
连明昌
钟浪
刘哲夫
Original Assignee
泉州装备制造研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 泉州装备制造研究所 filed Critical 泉州装备制造研究所
Publication of WO2023202345A1 publication Critical patent/WO2023202345A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Definitions

  • the invention relates to the field of data analysis in the refining and chemical industry, and in particular to a method for predicting the refining and chemical properties of pure components based on hierarchical group construction.
  • the molecular dynamics model can not only predict the distribution of products in the refining unit, that is, qualitative analysis, but can also quantitatively predict the corresponding refining properties of the product. This progress can enable decision makers to directionally design the chemical structure of pure components in products and optimize unit operating conditions, thereby guiding the direction of refining theoretical research and industrial production. Among them, the accuracy of predicting the refining properties of pure components is directly related to the accuracy of product quality assessment, which in turn affects the optimization direction of each operating unit. It is a key point of the molecular dynamics model and is also related to whether the molecular management technology can be successfully applied. Refinery optimization.
  • the technical problem to be solved by the present invention is to provide a method for predicting the refining properties of pure components based on hierarchical groups, aiming at the pure components in petroleum products, such as the octane number of each component of gasoline products and the octane number of each component of diesel products.
  • the cetane number of each component is predicted and constructed hierarchically through the component feature set.
  • Bayes' rule is introduced, so that the posterior probability distribution can be estimated.
  • hierarchical group construction is introduced to construct group fragments hierarchically to avoid Avoid the risk of overfitting in the final prediction.
  • the present invention specifically includes the following steps:
  • Step 10 Use SMILES, a coded simplified component expression method, to represent complex component structures with two-dimensional coding, and build a predefined group fragment component library, including primary groups, secondary groups, and tertiary groups.
  • the primary group is a basic group containing a component structure
  • the secondary group is a combination of link positions of basic groups, used to distinguish aromatic hydrocarbons from paraffins and corresponding isomers
  • the Tertiary groups are descriptors describing the topological structure of components
  • Step 20 Screen out primary groups and secondary groups from the predefined group fragment component library according to the molecular structure of the target component, and then use the predetermined refining properties of the target component with the first group. Multiple third-level groups are screened out based on the principle of maintaining minimal correlation with secondary groups while maintaining maximum information content with the properties to be predicted. Randomly select any number of third-level groups and the first-level groups. and secondary groups to form multiple component feature sets, and then screen to obtain the component feature set with the highest posterior probability;
  • Step 30 Use a linear accumulation function to combine groups at different levels for modeling and then solve the coefficients through the training set to obtain a hierarchical group model;
  • Step 40 Generate multiple candidate models based on the component feature set with the highest posterior probability. Based on the hierarchical group model, by using Bayes' rule again, obtain the confidence intervals of all candidate models, and combine each of the Based on the accuracy of the candidate models, the octane number and cetane number models suitable for the actual conditions of the refinery are screened out based on the principle of multi-objective optimization.
  • step 20 the component feature set with the largest posterior probability is screened, which specifically includes:
  • a single model m belongs to the candidate model set M.
  • Each model obeys the distribution of the known data set Y, f(y
  • the posterior probability is:
  • m) is the edge similarity, calculated by f(y
  • m) ⁇ f(y
  • step 30 includes a data preprocessing process and a modeling verification process
  • the data preprocessing process is: normalize the data set through probability and statistical methods, then use unsupervised learning methods to directly perform cluster analysis on the data set, approximate the sparse holes in the feature space in the data set, and obtain training. set;
  • the linear accumulation function is used for modeling hierarchical groups, and the formula is as follows:
  • the function f(Y) is the function of the property to be predicted
  • C i is the contribution of the i-th group in the primary group
  • N i is the number of occurrences of the i group
  • is the coefficient of the primary group
  • w is the second-level group coefficient.
  • the first-level group coefficient, D j is the contribution of j group in the second-level group, M j is the number of occurrences;
  • is the third-level component descriptor group coefficient
  • f(Y*) is the third-level descriptor pair The total contribution of a given property
  • the hierarchical method is used to regression in sequence, and C i is obtained through training set regression; then the secondary group contribution degree D j is obtained through regression; f ( Y*) is calculated from the component descriptor without regression calculation.
  • the group coefficients ⁇ , w, and ⁇ are obtained through unified regression, that is, the size of the weight, which represents the influence of the group fragment at the corresponding level on a given property.
  • Selecting group fragments based on the mechanism and combining them with component descriptors that do not rely on data regression coefficients reduces the number of regression calculation coefficients required, reduces the dependence on the size of the data set to a considerable extent, and also provides eigenvalues Set the posterior distribution probability of the model to achieve "soft" constraints, which is suitable for prediction research on the refining properties of pure components with a limited amount of data; on this basis, Bayes' rule is introduced again, so that the final model can be evaluated posteriorly Probability distribution estimation to avoid the risk of overfitting of the final prediction model.
  • Figure 1 is a schematic flow chart of the method of the present invention
  • Figure 2 is a schematic diagram of hierarchical groups of the present invention.
  • Figure 3 is a schematic diagram of the component feature set construction and screening process of the present invention.
  • Figure 4 is a schematic diagram of the modeling process of hierarchical groups in the present invention.
  • Figure 5 is one of the schematic diagrams of the uncertainty analysis process of the candidate model of the present invention.
  • Figure 6 is the second schematic diagram of the uncertainty analysis process of the candidate model of this method.
  • the embodiments of the present invention provide a method for predicting the refining properties of pure components based on hierarchical groups, aiming at the octane number of pure components in petroleum products, such as the octane number of each component in gasoline products, and the octane number of each component in diesel products.
  • the cetane number is predicted by hierarchically constructing the component feature set.
  • Bayes' rule is introduced, so that the posterior probability distribution can be estimated.
  • the Hierarchical group construction hierarchically constructing group fragments to avoid the risk of overfitting in the final prediction.
  • This component feature set combines the characteristic groups of the mechanism and the component descriptors screened by machine learning to characterize the refining properties of the components. While constructing groups and descriptors, they are divided into levels. Higher-level groups contain more detailed descriptions of components.
  • the primary group includes the basic groups of the component structure such as -CH, -CH3, -CO, etc.
  • Simple structural components such as alkanes can be disassembled and characterized through this hierarchical group.
  • this hierarchical group can only represent the basic composition of the component and cannot represent the linking position of the group in the component. The difference in linking position has a decisive impact on the refining properties of the component.
  • secondary groups focus on building up group blocks, which are combinations of basic groups that distinguish aromatics from paraffins and the corresponding isomers.
  • the primary basic group includes the R group representing the aromatic ring A6 group and CH2.
  • the -CH2 group is linked to the benzene ring, so it is The carbon on the benzene ring to which it is connected forms a new group block aC-CH, which is represented in the secondary group block to characterize the component.
  • the third-level group uses component descriptors. Due to the large number of component descriptors, the accuracy of descriptors based on quantum chemical calculations is still controversial in the scientific community. Therefore, we will focus on descriptors that describe the topological structure of the components. Such as connectivity index.
  • the codable, simplified component expression method SMILES is used to represent the complex component structure with two-dimensional coding, and the molecular structure of the given component is Automatically disassemble into group fragments that match the component library for quantitative analysis.
  • the primary groups and secondary groups are screened from the group library according to the molecular structure of the target component.
  • Global optimization algorithms such as simulated annealing and genetic algorithms can be used for screening.
  • tertiary groups are added to the selected feature set.
  • the addition of tertiary groups will inevitably overlap with primary and secondary groups, resulting in redundant feature sets. Therefore, combining information theory and machine learning, the concept of minimum correlation-maximum amount of information is introduced to ensure that the added third-level group maintains the minimum correlation with the existing low-level groups, while maintaining the maximum amount of information with the properties to be predicted, that is Maximize the representation of the properties to be predicted.
  • Bayes' rule is introduced for feature selection to calculate the posterior probability of the candidate model.
  • a single model m belongs to the candidate model set M.
  • Each model obeys the distribution of the known data set Y, f(y
  • the prior probability of model m is f(m)
  • the posterior probability is:
  • m) is the edge similarity, which can be calculated by f(y
  • m) ⁇ f(y
  • MCMC Markov Monte Carlo
  • Feature selection is a branch problem of model selection, that is, using binomial distribution to represent candidate models, where p is the number of all features. From this, the posterior distribution probability of the model represented by each feature subset based on the known data set Y is obtained, thereby achieving "soft" constraints.
  • the core of this feature selection method based on Bayes' rule is the MCMC sampling method.
  • Selecting group fragments based on the mechanism and combining them with component descriptors that do not rely on data regression coefficients reduces the number of regression calculation coefficients required, reduces the dependence on the size of the data set to a considerable extent, and also provides eigenvalues Set the posterior distribution probability of the model to achieve "soft" constraints, which is suitable for prediction research on the refining properties of pure components with limited data volume.
  • FIG 4 The process of hierarchical group modeling and coefficient solution is shown in Figure 4, which can be divided into two parts: data preprocessing and modeling verification. Due to the sparseness of existing data sets due to the refining properties of components, advanced statistics and machine learning methods need to be introduced in the data preprocessing stage to strive to improve the accuracy of small sample data modeling.
  • the distribution of the eigenvalues and experimental values of the components in the database is difficult to meet the requirements of normal distribution, which will affect the model effect during the modeling process. It needs to be normalized through the probability and statistics method, namely the Box-Cox log-likelihood function method. State transformation. Due to the sparsity of the feature space, the randomly selected training set is difficult to cover the feature space of the test set, causing the model based on the training set to be over extrapolated and reducing the model prediction effect. Therefore, the second step uses an unsupervised learning method, that is, only focusing on the feature set of the data set without evaluating the modeling effect, clustering analysis is directly performed on the data set, and the sparse holes in the feature space in the data set are approximated. Based on This selected training set can cover the feature space of the test set samples to the greatest extent and improve the model prediction effect.
  • the function f(Y) is the function of the property to be predicted
  • C i is the contribution of the i-th group in the primary group
  • N i is the number of occurrences of the i group
  • is the coefficient of the primary group
  • w is the second-level group coefficient.
  • the first-level group coefficient, D j is the contribution of j group in the second-level group, M j is the number of occurrences;
  • is the third-level component descriptor group coefficient
  • f(Y*) is the total contribution of three-level descriptors to a given property.
  • the hierarchical method is used for sequential regression.
  • C i is obtained through training set regression;
  • the secondary group contribution degree D j is then obtained through regression; since f(Y*) is calculated from the component descriptor, no regression calculation is required, thus greatly reducing the need for training set size.
  • the group coefficients ⁇ , w, and ⁇ are obtained through unified regression.
  • the calculated group coefficients ⁇ , w, ⁇ that is, the size of the weight, can represent the influence of the group fragments at the corresponding level on a given property.
  • Step 10 Use SMILES, a coded simplified component expression method, to represent complex component structures with two-dimensional coding, and build a predefined group fragment component library, including primary groups, secondary groups, and tertiary groups.
  • the primary group is a basic group containing a component structure
  • the secondary group is a combination of link positions of basic groups, used to distinguish aromatic hydrocarbons from paraffins and corresponding isomers
  • the Tertiary groups are descriptors describing the topological structure of components
  • Step 20 Screen out primary groups and secondary groups from the predefined group fragment component library according to the molecular structure of the target component, and then use the predetermined refining properties of the target component with the first group. Multiple third-level groups are screened out based on the principle of maintaining minimal correlation with secondary groups while maintaining maximum information content with the properties to be predicted. Randomly select any number of third-level groups and the first-level groups. and secondary groups to form multiple component feature sets, and then screen to obtain the component features with the highest posterior probability. set;
  • Step 30 Use a linear accumulation function to combine groups at different levels for modeling and then solve the coefficients through the training set to obtain a hierarchical group model;
  • Step 40 Generate multiple candidate models based on the component feature set with the highest posterior probability. Based on the hierarchical group model, by using Bayes' rule again, obtain the confidence intervals of all candidate models, and combine each of the Based on the accuracy of the candidate models, the octane number and cetane number models suitable for the actual conditions of the refinery are screened out based on the principle of multi-objective optimization.
  • the component feature set with the largest posterior probability is screened, which specifically includes:
  • a single model m belongs to the candidate model set M.
  • Each model obeys the distribution of the known data set Y, f(y
  • the posterior probability is:
  • m) is the edge similarity, calculated by f(y
  • m) ⁇ f(y
  • the step 30 includes a data preprocessing process and a modeling verification process
  • the data preprocessing process is: normalize the data set through probability and statistical methods, then use unsupervised learning methods to directly perform cluster analysis on the data set, approximate the sparse holes in the feature space in the data set, and obtain training. set;
  • the linear accumulation function is used for modeling hierarchical groups, and the formula is as follows:
  • the function f(Y) is the function of the property to be predicted
  • C i is the contribution of the i-th group in the primary group
  • N i is the number of occurrences of the i group
  • is the coefficient of the primary group
  • w is the second-level group coefficient.
  • Level group coefficient, D j is two The contribution of j group in the first-level group, M j is the number of occurrences;
  • is the third-level component descriptor group coefficient
  • f(Y*) is the total contribution of the third-level descriptor to a given property;
  • the hierarchical method is used to regression in sequence, and C i is obtained through training set regression; then the secondary group contribution degree D j is obtained through regression; f ( Y*) is calculated from the component descriptor without regression calculation.
  • the group coefficients ⁇ , w, and ⁇ are obtained through unified regression, that is, the size of the weight, which represents the influence of the group fragment at the corresponding level on a given property.
  • This invention selects group fragments based on the mechanism and combines them with component descriptors that do not rely on data regression coefficients, which reduces the number of required regression calculation coefficients, reduces the dependence on the size of the data set to a considerable extent, and at the same time provides
  • the posterior distribution probability of the feature subset model realizes "soft" constraints, which is suitable for prediction research on the refining properties of pure components with limited data volume; on this basis, Bayes' rule is introduced again, so that the final model can be The posterior probability distribution is estimated to avoid the risk of overfitting of the final prediction model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Disclosed is a hierarchical group construction-based method for predicting pure component refining properties, the method comprising: predicting an octane number and a cetane number of pure component compounds in a petroleum product, performing hierarchical construction by means of component feature sets, introducing hierarchical group construction to prevent the redundancy of the feature sets, introducing Bayes' rule when a third hierarchical component descriptor is added into the feature sets, so that posterior probability distribution estimation can be performed on the feature sets, and selecting the feature sets which have higher posterior probability rather than solely focusing on prediction precision. On the basis of the foregoing, Bayes' rule is introduced again, so that posterior probability distribution estimation may be performed on the final model, and the risk of the final prediction model overfitting is prevented. The present invention may be applied to crude oil and product blending units in the petrochemical industry, and may effectively improve the petroleum refining precision.

Description

一种基于层次基团构建的纯组分炼化性质的预测方法A prediction method for refining properties of pure components based on hierarchical group construction 技术领域Technical field
本发明涉及炼化工业数据分析领域,特别涉及一种基于层次基团构建的纯组分炼化性质的预测方法。The invention relates to the field of data analysis in the refining and chemical industry, and in particular to a method for predicting the refining and chemical properties of pure components based on hierarchical group construction.
背景技术Background technique
传统的炼化单元模型,碍于分析化学与计算机硬件的限制,多使用集总动力学模型,原料和产品常依据宏观性质如沸点或溶解度划分成集总。如催化裂化单元所广泛采用的十集总、十一集总模型。但基于宏观层次划分的集总天然具有多组分属性,无法详细表征组分信息,导致此类集总模型难以扩展到新的原料与催化剂体系。然而分子层次的集总模型可从纯组分层面计算原料的组成、性质,建立反应网络,进而精准预测炼化加工单元产物的性质。配合以纯组分性质预测模型与混合规则模型,分子动力学模型不但可预测炼化单元产物的分布即定性分析,更可实现定量预测产物相应的炼化性质。这一进展可使决策者定向地设计产物中纯组分的化学结构,优化单元操作条件,进而为炼化理论研究和工业生产指引方向。其中对于纯组分炼化性质预测的精度直接关系对产物质量评估的准确性,进而影响各操作单元的优化方向,是分子动力学模型的关键点,同时关系到分子管理技术能否顺利应用于炼厂优化。Due to the limitations of analytical chemistry and computer hardware, traditional refining unit models mostly use lumped kinetic models. Raw materials and products are often divided into lumps based on macroscopic properties such as boiling point or solubility. For example, the ten-lump and eleven-lump models are widely used in catalytic cracking units. However, lumping based on macro-level division naturally has multi-component attributes and cannot characterize component information in detail, making it difficult to extend such lumped models to new raw materials and catalyst systems. However, the lumped model at the molecular level can calculate the composition and properties of raw materials from the pure component level, establish a reaction network, and then accurately predict the properties of the products of the refining and chemical processing units. Combined with the pure component property prediction model and the mixing rule model, the molecular dynamics model can not only predict the distribution of products in the refining unit, that is, qualitative analysis, but can also quantitatively predict the corresponding refining properties of the product. This progress can enable decision makers to directionally design the chemical structure of pure components in products and optimize unit operating conditions, thereby guiding the direction of refining theoretical research and industrial production. Among them, the accuracy of predicting the refining properties of pure components is directly related to the accuracy of product quality assessment, which in turn affects the optimization direction of each operating unit. It is a key point of the molecular dynamics model and is also related to whether the molecular management technology can be successfully applied. Refinery optimization.
发明内容Contents of the invention
本发明要解决的技术问题,在于提供一种基于层次基团构建的纯组分炼化性质的预测方法,针对石油产物中纯组分的如汽油产物各组分的辛烷值、柴油产物中各组分的十六烷值进行预测,通过组分特征集分层构建,在加入组分描述符进行特征集合时,引入贝叶斯规则,从而可以对其进行后验概率分布估计,在此基础上,引入层次基团构建,对基团片段进行分层构建,避 免最终预测的过拟合风险。The technical problem to be solved by the present invention is to provide a method for predicting the refining properties of pure components based on hierarchical groups, aiming at the pure components in petroleum products, such as the octane number of each component of gasoline products and the octane number of each component of diesel products. The cetane number of each component is predicted and constructed hierarchically through the component feature set. When adding component descriptors for feature collection, Bayes' rule is introduced, so that the posterior probability distribution can be estimated. Here On the basis of this, hierarchical group construction is introduced to construct group fragments hierarchically to avoid Avoid the risk of overfitting in the final prediction.
本发明具体包括如下步骤:The present invention specifically includes the following steps:
步骤10、采用可编码的简化组分表达方式SMILES,将复杂组分结构用二维编码表示,构建预定义基团片段组分库,包括一级基团、二级基团以及三级基团;所述一级基团为包含组分结构的基本基团;所述二级基团为基本基团的链接位置组合,用于区分芳烃与链烷烃以及相应的同分异构体;所述三级基团为描述组分拓扑结构的描述符;Step 10. Use SMILES, a coded simplified component expression method, to represent complex component structures with two-dimensional coding, and build a predefined group fragment component library, including primary groups, secondary groups, and tertiary groups. ; The primary group is a basic group containing a component structure; the secondary group is a combination of link positions of basic groups, used to distinguish aromatic hydrocarbons from paraffins and corresponding isomers; the Tertiary groups are descriptors describing the topological structure of components;
步骤20、根据目标组分的分子结构从所述预定义基团片段组分库中筛选出一级基团和二级基团,再根据目标组分待预测的炼化性质利用与所述一级基团和二级基团保持最小的相关性,同时与待预测性质保持最大信息量的原则筛选得到多个三级基团,随机选取任意数量的三级基团与出的一级基团和二级基团构成多个组分特征集,然后筛选得到后验概率最大的组分特征集;Step 20: Screen out primary groups and secondary groups from the predefined group fragment component library according to the molecular structure of the target component, and then use the predetermined refining properties of the target component with the first group. Multiple third-level groups are screened out based on the principle of maintaining minimal correlation with secondary groups while maintaining maximum information content with the properties to be predicted. Randomly select any number of third-level groups and the first-level groups. and secondary groups to form multiple component feature sets, and then screen to obtain the component feature set with the highest posterior probability;
步骤30、采用线性累加函数将不同层次的基团结合进行建模然后通过训练集对系数求解,得到层级基团模型;Step 30: Use a linear accumulation function to combine groups at different levels for modeling and then solve the coefficients through the training set to obtain a hierarchical group model;
步骤40、根据所述后验概率最大的组分特征集生成多个候选模型,基于所述层级基团模型,通过再次使用贝叶斯规则,得到全部候选模型的置信区间,结合每一所述候选模型的精度,根据多目标优化的原则筛选出适用于炼化厂实际情况的辛烷值、十六烷值模型。Step 40: Generate multiple candidate models based on the component feature set with the highest posterior probability. Based on the hierarchical group model, by using Bayes' rule again, obtain the confidence intervals of all candidate models, and combine each of the Based on the accuracy of the candidate models, the octane number and cetane number models suitable for the actual conditions of the refinery are screened out based on the principle of multi-objective optimization.
进一步地,所述步骤20中,筛选得到后验概率最大的组分特征集,具体包括:Further, in step 20, the component feature set with the largest posterior probability is screened, which specifically includes:
单一模型m属于候选模型集合M每个模型服从已知数据集Y的分布,f(y|m,βm),其中参数向量βm∈Bm,Bm为模型m系数可能取值的集合,设模型m的先验概率为f(m),则后验概率为:
A single model m belongs to the candidate model set M. Each model obeys the distribution of the known data set Y, f(y|m,β m ), where the parameter vector β m ∈B m is the set of possible values of the coefficient of the model m . , assuming the prior probability of model m is f(m), then the posterior probability is:
其中,f(y|m)为边缘相似性,由f(y|m)=∫f(y|m,βm)f(βm|m)dβm与f(βm|m)计算得到,用马尔科夫蒙特卡洛随机抽样法近似估计其值,抽样范围为(m, βm)所在空间:
Among them, f(y|m) is the edge similarity, calculated by f(y|m)=∫f(y|m,β m )f(β m |m)dβ m and f(β m |m) , its value is approximately estimated using the Markov Monte Carlo random sampling method, and the sampling range is (m, The space where β m ) is located:
其中,其中p为全体特征数量。in, where p is the number of all features.
进一步地,所述步骤30包括数据预处理过程与建模验证过程;Further, the step 30 includes a data preprocessing process and a modeling verification process;
所述数据预处理过程为:通过概率统计方法将数据集进行正态化转换,然后采用无监督学习方法直接对数据集进行聚类分析,对数据集中特征空间的稀疏空洞进行近似估计,得到训练集;The data preprocessing process is: normalize the data set through probability and statistical methods, then use unsupervised learning methods to directly perform cluster analysis on the data set, approximate the sparse holes in the feature space in the data set, and obtain training. set;
所述建模验证过程中,层次基团的建模采用线性累加函数,公式如下:
During the modeling verification process, the linear accumulation function is used for modeling hierarchical groups, and the formula is as follows:
其中,函数f(Y)为待预测性质的函数,Ci为一级基团中第i基团的贡献度,Ni为i基团出现次数,δ为一级基团系数;w为二级基团系数,Dj为二级基团中j基团的贡献度,Mj为其出现次数;λ为三级组分描述符基团系数,f(Y*)为三级描述符对给定性质的总贡献度;Among them, the function f(Y) is the function of the property to be predicted, C i is the contribution of the i-th group in the primary group, N i is the number of occurrences of the i group, δ is the coefficient of the primary group; w is the second-level group coefficient. The first-level group coefficient, D j is the contribution of j group in the second-level group, M j is the number of occurrences; λ is the third-level component descriptor group coefficient, f(Y*) is the third-level descriptor pair The total contribution of a given property;
计算层级基团系数δ,w,λ和基团贡献度Ci,Dj时,采用层次方法依次回归,通过训练集回归得到Ci;之后回归得到二级基团贡献度Dj;f(Y*)由组分描述符计算得到,不需回归计算,最后统一回归得到基团系数δ,w,λ,即权重的大小,代表所属层级基团片段对给定性质的影响力。When calculating the hierarchical group coefficients δ, w, λ and group contribution degrees C i and D j , the hierarchical method is used to regression in sequence, and C i is obtained through training set regression; then the secondary group contribution degree D j is obtained through regression; f ( Y*) is calculated from the component descriptor without regression calculation. Finally, the group coefficients δ, w, and λ are obtained through unified regression, that is, the size of the weight, which represents the influence of the group fragment at the corresponding level on a given property.
本发明具有如下优点:The invention has the following advantages:
基于机理挑选基团片段,并结合不依赖数据回归系数的组分描述符,减少了所需回归计算系数的数量,在相当程度上降低了对数据集规模的依赖性,同时给出了特征子集模型的后验分布概率,实现“软”约束,适用于数据量有限的纯组分炼化性质的预测研究;在此基础上,再次引入贝叶斯规则,从而可对最终模型进行后验概率分布估计,避免最终预测模型的过拟合风险。Selecting group fragments based on the mechanism and combining them with component descriptors that do not rely on data regression coefficients reduces the number of regression calculation coefficients required, reduces the dependence on the size of the data set to a considerable extent, and also provides eigenvalues Set the posterior distribution probability of the model to achieve "soft" constraints, which is suitable for prediction research on the refining properties of pure components with a limited amount of data; on this basis, Bayes' rule is introduced again, so that the final model can be evaluated posteriorly Probability distribution estimation to avoid the risk of overfitting of the final prediction model.
附图说明 Description of the drawings
下面参照附图结合实施例对本发明作进一步的说明。The present invention will be further described below with reference to the accompanying drawings and embodiments.
图1为本发明方法的流程示意图;Figure 1 is a schematic flow chart of the method of the present invention;
图2为本发明层次基团示意图;Figure 2 is a schematic diagram of hierarchical groups of the present invention;
图3为本发明组分特征集构建筛选流程示意图;Figure 3 is a schematic diagram of the component feature set construction and screening process of the present invention;
图4为本发明层次基团的建模流程示意图;Figure 4 is a schematic diagram of the modeling process of hierarchical groups in the present invention;
图5为本发明候选模型不确定性分析流程示意图之一;Figure 5 is one of the schematic diagrams of the uncertainty analysis process of the candidate model of the present invention;
图6为本方法候选模型不确定性分析流程示意图之二。Figure 6 is the second schematic diagram of the uncertainty analysis process of the candidate model of this method.
具体实施方式Detailed ways
本发明实施例通过提供一种基于层次基团构建的纯组分炼化性质的预测方法,针对石油产物中纯组分的如汽油产物各组分的辛烷值、柴油产物中各组分的十六烷值进行预测,通过组分特征集分层构建,在加入组分描述符进行特征集合时,引入贝叶斯规则,从而可以对其进行后验概率分布估计,在此基础上,引入层次基团构建,对基团片段进行分层构建,避免最终预测的过拟合风险。The embodiments of the present invention provide a method for predicting the refining properties of pure components based on hierarchical groups, aiming at the octane number of pure components in petroleum products, such as the octane number of each component in gasoline products, and the octane number of each component in diesel products. The cetane number is predicted by hierarchically constructing the component feature set. When adding the component descriptor for feature collection, Bayes' rule is introduced, so that the posterior probability distribution can be estimated. On this basis, the Hierarchical group construction, hierarchically constructing group fragments to avoid the risk of overfitting in the final prediction.
如图1所示,本发明的总体思路如下:As shown in Figure 1, the general idea of the present invention is as follows:
S1:构建预定义基团片段组分库。S1: Construct a library of predefined group fragment components.
针对已有基团片段构建方法的缺陷,提出崭新的一套组分特征集,以表征石油产物中组分的炼化性质。此组分特征集结合机理的特征基团与机器学习筛选出的组分描述符,用以表征组分炼化性质。构建基团与描述符的同时,将其划分层次,越高层次的基团包含对组分更为细致的描述。In view of the shortcomings of existing group fragment construction methods, a new set of component characteristics is proposed to characterize the refining properties of components in petroleum products. This component feature set combines the characteristic groups of the mechanism and the component descriptors screened by machine learning to characterize the refining properties of the components. While constructing groups and descriptors, they are divided into levels. Higher-level groups contain more detailed descriptions of components.
一级基团包含组分结构的基本基团如-CH,-CH3,-CO等,简单结构的组分如链烷烃可以通过该层次基团进行拆解表征。然而该层次基团只能代表组分的基本组成,不能表征基团在组分中的链接位置,而链接位置的不同对组分炼化性质具有举足轻重的影响。The primary group includes the basic groups of the component structure such as -CH, -CH3, -CO, etc. Simple structural components such as alkanes can be disassembled and characterized through this hierarchical group. However, this hierarchical group can only represent the basic composition of the component and cannot represent the linking position of the group in the component. The difference in linking position has a decisive impact on the refining properties of the component.
因此,二级基团着重建立基团块,即基本基团的组合,以区分芳烃与链烷烃以及相应的同分异构体。如图2所示,一级基本基团中包括代表芳环A6基团与CH2的R基团。而其中的-CH2基团因其链接在苯环上,因此与 其所连接的苯环上的碳组成新的基团块aC-CH,在二级基团块表示以表征该组分。Therefore, secondary groups focus on building up group blocks, which are combinations of basic groups that distinguish aromatics from paraffins and the corresponding isomers. As shown in Figure 2, the primary basic group includes the R group representing the aromatic ring A6 group and CH2. The -CH2 group is linked to the benzene ring, so it is The carbon on the benzene ring to which it is connected forms a new group block aC-CH, which is represented in the secondary group block to characterize the component.
三级基团采用组分描述符,由于组分描述符数量众多,基于量子化学计算的描述符,其准确性在科学界还有一定争议,因此将着重关注描述组分拓扑结构的描述符,如连接性指数(connectivity index)。The third-level group uses component descriptors. Due to the large number of component descriptors, the accuracy of descriptors based on quantum chemical calculations is still controversial in the scientific community. Therefore, we will focus on descriptors that describe the topological structure of the components. Such as connectivity index.
S2、根据目标组分构建并筛选组分特征集。S2. Construct and filter the component feature set according to the target component.
如图3所示,当需要对目标组分的炼化性质进行预测时,采用可编码的,简化组分表达方式SMILES,将复杂组分结构用二维编码表示,将给定组分分子结构自动拆解成符合组分库中的基团片段,从而进行定量化分析。As shown in Figure 3, when it is necessary to predict the refining properties of the target component, the codable, simplified component expression method SMILES is used to represent the complex component structure with two-dimensional coding, and the molecular structure of the given component is Automatically disassemble into group fragments that match the component library for quantitative analysis.
先根据目标组分的分子结构从基团库里筛选一级基团和二级基团,可以采用模拟退火、遗传算法等全局优化算法进行筛选。接着向筛选出来的特征集合中加入三级基团,然而三级基团的加入难免会与一级,二级基团相重叠,从而造成特征集冗余。因此,结合信息理论与机器学习,引入最小相关度-最大信息量概念,保证加入三级基团与已有的低级别基团保持最小的相关性,同时与待预测性质保持最大信息量,即最大化表征待预测性质。First, the primary groups and secondary groups are screened from the group library according to the molecular structure of the target component. Global optimization algorithms such as simulated annealing and genetic algorithms can be used for screening. Then, tertiary groups are added to the selected feature set. However, the addition of tertiary groups will inevitably overlap with primary and secondary groups, resulting in redundant feature sets. Therefore, combining information theory and machine learning, the concept of minimum correlation-maximum amount of information is introduced to ensure that the added third-level group maintains the minimum correlation with the existing low-level groups, while maintaining the maximum amount of information with the properties to be predicted, that is Maximize the representation of the properties to be predicted.
接着引入贝叶斯规则进行特征选择计算候选模型后验概率。单一模型m属于候选模型集合M每个模型服从已知数据集Y的分布,f(y|m,βm),其中参数向量βm∈Bm,Bm为模型m系数可能取值的集合。设模型m的先验概率为f(m),则后验概率为:
Then Bayes' rule is introduced for feature selection to calculate the posterior probability of the candidate model. A single model m belongs to the candidate model set M. Each model obeys the distribution of the known data set Y, f(y|m,β m ), where the parameter vector β m ∈B m is the set of possible values of the coefficient of the model m . . Suppose the prior probability of model m is f(m), then the posterior probability is:
其中,f(y|m)为边缘相似性,可由f(y|m)=∫f(y|m,βm)f(βm|m)dβm与f(βm|m)计算得到。但此积分绝大多数情况下无法得到解析解,因此用马尔科夫蒙特卡洛(MCMC)随机抽样法近似估计其值。抽样范围为(m,βm)所在空间:
Among them, f(y|m) is the edge similarity, which can be calculated by f(y|m)=∫f(y|m,β m )f(β m |m)dβ m and f(β m |m) . However, in most cases, the analytical solution of this integral cannot be obtained, so the Markov Monte Carlo (MCMC) random sampling method is used to approximately estimate its value. The sampling range is the space where (m,β m ) is located:
特征选择为模型选择的分支问题,即用二项分布表示候选模型, 其中p为全体特征数量。由此得到每个特征子集所代表模型的基于已知数据集Y的后验分布概率,从而实现“软”约束。此基于贝叶斯规则特征选择方法的核心为MCMC抽样方法。Feature selection is a branch problem of model selection, that is, using binomial distribution to represent candidate models, where p is the number of all features. From this, the posterior distribution probability of the model represented by each feature subset based on the known data set Y is obtained, thereby achieving "soft" constraints. The core of this feature selection method based on Bayes' rule is the MCMC sampling method.
基于机理挑选基团片段,并结合不依赖数据回归系数的组分描述符,减少了所需回归计算系数的数量,在相当程度上降低了对数据集规模的依赖性,同时给出了特征子集模型的后验分布概率,实现“软”约束,适用于数据量有限的纯组分炼化性质的预测研究。Selecting group fragments based on the mechanism and combining them with component descriptors that do not rely on data regression coefficients reduces the number of regression calculation coefficients required, reduces the dependence on the size of the data set to a considerable extent, and also provides eigenvalues Set the posterior distribution probability of the model to achieve "soft" constraints, which is suitable for prediction research on the refining properties of pure components with limited data volume.
S3:进行层次基团建模及系数求解。S3: Perform hierarchical group modeling and coefficient solution.
层级基团建模与系数求解过程如图4所示,总体可分为数据预处理与建模验证两部分。由于组分炼化性质已有数据集稀疏性强,因此在数据预处理阶段需引入先进的统计学与机器学习方法,力求提升小样本数据建模的精度。The process of hierarchical group modeling and coefficient solution is shown in Figure 4, which can be divided into two parts: data preprocessing and modeling verification. Due to the sparseness of existing data sets due to the refining properties of components, advanced statistics and machine learning methods need to be introduced in the data preprocessing stage to strive to improve the accuracy of small sample data modeling.
数据库中组分的特征值与实验值的分布难以满足正态分布要求,在建模过程中将会影响模型效果,需通过概率统计方法即Box-Cox对数似然函数法,将其进行正态化转换。由于特征空间的稀疏性,随机选取的训练集难以涵盖测试集的特征空间,导致基于训练集模型过于外推,降低模型预测效果。因此第二步采用无监督学习方法,即只针对数据集的特征集而不通过对建模效果的评估,直接对数据集进行聚类分析,对数据集中特征空间的稀疏空洞进行近似估计,基于此选取的训练集,可在最大程度上涵盖测试集样本的特征空间,提高模型预测效果。The distribution of the eigenvalues and experimental values of the components in the database is difficult to meet the requirements of normal distribution, which will affect the model effect during the modeling process. It needs to be normalized through the probability and statistics method, namely the Box-Cox log-likelihood function method. State transformation. Due to the sparsity of the feature space, the randomly selected training set is difficult to cover the feature space of the test set, causing the model based on the training set to be over extrapolated and reducing the model prediction effect. Therefore, the second step uses an unsupervised learning method, that is, only focusing on the feature set of the data set without evaluating the modeling effect, clustering analysis is directly performed on the data set, and the sparse holes in the feature space in the data set are approximated. Based on This selected training set can cover the feature space of the test set samples to the greatest extent and improve the model prediction effect.
层次基团的建模优先考虑传统的线性累加函数,因其运算量较小,并能给出相应基团的贡献度系数,其在一定程度上提供更为丰富的机理信息。其公式形式如下式所示:
The modeling of hierarchical groups gives priority to the traditional linear accumulation function, because it has a small amount of calculation and can give the contribution coefficient of the corresponding group, which provides richer mechanism information to a certain extent. Its formula is as follows:
其中,函数f(Y)为待预测性质的函数,Ci为一级基团中第i基团的贡献度,Ni为i基团出现次数,δ为一级基团系数;w为二级基团系数,Dj为二级基团中j基团的贡献度,Mj为其出现次数;λ为三级组分描述符基团系数, f(Y*)为三级描述符对给定性质的总贡献度。Among them, the function f(Y) is the function of the property to be predicted, C i is the contribution of the i-th group in the primary group, N i is the number of occurrences of the i group, δ is the coefficient of the primary group; w is the second-level group coefficient. The first-level group coefficient, D j is the contribution of j group in the second-level group, M j is the number of occurrences; λ is the third-level component descriptor group coefficient, f(Y*) is the total contribution of three-level descriptors to a given property.
计算层级基团系数δ,w,λ和基团贡献度Ci,Dj时,采用层次方法依次回归。通过训练集回归得到Ci;之后回归得到二级基团贡献度Dj;由于f(Y*)由组分描述符计算得到,不需回归计算,从而大大减少对训练集规模的需求。最后统一回归得到基团系数δ,w,λ。计算得到的基团系数δ,w,λ,即权重的大小,可代表所属层级基团片段对给定性质的影响力。When calculating the hierarchical group coefficients δ, w, λ and group contribution degrees C i and D j , the hierarchical method is used for sequential regression. C i is obtained through training set regression; the secondary group contribution degree D j is then obtained through regression; since f(Y*) is calculated from the component descriptor, no regression calculation is required, thus greatly reducing the need for training set size. Finally, the group coefficients δ, w, and λ are obtained through unified regression. The calculated group coefficients δ, w, λ, that is, the size of the weight, can represent the influence of the group fragments at the corresponding level on a given property.
S4:进行不确定性分析。S4: Conduct uncertainty analysis.
如图5和图6所示,预测值的不确定性分析即置信区间的估计,对模型的实际应用至关重要。由于层级基团模型具有显性的数学表达式,同时又包括各候选模型的概率分布,通过再次使用贝叶斯规则,可得全部候选模型的置信区间,结合各自模型的精度,综合考虑模型的精确性与实用性,得到更适用于炼厂实际情况的辛烷值、十六烷值模型。As shown in Figures 5 and 6, uncertainty analysis of predicted values, that is, estimation of confidence intervals, is crucial to the practical application of the model. Since the hierarchical group model has explicit mathematical expressions and also includes the probability distribution of each candidate model, by using Bayes' rule again, the confidence intervals of all candidate models can be obtained, combined with the accuracy of each model, and comprehensively considering the model's Accuracy and practicality, obtaining an octane number and cetane number model that is more suitable for the actual situation of the refinery.
需要说明的是,本领域的相关技术人员在进行良品率预测计算的过程中,可以根据相关的原理进行适当变形及相应的参数设置。以上所述实施例仅表达了本发明的一种实施方式,其描述已经较为具体和详细,但是不能因此理解为对发明专利范围的限制。It should be noted that those skilled in the art can perform appropriate deformations and corresponding parameter settings based on relevant principles during the process of yield prediction calculation. The above-described embodiment only expresses one implementation mode of the present invention. The description is relatively specific and detailed, but it should not be understood as limiting the patent scope of the invention.
本发明一具体实施例如下:A specific embodiment of the present invention is as follows:
步骤10、采用可编码的简化组分表达方式SMILES,将复杂组分结构用二维编码表示,构建预定义基团片段组分库,包括一级基团、二级基团以及三级基团;所述一级基团为包含组分结构的基本基团;所述二级基团为基本基团的链接位置组合,用于区分芳烃与链烷烃以及相应的同分异构体;所述三级基团为描述组分拓扑结构的描述符;Step 10. Use SMILES, a coded simplified component expression method, to represent complex component structures with two-dimensional coding, and build a predefined group fragment component library, including primary groups, secondary groups, and tertiary groups. ; The primary group is a basic group containing a component structure; the secondary group is a combination of link positions of basic groups, used to distinguish aromatic hydrocarbons from paraffins and corresponding isomers; the Tertiary groups are descriptors describing the topological structure of components;
步骤20、根据目标组分的分子结构从所述预定义基团片段组分库中筛选出一级基团和二级基团,再根据目标组分待预测的炼化性质利用与所述一级基团和二级基团保持最小的相关性,同时与待预测性质保持最大信息量的原则筛选得到多个三级基团,随机选取任意数量的三级基团与出的一级基团和二级基团构成多个组分特征集,然后筛选得到后验概率最大的组分特征 集;Step 20: Screen out primary groups and secondary groups from the predefined group fragment component library according to the molecular structure of the target component, and then use the predetermined refining properties of the target component with the first group. Multiple third-level groups are screened out based on the principle of maintaining minimal correlation with secondary groups while maintaining maximum information content with the properties to be predicted. Randomly select any number of third-level groups and the first-level groups. and secondary groups to form multiple component feature sets, and then screen to obtain the component features with the highest posterior probability. set;
步骤30、采用线性累加函数将不同层次的基团结合进行建模然后通过训练集对系数求解,得到层级基团模型;Step 30: Use a linear accumulation function to combine groups at different levels for modeling and then solve the coefficients through the training set to obtain a hierarchical group model;
步骤40、根据所述后验概率最大的组分特征集生成多个候选模型,基于所述层级基团模型,通过再次使用贝叶斯规则,得到全部候选模型的置信区间,结合每一所述候选模型的精度,根据多目标优化的原则筛选出适用于炼化厂实际情况的辛烷值、十六烷值模型。Step 40: Generate multiple candidate models based on the component feature set with the highest posterior probability. Based on the hierarchical group model, by using Bayes' rule again, obtain the confidence intervals of all candidate models, and combine each of the Based on the accuracy of the candidate models, the octane number and cetane number models suitable for the actual conditions of the refinery are screened out based on the principle of multi-objective optimization.
所述步骤20中,筛选得到后验概率最大的组分特征集,具体包括:In the step 20, the component feature set with the largest posterior probability is screened, which specifically includes:
单一模型m属于候选模型集合M每个模型服从已知数据集Y的分布,f(y|m,βm),其中参数向量βm∈Bm,Bm为模型m系数可能取值的集合,设模型m的先验概率为f(m),则后验概率为:
A single model m belongs to the candidate model set M. Each model obeys the distribution of the known data set Y, f(y|m,β m ), where the parameter vector β m ∈B m is the set of possible values of the coefficient of the model m . , assuming the prior probability of model m is f(m), then the posterior probability is:
其中,f(y|m)为边缘相似性,由f(y|m)=∫f(y|m,βm)f(βm|m)dβm与f(βm|m)计算得到,用马尔科夫蒙特卡洛随机抽样法近似估计其值,抽样范围为(m,βm)所在空间:
Among them, f(y|m) is the edge similarity, calculated by f(y|m)=∫f(y|m,β m )f(β m |m)dβ m and f(β m |m) , its value is approximately estimated using the Markov Monte Carlo random sampling method, and the sampling range is the space where (m, β m ) is located:
其中,其中p为全体特征数量。in, where p is the number of all features.
所述步骤30包括数据预处理过程与建模验证过程;The step 30 includes a data preprocessing process and a modeling verification process;
所述数据预处理过程为:通过概率统计方法将数据集进行正态化转换,然后采用无监督学习方法直接对数据集进行聚类分析,对数据集中特征空间的稀疏空洞进行近似估计,得到训练集;The data preprocessing process is: normalize the data set through probability and statistical methods, then use unsupervised learning methods to directly perform cluster analysis on the data set, approximate the sparse holes in the feature space in the data set, and obtain training. set;
所述建模验证过程中,层次基团的建模采用线性累加函数,公式如下:
During the modeling verification process, the linear accumulation function is used for modeling hierarchical groups, and the formula is as follows:
其中,函数f(Y)为待预测性质的函数,Ci为一级基团中第i基团的贡献度,Ni为i基团出现次数,δ为一级基团系数;w为二级基团系数,Dj为二 级基团中j基团的贡献度,Mj为其出现次数;λ为三级组分描述符基团系数,f(Y*)为三级描述符对给定性质的总贡献度;Among them, the function f(Y) is the function of the property to be predicted, C i is the contribution of the i-th group in the primary group, N i is the number of occurrences of the i group, δ is the coefficient of the primary group; w is the second-level group coefficient. Level group coefficient, D j is two The contribution of j group in the first-level group, M j is the number of occurrences; λ is the third-level component descriptor group coefficient, f(Y*) is the total contribution of the third-level descriptor to a given property;
计算层级基团系数δ,w,λ和基团贡献度Ci,Dj时,采用层次方法依次回归,通过训练集回归得到Ci;之后回归得到二级基团贡献度Dj;f(Y*)由组分描述符计算得到,不需回归计算,最后统一回归得到基团系数δ,w,λ,即权重的大小,代表所属层级基团片段对给定性质的影响力。When calculating the hierarchical group coefficients δ, w, λ and group contribution degrees C i and D j , the hierarchical method is used to regression in sequence, and C i is obtained through training set regression; then the secondary group contribution degree D j is obtained through regression; f ( Y*) is calculated from the component descriptor without regression calculation. Finally, the group coefficients δ, w, and λ are obtained through unified regression, that is, the size of the weight, which represents the influence of the group fragment at the corresponding level on a given property.
本发明基于机理挑选基团片段,并结合不依赖数据回归系数的组分描述符,减少了所需回归计算系数的数量,在相当程度上降低了对数据集规模的依赖性,同时给出了特征子集模型的后验分布概率,实现“软”约束,适用于数据量有限的纯组分炼化性质的预测研究;在此基础上,再次引入贝叶斯规则,从而可对最终模型进行后验概率分布估计,避免最终预测模型的过拟合风险。This invention selects group fragments based on the mechanism and combines them with component descriptors that do not rely on data regression coefficients, which reduces the number of required regression calculation coefficients, reduces the dependence on the size of the data set to a considerable extent, and at the same time provides The posterior distribution probability of the feature subset model realizes "soft" constraints, which is suitable for prediction research on the refining properties of pure components with limited data volume; on this basis, Bayes' rule is introduced again, so that the final model can be The posterior probability distribution is estimated to avoid the risk of overfitting of the final prediction model.
虽然以上描述了本发明的具体实施方式,但是熟悉本技术领域的技术人员应当理解,我们所描述的具体的实施例只是说明性的,而不是用于对本发明的范围的限定,熟悉本领域的技术人员在依照本发明的精神所作的等效的修饰以及变化,都应当涵盖在本发明的权利要求所保护的范围内。 Although the specific embodiments of the present invention have been described above, those skilled in the art should understand that the specific embodiments we have described are only illustrative and are not used to limit the scope of the present invention. Those skilled in the art Equivalent modifications and changes made by skilled persons in accordance with the spirit of the present invention shall be covered by the scope of protection of the claims of the present invention.

Claims (3)

  1. 一种基于层次基团构建的纯组分炼化性质的预测方法,其特征在于,包括:A method for predicting the refining properties of pure components based on hierarchical group construction, which is characterized by including:
    步骤10、采用可编码的简化组分表达方式SMILES,将复杂组分结构用二维编码表示,构建预定义基团片段组分库,包括一级基团、二级基团以及三级基团;所述一级基团为包含组分结构的基本基团;所述二级基团为基本基团的链接位置组合,用于区分芳烃与链烷烃以及相应的同分异构体;所述三级基团为描述组分拓扑结构的描述符;Step 10. Use SMILES, a coded simplified component expression method, to represent complex component structures with two-dimensional coding, and build a predefined group fragment component library, including primary groups, secondary groups, and tertiary groups. ; The primary group is a basic group containing a component structure; the secondary group is a combination of link positions of basic groups, used to distinguish aromatic hydrocarbons from paraffins and corresponding isomers; the Tertiary groups are descriptors describing the topological structure of components;
    步骤20、根据目标组分的分子结构从所述预定义基团片段组分库中筛选出一级基团和二级基团,再根据目标组分待预测的炼化性质利用与所述一级基团和二级基团保持最小的相关性,同时与待预测性质保持最大信息量的原则筛选得到多个三级基团,随机选取任意数量的三级基团与出的一级基团和二级基团构成多个组分特征集,然后筛选得到后验概率最大的组分特征集;Step 20: Screen out primary groups and secondary groups from the predefined group fragment component library according to the molecular structure of the target component, and then use the predetermined refining properties of the target component with the first group. Multiple third-level groups are screened out based on the principle of maintaining minimal correlation with secondary groups while maintaining maximum information content with the properties to be predicted. Randomly select any number of third-level groups and the first-level groups. and secondary groups to form multiple component feature sets, and then screen to obtain the component feature set with the highest posterior probability;
    步骤30、采用线性累加函数将不同层次的基团结合进行建模然后通过训练集对系数求解,得到层级基团模型;Step 30: Use a linear accumulation function to combine groups at different levels for modeling and then solve the coefficients through the training set to obtain a hierarchical group model;
    步骤40、根据所述后验概率最大的组分特征集生成多个候选模型,基于所述层级基团模型,通过再次使用贝叶斯规则,得到全部候选模型的置信区间,结合每一所述候选模型的精度,根据多目标优化的原则筛选出适用于炼化厂实际情况的辛烷值、十六烷值模型。Step 40: Generate multiple candidate models based on the component feature set with the highest posterior probability. Based on the hierarchical group model, by using Bayes' rule again, obtain the confidence intervals of all candidate models, and combine each of the Based on the accuracy of the candidate models, the octane number and cetane number models suitable for the actual conditions of the refinery are screened out based on the principle of multi-objective optimization.
  2. 根据权利要求1所述的方法,其特征在于:所述步骤20中,筛选得到后验概率最大的组分特征集,具体包括:The method according to claim 1, characterized in that: in the step 20, screening to obtain the component feature set with the largest posterior probability, specifically includes:
    单一模型m属于候选模型集合M每个模型服从已知数据集Y的分布,f(y|m,βm),其中参数向量βm∈Bm,Bm为模型m系数可能取值的集合,设模型m的先验概率为f(m),则后验概率为:
    A single model m belongs to the candidate model set M. Each model obeys the distribution of the known data set Y, f(y|m,β m ), where the parameter vector β m ∈B m is the set of possible values of the coefficient of the model m . , assuming the prior probability of model m is f(m), then the posterior probability is:
    其中,f(y|m)为边缘相似性,由f(y|m)=∫f(y|m,βm)f(βm|m)dβm与f(βm|m)计算得到,用马尔科夫蒙特卡洛随机抽样法近似估计其值,抽样范围为(m,βm)所在空间:
    Among them, f(y|m) is the edge similarity, calculated by f(y|m)=∫f(y|m,β m )f(β m |m)dβ m and f(β m |m) , its value is approximately estimated using the Markov Monte Carlo random sampling method, and the sampling range is the space where (m, β m ) is located:
    其中,其中p为全体特征数量。in, where p is the number of all features.
  3. 根据权利要求1所述的方法,其特征在于:所述步骤30包括数据预处理过程与建模验证过程;The method according to claim 1, characterized in that: the step 30 includes a data preprocessing process and a modeling verification process;
    所述数据预处理过程为:通过概率统计方法将数据集进行正态化转换,然后采用无监督学习方法直接对数据集进行聚类分析,对数据集中特征空间的稀疏空洞进行近似估计,得到训练集;The data preprocessing process is: normalize the data set through probability and statistical methods, then use unsupervised learning methods to directly perform cluster analysis on the data set, approximate the sparse holes in the feature space in the data set, and obtain training. set;
    所述建模验证过程中,层次基团的建模采用线性累加函数,公式如下:
    During the modeling verification process, the linear accumulation function is used for modeling hierarchical groups, and the formula is as follows:
    其中,函数f(Y)为待预测性质的函数,Ci为一级基团中第i基团的贡献度,Ni为i基团出现次数,δ为一级基团系数;w为二级基团系数,Dj为二级基团中j基团的贡献度,Mj为其出现次数;λ为三级组分描述符基团系数,f(Y*)为三级描述符对给定性质的总贡献度;Among them, the function f(Y) is the function of the property to be predicted, C i is the contribution of the i-th group in the primary group, N i is the number of occurrences of the i group, δ is the coefficient of the primary group; w is the second-level group coefficient. The first-level group coefficient, D j is the contribution of j group in the second-level group, M j is the number of occurrences; λ is the third-level component descriptor group coefficient, f(Y*) is the third-level descriptor pair The total contribution of a given property;
    计算层级基团系数δ,w,λ和基团贡献度Ci,Dj时,采用层次方法依次回归,通过训练集回归得到Ci;之后回归得到二级基团贡献度Dj;f(Y*)由组分描述符计算得到,不需回归计算,最后统一回归得到基团系数δ,w,λ,即权重的大小,代表所属层级基团片段对给定性质的影响力。 When calculating the hierarchical group coefficients δ, w, λ and group contribution degrees C i and D j , the hierarchical method is used to regression in sequence, and C i is obtained through training set regression; then the secondary group contribution degree D j is obtained through regression; f ( Y*) is calculated from the component descriptor without regression calculation. Finally, the group coefficients δ, w, and λ are obtained through unified regression, that is, the size of the weight, which represents the influence of the group fragment at the corresponding level on a given property.
PCT/CN2023/085001 2022-04-19 2023-03-30 Hierarchical group construction-based method for predicting pure component refining properties WO2023202345A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210411133.2A CN114708930A (en) 2022-04-19 2022-04-19 Method for predicting refining property of pure components based on hierarchical group construction
CN202210411133.2 2022-04-19

Publications (1)

Publication Number Publication Date
WO2023202345A1 true WO2023202345A1 (en) 2023-10-26

Family

ID=82174562

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/085001 WO2023202345A1 (en) 2022-04-19 2023-03-30 Hierarchical group construction-based method for predicting pure component refining properties

Country Status (2)

Country Link
CN (1) CN114708930A (en)
WO (1) WO2023202345A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708930A (en) * 2022-04-19 2022-07-05 泉州装备制造研究所 Method for predicting refining property of pure components based on hierarchical group construction

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111899795A (en) * 2020-06-12 2020-11-06 中国石油天然气股份有限公司 Molecular-level oil refining processing full-flow optimization method, device and system and storage medium
CN111899793A (en) * 2020-06-12 2020-11-06 中国石油天然气股份有限公司 Real-time optimization method, device and system of molecular-level device and storage medium
WO2021234065A1 (en) * 2020-05-22 2021-11-25 Basf Coatings Gmbh Prediction of properties of a chemical mixture
CN113707240A (en) * 2021-07-30 2021-11-26 浙江大学 Component parameter robust soft measurement method based on semi-supervised nonlinear variational Bayes mixed model
CN114708930A (en) * 2022-04-19 2022-07-05 泉州装备制造研究所 Method for predicting refining property of pure components based on hierarchical group construction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021234065A1 (en) * 2020-05-22 2021-11-25 Basf Coatings Gmbh Prediction of properties of a chemical mixture
CN111899795A (en) * 2020-06-12 2020-11-06 中国石油天然气股份有限公司 Molecular-level oil refining processing full-flow optimization method, device and system and storage medium
CN111899793A (en) * 2020-06-12 2020-11-06 中国石油天然气股份有限公司 Real-time optimization method, device and system of molecular-level device and storage medium
CN113707240A (en) * 2021-07-30 2021-11-26 浙江大学 Component parameter robust soft measurement method based on semi-supervised nonlinear variational Bayes mixed model
CN114708930A (en) * 2022-04-19 2022-07-05 泉州装备制造研究所 Method for predicting refining property of pure components based on hierarchical group construction

Also Published As

Publication number Publication date
CN114708930A (en) 2022-07-05

Similar Documents

Publication Publication Date Title
Xu et al. Small data machine learning in materials science
Schaid Genomic similarity and kernel methods II: methods for genomic information
WO2023040512A1 (en) Catalytic cracking unit simulation and prediction method based on molecular-level mechanism model and big data technology
Wang et al. A two‐layer ensemble learning framework for data‐driven soft sensor of the diesel attributes in an industrial hydrocracking process
Pyl et al. Molecular reconstruction of complex hydrocarbon mixtures: An application of principal component analysis
Song et al. Modeling the hydrocracking process with deep neural networks
WO2023202345A1 (en) Hierarchical group construction-based method for predicting pure component refining properties
Tan et al. Rapid rule compaction strategies for global knowledge discovery in a supervised learning classifier system
Castro et al. Significant motifs in time series
CN116802741A (en) Inverse synthesis system and method
Ma et al. MIDIA: exploring denoising autoencoders for missing data imputation
Chen et al. Adaptive modeling strategy integrating feature selection and random forest for fluid catalytic cracking processes
Mei et al. Molecular-based bayesian regression model of petroleum fractions
WO2023129955A1 (en) Inter-model prediction score recalibration
Wang et al. Layer-wise residual-guided feature learning with deep learning networks for industrial quality prediction
Capel et al. ProteinGLUE multi-task benchmark suite for self-supervised protein modeling
CN109801681B (en) SNP (Single nucleotide polymorphism) selection method based on improved fuzzy clustering algorithm
Bondugula et al. MUPRED: a tool for bridging the gap between template based methods and sequence profile based methods for protein secondary structure prediction
Manfredi et al. ISPRED-SEQ: Deep neural networks and embeddings for predicting interaction sites in protein sequences
Luo et al. Developing soft sensors using hybrid soft computing methodology: a neurofuzzy system based on rough set theory and genetic algorithms
Yang et al. A carbon price hybrid forecasting model based on data multi-scale decomposition and machine learning
Zhou et al. TransVAE-DTA: Transformer and variational autoencoder network for drug-target binding affinity prediction
Guan et al. Dual‐objective optimization for petroleum molecular reconstruction based on property and composition similarities
Nguyen et al. Evaluating causal‐based feature selection for fuel property prediction models
Shi et al. Interpretable reconstruction of naphtha components using property-based extreme gradient boosting and compositional-weighted Shapley additive explanation values

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23791003

Country of ref document: EP

Kind code of ref document: A1