CN117332676A

CN117332676A - Fatigue performance prediction method based on self-adaptive feature selection

Info

Publication number: CN117332676A
Application number: CN202311164299.XA
Authority: CN
Inventors: 武川; 姚磊; 黎振; 蔡玉俊; 王琳宁; 王浩
Original assignee: Tianjin University of Technology and Education China Vocational Training Instructor Training Center
Current assignee: Tianjin University of Technology and Education China Vocational Training Instructor Training Center
Priority date: 2023-09-11
Filing date: 2023-09-11
Publication date: 2024-01-02

Abstract

The invention relates to a fatigue performance prediction method based on self-adaptive feature selection, which comprises the following steps: s1, collecting material related data, and dividing a data set into a training set and a testing set; s2, taking a correlation combination of the training set and the testing set as a characteristic weight, and taking sensitive characteristics of the characteristic weight as sample data; s3, inputting sample data into a support vector machine regression model for training, and inputting a sample test set into the trained support vector machine regression model for evaluating model performance; s4, inputting the combined test set into the integrated regression model, inputting the combined test set into the trained integrated regression model, and using the estimated model performance. The method can help identify the most relevant features of fatigue performance prediction, thereby potentially improving the accuracy of the prediction model, and the model can intensively consider key factors influencing fatigue behavior through selecting information features and eliminating irrelevant features.

Description

A fatigue performance prediction method based on adaptive feature selection

技术领域Technical field

本发明属于材料疲劳性能评估技术领域，具体涉及一种基于自适应特征选择的疲劳性能预测方法。The invention belongs to the technical field of material fatigue performance evaluation, and specifically relates to a fatigue performance prediction method based on adaptive feature selection.

背景技术Background technique

在现代工业生产中，材料或零部件的疲劳性能是一个非常重要的物理量，对于保障产品的质量、寿命以及安全性具有至关重要的作用。在过去，疲劳性能的预测通常是基于工程经验和试验数据，这种方法具有很大的依赖性、可靠性不高以及成本高的问题。近年来，机器学习方法的广泛应用为疲劳性能预测带来了新的机遇。In modern industrial production, the fatigue performance of materials or parts is a very important physical quantity, which plays a vital role in ensuring the quality, life and safety of products. In the past, fatigue performance prediction was usually based on engineering experience and test data. This method has great dependence, low reliability and high cost. In recent years, the widespread application of machine learning methods has brought new opportunities for fatigue performance prediction.

然而，在机器学习应用于疲劳性能预测的过程中，还存在一些技术问题需要解决。例如，如何对输入数据进行预处理，并减少数据的噪声和错误；如何设计并选择合适的特征提取和处理技术，以捕捉材料或零部件的重要特征；如何选择合适的机器学习算法，并进行优化以提高疲劳性能预测的准确性、精度和鲁棒性；以及如何对预测模型进行评估和验证，以确定其质量和性能是否满足要求等。However, there are still some technical problems that need to be solved in the process of applying machine learning to fatigue performance prediction. For example, how to preprocess input data and reduce data noise and errors; how to design and select appropriate feature extraction and processing techniques to capture important features of materials or parts; how to select appropriate machine learning algorithms and conduct Optimization to improve the accuracy, precision and robustness of fatigue performance prediction; and how to evaluate and verify prediction models to determine whether their quality and performance meet requirements, etc.

发明内容Contents of the invention

本发明的目的在于克服现有技术的不足，提供一种基于自适应特征选择的疲劳性能预测方法。The purpose of the present invention is to overcome the shortcomings of the existing technology and provide a fatigue performance prediction method based on adaptive feature selection.

本发明解决其技术问题是通过以下技术方案实现的：The technical problems solved by the present invention are achieved through the following technical solutions:

一种基于自适应特征选择的疲劳性能预测方法，其特征在于：所述预测方法的步骤为：A fatigue performance prediction method based on adaptive feature selection, characterized in that: the steps of the prediction method are:

S1、收集材料相关数据，对特征数据归一化处理形成数据集，并将数据集划分为训练集和测试集；S1. Collect material-related data, normalize the feature data to form a data set, and divide the data set into a training set and a test set;

S2、使用互信息提取特征与目标变量之间的相关性，将训练集和测试集的相关性使用自适应加权的方法组合作为特征权重，将特征权重作为自适应特征选择的输入，筛选敏感特征，将敏感特征作为样本数据；S2. Use mutual information to extract the correlation between features and target variables. Use the adaptive weighting method to combine the correlation between the training set and the test set as feature weights. Use the feature weights as the input of adaptive feature selection to screen sensitive features. , using sensitive features as sample data;

S3、将样本数据划分为样本训练集和样本测试集，将样本训练集输入到支持向量机回归模型进行训练，将样本测试集输入到训练后的支持向量机回归模型中，使用决定系数R²来评估模型性能；S3. Divide the sample data into a sample training set and a sample test set, input the sample training set into the support vector machine regression model for training, input the sample test set into the trained support vector machine regression model, and use the determination coefficient R ² to evaluate model performance;

S4、使用枚举法将非敏感特征与敏感特征组合，将组合的数据划分为组合训练集和组合测试集，使用BaggingRegressor来构建基于SVR的集成回归模型，将组合测试集输入到集成回归模型，将组合测试集输入到训练后的集成回归模型中并使用决定系数R²来评估模型性能。S4. Use the enumeration method to combine non-sensitive features and sensitive features, divide the combined data into a combined training set and a combined test set, use BaggingRegressor to build an integrated regression model based on SVR, and input the combined test set into the integrated regression model. The combined test set was input into the trained ensemble regression model and the coefficient of determination ^R2 was used to evaluate model performance.

而且，所述S1的具体步骤为：Moreover, the specific steps of S1 are:

收集的材料相关数据包含化学成分、工艺参数、上游加工特征以及相对应的疲劳强度，形成的数据集表示为F＝{f_i,j,y_i,i＝1,2,…,n；j＝1,2,…,k}，The collected material-related data includes chemical composition, process parameters, upstream processing characteristics and corresponding fatigue strength. The formed data set is expressed as F={f _i,j ,y _i ,i=1,2,...,n;j =1,2,…,k},

其中：n为样本数量；Among them: n is the number of samples;

k为特征数量；k is the number of features;

f_i,j为第i个样本的第j个特征；f _i,j is the j-th feature of the i-th sample;

y_i为第i个样本的疲劳强度；y _i is the fatigue strength of the i-th sample;

对数据的特征进行归一化处理，并将归一化的数据集划分为训练集和测试集；Normalize the characteristics of the data and divide the normalized data set into a training set and a test set;

归一化处理表示为：The normalization process is expressed as:

其中：f'_i,j为第i个样本中第j个特征归一化的值；Among them: f' _i,j is the normalized value of the j-th feature in the i-th sample;

x_j为第j个的所有特征值。x _j is all the eigenvalues of the jth one.

而且，所述S2的具体步骤为：Moreover, the specific steps of S2 are:

使用互信息来提取训练集和测试集的特征变量与疲劳强度之间的相关性。互信息是一种衡量两个随机变量之间的关联程度的概念，不仅仅局限于线性关系。它是基于信息论的概念，用来描述两个变量之间的统计依赖关系，即它们之间信息的共享程度。互信息的计算涉及到两个随机变量的联合概率分布和各自的边缘概率分布。在连续随机变量中，计算公式如下：Mutual information is used to extract the correlation between the characteristic variables and fatigue intensity of the training set and test set. Mutual information is a concept that measures the degree of correlation between two random variables and is not limited to linear relationships. It is based on the concept of information theory and is used to describe the statistical dependence between two variables, that is, the degree of information sharing between them. The calculation of mutual information involves the joint probability distribution of two random variables and their respective marginal probability distributions. In continuous random variables, the calculation formula is as follows:

其中：p(f_j,Y)是f_j和Y的联合概率密度函数；Among them: p(f _j ,Y) is the joint probability density function of f _j and Y;

p(f_j)和p(Y)分别是f_j和Y的边缘概率密度函数；p(f _j ) and p(Y) are the marginal probability density functions of f _j and Y respectively;

I(f_j；Y)为f_j和Y的互信息，它是用来衡量两个随机变量f_j和Y之间的相关性；I(f _j ; Y) is the mutual information of f _j and Y, which is used to measure the correlation between two random variables f _j and Y;

将训练集和测试集的相关性采用自适应加权的方法组合起来，形成特征权重，计算公式如下：The correlation between the training set and the test set is combined using adaptive weighting to form feature weights. The calculation formula is as follows:

W_j＝αI₁(f_j；Y)+βI₂(f_j；Y)W _j = αI ₁ (f _j ; Y) + βI ₂ (f _j ; Y)

其中：W_j为特征权重；Among them: W _j is the feature weight;

I₁(f_j；Y)为训练集的相关性；I ₁ (f _j ; Y) is the correlation of the training set;

I₂(f_j；Y)为测试集的相关性；I ₂ (f _j ; Y) is the correlation of the test set;

α，β为权重系数，α+β＝1；α, β are weight coefficients, α+β=1;

根据特征权重以及显著性水平的来计算阈值l_α，选择特征权重W_j超过阈值l_α的特征为敏感特征，作为样本数据，计算公式如下：The threshold l _α is calculated based on the feature weight and significance level. Features whose feature weight W _j exceeds the threshold l _α are selected as sensitive features as sample data. The calculation formula is as follows:

其中：μ为组合权重的均值；Among them: μ is the mean value of the combination weight;

σ为组合权重的方差；σ is the variance of the combination weight;

α为卡方分布的显著性水平，2μ²/σ为自由度。α is the chi-square distribution The significance level, 2μ ² /σ is the degree of freedom.

本发明的优点和有益效果为：The advantages and beneficial effects of the present invention are:

1、本发明基于自适应特征选择的疲劳性能预测方法，能够帮助识别与疲劳性能预测最相关的特征，从而潜在地提高预测模型的准确性，通过对信息特征的选取和对无关特征的剔除，模型能够集中考虑影响疲劳行为的关键因素。1. The fatigue performance prediction method of the present invention based on adaptive feature selection can help identify the features most relevant to fatigue performance prediction, thereby potentially improving the accuracy of the prediction model. Through the selection of information features and the elimination of irrelevant features, The model is able to focus on the key factors that influence fatigue behavior.

2、本发明基于自适应特征选择的疲劳性能预测方法，利用数据来识别重要特征，允许模型适应并从可用信息中学习，当处理传统模型无法完全捕获的复杂疲劳机制时，这种数据驱动的方法具有更高的准确性和适应性，枚举非敏感特征能够考虑到特征之间的交互作用和相互影响，从而更全面地了解特征对目标变量的贡献。2. The present invention's fatigue performance prediction method based on adaptive feature selection uses data to identify important features, allowing the model to adapt and learn from available information. This data-driven approach is useful when dealing with complex fatigue mechanisms that traditional models cannot fully capture. The method has higher accuracy and adaptability, and enumerating non-sensitive features can take into account the interaction and mutual influence between features, thereby gaining a more comprehensive understanding of the contribution of features to the target variable.

附图说明Description of drawings

图1为本发明的流程图；Figure 1 is a flow chart of the present invention;

图2为本发明特征筛选的示意图；Figure 2 is a schematic diagram of feature screening in the present invention;

图3为本发明原始数据与敏感特征的预测结果对比图；Figure 3 is a comparison chart of the prediction results of the original data and sensitive features of the present invention;

图4为本发明组合特征的预测结果示意图。Figure 4 is a schematic diagram of the prediction results of the combined features of the present invention.

具体实施方式Detailed ways

下面通过具体实施例对本发明作进一步详述，以下实施例只是描述性的，不是限定性的，不能以此限定本发明的保护范围。The present invention will be further described in detail below through specific examples. The following examples are only descriptive, not restrictive, and cannot be used to limit the scope of the present invention.

如图1所示，一种基于自适应特征选择的疲劳性能预测方法，其特征在于：包括如下步骤：As shown in Figure 1, a fatigue performance prediction method based on adaptive feature selection is characterized by: including the following steps:

步骤S1：收集的材料数据包含化学成分、工艺参数、上游加工特征以及相对应的疲劳强度，数据集表示为F＝{f_i,j,y_i,i＝1,2,…,n；j＝1,2,…,k}，其中：n为样本数量，在此处为437，k为特征数量，在此处为25，包含化学成分变量(C、Si、Mn、P、S、Ni、Cr、Cu、Mo)、工艺参数变量(正火温度、过硬化温度、过硬化时间、过硬化冷却速度、渗碳温度、渗碳时间、扩散温度、扩散时间、淬火介质温度、回火温度、回火时间、回火冷却速度)、上游加工特征变量(锭棒比、塑性变形夹杂物的面积比、不连续排列中夹杂物的面积比例、孤立夹杂物面积比)，f_i,j为第i个样本的第j个特征，y_i为为第i个样本的旋转弯曲疲劳强度。Step S1: The collected material data includes chemical composition, process parameters, upstream processing characteristics and corresponding fatigue strength. The data set is expressed as F={fi _,j ,y _i ,i=1,2,...,n;j =1,2,…,k}, where: n is the number of samples, here is 437, k is the number of features, here is 25, including chemical composition variables (C, Si, Mn, P, S, Ni , Cr, Cu, Mo), process parameter variables (normalizing temperature, overhardening temperature, overhardening time, overhardening cooling rate, carburizing temperature, carburizing time, diffusion temperature, diffusion time, quenching medium temperature, tempering temperature , tempering time, tempering cooling rate), upstream processing characteristic variables (ratio of ingot to rod, area ratio of plastic deformation inclusions, area ratio of inclusions in discontinuous arrangements, area ratio of isolated inclusions), f _i,j is The j-th characteristic of the i-th sample, _yi is the rotational bending fatigue strength of the i-th sample.

对数据的特征进行归一化处理,将归一化的数据集随机划分为80％的训练集和20％的测试集，归一化处理表示为：The characteristics of the data are normalized, and the normalized data set is randomly divided into 80% training set and 20% test set. The normalization processing is expressed as:

其中：f′_i,j为第i个样本中第j个特征归一化的值；Among them: f′ _i,j is the normalized value of the j-th feature in the i-th sample;

f_j为第j列的所有特征值。f _j is all the eigenvalues of column j.

步骤S2：使用互信息来评估特征变量与疲劳强度之间的特征权重，在连续随机变量中，计算公式如下：Step S2: Use mutual information to evaluate the characteristic weight between characteristic variables and fatigue strength. In continuous random variables, the calculation formula is as follows:

将训练集和测试集的相关性进行自适应分配权重系数计算特征权重W_j，计算公式如下：The correlation between the training set and the test set is adaptively assigned a weight coefficient to calculate the feature weight W _j . The calculation formula is as follows:

其中：W_j为特征权重；Among them: W _j is the feature weight;

α，β为权重系数，α+β＝1；α, β are weight coefficients, α+β=1;

根据特征权重和显著性水平的来计算阈值l_α，显著性水平取0.01，选择特征权重W_j超过阈值l_α的特征为敏感特征，作为样本数据如图2所示，计算公式如下所示：The threshold l _α is calculated based on the feature weight and significance level. The significance level is 0.01. Features whose feature weight W _j exceeds the threshold l _α are selected as sensitive features. The sample data is shown in Figure 2. The calculation formula is as follows:

其中，μ为组合权重的均值，σ为组合权重的方差，α为卡方分布的显著性水平，2μ²/σ为自由度。Among them, μ is the mean value of the combination weight, σ is the variance of the combination weight, and α is the chi-square distribution. The significance level, 2μ ² /σ is the degree of freedom.

步骤S3：将样本数据划分为80％样本训练集和20％样本测试集，样本训练集输入支持向量回归模型中训练，获取模型的参数，将样本测试集输入到训练后的模型中使用决定系数R²来评估模型的性能，通过模型的不断迭代来输出最优组合。Step S3: Divide the sample data into 80% sample training set and 20% sample test set. The sample training set is input into the support vector regression model for training, the parameters of the model are obtained, and the sample test set is input into the trained model to use the coefficient of determination. R ² is used to evaluate the performance of the model, and the optimal combination is output through continuous iteration of the model.

R²计算公式如下：The formula for calculating ^R2 is as follows:

其中，y_i为真实疲劳强度，为预测疲劳强度，/>为真实疲劳强度的平均值。Among them, _yi is the real fatigue intensity, To predict fatigue strength,/> is the average value of true fatigue strength.

最优的权重系数组合为α＝0.65，β＝0.35，原始数据的测试集预测结果R²＝0.974，敏感特征的测试集预测结果R²＝0.961，如图3所示。The optimal weight coefficient combination is α = 0.65, β = 0.35, the test set prediction result of the original data R ² = 0.974, and the test set prediction result of the sensitive feature R ² = 0.961, as shown in Figure 3.

尽管去掉大量的特征，筛选后的敏感特征仍然能够保持良好的预测能力，这表明这些特征对于目标变量的预测具有较高的相关性和重要性，这也进一步验证自适应特征选择的有效性。Although a large number of features are removed, the filtered sensitive features can still maintain good predictive ability, which shows that these features have high relevance and importance for the prediction of target variables, which further verifies the effectiveness of adaptive feature selection.

步骤S4：采用枚举法将非敏感特征与敏感特征组合起来，BaggingRegressor通过对组合训练集进行有放回的随机采样，构建了10个SVR基模型，对每个基模型的预测结果进行平均来得到最终的集成预测结果，将组合测试集输入到训练后的集成模型中使用决定系数R²来评估模型性能。Step S4: Use the enumeration method to combine non-sensitive features and sensitive features. BaggingRegressor constructs 10 SVR base models by randomly sampling the combined training set with replacement, and averages the prediction results of each base model. To obtain the final ensemble prediction results, input the combined test set into the trained ensemble model and use the coefficient of determination R ² to evaluate the model performance.

最优的组合特征如图2所示，通过将敏感特征与非敏感特征进行合理的组合，以捕捉更高层次的特征交互和模式，其决定系数R²＝0.983，如图4所示，相较于原始数据和敏感数据的R²有一定的提升。这些方法可以帮助模型更好地解释和预测数据，并改进模型在实际应用中的性能。The optimal combination of features is shown in Figure 2. By reasonably combining sensitive features and non-sensitive features to capture higher-level feature interactions and patterns, its determination coefficient R ² =0.983, as shown in Figure 4. There is a certain improvement compared to R ² for original data and sensitive data. These methods can help models better explain and predict data, and improve model performance in real-world applications.

尽管为说明目的公开了本发明的实施例和附图，但是本领域的技术人员可以理解：在不脱离本发明及所附权利要求的精神和范围内，各种替换、变化和修改都是可能的，因此，本发明的范围不局限于实施例和附图所公开的内容。Although the embodiments and drawings of the present invention have been disclosed for illustrative purposes, those skilled in the art will understand that various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. , therefore, the scope of the present invention is not limited to the contents disclosed in the embodiments and drawings.

Claims

1. A fatigue performance prediction method based on self-adaptive feature selection is characterized by comprising the following steps: the prediction method comprises the following steps:

s1, collecting material related data, performing normalization processing on characteristic data to form a data set, and dividing the data set into a training set and a testing set;

s2, extracting correlation between the features and the target variable by using mutual information, combining the correlation of the training set and the testing set by using a self-adaptive weighting method to serve as feature weights, taking the feature weights as input of self-adaptive feature selection, screening sensitive features, and taking the sensitive features as sample data;

s3, dividing sample data into a sample training set and a sample testing set, inputting the sample training set into a support vector machine regression model for training, and inputting the sample testing set into the trained support vector machine regression model to enableBy determining coefficient R ² To evaluate model performance;

s4, combining the non-sensitive features and the sensitive features by using an enumeration method, dividing the combined data into a combined training set and a combined testing set, constructing an integrated regression model based on SVR by using a BaggingReggresor, inputting the combined testing set into the integrated regression model, inputting the combined testing set into the trained integrated regression model, and using a decision coefficient R ² To evaluate model performance.

2. The fatigue performance prediction method based on adaptive feature selection according to claim 1, wherein: the specific steps of the S1 are as follows:

the collected material-related data contains chemical composition, process parameters, upstream process characteristics, and corresponding fatigue strength, and the resulting data set is denoted as f= { F _i,j ,y _i ,i＝1,2,…,n；j＝1,2,…,k}，

Wherein: n is the number of samples;

k is the number of features;

f _i,j a j-th feature that is the i-th sample;

y _i fatigue strength for the ith sample;

normalizing the characteristics of the data, and dividing the normalized data set into a training set and a testing set;

the normalization process is expressed as:

wherein: f's' _i,j Normalized values for the jth feature in the ith sample;

x _j all eigenvalues for the j-th.

3. The fatigue performance prediction method based on adaptive feature selection according to claim 1, wherein: the specific steps of the S2 are as follows:

mutual information is used to extract correlations between the characteristic variables of the training and test sets and the fatigue strength. Mutual information is a concept that measures the degree of association between two random variables and is not limited to linear relationships. It is a concept based on information theory, which is used to describe the statistical dependency between two variables, i.e. the degree of sharing of information between them. The computation of mutual information involves a joint probability distribution of two random variables and a respective edge probability distribution. In the continuous random variable, the calculation formula is as follows:

wherein: p (f) _j Y) is f _j And a joint probability density function of Y;

p(f _j ) And p (Y) is f _j And an edge probability density function of Y;

I(f _j the method comprises the steps of carrying out a first treatment on the surface of the Y) is f _j And Y, which is used to measure two random variables f _j And Y;

the correlation of the training set and the test set is combined by adopting a self-adaptive weighting method to form characteristic weights, and the calculation formula is as follows:

W _j ＝αI ₁ (f _j ；Y)+βI ₂ (f _j ；Y)

wherein: w (W) _j Is a characteristic weight;

I ₁ (f _j the method comprises the steps of carrying out a first treatment on the surface of the Y) is the correlation of the training set;

I ₂ (f _j the method comprises the steps of carrying out a first treatment on the surface of the Y) is the correlation of the test set;

α, β are weight coefficients, α+β=1;

calculating threshold value l according to feature weight and significance level _α Selecting a feature weight W _j Exceeding threshold l _α Is characterized by sensitive characteristics, and is used as sample data, and the calculation formula is as follows:

wherein: mu is the average value of the combination weights;

sigma is the variance of the combining weights;

alpha is chi-square distributionSignificance level of 2 mu ² And/sigma is the degree of freedom.