CN113222045B

CN113222045B - Semi-supervised fault classification method based on weighted feature alignment self-encoder

Info

Publication number: CN113222045B
Application number: CN202110575307.4A
Authority: CN
Inventors: 张新民; 张宏毅
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2022-06-24
Anticipated expiration: 2041-05-26
Also published as: CN113222045A

Abstract

The invention discloses a semi-supervised fault classification method based on weighted feature alignment self-encoder. The method firstly uses labeled data to reconstruct and pre-train the stacked self-encoder, and estimates the probability density distribution of reconstruction errors. Then, the probability density function of the error is reconstructed according to the training data, and the weight of the unlabeled samples is calculated. Further, a semi-supervised classification model based on weighted feature alignment autoencoder is constructed by using the labeled sample set, the unlabeled sample set and the corresponding weights. The weighted feature alignment autoencoder classification model designs a cross-entropy training loss function based on the weighted Sinkhorn distance. This function enables the model to use both labeled data and unlabeled data in the fine-tuning stage, which can not only achieve in-depth mining of data information, but also improve The generalization ability of the network model. At the same time, due to the introduction of the weighting strategy, the robustness of the model is significantly improved.

Description

A Semi-Supervised Fault Classification Method Based on Weighted Feature Alignment Autoencoder

技术领域technical field

本发明属于工业过程控制领域，特别涉及一种基于加权特征对齐自编码器的半监督故障分类方法。The invention belongs to the field of industrial process control, in particular to a semi-supervised fault classification method based on a weighted feature alignment self-encoder.

背景技术Background technique

现代工业过程正朝着大规模、复杂化的方向发展。如何保证生产过程安全是工业过程控制领域重点关注和需要解决的关键问题之一。故障诊断是保障工业过程安全运行的关键技术，对提高产品质量和生产效率具有重要意义。故障分类属于故障诊断中的一个环节，通过从历史的故障信息中学习，实现故障类型的自动识别与判断，从而帮助生产人员快速地定位、修复故障，避免故障造成进一步的损失。随着现代测量手段的不断发展和进步，工业生产过程积累了大量的数据。数据描述了制造各生产阶段的真实情况，为读懂、分析和优化制造过程提供了宝贵的数据资源，是实现智能制造的智能来源。因此，如何合理地利用制造过程积累的数据信息，建立数据驱动的智能分析模型，以更好的为制造过程的智能决策与质量控制服务，是工业界较为关注的热点问题。数据驱动的故障分类方法利用机器学习、深度学习等智能分析技术，对工业数据深入挖掘、建模和分析，为用户和工业提供数据驱动的故障诊断模式。现有的数据驱动的故障分类方法大部分属于有监督学习的方法，在能获取充足有标签数据时，模型可以获得出色的性能。然而，在某些工业场景下很难获取大量、充足的有标签数据。因此，往往具有大量的无标签数据和少量的有标签数据。为了有效地利用无标签数据以提高模型的分类性能，基于半监督学习的故障分类方法逐渐受到了关注。然而，大部分现有的半监督故障分类方法大都依赖于某些数据假设，例如基于统计学习的半监督学习方法、基于图的半监督学习方法、以及基于协同训练、自训练等其他给无标签数据打标签的方法，这些方法都依赖一个假设，即：有标签样本与无标签样本属于同一分布。然而，这一假设存在其局限性，工业过程采集到的数据经常包含大量噪声和异常点，并且有可能会发生工况的飘移，有标签数据往往是经过过程领域的专家人工筛选、标注过的，而无标签样本则没有经过筛选，因此，无标签数据中很有可能会出现与有标签数据分布不一样的异常数据。当无标签数据与有标签数据分布不一致时，半监督算法会出现性能的下降，甚至低于只使用有标签数据进行训练的有监督算法。因此，亟需提供一种鲁棒的半监督学习方法，使得模型能够在有标签数据以及无标签数据存在分布不一致现象时仍然能够准确实施故障分类。Modern industrial processes are moving towards large-scale and complex development. How to ensure the safety of the production process is one of the key issues that the industrial process control field focuses on and needs to solve. Fault diagnosis is a key technology to ensure the safe operation of industrial processes, and it is of great significance to improve product quality and production efficiency. Fault classification is a part of fault diagnosis. By learning from historical fault information, automatic identification and judgment of fault types can be realized, thereby helping production personnel to quickly locate and repair faults and avoid further losses caused by faults. With the continuous development and progress of modern measurement methods, a large amount of data has been accumulated in the industrial production process. Data describes the real situation of each production stage of manufacturing, provides valuable data resources for understanding, analyzing and optimizing the manufacturing process, and is an intelligent source for realizing intelligent manufacturing. Therefore, how to reasonably use the data information accumulated in the manufacturing process to establish a data-driven intelligent analysis model to better serve the intelligent decision-making and quality control of the manufacturing process is a hot issue that the industry pays more attention to. The data-driven fault classification method uses intelligent analysis technologies such as machine learning and deep learning to deeply mine, model and analyze industrial data, and provide data-driven fault diagnosis models for users and industries. Most of the existing data-driven fault classification methods belong to supervised learning methods, and when sufficient labeled data is available, the model can achieve excellent performance. However, it is difficult to obtain large and sufficient labeled data in some industrial scenarios. Therefore, there is often a large amount of unlabeled data and a small amount of labeled data. In order to effectively utilize unlabeled data to improve the classification performance of the model, fault classification methods based on semi-supervised learning have gradually attracted attention. However, most of the existing semi-supervised fault classification methods mostly rely on certain data assumptions, such as statistical learning-based semi-supervised learning methods, graph-based semi-supervised learning methods, and others based on co-training, self-training, etc. Data labeling methods rely on an assumption that labeled samples and unlabeled samples belong to the same distribution. However, this assumption has its limitations. The data collected in the industrial process often contains a lot of noise and abnormal points, and the drift of the working conditions may occur. The labeled data is often manually screened and labeled by experts in the process field. , while the unlabeled samples are not filtered, so there is a high probability that there will be abnormal data in the unlabeled data distribution that is different from the labeled data. When the distribution of unlabeled data and labeled data is inconsistent, semi-supervised algorithms will suffer from performance degradation, even lower than supervised algorithms trained only with labeled data. Therefore, there is an urgent need to provide a robust semi-supervised learning method, so that the model can still accurately implement fault classification when there is distribution inconsistency between labeled data and unlabeled data.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于针对现有技术的不足，提供一种基于加权特征对齐自编码器的半监督故障分类方法，该方法包括如下步骤：The purpose of the present invention is to aim at the deficiencies of the prior art, to provide a semi-supervised fault classification method based on the weighted feature alignment autoencoder, the method comprises the following steps:

一种基于加权特征对齐自编码器的半监督故障分类方法，该方法包括以下步骤：A semi-supervised fault classification method based on weighted feature alignment autoencoder, the method includes the following steps:

步骤一：收集工业过程的正常工况数据以及各种故障数据，得到建模用的训练数据集：有标签样本集

和无标签样本集

其中，x代表输入样本，y代表样本标签，m表示有标签样本个数，n表示无标签样本个数；Step 1: Collect the normal working condition data of the industrial process and various fault data, and obtain the training data set for modeling: the labeled sample set

and unlabeled sample set

Among them, x represents the input sample, y represents the sample label, m represents the number of labeled samples, and n represents the number of unlabeled samples;

步骤二：构建用于重构的堆叠自编码器模型，并利用有标签样本集对堆叠自编码器模型进行训练；Step 2: Build a stacked autoencoder model for reconstruction, and use the labeled sample set to train the stacked autoencoder model;

步骤三：估计训练数据重构误差的概率密度分布，计算无标签样本的权重，并进一步构建加权特征对齐自编码器分类模型；Step 3: Estimate the probability density distribution of the reconstruction error of the training data, calculate the weight of unlabeled samples, and further construct a weighted feature alignment autoencoder classification model;

步骤四：采集现场工作数据并输入所述加权特征对齐自编码器分类模型，输出对应的故障类别。Step 4: Collect field work data and input the weighted feature alignment autoencoder classification model, and output the corresponding fault category.

进一步地，所述步骤二具体分为如下的子步骤：Further, the step 2 is specifically divided into the following sub-steps:

(2.1)构建用于重构的堆叠自编码器模型，包含多层编码器和解码器，模型输出是对输入的重构，计算公式如下：(2.1) Build a stacked autoencoder model for reconstruction, including multi-layer encoders and decoders, the model output is the reconstruction of the input, and the calculation formula is as follows:

其中，x表示输入，z_k代表提取的第k层特征，k表示堆叠自编码器的第k层，

和

分别表示编码器和解码器的权重向量和偏差向量，

代表模型对输入的重构；where x represents the input, z _k represents the extracted k-th layer features, k represents the k-th layer of the stacked autoencoder,

and

represent the weight vector and bias vector of the encoder and decoder, respectively,

Represents the reconstruction of the input by the model;

(2.2)采用步骤一构建的有标签样本，采用随机梯度下降算法对所述堆叠自编码器模型进行训练，其模型训练损失函数定义为对输入的重构误差，重构误差由下式表示：(2.2) Using the labeled samples constructed in step 1, the stacked autoencoder model is trained by the stochastic gradient descent algorithm, and the model training loss function is defined as the reconstruction error of the input, and the reconstruction error is represented by the following formula:

其中，

代表第i个有标签输入样本，

代表堆叠自编码器对它的重构；in,

represents the ith labeled input sample,

represents its reconstruction by stacked autoencoders;

(2.3)利用训练好的堆叠自编码器模型，计算有标签样本的重构误差

其中，单个样本的重构误差参照如下公式计算：(2.3) Use the trained stacked autoencoder model to calculate the reconstruction error of labeled samples

Among them, the reconstruction error of a single sample is calculated with reference to the following formula:

进一步地，所述步骤三具体分为如下的子步骤：Further, the step 3 is specifically divided into the following sub-steps:

(3.1)计算有标签样本的重构误差E_l服从χ²分布

的分布参数g和h(3.1) Calculating the reconstruction error E _l of the labeled samples obeys the χ ² distribution

The distribution parameters g and h of

g·h＝mean(E_l) (5)g·h=mean(E _l ) (5)

2g²·h＝variance(E_l) (6)2g ² ·h=variance(E _l ) (6)

(3.2)计算无标签样本的重构误差

单个样本的重构误差计算公式和公式(4)相同；(3.2) Calculate the reconstruction error of unlabeled samples

The calculation formula of the reconstruction error of a single sample is the same as formula (4);

(3.3)计算无标签样本的重构误差E_u在分布E_l下发生的概率

对P_u进行归一化，得到无标签样本的权重

(3.3) Calculate the probability that the reconstruction error E _u of unlabeled samples occurs under the distribution E _l

Normalize P _u to get the weight of unlabeled samples

(3.4)构建加权特征对齐自编码器分类模型，采用有标签样本集、无标签样本集以及对应权重，对所述加权特征对齐自编码器分类模型进行训练。训练过程分为：无监督预训练和有监督微调。在无监督预训练阶段，采用有标签样本和无标签样本一起训练一个堆叠自编码器。无监督预训练方法与步骤(2.1)～(2.3)相同。所述有监督微调是在无监督预训练获得的堆叠自编码器上增加一层全连接神经网络层并将其作为类别的输出构成，从而得到有标签样本的深层提取特征和类别标签，以及无标签样本的深层提取特征和预测的类别标签输出，具体计算公式如下：(3.4) Constructing a weighted feature alignment autoencoder classification model, using labeled sample sets, unlabeled sample sets and corresponding weights to train the weighted feature alignment autoencoder classification model. The training process is divided into: unsupervised pre-training and supervised fine-tuning. In the unsupervised pre-training stage, a stacked autoencoder is trained with labeled samples and unlabeled samples together. The unsupervised pre-training method is the same as steps (2.1)~(2.3). The supervised fine-tuning is to add a fully connected neural network layer to the stacked autoencoder obtained by unsupervised pre-training and use it as the output of the category, so as to obtain the deep extracted features and category labels of the labeled samples, and the unsupervised pre-training. The deep extraction features of the label samples and the predicted category label output, the specific calculation formula is as follows:

其中，

代表第i个有标签样本的深层提取特征，

代表预测的第i个有标签样本的类别标签，{w_c，b_c}表示全连接神经网络层的权重向量和偏差向量；

代表无标签样本的深层提取特征，

代表预测的类别标签输出；in,

represents the deep extracted features of the i-th labeled sample,

represents the class label of the predicted i-th labeled sample, {w _c , b _c } represents the weight vector and bias vector of the fully connected neural network layer;

represents the deep extracted features of unlabeled samples,

represents the predicted class label output;

(3.7)假设类别数目为F，获得对应于每一类别f∈F的有标签样本和无标签样本的深层提取特征

和

以及无标签样本的权重

(3.7) Assuming that the number of classes is F, obtain the deep extraction features of labeled samples and unlabeled samples corresponding to each class f∈F

and

and the weights of unlabeled samples

(3.8)采用下式计算加权特征对齐自编码器分类模型的训练损失函数：(3.8) The training loss function of the weighted feature alignment autoencoder classification model is calculated by the following formula:

其中，crossentropy代表交叉熵损失函数，

代表加权Sinkhorn距离函数，用于度量属于同一类别的有标签数据特征分布和无标签数据特征分布的距离，同时实现了对重构误差较大的异常无标签样本进行降权；α为Sinkhorn距离的权重，

为网络参数的L₂正则化惩罚项，β是它的权重，p_ij代表对应于类别f的有标签样本i的特征

到无标签样本j的特征

的转移概率，d_ij代表对应于类别f的有标签样本i的特征

到无标签样本j的特征

的距离，

代表对应于类别f的无标签样本j的权重，mf和nf分别代表对应于类别f的有标签样本和无标签样本的数量。where crossentropy represents the cross entropy loss function,

Represents the weighted Sinkhorn distance function, which is used to measure the distance between the feature distribution of labeled data and the feature distribution of unlabeled data belonging to the same category, and realizes the weight reduction of abnormal unlabeled samples with large reconstruction error; α is the Sinkhorn distance Weights,

is the L2 _{regularization} penalty term for the network parameters, β is its weight, and _pij represents the feature of the labeled sample i corresponding to the class f

to the features of unlabeled sample j

The transition probability of d _ij represents the feature of the labeled sample i corresponding to the class f

to the features of unlabeled sample j

the distance,

represents the weight of unlabeled sample j corresponding to class f, and mf and nf represent the number of labeled samples and unlabeled samples corresponding to class f, respectively.

本发明的有益效果如下：The beneficial effects of the present invention are as follows:

本发明针对传统半监督分类模型在有标签数据与无标签数据分布不一致时性能下降的问题，提出一种鲁棒的基于加权特征对齐自编码器的半监督故障分类方法。该方法设计了基于加权和特征对齐策略的模型训练损失函数。加权策略的引入提升了半监督分类模型的鲁棒性，减少了样本分布不一致造成的分类模型性能下降问题。而特征对齐策略的引入使得模型在微调阶段同时使用有标签数据和无标签数据，不仅可以实现数据信息的深度挖掘，还可以提高网络模型的泛化能力和分类性能。Aiming at the problem that the performance of the traditional semi-supervised classification model is degraded when the distribution of labeled data and unlabeled data is inconsistent, the invention proposes a robust semi-supervised fault classification method based on a weighted feature alignment autoencoder. This method designs a model training loss function based on weighting and feature alignment strategy. The introduction of the weighting strategy improves the robustness of the semi-supervised classification model and reduces the performance degradation of the classification model caused by inconsistent sample distribution. The introduction of the feature alignment strategy enables the model to use both labeled data and unlabeled data in the fine-tuning stage, which can not only realize the deep mining of data information, but also improve the generalization ability and classification performance of the network model.

附图说明Description of drawings

图1为堆叠自编码器示意图；Figure 1 is a schematic diagram of a stacked autoencoder;

图2为TE过程流程图；Figure 2 is a flow chart of the TE process;

图3为数据对数重构误差的示意图；Fig. 3 is the schematic diagram of data logarithm reconstruction error;

图4不同算法的分类准确率的示意图。Figure 4 is a schematic diagram of the classification accuracy of different algorithms.

具体实施方式Detailed ways

下面根据附图和优选实施例详细描述本发明，本发明的目的和效果将变得更加明白，应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。The present invention will be described in detail below according to the accompanying drawings and preferred embodiments, and the purpose and effects of the present invention will become clearer.

本发明的基于加权特征对齐自编码器的半监督故障分类方法，首先使用有标签数据对堆叠自编码器进行重构预训练，并估计重构误差的概率密度分布。然后，根据训练数据重构误差的概率密度函数，计算无标签样本的权重。进一步，利用有标签样本集、无标签样本集以及对应权重，构建基于加权特征对齐自编码器的半监督分类模型。加权特征对齐自编码器分类模型设计了基于加权Sinkhorn距离的交叉熵训练损失函数，该函数使得模型在微调阶段同时使用有标签数据和无标签数据，不仅可以实现数据信息的深度挖掘，还可以提高网络模型的泛化能力。同时，由于加权策略的引入，模型的鲁棒性显著提升。In the semi-supervised fault classification method based on the weighted feature alignment autoencoder of the present invention, the stacked autoencoder is reconstructed and pre-trained by using labeled data, and the probability density distribution of the reconstruction error is estimated. Then, the probability density function of the error is reconstructed according to the training data, and the weight of the unlabeled samples is calculated. Further, a semi-supervised classification model based on weighted feature alignment autoencoder is constructed by using the labeled sample set, the unlabeled sample set and the corresponding weights. The weighted feature alignment autoencoder classification model designs a cross-entropy training loss function based on the weighted Sinkhorn distance. This function enables the model to use both labeled data and unlabeled data in the fine-tuning stage, which can not only achieve in-depth mining of data information, but also improve The generalization ability of the network model. At the same time, due to the introduction of the weighting strategy, the robustness of the model is significantly improved.

本发明的方法具体步骤如下：The specific steps of the method of the present invention are as follows:

和无标签样本集

and unlabeled sample set

步骤二：构建用于重构的堆叠自编码器模型，并利用有标签样本集对堆叠自编码器模型进行训练；具体分为如下的子步骤：Step 2: Build a stacked autoencoder model for reconstruction, and use the labeled sample set to train the stacked autoencoder model; it is divided into the following sub-steps:

和

分别表示编码器和解码器的权重向量和偏差向量，

and

Represents the reconstruction of the input by the model;

(2.2)采用步骤一构建的有标签样本集，采用随机梯度下降算法对所述堆叠自编码器模型进行训练，其模型训练损失函数定义为对输入的重构误差，重构误差由下式表示：(2.2) Using the labeled sample set constructed in step 1, the stacked autoencoder model is trained by the stochastic gradient descent algorithm, and the model training loss function is defined as the reconstruction error of the input, and the reconstruction error is represented by the following formula :

其中，

代表第i个有标签输入样本，

代表堆叠自编码器对它的重构；in,

represents the ith labeled input sample,

represents its reconstruction by stacked autoencoders;

所述步骤三具体分为如下的子步骤：The third step is specifically divided into the following sub-steps:

(3.1)计算有标签样本的重构误差E_l服从χ²分布

The distribution parameters g and h of

g·h＝mean(E_l) (5)g·h=mean(E _l ) (5)

2g²·h＝variance(E_l) (6)2g ² ·h=variance(E _l ) (6)

(3.2)计算无标签样本的重构误差

(3.3)计算无标签样本的重构误差E_u在分布E_l下发生的概率

对P_u进行归一化，得到无标签样本的权重

Normalize P _u to get the weight of unlabeled samples

(3.4)构建加权特征对齐自编码器分类模型，采用有标签样本集、无标签样本集以及对应权重，对所述加权特征对齐自编码器分类模型进行训练。训练过程可分为：无监督预训练和有监督微调。(3.4) Constructing a weighted feature alignment autoencoder classification model, using labeled sample sets, unlabeled sample sets and corresponding weights to train the weighted feature alignment autoencoder classification model. The training process can be divided into: unsupervised pre-training and supervised fine-tuning.

在无监督预训练阶段，采用有标签样本和无标签样本一起训练一个堆叠自编码器。无监督预训练方法与步骤(2.1)～(2.3)相同，即先构建一个用于重构的堆叠自编码器模型，然后用有标签样本和无标签样本一起训练该堆叠自编码器；In the unsupervised pre-training stage, a stacked autoencoder is trained with labeled samples and unlabeled samples together. The unsupervised pre-training method is the same as steps (2.1)~(2.3), that is, firstly construct a stacked autoencoder model for reconstruction, and then train the stacked autoencoder with labeled samples and unlabeled samples;

所述有监督微调是在无监督预训练获得的堆叠自编码器上增加一层全连接神经网络层并将其作为类别的输出构成，从而得到有标签样本的深层提取特征和类别标签，以及无标签样本的深层提取特征和预测的类别标签输出，具体计算公式如下：The supervised fine-tuning is to add a fully connected neural network layer to the stacked autoencoder obtained by unsupervised pre-training and use it as the output of the category, so as to obtain the deep extracted features and category labels of the labeled samples, and the unsupervised pre-training. The deep extraction features of the label samples and the predicted category label output, the specific calculation formula is as follows:

其中，

代表第i个有标签样本的深层提取特征，

代表无标签样本的深层提取特征，

代表预测的类别标签输出；in,

represents the deep extracted features of the i-th labeled sample,

represents the deep extracted features of unlabeled samples,

represents the predicted class label output;

(3.7)假设类别数目为F，根据下式获得对应于每一类别f∈F的有标签样本和无标签样本的深层提取特征

和

以及无标签样本的权重

(3.7) Assuming that the number of categories is F, the deep extraction features of labeled samples and unlabeled samples corresponding to each category f∈F are obtained according to the following formula

and

and the weights of unlabeled samples

其中，crossentropy代表交叉熵损失函数，

代表加权Sinkhorn距离函数，α为Sinkhorn距离的权重，

到无标签样本j的特征

的转移概率，d_ij代表对应于类别f的有标签样本i的特征

到无标签样本j的特征

的距离，

代表对应于类别f的无标签样本j的权重，mf和nf分别代表对应于类别f的有标签样本和无标签样本的数量。新设计的基于加权Sinkhorn距离的训练损失函数主要目的有两个。一个是在微调阶段对属于同一类别的有标签数据和无标签数据通过堆叠自编码器提取的特征对齐，使它们的分布接近。另一个是通过该带无标签样本权重的加权Sinkhorn特征距离，实现了对重构误差较大的异常无标签样本进行降权。where crossentropy represents the cross entropy loss function,

represents the weighted Sinkhorn distance function, α is the weight of the Sinkhorn distance,

to the features of unlabeled sample j

the distance,

represents the weight of unlabeled sample j corresponding to class f, and mf and nf represent the number of labeled samples and unlabeled samples corresponding to class f, respectively. The newly designed training loss function based on weighted Sinkhorn distance has two main purposes. One is to align the features extracted by stacking autoencoders on labeled data and unlabeled data belonging to the same class in the fine-tuning stage to make their distributions close. The other is that the weighted Sinkhorn feature distance with the weight of the unlabeled sample is used to reduce the weight of the abnormal unlabeled sample with large reconstruction error.

下面以一个具体工业过程实例验证本发明的方法的有效性。所有的数据采集于美国田纳西-伊斯曼(Tennessee Eastman,TE)化工实验仿真平台，该平台作为典型的化工过程研究对象广泛应用于故障诊断与故障分类领域。TE过程的流程图如图2所示，其主要设备包括一个连续搅拌式反应釜，一个气液分离塔，一个离心式压缩机，一个分凝器和一个再沸器。建模过程数据包含16个过程变量和10个故障类别，详细的过程变量和故障信息描述分别见表1和表2。The effectiveness of the method of the present invention is verified below with a specific industrial process example. All data are collected on the Tennessee Eastman (TE) chemical experiment simulation platform, which is widely used in the field of fault diagnosis and fault classification as a typical chemical process research object. The flow chart of the TE process is shown in Figure 2, and its main equipment includes a continuous stirring reactor, a gas-liquid separation tower, a centrifugal compressor, a partial condenser and a reboiler. The modeling process data contains 16 process variables and 10 fault categories. The detailed process variables and fault information descriptions are shown in Table 1 and Table 2, respectively.

表1Table 1

表2Table 2

故障编号fault number 描述describe 故障类型Fault type 11 A/C描述进料流量比变化(流4)A/C describes the feed flow ratio change (stream 4) 阶跃step 55 冷凝器冷却水入口温度变化Condenser cooling water inlet temperature change 阶跃step 77 物料C压力损失(流4)Material C pressure loss (stream 4) 阶跃step 1010 物料C的温度变化(流4)Temperature change of material C (stream 4) 随机变量Random Variables 1414 反应器冷却水阀门Reactor cooling water valve 粘滞sticky

采集到的数据一共包含3600个样本，其来自于6个类别，每个类别各采集600个样本。采集的数据被划分为训练数据(包含300个有标签数据和3000个无标签数据)和测试数据(包含300个有标签数据)。为了模拟无标签数据出现与有标签数据分布不一致的情况，我们按照一定的比例对原始的无标签数据中加入高斯噪声。The collected data contains a total of 3600 samples, which come from 6 categories, and 600 samples are collected in each category. The collected data is divided into training data (containing 300 labeled data and 3000 unlabeled data) and test data (containing 300 labeled data). In order to simulate the inconsistency between the unlabeled data and the labeled data distribution, we add Gaussian noise to the original unlabeled data according to a certain proportion.

图3给出了有标签数据，正常的无标签数据以及与有标签数据分布不一致的异常无标签数据在堆叠自编码器重构模型下的对数重构误差。从图3中可以明显看出，有标签数据和正常无标签数据的重构误差比较接近，而异常无标签数据的重构误差明显大于有标签数据和正常无标签数据的重构误差。这是基于加权特征对齐自编码器检测异常分布无标签数据的基础。Figure 3 presents the logarithmic reconstruction errors of labeled data, normal unlabeled data, and anomalous unlabeled data that are inconsistent with the distribution of labeled data under the stacked autoencoder reconstruction model. It can be clearly seen from Figure 3 that the reconstruction errors of labeled data and normal unlabeled data are relatively close, while the reconstruction error of abnormal unlabeled data is significantly larger than that of labeled data and normal unlabeled data. This is the basis for detecting abnormally distributed unlabeled data based on weighted feature alignment autoencoders.

图4展示了不同有标签无标签数据分布不一致比例下，三种算法的分类准确率。其中，MLP方法为有监督神经网络分类模型，Tri-traing方法为基于协同训练获得的神经网络分类模型，Weighted FA-SAE方法为本发明所提出的基于加权特征对齐自编码器分类模型。Tri-traing和Weighted FA-SAE属于半监督深度学习网络模型。从图中可以看出，大部分半监督学习算法的分类性能优于有监督算法；另外，随着有标签数据和无标签数据分布不一致比例的逐渐扩大，半监督算法的性能都出现了下降，其中当分布不一致率达到90％时，Tri-traing方法的分类精度甚至低于有监督的MLP方法。相比之下，本发明所提出的Weighted FA-SAE方法，在不同程度的分布不一致率下，分类性能优于MLP和Tri-traing方法。Figure 4 shows the classification accuracy of the three algorithms under different proportions of inconsistent labeled and unlabeled data distributions. Among them, the MLP method is a supervised neural network classification model, the Tri-training method is a neural network classification model obtained based on collaborative training, and the Weighted FA-SAE method is a weighted feature alignment-based autoencoder classification model proposed by the present invention. Tri-traing and Weighted FA-SAE are semi-supervised deep learning network models. As can be seen from the figure, the classification performance of most semi-supervised learning algorithms is better than that of supervised algorithms; in addition, with the gradual expansion of the proportion of inconsistent distribution of labeled data and unlabeled data, the performance of semi-supervised algorithms has declined. Among them, when the distribution inconsistency rate reaches 90%, the classification accuracy of the Tri-training method is even lower than that of the supervised MLP method. In contrast, the Weighted FA-SAE method proposed in the present invention has better classification performance than MLP and Tri-training methods under different degrees of distribution inconsistency rates.

本领域普通技术人员可以理解，以上所述仅为发明的优选实例而已，并不用于限制发明，尽管参照前述实例对发明进行了详细的说明，对于本领域的技术人员来说，其依然可以对前述各实例记载的技术方案进行修改，或者对其中部分技术特征进行等同替换。凡在发明的精神和原则之内，所做的修改、等同替换等均应包含在发明的保护范围之内。Those of ordinary skill in the art can understand that the above are only preferred examples of the invention and are not intended to limit the invention. Although the invention has been described in detail with reference to the foregoing examples, those skilled in the art can still Modifications are made to the technical solutions described in the foregoing examples, or equivalent replacements are made to some of the technical features. All modifications and equivalent replacements made within the spirit and principle of the invention shall be included within the protection scope of the invention.

Claims

1. A semi-supervised fault classification method based on a weighted feature alignment self-encoder is characterized by comprising the following steps:

the method comprises the following steps: collecting normal working condition data and various fault data of an industrial process to obtain a training data set for modeling: sample set with labels

And unlabeled sample set

Wherein x represents an input sample, y represents a sample label, m represents the number of labeled samples, and n represents the number of unlabeled samples;

step two: constructing a stacking self-encoder model for reconstruction, and training the stacking self-encoder model by utilizing a labeled sample set;

step three: estimating the probability density distribution of the reconstruction error of the training data, calculating the weight of the label-free sample, and further constructing a weighted feature alignment self-encoder classification model;

the third step is specifically divided into the following substeps:

(3.1) calculating the reconstruction error E of the labeled exemplars_lCompliance chi²Distribution of

Distribution parameters g and h of

g·h＝mean(E_l)

2g²·h＝variance(E_l)

(3.2) calculating reconstruction error of unlabeled exemplar

The reconstruction error calculation formula for a single sample is as follows:

wherein,

representing the reconstruction of the model to the input;

(3.3) calculating the reconstruction error E of the unlabeled exemplars_uIn distribution E_lProbability of occurrence of

To P_uNormalizing to obtain the weight of the unlabeled sample

(3.4) constructing a weighted feature alignment self-encoder classification model, and training the weighted feature alignment self-encoder classification model by adopting a labeled sample set, an unlabeled sample set and corresponding weights; the training process comprises the following steps: unsupervised pre-training and supervised fine tuning; in an unsupervised pre-training stage, a stack self-encoder is trained by adopting a labeled sample and an unlabeled sample together; the supervised fine tuning is formed by adding a fully-connected neural network layer on a stacked self-encoder obtained by unsupervised pre-training and using the fully-connected neural network layer as output of categories, so as to obtain deep extraction features and category labels of the labeled samples and deep extraction features and predicted category label output of the unlabeled samples, and a specific calculation formula is as follows:

wherein,

represents the deep-extracted features of the ith labeled sample,

class label representing predicted ith labeled sample, { w_c，b_cRepresenting weight vectors and deviation vectors of the fully connected neural network layer;

represents a deep extraction feature of the unlabeled exemplar,

a class label output representing a prediction;

(3.5) the number of classes is F, and deep extraction features of labeled samples and unlabeled samples corresponding to each class F epsilon F are obtained

And

and weight of unlabeled exemplars

(3.6) calculating a training loss function of the weighted feature alignment self-encoder classification model using the following formula:

wherein, crossentropy represents a cross entropy loss function,

the representative weighted Sinkhorn distance function is used for measuring the distance between the characteristic distribution of the labeled data and the characteristic distribution of the unlabeled data belonging to the same category, and meanwhile, the weight reduction of the abnormal unlabeled sample with larger reconstruction error is realized; alpha is the weight of the Sinkhorn distance,

l being a network parameter₂Regularization penalty term, β is its weight, p_ijRepresenting features of labeled exemplars i corresponding to category f

Features to unlabeled sample j

Transition probability of d_ijRepresenting features of labeled exemplars i corresponding to class f

Features to unlabeled sample j

The distance of (a) to (b),

represents the weight of the unlabeled exemplar j corresponding to the class f, and mf and nf represent the number of labeled and unlabeled exemplars corresponding to the class f, respectively;

step four: and acquiring field working data, inputting the weighted features to align the self-encoder classification model, and outputting corresponding fault categories.

2. The semi-supervised fault classification method based on weighted feature alignment self-encoder according to claim 1, wherein the second step is specifically divided into the following sub-steps:

(2.1) constructing a stacked self-encoder model for reconstruction, comprising a multi-layer encoder and a decoder, wherein the output of the model is the reconstruction of the input, and the calculation formula is as follows:

wherein x represents the input, z_kRepresenting the extracted k-th layer features, k representing the k-th layer of the stacked self-encoder,

and

representing the weight vector and the disparity vector of the encoder and decoder respectively,

reconstruction of the input by the representative model;

(2.2) training the stacked self-encoder model by adopting the labeled samples constructed in the step one and adopting a random gradient descent algorithm, wherein a model training loss function is defined as a reconstruction error of an input, and the reconstruction error is represented by the following formula:

wherein,

representing the ith labeled input sample,

representing the reconstruction of the stacked auto-encoder;

(2.3) calculating the reconstruction error of the labeled sample by using the trained stacked self-encoder model

Wherein the reconstruction error of a single sample is calculated with reference to the following formula: