CN117474119A

CN117474119A - Fault prediction method and system based on federal learning

Info

Publication number: CN117474119A
Application number: CN202311368228.1A
Authority: CN
Inventors: 孙银银; 赵宁; 兰春嘉
Original assignee: Shanghai Lingshuzhonghe Information Technology Co ltd
Current assignee: Shanghai Lingshuzhonghe Information Technology Co ltd
Priority date: 2023-10-20
Filing date: 2023-10-20
Publication date: 2024-01-30

Abstract

The invention discloses a fault prediction method and a system based on federal learning, wherein the method comprises the following steps: acquiring a first data set of a detection target at an initiator and a second data set of a partner, calculating an ID intersection of the first data set and the second data set, and acquiring an original data set of the initiator and an original data set of the partner according to the ID intersection and the local data sets of the parties; performing data processing on the original data sets of the initiator and the partner to obtain a standby data set of the initiator and a standby data set of the partner; performing correlation analysis and feature selection according to the initiator standby data set and the partner standby data set to obtain an initiator training data set and a partner training data set; model training is carried out according to the initiator training data set and the partner training data set to obtain a fault prediction model, the fault prediction model is adopted to conduct fault prediction on the detection target to obtain a prediction result, the dimension of modeling is increased on the premise of protecting privacy safety of users, and the fault prediction accuracy is improved.

Description

Fault prediction method and system based on federal learning

Technical Field

The invention relates to the field of privacy computation, in particular to a federal learning-based fault prediction method and system.

Background

At present, when a power plant builds a fault diagnosis system of power generation equipment, in order to eliminate data island, remote monitoring systems of various manufacturers are deployed, deployment operation and maintenance consume a great deal of manpower and material resources, and equipment of the manufacturers cannot be disclosed due to key data related to business confidentiality, so that the accuracy of equipment fault diagnosis is affected, and the problem of how to eliminate the data island, so that the data is available and invisible, the operation and maintenance cost of the power plant is reduced, and the problem of realizing a safe and reliable fault diagnosis system is urgently solved.

In view of the above problems in the prior art, no effective solution exists at present.

Disclosure of Invention

In order to solve the problems, the invention provides a fault prediction method and a fault prediction system based on federal learning, which are used for solving the problems that in the prior art, under the condition that data are available and invisible, a data island is eliminated and the fault diagnosis accuracy of equipment is improved by fusing the characteristics of a thermal power plant and an equipment manufacturer, carrying out data processing and characteristic engineering processing based on the characteristics, carrying out model training on equipment fault prediction models by adopting processed data, and predicting the subsequent running conditions of the equipment by using an online model.

In order to achieve the above object, the present invention provides a method for predicting a failure based on federal learning, comprising: acquiring a first data set of a detection target at an initiator and a second data set of a partner, calculating an ID intersection of the first data set and the second data set, and respectively acquiring an original data set of the initiator and an original data set of the partner according to the ID intersection and the local data sets of the parties; performing data interaction and data processing on the original data set of the initiator and the original data set of the partner to obtain a standby data set of the initiator and a standby data set of the partner; performing correlation analysis and feature selection according to the initiator standby data set and the partner standby data set to obtain an initiator training data set and a partner training data set; and performing model training according to the initiator training data set and the partner training data set to obtain a fault prediction model, and performing fault prediction on the detection target by adopting the fault prediction model to obtain a fault result of the detection target.

Further optionally, the performing data interaction and data processing on the original data set of the initiator and the original data set of the partner to obtain an inactive data set of the initiator and an inactive data set of the partner includes: counting the characteristics in the original data set of the initiator to obtain a first counting result; counting the features in the original data set of the partner to obtain a second statistical result, mapping the local features into anonymous features, and then sending the anonymous features and the second statistical result to the initiator; the method comprises the steps that an initiator filters features in an original data set of the initiator according to variances in a first statistical result and a preset variance threshold to obtain an initiator filtered data set, the initiator determines anonymous features to be deleted according to variances in a second statistical result and the preset variance threshold, and sends a deleting instruction of the anonymous features to a partner, and the partner deletes local features corresponding to the anonymous features according to the deleting instruction to obtain the partner filtered data set; and performing outlier processing and missing value processing on the initiator filtered data set and the partner filtered data set to respectively obtain an initiator standby data set and a partner standby data set.

Further optionally, the performing outlier processing on the initiator filtered data set and the partner filtered data set includes: the method comprises the steps that an initiator and a partner obtain abnormal features through screening according to features and feature reasonable intervals, the partner maps the abnormal local features into anonymous features and then sends the anonymous features and second statistical results to the initiator, the initiator filters or fills the local abnormal features and sends anonymous features and an abnormal processing instruction to the partner, and the partner filters or fills the abnormal local features according to the abnormal processing instruction after mapping the anonymous features into the local features; for the continuous features, each side carries out abnormality detection according to the box line graph, and the abnormal value in the features is filled with the median of the features in the statistical result; and for the discrete features, the initiator performs anomaly detection according to the feature histogram, deletes the outlier in the feature, sends the anonymous feature with the outlier and the anomaly handling instruction to the partner, and the partner deletes the outlier in the local feature mapped by the anonymous feature according to the anomaly handling instruction or adopts mode filling corresponding to the feature in the statistical result.

Further optionally, the performing missing value processing on the initiator filtered data set and the partner filtered data set includes: for discrete features, each party calculates the ratio of the number of missing samples with the category of missing to the number of all samples in the feature histogram, and deletes the samples with the ratio smaller than a preset duty ratio threshold, or adopts mode filling in the statistical result, or adopts median filling in the statistical result; for continuous features, each party calculates the loss rate; deleting the missing samples when the missing rate is smaller than a first preset missing rate threshold value; deleting the corresponding feature when the deletion rate is greater than a second preset deletion rate threshold; and when the deletion rate is between the first preset deletion rate threshold value and the second preset deletion rate threshold value, filling the deletion value in the deletion sample with the average number of the features in the statistical result.

Further optionally, the performing correlation analysis and feature selection according to the initiator standby data set and the partner standby data set to obtain an initiator training data set and a partner training data set includes: carrying out pearson correlation analysis on the features in the data set to be used by the initiator to obtain correlation coefficients of each feature and other features, and selecting one feature from a plurality of features with correlation coefficients larger than a preset coefficient threshold value as a representative feature of the plurality of features to obtain an update data set of the initiator; carrying out pearson correlation analysis on the features in the data set to be used by the partner to obtain correlation coefficients of each feature and other features, and selecting one feature from a plurality of features with correlation coefficients larger than a preset coefficient threshold value as a representative feature of the plurality of features to obtain a partner update data set; fragmenting the initiator update data set to obtain a first fragment data set and a second fragment data set, and transmitting the second fragment data set to the partner; fragmenting the updated data set of the partner to obtain a third fragmented data set and a fourth fragmented data set, and transmitting the third fragmented data set to the initiator; calculating common parameters according to the triples and the current data sets of all the parties, and calculating the correlation coefficient between each feature and the other party feature in the current data sets of all the parties through the common parameters; wherein the triples are generated by a trusted third party; the method comprises the steps that an initiator selects one feature from a plurality of features with mutual correlation coefficients larger than a preset coefficient threshold value as a representative feature of the plurality of features, a training dataset of the initiator is obtained, the selected anonymous feature is sent to a partner, and the partner maps the selected anonymous feature of the partner to a local feature to obtain the training dataset of the partner.

In another aspect, the present invention further provides a federal learning-based fault prediction system, including: the data fusion module is used for acquiring a first data set of a detection target at an initiator and a second data set of a partner, calculating an ID intersection of the first data set and the second data set, and respectively acquiring an original data set of the initiator and an original data set of the partner according to the ID intersection and the local data sets of the parties; the data processing module is used for carrying out data interaction and data processing on the original data set of the initiator and the original data set of the partner to obtain a stand-by data set of the initiator and a stand-by data set of the partner; the feature engineering module is used for carrying out correlation analysis and feature selection according to the initiator standby data set and the partner standby data set to obtain an initiator training data set and a partner training data set; and the prediction module is used for carrying out model training according to the initiator training data set and the partner training data set to obtain a fault prediction model, and carrying out fault prediction on the detection target by adopting the fault prediction model to obtain a fault result of the detection target.

Further optionally, the data processing module includes: the first statistics sub-module is used for counting the characteristics in the original data set of the initiator to obtain a first statistics result; the second statistics sub-module is used for carrying out statistics on the features in the original data set of the partner to obtain a second statistical result, mapping the local features into anonymous features and then sending the anonymous features together with the second statistical result to the initiator; the filtering sub-module is used for filtering the features in the original data set of the initiator by the initiator according to the variance in the first statistical result and a preset variance threshold value to obtain a filtered data set of the initiator, determining anonymous features to be deleted by the initiator according to the variance in the second statistical result and the preset variance threshold value, sending a deletion instruction of the anonymous features to the partner, and deleting the local features corresponding to the anonymous features by the partner according to the deletion instruction to obtain the filtered data set of the partner; and the exception processing sub-module is used for performing exception value processing and missing value processing on the initiator filtered data set and the partner filtered data set to respectively obtain an initiator standby data set and a partner standby data set.

Further optionally, the exception handling submodule includes: the abnormal feature identification unit is used for respectively screening the sponsor and the partner according to the feature and the feature reasonable interval to obtain abnormal features, mapping the abnormal local features into anonymous features by the partner, then sending the anonymous features and the abnormal processing instructions to the partner together with a second statistical result, and filtering or filling the abnormal local features by the sponsor; the continuous feature detection unit is used for carrying out anomaly detection on the continuous features according to the box line graph by each party, and filling the abnormal values in the features with the median of the features in the statistical result; the discrete feature detection unit is used for carrying out anomaly detection on the discrete features by the initiator according to the feature histogram, deleting the abnormal value in the feature, sending the anonymous feature with the abnormal value and the anomaly processing instruction to the partner, and deleting the abnormal value in the local feature mapped by the anonymous feature by the partner according to the anomaly processing instruction or adopting mode filling corresponding to the feature in the statistical result.

Further optionally, the exception handling submodule includes: the first missing value processing unit is used for calculating the ratio of the number of missing samples with the type of missing to the number of all samples in the characteristic histogram for the discrete characteristics, deleting the samples with the ratio smaller than a preset duty ratio threshold, or filling the samples with the mode in the statistical result, or filling the samples with the median in the statistical result; a second missing value processing unit for calculating a missing rate for each of the continuous features; deleting the missing samples when the missing rate is smaller than a first preset missing rate threshold value; deleting the corresponding feature when the deletion rate is greater than a second preset deletion rate threshold; and when the deletion rate is between the first preset deletion rate threshold value and the second preset deletion rate threshold value, filling the deletion value in the deletion sample with the average number of the features in the statistical result.

Further optionally, the feature engineering module includes: the first correlation analysis submodule is used for carrying out pearson correlation analysis on the features in the data set to be used by the initiator to obtain correlation coefficients of each feature and other features, and selecting one feature from a plurality of features with correlation coefficients larger than a preset coefficient threshold as a representative feature of the plurality of features to obtain an update data set of the initiator; the second correlation analysis submodule is used for carrying out pearson correlation analysis on the features in the data set to be used by the partner to obtain correlation coefficients of each feature and other features, and selecting one feature from a plurality of features with correlation coefficients larger than a preset coefficient threshold as a representative feature of the plurality of features to obtain an updated data set of the partner; the first segmentation sub-module is used for segmenting the update data set of the initiator to obtain a first segmentation data set and a second segmentation data set, and sending the second segmentation data set to the partner; the second slicing submodule is used for slicing the updated data set of the partner to obtain a third sliced data set and a fourth sliced data set, and the third sliced data set is sent to the initiator; the federal correlation coefficient calculation sub-module is used for calculating common parameters according to the triples and the current data sets of all the parties, and calculating the correlation coefficient between each feature and the other party feature in the current data sets of all the parties through the common parameters; wherein the triples are generated by a trusted third party. The feature selection sub-module is used for selecting one feature from a plurality of features with mutual correlation coefficients larger than a preset coefficient threshold value by the initiator, taking the selected feature as a representative feature of the plurality of features to obtain an initiator training data set, sending the selected anonymous feature to the partner, and mapping the selected anonymous feature of the partner to a local feature by the partner to obtain the partner training data set.

The technical scheme has the following beneficial effects: by adopting the federal learning mode, namely by setting a common sample space and different characteristic spaces in the initiator and the participant, the data can be invisible in the model training process, and the safety of the data is improved; in addition, the data island is eliminated in the training process, the data dimension is enlarged, and the accuracy of the final model prediction is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a federal learning-based fault prediction method provided by an embodiment of the present invention;

FIG. 2 is a flow chart of a data processing method provided by an embodiment of the present invention;

FIG. 3 is a flowchart of an outlier processing method in a data processing process according to an embodiment of the present invention;

FIG. 4 is a flowchart of a missing value processing method in a data processing process according to an embodiment of the present invention;

FIG. 5 is a flow chart of a federal feature engineering method provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a federal learning-based fault prediction system according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a data processing module according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a configuration of an exception handling sub-module for an outlier handling unit according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a structure of an exception handling sub-module for a missing value handling unit according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a feature engineering module according to an embodiment of the present invention.

Reference numerals: 100-a data fusion module; 200-a data processing module; 2001-first statistics submodule; 2002-a second statistics sub-module; 2003-a filtering sub-module; 2004-an exception handling sub-module; 20041-abnormal feature recognition unit; 20042-continuous feature detection unit; 20043-discrete feature detection unit; 20044-a first missing value processing unit; 20045-a second missing value processing unit; 300-a feature engineering module; 3001-a first correlation analysis sub-module; 3002-a second correlation analysis sub-module; 3003—a first split sub-module; 3004-second chip sub-module; 3005-federal correlation coefficient calculation sub-module; 3006-feature selection submodule; 400-prediction module.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to solve the problem that in the prior art, a data island exists to influence the accuracy of equipment fault diagnosis, the invention provides a fault prediction method based on federal learning, and fig. 1 is a flowchart of the fault prediction method based on federal learning, provided by the embodiment of the invention, as shown in fig. 1, and the method comprises the following steps:

s1, acquiring a first data set of a detection target at an initiator and a second data set of a partner, calculating an ID intersection of the first data set and the second data set, and respectively acquiring an original data set of the initiator and an original data set of the partner according to the ID intersection and the local data sets of the parties;

assuming that the initiator is a thermal power plant, the detection target is a boiler of the thermal power plant, and the partner is a boiler manufacturer. The thermal power plant has operation data, maintenance data and the like of the boiler and auxiliary equipment, and the first data set X of the initiator has m characteristics on the assumption that the minimum unit of the boiler is a component. The boiler manufacturer has important operation data, design parameters, fault mode data and the like of typical working conditions of the boiler, and the second data set owned by the boiler manufacturer is Y and has t characteristics.

Since the kks codes exist in the device and the component, which is equivalent to a unique identification code (ID), the first data set X and the second data set Y are intersected based on kks, and the number of samples of the intersection is n, the original data set of the initiator is X (m, n), and the original data set of the partner is Y (t, n).

S2, carrying out data interaction and data processing on the original data set of the initiator and the original data set of the partner to obtain a standby data set of the initiator and a standby data set of the partner;

and carrying out data interaction on the original data set of the initiator and the original data set of the partner, and carrying out data processing on the two data after the data interaction, for example, filtering abnormal data, supplementing missing data and the like, so as to obtain a standby data set of the initiator and a standby data set of the partner, and improve the quality of the data.

S3, carrying out correlation analysis and feature selection according to the initiator standby data set and the partner standby data set to obtain an initiator training data set and a partner training data set;

performing correlation analysis on the standby data set of the initiator and the standby data set of the partner, firstly performing correlation analysis on all the features in the standby data set of the initiator, forming feature groups by the features with higher correlation, selecting one feature from each feature group, and deleting other features to update the standby data set of the initiator; and carrying out correlation analysis on all the features in the stand-by data set of the partner, forming the features with higher correlation into feature groups, selecting one feature from each feature group, and deleting other features to update the stand-by data set of the partner.

And then carrying out comprehensive correlation analysis on the updated initiator standby data set and the characteristics in the partner standby data set, selecting one characteristic from the characteristic group with higher correlation, and deleting other characteristics to obtain an updated data set, namely the initiator training data set and the partner training data set for modeling.

And S4, performing model training according to the initiator training data set and the partner training data set to obtain a fault prediction model, and performing fault prediction on the detection target by adopting the fault prediction model to obtain a fault result of the detection target.

Based on the training data set of the initiator and the training data set of the partner, joint model training is carried out on each node, different classification models can be selected according to different prediction purposes during model training, for example, if only judging whether a boiler has faults or not, a federal logistic regression algorithm can be selected to train a fault prediction model, the trained model is adopted to conduct fault prediction on a detection target, when a prediction result is 0, the detection target normally operates, when a prediction result is 1, the detection target is about to be faulty, and early warning can be carried out on the basis of the result.

If the fault category of the detection target needs to be predicted, a multi-classification algorithm training model of a tree model can be selected, and the prediction result of the model is normal or a plurality of fault categories.

The nodes are combined to train the model, rules of data processing, feature processing and model training are finally saved, when the model is on line, the model is only loaded on each node, model parameters of the whole flow are loaded at the same time, new data of each node are sequentially processed through the loaded model rules, and finally a fault prediction result is obtained at an initiator.

As an optional implementation manner, fig. 2 is a flowchart of a data processing method provided by an embodiment of the present invention, as shown in fig. 2, performing data interaction and data processing on an original data set of an initiator and an original data set of a partner, to obtain an inactive data set of the initiator and an inactive data set of the partner includes:

s201, counting the characteristics in the original data set of the initiator to obtain a first counting result;

s202, counting the features in the original dataset of the composition party to obtain a second statistical result, mapping the local features into anonymous features, and sending the anonymous features and the second statistical result to the initiator;

s203, the initiator filters the features in the original data set of the initiator according to the variance in the first statistical result and a preset variance threshold value to obtain a filtered data set of the initiator, the initiator determines anonymous features to be deleted according to the variance in the second statistical result and the preset variance threshold value, and sends a deletion instruction of the anonymous features to the partner, and the partner deletes the local features corresponding to the anonymous features according to the deletion instruction to obtain a filtered data set of the partner;

S204, performing outlier processing and missing value processing on the initiator filtered data set and the partner filtered data set to respectively obtain an initiator standby data set and a partner standby data set.

Each party (initiator and partner) respectively counts the statistics of the features in the local dataset, including: maximum, minimum, mean, variance, skewness, kurtosis, mode, median, histogram, and rate of absence.

The partner performs anonymous processing on the local features, for example, encryption processing can be performed in a public-private key mode, the partner sends the anonymous features and the second statistical result to the initiator, and the initiator performs data processing according to the first statistical result of the local features of the initiator and the second statistical result of the anonymous features of the partner.

Firstly, the initiator deletes the feature with the variance of 0 in the local features of the initiator, or after comparing the variance of the feature with a preset variance threshold, the feature smaller than the preset variance threshold is removed. And the anonymous characteristics belonging to the partner are required to be notified, and after the partner is notified, the anonymous characteristics with variance equal to 0 or smaller than a preset variance threshold are mapped into local characteristics by the partner and then deleted, so that each participant performs preliminary data update on the data set of the partner.

Each participant respectively carries out abnormal value processing based on the updated data set, filters out abnormal samples, then carries out missing value processing, and carries out proper replenishment or sample deletion on the characteristics so as to improve the data quality.

As an optional implementation manner, fig. 3 is a flowchart of a method for processing an outlier in a data processing process according to an embodiment of the present invention, where, as shown in fig. 3, the outlier processing is performed on both an initiator filtered data set and a partner filtered data set, including:

s2041, screening by the initiator and the partner according to the feature and the feature reasonable interval to obtain abnormal features, mapping the abnormal local features into anonymous features by the partner, sending the anonymous features and the abnormal processing instructions to the partner together with a second statistical result, and filtering or filling the abnormal local features according to the abnormal processing instructions after the anonymous features and the abnormal processing instructions are mapped into the local features by the partner;

the initiator and the partner can observe whether the data are in a reasonable data range according to the maximum value and the minimum value of the characteristics, and the temperature and the pressure are taken as examples, so that the temperature and the pressure of the general boiler steam have a reasonable interval, and the characteristic sample value exceeding the interval is regarded as an abnormal value and needs to be filtered.

After comparing the local features of the method with the reasonable intervals of the corresponding features, the initiator directly filters or fills the abnormal features, and particularly can fill the abnormal features by adopting the average number, mode or median corresponding to the features.

After converting the abnormal local features into anonymous features, the partner and the corresponding statistical results are sent to the initiator, the initiator selects an abnormal processing instruction of the anonymous features, such as a filling method, a deleting instruction and the like, and sends the anonymous features and the corresponding abnormal processing instruction back to the partner, and after mapping the anonymous features of the anomalies into the local features, the partner performs corresponding operations according to the abnormal processing instruction.

S2042, for the continuous features, each side carries out anomaly detection according to the box line graph, and the anomaly value in the features is filled with the median of the features in the statistical result;

each participant performs outlier detection and processing on the continuous features through the box diagram. Specifically, each participant performs box diagram detection on the feature, obtains the feature with the abnormal value, and fills the abnormal value of the feature with the median corresponding to the feature.

S2043, for the discrete features, the initiator performs anomaly detection according to the feature histogram, deletes the outlier in the feature, sends the anonymous feature with the outlier and the anomaly handling instruction to the partner, and the partner deletes the outlier in the local feature mapped by the anonymous feature according to the anomaly handling instruction or adopts mode filling corresponding to the feature in the statistical result.

The method comprises the steps that an initiator observes discrete features through a feature histogram, the initiator selects features with the proportion of a certain feature sample value in the feature histogram being lower than 1%, in the selected features, the local abnormal value of the initiator is directly filtered, the features of the partner are sent to the partner, the partner receives the abnormal anonymous features and corresponding abnormal processing instructions, the anonymous features are mapped to the local features, and abnormal values with the proportion of the sample value being lower than 1% in the discrete features are deleted or mode is adopted for filling.

As an optional implementation manner, fig. 4 is a flowchart of a missing value processing method in a data processing process provided by an embodiment of the present invention, where, as shown in fig. 4, missing value processing is performed on an initiator filtered data set and a partner filtered data set, including:

s2044, for the discrete features, each party calculates the ratio of the number of missing samples with the category of missing to the number of all samples in the feature histogram, and deletes the samples with the ratio smaller than a preset duty ratio threshold, or adopts mode filling in the statistical result, or adopts median filling in the statistical result;

for discrete features, it is desirable to delete the bins or fill with the mode, median of the feature if the ratio is less than a preset ratio threshold (1% in this embodiment) based on observing the ratio of the number of bins for which the class is missing in the histogram.

S2045, for the continuous characteristics, calculating the deletion rate of each party; deleting the missing samples when the missing rate is smaller than a first preset missing rate threshold value; deleting the corresponding feature when the deletion rate is greater than a second preset deletion rate threshold; and when the deletion rate is between the first preset deletion rate threshold value and the second preset deletion rate threshold value, filling the deletion value in the deletion sample with the average number of the features in the statistical result.

The continuous feature directly calculates the deletion rate, and if the deletion rate is smaller than a first preset deletion rate threshold (1% in the embodiment), deleting the deletion sample; if the deletion rate is greater than a second preset deletion rate threshold (50% in this embodiment), the feature is deleted directly; if the loss rate is between the first preset loss rate threshold and the second preset loss rate threshold (1% -50%), the average of the features is used to populate the loss value.

As an optional implementation manner, fig. 5 is a flowchart of a federal feature engineering method provided by an embodiment of the present invention, as shown in fig. 5, according to performing correlation analysis and feature selection on an initiator standby data set and a partner standby data set, to obtain an initiator training data set and a partner training data set, including:

S301, carrying out Pelson correlation analysis on the features in the data set to be used by the initiator to obtain correlation coefficients of each feature and other features, and selecting one feature from a plurality of features with correlation coefficients larger than a preset coefficient threshold as a representative feature of the plurality of features to obtain an update data set of the initiator;

s302, carrying out Pelson correlation analysis on the features in the data set to be used by the partner to obtain correlation coefficients of each feature and other features, and selecting one feature from a plurality of features with correlation coefficients larger than a preset coefficient threshold as a representative feature of the plurality of features to obtain an updated data set of the partner;

s303, slicing the initiator update data set to obtain a first sliced data set and a second sliced data set, and sending the second sliced data set to the partner;

s304, slicing the updated data set of the composition party to obtain a third sliced data set and a fourth sliced data set, and sending the third sliced data set to the initiator;

s305, calculating common parameters according to the triples and the current data sets of all the parties, and calculating the correlation coefficient between each feature and the other party feature in the current data sets of all the parties through the common parameters; wherein the triples are generated by a trusted third party;

And S306, selecting one feature from a plurality of features with mutual correlation coefficients larger than a preset coefficient threshold value by the initiator, taking the selected anonymous feature as a representative feature of the plurality of features, sending the selected anonymous feature of the partner to the partner, and mapping the selected anonymous feature to a local feature by the partner to obtain the partner training data set.

Each party performs pearson correlation analysis on the features within the present dataset and then performs feature selection. Specifically, the correlation coefficient between each feature and other features in the data set is calculated, and after the calculation is completed, the situation that the correlation of a plurality of features is high is possible, when the situation occurs, one of the features needs to be selected as a representative feature, other features with the situation also need to be selected by the same feature, and finally the representative features form the updated data set.

And the updated initiator update data set and the updated partner data set need to be transmitted after being fragmented. If the initiator updates the data set to X, the initiator fragments the data set to obtain a first fragmented data set X0 and a second fragmented data set X1, and the initiator sends the second fragmented data set X1 to the partner. The partner updates the data set to Y, the partner fragments the data set to obtain a third fragmented data set Y0 and a fourth fragmented data set Y1, and the partner sends the third fragmented data set Y0 to the initiator.

The trusted third party generates triples (a 0, b0, c 0), (a 1, b1, c 1) to send to the parties. According to the triples (a 0, b0, c 0), (a 1, b1, c 1) and local slicing data, each participant calculates common parameters x+a, y+b, then calculates correlation coefficients of each feature and other features in each data set according to the common parameters, and after calculation, the situation that the correlation of a plurality of features is high is possible, when the situation occurs, one of the features needs to be selected as a representative feature, the other features with the situation also need to be subjected to the same feature selection, the sponsor sends anonymous features after feature selection to the partner, and each participant updates the data set according to the data after feature selection, so that the sponsor training data set and the partner training data set are respectively obtained.

As an alternative implementation manner, the embodiment of the present invention further provides a federal learning-based fault prediction system, and fig. 6 is a schematic structural diagram of the federal learning-based fault prediction system provided by the embodiment of the present invention, as shown in fig. 6, where the system includes:

the data fusion module 100 is configured to obtain a first data set of the detection target at the initiator and a second data set of the partner, calculate an ID intersection of the first data set and the second data set, and obtain an original data set of the initiator and an original data set of the partner according to the ID intersection and the local data sets of the parties, respectively;

Since the kks codes exist in the equipment and the parts, which is equivalent to the unique identification code, the first data set and the second data set are subjected to intersection based on kks, and the number of samples of the intersection is n, the original data set of the initiator is X (m, n), and the original data set of the partner is Y (t, n).

The data processing module 200 is configured to perform data interaction and data processing on the original data set of the initiator and the original data set of the partner to obtain a standby data set of the initiator and a standby data set of the partner;

and respectively carrying out data interaction on the original data set of the initiator and the original data set of the partner, and carrying out data processing on the two data after the data interaction, for example, filtering abnormal data, supplementing missing data and the like, so as to obtain a standby data set of the initiator and a standby data set of the partner, and improve the quality of the data.

The feature engineering module 300 is configured to perform correlation analysis and feature selection according to the initiator standby data set and the partner standby data set to obtain an initiator training data set and a partner training data set;

The prediction module 400 is configured to perform model training according to the initiator training data set and the partner training data set to obtain a fault prediction model, and perform fault prediction on the detection target by using the fault prediction model to obtain a fault result of the detection target.

As an alternative implementation manner, fig. 7 is a schematic structural diagram of a data processing module provided by an embodiment of the present invention, and as shown in fig. 7, a data processing module 200 includes:

A first statistics sub-module 2001, configured to perform statistics on features in the original data set of the initiator to obtain a first statistics result;

the second statistics sub-module 2002 is configured to perform statistics on features in the original dataset of the partner to obtain a second statistical result, map the local features to anonymous features, and send the anonymous features together with the second statistical result to the initiator;

the filtering submodule 2003 is used for filtering the features in the original data set of the initiator by the initiator according to the variance in the first statistical result and a preset variance threshold value to obtain a filtered data set of the initiator, determining anonymous features to be deleted by the initiator according to the variance in the second statistical result and the preset variance threshold value, sending a deleting instruction of the anonymous features to the partner, and deleting the local features corresponding to the anonymous features by the partner according to the deleting instruction to obtain a filtered data set of the partner;

the exception handling submodule 2004 is used for performing exception value handling and missing value handling on the initiator filtered data set and the partner filtered data set to obtain an initiator standby data set and a partner standby data set respectively.

As an alternative implementation manner, fig. 8 is a schematic structural diagram of an exception handling submodule for an outlier handling unit according to an embodiment of the present invention, and as shown in fig. 8, an exception handling submodule 2004 includes:

The abnormal feature identifying unit 20041 is configured to obtain abnormal features by screening the initiator and the partner according to the feature and the feature reasonable interval, map the abnormal local features to anonymous features, send the anonymous features to the initiator along with a second statistical result, filter or fill the local abnormal features, send the anonymous features and an abnormal processing instruction to the partner, and filter or fill the abnormal local features according to the abnormal processing instruction after the partner maps the anonymous features to the local features;

The continuous feature detection unit 20042 is used for carrying out anomaly detection on the continuous features according to the box line graph by each party, and filling the anomaly value in the features with the median of the features in the statistical result;

The discrete feature detection unit 20043 is used for performing anomaly detection on the discrete features according to the feature histogram by the initiator, deleting the outlier in the feature, sending the anonymous feature with the outlier and the anomaly handling instruction to the partner, and deleting the outlier in the local feature mapped by the anonymous feature by the partner according to the anomaly handling instruction or filling the mode corresponding to the feature in the statistical result.

As an alternative implementation manner, fig. 9 is a schematic structural diagram of an exception handling sub-module for a missing value processing unit according to an embodiment of the present invention, and as shown in fig. 9, an exception handling sub-module 2004 includes:

the first missing value processing unit 20044 is configured to calculate, for discrete features, a ratio of the number of missing samples with missing categories to the number of all samples in the feature histogram, and delete samples with a ratio smaller than a preset duty ratio threshold, or fill a mode in a statistical result, or fill a median in the statistical result;

A second missing value processing unit 20045, configured to calculate a missing rate for each party for the continuous feature; deleting the missing samples when the missing rate is smaller than a first preset missing rate threshold value; deleting the corresponding feature when the deletion rate is greater than a second preset deletion rate threshold; and when the deletion rate is between the first preset deletion rate threshold value and the second preset deletion rate threshold value, filling the deletion value in the deletion sample with the average number of the features in the statistical result.

The continuous feature directly calculates the deletion rate, and if the deletion rate is smaller than a first preset deletion rate threshold (1% in the embodiment), deleting the deletion value; if the deletion rate is greater than a second preset deletion rate threshold (50% in this embodiment), the feature is deleted directly; if the loss rate is between the first preset loss rate threshold and the second preset loss rate threshold (1% -50%), the average of the features is used to populate the loss value.

As an alternative implementation manner, fig. 10 is a schematic structural diagram of a feature engineering module provided by an embodiment of the present invention, and as shown in fig. 10, a feature engineering module 300 includes:

the first correlation analysis submodule 3001 is configured to perform pearson correlation analysis on features in a standby dataset of an initiator to obtain correlation coefficients of each feature and other features, and select one feature from a plurality of features whose correlation coefficients are all greater than a preset coefficient threshold as a representative feature of the plurality of features to obtain an update dataset of the initiator;

the second correlation analysis submodule 3002 is configured to perform pearson correlation analysis on features in a data set to be used by the partner to obtain correlation coefficients of each feature and other features, and select one feature from a plurality of features whose correlation coefficients are all greater than a preset coefficient threshold value as a representative feature of the plurality of features to obtain a partner update data set;

A first slicing sub-module 3003, configured to slice the initiator update data set to obtain a first sliced data set and a second sliced data set, and send the second sliced data set to the partner;

a second slicing sub-module 3004, configured to slice the updated data set of the partner to obtain a third sliced data set and a fourth sliced data set, and send the third sliced data set to the initiator;

the federal correlation coefficient calculation submodule 3005 is used for calculating common parameters according to the triples and the current data sets of all parties, and calculating the correlation coefficient between each feature and the other feature in the current data sets of all parties through the common parameters; wherein the triples are generated by a trusted third party;

the feature selection submodule 3006 is configured to select one feature from a plurality of features whose correlation coefficients are all greater than a preset coefficient threshold, obtain an initiator training data set as a representative feature of the plurality of features, send the anonymous feature of the selected partner to the partner, and map the anonymous feature of the selected partner to a local feature, thereby obtaining the partner training data set.

The foregoing description of the embodiments of the present invention further provides a detailed description of the objects, technical solutions and advantages of the present invention, and it should be understood that the foregoing description is only illustrative of the embodiments of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements, etc. that fall within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A federal learning-based fault prediction method, comprising:

acquiring a first data set of a detection target at an initiator and a second data set of a partner, calculating an ID intersection of the first data set and the second data set, and respectively acquiring an original data set of the initiator and an original data set of the partner according to the ID intersection and the local data sets of the parties;

Performing data interaction and data processing on the original data set of the initiator and the original data set of the partner to obtain a standby data set of the initiator and a standby data set of the partner;

performing correlation analysis and feature selection according to the initiator standby data set and the partner standby data set to obtain an initiator training data set and a partner training data set;

and performing model training according to the initiator training data set and the partner training data set to obtain a fault prediction model, and performing fault prediction on the detection target by adopting the fault prediction model to obtain a fault result of the detection target.

2. The federal learning-based fault prediction method according to claim 1, wherein the performing data interaction and data processing on the initiator original data set and the partner original data set to obtain an initiator standby data set and a partner standby data set includes:

counting the characteristics in the original data set of the initiator to obtain a first counting result;

counting the features in the original data set of the partner to obtain a second statistical result, mapping the local features into anonymous features, and then sending the anonymous features and the second statistical result to the initiator;

The method comprises the steps that an initiator filters features in an original data set of the initiator according to variances in a first statistical result and a preset variance threshold to obtain an initiator filtered data set, the initiator determines anonymous features to be deleted according to variances in a second statistical result and the preset variance threshold, and sends a deleting instruction of the anonymous features to a partner, and the partner deletes local features corresponding to the anonymous features according to the deleting instruction to obtain the partner filtered data set;

and performing outlier processing and missing value processing on the initiator filtered data set and the partner filtered data set to respectively obtain an initiator standby data set and a partner standby data set.

3. The federal learning-based fault prediction method according to claim 2, wherein the outlier processing of both the initiator filtered data set and the partner filtered data set comprises:

the method comprises the steps that an initiator and a partner obtain abnormal features through screening according to features and feature reasonable intervals, the partner maps the abnormal local features into anonymous features and then sends the anonymous features and second statistical results to the initiator, the initiator filters or fills the local abnormal features and sends anonymous features and an abnormal processing instruction to the partner, and the partner filters or fills the abnormal local features according to the abnormal processing instruction after mapping the anonymous features into the local features;

For the continuous features, each side carries out abnormality detection according to the box line graph, and the abnormal value in the features is filled with the median of the features in the statistical result;

and for the discrete features, the initiator performs anomaly detection according to the feature histogram, deletes the outlier in the feature, sends the anonymous feature with the outlier and the anomaly handling instruction to the partner, and the partner deletes the outlier in the local feature mapped by the anonymous feature according to the anomaly handling instruction or adopts mode filling corresponding to the feature in the statistical result.

4. The federal learning-based fault prediction method according to claim 2, wherein the performing missing value processing on the initiator filtered data set and partner filtered data set comprises:

for discrete features, each party calculates the ratio of the number of missing samples with the category of missing to the number of all samples in the feature histogram, and deletes the samples with the ratio smaller than a preset duty ratio threshold, or adopts mode filling in the statistical result, or adopts median filling in the statistical result;

for continuous features, each party calculates the loss rate; deleting the missing samples when the missing rate is smaller than a first preset missing rate threshold value; deleting the corresponding feature when the deletion rate is greater than a second preset deletion rate threshold; and when the deletion rate is between the first preset deletion rate threshold value and the second preset deletion rate threshold value, filling the deletion value in the deletion sample with the average number of the features in the statistical result.

5. The federal learning-based fault prediction method according to claim 1, wherein the obtaining the initiator training data set and the partner training data set according to the correlation analysis and the feature selection of the initiator standby data set and the partner standby data set includes:

carrying out pearson correlation analysis on the features in the data set to be used by the initiator to obtain correlation coefficients of each feature and other features, and selecting one feature from a plurality of features with correlation coefficients larger than a preset coefficient threshold value as a representative feature of the plurality of features to obtain an update data set of the initiator;

carrying out pearson correlation analysis on the features in the data set to be used by the partner to obtain correlation coefficients of each feature and other features, and selecting one feature from a plurality of features with correlation coefficients larger than a preset coefficient threshold value as a representative feature of the plurality of features to obtain a partner update data set;

fragmenting the initiator update data set to obtain a first fragment data set and a second fragment data set, and transmitting the second fragment data set to the partner;

fragmenting the updated data set of the partner to obtain a third fragmented data set and a fourth fragmented data set, and transmitting the third fragmented data set to the initiator;

Calculating common parameters according to the triples and the current data sets of all the parties, and calculating the correlation coefficient between each feature and the other party feature in the current data sets of all the parties through the common parameters; wherein the triples are generated by a trusted third party;

the method comprises the steps that an initiator selects one feature from a plurality of features with mutual correlation coefficients larger than a preset coefficient threshold value as a representative feature of the plurality of features, a training dataset of the initiator is obtained, the anonymous feature of the selected partner is sent to the partner, and the partner maps the selected anonymous feature to a local feature to obtain the training dataset of the partner.

6. A federal learning-based fault prediction system, comprising:

the data fusion module is used for acquiring a first data set of a detection target at an initiator and a second data set of a partner, calculating an ID intersection of the first data set and the second data set, and respectively acquiring an original data set of the initiator and an original data set of the partner according to the ID intersection and the local data sets of the parties;

the data processing module is used for carrying out data interaction and data processing on the original data set of the initiator and the original data set of the partner to obtain a stand-by data set of the initiator and a stand-by data set of the partner;

The feature engineering module is used for carrying out correlation analysis and feature selection according to the initiator standby data set and the partner standby data set to obtain an initiator training data set and a partner training data set;

and the prediction module is used for carrying out model training according to the initiator training data set and the partner training data set to obtain a fault prediction model, and carrying out fault prediction on the detection target by adopting the fault prediction model to obtain a fault result of the detection target.

7. The federal learning-based fault prediction system according to claim 6, wherein the data processing module comprises:

the first statistics sub-module is used for counting the characteristics in the original data set of the initiator to obtain a first statistics result;

the second statistics sub-module is used for carrying out statistics on the features in the original data set of the partner to obtain a second statistical result, mapping the local features into anonymous features and then sending the anonymous features together with the second statistical result to the initiator;

the filtering sub-module is used for filtering the features in the original data set of the initiator by the initiator according to the variance in the first statistical result and a preset variance threshold value to obtain a filtered data set of the initiator, determining anonymous features to be deleted by the initiator according to the variance in the second statistical result and the preset variance threshold value, sending a deletion instruction of the anonymous features to the partner, and deleting the local features corresponding to the anonymous features by the partner according to the deletion instruction to obtain the filtered data set of the partner;

And the exception processing sub-module is used for performing exception value processing and missing value processing on the initiator filtered data set and the partner filtered data set to respectively obtain an initiator standby data set and a partner standby data set.

8. The federally-learned based fault prediction system according to claim 7, wherein the exception handling submodule comprises:

the abnormal feature identification unit is used for respectively screening the sponsor and the partner according to the feature and the feature reasonable interval to obtain abnormal features, mapping the abnormal local features into anonymous features by the partner, then sending the anonymous features and the abnormal processing instructions to the partner together with a second statistical result, and filtering or filling the abnormal local features by the sponsor;

the continuous feature detection unit is used for carrying out anomaly detection on the continuous features according to the box line graph by each party, and filling the abnormal values in the features with the median of the features in the statistical result;

the discrete feature detection unit is used for carrying out anomaly detection on the discrete features by the initiator according to the feature histogram, deleting the abnormal value in the feature, sending the anonymous feature with the abnormal value and the anomaly processing instruction to the partner, and deleting the abnormal value in the local feature mapped by the anonymous feature by the partner according to the anomaly processing instruction or adopting mode filling corresponding to the feature in the statistical result.

9. The federally-learning-based fault prediction system according to claim 7, wherein the exception handling submodule includes:

the first missing value processing unit is used for calculating the ratio of the number of missing samples with the type of missing to the number of all samples in the characteristic histogram for the discrete characteristics, deleting the samples with the ratio smaller than a preset duty ratio threshold, or filling the samples with the mode in the statistical result, or filling the samples with the median in the statistical result;

a second missing value processing unit for calculating a missing rate for each of the continuous features; deleting the missing samples when the missing rate is smaller than a first preset missing rate threshold value; deleting the corresponding feature when the deletion rate is greater than a second preset deletion rate threshold; and when the deletion rate is between the first preset deletion rate threshold value and the second preset deletion rate threshold value, filling the deletion value in the deletion sample with the average number of the features in the statistical result.

10. The federally-learned based fault prediction system according to claim 6, wherein the feature engineering module comprises:

the first correlation analysis submodule is used for carrying out pearson correlation analysis on the features in the data set to be used by the initiator to obtain correlation coefficients of each feature and other features, and selecting one feature from a plurality of features with correlation coefficients larger than a preset coefficient threshold as a representative feature of the plurality of features to obtain an update data set of the initiator;

The second correlation analysis submodule is used for carrying out pearson correlation analysis on the features in the data set to be used by the partner to obtain correlation coefficients of each feature and other features, and selecting one feature from a plurality of features with correlation coefficients larger than a preset coefficient threshold as a representative feature of the plurality of features to obtain an updated data set of the partner;

the first segmentation sub-module is used for segmenting the update data set of the initiator to obtain a first segmentation data set and a second segmentation data set, and sending the second segmentation data set to the partner;

the second slicing submodule is used for slicing the updated data set of the partner to obtain a third sliced data set and a fourth sliced data set, and the third sliced data set is sent to the initiator;

the federal correlation coefficient calculation sub-module is used for calculating common parameters according to the triples and the current data sets of all the parties, and calculating the correlation coefficient between each feature and the other party feature in the current data sets of all the parties through the common parameters; wherein the triples are generated by a trusted third party;

the feature selection sub-module is used for selecting one feature from a plurality of features with mutual correlation coefficients larger than a preset coefficient threshold value by the initiator, taking the selected feature as a representative feature of the plurality of features to obtain an initiator training data set, sending the anonymous feature of the selected partner to the partner, and mapping the selected anonymous feature to a local feature by the partner to obtain the partner training data set.