CN117252287A

CN117252287A - Index prediction method and system based on federal pearson correlation analysis

Info

Publication number: CN117252287A
Application number: CN202310981568.5A
Authority: CN
Inventors: 孙银银; 兰春嘉
Original assignee: Shanghai Lingshuzhonghe Information Technology Co ltd
Current assignee: Shanghai Lingshuzhonghe Information Technology Co ltd
Priority date: 2023-08-04
Filing date: 2023-08-04
Publication date: 2023-12-19
Anticipated expiration: 2043-08-04

Abstract

The invention discloses an index prediction method and system based on federal pearson correlation analysis. Wherein the method comprises the following steps: normalizing the data set; slicing the standard data set; computing a multiplication pair in a trusted execution environment; calculating a first common parameter and a second common parameter; calculating a first slicing correlation coefficient and a second slicing correlation coefficient according to the public parameters; calculating a federal correlation coefficient according to the first and second sliced correlation coefficients; performing federal pearson correlation analysis according to federal correlation coefficients; determining model training data according to the analysis result, and training a federal learning model by adopting the model training data; index prediction is carried out through the federal learning model; the index comprises: fault performance index, and profit index. The invention calculates the multiplication pair in the trusted execution environment, reduces the communication overhead of ciphertext calculation and improves the performance, in addition, in the aspect of the performance, the calculation tasks can be executed in a partitioning and parallel manner, and the calculation efficiency is greatly improved.

Description

Index prediction method and system based on federal pearson correlation analysis

Technical Field

The invention relates to the technical field of privacy computation, in particular to an index prediction method and system based on federal pearson correlation analysis.

Background

When longitudinal federal modeling is performed, the data sets of the task initiator and the partner have common sample space and different feature spaces, encryption algorithm is required to be used for guaranteeing data privacy safety, correlation analysis is performed on each continuous feature of the node and other features of the node of the partner, the feature with larger correlation is removed, modeling efficiency and modeling accuracy are improved, existing federal correlation analysis can be converted into matrix secret multiplication, performance is poor, communication cost is high, calculation process is complex, and efficiency is low.

Aiming at the problems of poor federal correlation analysis performance, high communication overhead, complex calculation process and low efficiency in the prior art, no effective solution is proposed at present.

Disclosure of Invention

The embodiment of the invention provides an index prediction method and system based on federal pearson correlation analysis, which are used for solving the problems of poor federal correlation analysis performance, high communication overhead, complex calculation process and low efficiency in the prior art.

To achieve the above object, in one aspect, the present invention provides an index prediction method based on federal pearson correlation analysis, the method comprising: s1, normalizing a first data set X of an initiator in longitudinal federal learning to obtain a first standard data set X ', and normalizing a second data set Y of a partner to obtain a second standard data set Y'; s2, the initiator fragments X ' according to the first random data set R0 to obtain a first fragmented data set X0', and the partner takes the shared R0 as a second fragmented data set X1'; the partner fragments Y ' according to the second random data set R1 to obtain a third fragment data set Y0', and the initiator takes the shared R1 as a fourth fragment data set Y1'; s3, the trusted execution environment calculates a product data set c according to the third random data set a0, the fourth random data set b0, the fifth random data set a1 and the sixth random data set b 1; randomly generating a first generated data set c0 with the same size as the c matrix according to the c; calculating a second generated data set c1 according to the c and the c0; the initiator shares a0 and b0 and acquires c0 sent by the trusted execution environment; the partner shares a1 and b1 and acquires c1 sent by the trusted execution environment; s4, the partner calculates a first public parameter according to the sum of X1', a1 and X0' and a0 of the shared initiator; the initiator calculates a second public parameter according to the Y1', the b0 and the sum of Y0' and b1 of the shared partner; s5, the initiator calculates a first slicing correlation coefficient according to the a0, the b0, the c0, the second common parameter and the shared first common parameter; the partner calculates a second fragment correlation coefficient according to the a1, the b1 and the c1, the first public parameter and the shared second public parameter; s6, the two parties respectively calculate and obtain the federal correlation coefficient according to the respective slicing correlation coefficient and the slicing correlation coefficient of the shared counterpart; s7, carrying out Federal pearson correlation analysis according to the Federal correlation coefficient, determining model training data according to an analysis result, and training a Federal learning model by adopting the model training data; s8, index prediction is carried out through the federal learning model; the index comprises: fault performance index, and profit index.

Optionally, the S2 includes: the method comprises the steps that a first random seed generated by an initiator is sent to a partner, and the initiator and the partner generate a first random data set R0 according to the first random seed; the initiator fragments the first standard data set X 'according to the first random data set R0 to obtain a first fragmented data set X0'; the partner takes the first random data set R0 as a second fragment data set X1'; the second random seed generated by the partner is sent to the initiator, and the initiator and the partner generate a second random data set R1 according to the second random seed; the partner fragments the second standard data set Y 'according to the second random data set R1 to obtain a third fragmented data set Y0'; the initiator takes the second random data set R1 as a fourth sliced data set Y1'.

Optionally, the product data set c is calculated according to the following formula:

c＝(a0+a1)×(b0+b1)；

the second generated dataset c1 is calculated according to the following formula:

c1＝c-c0；

wherein a0 is the third random data set, a1 is the fifth random data set, b0 is the fourth random data set, b1 is the sixth random data set, c is the product data set, c0 is the first generated data set, and c1 is the second generated data set.

Optionally, the first common parameter is calculated according to the following formula:

X’+a＝X0’+a0+X1’+a1；

wherein X ' +a is a first common parameter, X0' is a first sliced data set, a0 is a third random data set, X1' is a second sliced data set, and a1 is a fifth random data set;

the second common parameter is calculated according to the following formula:

Y’+b＝Y0’+b1+Y1’+b0；

wherein Y ' +b is a second common parameter, Y0' is a third sliced data set, b1 is a sixth random data set, Y1' is a fourth sliced data set, and b0 is a fourth random data set.

Optionally, the first slice correlation coefficient is calculated according to the following formula:

corr0＝c0-a0*(Y’+b)-(X’+a)*b0+(X’+a)*(Y’+b)；

wherein corr0 is a first slicing correlation coefficient, c0 is a first generated data set, a0 is a third random data set, b0 is a fourth random data set, X '+a is a first common parameter, and Y' +b is a second common parameter;

the second slice correlation coefficient is calculated according to the following formula:

corr1＝c1-a1*(Y’+b)-(X’+a)*b1；

wherein corr1 is a second slice correlation coefficient, c1 is a second generated data set, a1 is a fifth random data set, b1 is a sixth random data set, X '+a is a first common parameter, and Y' +b is a second common parameter.

In another aspect, the present invention provides an index prediction system based on federal pearson correlation analysis, the system comprising: the normalization unit is used for normalizing the first data set X of the initiator in longitudinal federal learning to obtain a first standard data set X ', and normalizing the second data set Y of the partner to obtain a second standard data set Y'; the slicing unit is used for slicing the X ' according to the first random data set R0 by the initiator to obtain a first slicing data set X0', and taking the shared R0 as a second slicing data set X1' by the partner; the partner fragments Y ' according to the second random data set R1 to obtain a third fragment data set Y0', and the initiator takes the shared R1 as a fourth fragment data set Y1'; the generated data set calculation unit is used for calculating a product data set c according to the third random data set a0, the fourth random data set b0, the fifth random data set a1 and the sixth random data set b1 by the trusted execution environment; randomly generating a first generated data set c0 with the same size as the c matrix according to the c; calculating a second generated data set c1 according to the c and the c0; the initiator shares a0 and b0 and acquires c0 sent by the trusted execution environment; the partner shares a1 and b1 and acquires c1 sent by the trusted execution environment; the public parameter calculation unit is used for the partner to calculate and obtain a first public parameter according to the sum of X1', a1 and X0' and a0 of the shared initiator; the initiator calculates a second public parameter according to the Y1', the b0 and the sum of Y0' and b1 of the shared partner; the slicing correlation coefficient calculation unit is used for calculating a first slicing correlation coefficient according to the a0, b0, c0, the second common parameter and the shared first common parameter by the initiator; the partner calculates a second fragment correlation coefficient according to the a1, the b1 and the c1, the first public parameter and the shared second public parameter; the federal correlation coefficient calculation unit is used for calculating the federal correlation coefficient by two parties according to the respective slicing correlation coefficient and the slicing correlation coefficient of the shared counterpart; the analysis unit is used for carrying out federal pearson correlation analysis according to the federal correlation coefficient, determining model training data according to analysis results, and training a federal learning model by adopting the model training data; the prediction unit is used for performing index prediction through the federal learning model; the index comprises: fault performance index, and profit index.

Optionally, the slicing unit includes: the first segmentation subunit is used for sending a first random seed generated by the initiator to the partner, and the initiator and the partner generate a first random data set R0 according to the first random seed; the initiator fragments the first standard data set X 'according to the first random data set R0 to obtain a first fragmented data set X0'; the partner takes the first random data set R0 as a second fragment data set X1'; the second segmentation subunit is used for sending a second random seed generated by the partner to the initiator, and the initiator and the partner generate a second random data set R1 according to the second random seed; the partner fragments the second standard data set Y 'according to the second random data set R1 to obtain a third fragmented data set Y0'; the initiator takes the second random data set R1 as a fourth sliced data set Y1'.

c＝(a0+a1)×(b0+b1)；

c1＝c-c0；

X’+a＝X0’+a0+X1’+a1；

the second common parameter is calculated according to the following formula:

Y’+b＝Y0’+b1+Y1’+b0；

corr0＝c0-a0*(Y’+b)-(X’+a)*b0+(X’+a)*(Y’+b)；

corr1＝c1-a1*(Y’+b)-(X’+a)*b1；

The invention has the beneficial effects that:

the invention provides an index prediction method and system based on federal pearson correlation analysis, wherein the method normalizes a data set; slicing the standard data set; in the trusted execution environment computing multiplication pair, the communication overhead of ciphertext computing is reduced, the performance is improved, in addition, in the performance, computing tasks can be executed in a partitioning and parallel mode, the computing efficiency is greatly improved, the correlation of model training data is greatly improved through an analysis result based on federal pearson correlation analysis, and further the prediction efficiency and the prediction accuracy of indexes are improved.

Drawings

FIG. 1 is a flowchart of an index prediction method based on Federal pearson correlation analysis provided by an embodiment of the present invention;

FIG. 2 is a flow chart of standard data set sharding provided by an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an index prediction system based on federal pearson correlation analysis according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a slicing unit according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Therefore, the index prediction method based on the federal pearson correlation analysis can reduce communication overhead, improve performance and computational efficiency, and improve the correlation of the determined model training data through the federal pearson correlation analysis, thereby improving the prediction efficiency and accuracy of the federal learning model on the user performance index. The federal learning model in the embodiment of the invention can be a wind power equipment fault model, a power supply coal consumption clean profit model and the like.

Fig. 1 is a flowchart of an index prediction method based on a federal pearson correlation analysis party according to an embodiment of the present invention, where, as shown in fig. 1, the method includes:

s1, normalizing a first data set X of an initiator in longitudinal federal learning to obtain a first standard data set X ', and normalizing a second data set Y of a partner to obtain a second standard data set Y';

the sponsor refers to the sponsor in the longitudinal federal learning, and the partner refers to the partner in the longitudinal federal learning. Longitudinal federal learning is a federal learning scenario applicable to participants on a dataset that have the same sample space and different feature spaces. A machine learning model may be co-trained for different participants through longitudinal federal learning. The initiator and the partner may be different enterprises having a partner requirement. The initiator data set contains user data of the initiator, wherein the user data of the initiator refer to sample data which can represent performance indexes of a user at the initiator and are obtained by the initiator under the condition of user permission; the partner data set contains user data of the partner, and the user data of the partner refers to sample data which can represent performance indexes of the user at the partner and is acquired by the partner under the condition of user permission.

For example, the initiator may be a wind farm and the partner may be a manufacturer of the wind power plant. The data set of the wind power plant comprises SCADA operation data, maintenance ledgers, test data, meteorological data and the like of the wind power equipment; the data set of the manufacturer of the wind power equipment comprises manufacturer design parameters and the like of the wind power equipment. The data sets of manufacturers of the wind power plant and the wind power equipment are respectively standardized and fragmented, and a model training data set is determined according to the federal pearson analysis result, so that the failure performance index of the wind power equipment is accurately predicted through the federal learning model. For another example, the initiator may be a group power company and the partner may be a sub power company. The data set of the group power company comprises operation financial data (such as actual power supply coal consumption net profit value); the data set of the sub-utility includes power coal consumption data. The data sets of sub-electric power companies of the group electric power company are processed, the strongly-correlated characteristic data of the power supply coal consumption can be obtained through a Federal pearson analysis method, a power supply coal consumption net profit model is determined according to the strongly-correlated characteristic data, and effective prediction of the power supply coal consumption profit can be achieved through the model.

Specifically, in the longitudinal federal modeling task, users of the initiator and the partner have intersections and different features. Assuming that the number of samples of an intersection is n after intersection based on user id, the initiator has m features, and the partner has t features; the first data set of the initiator is X (m, n) and the second data set of the partner is Y (t, n);

the first data set X (m, n) of the initiator is characterized by xi, i=1, 2, …, m, each feature having dimensions j, j=1, 2, …, n; average value ui= (xi1+xi) of feature xi ₂ + … +xin)/n; criteria for feature xiDifference δi= (((xi 1-ui)/(2+ (xi)) ₂ -ui)/(2+ … + (xin-ui)/(2)/n)/(0.5); xi normalized feature xij' = (xij-ui)/δi; wherein i=1, 2, …, m; j=1, 2, …, n; normalizing each feature of the first data set X to obtain a first standard data set X'; similarly, the second dataset of the partner is normalized to a second standard dataset Y'.

S2, the initiator fragments X ' according to the first random data set R0 to obtain a first fragmented data set X0', and the partner takes the shared R0 as a second fragmented data set X1'; the partner fragments Y ' according to the second random data set R1 to obtain a third fragment data set Y0', and the initiator takes the shared R1 as a fourth fragment data set Y1';

Fig. 2 is a flowchart of standard data set slicing provided by an embodiment of the present invention, as shown in fig. 2, where S2 includes:

s21, a first random seed generated by an initiator is sent to a partner, and the initiator and the partner generate a first random data set R0 according to the first random seed; the initiator fragments the first standard data set X 'according to the first random data set R0 to obtain a first fragmented data set X0'; the partner takes the first random data set R0 as a second fragment data set X1';

the first random seed generated by the initiator 0=k, k being [1,2, …,1000000]Optionally, the initiator sends a first random seed to the partner, and the initiator and the partner generate a first random data set r0= [ R0i ] according to the first random seed]＝[r01,r0 ₂ ,…,r0m]The method comprises the steps of carrying out a first treatment on the surface of the i=1, 2, …, m; r0i is composed of j-dimensional vectors, j=1, 2, …, n; first standard data set X '= [ X1', X of initiator ₂ ’,…,xm’]The method comprises the steps of carrying out a first treatment on the surface of the The initiator acquires a first sliced data set X0 '=x' -R0; the partner obtains the second patch data set X1' =r0.

S22, a second random seed generated by the partner is sent to the initiator, and the initiator and the partner generate a second random data set R1 according to the second random seed; the partner fragments the second standard data set Y 'according to the second random data set R1 to obtain a third fragmented data set Y0'; the initiator takes the second random data set R1 as a fourth sliced data set Y1'.

The method comprises the steps that a second random seed1 = k generated by a partner is selected from any one of [1,2, …,1000000], the second random seed is sent to an initiator by the partner, and the initiator and the partner generate a second random data set R1= [ r1i ] = [ R11, R12, …, R1t ]; i=1, 2, …, t; r1i is composed of j-dimensional vectors, j=1, 2, …, n; a second standard dataset Y '= [ Y1', Y2', …, yt' ] of the partner; the partner acquires a third sliced data set Y0 '=y' -R1; the initiator acquires a fourth set of sliced data Y1' =r1.

The random seed is directly sent to the opposite side, and the random data set is not sent to the opposite side, so that the random seed is directly sent to the opposite side, the sending time is saved, and the calculation efficiency is improved because the data volume of the random data set is large and the sending time is long.

S3, the trusted execution environment calculates a product data set c according to the third random data set a0, the fourth random data set b0, the fifth random data set a1 and the sixth random data set b 1; randomly generating a first generated data set c0 with the same size as the c matrix according to the c; calculating a second generated data set c1 according to the c and the c0; the initiator shares a0 and b0 and acquires c0 sent by the trusted execution environment; the partner shares a1 and b1 and acquires c1 sent by the trusted execution environment;

Specifically, the trusted execution environment and the initiator generate a third random data set a0, a0= [ a0i ] based on the same random seed]＝[a01,a0 ₂ ,…,a0m]The method comprises the steps of carrying out a first treatment on the surface of the Wherein i=1, 2, …, m; a0i is composed of j-dimensional vectors, j=1, 2, …, n;

the trusted execution environment and the partner generate a fifth random data set a1, a1= [ a1i ] = [ a11, a12, … a1m ]; wherein i=1, 2, …, m; a1i is composed of j-dimensional vectors, j=1, 2, …, n;

the trusted execution environment and the initiator generate a fourth random data set b0, b0= [ b0i ] based on the same random seed]＝[b01,b0 ₂ ,…,b0t]The method comprises the steps of carrying out a first treatment on the surface of the Wherein i=1, 2, …, t; b0i is composed of j-dimensional vectors, j=1, 2, …, n;

the trusted execution environment and the assembler generate a sixth random data set b1, b1= [ b1i ] = [ b11, b12, … b1t ]; wherein i=1, 2, …, t; b1i is composed of j-dimensional vectors, j=1, 2, …, n;

the trusted execution environment is known as a0, a1, b0, b1, and the product data set c= (a0+a1) (b0+b1) is calculated first, the matrix c rank is m×t, c= [ ci ] = [ c1, c2, …, ct ], i=1, 2, …, t, ci is composed of k-dimensional vectors, k=1, 2, …, m;

the trusted execution environment generates a random integer matrix c0 of m rows and t columns, i.e. a first generated data set, 0= [ c0i ]]＝[c01,c0 ₂ ,…,c0t]I=1, 2, …, t, c0i is composed of k-dimensional vectors, k=1, 2, …, m; the trusted execution environment sends c0 to the initiator;

Calculating a second generated data set c1=c-c 0; c1 = [ c1i ] = [ c11, c12, …, c1t ], i=1, 2, …, t, c1i is composed of k-dimensional vectors, k=1, 2, …, m; the trusted execution environment sends c1 to the partner.

S4, the partner calculates a first public parameter according to the sum of X1', a1 and X0' and a0 of the shared initiator; the initiator calculates a second public parameter according to the Y1', the b0 and the sum of Y0' and b1 of the shared partner;

the initiator calculates the sum value of X0' and a0 and sends the sum value to the partner; the partner calculates a first common parameter according to the sum of X1', a1 and X0' and a 0; the first common parameter X ' +a=x0 ' +a0+x1' +a1;

the partner calculates the sum value of Y0' and b1 and sends the sum value to the initiator; the initiator calculates a second common parameter according to the sum of Y1', b0 and Y0' and b 1; the second common parameter Y ' +b=y0 ' +b1+y1' +b0;

in another embodiment, the partner calculates a first common parameter from the sum of X1', a1, and X0' and a0 of the shared initiator; the partner calculates a second public parameter according to the sum of Y0', b1 and Y1' and b0 of the shared initiator; the initiator calculates a first public parameter according to the sum of X0', a0 and X1' and a1 of the shared partner; the initiator calculates a second common parameter according to the sum of Y1', b0 and Y0' and b1 of the shared partner.

S5, the initiator calculates a first slicing correlation coefficient according to the a0, the b0, the c0, the second common parameter and the shared first common parameter; the partner calculates a second fragment correlation coefficient according to the a1, the b1 and the c1, the first public parameter and the shared second public parameter;

the partner sends the first public parameter to the initiator; the initiator sends the second public parameter to the partner;

the first slicing correlation coefficient is calculated according to the following formula:

corr0＝c0-a0*(Y’+b)-(X’+a)*b0+(X’+a)*(Y’+b)；

corr1＝c1-a1*(Y’+b)-(X’+a)*b1；

In the other embodiment, the initiator calculates a first slicing correlation coefficient according to a0, b0, c0, the first common parameter and the second common parameter; and the partner calculates a second fragment correlation coefficient according to the a1, the b1, the c1, the first public parameter and the second public parameter.

S6, the two parties respectively calculate and obtain the federal correlation coefficient according to the respective slicing correlation coefficient and the slicing correlation coefficient of the shared counterpart;

the initiator sends the first fragment correlation coefficient to the partner; the partner sends the second fragment correlation coefficient to the initiator; both sides calculate to obtain a federal correlation coefficient corr=corr0+corr1 according to the first slicing correlation coefficient and the second slicing correlation coefficient; where corr is the federal correlation coefficient.

S7, carrying out Federal pearson correlation analysis according to the Federal correlation coefficient, determining model training data according to an analysis result, and training a Federal learning model by adopting the model training data;

s8, index prediction is carried out through the federal learning model; the index comprises: fault performance index, and profit index.

Specifically, a new (current) first data set and a new (current) second data set are input into the federal learning model for index prediction.

The method of the invention is illustrated below by scenario one:

the wind farm fan equipment performs fault diagnosis, SCADA operation data, maintenance account, test data, factory design parameters, meteorological data and the like of the equipment or equipment related data of other wind farms, which have no history data for just put into production operation, are needed for equipment fault diagnosis, because commercial confidentiality is involved, part of design data of a manufacturer has great influence on a prediction result, but core data is inconvenient to expose, federal learning solves the problem, data cannot be local, namely the safety of the data is protected, the federal modeling with rich features is realized, and the prediction precision of the model is improved. And (3) mining out the characteristics which are greatly related to a certain fault of the fan from mass data of the wind farm and a partner (manufacturer) by using federal correlation analysis, performing federal correlation analysis on the characteristics, selecting one of the characteristics which are strongly related, removing the other characteristics, and improving nonlinearity among the characteristics, thereby establishing a corresponding characteristic set of each fault mode, and improving efficiency and accuracy of federal modeling.

The correlation analysis of the feature set corresponding to the fan gear box fault model generally selects modeling features according to expert experience, important features can be omitted, and the correlation analysis is related to the accuracy of acquired data of a sensor, so that the expert modeling is assisted by using a data correlation analysis result to be more scientific.

Firstly, longitudinally crossing data, namely, based on time crossing, mainly acquiring data of a SCADA (supervisory control and data acquisition) of a wind power plant, wherein the data comprise characteristics such as wind speed, active power, generator rotating speed, opposite wind angle, low-speed bearing temperature of a fan, high-speed bearing temperature of the fan, bearing temperature of a driving end of the fan, free end bearing temperature of the fan, oil temperature of a gear box and the like, a manufacturer mainly comprises temperature curves of normal operation of the gear box under different working conditions, different characteristic spaces under the same time are acquired as respective data sets, the data sets of the wind power plant are X, and the data sets of the manufacturer are Y under different working conditions;

step two, X, Y data sets are standardized to obtain x and y;

step three, calculating the correlation between the features of the local data set x or y, screening in the feature set with strong correlation, and reducing the linear coupling between the features;

taking x as an example, x has n features, each feature consisting of vectors of m dimensions, x= [ xi ] ]＝[x1,x ₂ ,…,xn]，xi＝[xi1,xi ₂ ,…,xim]I=1, 2, …, n; the pearson correlation coefficients between n features are calculated to obtain a correlation coefficient matrix P (n, n), each element being P (i, j) =correlation coefficient (xi, xj), where i and j=1, 2, …, n, if P (i, j)>0.95, screening a feature as a feature of the following modeling, assuming k features are screened out, k<n, n-k left features compose new data set x', and the same method, y data set is analyzed by correlation coefficient, and the correlation coefficient is calculated>Features of 0.95 are screened, the number of features is reduced, and a new data set y' is generated.

Step four, performing x ', y' fragmentation of the data set;

generating a first random number seed by a wind power plant and sending the first random number seed to a manufacturer, generating a first random data set r0 by the wind power plant and the manufacturer according to the first random number seed, slicing a data set x 'by the wind power plant according to the first random data set r0 to obtain x0' =x '-r0, and slicing x1' =r0 by the manufacturer; the manufacturer generates a second random seed and sends the second random seed to the wind power plant, the manufacturer and the wind power plant generate a second random data set r1 according to the second random seed, and the manufacturer slices the data set y ' according to the second random data set r1 to obtain y0' =y ' -r1; the wind farm has data y 'slices y1' =r1.

Step five, generating multiplication pairs [ (a 0, b0, c 0), (a 1, b1, c 1) ];

sharing random number seeds with an initiator (wind farm) and a partner (manufacturer) in a trusted execution environment respectively;

the wind power plant sends the sample number s and the characteristic number f0 of the triplet to the trusted execution environment, and the manufacturer sends the characteristic number f1 to the trusted execution environment;

generating random seeds by the wind power plant, sending the random seeds to a trusted execution environment, generating a0 by the wind power plant and the trusted execution environment according to the random seeds, wherein a0 is a matrix of s rows and f0 columns, and generating b0 by the same method, wherein b0 is a matrix of s rows and f1 columns;

the random seeds are generated by manufacturers and sent to the trusted execution environment, the manufacturers and the trusted execution environment generate a1 according to the random seeds, a1 is a matrix of s rows and f0 columns, and b1 is a matrix of s rows and f1 columns in the same method;

trusted execution environment computing a=a0+a1, b=b0+b1;

the trusted execution environment calculates c=a×b, and the c matrix is f0 rows and f1 columns;

the trusted execution environment generates a random number c0 with the size of a matrix c, and sends the random number c0 to a wind power plant; calculating c1=c-c 0, and sending c1 to a manufacturer; c0, c1 are the f0 row, f1 column matrix.

Step six, calculating a public parameter x '+a and y' +b;

the wind farm has x 'slices x0' =x '-r0, and a0 in the multiplication pair, calculating the slices (x' +a) 0=x '-r0+a0 of (x' +a); transmitting the fragment to a manufacturer, wherein the manufacturer has x 'fragments x1' =r0 and a1 in the multiplication pair, and calculates fragments (x '+a) 1=r0+a1 of (x' +a); transmitting the slice to a wind farm;

Each participant calculates the sum (x '+a) 0+ (x' +a) 1=x '-r0+a0+r0+a1=x' +a of the fragments of (x '+a), and the wind power plant and the manufacturer obtain a public parameter x' +a;

as above, the wind farm and manufacturer obtain the public parameter y' +b.

Step seven, calculating a federal correlation coefficient corr;

the wind power plant calculates a first slicing correlation coefficient:

corr0=c0-a0 (y '+b) - (x' +a) b0+ (x '+a) x (y' +b), the wind farm transmitting a first sliced correlation coefficient corr0 to the manufacturer;

calculating a second fragment correlation coefficient by a manufacturer:

corr1=c1-a1 (y '+b) - (x' +a) b1, and the manufacturer sends a second slice correlation coefficient corr1 to the wind farm;

the wind farm and the manufacturer simultaneously acquire federal correlation coefficient corr=corr0+corr1.

Step eight, acquiring a feature set related to a fan gear box fault model through federal correlation analysis, improving the accuracy rate compared with a feature modeling model selected by expert experience, and protecting the safety of data. Training a wind power equipment fault model according to the characteristic values; and predicting fault performance indexes through a wind power equipment fault model.

Specifically, a new (current) screened data set x 'and a new data set y' are input into a wind power equipment fault model to conduct fault performance index prediction, and fault data of the wind power equipment are obtained.

The method of the invention is described below by way of scenario two:

the system comprises a plant-level monitoring system (SIS) and a Management Information System (MIS) of a power plant, wherein the MIS system depends on the SIS and can master the operation condition of each power plant under a group, so that the operation and management of the power plants under the group can be scientifically assisted, the traditional mode is to upload the data of the power plants under each subsidiary company to a cloud server of the group, the problem of data island is solved, the big data mining analysis is realized, the competition of different subsidiary companies exists, the group company wants to apply the operation model of a marker post power plant to other power plants, the potential safety hazard exists when the data is uploaded to the same server, and the big data mining analysis is realized under the condition that the federal learning ensures the data privacy safety. Taking a certain power company operation analysis coefficient as an example, the company needs to analyze real-time power supply coal consumption influence factors, needs to analyze the influence of measurement point data in each power plant sis of the company on the power supply coal consumption, and supposes that the power company is p0, the power company has two power plants p1 and p2, and p0 has company operation financial data, such as real-time power supply coal consumption net profit value y, needs to learn with p1 and p2 federal, and analyzes which indexes of the p1 and p2 power plants are strongly related to a target y, so that a power supply module analysis module is constructed by using the characteristics to realize real-time prediction of the power supply coal consumption net profit;

Step one, p0 acquires data sets of a certain period of time from a mis system and p1 and p2 respectively, longitudinally intersection data, and acquires different feature spaces under the same time as respective data sets based on the intersection time, wherein the data set of p0 is Y, and the data sets of p1 and p2 are X2;

step two, normalizing the X1, X2 and Y data sets to obtain X1, X2 and Y;

step three, calculating the correlation between the features of the local data set x1 or x2, screening in a feature set with a correlation coefficient of >0.95, and reducing the linear coupling between the features, wherein the screened feature set is x1', x2';

step four, performing data set x1', x2', y segmentation, and respectively calculating the segmentation of x1' and y; and fragmentation of x2' and y;

step five, generating multiplication pairs of x1 'and y and x2' and y, which are [ (a 0, b0, c 0), (a 1, b1, c 1) ], [ (a 0', b0', c0 ') ], (a 1', b1', c 1') ] ], wherein a=a0+a1, b=b0+b1; a '=a0' +a1', b' =b0 '+b1' as above;

step six, calculating x1 'and y public parameters x1' +a, y+b, and calculating x2 'and y public parameters x2' +a ', y+b', wherein the method is the same as above;

step seven, calculating federal correlation coefficients corr of x1' and y and federal correlation coefficients corr ' of x2' and y, wherein the method is the same as that described above;

Step eight, the correlation coefficient matrix affecting the net profit of the power supply coal consumption of the power company is [ corr, corr' ], the characteristics with the correlation coefficient larger than 0.95 are screened, the characteristic set related to the net profit model of the power supply coal consumption is obtained through federal correlation analysis, and the data of the characteristics are used for federal modeling to obtain the profit model of the power supply coal consumption; the model can predict the net profit of the power supply coal consumption on line, and is convenient for the scientific decision analysis of the manager.

Specifically, the new (current) screened feature sets are x1', x2' and y, and are input into a power supply coal consumption profit model for profit index prediction, so that power supply coal consumption profit data are obtained.

Fig. 3 is a schematic structural diagram of an index prediction system based on federal pearson correlation analysis according to an embodiment of the present invention, as shown in fig. 3, where the system includes:

a normalizing unit 201, configured to normalize a first data set X of an initiator in longitudinal federal learning to obtain a first standard data set X ', and normalize a second data set Y of a partner to obtain a second standard data set Y';

a slicing unit 202, configured to slice X ' according to the first random data set R0 to obtain a first sliced data set X0', where the partner uses the shared R0 as a second sliced data set X1'; the partner fragments Y ' according to the second random data set R1 to obtain a third fragment data set Y0', and the initiator takes the shared R1 as a fourth fragment data set Y1';

Fig. 4 is a schematic structural diagram of a slicing unit according to an embodiment of the present invention, as shown in fig. 4, where the slicing unit 202 includes:

a first slicing subunit 2021, configured to send a first random seed generated by the initiator to the partner, where the initiator and the partner generate a first random data set R0 according to the first random seed; the initiator fragments the first standard data set X 'according to the first random data set R0 to obtain a first fragmented data set X0'; the partner takes the first random data set R0 as a second fragment data set X1';

a second segmentation subunit 2022, configured to send a second random seed generated by the partner to the initiator, where the initiator and the partner generate a second random data set R1 according to the second random seed; the partner fragments the second standard data set Y 'according to the second random data set R1 to obtain a third fragmented data set Y0'; the initiator takes the second random data set R1 as a fourth sliced data set Y1'.

A generated data set calculating unit 203, configured to calculate a product data set c according to the third random data set a0, the fourth random data set b0, the fifth random data set a1 and the sixth random data set b1 by using the trusted execution environment; randomly generating a first generated data set c0 with the same size as the c matrix according to the c; calculating a second generated data set c1 according to the c and the c0; the initiator shares a0 and b0 and acquires c0 sent by the trusted execution environment; the partner shares a1 and b1 and acquires c1 sent by the trusted execution environment;

A common parameter calculation unit 204, configured to calculate a first common parameter according to the sum of X1', a1 and X0' and a0 of the shared initiator by the partner; the initiator calculates a second public parameter according to the Y1', the b0 and the sum of Y0' and b1 of the shared partner;

a slicing correlation coefficient calculating unit 205, configured to calculate a first slicing correlation coefficient according to a0, b0, c0, the second common parameter and the shared first common parameter by the initiator; the partner calculates a second fragment correlation coefficient according to the a1, the b1 and the c1, the first public parameter and the shared second public parameter;

the federal correlation coefficient calculation unit 206 is configured to calculate federal correlation coefficients according to the respective shard correlation coefficients and the shard correlation coefficients of the shared parties, respectively;

an analysis unit 207, configured to perform federal pearson correlation analysis according to the federal correlation coefficient, determine model training data according to an analysis result, and train a federal learning model using the model training data;

a prediction unit 208, configured to perform index prediction through the federal learning model; the index comprises: fault performance index, and profit index.

The index prediction system based on the federal pearson correlation analysis provided by the invention corresponds to the method, and is not described herein.

The invention has the beneficial effects that:

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An index prediction method based on federal pearson correlation analysis, comprising:

s2, the initiator fragments X' according to the first random data set R0 to obtain a first fragmented data set X ₀ ' the partner takes shared R0 as the second sliced data set X ₁ 'A'; the partner slices Y' according to the second random data set R1 to obtain a third sliced data set Y ₀ ' the initiator takes the shared R1 as a fourth sliced data set Y ₁ ’；

s4, the partner is according to X ₁ ' a1 and X of the shared initiator ₀ Calculating the sum of' and a0 to obtain a first common parameter; the initiator according to Y ₁ ' b0 and Y of shared partner ₀ Calculating the sum of' and b1 to obtain a second common parameter;

2. The method according to claim 1, wherein S2 comprises:

the method comprises the steps that a first random seed generated by an initiator is sent to a partner, and the initiator and the partner generate a first random data set R0 according to the first random seed; the initiator fragments the first standard data set X' according to the first random data set R0 to obtain a first fragmented data set X ₀ 'A'; the partner takes the first random data set R0 as a second sliced data set X ₁ ’；

The second random seed generated by the partner is sent to the initiator, and the initiator and the partner generate a second random data set R1 according to the second random seed; the partner fragments the second standard data set Y' according to the second random data set R1 to obtain a third fragmented data set Y ₀ 'A'; the initiator takes the second random data set R1 as a fourth sliced data set Y ₁ ’。

3. The method according to claim 1, characterized in that:

the product data set c is calculated according to the following formula:

c＝(a0+a1)×(b0+b1)；

c1＝c-c0；

4. The method according to claim 1, characterized in that:

the first common parameter is calculated according to the following formula:

X’+a＝X ₀ ’+a0+X ₁ ’+a1；

wherein X' +a is a first common parameter, X ₀ ' is the first sliced data set, a0 is the third random data set, X ₁ ' is the second sliced data set and a1 is the fifth random data set;

The second common parameter is calculated according to the following formula:

Y’+b＝Y ₀ ’+b1+Y ₁ ’+b0；

wherein Y' +b is a second common parameter, Y ₀ ' is the third sliced data set, b1 is the sixth random data set, Y ₁ ' is the fourth sliced data set and b0 is the fourth random data set.

5. The method according to claim 1, characterized in that:

corr ₀ ＝c0-a0*(Y’+b)-(X’+a)*b0+(X’+a)*(Y’+b)；

wherein corr ₀ For the first sliced correlation coefficient, c0 is the first generated data set and a0 is the third randomThe data set, b0 is the fourth random data set, X '+a is the first common parameter, Y' +b is the second common parameter;

corr ₁ ＝c1-a1*(Y’+b)-(X’+a)*b1；

wherein corr ₁ For the second slice correlation coefficient, c1 is the second generated data set, a1 is the fifth random data set, b1 is the sixth random data set, X '+a is the first common parameter, and Y' +b is the second common parameter.

6. An index prediction system based on federal pearson correlation analysis, comprising:

the normalization unit is used for normalizing the first data set X of the initiator in longitudinal federal learning to obtain a first standard data set X ', and normalizing the second data set Y of the partner to obtain a second standard data set Y';

A slicing unit for slicing the X' by the initiator according to the first random data set R0 to obtain a first sliced data set X ₀ ' the partner takes shared R0 as the second sliced data set X ₁ 'A'; the partner slices Y' according to the second random data set R1 to obtain a third sliced data set Y ₀ ' the initiator takes the shared R1 as a fourth sliced data set Y ₁ ’；

The generated data set calculation unit is used for calculating a product data set c according to the third random data set a0, the fourth random data set b0, the fifth random data set a1 and the sixth random data set b1 by the trusted execution environment; randomly generating a first generated data set c0 with the same size as the c matrix according to the c; calculating a second generated data set c1 according to the c and the c0; the initiator shares a0 and b0 and acquires c0 sent by the trusted execution environment; the partner shares a1 and b1 and acquires c1 sent by the trusted execution environment;

a common parameter calculation unit for the partner according to X ₁ ' a1 and X of the shared initiator ₀ Calculating the sum of' and a0 to obtain a first common parameter; the initiator according to Y ₁ ' b0 and Y of shared partner ₀ Calculating the sum of' and b1 to obtain a second common parameter;

The slicing correlation coefficient calculation unit is used for calculating a first slicing correlation coefficient according to the a0, b0, c0, the second common parameter and the shared first common parameter by the initiator; the partner calculates a second fragment correlation coefficient according to the a1, the b1 and the c1, the first public parameter and the shared second public parameter;

the federal correlation coefficient calculation unit is used for calculating the federal correlation coefficient by two parties according to the respective slicing correlation coefficient and the slicing correlation coefficient of the shared counterpart;

the analysis unit is used for carrying out federal pearson correlation analysis according to the federal correlation coefficient, determining model training data according to analysis results, and training a federal learning model by adopting the model training data;

the prediction unit is used for performing index prediction through the federal learning model; the index comprises: fault performance index, and profit index.

7. The system of claim 6, wherein the slicing unit comprises:

the first segmentation subunit is used for sending a first random seed generated by the initiator to the partner, and the initiator and the partner generate a first random data set R0 according to the first random seed; the initiator fragments the first standard data set X' according to the first random data set R0 to obtain a first fragmented data set X ₀ 'A'; the partner takes the first random data set R0 as a second sliced data set X ₁ ’；

The second segmentation subunit is used for sending a second random seed generated by the partner to the initiator, and the initiator and the partner generate a second random data set R1 according to the second random seed; the partner fragments the second standard data set Y' according to the second random data set R1 to obtain a third fragmented data set Y ₀ 'A'; the initiator takes the second random data set R1 as a fourth sliced data set Y ₁ ’。

8. The system according to claim 6, wherein:

the product data set c is calculated according to the following formula:

c＝(a0+a1)×(b0+b1)；

c1＝c-c0；

9. The system according to claim 6, wherein:

the first common parameter is calculated according to the following formula:

X’+a＝X ₀ ’+a0+X ₁ ’+a1；

the second common parameter is calculated according to the following formula:

Y’+b＝Y ₀ ’+b1+Y ₁ ’+b0；

10. The system according to claim 6, wherein:

corr ₀ ＝c0-a0*(Y’+b)-(X’+a)*b0+(X’+a)*(Y’+b)；

wherein corr ₀ For the first sliced correlation coefficient, c0 is the first generated data set, a0 is the third random data set, b0 is the fourth random data setA machine dataset, X '+a being a first common parameter and Y' +b being a second common parameter;

corr ₁ ＝c1-a1*(Y’+b)-(X’+a)*b1；