CN114329300B

CN114329300B - Multi-party projection method based on data security and multi-party production data analysis method

Info

Publication number: CN114329300B
Application number: CN202210244755.0A
Authority: CN
Inventors: 夏佳志; 林伟星
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-05-20
Anticipated expiration: 2042-03-14
Also published as: CN114329300A

Abstract

The invention discloses a multi-party projection method based on data security, which comprises an acquisition server and a client set; the server constructs a global model and an initial global dictionary and issues a client; the client initializes a local model, obtains a dictionary of the other party, trains a new local model, projects local data and uploads the local model and part of projection results to a server; the server aggregates to obtain a new global model and a new global dictionary and issues the new global model and the new global dictionary to the client; repeating the steps, and obtaining a final projection model by the server; the service issues a projection model to the client; the client side projects local data by adopting a projection model and uploads the result to the server; and the server draws all the received projection results to a scatter diagram to finish the multi-party projection. The invention also provides a multi-party production data analysis method comprising the multi-party projection method based on data safety. The method has the advantages of good projection effect, high safety and high efficiency.

Description

Multi-party projection method based on data security and multi-party production data analysis method

Technical Field

The invention belongs to the field of data processing, and particularly relates to a multi-party projection method and a multi-party production data analysis method based on data security.

Background

With the development of technology and the improvement of living standard of people, the intelligent big data technology is widely applied to the production and the life of people. The current data is generally high-dimensional data, so it is very important to process the high-dimensional data.

A high-dimensional data projection method is a commonly used data analysis method. The method projects high-dimensional data into a low-dimensional space, so that a data analyst is supported to analyze high-dimensional data features from a low-dimensional projection result. In the past, data analysts typically collected data from multiple data providers on a single device and then projected. However, with the increased awareness of people's privacy and the advent of privacy protection policies, it has become increasingly difficult to collect and share data, especially data with sensitive information, among data providers. Therefore, how to obtain the global data projection result on the premise of not collecting data of all parties becomes a common difficulty faced by the data analyst at present; and this problem is also known as the secure multiparty projection problem.

In order to make the projection result truly reflect the high-dimensional data distribution, it is necessary to maintain the data proximity relation. This presents two challenges to the security multiparty projection problem: first, how to keep the projection result in the cross-square data proximity relation: maintaining a cross-party data proximity relationship refers to having high-dimensional neighbor data of the dispersed parties projected onto locations of low-dimensional proximity; secondly, how to maintain the data proximity relation under the premise that the data are not independently and identically distributed. In the above scenario, the data of each party is usually not independently and identically distributed, and under this condition, the projection results of the parties are easily overlapped, thereby destroying the data proximity relationship.

The traditional projection method needs to gather data together to project, which does not meet the requirement of data confidentiality. Some projection methods are available to calculate the multi-party projection result. A Secure Multi-party Projection (Secure Multi-party Projection) method SMAP (t-Distributed stored Neighbor Embedding) joint Projection method based on a homomorphic encryption method can calculate joint Projection with a consistent Projection effect with a single party; however, the computational overhead added by homomorphic encryption is large, making the method difficult to be put into practice. MSDSNE (Multi shot Decentralized Data storage Neighbor Embedding) projection method performs joint t-SNE projection among Data parties based on shared anchor Data; however, there are many limitations to the MSDSNE approach: firstly, the method needs to share an additional data set, and does not meet the requirement of problems; secondly, the projection effect of the method has high randomness; finally, the method approximately maintains the cross-party data proximity relationship with the additional data set as an anchor point, the retention capability of which is limited by the size of the additional data set. Therefore, the current methods cannot effectively solve the problem of safe multi-party projection.

Disclosure of Invention

The invention aims to provide a multi-projection method based on data security, which has good projection effect, high safety and high efficiency.

The invention also aims to provide a multi-party production data analysis method comprising the multi-party projection method based on data security.

The invention provides a multi-party projection method based on data security, which comprises the following steps:

s1, acquiring a server and client set;

s2, the server constructs a global model and an initial global dictionary and issues the global model and the global dictionary to each client;

s3, each client initializes the respective current local model according to the received global model, and filters the global model to obtain the dictionary of the other party;

s4, each client side trains to obtain a new local model according to the local model obtained in the step S3 and the dictionary of the other party, and the new local model is used as the current local model;

s5, each client uses the current local model obtained in the step S4 to project the local data of the client to obtain a projection result, and uploads the current local model and the randomly selected partial projection result to the server;

s6, the server aggregates to obtain a new global model and a new global dictionary according to the received local model and the projection result, and issues the new global model and the new global dictionary to each client;

s7, repeating the steps S3-S6 until the set conditions are met, and obtaining a final projection model by the server;

s8, the server sends the final projection model obtained in the step S7 to each client;

s9, each client projects local data of the client by adopting the received projection model, and uploads a projection result to the server;

and S10, the server draws all the received projection results to a scatter diagram to finish the safe multi-party projection based on the data safety.

The server building the global model and the initial global dictionary in step S2 includes the following steps:

A. the server selects a model architecture and generates model parameters, constructs a global model and sends the global model to each client;

B. each client projects respective local data by adopting the received global model to obtain respective local projection results;

C. randomly extracting a part of local projection results of each client from each local projection result and uploading the part of the local projection results to a server;

D. and the server constructs an initial global dictionary according to the received projection result.

Each client side stated in step S3 initializes its respective current local model according to the received global model, and filters the global dictionary to obtain the dictionary of the other party, including the following steps:

a. each client generates a local model according to the received topological structure and parameters of the global model;

b. and each client selects the interval where the projection data of the client is located in the received global dictionary according to the sequence number of the client, and constructs the data outside the interval as the dictionary of the other party.

Each client side in the step S4 trains and obtains a respective new local model according to the local model obtained in the step S3 and the other party dictionary, and uses the new local model as the current local model, which specifically includes the following steps:

(1) the client acquires a domain graph of local data and calculates to obtain a weighted graph;

(2) sampling data pairs according to the weight values of the edges in the weighted graph obtained in the step (1) and generating a training data set;

(3) projecting each data pair in the training data set by using the local model obtained in the step S3 to obtain a projection pair;

(4) randomly sampling n high-dimensional vectors for each projection pair obtained in the step (3) and setting the n high-dimensional vectors as non-neighbor vectors so as to calculate projection results of the n high-dimensional vectors;

(5) repeating the step (3) to the step (4) until a set condition is reached, and optimizing parameters of the local model by adopting a cross entropy loss function in the repeating process;

(6) and obtaining a final optimized new local model, and using the final optimized new local model as the current local model.

The cross entropy loss function in the step (5) is specifically a cross entropy loss function which is as follows:

in the formulaLoss(X,Y,D) Is a cross entropy loss function;Xis high-dimensional data;Yis a projection result;Dprojecting a dictionary for the other party;

is a hyper-parameter for controlling the repulsive force;R(Y,D) For implementing a loss term of the other party rejection strategy for introducing a repulsion force between the result of the present projection and the result of the other party projection, and

，

is composed ofYTo middleiAn element andDto middlekLow dimensional similarity between individual elements and

，aandbparameters for the umap (Uniform Manifold Approximation and Projection) algorithm in calculating low-dimensional similarity, preferablya=1.93，b=0.79，Y _iIs composed ofYTo middleiThe number of the elements is one,D _kis composed ofDTo middlekAn element;CE(X,Y) Is the difference between the projected distribution and the high-dimensional distribution of the data pairs, and

，

for calculating high-dimensional data by using umap algorithmXTo middleiAn element andja similarity function between the individual elements;

for computing low-dimensional data by using umap algorithmYTo middleiAn element andja similarity function between the individual elements; log is a logarithmic operation based on e.

Each client in step S5 projects its own local data with the current local model obtained in step S4 to obtain a projection result, and uploads the current local model and a randomly selected part of the projection result to the server, specifically, each client projects its own local data with the current local model obtained in step S4 to obtain a projection result, randomly extracts a projection result with a fixed length and without repetition according to the length of the projection result, and uploads the projection result together with the current local model to the server.

The server in step S6 obtains a new global model and a new global dictionary by aggregation according to the received local model and projection result, and issues the new global model and the new global dictionary to each client, and specifically includes the following steps:

1) the server receives the local models and the projection results uploaded by the clients;

2) the server adopts a federal average algorithm to aggregate the local models according to the local models of the clients received in the step 1), so as to obtain a new global model;

3) the server combines the projection results of the clients received in the step 1) according to the numbering sequence of the clients, so as to obtain a new global dictionary.

The local model is aggregated by adopting a federal average algorithm in the step 2), specifically, the local model is aggregated by adopting the following formula:

in the formulaf(w) Parameters of the polymerization model;n _kis as followskThe amount of data owned by each client;nis the total amount of data;

is as followskParameters of the local model;Kis the number of clients.

The invention also provides a multi-party production data analysis method comprising the multi-party projection method based on data security, which comprises the following steps:

the method comprises the following steps that SA, a headquarter server is used as a server in the multi-projection method based on the data security, and data centers of various factories of an enterprise are used as clients in the multi-projection method based on the data security;

SB., the data center and headquarters server of each factory of the enterprise project by the above-mentioned multi-projection method based on data security;

SC., the enterprise headquarter server draws all the received projection results to a scatter diagram to complete the safe multi-party projection based on data safety;

SD. the headquarters personnel analyzes the multi-party production data based on the scatter plot obtained in step SC.

According to the multi-party projection method based on data safety and the multi-party production data analysis method, a federal learning framework and a depth dimension reduction method are innovatively combined, the cross-party data proximity relation can be kept, and a new solution is provided for the safe multi-party projection problem; the invention provides a new technology for keeping the data adjacent relation under the condition of non-independent same distribution of data, and effectively solves the problem of projection overlapping under the condition of non-independent same distribution of data; therefore, the method has the advantages of good projection effect, high safety and high efficiency.

Drawings

FIG. 1 is a schematic flow chart of a projection method according to the present invention.

FIG. 2 is a graph showing a comparison of the performance of the method of the present invention with that of a conventional MSDSNE method under IID (Independent and Identifically Distributed) conditions.

FIG. 3 is a diagram showing the quantitative validation of the effectiveness of the rejection strategy under the NonIID (Non-Independent and reactive distribution) condition.

Fig. 4 is a schematic diagram of qualitative validation of effectiveness of the projection exclusion policy of the other party when the number of clients on the small _ washion data set is 2 under the NonIID condition.

FIG. 5 is a schematic method flow diagram of the analysis method of the present invention.

Detailed Description

FIG. 1 is a schematic flow chart of the method of the present invention: the invention provides a multi-party projection method based on data security, which comprises the following steps:

s1, acquiring a server and client set;

s2, the server constructs a global model and an initial global dictionary and issues the global model and the global dictionary to each client; the method specifically comprises the following steps:

A. the server selects a model architecture, generates model parameters, constructs a global model and sends the global model to each client;

D. the server constructs an initial global dictionary according to the received projection result;

s3, each client initializes the respective current local model according to the received global model, and filters the global model to obtain the dictionary of the other party; the method specifically comprises the following steps:

b. each client selects an interval where self projection data are located in the received global dictionary according to the sequence number of the client, and data outside the interval are constructed into other dictionaries;

s4, each client trains to obtain a new local model according to the local model obtained in the step S3 and the dictionary of the other party, and the new local model is used as the current local model; the method specifically comprises the following steps:

(4) randomly sampling n high-dimensional vectors for each projection pair obtained in the step (3) and setting the n high-dimensional vectors as non-neighbor vectors so as to calculate the projection results of the n high-dimensional vectors;

in specific implementation, the following cross entropy loss function is adopted:

，

is composed ofYTo middleiAn element andDto middlekLow dimensional phase between individual elementsSimilarity and

，aandbfor the parameters of the umap algorithm in calculating the low-dimensional similarity, the method is preferably useda=1.93，b=0.79，Y _iIs composed ofYTo middleiThe number of the elements is one,D _kis composed ofDTo middlekAn element;CE(X,Y) Is the difference between the projected distribution and the high-dimensional distribution of the data pairs, and

，

for computing low-dimensional data by using umap algorithmYTo middleiAn element andja similarity function between the individual elements; log is a logarithmic operation taking e as a base number;

(6) obtaining a new local model after final optimization, and taking the new local model as a current local model;

s5, each client uses the current local model obtained in the step S4 to project the local data of the client to obtain a projection result, and uploads the current local model and the randomly selected partial projection result to the server; specifically, each client uses the current local model obtained in the step S4 to project its own local data to obtain a projection result, randomly extracts a projection result with a fixed length and without repetition according to the length of the projection result, and uploads the projection result together with the current local model to the server;

s6, the server aggregates to obtain a new global model and a new global dictionary according to the received local model and the projection result, and issues the new global model and the new global dictionary to each client; the method specifically comprises the following steps:

2) the server adopts a federal average algorithm to aggregate the local models according to the local models of the clients received in the step 1), so as to obtain a new global model; specifically, the polymerization is carried out by adopting the following formula:

is as followskParameters of the local model;Kthe number of the clients;

3) the server combines the projection results of the clients received in the step 1) according to the numbering sequence of the clients, so as to obtain a new global dictionary;

s9, each client projects local data by adopting the received projection model, and uploads the projection result to the server;

FIG. 2 is a graphical representation of the comparison of the performance of the method of the present invention with the existing MSDSNE method under IID conditions.

The performance of the method of the invention was compared to the existing MSDSNE method under IID conditions, as shown in figure 2. The left graph (fig. 2 (a)) is the experimental result of the mnst _ test data set, and the right graph (fig. 2 (b)) is the experimental result of the small _ washion data set. In which umap and pumap (Parametric Uniform Approximation and Projection) are both centralized Projection methods, were used as controls in this experiment. FP is the method of the invention. The MSDSNE method is the prior art mentioned in the background, and the percentage figures in parentheses indicate the scale of shared data in the MSDSNE method. In fig. 2, comparison of performance is performed using the classification accuracy and neighborhood preserving degree of KNN (K-Nearest Neighbors) as indexes. As can be seen from the figure, the method of the invention is greatly superior to the existing MSDSNE method in KNN classification accuracy and neighborhood preservation degree.

Fig. 3 is a diagram illustrating the quantitative validation of the effectiveness of the other projection exclusion strategy under the NonIID condition.

The effectiveness of the rejection strategy was assessed quantitatively under NonIID conditions, as shown in figure 3. The left graph (fig. 3 (a)) is the experimental result of the mnst _ test data set, and the right graph (fig. 3 (b)) is the experimental result of the small _ washion data set. Wherein LR represents a label _ ratio index for representing KNN classification accuracy; IR denotes an index _ ratio index for indicating a neighborhood preservation degree. FP is the method of the invention (i.e. setting in the loss function) without using the other party rejection strategy

). FP (R) is the method of the invention using the other party rejection strategy (i.e. setting in the loss function)

). The MSDSNE method was used as a control. Therefore, the 6 folding lines in fig. 3 correspond to: the broken line 1 is an LR index of the MSDSNE method; broken line 2 is the LR index of the method of the invention without using the other party rejection strategy; the broken line 3 is the LR index of the method of the invention using the strategy of exclusion of other parties; a polyline 4 is an IR index of the MSDSNE method; polyline 5 is the IR index of the method of the invention without using the other party exclusion strategy; polyline 6 is the IR indicator of the inventive method using the other party exclusion strategy. In fig. 3, polyline 3 is higher than polyline 2, illustrating the effectiveness of the other rejection strategy in solving the projection overlap problem with KNN classification accuracy. The fold line 6 is higher than the fold line 5, which explains that the projection weight is solvedAfter the stacking problem, the neighborhood preservation degree is better maintained.

And qualitatively evaluating the effectiveness of the projection exclusion strategy of the other party when the number of clients on the small _ washion data set is 2 under the NonIID condition. As shown in fig. 4. Fig. 4 (a) shows the projection result of the centralized projection method, pumap, for comparison, where the LR (label _ ratio) result is 98.2% and the IR (index _ ratio) result is 22.3%. The problem of projective overlap arises in the method of the present invention that does not use the exclusive strategy in fig. 4(b), and the method has an LR (label _ ratio) index result of 76.8% and an IR (index _ ratio) index result of 17.5%. In fig. 4(c), the method of the present invention using the other-party exclusion strategy solves the problem of projection overlap, and the LR (label _ ratio) index result of the method is 100%, and the IR (index _ ratio) index result is 22.5%; this demonstrates the effectiveness of the other party rejection strategy in solving the problem of projection overlap. Fig. 4 (d) shows the projection result of the MSDSNE method, and it can be seen that there is a serious problem of projection overlap, and the LR (label _ ratio) index result of the method is 78.6%, and the IR (index _ ratio) index result is 1.6%. This indicates that the MSDSNE method does not solve the projection overlap problem under the NonIID condition.

The difference between the method of the present invention (FP for short) and the SMAP method of the prior art is:

1. in the SMAP method, each data party needs to transmit encrypted data to two central servers, and then the two central servers calculate a projection result by cooperation. In the method (FP method) of the invention, data does not need to be encrypted or leave the local of a data side, and only the model parameters and the projection result are transmitted to the central server. In contrast, the SMAP method still risks the encrypted data being cracked. If its two central servers are in collusion, the encrypted data may be cracked. The method (FP method) of the invention has no risk of cracking the original data. Although methods for estimating original data through model parameters exist at present, the methods have more constraints and have poor estimation effect.

The difference between the method (FP for short) of the invention and the MSDSNE method in the prior art is as follows:

1. the MSDSNE method needs to share an additional data set among data parties, and the method (FP method) does not need to share the additional data set; the shared data set does not accord with the current data privacy protection policy;

2. the projection effect of the method (FP method) of the invention is superior to that of the MSDSNE method: firstly, under the IID condition of data, the index of class separation degree or the index of proximity relation keeping degree is superior to the MSDSNE method; secondly, under the condition of data NonIID, the method (called FP for short) of the invention using the projection exclusion strategy of the other party can relieve the problem of projection overlap, so that the two indexes are improved.

FIG. 5 is a schematic flow chart of the analysis method of the present invention: the invention also provides a multi-party production data analysis method comprising the multi-party projection method based on data security, which comprises the following steps:

For example, the above multi-party production data analysis method can be used for abnormal production data analysis of enterprises; more specifically, for example, a headquarters of a certain automobile manufacturing enterprise has a plurality of automobile manufacturing plants arranged all over the country in a location a, each automobile manufacturing plant operates independently and produces an automobile of type B for the automobile manufacturing enterprise; then, if a head office researcher of the automobile manufacturing enterprise needs to analyze abnormal data in the production process of the B-type automobile so as to optimize the production flow, the head office of the automobile manufacturing enterprise needs to be able to obtain the production process data of each automobile manufacturing plant.

Conventionally, a conventional method is to collect production data of each production plant and to perform analysis by a headquarters. However, production data may be obtained by hackers during data transmission, and leaking product production data may pose a significant hazard to the company. Then, the headquarters of the automobile production enterprise and each production factory can adopt the multi-party production data analysis method provided by the invention, a server of the headquarters of the automobile production enterprise is used as a server, a data center of each production factory is used as a client, and the server and the client operate the multi-party projection method based on data security provided by the invention together, so that the headquarters obtains data projection results of each production factory on the premise of ensuring data security and draws the results to a scatter diagram; then, the total researchers can analyze multi-part production data according to the obtained scatter diagram, namely, abnormal data in the production process of each production plant can be analyzed.

The data security-based multi-party projection method provided by the invention can be particularly applied to the Internet industry and the industrial Internet of things industry.

In the internet industry, if an internet company wants to analyze the mobile browsing behavior pattern of a user, the mobile browsing behavior information of the user is private data and cannot leave the mobile of the user. The internet company can then use the method of the invention. In the application scenario, the server of the present invention is a company server of the internet company, and the mobile phone of the user is a client. And a plurality of users cooperatively train the projection model based on local mobile phone browsing behavior information with the help of the internet company server. Finally, the Internet company can obtain the mobile phone browsing behavior projection result of the user. In the projection result, the internet company can analyze the browsing behavior pattern of the user according to the clustering condition in the projection.

Aiming at the industrial Internet of things scene, the industrial Internet of things transmits mass industrial data to an industrial chain at a very high speed, so that the machine learning method based on data driving is widely applied to industrial manufacturing. In the industrial field, however, data resources cannot be shared among enterprises for competition or user privacy reasons. For example, a manufacturing company may want to analyze production data for a product. The traditional approach is to aggregate the production data from multiple plants for analysis. However, product production data may be obtained by hacker attacks during data transmission. Revealing product production data can pose a significant hazard to a company's product marketing strategy. Therefore, it is very important to analyze data on the premise of protecting the production data privacy of enterprise products. In this scenario, the server of the present invention is a server of a company headquarters, and each local factory data storage center is a client. Multiple plant data storage centers co-train the projection model with the help of a headquarters server. And finally, the headquarters can obtain the projection results of the production data of the products of various factories. In the projection result, the company can analyze the problems existing in the production process of the products of various regions according to the abnormal data in the projection result.

Claims

1. A multi-projection method based on data security is characterized by comprising the following steps:

s1, acquiring a server and client set;

s2, the server constructs a global model and an initial global dictionary and issues the global model and the global dictionary to each client; the server constructs a global model and an initial global dictionary, and specifically comprises the following steps:

(5) repeating the step (3) to the step (4) until a set condition is reached, and optimizing parameters of the local model by adopting a cross entropy loss function in the repeated process; specifically, the following cross entropy loss function is adopted:

，

，aandbfor the parameters of the umap algorithm in calculating the low-dimensional similarity,a=1.93，b=0.79，Y _iis composed ofYTo middleiThe number of the elements is one,D _kis composed ofDTo middlekAn element;CE(X,Y) Is the difference between the projected distribution and the high-dimensional distribution of the data pairs, and

，

for calculating high-dimensional data by using umap algorithmXTo middleiAn element and ajA similarity function between the individual elements;

2) the server adopts a federal average algorithm to aggregate the local models according to the local models of the clients received in the step 1), so as to obtain a new global model; the local model is aggregated by adopting a federal average algorithm, and specifically, the aggregation is performed by adopting the following formula:

is as followskParameters of the local model;Kthe number of the clients;

2. A multi-party production data analysis method comprising the data security-based multi-party projection method of claim 1, characterized by comprising the steps of:

SC., the enterprise headquarter server draws all the received projection results to a scatter diagram to finish the safe multi-party projection based on the data safety;