CN114357521A

CN114357521A - Data protection method and system

Info

Publication number: CN114357521A
Application number: CN202210031150.3A
Authority: CN
Inventors: 李泽远; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-04-15
Also published as: WO2023134076A1

Abstract

The application is applicable to the technical field of data security, and provides a data protection method and a data protection system. According to the method, the client side obtains the first data stored in the local and codes the first data to obtain the multiple groups of second data, so that an attacker can be prevented from obtaining the plaintext corresponding to the first data by cracking the client side, and the safety of the first data is improved; the client information of the second data is eliminated through the shuffle server, so that an attacker can be prevented from further stealing the data stored in the client by acquiring the client information of the second data, and the safety of the client information and the safety of the data stored in the client are improved; by eliminating the time sequence information of the second data, the corresponding relation between the second data and the client can be avoided being calculated by an attacker based on the time sequence information of the second data, the privacy of the client participating in the federal learning is improved, and the defense capability of the federal learning to face the output attack is improved.

Description

Data protection method and system

Technical Field

The application belongs to the technical field of data security, and relates to a data protection method and system.

Background

The Federated Learning (Federal Learning) is a distributed Learning mode, and by sharing model parameters between a plurality of clients which locally store data and a server carrying a deep Learning model, the data of the plurality of clients can be commonly used for training the deep Learning model on the premise that the data does not leave the clients. Federal learning has the advantage of efficiently utilizing data, the performance of a model facing different data sets can be improved, and the data privacy of a client can be ensured, so that more and more deep learning models begin to be trained by Federal learning.

In order to ensure data security, a defense mechanism for coping with network attacks needs to be set during training, and the traditional defense mechanism can identify and defend poisoning attacks for paralyzing the training process of federal learning. At present, an output attack for acquiring client data by acquiring and reversely reasoning model parameters appears in the market, a training process of league learning is followed in the output attack process, a traditional defense mechanism can be bypassed, identity information and data of other clients participating in training are easily leaked, and potential safety hazards are caused. Therefore, how to improve the defense capability of federal learning in the face of output attacks becomes a problem which needs to be solved urgently at present.

Disclosure of Invention

In view of this, the embodiments of the present application provide a data protection method and system to solve the problem that the existing federal learning has poor defense capability against output attacks.

A first aspect of an embodiment of the present application provides a data protection method, which is applied to a data protection system, where the data protection system includes a shuffle server and a client, and the shuffle server is connected to the client, and the method includes:

acquiring multiple groups of first data stored locally through the client;

coding the multiple groups of first data through the client to obtain multiple groups of second data, wherein the multiple groups of second data correspond to the multiple groups of first data one to one;

shuffling, by the shuffle server, the plurality of sets of second data to eliminate client information in the plurality of sets of second data and eliminate timing information of the plurality of sets of second data, the timing information reflecting a time and an order in which the shuffle server acquires the plurality of sets of second data.

A first aspect of the embodiments of the present application provides a data protection method, where a client obtains first data stored locally and performs encoding to obtain multiple sets of second data, so as to prevent an attacker from obtaining a plaintext corresponding to the first data by cracking the client, thereby improving the security of the first data; the client information of the second data is eliminated through the shuffle server, so that an attacker can be prevented from further stealing the data stored in the client by acquiring the client information of the second data, and the safety of the client information and the safety of the data stored in the client are improved; by eliminating the time sequence information of the second data, the corresponding relation between the second data and the client can be avoided from being calculated by an attacker based on the time sequence information of the second data, the privacy of the client participating in the federal learning is improved, and the defense capability of the federal learning for resisting attacks is improved.

A second aspect of an embodiment of the present application provides a data protection system, including a shuffle server and a client, where the shuffle server is connected to the client;

the client is used for acquiring multiple groups of first data stored locally;

the client is used for coding the multiple groups of first data to obtain multiple groups of second data, and the multiple groups of second data correspond to the multiple groups of first data one to one;

the shuffle server is configured to shuffle the multiple sets of second data to eliminate client information in the multiple sets of second data and eliminate timing information of the multiple sets of second data, where the timing information is used to reflect a time and an order in which the shuffle server acquires the multiple sets of second data.

It is understood that the beneficial effects of the second aspect can be referred to the related description of the first aspect, and are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a first schematic structural diagram of a data protection system according to an embodiment of the present application;

fig. 2 is a schematic diagram of a second structure of a data protection system according to an embodiment of the present application;

fig. 3 is a first flowchart of a data protection method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a third data protection system according to an embodiment of the present application;

fig. 5 is a second flowchart of a data protection method according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

In application, in order to ensure data security, a defense mechanism for coping with network attacks needs to be set during training, and the traditional defense mechanism can identify and defend poisoning attacks for paralyzing the training process of federal learning. At present, an output attack for acquiring client data by acquiring and reversely reasoning model parameters appears in the market, a training process of league learning is followed in the output attack process, a traditional defense mechanism can be bypassed, identity information and data of other clients participating in training are easily leaked, and potential safety hazards are caused. Therefore, how to improve the defense capability of federal learning in the face of output attacks becomes a problem which needs to be solved urgently at present.

In order to solve the above technical problems, embodiments of the present application provide a data protection method, where a client obtains first data stored locally and encodes the first data to obtain multiple groups of second data, so as to prevent an attacker from obtaining a plaintext corresponding to the first data by cracking the client, so as to improve the security of the first data; the client information of the second data is eliminated through the shuffle server, so that an attacker can be prevented from further stealing the data stored in the client by acquiring the client information of the second data, and the safety of the client information and the safety of the data stored in the client are improved; by eliminating the time sequence information of the second data, the corresponding relation between the second data and the client can be avoided being calculated by an attacker based on the time sequence information of the second data, the privacy of the client participating in the federal learning is improved, and the defense capability of the federal learning to face the output attack is improved.

The data protection method provided by the embodiment of the application can be applied to a data protection system, and the data protection system can be installed in a federal learning system or any other type of distributed learning system.

Fig. 1 schematically shows a structural diagram of a data protection system 100, where the data protection system 100 includes a client 110 and a Shuffle (Shuffle) server 120, and the Shuffle server 120 is connected to the client 110;

the client 110 is configured to obtain multiple sets of first data stored locally;

the client 110 is configured to encode multiple sets of first data to obtain multiple sets of second data, where the multiple sets of second data correspond to the multiple sets of first data one to one;

the shuffle server 120 is configured to shuffle the plurality of sets of second data to eliminate the client 110 information in the plurality of sets of second data and to eliminate timing information of the plurality of sets of second data, the timing information reflecting a time and an order in which the shuffle server 120 acquires the plurality of sets of second data.

In application, the client 110 may be a terminal device with data storage capability, and the data of the client 130 may be stored in at least one database, and the following describes the types of databases supported by the client 130: according to the data storage structure of the database, the client 110 can support the database with a relational data storage structure and a non-relational data storage structure; according to the system architecture of the database, the client 110 can support the database with distributed and centralized system architectures; different types of databases such as Oracle, MySQL, MongoDB, SQL Server, IBM Db2 and Damomeng database can be supported.

In application, the terminal device may be a mobile phone, a tablet computer, a wearable device, an in-vehicle device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), and the like, and the specific type of the terminal device is not limited in this embodiment.

In application, an encoding (Encode) module 130 may be built in the client 110, the encoding module 130 is configured to Encode multiple sets of first data to obtain multiple sets of second data, and before the encoding module 130 encodes the first data, the first data may be encrypted to improve data security of the first data in an encoding process.

In application, the shuffle server 120 may be disposed in an independent server, and when a third-party user except a client accesses the shuffle server, the third-party user needs to perform identity authentication, where the identity authentication may be implemented based on an encryption algorithm such as RSA (Ron Rivest-Adi Shamir-Leonard Adleman) public key cryptosystem. The shuffling server 120 may perform the shuffling of the second data without reading the second data, and a specific shuffling method may refer to a data protection method corresponding to fig. 3 or fig. 5 described below.

Fig. 2 exemplarily shows a schematic structural diagram of a data protection system 100, which includes a client 110, a shuffle server 120, and an analysis (Analyze) server 140 connected in sequence, where the shuffle server 120 is connected to the analysis server 140;

the analysis server 140 is configured to integrate multiple sets of mixed second data according to the shuffling rule of the shuffling server 120 to obtain third data;

the analysis server 140 is further configured to decode the third data according to the encoding rule of the client 110, where the decoded third data is used to train a deep learning model;

in an application, the analysis server 140 may be provided in one independent server, and when the client 110 is provided in one independent server, the client 110, the shuffle server 120, and the analysis server 140 may be provided in three mutually independent servers. The analysis server 140 is configured to integrate and decode the shuffled second data to obtain third data, and use the third data to train the deep learning model.

In application, the analysis server 140 includes a deep learning model, and in the data protection system 100, the analysis server 140 is configured to receive the second data obtained through encoding and shuffling and train the deep learning model based on the second data. The analysis server 140 needs to summarize all data and analyze the data in the data protection system 100 to implement federal learning, and the server where the analysis server 140 is located has a high openness and weak protection capability against attacks, and an attacker usually initiates an attack on the analysis server 140 and analyzes the identity of a client participating in the federal learning according to the second data or the third data in the analysis server 140 to try to further crack the client and obtain local information stored in the client.

In application, the client 110 and the shuffle server 120 form a data protection system 100 built in the federal learning system, and encode and shuffle the acquired first data to obtain second data, so that an attacker cannot acquire identity information and data of a client sending the first data corresponding to the second data according to the second data, thereby improving the defense capability of the federal learning face against output attacks and improving the security of the identity information and data of the clients participating in training, and specific encoding methods and shuffle methods of the client 110 and the shuffle server 120 are described below.

It is to be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation to the data protection system 100. In other embodiments of the present application, data protection system 100 may include more or fewer components than shown, or some components may be combined, or different components may include, for example, input output devices, network access devices, etc. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

As shown in fig. 3, the data protection method provided in the embodiment of the present application is applied to a data protection system, and includes the following steps S301 to S303:

step S301, multiple groups of first data stored locally are obtained through a client

In application, first data of a client is partial plaintext data generated or acquired locally by the client; the client can screen first data from the plaintext data stored locally according to actual needs to use the first data for federal learning, the client can input the first data into the encoding module after acquiring the first data, and the encoding module can be built in the client or arranged in the server and connected with the client.

In one embodiment, step S301 is followed by:

the first data is encrypted by the client.

In application, after the client acquires the first data, the client can encrypt the first data to obtain a ciphertext (Cipher Text) corresponding to the first data, so that an attacker is prevented from acquiring a plaintext (Plain Text) corresponding to the first data by cracking the client or a coding module, and the safety of the first data in the process of storing the first data in the client and coding the first data is improved. The attacker can be a user participating in federal learning or a third-party user not participating in federal learning.

In an application, the Encryption Algorithm for encrypting the first data may include a Symmetric Encryption Algorithm (symmet Encryption Algorithm) or an Asymmetric Encryption Algorithm (Asymmetric Cryptographic Algorithm). Specifically, the symmetric Encryption algorithm may include RC4(Rivest Cipher4, a stream Encryption algorithm), RC2(Rivest Cipher 2, another stream Encryption algorithm), DES (Data Encryption Standard), AES (Advanced Encryption Standard), or the like; the asymmetric encryption Algorithm may include RSA, ECC (Elliptic Curve Cryptography), DSA (Digital Signature Algorithm), and the like, and the specific type of the encryption Algorithm for encrypting the first data is not limited in any way in the embodiment of the present application.

Step S302, a plurality of groups of first data are coded through the client to obtain a plurality of groups of second data, and the plurality of groups of second data correspond to the plurality of groups of first data one to one.

In application, the client can encode multiple groups of first data through the encoding module to obtain multiple groups of second data, and the multiple groups of second data correspond to the multiple groups of first data one to one. Wherein the plurality of sets of first data may be from a plurality of clients, each client may provide one or more sets of first data. The encoding module may encode the first data into second data in a specific encoding format, and the specific encoding format may specifically be different types of encoding formats such as ASCII (american standard code for information interchange), ANSI (an extended ASCII code), or Unicode (Unicode). The embodiment of the present application does not set any limit to the specific type of the specific encoding format.

In application, a plurality of groups of first data are coded through the coding module to obtain a plurality of groups of second data, so that an attacker can be prevented from acquiring a plaintext corresponding to the first data by cracking the coding module, and the safety of the first data is improved; the data size of the first data can be compressed, so that the data protection system can reduce the processing load and improve the processing speed when processing the second data.

In one embodiment, step S302 includes:

and coding the first data stored locally through each sub-client to obtain corresponding second data.

In application, the client may be composed of a plurality of sub-clients, and each sub-client has a sub-coding unit built therein. Each sub-coding unit is used for coding the first data of a corresponding sub-client to obtain a group of coded second data. The obtaining speed and the coding speed of the coding module for the first data can be improved by processing the corresponding first data of the client side in parallel through the plurality of sub-coding units.

Fig. 4 exemplarily shows a schematic structural diagram of the data protection system 100 when the client 110 includes a plurality of sub-clients 111, wherein each sub-client 111 is embedded with one sub-encoding unit 131.

Step S303, shuffling, by the shuffle server, the plurality of sets of second data to eliminate client information in the plurality of sets of second data, and eliminating timing information of the plurality of sets of second data, where the timing information is used to reflect a time and an order in which the shuffle server acquires the plurality of sets of second data.

In an application, the shuffling server may shuffle the plurality of sets of second data to eliminate client information in the plurality of sets of second data, where the client information may be Metadata (Metadata) of the second data, and the Metadata may include a data source address of the client, a physical topology of the client, system version information of the client, a Domain Name (Domain Name) of the client, a library Name of a database for storing the second data of the client, and the like. The data source Address may specifically include an IP Address (Internet Protocol Address), an interface Address, a MAC Address (Media Access Control Address), and the like of the client, and the physical topology of the client is used to reflect all devices included in the client or all devices connected to the client.

In application, the multiple groups of second data are shuffled by the shuffling server, and the time sequence information of the multiple groups of second data can be eliminated, wherein the time sequence information is used for reflecting the time and the sequence of obtaining each group of second data by the shuffling server.

In the application, by eliminating the client information of the second data, an attacker can be prevented from acquiring the client information of the second data, the attacker is prevented from further stealing the data stored in the client based on the client information of the second data, and the security of the client information and the data security of the client are improved. By eliminating the time sequence information of the second data, the corresponding relation between the second data and the client can be avoided from being calculated by an attacker based on the time sequence information of the second data, the privacy of the client participating in the federal learning is improved, and the defense capability of the federal learning for resisting attacks is improved.

In one embodiment, step S303 includes:

acquiring data volume of a plurality of groups of second data through the shuffling server;

adding noise to the data volume by the shuffle server to make the data volume satisfy the difference privacy; or deleting the preset data amount in the plurality of groups of second data through the shuffling server to enable the data amount to meet the difference privacy.

In application, the data amount of the plurality of groups of second data may be obtained by the shuffle server, and specifically, when the second data is an SQL (Structured Query Language) statement, the number of SQL statements in the plurality of groups of second data may be determined by a Count function, so as to determine the data amount of the plurality of groups of second data.

In application, after the data amount of the plurality of sets of second data is determined, Noise may be added to the data amount by the shuffle server, where the Noise type may include Laplace Noise (Laplace Noise) or Gaussian Noise (Gaussian Noise), and the like, so as to disturb the data amount, so that the data amount satisfies Differential Privacy (Differential Privacy), and it may be possible to prevent an attacker from acquiring the data amount of the second data, or prevent the attacker from acquiring the data amount of the wrong second data.

In an application, after determining the data amount of the plurality of sets of second data, a preset data amount in the plurality of sets of second data may be deleted by the shuffle server, where the preset data amount may be generated by a randomization Algorithm (Randomized Algorithm), and the preset data amount for deleting the plurality of sets of second data each time is different, so that the data amount satisfies the differential privacy.

In application, noise is added to the data volume through the shuffle server, or preset data volume in a plurality of groups of second data is deleted through the shuffle server, so that the data volume of the second data meets the difference privacy, an attacker can be prevented from acquiring the data volume of the second data, the fact that the attacker calculates the number of clients and the corresponding relation between the second data and the clients according to the data volume of the second data is avoided, the privacy of the clients participating in federal learning is improved, and the defense capability of the federal learning in the face of output attacks is improved.

In one embodiment, step S303 is followed by:

integrating multiple groups of mixed second data through an analysis server to obtain third data;

and decoding the third data according to the coding rule of the client by the analysis server, wherein the decoded third data is used for training the deep learning model.

In application, after the shuffle server finishes mixing the multiple groups of second data, the analysis server may first obtain and integrate the mixed multiple groups of second data to obtain third data, and decode the third data; the analysis server may also obtain multiple sets of mixed second data for decoding, and integrate the decoded second data. The embodiment of the present application does not set any limitation on the order of performing the integration and decoding after the analysis server obtains the second data.

In application, the decoded third data can be used for training a deep learning model, the client message is eliminated from the decoded third data, the sequence information of each group of second data included in the third data is acquired by the shuffling server, after the analysis server is cracked by an attacker, the attacker can be prevented from analyzing the identity of the client participating in federal learning according to the third data, the attacker is prevented from further acquiring local information stored in the client according to the identity of the client, and the security of the identity information and the data information of the client is improved.

According to the data protection method provided by the embodiment of the application, the client side obtains the first data stored in the local and codes the first data to obtain the multiple groups of second data, so that an attacker can be prevented from obtaining the plaintext corresponding to the first data by cracking the client side, and the safety of the first data is improved; the client information of the second data is eliminated through the shuffle server, so that an attacker can be prevented from further stealing the data stored in the client by acquiring the client information of the second data, and the safety of the client information and the safety of the data stored in the client are improved; by eliminating the time sequence information of the second data, the corresponding relation between the second data and the client can be avoided from being calculated by an attacker based on the time sequence information of the second data, the privacy of the client participating in the federal learning is improved, and the defense capability of the federal learning for resisting attacks is improved.

As shown in fig. 5, in an embodiment, based on the embodiment corresponding to fig. 3, the method includes the following steps S501 to S510:

step S501, multiple groups of first data stored locally are obtained through the client.

In application, the data protection method provided in step S501 is the same as that in step S301, and is not described herein again.

Step S502, each subdata of the first data stored locally is coded through each sub-client to obtain a group of corresponding second data, and the second data comprises a plurality of subdata and corresponds to the plurality of subdata of the first data one by one.

In application, a sub-coding unit built in a sub-client may include a first coding mode and a second coding mode, when the sub-coding unit adopts the first coding mode, each sub-data of the first data may be coded one by one, and each coding corresponds to one sub-data of the second data; or coding a plurality of subdata of the first data in parallel to obtain a group of coded second data; each subdata included in the second data has a corresponding subdata of the first data, and the number of the subdata of the second data is the same as that of the subdata of the first data.

In application, the sub-coding unit codes each sub-data of the first data through the first coding mode, and each sub-data can be coded indiscriminately, so that the discreteness of the second data obtained after coding is improved.

Step S503, obtaining, by the sub-client, multiple groups of sub-data of the first data according to the data attribute of each piece of sub-data of the first data stored locally, where each group of sub-data of the first data includes at least one piece of sub-data with the same data attribute.

In application, when the sub-coding unit adopts the second coding mode, multiple groups of sub-data of the first data can be obtained by determining the data attribute of each sub-data of the first data, and each group of sub-data comprises at least one sub-data with the same data attribute. The data attribute may specifically be a data type (e.g., a character type, an integer type, a floating point type, etc.), a record (Tuple), a Field (Field), a Primary Key (Primary Key), or a Foreign Key (Foreign Key). By obtaining multiple sub-data sets of the first data, the sub-data with the same data attribute in the first data can be classified, so that each sub-data set of the first data includes multiple sub-data with the same data attribute.

Step S504, each group of subdata of the first data is coded through the sub-client to obtain a group of corresponding second data, and the second data comprises a plurality of groups of subdata and corresponds to the plurality of groups of subdata of the first data one by one.

In application, after the sub-coding unit adopts the second coding mode and completes grouping of the first data to obtain a plurality of groups of sub-data, each group of sub-data of the first data can be coded one by one, specifically, one group of sub-data of the first data can be coded one by one, and a group of sub-data of the second data is correspondingly generated; or coding a plurality of groups of subdata of the first data in parallel to obtain a group of coded second data; each group of subdata of the second data has a corresponding group of subdata of the first data, and the group number of the subdata of the second data is the same as that of the subdata of the first data.

In application, when the sub-coding unit adopts the second coding mode, the sub-data with the same data attribute in the first data is classified, and each group of sub-data of the first data is coded to obtain coded second data comprising multiple groups of sub-data, so that the sub-data of the first data with the same data attribute can be summarized, and the aggregation of each group of sub-data in the coded second data is improved.

In application, a user may select the first encoding mode or the second encoding mode of the sub-encoding unit to encode the first data according to actual needs, and the encoding module and the encoding method of the sub-encoding unit are not limited in this embodiment of the application.

And step S505 of shuffling, by the shuffle server, each piece of sub data of the plurality of sets of second data.

In an application, the shuffle server may include a first shuffle pattern and a second shuffle pattern. The first shuffle pattern may be for shuffling second data encoded by the first encoding pattern and the second encoding pattern. Specifically, when the sub-coding units adopt the first coding mode to code the first data and the obtained second data includes a plurality of pieces of sub-data, the shuffle server may obtain a plurality of sets of second data output by all the sub-coding units, and shuffle each piece of sub-data of the plurality of sets of second data indiscriminately; when the sub-coding units encode the first data in the second coding mode, and the obtained second data includes multiple sets of sub-data, the shuffle server may obtain multiple sets of second data output by all the sub-coding units, and shuffle each sub-data of the multiple sets of second data indiscriminately.

In an application, the second shuffle pattern may be for shuffling second data resulting from the second encoding mode encoding. The embodiments of the present application do not limit the shuffling manner of the shuffle server. The shuffling method in the first shuffle pattern corresponds to the shuffling method provided in step S303 described above, and will not be described again, and the shuffling method in the second shuffle pattern will be described below based on step S506 and step S507.

Step S506, the shuffle server recombines the multiple groups of second data to obtain multiple groups of fourth data, where each group of fourth data includes multiple groups of subdata with the same data attribute.

In application, after the sub-encoding unit encodes the first data in the second encoding mode, the shuffle server may employ the second shuffle mode, and specifically, the shuffle server may determine data attributes of a plurality of groups of sub-data included in each group of second data, and generate a plurality of groups of fourth data according to the data attributes of each group of sub-data, where each group of fourth data includes a plurality of groups of sub-data having the same data attributes. The integration of multiple groups of second data can be realized, and sub data with the same data attribute in the multiple groups of second data can be integrated into one group of fourth data.

For example, the shuffle server receives two groups of second data, where the first group of second data includes a first group of sub data and a second group of sub data, the second group of second data includes a third group of sub data and a fourth group of sub data, where a data attribute of the first group of sub data is a floating point type, a data attribute of the second group of sub data is an integer type, a data attribute of the third group of sub data is a floating point type, and a data attribute of the fourth group of sub data is an integer type, the shuffle server may generate two groups of fourth data, where the first group of fourth data includes the first group of sub data and the third group of sub data, and the second group of fourth data includes the second group of sub data and the fourth group of sub data, where the data attributes of the first group of sub data and the third group of sub data are both floating point types.

In application, the shuffling server recombines a plurality of groups of second data, so that a plurality of groups of subdata with the same data attribute can be integrated, each group of fourth data comprises a plurality of subdata with the same data attribute, and the identifiability and the data availability of the fourth data are improved.

In step S507, the shuffle server shuffles the fourth data of each group.

In application, the second shuffling pattern is different from the first shuffling pattern in that the first shuffling pattern shuffles a plurality of sets of second data at the same time, and the plurality of sets of mixed second data are output to the analysis server as one data set; the second shuffling mode shuffles each group of fourth data individually, each mixed group of fourth data is output to the analysis server as a data set individually, and each mixed group of fourth data comprises subdata with the same data attributes, so that the identifiability and the data availability of the fourth data are improved, and the analysis server can train the deep learning model in a targeted manner according to the data attributes. Compared with the second shuffling pattern, the plurality of groups of second data mixed by the first shuffling pattern have no correlation and have higher privacy.

Step S508, integrating the mixed multiple groups of second data through an analysis server to obtain third data;

step S509, decoding the third data by the analysis server according to the coding rule of the client, where the decoded third data is used to train a deep learning model.

In application, the analysis method of step S508 and step S509 is identical to the analysis method provided in step S303, and is not described herein again.

And step S510, decoding each group of mixed fourth data through the analysis server according to the coding rule of the client, wherein each group of decoded fourth data is used for training a deep learning model.

In application, the decoding method in step S510 is consistent with the decoding method provided in step S303, and is not described herein again. The difference is that the analysis server can obtain the data attributes of the subdata included in each group of fourth data, so that the analysis server can perform targeted training on the deep learning model according to the fourth data with different data attributes, and the training efficiency of the analysis server is improved.

In application, the encoding module of the client or the sub-encoding unit of the sub-client may select one of a first encoding mode and a second encoding mode to encode the first data, wherein the first encoding mode may encode the first data indiscriminately, which may improve the discreteness of the encoded second data, and the second encoding mode may encode according to the data attribute of the sub-data of the first data, which may improve the aggregation of each group of sub-data of the second data; the shuffle server can select a first shuffle mode and a second shuffle mode to shuffle the second data, wherein the first shuffle mode can shuffle the second data indiscriminately, so that the privacy of the shuffled second data is improved, the second shuffle mode can recombine multiple groups of second data according to data attributes to obtain fourth data, the fourth data comprises multiple groups of sub-data with the same data attributes, and the identifiability and the data availability of the fourth data are improved. Through the selectable encoding mode and the shuffle mode, a user can improve the discreteness or the aggregativeness of the encoded second data and the privacy of the shuffled second data or the data availability of the shuffled fourth data according to actual needs, so that the flexibility of data processing is improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed terminal device and method may be implemented in other ways. For example, the above-described terminal device embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and there may be other divisions when actually implementing, for example, a plurality of modules or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A data protection method applied to a data protection system, the data protection system including a shuffle server and a client, the shuffle server being connected to the client, the method comprising:

acquiring multiple groups of first data stored locally through the client;

2. The data protection method of claim 1, wherein the data protection system further comprises an analysis server, the analysis server being connected to the shuffle server;

after the shuffling of the plurality of sets of second data by the shuffling server, the method further includes:

integrating the multiple groups of mixed second data through the analysis server to obtain third data;

and decoding the third data according to the coding rule of the client by the analysis server, wherein the decoded third data is used for training a deep learning model.

3. The data protection method of claim 1, wherein the client comprises a plurality of sub-clients, each of the sub-clients being respectively connected to the shuffle server, each of the sub-clients locally storing a set of first data;

the encoding, by the client, the multiple sets of first data to obtain multiple sets of second data includes:

4. The data protection method of claim 3, wherein the encoding, by the sub-client, the first data stored locally to obtain corresponding second data comprises:

and coding each subdata of the first data stored locally through each sub-client to obtain a group of corresponding second data, wherein the second data comprises a plurality of subdata and corresponds to the plurality of subdata of the first data one by one.

5. The data protection method of claim 3, wherein the encoding, by the sub-client, the first data stored locally to obtain corresponding second data comprises:

acquiring multiple groups of subdata of the first data through the sub client according to the data attribute of each subdata of the first data stored locally, wherein each group of subdata of the first data comprises at least one subdata with the same data attribute;

and coding each sub-data group of the first data through the sub-client to obtain a group of corresponding second data, wherein the second data comprises a plurality of sub-data groups and corresponds to the plurality of sub-data groups of the first data one by one.

6. The data protection method of claim 1, wherein the shuffling, by the shuffle server, the plurality of sets of second data, further comprises:

acquiring, by the shuffle server, data volumes of the plurality of sets of second data;

adding, by the shuffle server, noise to the amount of data such that the amount of data satisfies differential privacy; or deleting a preset data amount in the plurality of groups of second data through the shuffle server to enable the data amount to meet the difference privacy.

7. The data protection method of claim 1, wherein the shuffling, by the shuffle server, the plurality of sets of second data comprises:

shuffling, by the shuffle server, each sub data of the plurality of sets of second data.

8. The data protection method of claim 1, wherein the shuffling, by the shuffle server, the plurality of sets of second data comprises:

recombining the multiple groups of second data through the shuffling server to obtain multiple groups of fourth data, wherein each group of fourth data comprises multiple groups of subdata with the same data attribute;

shuffling, by the shuffle server, each group of the fourth data.

9. The data protection method of claim 8, wherein after the shuffling, by the shuffle server, each set of fourth data, the method further comprises:

and decoding each group of mixed fourth data according to the coding rule of the client by the analysis server, wherein each group of decoded fourth data is used for training a deep learning model.

10. A data protection system comprising a client and a shuffle server, the shuffle server being connected to the client;

the client is used for acquiring multiple groups of first data stored locally;