WO2023134076A1 - Data protection method and system, and storage medium - Google Patents

Data protection method and system, and storage medium Download PDF

Info

Publication number
WO2023134076A1
WO2023134076A1 PCT/CN2022/090192 CN2022090192W WO2023134076A1 WO 2023134076 A1 WO2023134076 A1 WO 2023134076A1 CN 2022090192 W CN2022090192 W CN 2022090192W WO 2023134076 A1 WO2023134076 A1 WO 2023134076A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
sub
shuffling
multiple sets
client
Prior art date
Application number
PCT/CN2022/090192
Other languages
French (fr)
Chinese (zh)
Inventor
李泽远
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023134076A1 publication Critical patent/WO2023134076A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the application belongs to the technical field of data security, and in particular relates to a data protection method, system and storage medium.
  • Federated Learning is a distributed learning model. By sharing model parameters between multiple clients that store data locally and a server equipped with a deep learning model, multiple The data from the client is used to train the deep learning model. Federated learning has the advantages of efficient use of data, which can improve the performance of models facing different data sets, and can ensure the data privacy of clients. Therefore, more and more deep learning models start to use federated learning for training.
  • the embodiments of the present application provide a data protection method, system, and storage medium to solve the problem of poor defense against output attacks in existing federated learning.
  • the first aspect of the present application provides a data protection method applied to a data protection system
  • the data protection system includes a shuffling server and a client, the shuffling server is connected to the client, and the method includes:
  • the multiple sets of second data are shuffled by the shuffling server to eliminate client information in the multiple sets of second data, and eliminate timing information of the multiple sets of second data, the timing information It is used to reflect the time and order in which the shuffling server acquires the multiple sets of second data.
  • a second aspect of the present application provides a data protection system, the system comprising:
  • shuffling server and a client, the shuffling server being connected to the client;
  • the client is used to acquire multiple sets of first data stored locally;
  • the client is configured to encode the multiple sets of first data to obtain multiple sets of second data, and the multiple sets of second data are in one-to-one correspondence with the multiple sets of first data;
  • the shuffling server is used for shuffling the multiple sets of second data, so as to eliminate the client information in the multiple sets of second data, and eliminate the timing information of the multiple sets of second data, the timing The information is used to reflect the time and order in which the shuffling server acquires the multiple sets of second data.
  • a third aspect of the present application provides a computer-readable storage medium, the computer-readable storage medium stores at least one computer-readable instruction, and when the at least one computer-readable instruction is executed by a processor, the following steps are implemented:
  • the multiple sets of second data are shuffled by the shuffling server to eliminate client information in the multiple sets of second data, and eliminate timing information of the multiple sets of second data, the timing information It is used to reflect the time and order in which the shuffling server acquires the multiple sets of second data.
  • the data protection method, system, and storage medium described in this application obtain and encode the first data stored locally through the client to obtain multiple sets of second data, which can prevent attackers from obtaining the first data corresponding to the first data by cracking the client. plaintext to improve the security of the first data; and by shuffling the server to eliminate the client information of the second data, it can prevent the attacker from further stealing the data stored in the client by obtaining the client information of the second data, thereby improving the The security of the client information and the security of the data stored in the client; also by eliminating the timing information of the second data, it is possible to prevent an attacker from estimating the corresponding relationship between the second data and the client based on the timing information of the second data.
  • This application can promote the construction of smart cities and be applied in smart buildings, smart security, smart communities, smart life, Internet of Things and other fields, which improves the privacy of clients participating in federated learning and improves the defense of federated learning in the face of output attacks ability.
  • FIG. 1 is a first structural schematic diagram of a data protection system provided by an embodiment of the present application
  • Fig. 2 is a second structural schematic diagram of the data protection system provided by the embodiment of the present application.
  • FIG. 3 is a schematic flow chart of the first data protection method provided by the embodiment of the present application.
  • FIG. 4 is a schematic diagram of a third structure of a data protection system provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a second data protection method provided by an embodiment of the present application.
  • the term “if” may be construed, depending on the context, as “when” or “once” or “in response to determining” or “in response to detecting “.
  • the phrase “if determined” or “if [the described condition or event] is detected” may be construed, depending on the context, to mean “once determined” or “in response to the determination” or “once detected [the described condition or event] ]” or “in response to detection of [described condition or event]”.
  • references to "one embodiment” or “some embodiments” or the like in the specification of the present application means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically stated otherwise.
  • the terms “including”, “comprising”, “having” and variations thereof mean “including but not limited to”, unless specifically stated otherwise.
  • the embodiment of the present application provides a data protection method.
  • the client obtains and encodes the first data stored locally to obtain multiple sets of second data, which can prevent attackers from obtaining the first data by cracking the client.
  • Corresponding plaintext to improve the security of the first data; and eliminate the client information of the second data by shuffling the server, which can prevent the attacker from further stealing the data stored in the client by obtaining the client information of the second data Improve the security of client information and the security of data stored in the client; also by eliminating the timing information of the second data, it is possible to prevent attackers from estimating the correspondence between the second data and the client based on the timing information of the second data. It improves the privacy of clients participating in federated learning, and improves the defense capability of federated learning in the face of output attacks.
  • the data protection method provided in the embodiment of the present application can be applied to a data protection system, and the data protection system can be installed in a federated learning system or any other type of distributed learning system.
  • Fig. 1 exemplarily shows a schematic structural diagram of a data protection system 100, the data protection system 100 includes a client 110 and a shuffle (Shuffle) server 120, and the shuffle server 120 is connected to the client 110;
  • a shuffle server 120 is connected to the client 110;
  • the client 110 is used to acquire multiple sets of first data stored locally;
  • the client 110 is configured to encode multiple sets of first data to obtain multiple sets of second data, and multiple sets of second data correspond to multiple sets of first data;
  • the shuffling server 120 is used for shuffling the multiple sets of second data, so as to eliminate the information of the client 110 in the multiple sets of second data, and eliminate the timing information of the multiple sets of second data, and the timing information is used to reflect that the shuffling server 120 The time and sequence of multiple sets of second data are acquired.
  • the client 110 can be a terminal device with data storage capability, and the data of the client 130 can be stored in at least one database.
  • the database types supported by the client 130 are introduced below: according to the data storage structure of the database, The client 110 can support databases with relational and non-relational data storage structures; according to the system architecture of the database, the client 110 can support databases with both distributed and centralized system architectures; specifically, it can support Oracle, MySQL, and MongoDB , SQL Server, IBM Db2 and Dannyg database and other different types of databases.
  • terminal devices can be mobile phones, tablet computers, wearable devices, vehicle-mounted devices, augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) devices, notebook computers, ultra-mobile personal computers (ultra-mobile personal computer, UMPC), netbook, personal digital assistant (personal digital assistant, PDA), etc.
  • augmented reality augmented reality, AR
  • virtual reality VR
  • notebook computers ultra-mobile personal computers (ultra-mobile personal computer, UMPC), netbook
  • personal digital assistant personal digital assistant, PDA
  • PDA personal digital assistant
  • the client 110 may have a built-in encoding (Encode) module 130, and the encoding module 130 is used to realize the encoding of multiple sets of first data to obtain the function of multiple sets of second data.
  • the encoding module 130 the first data is encoded Before encoding, the first data may be encrypted, so as to improve the data security of the first data during the encoding process.
  • the shuffling server 120 can be set in an independent server, and a third-party user other than the client needs to perform identity authentication when accessing the shuffling server, and the identity authentication can be based on RSA (Ron Rivest-Adi Shamir-Leonard Adleman) public key cryptosystem and other encryption algorithms.
  • the shuffling server 120 can complete the shuffling of the second data without reading the second data.
  • RSA Rivest-Adi Shamir-Leonard Adleman
  • FIG. 2 exemplarily shows a schematic structural diagram of a data protection system 100, including a sequentially connected client 110, a shuffling server 120, and an analysis (Analyze) server 140, and the shuffling server 120 is connected to the analysis server 140;
  • the analysis server 140 is configured to integrate multiple sets of second data after shuffling according to the shuffling rules of the shuffling server 120 to obtain third data;
  • the analysis server 140 is also used to decode the third data according to the encoding rule of the client 110, and the decoded third data is used for training the deep learning model;
  • the analysis server 140 can be set in an independent server, and when the client 110 is set in an independent server, the client 110, the shuffling server 120 and the analysis server 140 can be set in three mutually independent servers .
  • the analysis server 140 is configured to integrate and decode the shuffled second data to obtain third data, and use the third data for training a deep learning model.
  • the analysis server 140 includes a deep learning model.
  • the analysis server 140 is configured to receive the second data obtained through encoding and shuffling, and train the deep learning model based on the second data.
  • the analysis server 140 in the data protection system 100 needs to summarize all data and analyze the data to achieve federated learning, and the server where the analysis server 140 is located has a high degree of openness, and its protection against attacks is weak.
  • the attacker will target the analysis
  • the server 140 initiates an attack, and analyzes the identity of the client participating in the federated learning according to the second data or the third data in the analyzing server 140, in order to try to further crack the client and obtain local information stored in the client.
  • the client 110 and the shuffling server 120 constitute the data protection system 100 built into the federated learning system, and obtain the second data by encoding and shuffling the acquired first data, so that the attacker cannot , obtain the identity information and data of the client that sent the first data corresponding to the second data, improve the defense capability of federated learning in the face of output attacks, and improve the security of the identity information and data of the clients participating in the training, as follows
  • the specific encoding method and shuffling method of the client 110 and the shuffling server 120 will be described.
  • the structure shown in the embodiment of the present application does not constitute a specific limitation on the data protection system 100 .
  • the data protection system 100 may include more or fewer components than shown in the figure, or combine certain components, or different components, for example, may also include input and output devices, network access devices, etc. .
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • the data protection method provided by the embodiment of the present application is applied to a data protection system, including the following steps S301 to S303:
  • the first data of the client is part of the plaintext data generated or obtained locally by the client; the client can filter the first data from the plaintext data stored locally according to actual needs, so as to use the first data for federated learning , the client may input the first data into the encoding module after obtaining the first data, and the encoding module may be built in the client, or set in the server and connected to the client.
  • step S301 includes:
  • the first data is encrypted by the client.
  • the client after the client obtains the first data, it can encrypt the first data to obtain the ciphertext (Cipher Text) corresponding to the first data, so as to prevent the attacker from obtaining the cipher text corresponding to the first data by cracking the client or the encoding module.
  • Plain Text to improve the security of the first data stored on the client and during the encoding process.
  • the attacker can be a user who participates in federated learning, or a third-party user who does not participate in federated learning.
  • the encryption algorithm for encrypting the first data may include a symmetric encryption algorithm (Symmetric Encryption Algorithm) or an asymmetric encryption algorithm (Asymmetric Cryptographic Algorithm).
  • the symmetric encryption algorithm may include RC4 (Rivest Cipher 4, a stream encryption algorithm), RC2 (Rivest Cipher 2, another stream encryption algorithm), DES (Data Encryption Standard, data encryption standard) or AES (Advanced Encryption Standard (Advanced Encryption Standard), etc.
  • asymmetric encryption algorithms may include RSA, ECC (Elliptic Curve Cryptography, elliptic curve algorithm), DSA (Digital Signature Algorithm, a digital signature algorithm), etc.
  • the embodiment of the present application performs the first data
  • the specific type of the encrypted encryption algorithm is not limited in any way.
  • Step S302 Encoding the multiple sets of first data by the client to obtain multiple sets of second data, and the multiple sets of second data correspond to the multiple sets of first data one-to-one.
  • the client can encode multiple sets of first data through the encoding module to obtain multiple sets of second data, and multiple sets of second data correspond to multiple sets of first data one-to-one.
  • multiple sets of first data may come from multiple clients, and each client may provide one or more sets of first data.
  • the encoding module can encode the first data into the second data in a specified encoding format, and the specified encoding format can be different types such as ASCII (American Standard Code for Information Interchange), ANSI (an extended ASCII code) or Unicode (Unicode). encoding format.
  • ASCII American Standard Code for Information Interchange
  • ANSI an extended ASCII code
  • Unicode Unicode
  • multiple sets of first data are encoded by the encoding module to obtain multiple sets of second data, which can prevent attackers from obtaining the plaintext corresponding to the first data by cracking the encoding module, so as to improve the security of the first data;
  • the data size of the first data is compressed, so that the data protection system can reduce the processing load and increase the processing speed when processing the second data subsequently.
  • step S302 includes:
  • Each sub-client encodes the locally stored first data to obtain corresponding second data.
  • a client may consist of multiple sub-clients, and each sub-client has a built-in sub-coding unit.
  • Each sub-encoding unit is used to encode the first data of a corresponding sub-client to obtain a set of encoded second data.
  • FIG. 4 exemplarily shows a schematic structural diagram of the data protection system 100 when the client 110 includes multiple sub-clients 111 , wherein each sub-client 111 has a built-in sub-encoding unit 131 .
  • Step S303 using the shuffling server to shuffle multiple sets of second data to eliminate client information in multiple sets of second data, and to eliminate timing information of multiple sets of second data, the timing information is used to reflect the data acquired by the shuffling server Time and sequence of multiple sets of second data.
  • the client information can be metadata of the second data
  • the metadata can include The data source address of the client, the physical topology of the client, the system version information of the client, the domain name (Domain Name) of the client, the library name of the database used by the client to store the second data, etc.
  • the data source address can specifically include the client's IP address (Internet Protocol Address), interface address, MAC address (Media Access Control Address), etc.
  • the physical topology of the client is used to reflect all the devices included in the client or the devices connected to the client. all equipment.
  • timing information of multiple sets of second data can also be eliminated, and the timing information is used to reflect the time and order in which the shuffling server acquires each set of second data.
  • step S303 includes:
  • the data volume of multiple sets of second data can be obtained by shuffling the server.
  • the second data is a SQL (Structured Query Language, Structured Query Language) statement
  • the Count (count) function can be used to determine how many The number of SQL statements in the set of second data, so as to determine the data volume of multiple sets of second data.
  • noise can be added to the data volume by shuffling the server, and the noise type can include Laplace noise (Laplace Noise) or Gaussian noise (Gaussian Noise).
  • Laplace Noise Laplace Noise
  • Gaussian Noise Gaussian Noise
  • the preset data volume in multiple sets of second data can also be deleted through the shuffling server, wherein the preset data volume can be randomized through a randomized algorithm (Randomized Algorithm)
  • the method is generated so that the preset data amount used for deleting multiple sets of second data is different each time, so that the data amount satisfies differential privacy.
  • the shuffling server by adding noise to the data volume through the shuffling server, or deleting the preset data volume in multiple sets of second data through the shuffling server, so that the data volume of the second data satisfies differential privacy, which can prevent attackers from obtaining the second
  • the data volume of the data prevents the attacker from estimating the number of clients and the corresponding relationship between the second data and the client based on the data volume of the second data, improves the privacy of the clients participating in the federated learning, and improves the federated learning in the face of output attacks defense ability.
  • step S303 it also includes:
  • the analysis server decodes the third data according to the coding rules of the client, and the decoded third data is used for training the deep learning model.
  • the analysis server can first obtain and integrate the multiple sets of shuffled second data to obtain third data, and decode the third data ;
  • the analysis server may first obtain multiple sets of shuffled second data for decoding, and integrate the decoded second data.
  • the embodiment of the present application does not impose any limitation on the order in which the analysis server integrates and decodes the second data after it acquires it.
  • the decoded third data can be used to train the deep learning model.
  • the decoded third data has eliminated the client message, and the shuffling server obtains each set of second data included in the third data.
  • Timing information after the analysis server is cracked by the attacker, can prevent the attacker from analyzing the identity of the client participating in the federated learning based on the third data, and prevent the attacker from further obtaining the local information stored in the client based on the identity of the client, which improves customer security. Terminal identity information and data security.
  • the client obtains and encodes the first data stored locally to obtain multiple sets of second data, which can prevent an attacker from obtaining the plaintext corresponding to the first data by cracking the client, so as to improve The security of the first data; and eliminate the client information of the second data by shuffling the server, which can prevent the attacker from further stealing the data stored in the client by obtaining the client information of the second data, and improve the security of the client information Security and the security of the data stored by the client; also by eliminating the timing information of the second data, it is possible to prevent attackers from estimating the correspondence between the second data and the client based on the timing information of the second data, which improves the chance of participating in federated learning The privacy of the client and improve the defense capability of federated learning against attacks.
  • Step S501 Obtain multiple sets of first data stored locally through the client.
  • step S501 the data protection method provided in step S501 is consistent with the above step S301, and will not be repeated here.
  • each sub-client encodes each piece of sub-data of the first data stored locally to obtain a set of corresponding second data, the second data includes multiple pieces of sub-data and is one-to-one with the multiple pieces of first data correspond.
  • the built-in sub-encoding unit of the sub-client can include the first encoding mode and the second encoding mode.
  • the sub-encoding unit adopts the first encoding mode each piece of sub-data of the first data can be encoded one by one.
  • the encoding corresponds to generating a piece of sub-data of the second data; multiple pieces of sub-data of the first data can also be encoded in parallel to obtain a set of encoded second data; each piece of sub-data included in the second data has a corresponding piece of the first data sub-data, and the number of sub-data of the second data is the same as the number of sub-data of the first data, the difference is that each sub-data of the first data is unencoded plaintext data, and each sub-data of the second data is Encoded ciphertext data.
  • the sub-encoding unit encodes each piece of sub-data of the first data through the first encoding mode, and can encode each piece of sub-data without distinction, thereby improving the discreteness of the second data obtained after encoding.
  • Step S503 using the sub-client to obtain multiple sets of sub-data of the first data according to the data attributes of each piece of sub-data stored locally, each set of sub-data of the first data includes at least one piece of sub-data of the same data attribute .
  • multiple sets of sub-data of the first data can be obtained by determining the data attribute of each piece of sub-data of the first data, and each set of sub-data includes at least one item with the same data
  • the child data of the attribute may specifically be a data type (such as a character type, an integer type, a floating point type, etc.), a record (Tuple), a field (Field), a primary key (Primary Key) or a foreign key (Foreign Key), etc.
  • the sub-data with the same data attribute in the first data can be classified, so that each group of sub-data of the first data includes multiple pieces of sub-data with the same data attribute.
  • Step S504 Encoding each group of sub-data of the first data through the sub-client to obtain a corresponding group of second data, the second data includes multiple groups of sub-data and is in one-to-one correspondence with the multiple groups of sub-data of the first data.
  • each group of sub-data of the first data can be encoded one by one.
  • the first data can be encoded one by one.
  • a group of sub-data of the data is encoded group by group, corresponding to a group of sub-data for generating the second data; multiple groups of sub-data of the first data can also be encoded in parallel to obtain a group of encoded second data; the second data includes Each group of sub-data of the first data has a corresponding group of sub-data of the first data, and the number of groups of sub-data of the second data is the same as the number of groups of sub-data of the first data, the difference is that each group of sub-data of the first data is unencoded plaintext data, and each group of sub-data of the second data is encoded ciphertext data.
  • the sub-coding unit when the sub-coding unit adopts the second coding mode, by classifying the sub-data with the same data attribute in the first data, and coding each group of sub-data of the first data, the coded sub-data including multiple
  • the second data of the group of sub-data can summarize the sub-data of the first data with the same data attribute, and improve the aggregation of each group of sub-data in the encoded second data.
  • the user can select the first encoding mode or the second encoding mode of the sub-coding unit to encode the first data according to actual needs, and the embodiment of the present application does not impose any restrictions on the encoding mode of the encoding module and its sub-coding units.
  • Step S505 using the shuffling server to shuffle each piece of sub-data of multiple sets of second data.
  • the shuffle server may include a first shuffle schema and a second shuffle schema.
  • the first shuffling mode may be used for shuffling the second data encoded by the first encoding mode and the second encoding mode.
  • the shuffling server can obtain multiple sets of second data output by all sub-encoding units, and Each piece of sub-data of the second data group is shuffled indiscriminately;
  • the shuffling server can obtain all sub-data The multiple sets of second data output by the encoding unit, and indiscriminately shuffling each piece of sub-data of the multiple sets of second data.
  • the second shuffling mode may be used for shuffling the second data encoded by the second encoding mode.
  • the embodiment of the present application does not impose any limitation on the shuffling mode of the shuffling server.
  • the shuffling method in the first shuffling mode is consistent with the shuffling method provided in step S303 above, and will not be repeated here.
  • the shuffling method in the second shuffling mode will be described below based on steps S506 and S507.
  • Step S506 reorganizing multiple sets of second data through the shuffling server to obtain multiple sets of fourth data, each set of fourth data includes multiple sets of sub-data with the same data attribute.
  • the shuffling server can adopt the second shuffling mode. Specifically, the shuffling server can determine the multiple sets of sub-encodings included in each set of second data. data attributes of the data, and according to the data attributes of each group of sub-data, multiple sets of fourth data are generated, and each set of fourth data includes multiple sets of sub-data with the same data attribute. Integration of multiple sets of second data can be achieved, and sub-data with the same data attribute in multiple sets of second data can be integrated into a set of fourth data.
  • the shuffling server receives two sets of second data, the first set of second data includes the first set of sub-data and the second set of sub-data, the second set of second data includes the third set of sub-data and the fourth set of sub-data , where the data attribute of the first group of sub-data is floating-point type, the data attribute of the second group of sub-data is integer, the data attribute of the third group of sub-data is floating-point type, and the data attribute of the fourth group of sub-data is integer type, the shuffling server can generate two sets of fourth data, the first set of fourth data includes the first set of sub-data and the third set of sub-data whose data attributes are both floating-point types, and the second set of fourth data includes the above-mentioned The data attributes are both the second group of sub-data and the fourth group of sub-data of integer type.
  • each set of fourth data includes multiple pieces of sub-data with the same data attribute, improving the fourth Data Discernibility and Data Availability.
  • Step S507 shuffling each group of fourth data through the shuffling server.
  • the difference between the second shuffling mode and the first shuffling mode is that the first shuffling mode simultaneously shuffles multiple sets of second data, and the shuffled multiple sets of second data are output to the The analysis server; the second shuffling mode shuffles each group of fourth data separately, and each group of fourth data after shuffling is output to the analysis server as a data set, and each group of fourth data after shuffling includes
  • the sub-data with the same data attribute improves the identification and data availability of the fourth data, which is beneficial for the analysis server to train the deep learning model in a targeted manner according to the data attribute.
  • the multiple sets of second data shuffled in the first shuffling mode have no correlation among them and have higher privacy.
  • Step S508 integrate the multiple sets of shuffled second data through the analysis server to obtain third data
  • Step S509 the analysis server decodes the third data according to the coding rules of the client, and the decoded third data is used for training the deep learning model.
  • step S508 and step S509 are consistent with the analysis method provided in step S303 above, and will not be repeated here.
  • step S510 the analysis server decodes each set of shuffled fourth data according to the encoding rules of the client, and each decoded set of fourth data is used for training a deep learning model.
  • the decoding method in step S510 is consistent with the decoding method provided in step S303 above, and will not be repeated here.
  • the difference is that the analysis server can obtain the data attributes of the sub-data included in each group of fourth data, so that the analysis server can perform targeted training on the deep learning model according to the fourth data of different data attributes, which improves the training efficiency of the analysis server.
  • the encoding module of the client or the sub-encoding unit of the sub-client can select one of the first encoding mode and the second encoding mode to encode the first data, wherein the first encoding mode can encode the first Data encoding can improve the discreteness of the encoded second data, and the second encoding mode can be encoded according to the data attributes of the sub-data of the first data, which can improve the aggregation of each group of sub-data of the second data type; shuffling The server can choose the first shuffling mode and the second shuffling mode to shuffle the second data, wherein the first shuffling mode can shuffle the second data indiscriminately, which improves the quality of the shuffled second data.
  • the second shuffling mode can reorganize multiple sets of second data according to data attributes to obtain fourth data.
  • the fourth data includes multiple sets of sub-data with the same data attributes, which improves the identification and data availability.
  • users can improve the discreteness or aggregation of the encoded second data according to actual needs, and improve the privacy of the shuffled second data or the shuffled fourth data. Data availability of data, thus increasing the flexibility of data processing.
  • modules and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
  • the disclosed terminal device and method may be implemented in other ways.
  • the terminal device embodiments described above are only illustrative.
  • the division of the modules is only a logical function division. In actual implementation, there may be other division methods.
  • multiple modules or components can be combined Or it can be integrated into another system, or some features can be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or modules may be in electrical, mechanical or other forms.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function, etc.; store data
  • the zone can store data created according to the use of blockchain nodes, etc.

Abstract

The present application is applicable to the technical field of data security. Provided are a data protection method and system, and a storage medium. In the method, locally stored first data is acquired by means of a client and then encoded to obtain a plurality of groups of second data, such that an attacker can be prevented from acquiring, by cracking the client, plaintext corresponding to the first text, thereby improving the security of the first data; client information in the second data is removed by means of a shuffle server, such that an attacker can be prevented from further stealing, by acquiring the client information in the second data, data stored in the client, thereby improving the security of the client information and the security of the data stored in the client; and time sequence information of the second data is removed, such that an attacker can be prevented from reckoning a correspondence between the second data and the client on the basis of the time sequence information of the second data, thereby improving the privacy of the client, which is participating in federated learning, and improving the defense capability of federated learning when encountering an output attack.

Description

数据保护方法、系统及存储介质Data protection method, system and storage medium
本申请要求于2022年1月12日提交中国专利局,申请号为202210031150.3申请名称为“数据保护方法和系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application with application number 202210031150.3 titled "Data Protection Method and System" filed with the China Patent Office on January 12, 2022, the entire contents of which are incorporated herein by reference.
技术领域technical field
本申请属于数据安全技术领域,具体涉及数据保护方法、系统及存储介质。The application belongs to the technical field of data security, and in particular relates to a data protection method, system and storage medium.
背景技术Background technique
联邦学习(Federated Learning)是一种分布式学习模式,通过在本地存储数据的多个客户端与一个搭载深度学习模型的服务器共享模型参数,可以实现在数据不离开客户端的前提下,将多个客户端的数据共同用于训练深度学习模型。联邦学习具有高效利用数据的优点,可以提高模型面对不同数据集的性能,并可以保证客户端的数据隐私,因此越来越多的深度学习模型开始采用联邦学习进行训练。Federated Learning (Federated Learning) is a distributed learning model. By sharing model parameters between multiple clients that store data locally and a server equipped with a deep learning model, multiple The data from the client is used to train the deep learning model. Federated learning has the advantages of efficient use of data, which can improve the performance of models facing different data sets, and can ensure the data privacy of clients. Therefore, more and more deep learning models start to use federated learning for training.
为了保证数据安全在训练时需要设置应对网络攻击的防御机制,传统的防御机制可以识别并防御用于瘫痪联邦学习的训练进程的中毒攻击。目前市面上出现一种通过获取并反向推理模型参数以获取客户端数据的输出攻击,发明人发现在输出攻击过程中遵循联盟学习的训练进程,可以绕过传统的防御机制,容易造成其他参与训练的客户端的身份信息和数据泄漏,具有安全隐患。因此,如何提升联邦学习面对输出攻击的防御能力成为当前亟需解决的问题。In order to ensure data security, it is necessary to set up a defense mechanism against network attacks during training. Traditional defense mechanisms can identify and defend against poisoning attacks that paralyze the training process of federated learning. At present, there is an output attack on the market that obtains and inversely infers model parameters to obtain client data. The inventor found that following the training process of alliance learning during the output attack process can bypass the traditional defense mechanism and easily cause other participation. The identity information and data leakage of the trained client has potential security risks. Therefore, how to improve the defense capability of federated learning in the face of output attacks has become an urgent problem to be solved.
发明内容Contents of the invention
有鉴于此,本申请实施例提供了一种数据保护方法、系统及存储介质,以解决现有的联邦学习面对输出攻击的防御能力差的问题。In view of this, the embodiments of the present application provide a data protection method, system, and storage medium to solve the problem of poor defense against output attacks in existing federated learning.
本申请的第一方面提供一种数据保护方法,应用于数据保护系统,所述数据保护系统包括混洗服务器和客户端,所述混洗服务器与所述客户端连接,所述方法包括:The first aspect of the present application provides a data protection method applied to a data protection system, the data protection system includes a shuffling server and a client, the shuffling server is connected to the client, and the method includes:
通过所述客户端获取存储于本地的多组第一数据;Obtain multiple sets of first data stored locally through the client;
通过所述客户端对所述多组第一数据进行编码,得到多组第二数据,所述多组第二数据和所述多组第一数据一一对应;Encoding the multiple sets of first data by the client to obtain multiple sets of second data, wherein the multiple sets of second data correspond to the multiple sets of first data;
通过所述混洗服务器对所述多组第二数据进行混洗,以消除所述多组第二数据中的客户端信息,并消除所述多组第二数据的时序信息,所述时序信息用于反映所述混洗服务器获取所述多组第二数据的时间和顺序。The multiple sets of second data are shuffled by the shuffling server to eliminate client information in the multiple sets of second data, and eliminate timing information of the multiple sets of second data, the timing information It is used to reflect the time and order in which the shuffling server acquires the multiple sets of second data.
本申请的第二方面提供一种数据保护系统,所述系统包括:A second aspect of the present application provides a data protection system, the system comprising:
混洗服务器和客户端,所述混洗服务器与所述客户端连接;a shuffling server and a client, the shuffling server being connected to the client;
所述客户端用于获取存储于本地的多组第一数据;The client is used to acquire multiple sets of first data stored locally;
所述客户端用于对所述多组第一数据进行编码,得到多组第二数据,所述多组第二数据和所述多组第一数据一一对应;The client is configured to encode the multiple sets of first data to obtain multiple sets of second data, and the multiple sets of second data are in one-to-one correspondence with the multiple sets of first data;
所述混洗服务器用于对所述多组第二数据进行混洗,以消除所述多组第二数据中的客户端信息,并消除所述多组第二数据的时序信息,所述时序信息用于反映所述混洗服务器获取所述多组第二数据的时间和顺序。The shuffling server is used for shuffling the multiple sets of second data, so as to eliminate the client information in the multiple sets of second data, and eliminate the timing information of the multiple sets of second data, the timing The information is used to reflect the time and order in which the shuffling server acquires the multiple sets of second data.
本申请的第三方面提供一种计算机可读存储介质,所述计算机可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行时实现以下步骤:A third aspect of the present application provides a computer-readable storage medium, the computer-readable storage medium stores at least one computer-readable instruction, and when the at least one computer-readable instruction is executed by a processor, the following steps are implemented:
通过所述客户端获取存储于本地的多组第一数据;Obtain multiple sets of first data stored locally through the client;
通过所述客户端对所述多组第一数据进行编码,得到多组第二数据,所述多组第二数据和所述多组第一数据一一对应;Encoding the multiple sets of first data by the client to obtain multiple sets of second data, wherein the multiple sets of second data correspond to the multiple sets of first data;
通过所述混洗服务器对所述多组第二数据进行混洗,以消除所述多组第二数据中的客户端信息,并消除所述多组第二数据的时序信息,所述时序信息用于反映所述混洗服务器获取所述多组第二数据的时间和顺序。The multiple sets of second data are shuffled by the shuffling server to eliminate client information in the multiple sets of second data, and eliminate timing information of the multiple sets of second data, the timing information It is used to reflect the time and order in which the shuffling server acquires the multiple sets of second data.
本申请所述的数据保护方法、系统及存储介质,通过客户端获取存储于本地的第一数据并进行编码,得到多组第二数据,可以避免攻击者通过破解客户端获取第一数据对应的明文,以提高第一数据的安全性;并通过混洗服务器消除第二数据的客户端信息,可以避免攻击者通过获取第二数据的客户端信息,进一步窃取客户端中存储的数据,提高了客户端信息的安全性和客户端所存储数据的安全性;还通过消除第二数据的时序信息,可以避免攻击者基于第二数据的时序信息,推算第二数据与客户端的对应关系。本申请能够推动智慧城市的建设,应用于智慧建筑、智慧安防、智慧社区、智慧生活、物联网等领域,提高了参与联邦学习的客户端的隐私性,并提高了联邦学习面对输出攻击的防御能力。The data protection method, system, and storage medium described in this application obtain and encode the first data stored locally through the client to obtain multiple sets of second data, which can prevent attackers from obtaining the first data corresponding to the first data by cracking the client. plaintext to improve the security of the first data; and by shuffling the server to eliminate the client information of the second data, it can prevent the attacker from further stealing the data stored in the client by obtaining the client information of the second data, thereby improving the The security of the client information and the security of the data stored in the client; also by eliminating the timing information of the second data, it is possible to prevent an attacker from estimating the corresponding relationship between the second data and the client based on the timing information of the second data. This application can promote the construction of smart cities and be applied in smart buildings, smart security, smart communities, smart life, Internet of Things and other fields, which improves the privacy of clients participating in federated learning and improves the defense of federated learning in the face of output attacks ability.
附图说明Description of drawings
图1是本申请实施例提供的数据保护系统的第一种结构示意图;FIG. 1 is a first structural schematic diagram of a data protection system provided by an embodiment of the present application;
图2是本申请实施例提供的数据保护系统的第二种结构示意图;Fig. 2 is a second structural schematic diagram of the data protection system provided by the embodiment of the present application;
图3是本申请实施例提供的数据保护方法的第一种流程示意图;FIG. 3 is a schematic flow chart of the first data protection method provided by the embodiment of the present application;
图4是本申请实施例提供的数据保护系统的第三种结构示意图;FIG. 4 is a schematic diagram of a third structure of a data protection system provided by an embodiment of the present application;
图5是本申请实施例提供的数据保护方法的第二种流程示意图。FIG. 5 is a schematic flowchart of a second data protection method provided by an embodiment of the present application.
具体实施方式Detailed ways
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。In the following description, specific details such as specific system structures and technologies are presented for the purpose of illustration rather than limitation, so as to thoroughly understand the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and the appended claims, the term "comprising" indicates the presence of described features, integers, steps, operations, elements and/or components, but does not exclude one or more other Presence or addition of features, wholes, steps, operations, elements, components and/or collections thereof.
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations.
如在本申请说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and the appended claims, the term "if" may be construed, depending on the context, as "when" or "once" or "in response to determining" or "in response to detecting ". Similarly, the phrase "if determined" or "if [the described condition or event] is detected" may be construed, depending on the context, to mean "once determined" or "in response to the determination" or "once detected [the described condition or event] ]” or “in response to detection of [described condition or event]”.
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。In addition, in the description of the specification and the appended claims of the present application, the terms "first", "second", "third" and so on are only used to distinguish descriptions, and should not be understood as indicating or implying relative importance.
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。Reference to "one embodiment" or "some embodiments" or the like in the specification of the present application means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless specifically stated otherwise. The terms "including", "comprising", "having" and variations thereof mean "including but not limited to", unless specifically stated otherwise.
在应用中,为了保证数据安全在训练时需要设置应对网络攻击的防御机制,传统的防御 机制可以识别并防御用于瘫痪联邦学习的训练进程的中毒攻击。目前市面上出现一种通过获取并反向推理模型参数以获取客户端数据的输出攻击,在输出攻击过程中遵循联盟学习的训练进程,可以绕过传统的防御机制,容易造成其他参与训练的客户端的身份信息和数据泄漏,具有安全隐患。因此,如何提升联邦学习面对输出攻击的防御能力成为当前亟需解决的问题。In the application, in order to ensure data security, it is necessary to set up a defense mechanism against network attacks during training. Traditional defense mechanisms can identify and defend against poisoning attacks that paralyze the training process of federated learning. At present, there is an output attack on the market that acquires and infers model parameters to obtain client data. During the output attack, it follows the training process of alliance learning, which can bypass the traditional defense mechanism and easily cause other clients participating in the training. The identity information and data leakage of the terminal have potential security risks. Therefore, how to improve the defense capability of federated learning in the face of output attacks has become an urgent problem to be solved.
针对上述技术问题,本申请实施例提供一种数据保护方法,通过客户端获取存储于本地的第一数据并进行编码,得到多组第二数据,可以避免攻击者通过破解客户端获取第一数据对应的明文,以提高第一数据的安全性;并通过混洗服务器消除第二数据的客户端信息,可以避免攻击者通过获取第二数据的客户端信息,进一步窃取客户端中存储的数据,提高了客户端信息的安全性和客户端所存储数据的安全性;还通过消除第二数据的时序信息,可以避免攻击者基于第二数据的时序信息,推算第二数据与客户端的对应关系,提高了参与联邦学习的客户端的隐私性,并提高了联邦学习面对输出攻击的防御能力。In view of the above-mentioned technical problems, the embodiment of the present application provides a data protection method. The client obtains and encodes the first data stored locally to obtain multiple sets of second data, which can prevent attackers from obtaining the first data by cracking the client. Corresponding plaintext to improve the security of the first data; and eliminate the client information of the second data by shuffling the server, which can prevent the attacker from further stealing the data stored in the client by obtaining the client information of the second data, Improve the security of client information and the security of data stored in the client; also by eliminating the timing information of the second data, it is possible to prevent attackers from estimating the correspondence between the second data and the client based on the timing information of the second data. It improves the privacy of clients participating in federated learning, and improves the defense capability of federated learning in the face of output attacks.
本申请实施例提供的数据保护方法可以应用于数据保护系统,数据保护系统可以安装在联邦学习系统或其他任意类型的分布式学习系统中。The data protection method provided in the embodiment of the present application can be applied to a data protection system, and the data protection system can be installed in a federated learning system or any other type of distributed learning system.
图1示例性的示出了数据保护系统100的结构示意图,数据保护系统100包括客户端110和混洗(Shuffle)服务器120,混洗服务器120与客户端110连接;Fig. 1 exemplarily shows a schematic structural diagram of a data protection system 100, the data protection system 100 includes a client 110 and a shuffle (Shuffle) server 120, and the shuffle server 120 is connected to the client 110;
客户端110用于获取存储于本地的多组第一数据;The client 110 is used to acquire multiple sets of first data stored locally;
客户端110用于对多组第一数据进行编码,得到多组第二数据,多组第二数据和多组第一数据一一对应;The client 110 is configured to encode multiple sets of first data to obtain multiple sets of second data, and multiple sets of second data correspond to multiple sets of first data;
混洗服务器120用于对多组第二数据进行混洗,以消除多组第二数据中的客户端110信息,并消除多组第二数据的时序信息,时序信息用于反映混洗服务器120获取多组第二数据的时间和顺序。The shuffling server 120 is used for shuffling the multiple sets of second data, so as to eliminate the information of the client 110 in the multiple sets of second data, and eliminate the timing information of the multiple sets of second data, and the timing information is used to reflect that the shuffling server 120 The time and sequence of multiple sets of second data are acquired.
在应用中,客户端110可以是具有数据存储能力的终端设备,客户端130的数据可以存储在至少一个数据库中,下面对客户端130支持的数据库类型进行介绍:按照数据库的数据存储结构,客户端110可以支持关系型和非关系型两种数据存储结构的数据库;按照数据库的系统架构,客户端110可以支持分布式和集中式两种系统架构的数据库;具体可以支持Oracle、MySQL、MongoDB、SQL Server、IBM Db2和达梦数据库等不同类型的数据库。In the application, the client 110 can be a terminal device with data storage capability, and the data of the client 130 can be stored in at least one database. The database types supported by the client 130 are introduced below: according to the data storage structure of the database, The client 110 can support databases with relational and non-relational data storage structures; according to the system architecture of the database, the client 110 can support databases with both distributed and centralized system architectures; specifically, it can support Oracle, MySQL, and MongoDB , SQL Server, IBM Db2 and Dameng database and other different types of databases.
在应用中,终端设备可以是手机、平板电脑、可穿戴设备、车载设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)等,本申请实施例对终端设备的具体类型不作任何限制。In applications, terminal devices can be mobile phones, tablet computers, wearable devices, vehicle-mounted devices, augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) devices, notebook computers, ultra-mobile personal computers (ultra-mobile personal computer, UMPC), netbook, personal digital assistant (personal digital assistant, PDA), etc., the embodiment of the present application does not impose any limitation on the specific type of the terminal device.
在应用中,客户端110可以内置有编码(Encode)模块130,编码模块130用于实现对多组第一数据进行编码,得到多组第二数据的功能,在编码模块130对第一数据进行编码前,可以对第一数据进行加密,以提高第一数据在编码过程中的数据安全性。In the application, the client 110 may have a built-in encoding (Encode) module 130, and the encoding module 130 is used to realize the encoding of multiple sets of first data to obtain the function of multiple sets of second data. In the encoding module 130, the first data is encoded Before encoding, the first data may be encrypted, so as to improve the data security of the first data during the encoding process.
在应用中,混洗服务器120可以设置在一个独立的服务器中,除客户端以外的第三方用户在访问混洗服务器时,需要进行身份认证,身份认证可以基于RSA(Ron Rivest-Adi Shamir-Leonard Adleman)公开密钥密码体制等加密算法实现。混洗服务器120可以在不读取第二数据的情况下完成对第二数据的混洗,具体的混洗方法可以参考下述图3或图5对应的数据保护方法。In the application, the shuffling server 120 can be set in an independent server, and a third-party user other than the client needs to perform identity authentication when accessing the shuffling server, and the identity authentication can be based on RSA (Ron Rivest-Adi Shamir-Leonard Adleman) public key cryptosystem and other encryption algorithms. The shuffling server 120 can complete the shuffling of the second data without reading the second data. For a specific shuffling method, refer to the data protection method corresponding to FIG. 3 or FIG. 5 below.
图2示例性的示出了一种数据保护系统100的结构示意图,包括依次连接的客户端110、混洗服务器120及分析(Analyze)服务器140,混洗服务器120与分析服务器140连接;FIG. 2 exemplarily shows a schematic structural diagram of a data protection system 100, including a sequentially connected client 110, a shuffling server 120, and an analysis (Analyze) server 140, and the shuffling server 120 is connected to the analysis server 140;
分析服务器140用于,根据混洗服务器120的混洗规则对混洗后的多组第二数据进行整合,得到第三数据;The analysis server 140 is configured to integrate multiple sets of second data after shuffling according to the shuffling rules of the shuffling server 120 to obtain third data;
分析服务器140还用于,根据客户端110的编码规则对第三数据进行解码,解码后的第三数据用于训练深度学习模型;The analysis server 140 is also used to decode the third data according to the encoding rule of the client 110, and the decoded third data is used for training the deep learning model;
在应用中,分析服务器140可以设置在一个独立的服务器中,在客户端110设置在一个 独立的服务器中时,客户端110、混洗服务器120及分析服务器140可以设置于三个相互独立的服务器。分析服务器140用于对混洗后的第二数据进行整合和解码,得到第三数据,并将第三数据用于训练深度学习模型。In the application, the analysis server 140 can be set in an independent server, and when the client 110 is set in an independent server, the client 110, the shuffling server 120 and the analysis server 140 can be set in three mutually independent servers . The analysis server 140 is configured to integrate and decode the shuffled second data to obtain third data, and use the third data for training a deep learning model.
在应用中,分析服务器140包括深度学习模型,在数据保护系统100中,分析服务器140用于接收经过编码和混洗得到的第二数据,并基于第二数据对深度学习模型进行训练。分析服务器140在数据保护系统100中需要汇总所有数据并对数据进行分析,以实现联邦学习,且分析服务器140所在的服务器开放程度高,面对攻击的保护能力较弱,通常攻击者会针对分析服务器140发起攻击,并根据分析服务器140中的第二数据或第三数据分析参与联邦学习的客户端的身份,以试图进一步破解客户端并获取客户端中存储的本地信息。In the application, the analysis server 140 includes a deep learning model. In the data protection system 100, the analysis server 140 is configured to receive the second data obtained through encoding and shuffling, and train the deep learning model based on the second data. The analysis server 140 in the data protection system 100 needs to summarize all data and analyze the data to achieve federated learning, and the server where the analysis server 140 is located has a high degree of openness, and its protection against attacks is weak. Usually, the attacker will target the analysis The server 140 initiates an attack, and analyzes the identity of the client participating in the federated learning according to the second data or the third data in the analyzing server 140, in order to try to further crack the client and obtain local information stored in the client.
在应用中,客户端110和混洗服务器120构成数据保护系统100内置于联邦学习系统中,通过对获取的第一数据进行编码和混洗,得到第二数据,使攻击者无法根据第二数据,获取发送与第二数据对应的第一数据的客户端的身份信息和数据,提高了联邦学习面对输出攻击的防御能力,并提高了参与训练的客户端的身份信息和数据的安全性,下面对客户端110和混洗服务器120的具体编码方法和混洗方法进行说明。In the application, the client 110 and the shuffling server 120 constitute the data protection system 100 built into the federated learning system, and obtain the second data by encoding and shuffling the acquired first data, so that the attacker cannot , obtain the identity information and data of the client that sent the first data corresponding to the second data, improve the defense capability of federated learning in the face of output attacks, and improve the security of the identity information and data of the clients participating in the training, as follows The specific encoding method and shuffling method of the client 110 and the shuffling server 120 will be described.
可以理解的是,本申请实施例示意的结构并不构成对数据保护系统100的具体限定。在本申请另一些实施例中,数据保护系统100可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如还可以包括输入输出设备、网络接入设备等。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that, the structure shown in the embodiment of the present application does not constitute a specific limitation on the data protection system 100 . In other embodiments of the present application, the data protection system 100 may include more or fewer components than shown in the figure, or combine certain components, or different components, for example, may also include input and output devices, network access devices, etc. . The illustrated components can be realized in hardware, software or a combination of software and hardware.
如图3所示,本申请实施例提供的数据保护方法,应用于数据保护系统,包括如下步骤S301至步骤S303:As shown in Figure 3, the data protection method provided by the embodiment of the present application is applied to a data protection system, including the following steps S301 to S303:
步骤S301、通过客户端获取存储于本地的多组第一数据。Step S301. Obtain multiple sets of first data stored locally through the client.
在应用中,客户端的第一数据为客户端在本地生成或获取的部分明文数据;客户端可以根据实际需要从存储于本地的明文数据中筛选第一数据,以将第一数据用于联邦学习,客户端在获取第一数据后可以将第一数据输入编码模块,编码模块可以内置于客户端,也可以设置在服务器中并与客户端连接。In the application, the first data of the client is part of the plaintext data generated or obtained locally by the client; the client can filter the first data from the plaintext data stored locally according to actual needs, so as to use the first data for federated learning , the client may input the first data into the encoding module after obtaining the first data, and the encoding module may be built in the client, or set in the server and connected to the client.
在一个实施例中,步骤S301之后包括:In one embodiment, after step S301 includes:
通过客户端对第一数据进行加密。The first data is encrypted by the client.
在应用中,客户端在获取第一数据后,可以对第一数据进行加密,得到第一数据对应的密文(Cipher Text),避免攻击者通过破解客户端或编码模块获取第一数据对应的明文(Plain Text),以提高第一数据存储在客户端和在编码过程中的安全性。其中,攻击者可以是参与联邦学习的用户,也可以是未参与联邦学习的第三方用户。In the application, after the client obtains the first data, it can encrypt the first data to obtain the ciphertext (Cipher Text) corresponding to the first data, so as to prevent the attacker from obtaining the cipher text corresponding to the first data by cracking the client or the encoding module. Plain Text to improve the security of the first data stored on the client and during the encoding process. Among them, the attacker can be a user who participates in federated learning, or a third-party user who does not participate in federated learning.
在应用中,对第一数据进行加密的加密算法可以包括对称加密算法(Symmetric Encryption Algorithm)或非对称加密算法(Asymmetric Cryptographic Algorithm)。具体的,对称加密算法可以包括RC4(Rivest Cipher 4,一种流加密算法)、RC2(Rivest Cipher 2,另一种流加密算法)、DES(Data Encryption Standard,数据加密标准)或AES(Advanced Encryption Standard,高级加密标准)等;非对称加密算法可以包括RSA、ECC(Elliptic Curve Cryptography,椭圆曲线算法)、DSA(Digital Signature Algorithm,一种数字签名算法)等,本申请实施例对第一数据进行加密的加密算法的具体类型不作任何限定。In the application, the encryption algorithm for encrypting the first data may include a symmetric encryption algorithm (Symmetric Encryption Algorithm) or an asymmetric encryption algorithm (Asymmetric Cryptographic Algorithm). Specifically, the symmetric encryption algorithm may include RC4 (Rivest Cipher 4, a stream encryption algorithm), RC2 (Rivest Cipher 2, another stream encryption algorithm), DES (Data Encryption Standard, data encryption standard) or AES (Advanced Encryption Standard (Advanced Encryption Standard), etc.; asymmetric encryption algorithms may include RSA, ECC (Elliptic Curve Cryptography, elliptic curve algorithm), DSA (Digital Signature Algorithm, a digital signature algorithm), etc., the embodiment of the present application performs the first data The specific type of the encrypted encryption algorithm is not limited in any way.
步骤S302、通过客户端对多组第一数据进行编码,得到多组第二数据,多组第二数据和多组第一数据一一对应。Step S302: Encoding the multiple sets of first data by the client to obtain multiple sets of second data, and the multiple sets of second data correspond to the multiple sets of first data one-to-one.
在应用中,客户端可以通过编码模块对多组第一数据进行编码,得到多组第二数据,多组第二数据和多组第一数据一一对应。其中,多组第一数据可以来自于多个客户端,每个客户端可以提供一组或多组第一数据。编码模块可以将第一数据编码为指定编码格式的第二数据,指定编码格式具体可以是ASCII(美国信息交换标准代码)、ANSI(一种拓展的ASCII代码)或Unicode(统一码)等不同类型的编码格式。本申请实施例对指定编码格式的具体类 型不作任何限制。In the application, the client can encode multiple sets of first data through the encoding module to obtain multiple sets of second data, and multiple sets of second data correspond to multiple sets of first data one-to-one. Wherein, multiple sets of first data may come from multiple clients, and each client may provide one or more sets of first data. The encoding module can encode the first data into the second data in a specified encoding format, and the specified encoding format can be different types such as ASCII (American Standard Code for Information Interchange), ANSI (an extended ASCII code) or Unicode (Unicode). encoding format. The embodiment of this application does not impose any restrictions on the specific type of the specified encoding format.
在应用中,通过编码模块对多组第一数据进行编码,得到多组第二数据,可以避免攻击者通过破解编码模块获取第一数据对应的明文,以提高第一数据的安全性;还可以压缩第一数据的数据大小,使数据保护系统后续在处理第二数据时可以降低处理负载并提高处理速度。In the application, multiple sets of first data are encoded by the encoding module to obtain multiple sets of second data, which can prevent attackers from obtaining the plaintext corresponding to the first data by cracking the encoding module, so as to improve the security of the first data; The data size of the first data is compressed, so that the data protection system can reduce the processing load and increase the processing speed when processing the second data subsequently.
在一个实施例中,步骤S302包括:In one embodiment, step S302 includes:
通过每个子客户端对存储于本地的第一数据进行编码,得到对应的第二数据。Each sub-client encodes the locally stored first data to obtain corresponding second data.
在应用中,客户端可以由多个子客户端组成,每个子客户端内置一个子编码单元。每个子编码单元用于将对应的一个子客户端的第一数据进行编码,得到一组编码后的第二数据。通过多个子编码单元并行处理对应的客户端的第一数据,可以提高编码模块对第一数据的获取速度和编码速度。In an application, a client may consist of multiple sub-clients, and each sub-client has a built-in sub-coding unit. Each sub-encoding unit is used to encode the first data of a corresponding sub-client to obtain a set of encoded second data. By parallel processing the first data of the corresponding client by multiple sub-encoding units, the acquisition speed and encoding speed of the first data by the encoding module can be improved.
图4示例性的示出了客户端110包括多个子客户端111时的数据保护系统100的结构示意图,其中,每个子客户端111内置有一个子编码单元131。FIG. 4 exemplarily shows a schematic structural diagram of the data protection system 100 when the client 110 includes multiple sub-clients 111 , wherein each sub-client 111 has a built-in sub-encoding unit 131 .
步骤S303、通过混洗服务器对多组第二数据进行混洗,以消除多组第二数据中的客户端信息,并消除多组第二数据的时序信息,时序信息用于反映混洗服务器获取多组第二数据的时间和顺序。Step S303, using the shuffling server to shuffle multiple sets of second data to eliminate client information in multiple sets of second data, and to eliminate timing information of multiple sets of second data, the timing information is used to reflect the data acquired by the shuffling server Time and sequence of multiple sets of second data.
在应用中,可以通过混洗服务器对多组第二数据进行混洗,消除多组第二数据中的客户端信息,客户端信息可以是第二数据的元数据(Metadata),元数据可以包括客户端的数据源地址、客户端的物理拓扑、客户端的系统版本信息、客户端的域名(Domain Name)、客户端用于存储第二数据的数据库的库名等。其中,数据源地址具体可以包括客户端的IP地址(Internet Protocol Address)、接口地址、MAC地址(Media Access Control Address)等,客户端的物理拓扑用于反映客户端包含的所有设备或与客户端连接的所有设备。In the application, multiple sets of second data can be shuffled through the shuffling server to eliminate client information in multiple sets of second data. The client information can be metadata of the second data, and the metadata can include The data source address of the client, the physical topology of the client, the system version information of the client, the domain name (Domain Name) of the client, the library name of the database used by the client to store the second data, etc. Among them, the data source address can specifically include the client's IP address (Internet Protocol Address), interface address, MAC address (Media Access Control Address), etc. The physical topology of the client is used to reflect all the devices included in the client or the devices connected to the client. all equipment.
在应用中,通过混洗服务器对多组第二数据进行混洗,还可以消除多组第二数据的时序信息,时序信息用于反映混洗服务器获取每一组第二数据的时间和顺序。In the application, by shuffling multiple sets of second data through the shuffling server, timing information of multiple sets of second data can also be eliminated, and the timing information is used to reflect the time and order in which the shuffling server acquires each set of second data.
在应用中,通过消除第二数据的客户端信息,可以避免攻击者获取第二数据的客户端信息,并避免攻击者基于第二数据的客户端信息进一步窃取客户端中存储的数据,提高了客户端信息的安全性和客户端的数据安全性。还通过消除第二数据的时序信息,可以避免攻击者基于第二数据的时序信息,推算第二数据与客户端的对应关系,提高了参与联邦学习的客户端的隐私性,并提高了联邦学习抵御攻击的防御能力。In the application, by eliminating the client information of the second data, it is possible to prevent the attacker from obtaining the client information of the second data, and prevent the attacker from further stealing the data stored in the client based on the client information of the second data, thereby improving the Client Information Security and Client Data Security. Also by eliminating the timing information of the second data, it is possible to prevent the attacker from estimating the corresponding relationship between the second data and the client based on the timing information of the second data, which improves the privacy of the client participating in the federated learning and improves the protection against attacks of the federated learning. defense capability.
在一个实施例中,步骤S303包括:In one embodiment, step S303 includes:
通过混洗服务器获取多组第二数据的数据量;Obtain the data volume of multiple sets of second data by shuffling the server;
通过混洗服务器对数据量添加噪声,使数据量满足差分隐私;或者,通过混洗服务器删除多组第二数据中的预设数据量,使数据量满足差分隐私。Adding noise to the amount of data through the shuffling server, so that the amount of data satisfies differential privacy; or, deleting the preset amount of data in multiple sets of second data through the shuffling server, so that the amount of data satisfies differential privacy.
在应用中,可以通过混洗服务器获取多组第二数据的数据量,具体的,在第二数据为SQL(Structured Query Language,结构化查询语言)语句时,可以通过Count(计数)函数确定多组第二数据中的SQL语句的条数,从而确定多组第二数据的数据量。In the application, the data volume of multiple sets of second data can be obtained by shuffling the server. Specifically, when the second data is a SQL (Structured Query Language, Structured Query Language) statement, the Count (count) function can be used to determine how many The number of SQL statements in the set of second data, so as to determine the data volume of multiple sets of second data.
在应用中,在确定多组第二数据的数据量后,可以通过混洗服务器对数据量添加噪声,噪声类型可以包括拉普拉斯噪声(Laplace Noise)或高斯噪声(Gaussian Noise)等,以对数据量进行扰动,使数据量满足差分隐私(Differential Privacy),可以避免攻击者获取第二数据的数据量,或者使攻击者获取到错误的第二数据的数据量。In the application, after determining the data volume of multiple sets of second data, noise can be added to the data volume by shuffling the server, and the noise type can include Laplace noise (Laplace Noise) or Gaussian noise (Gaussian Noise). Disturbing the amount of data so that the amount of data satisfies differential privacy (Differential Privacy) can prevent the attacker from obtaining the data amount of the second data, or allow the attacker to obtain the wrong amount of the second data.
在应用中,在确定多组第二数据的数据量后,还可以通过混洗服务器删除多组第二数据中的预设数据量,其中,预设数据量可以通过随机化算法(Randomized Algorithm)生成,使每一次用于删除多组第二数据的预设数据量不同,从而使数据量满足差分隐私。In the application, after determining the data volume of multiple sets of second data, the preset data volume in multiple sets of second data can also be deleted through the shuffling server, wherein the preset data volume can be randomized through a randomized algorithm (Randomized Algorithm) The method is generated so that the preset data amount used for deleting multiple sets of second data is different each time, so that the data amount satisfies differential privacy.
在应用中,通过混洗服务器对数据量添加噪声,或者通过混洗服务器删除多组第二数据中的预设数据量,使第二数据的数据量满足差分隐私,可以防止攻击者获取第二数据的数据量,避免攻击者根据第二数据的数据量推算客户端的数量以及第二数据与客户端的对应关系, 提高了参与联邦学习的客户端的隐私性,并提高了联邦学习面对输出攻击的防御能力。In the application, by adding noise to the data volume through the shuffling server, or deleting the preset data volume in multiple sets of second data through the shuffling server, so that the data volume of the second data satisfies differential privacy, which can prevent attackers from obtaining the second The data volume of the data prevents the attacker from estimating the number of clients and the corresponding relationship between the second data and the client based on the data volume of the second data, improves the privacy of the clients participating in the federated learning, and improves the federated learning in the face of output attacks defense ability.
在一个实施例中,步骤S303之后还包括:In one embodiment, after step S303, it also includes:
通过分析服务器对混洗后的多组第二数据进行整合,得到第三数据;integrating multiple sets of shuffled second data through the analysis server to obtain third data;
通过分析服务器根据客户端的编码规则对第三数据进行解码,解码后的第三数据用于训练深度学习模型。The analysis server decodes the third data according to the coding rules of the client, and the decoded third data is used for training the deep learning model.
在应用中,在混洗服务器完成对多组第二数据的混洗后,分析服务器可以先获取混洗后的多组第二数据并进行整合,得到第三数据,并对第三数据进行解码;分析服务器也可以先获取混洗后的多组第二数据进行解码,并将解码后的第二数据进行整合。本申请实施例对分析服务器获取第二数据后进行整合和解码的顺序不作任何限制。In the application, after the shuffling server completes the shuffling of multiple sets of second data, the analysis server can first obtain and integrate the multiple sets of shuffled second data to obtain third data, and decode the third data ; The analysis server may first obtain multiple sets of shuffled second data for decoding, and integrate the decoded second data. The embodiment of the present application does not impose any limitation on the order in which the analysis server integrates and decodes the second data after it acquires it.
在应用中,解码后的第三数据可以用于训练深度学习模型,此时解码后的第三数据已消除客户端消息,以及混洗服务器获取第三数据包括的每一组第二数据时的时序信息,在分析服务器被攻击者破解后,可以防止攻击者根据第三数据分析参与联邦学习的客户端的身份,并避免攻击者根据客户端的身份进一步获取客户端中存储的本地信息,提高了客户端的身份信息和数据信息的安全性。In the application, the decoded third data can be used to train the deep learning model. At this time, the decoded third data has eliminated the client message, and the shuffling server obtains each set of second data included in the third data. Timing information, after the analysis server is cracked by the attacker, can prevent the attacker from analyzing the identity of the client participating in the federated learning based on the third data, and prevent the attacker from further obtaining the local information stored in the client based on the identity of the client, which improves customer security. Terminal identity information and data security.
本申请实施例提供的数据保护方法,通过客户端获取存储于本地的第一数据并进行编码,得到多组第二数据,可以避免攻击者通过破解客户端获取第一数据对应的明文,以提高第一数据的安全性;并通过混洗服务器消除第二数据的客户端信息,可以避免攻击者通过获取第二数据的客户端信息,进一步窃取客户端中存储的数据,提高了客户端信息的安全性和客户端所存储数据的安全性;还通过消除第二数据的时序信息,可以避免攻击者基于第二数据的时序信息,推算第二数据与客户端的对应关系,提高了参与联邦学习的客户端的隐私性,并提高了联邦学习抵御攻击的防御能力。In the data protection method provided by the embodiment of the present application, the client obtains and encodes the first data stored locally to obtain multiple sets of second data, which can prevent an attacker from obtaining the plaintext corresponding to the first data by cracking the client, so as to improve The security of the first data; and eliminate the client information of the second data by shuffling the server, which can prevent the attacker from further stealing the data stored in the client by obtaining the client information of the second data, and improve the security of the client information Security and the security of the data stored by the client; also by eliminating the timing information of the second data, it is possible to prevent attackers from estimating the correspondence between the second data and the client based on the timing information of the second data, which improves the chance of participating in federated learning The privacy of the client and improve the defense capability of federated learning against attacks.
如图5所示,在一个实施例中,基于图3所对应的实施例,包括如下步骤S501至步骤S510:As shown in FIG. 5, in one embodiment, based on the embodiment corresponding to FIG. 3, the following steps S501 to S510 are included:
步骤S501、通过客户端获取存储于本地的多组第一数据。Step S501. Obtain multiple sets of first data stored locally through the client.
在应用中,步骤S501提供的数据保护方法与上述步骤S301一致,在此不再赘述。In the application, the data protection method provided in step S501 is consistent with the above step S301, and will not be repeated here.
步骤S502、通过每个子客户端对存储于本地的第一数据的每条子数据进行编码,得到一组对应的第二数据,第二数据包括多条子数据并与第一数据的多条子数据一一对应。Step S502, each sub-client encodes each piece of sub-data of the first data stored locally to obtain a set of corresponding second data, the second data includes multiple pieces of sub-data and is one-to-one with the multiple pieces of first data correspond.
在应用中,子客户端内置的子编码单元可以包括第一编码模式和第二编码模式,在子编码单元采用第一编码模式时,可以对第一数据的每条子数据逐一进行编码,每次编码对应生成第二数据的一条子数据;也可以对第一数据的多条子数据并行编码,得到一组编码后的第二数据;第二数据包括的每条子数据具有对应的第一数据的一条子数据,且第二数据的子数据的条数与第一数据的子数据的条数相同,区别在于,第一数据的每条子数据为未编码的明文数据,第二数据的每条子数据为编码后的密文数据。In the application, the built-in sub-encoding unit of the sub-client can include the first encoding mode and the second encoding mode. When the sub-encoding unit adopts the first encoding mode, each piece of sub-data of the first data can be encoded one by one. The encoding corresponds to generating a piece of sub-data of the second data; multiple pieces of sub-data of the first data can also be encoded in parallel to obtain a set of encoded second data; each piece of sub-data included in the second data has a corresponding piece of the first data sub-data, and the number of sub-data of the second data is the same as the number of sub-data of the first data, the difference is that each sub-data of the first data is unencoded plaintext data, and each sub-data of the second data is Encoded ciphertext data.
在应用中,子编码单元通过第一编码模式对第一数据的每条子数据进行编码,可以对每条子数据进行无差别地编码,从而提高编码后得到的第二数据的离散性。In the application, the sub-encoding unit encodes each piece of sub-data of the first data through the first encoding mode, and can encode each piece of sub-data without distinction, thereby improving the discreteness of the second data obtained after encoding.
步骤S503、通过子客户端根据存储于本地的第一数据的每条子数据的数据属性,获取第一数据的多组子数据,第一数据的每组子数据包括至少一条相同数据属性的子数据。Step S503, using the sub-client to obtain multiple sets of sub-data of the first data according to the data attributes of each piece of sub-data stored locally, each set of sub-data of the first data includes at least one piece of sub-data of the same data attribute .
在应用中,在子编码单元采用第二编码模式时,可以通过确定第一数据的每条子数据的数据属性,获取第一数据的多组子数据,每组子数据中包括至少一条具有相同数据属性的子数据。其中,数据属性具体可以是数据类型(例如字符型、整型、浮点型等)、记录(Tuple)、字段(Field)、主键(Primary Key)或外键(Foreign Key)等。通过获取第一数据的多组子数据,可以将第一数据中具有相同数据属性的子数据进行分类,使第一数据的每组子数据包括相同数据属性的多条子数据。In the application, when the sub-coding unit adopts the second coding mode, multiple sets of sub-data of the first data can be obtained by determining the data attribute of each piece of sub-data of the first data, and each set of sub-data includes at least one item with the same data The child data of the attribute. Wherein, the data attribute may specifically be a data type (such as a character type, an integer type, a floating point type, etc.), a record (Tuple), a field (Field), a primary key (Primary Key) or a foreign key (Foreign Key), etc. By acquiring multiple groups of sub-data of the first data, the sub-data with the same data attribute in the first data can be classified, so that each group of sub-data of the first data includes multiple pieces of sub-data with the same data attribute.
步骤S504、通过子客户端对第一数据的每组子数据进行编码,得到一组对应的第二数据,第二数据包括多组子数据并与第一数据的多组子数据一一对应。Step S504: Encoding each group of sub-data of the first data through the sub-client to obtain a corresponding group of second data, the second data includes multiple groups of sub-data and is in one-to-one correspondence with the multiple groups of sub-data of the first data.
在应用中,在子编码单元采用第二编码模式并完成对第一数据的分组,得到多组子数据后,可以对第一数据的每组子数据逐一进行编码,具体的,可以对第一数据的一组子数据逐组进行编码,对应生成第二数据的一组子数据;也可以对第一数据的多组子数据并行编码,得到一组编码后的第二数据;第二数据包括的每组子数据具有对应的第一数据的一组子数据,且第二数据的子数据的组数与第一数据的子数据的组数相同,区别在于,第一数据的每组子数据为未编码的明文数据,第二数据的每组子数据为编码后的密文数据。In the application, after the sub-encoding unit adopts the second encoding mode and completes the grouping of the first data, and obtains multiple groups of sub-data, each group of sub-data of the first data can be encoded one by one. Specifically, the first data can be encoded one by one. A group of sub-data of the data is encoded group by group, corresponding to a group of sub-data for generating the second data; multiple groups of sub-data of the first data can also be encoded in parallel to obtain a group of encoded second data; the second data includes Each group of sub-data of the first data has a corresponding group of sub-data of the first data, and the number of groups of sub-data of the second data is the same as the number of groups of sub-data of the first data, the difference is that each group of sub-data of the first data is unencoded plaintext data, and each group of sub-data of the second data is encoded ciphertext data.
在应用中,在子编码单元采用第二编码模式时,通过将第一数据中具有相同数据属性的子数据进行分类,以及将第一数据的每组子数据进行编码,得到编码后的包括多组子数据的第二数据,可以汇总具有相同数据属性的第一数据的子数据,并提高编码后的第二数据中每组子数据的聚合性。In the application, when the sub-coding unit adopts the second coding mode, by classifying the sub-data with the same data attribute in the first data, and coding each group of sub-data of the first data, the coded sub-data including multiple The second data of the group of sub-data can summarize the sub-data of the first data with the same data attribute, and improve the aggregation of each group of sub-data in the encoded second data.
在应用中,用户可以根据实际需要选择子编码单元的第一编码模式或第二编码模式对第一数据进行编码,本申请实施例对编码模块及其子编码单元的编码方式不作任何限制。In the application, the user can select the first encoding mode or the second encoding mode of the sub-coding unit to encode the first data according to actual needs, and the embodiment of the present application does not impose any restrictions on the encoding mode of the encoding module and its sub-coding units.
步骤S505、通过混洗服务器对多组第二数据的每条子数据进行混洗。Step S505, using the shuffling server to shuffle each piece of sub-data of multiple sets of second data.
在应用中,混洗服务器可以包括第一混洗模式和第二混洗模式。第一混洗模式可以用于通过第一编码模式和第二编码模式编码得到的第二数据进行混洗。具体的,在子编码单元采用第一编码模式对第一数据进行编码,得到的第二数据包括多条子数据时,混洗服务器可以获取所有子编码单元输出的多组第二数据,并对多组第二数据的每条子数据进行无差别地混洗;在子编码单元采用第二编码模式对第一数据进行编码,得到的第二数据包括多组子数据时,混洗服务器可以获取所有子编码单元输出的多组第二数据,并对多组第二数据的每条子数据进行无差别地混洗。In an application, the shuffle server may include a first shuffle schema and a second shuffle schema. The first shuffling mode may be used for shuffling the second data encoded by the first encoding mode and the second encoding mode. Specifically, when the sub-encoding unit uses the first encoding mode to encode the first data, and the obtained second data includes multiple pieces of sub-data, the shuffling server can obtain multiple sets of second data output by all sub-encoding units, and Each piece of sub-data of the second data group is shuffled indiscriminately; when the sub-encoding unit uses the second encoding mode to encode the first data, and the obtained second data includes multiple sets of sub-data, the shuffling server can obtain all sub-data The multiple sets of second data output by the encoding unit, and indiscriminately shuffling each piece of sub-data of the multiple sets of second data.
在应用中,第二混洗模式可以用于通过第二编码模式编码得到的第二数据进行混洗。本申请实施例对混洗服务器的混洗方式不作任何限制。第一混洗模式的混洗方法与上述步骤S303提供的混洗方法一致,在此不再赘述,下面基于步骤S506和步骤S507对第二混洗模式的混洗方法进行说明。In an application, the second shuffling mode may be used for shuffling the second data encoded by the second encoding mode. The embodiment of the present application does not impose any limitation on the shuffling mode of the shuffling server. The shuffling method in the first shuffling mode is consistent with the shuffling method provided in step S303 above, and will not be repeated here. The shuffling method in the second shuffling mode will be described below based on steps S506 and S507.
步骤S506、通过混洗服务器对多组第二数据进行重组,得到多组第四数据,每组第四数据包括具有相同数据属性的多组子数据。Step S506, reorganizing multiple sets of second data through the shuffling server to obtain multiple sets of fourth data, each set of fourth data includes multiple sets of sub-data with the same data attribute.
在应用中,在子编码单元采用第二编码模式对第一数据进行编码后,混洗服务器可以采用第二混洗模式,具体的,混洗服务器可以确定每组第二数据包括的多组子数据的数据属性,并根据每组子数据的数据属性,生成多组第四数据,每组第四数据包括具有相同数据属性的多组子数据。可以实现对多组第二数据进行整合,将多组第二数据中具有相同数据属性的子数据整合为一组第四数据。In the application, after the sub-encoding unit uses the second encoding mode to encode the first data, the shuffling server can adopt the second shuffling mode. Specifically, the shuffling server can determine the multiple sets of sub-encodings included in each set of second data. data attributes of the data, and according to the data attributes of each group of sub-data, multiple sets of fourth data are generated, and each set of fourth data includes multiple sets of sub-data with the same data attribute. Integration of multiple sets of second data can be achieved, and sub-data with the same data attribute in multiple sets of second data can be integrated into a set of fourth data.
例如,混洗服务器接收到两组第二数据,第一组第二数据包括第一组子数据和第二组子数据,第二组第二数据包括第三组子数据和第四组子数据,其中第一组子数据的数据属性为浮点型,第二组子数据的数据属性为整型,第三组子数据的数据属性为浮点型,第四组子数据的数据属性为整型,则混洗服务器可以生成两组第四数据,第一组第四数据包括上述数据属性同为浮点型的第一组子数据和第三组子数据,第二组第四数据包括上述数据属性同为整型的第二组子数据和第四组子数据。For example, the shuffling server receives two sets of second data, the first set of second data includes the first set of sub-data and the second set of sub-data, the second set of second data includes the third set of sub-data and the fourth set of sub-data , where the data attribute of the first group of sub-data is floating-point type, the data attribute of the second group of sub-data is integer, the data attribute of the third group of sub-data is floating-point type, and the data attribute of the fourth group of sub-data is integer type, the shuffling server can generate two sets of fourth data, the first set of fourth data includes the first set of sub-data and the third set of sub-data whose data attributes are both floating-point types, and the second set of fourth data includes the above-mentioned The data attributes are both the second group of sub-data and the fourth group of sub-data of integer type.
在应用中,通过混洗服务器对多组第二数据进行重组,可以将具有相同数据属性的多组子数据进行整合,使每组第四数据包括相同数据属性的多条子数据,提高了第四数据的辨识性和数据可用性。In the application, by shuffling the server to reorganize multiple sets of second data, multiple sets of sub-data with the same data attribute can be integrated, so that each set of fourth data includes multiple pieces of sub-data with the same data attribute, improving the fourth Data Discernibility and Data Availability.
步骤S507、通过混洗服务器对每组第四数据进行混洗。Step S507, shuffling each group of fourth data through the shuffling server.
在应用中,第二混洗模式和第一混洗模式的区别在于,第一混洗模式同时对多组第二数据进行混洗,混洗后的多组第二数据作为一个数据集输出至分析服务器;第二混洗模式对每组第四数据单独进行混洗,混洗后的每组第四数据单独作为一个数据集输出至分析服务器,且混洗后的每组第四数据包括具有相同数据属性的子数据,提高了第四数据的辨识性和数据 可用性,有利于分析服务器根据数据属性对深度学习模型针对性地进行训练。相较于第二混洗模式,第一混洗模式混洗后的多组第二数据之间不具有相关性,具有更高的隐私性。In the application, the difference between the second shuffling mode and the first shuffling mode is that the first shuffling mode simultaneously shuffles multiple sets of second data, and the shuffled multiple sets of second data are output to the The analysis server; the second shuffling mode shuffles each group of fourth data separately, and each group of fourth data after shuffling is output to the analysis server as a data set, and each group of fourth data after shuffling includes The sub-data with the same data attribute improves the identification and data availability of the fourth data, which is beneficial for the analysis server to train the deep learning model in a targeted manner according to the data attribute. Compared with the second shuffling mode, the multiple sets of second data shuffled in the first shuffling mode have no correlation among them and have higher privacy.
步骤S508、通过分析服务器对混洗后的多组第二数据进行整合,得到第三数据;Step S508, integrate the multiple sets of shuffled second data through the analysis server to obtain third data;
步骤S509、通过分析服务器根据客户端的编码规则对第三数据进行解码,解码后的第三数据用于训练深度学习模型。Step S509, the analysis server decodes the third data according to the coding rules of the client, and the decoded third data is used for training the deep learning model.
在应用中,步骤S508和步骤S509的分析方法与上述步骤S303中提供的分析方法一致,在此不再赘述。In application, the analysis methods in step S508 and step S509 are consistent with the analysis method provided in step S303 above, and will not be repeated here.
步骤S510、通过分析服务器根据客户端的编码规则对混洗后的每组第四数据进行解码,解码后的每组第四数据用于训练深度学习模型。In step S510, the analysis server decodes each set of shuffled fourth data according to the encoding rules of the client, and each decoded set of fourth data is used for training a deep learning model.
在应用中,步骤S510中的解码方法与上述步骤S303中提供的解码方法一致,在此不再赘述。区别在于,分析服务器可以获取每组第四数据包括的子数据的数据属性,使分析服务器可以根据不同数据属性的第四数据对深度学习模型进行针对性的训练,提高了分析服务器的训练效率。In application, the decoding method in step S510 is consistent with the decoding method provided in step S303 above, and will not be repeated here. The difference is that the analysis server can obtain the data attributes of the sub-data included in each group of fourth data, so that the analysis server can perform targeted training on the deep learning model according to the fourth data of different data attributes, which improves the training efficiency of the analysis server.
在应用中,客户端的编码模块,或子客户端的子编码单元可以选用第一编码模式和第二编码模式中的一种对第一数据进行编码,其中第一编码模式可以无差别地对第一数据进行编码,可以提高编码后的第二数据的离散性,第二编码模式可以根据第一数据的子数据的数据属性进行编码,可以提高第二数据种每组子数据的聚合性;混洗服务器可以选用第一混洗模式和第二混洗模式对第二数据进行混洗,其中第一混洗模式可以无差别地对第二数据进行混洗,提高了混洗后的第二数据的隐私性,第二混洗模式可以根据数据属性对多组第二数据进行重组,得到第四数据,第四数据包括具有相同数据属性的多组子数据,提高了第四数据的辨识性和数据可用性。通过可选的编码模式和混洗模式,使用户可以根据实际需要提升编码后的第二数据的离散性或聚合性,以及提升混洗后的第二数据的隐私性或混洗后的第四数据的数据可用性,从而提高了数据处理的灵活性。In the application, the encoding module of the client or the sub-encoding unit of the sub-client can select one of the first encoding mode and the second encoding mode to encode the first data, wherein the first encoding mode can encode the first Data encoding can improve the discreteness of the encoded second data, and the second encoding mode can be encoded according to the data attributes of the sub-data of the first data, which can improve the aggregation of each group of sub-data of the second data type; shuffling The server can choose the first shuffling mode and the second shuffling mode to shuffle the second data, wherein the first shuffling mode can shuffle the second data indiscriminately, which improves the quality of the shuffled second data. Privacy, the second shuffling mode can reorganize multiple sets of second data according to data attributes to obtain fourth data. The fourth data includes multiple sets of sub-data with the same data attributes, which improves the identification and data availability. Through the optional encoding mode and shuffling mode, users can improve the discreteness or aggregation of the encoded second data according to actual needs, and improve the privacy of the shuffled second data or the shuffled fourth data. Data availability of data, thus increasing the flexibility of data processing.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the sequence numbers of the steps in the above embodiments do not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the above-mentioned embodiments, the descriptions of each embodiment have their own emphases, and for parts that are not detailed or recorded in a certain embodiment, refer to the relevant descriptions of other embodiments.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的模块及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those skilled in the art can appreciate that the modules and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
在本申请所提供的实施例中,应该理解到,所揭露的终端设备和方法,可以通过其它的方式实现。例如,以上所描述的终端设备实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或模块的间接耦合或通讯连接,可以是电性,机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed terminal device and method may be implemented in other ways. For example, the terminal device embodiments described above are only illustrative. For example, the division of the modules is only a logical function division. In actual implementation, there may be other division methods. For example, multiple modules or components can be combined Or it can be integrated into another system, or some features can be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or modules may be in electrical, mechanical or other forms.
在本申请所提供的实施例中,所述计算机可读存储介质可以是非易失性,也可以是易失性。In the embodiments provided in the present application, the computer-readable storage medium may be non-volatile or volatile.
在本申请所提供的实施例中,所述计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。In the embodiments provided in this application, the computer-readable storage medium may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function, etc.; store data The zone can store data created according to the use of blockchain nodes, etc.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所 记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-described embodiments are only used to illustrate the technical solutions of the present application, rather than to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still implement the foregoing embodiments Modifications to the technical solutions described in the examples, or equivalent replacements for some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the application, and should be included in the Within the protection scope of this application.

Claims (20)

  1. 一种数据保护方法,应用于数据保护系统,所述数据保护系统包括混洗服务器和客户端,所述混洗服务器与所述客户端连接,其中,所述方法包括:A data protection method applied to a data protection system, the data protection system includes a shuffling server and a client, the shuffling server is connected to the client, wherein the method includes:
    通过所述客户端获取存储于本地的多组第一数据;Obtain multiple sets of first data stored locally through the client;
    通过所述客户端对所述多组第一数据进行编码,得到多组第二数据,所述多组第二数据和所述多组第一数据一一对应;Encoding the multiple sets of first data by the client to obtain multiple sets of second data, wherein the multiple sets of second data correspond to the multiple sets of first data;
    通过所述混洗服务器对所述多组第二数据进行混洗,以消除所述多组第二数据中的客户端信息,并消除所述多组第二数据的时序信息,所述时序信息用于反映所述混洗服务器获取所述多组第二数据的时间和顺序。The multiple sets of second data are shuffled by the shuffling server to eliminate client information in the multiple sets of second data, and eliminate timing information of the multiple sets of second data, the timing information It is used to reflect the time and order in which the shuffling server acquires the multiple sets of second data.
  2. 如权利要求1所述的数据保护方法,其中,所述数据保护系统还包括分析服务器,所述分析服务器与所述混洗服务器连接;The data protection method according to claim 1, wherein the data protection system further comprises an analysis server connected to the shuffling server;
    所述通过所述混洗服务器对所述多组第二数据进行混洗之后,还包括:After the shuffling of the multiple sets of second data by the shuffling server, it further includes:
    通过所述分析服务器对混洗后的所述多组第二数据进行整合,得到第三数据;integrating the multiple sets of shuffled second data by the analysis server to obtain third data;
    通过所述分析服务器根据所述客户端的编码规则对所述第三数据进行解码,解码后的第三数据用于训练深度学习模型。The analysis server decodes the third data according to the encoding rule of the client, and the decoded third data is used for training a deep learning model.
  3. 如权利要求1所述的数据保护方法,其中,所述客户端包括多个子客户端,每个所述子客户端分别与所述混洗服务器连接,每个所述子客户端在本地存储有一组第一数据;The data protection method according to claim 1, wherein the client includes a plurality of sub-clients, each of the sub-clients is respectively connected to the shuffling server, and each of the sub-clients stores a group first data;
    所述通过所述客户端对所述多组第一数据进行编码,得到多组第二数据,包括:Encoding the multiple sets of first data by the client to obtain multiple sets of second data includes:
    通过每个所述子客户端对存储于本地的第一数据进行编码,得到对应的第二数据。Each of the sub-clients encodes the locally stored first data to obtain corresponding second data.
  4. 如权利要求3所述的数据保护方法,其中,所述通过所述子客户端对存储于本地的第一数据进行编码,得到对应的第二数据,包括:The data protection method according to claim 3, wherein said encoding the locally stored first data by said sub-client to obtain corresponding second data comprises:
    通过每个所述子客户端对所述存储于本地的第一数据的每条子数据进行编码,得到一组对应的第二数据,所述第二数据包括多条子数据并与所述第一数据的多条子数据一一对应。Each piece of sub-data of the locally stored first data is encoded by each of the sub-clients to obtain a set of corresponding second data, the second data includes a plurality of pieces of sub-data and is related to the first data The multiple pieces of sub-data correspond one-to-one.
  5. 如权利要求3所述的数据保护方法,其中,所述通过所述子客户端对存储于本地的第一数据进行编码,得到对应的第二数据,包括:The data protection method according to claim 3, wherein said encoding the locally stored first data by said sub-client to obtain corresponding second data comprises:
    通过所述子客户端根据所述存储于本地的第一数据的每条子数据的数据属性,获取所述第一数据的多组子数据,所述第一数据的每组子数据包括至少一条相同数据属性的子数据;According to the data attributes of each piece of sub-data stored locally, the sub-client acquires multiple sets of sub-data of the first data, and each set of sub-data of the first data includes at least one of the same subdata of the data attribute;
    通过所述子客户端对所述第一数据的每组子数据进行编码,得到一组对应的第二数据,所述第二数据包括多组子数据并与所述第一数据的多组子数据一一对应。Each set of sub-data of the first data is encoded by the sub-client to obtain a set of corresponding second data, the second data includes multiple sets of sub-data and is combined with multiple sets of sub-data of the first data The data correspond one to one.
  6. 如权利要求1所述的数据保护方法,其中,所述通过所述混洗服务器对所述多组第二数据进行混洗,包括:The data protection method according to claim 1, wherein said shuffling said plurality of sets of second data by said shuffling server comprises:
    通过所述混洗服务器获取所述多组第二数据的数据量;Obtain the data volume of the plurality of sets of second data through the shuffling server;
    通过所述混洗服务器对所述数据量添加噪声,使所述数据量满足差分隐私;或者,通过所述混洗服务器删除所述多组第二数据中的预设数据量,使所述数据量满足差分隐私。Add noise to the amount of data through the shuffling server, so that the amount of data satisfies differential privacy; or, delete the preset data amount in the multiple sets of second data through the shuffling server, so that the data The amount satisfies differential privacy.
  7. 如权利要求1所述的数据保护方法,其中,所述通过所述混洗服务器对所述多组第二数据进行混洗,包括:The data protection method according to claim 1, wherein said shuffling said plurality of sets of second data by said shuffling server comprises:
    通过所述混洗服务器对所述多组第二数据的每条子数据进行混洗。Shuffling each piece of sub-data of the plurality of sets of second data is performed by the shuffling server.
  8. 如权利要求1所述的数据保护方法,其中,所述通过所述混洗服务器对所述多组第二数据进行混洗,包括:The data protection method according to claim 1, wherein said shuffling said plurality of sets of second data by said shuffling server comprises:
    通过所述混洗服务器对所述多组第二数据进行重组,得到多组第四数据,每组所述第四数据包括具有相同数据属性的多组子数据;Reorganizing the multiple sets of second data by the shuffling server to obtain multiple sets of fourth data, each set of fourth data includes multiple sets of sub-data with the same data attribute;
    通过所述混洗服务器对每组第四数据进行混洗。Each group of fourth data is shuffled by the shuffling server.
  9. 如权利要求8所述的数据保护方法,其中,所述通过所述混洗服务器对每组第四数据进行混洗之后,所述方法还包括:The data protection method according to claim 8, wherein, after shuffling each group of fourth data by the shuffling server, the method further comprises:
    通过分析服务器根据所述客户端的编码规则对混洗后的每组第四数据进行解码,解码后的每组第四数据用于训练深度学习模型。The analysis server decodes each group of shuffled fourth data according to the encoding rules of the client, and each decoded group of fourth data is used for training a deep learning model.
  10. 一种数据保护系统,其中,所述系统包括:A data protection system, wherein the system includes:
    客户端和混洗服务器,所述混洗服务器与所述客户端连接;a client and a shuffling server, the shuffling server is connected to the client;
    所述客户端用于获取存储于本地的多组第一数据;The client is used to acquire multiple sets of first data stored locally;
    所述客户端用于对所述多组第一数据进行编码,得到多组第二数据,所述多组第二数据和所述多组第一数据一一对应;The client is configured to encode the multiple sets of first data to obtain multiple sets of second data, and the multiple sets of second data are in one-to-one correspondence with the multiple sets of first data;
    所述混洗服务器用于对所述多组第二数据进行混洗,以消除所述多组第二数据中的客户端信息,并消除所述多组第二数据的时序信息,所述时序信息用于反映所述混洗服务器获取所述多组第二数据的时间和顺序。The shuffling server is used for shuffling the multiple sets of second data, so as to eliminate the client information in the multiple sets of second data, and eliminate the timing information of the multiple sets of second data, the timing The information is used to reflect the time and order in which the shuffling server acquires the multiple sets of second data.
  11. 如权利要求10所述的数据保护系统,其中,所述数据保护系统还包括分析服务器,所述分析服务器与所述混洗服务器连接;The data protection system according to claim 10, wherein the data protection system further comprises an analysis server connected to the shuffling server;
    所述混洗服务器对所述多组第二数据进行混洗之后,还包括:After the shuffling server shuffles the multiple sets of second data, it further includes:
    所述分析服务器对混洗后的所述多组第二数据进行整合,得到第三数据;The analysis server integrates the shuffled sets of second data to obtain third data;
    所述分析服务器根据所述客户端的编码规则对所述第三数据进行解码,解码后的第三数据用于训练深度学习模型。The analysis server decodes the third data according to the encoding rule of the client, and the decoded third data is used for training a deep learning model.
  12. 如权利要求10所述的数据保护系统,其中,所述客户端包括多个子客户端,每个所述子客户端分别与所述混洗服务器连接,每个所述子客户端在本地存储有一组第一数据;The data protection system according to claim 10, wherein the client includes a plurality of sub-clients, each of the sub-clients is respectively connected to the shuffling server, and each of the sub-clients stores a group first data;
    所述客户端对所述多组第一数据进行编码,得到多组第二数据,包括:The client encodes the multiple sets of first data to obtain multiple sets of second data, including:
    每个所述子客户端对存储于本地的第一数据进行编码,得到对应的第二数据。Each of the sub-clients encodes the locally stored first data to obtain corresponding second data.
  13. 如权利要求12所述的数据保护系统,其中,所述子客户端对存储于本地的第一数据进行编码,得到对应的第二数据,包括:The data protection system according to claim 12, wherein the sub-client encodes the locally stored first data to obtain the corresponding second data, comprising:
    每个所述子客户端对所述存储于本地的第一数据的每条子数据进行编码,得到一组对应的第二数据,所述第二数据包括多条子数据并与所述第一数据的多条子数据一一对应。Each of the sub-clients encodes each piece of sub-data of the locally stored first data to obtain a set of corresponding second data, the second data includes a plurality of pieces of sub-data and is consistent with the first data Multiple pieces of sub-data are in one-to-one correspondence.
  14. 如权利要求12所述的数据保护系统,其中,所述子客户端对存储于本地的第一数据进行编码,得到对应的第二数据,包括:The data protection system according to claim 12, wherein the sub-client encodes the locally stored first data to obtain the corresponding second data, comprising:
    所述子客户端根据所述存储于本地的第一数据的每条子数据的数据属性,获取所述第一数据的多组子数据,所述第一数据的每组子数据包括至少一条相同数据属性的子数据;The sub-client obtains multiple sets of sub-data of the first data according to the data attribute of each piece of sub-data stored locally, and each set of sub-data of the first data includes at least one piece of the same data attribute subdata;
    所述子客户端对所述第一数据的每组子数据进行编码,得到一组对应的第二数据,所述第二数据包括多组子数据并与所述第一数据的多组子数据一一对应。The sub-client encodes each set of sub-data of the first data to obtain a set of corresponding second data, the second data includes multiple sets of sub-data and is combined with multiple sets of sub-data of the first data One to one correspondence.
  15. 如权利要求10所述的数据保护系统,其中,所述混洗服务器对所述多组第二数据进行混洗,包括:The data protection system according to claim 10, wherein the shuffling server shuffling the multiple sets of second data comprises:
    所述混洗服务器获取所述多组第二数据的数据量;The shuffling server acquires the data volume of the plurality of sets of second data;
    所述混洗服务器对所述数据量添加噪声,使所述数据量满足差分隐私;或者,通过所述混洗服务器删除所述多组第二数据中的预设数据量,使所述数据量满足差分隐私。The shuffling server adds noise to the data amount, so that the data amount satisfies differential privacy; or, deletes the preset data amount in the plurality of sets of second data through the shuffling server, so that the data amount Satisfy differential privacy.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行时实现以下步骤:A computer-readable storage medium, wherein the computer-readable storage medium stores at least one computer-readable instruction, and when the at least one computer-readable instruction is executed by a processor, the following steps are implemented:
    通过客户端获取存储于本地的多组第一数据;Obtain multiple sets of first data stored locally through the client;
    通过所述客户端对所述多组第一数据进行编码,得到多组第二数据,所述多组第二数据和所述多组第一数据一一对应;Encoding the multiple sets of first data by the client to obtain multiple sets of second data, wherein the multiple sets of second data correspond to the multiple sets of first data;
    通过混洗服务器对所述多组第二数据进行混洗,以消除所述多组第二数据中的客户端信息,并消除所述多组第二数据的时序信息,所述时序信息用于反映所述混洗服务器获取所述多组第二数据的时间和顺序。The multiple sets of second data are shuffled by the shuffling server to eliminate the client information in the multiple sets of second data, and eliminate the timing information of the multiple sets of second data, the timing information is used for Reflecting the time and order in which the shuffling server acquires the multiple sets of second data.
  17. 如权利要求16所述的存储介质,其中,所述数据保护系统还包括分析服务器,所述分析服务器与所述混洗服务器连接;The storage medium according to claim 16, wherein the data protection system further comprises an analysis server connected to the shuffling server;
    所述通过所述混洗服务器对所述多组第二数据进行混洗之后,所述至少一个计算机可读指令被处理器执行时还用以实现以下步骤:After the multiple sets of second data are shuffled by the shuffling server, the at least one computer-readable instruction is further used to implement the following steps when executed by the processor:
    通过所述分析服务器对混洗后的所述多组第二数据进行整合,得到第三数据;integrating the multiple sets of shuffled second data by the analysis server to obtain third data;
    通过所述分析服务器根据所述客户端的编码规则对所述第三数据进行解码,解码后的第三数据用于训练深度学习模型。The analysis server decodes the third data according to the encoding rule of the client, and the decoded third data is used for training a deep learning model.
  18. 如权利要求16所述的存储介质,其中,所述客户端包括多个子客户端,每个所述子客户端分别与所述混洗服务器连接,每个所述子客户端在本地存储有一组第一数据;The storage medium according to claim 16, wherein the client includes a plurality of sub-clients, each of which is connected to the shuffling server, and each of the sub-clients locally stores a set of first data;
    所述至少一个计算机可读指令被所述处理器执行以实现所述通过所述客户端对所述多组第一数据进行编码,得到多组第二数据时,具体包括:When the at least one computer-readable instruction is executed by the processor to implement encoding the multiple sets of first data through the client to obtain multiple sets of second data, it specifically includes:
    通过每个所述子客户端对存储于本地的第一数据进行编码,得到对应的第二数据。Each of the sub-clients encodes the locally stored first data to obtain corresponding second data.
  19. 如权利要求18所述的存储介质,其中,所述至少一个计算机可读指令被所述处理器执行以实现所述通过所述子客户端对存储于本地的第一数据进行编码,得到对应的第二数据时,具体包括:The storage medium according to claim 18, wherein the at least one computer-readable instruction is executed by the processor to implement encoding the locally stored first data through the sub-client to obtain the corresponding The second data specifically include:
    通过所述子客户端根据所述存储于本地的第一数据的每条子数据的数据属性,获取所述第一数据的多组子数据,所述第一数据的每组子数据包括至少一条相同数据属性的子数据;According to the data attributes of each piece of sub-data stored locally, the sub-client acquires multiple sets of sub-data of the first data, and each set of sub-data of the first data includes at least one of the same subdata of the data attribute;
    通过所述子客户端对所述第一数据的每组子数据进行编码,得到一组对应的第二数据,所述第二数据包括多组子数据并与所述第一数据的多组子数据一一对应。Each set of sub-data of the first data is encoded by the sub-client to obtain a set of corresponding second data, the second data includes multiple sets of sub-data and is combined with multiple sets of sub-data of the first data The data correspond one to one.
  20. 如权利要求16所述的存储介质,其中,所述至少一个计算机可读指令被所述处理器执行以实现所述通过所述混洗服务器对所述多组第二数据进行混洗时,具体包括:The storage medium according to claim 16, wherein when the at least one computer-readable instruction is executed by the processor to implement the shuffling of the sets of second data by the shuffling server, specifically include:
    通过所述混洗服务器获取所述多组第二数据的数据量;Obtain the data volume of the plurality of sets of second data through the shuffling server;
    通过所述混洗服务器对所述数据量添加噪声,使所述数据量满足差分隐私;或者,通过所述混洗服务器删除所述多组第二数据中的预设数据量,使所述数据量满足差分隐私。Add noise to the amount of data through the shuffling server, so that the amount of data satisfies differential privacy; or, delete the preset data amount in the multiple sets of second data through the shuffling server, so that the data The amount satisfies differential privacy.
PCT/CN2022/090192 2022-01-12 2022-04-29 Data protection method and system, and storage medium WO2023134076A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210031150.3A CN114357521A (en) 2022-01-12 2022-01-12 Data protection method and system
CN202210031150.3 2022-01-12

Publications (1)

Publication Number Publication Date
WO2023134076A1 true WO2023134076A1 (en) 2023-07-20

Family

ID=81109382

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090192 WO2023134076A1 (en) 2022-01-12 2022-04-29 Data protection method and system, and storage medium

Country Status (2)

Country Link
CN (1) CN114357521A (en)
WO (1) WO2023134076A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117034328A (en) * 2023-10-09 2023-11-10 国网信息通信产业集团有限公司 Improved abnormal electricity utilization detection system and method based on federal learning

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114357521A (en) * 2022-01-12 2022-04-15 平安科技(深圳)有限公司 Data protection method and system
CN117349841A (en) * 2022-06-27 2024-01-05 华为技术有限公司 Information processing method, chip, electronic device, and computer-readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695674A (en) * 2020-05-14 2020-09-22 平安科技(深圳)有限公司 Federal learning method and device, computer equipment and readable storage medium
CN113052333A (en) * 2021-04-02 2021-06-29 中国科学院计算技术研究所 Method and system for data analysis based on federal learning
CN113434873A (en) * 2021-06-01 2021-09-24 内蒙古大学 Federal learning privacy protection method based on homomorphic encryption
CN114357521A (en) * 2022-01-12 2022-04-15 平安科技(深圳)有限公司 Data protection method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695674A (en) * 2020-05-14 2020-09-22 平安科技(深圳)有限公司 Federal learning method and device, computer equipment and readable storage medium
CN113052333A (en) * 2021-04-02 2021-06-29 中国科学院计算技术研究所 Method and system for data analysis based on federal learning
CN113434873A (en) * 2021-06-01 2021-09-24 内蒙古大学 Federal learning privacy protection method based on homomorphic encryption
CN114357521A (en) * 2022-01-12 2022-04-15 平安科技(深圳)有限公司 Data protection method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117034328A (en) * 2023-10-09 2023-11-10 国网信息通信产业集团有限公司 Improved abnormal electricity utilization detection system and method based on federal learning
CN117034328B (en) * 2023-10-09 2024-03-19 国网信息通信产业集团有限公司 Improved abnormal electricity utilization detection system and method based on federal learning

Also Published As

Publication number Publication date
CN114357521A (en) 2022-04-15

Similar Documents

Publication Publication Date Title
WO2023134076A1 (en) Data protection method and system, and storage medium
Truex et al. A hybrid approach to privacy-preserving federated learning
Panah et al. On the properties of non-media digital watermarking: a review of state of the art techniques
Guan et al. Cross-lingual multi-keyword rank search with semantic extension over encrypted data
Zheng et al. A cloud data deduplication scheme based on certificateless proxy re-encryption
Teng et al. A Modified Advanced Encryption Standard for Data Security.
Blundo et al. Espresso: efficient privacy-preserving evaluation of sample set similarity
Kulshrestha et al. Identifying harmful media in {End-to-End} encrypted communication: Efficient private membership computation
CN107659401B (en) A kind of secure data duplicate removal encryption method of similitude perception
Blundo et al. EsPRESSo: efficient privacy-preserving evaluation of sample set similarity
Yin et al. Improved Elliptic Curve Cryptography with Homomorphic Encryption for Medical Image Encryption.
Zhang et al. A privacy-preserving friend recommendation scheme in online social networks
Yigzaw et al. Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation
Yiu et al. Outsourcing search services on private spatial data
Tang et al. Privacy-preserving multi-keyword search in information networks
Li et al. SPFM: Scalable and privacy-preserving friend matching in mobile cloud
Li et al. CDPS: A cryptographic data publishing system
Teng et al. An Efficient and Secure Cipher-Text Retrieval Scheme Based on Mixed Homomorphic Encryption and Multi-Attribute Sorting Method.
Hu et al. How to make private distributed cardinality estimation practical, and get differential privacy for free
Zhang et al. An efficient privacy-preserving multi-keyword query scheme in location based services
CN117390657A (en) Data encryption method, device, computer equipment and storage medium
Wang et al. PPFLHE: A privacy-preserving federated learning scheme with homomorphic encryption for healthcare data
Zhang et al. A speech fully homomorphic encryption scheme for DGHV based on multithreading in cloud storage
Park et al. PKIS: practical keyword index search on cloud datacenter
CN110457940B (en) Differential privacy measurement method based on graph theory and mutual information quantity

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22919717

Country of ref document: EP

Kind code of ref document: A1