CN114490704A

CN114490704A - Data processing method, device, equipment and storage medium

Info

Publication number: CN114490704A
Application number: CN202011271019.1A
Authority: CN
Inventors: 林江淼; 黄启军; 黄铭毅; 陈瑞钦; 刘玉德
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2022-05-13

Abstract

The invention discloses a data processing method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a database script statement, wherein the database script statement is associated with first sample data and second sample data, and the first sample data is sample data in a local database of a first client terminal; determining a second client terminal according to the database script statement, wherein second sample data is sample data in a local database of the second client terminal; and according to the database script statement and the encryption algorithm, performing data alignment of the first sample data and the second sample data with the second client terminal to obtain intersection data of the first sample data and the second sample data, wherein the intersection data is used for federal learning. The invention can align the sample data on different client terminals in the federal learning, reduce the complexity of sample data alignment in the federal learning and improve the efficiency of sample data alignment in the federal learning.

Description

Data processing method, device, equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, device, and storage medium.

Background

Federal machine Learning (also called federal Learning), which can unite all parties to perform data use and collaborative modeling on the premise that data is not out of the local, becomes a common method in privacy protection calculation.

During federal learning, multiple participants in machine model training have different but alignable data. In order to make the modeling effect of federal learning not much different from the modeling effect of putting together data owned by each participant, data alignment is required between different participants before model training.

Data of different participants in federal learning are stored locally, and the complexity of data alignment among different participants is reduced as cross-platform or cross-network data alignment.

Disclosure of Invention

The invention mainly aims to provide a data processing method, a data processing device, data processing equipment and a data processing storage medium, and aims to solve the technical problem that the complexity of data alignment of different client terminals in federal learning is high.

In order to achieve the above object, the present invention provides a data processing method applied to a first client terminal, the method including:

acquiring a database script statement, wherein the database script statement is associated with first sample data and second sample data, and the first sample data is sample data in a local database of the first client terminal;

determining a second client terminal according to the database script statement, wherein the second sample data is sample data in a local database of the second client terminal;

and according to the database script statement and the encryption algorithm, performing data alignment of the first sample data and the second sample data with the second client terminal to obtain intersection data of the first sample data and the second sample data, wherein the intersection data is used for federal learning.

Optionally, the database script statement includes identification information of the first sample data, and before the second client terminal performs data alignment between the first sample data and the second sample data, the method further includes:

and acquiring the first sample data from a local database of the first client terminal according to the identification information of the first sample data.

Optionally, the determining, according to the database script statement, a second client terminal further includes identification information of the second sample data, where the determining includes:

and determining the second client terminal according to the identification information of the second sample data and preset sample data distribution information, wherein the sample data distribution information is used for indicating the corresponding relation between the identification information of the sample data and the client terminal to which the sample data belongs.

Optionally, the database script statement further includes sample alignment reference information, where the sample alignment reference information includes one or more of the following: a sample ID, a sample characteristic, and the performing, according to the database script statement and the encryption algorithm, data alignment of the first sample data and the second sample data with the second client terminal, includes:

according to the encryption algorithm, carrying out data alignment of at least one first element value and at least one second element value with the second client terminal to obtain the intersection data;

wherein the first element value is an element value corresponding to the sample alignment reference information in the first sample data, and the second element value is an element value corresponding to the sample alignment reference information in the second sample data.

Optionally, the performing, according to the encryption algorithm, data alignment of at least one first element value and at least one second element value with the second client terminal to obtain the intersection data includes:

encrypting each first element value to obtain first encrypted data;

sending the first encrypted data to the second client terminal, and receiving second encrypted data returned by the second client, wherein the second encrypted data is associated with each encrypted second element value;

and according to the first encrypted data and the second encrypted data, performing data alignment on each encrypted first element value and each encrypted second element value to obtain the intersection data.

Optionally, before determining the second client terminal according to the database script statement, the method further includes:

compiling the database script statements to obtain compiled grammar units;

and obtaining the identification information of the first sample data, the identification information of the second sample data and the sample alignment reference information according to the compiled grammar unit.

Optionally, the database script statement is a structured query language SQL statement, and the structured query language SQL statement includes the file name of the first sample data, the file name of the second sample data, and the sample alignment reference information.

Optionally, the method further includes:

and sending the intersection data to the second client terminal.

Optionally, before performing data alignment between the first sample data and the second sample data with the second client terminal, the method further includes:

and sending the database script statement to the second client terminal.

The present invention also provides a data processing apparatus, comprising:

the acquisition module is used for acquiring a database script statement, wherein the database script statement is associated with first sample data and second sample data, and the first sample data is sample data in a local database of the first client terminal;

the determining module is used for determining a second client terminal according to the database script statement, wherein the second sample data is sample data in a local database of the second client terminal;

and the intersection module is used for performing data alignment of the first sample data and the second sample data with the second client terminal according to the database script statement and the encryption algorithm to obtain intersection data of the first sample data and the second sample data, wherein the intersection data is used for federal learning.

The present invention also provides a data processing apparatus, comprising: memory, a processor and a data processing program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the data processing method according to any of the preceding claims.

The invention also provides a computer readable storage medium having stored thereon a data processing program which, when executed by a processor, implements the steps of the data processing method as claimed in any one of the preceding claims.

In the invention, after the first client terminal acquires the database script statement associated with the first sample data and the second sample data, the second client terminal where the second sample data is located is determined according to the database script statement, and the first client terminal aligns the first sample data with the second sample data according to the database script statement and an encryption algorithm to obtain the intersection data of the first sample data and the second sample data. Therefore, on the premise of meeting the data security of the federal study, the database script statements are used for realizing the data alignment between different client terminals in the federal study, the complexity of the data alignment between the different client terminals is effectively reduced, and the efficiency of the data alignment between the different client terminals is improved.

Drawings

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present invention;

fig. 2 is a schematic flow chart of a data processing method according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating another data processing method according to an embodiment of the present invention;

FIG. 4 is an exemplary diagram of data alignment for organization A and organization B provided by an embodiment of the invention;

FIG. 5 is a schematic structural diagram of a data processing apparatus according to the present invention;

fig. 6 is a schematic structural diagram of a data processing apparatus according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 is an exemplary diagram of an application scenario provided in an embodiment of the present invention.

As shown in fig. 1, the participants of federal learning include a server and K client terminals. In the federal learning process, a server issues a global model to each client terminal, each client terminal trains the global model issued by the server by using local data to obtain trained model parameters and uploads the trained model parameters to the server, the server aggregates the model parameters uploaded by each client terminal to obtain an updated global model, and the process is repeated in sequence until the aggregated global model converges.

Wherein each client terminal may comprise a terminal device and/or a server.

Federal learning includes two modes: horizontal federal learning and vertical federal learning.

In the horizontal federal learning, the local data of different client terminals participating in the federal learning include user data of different users, but the user data is overlapped more on the user characteristics. For example, the client terminal a has user data of a user a, a user B, a user c, and a user d, the user data including user's age, occupation, income, and the like, and the client terminal B has user data of a client e, a client f, and a client g, the user data including user's age, income, consumption record, and the like. It can be seen that the client terminal a and the client terminal B have user data of different users, but the user data in the client terminal a and the user data in the client terminal B both include user characteristics of age and income.

In the longitudinal federal learning, there are user data of overlapping users in the local data of different client terminals participating in the federal learning, but the user characteristics in the user data are different. For example, the client terminal a has user data of a user a, a user b, a user C, and a user d, the user characteristics of which include the age, occupation, income, and the like of the user, and the client terminal C has user data of a user b, a user C, a user d, and a user f, the user characteristics of which include consumption records, travel records, and the like of the user. It can be seen that both client terminal a and client terminal C have user data for user b, user C and user d, but the user characteristics of the user data in client terminal a are different from the user characteristics of the user data in client terminal C.

In the horizontal federal learning or the vertical federal learning, before the federal learning modeling is carried out, data alignment between different client terminals is required to be carried out, namely, a 'collision library' is carried out, intersection data between different client terminals are obtained, and then the horizontal federal modeling or the vertical federal modeling is carried out based on the intersection data.

For example, based on the application scenario shown in fig. 1, before the server issues the global model to each client terminal, or before each client terminal trains the global model issued by the server using the local data, different client terminals need to perform data alignment, or need to perform library collision, so as to obtain intersection data between the client terminals. And based on the local data and the intersection data, each client terminal trains the global model issued by the server.

For example, the client terminal a and the client terminal B may perform data alignment based on user characteristics, and then the intersection data of the client terminal a and the client terminal B is the user characteristics of age and income. The client terminal a can train the global model delivered by the server based on the age and income of the user owned by the client terminal a, and the client terminal B can train the global model delivered by the server based on the age and income of the user owned by the client terminal B, so that the effect of the global model aggregated by the server finally approaches the effect of modeling by putting the user data of the client terminal a and the user data of the client terminal B together.

Because user data in each client terminal for federal learning are usually stored in a database mode, different client terminals have a set of mature database scripting languages for data analysis and data processing, and the client terminals are not data own. The process of solving intersection of data through a database scripting language is simple and efficient, but when user data of different client terminals are aligned, database operation needs to be carried out among multiple parties, and data encryption is involved. The traditional mode of data statistics and analysis by adopting a database scripting language is usually only suitable for one client terminal to perform intersection solution on local user data, and is difficult to be directly applied to data alignment in federal learning.

In view of this, an embodiment of the present invention provides a data processing method, where a database script statement related to first sample data and second sample data instructs a first client terminal and a second client terminal to align the first sample data and the second sample data, after the first client terminal obtains the database script statement, a first client determines a second client terminal according to the database script statement, and aligns the first sample data and the second sample data with the second client terminal according to the database script statement and an encryption algorithm to obtain final intersection data. The first client terminal and the second client terminal are different client terminals in federal learning. Therefore, the embodiment of the invention realizes the alignment of sample data between different client terminals in the federal learning by utilizing the characteristic that the client terminals in the federal learning store user data by adopting the database and adopting the database script statements and the encryption algorithm on the premise of ensuring the data security of the user data on the client terminals, reduces the complexity of aligning the sample data between different client terminals in the federal learning, and improves the efficiency of aligning the sample data between different client terminals in the federal learning.

Fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present invention. Applied to a first client terminal, as shown in fig. 2, the method may include:

step 201, obtaining a database script statement, where the database script statement is associated with first sample data and second sample data, and the first sample data is sample data in a local database of the first client terminal.

The database script language refers to a non-procedural programming language for operating a database (for example, operations such as creating a data table in the database, querying the database, updating the database, and the like).

The database script statement may include identification information of the first sample data (e.g., a database name of a database where the first sample data is located, a table name of a data table where the first sample data is located) and identification information of the second sample data (e.g., a database name of a database where the second sample data is located, a table name of a data table where the second sample data is located), or may include identification information of the first client terminal (e.g., a device identifier of the first client terminal, a network address) and identification information of the second client terminal (e.g., a device identifier of the second client terminal, a network address) for instructing the first client terminal and the second client terminal to perform data alignment of the first sample data and the second sample data.

The first sample data is sample data in a local database of the first client terminal, and the second sample data is sample data in a local database of the second client terminal. The first sample data comprises sample data of one or more samples, wherein the sample data comprises a sample ID and sample characteristics of the samples, and the sample ID of each sample is unique. Similarly, the second sample data comprises sample data of one or more samples, the sample data comprising a sample ID and sample characteristics of the sample, the sample ID of each sample being unique.

In the horizontal federal learning, the first sample data is different from the sample ID in the second sample data, but there is an overlap in the sample characteristics of the samples, for example, the first sample data includes the characteristic a1 of the sample a, and the second sample data includes the sample characteristic a1 of the sample e. In longitudinal federal learning, there is overlap of the first sample data with the sample ID in the second sample data, for example, the first sample data includes the sample feature a1 of the sample d, and the second sample data includes the sample feature b1 of the sample d.

Optionally, the sample may be a user, the sample data of the sample may be user data of the user, the user data includes a user ID and a user characteristic, and the user ID is unique. The user ID includes, for example, one or more of the following: the user number, the identification number, the bank card number, the terminal device number, etc., and the user characteristics include, for example, one or more of the following: the user's name, age, occupation, income, consumption records, etc.

Optionally, a preset database script statement is obtained.

Optionally, a database script statement input by the user is obtained. For example, a user inputs a database script statement on a terminal device of a first client terminal, and the terminal device of the first client terminal sends the database script statement input by the user to a server of the first client terminal, so that the server of the first client terminal and the server of a second client terminal perform alignment of first sample data and second sample data.

Optionally, the database script statement sent by the second client terminal is received.

For example, before performing data alignment between the first sample data and the second sample data, the first client terminal may send a request for performing data alignment to the second client terminal, and the second client terminal sends a database script statement preset on the second client terminal to the first client terminal in response to the received data alignment request. For another example, when it is detected that the current time is the preset time, the second client terminal actively sends the database script statement to the first client terminal.

Step 202, determining the second client terminal according to the database script statement.

Specifically, after the database script statement is obtained, if the database script statement includes the identification information of the second sample data, the second client terminal where the second sample data is located may be determined according to the identification information of the second sample data; or if the database script statement comprises the identification information of the second client terminal, determining the second client terminal where the second sample data is located according to the identification information of the second client terminal; alternatively, if the database script statement is transmitted from the second client terminal to the first client terminal, the second client terminal may be determined according to the transmitting device of the database script statement.

Optionally, when the second client terminal where the second sample data is located is determined according to the identification information of the second sample data, the second client terminal is determined according to the identification information of the second sample data and preset sample data distribution information, where the sample data distribution information is used to indicate a correspondence between the identification information of the sample data and the client terminal to which the sample data belongs. In the sample data distribution information, the client terminal corresponding to the identification information of the second sample data is searched by taking the identification information of the second sample data as an index, and the client terminal is determined as the second client terminal. Therefore, the client terminals distributed by different sample data are recorded through the sample data distribution information, and another client terminal can be accurately and quickly determined when data alignment is carried out.

And 203, aligning the first sample data and the second sample data with the second client terminal according to the database script statement and the encryption algorithm to obtain intersection data of the first sample data and the second sample data, wherein the intersection data is used for federal learning.

In order to ensure data security, the encryption algorithm is an asymmetric encryption algorithm. Examples of asymmetric encryption algorithms include the RSA Algorithm, the data Signature Algorithm (DSA Algorithm), the Elliptic curve Cryptography (ECC Algorithm), and the like.

Specifically, after the database script statement is obtained, first sample data is obtained in the first client terminal according to the database script statement. The first client terminal may perform encryption processing on the first sample data according to an encryption algorithm. The second client terminal can determine second sample data according to the same database script statement, encrypt the second sample data by using the same encryption algorithm, and send the encrypted second sample data to the first client terminal. The first client terminal performs data alignment based on the first sample data and the second sample data which are subjected to the same encryption processing, acquires intersection data of the first sample data and the second sample data based on intersection data of the first sample data and the second sample data which are subjected to the encryption processing, and further acquires the intersection data of the first sample data and the second sample data based on a corresponding relation between the first sample data and the first sample data which are subjected to the encryption processing, for example, acquires a user ID which both the first sample data and the second sample data have.

When the first client terminal encrypts the first sample data according to the encryption algorithm, part of or all of the data in the first sample data may be encrypted, for example, a sample ID of each sample in the first sample data is encrypted, or a sample characteristic of each sample in the first sample data is encrypted, or a sample ID and a sample characteristic of each sample in the first sample data are encrypted. Similarly, when the second client terminal encrypts the second sample data according to the encryption algorithm, part of or all of the data in the second sample data may be encrypted.

If the sample ID of each sample in the first sample data is encrypted and the sample ID of each sample in the second sample data is encrypted, the first client terminal and the second client terminal may obtain intersection data of the sample ID of the first sample data and the sample ID of the second sample data. If the sample characteristics of each sample in the first sample data are encrypted and the sample characteristics of each sample in the second sample data are encrypted, the first client terminal and the second client terminal can obtain intersection data of the sample characteristics of the first sample data and the sample characteristics of the second sample data.

Optionally, the first client terminal and the second client terminal may respectively perform data alignment on the first sample data after encryption processing and the second sample data after encryption processing, and each obtain intersection data; or aligning the data of the first sample data after the encryption processing on the first client terminal with the data of the second sample data after the encryption processing, and sending the obtained intersection data to the second client terminal by the first client terminal; or the data of the first sample data after the encryption processing and the data of the second sample data after the encryption processing are aligned on the second client terminal, and the intersection data is sent to the first client terminal by the second client terminal.

Optionally, when the first sample data is obtained in the first client terminal according to the database script statement, all local sample data in the local database on the first client terminal may be obtained, or local sample data for data alignment preset in the local database by the user on the first client terminal may also be obtained. Or when the first sample data is acquired in the first client terminal according to the database script statement, the sample data corresponding to the identification information of the first sample data may be searched in the local database in the first client terminal according to the identification information of the first sample data, that is, the first sample data is acquired, so that the first sample data for data alignment may be specified by the database script statement.

According to the data processing method provided by the embodiment, the characteristic that the user data is stored in the database by the client terminal in the federal learning is utilized, and the sample data alignment between the first client terminal and the second client terminal is realized by adopting the database script statement and the encryption algorithm. Therefore, on the premise of ensuring the data security of the user data on the client terminal, the alignment of sample data among different client terminals in the federal learning is realized, the complexity of aligning the sample data among different client terminals in the federal learning is reduced, and the efficiency of aligning the sample data among different client terminals in the federal learning is improved.

In some embodiments, the database script statement further includes sample alignment reference information, the sample alignment reference information including one or more of: the sample ID, the sample characteristics and the sample alignment reference information are used for determining the range of data alignment between the first sample data and the second sample data. And if the sample alignment reference information is the sample ID, the database script statement is used for indicating that the sample ID of each sample in the first sample data and the sample ID of each sample in the second sample data are subjected to data alignment. And if the sample alignment reference information is the sample characteristic, the database script statement is used for indicating that the sample characteristic of each sample in the first sample data is aligned with the sample characteristic of each sample in the second sample data. Therefore, according to the database script statement, the alignment of the sample ID and/or the sample characteristic of the first sample data and the second sample data is realized.

And under the condition that the database script statement comprises sample alignment reference information and the sample alignment reference information comprises a sample ID and/or a sample characteristic, performing data alignment of at least one first element value and at least one second element value by the first client terminal and the second client terminal according to an encryption algorithm to obtain intersection data. The first element value is an element value corresponding to the sample alignment reference information in the first sample data, and the second element value is an element value corresponding to the sample alignment reference information in the second sample data.

Specifically, the first client terminal obtains sample alignment reference information in a database script statement, and obtains a first element value corresponding to the sample alignment reference information in the first sample data. And the first client terminal encrypts each first element value through an encryption algorithm. And the second client terminal acquires the sample alignment reference information in the database script statement, acquires second element values corresponding to the sample alignment reference information in second sample data, and encrypts each second element value through an encryption algorithm. The first client terminal may determine intersection data between the first sample data and the second sample data based on intersection data of the first element value and the second element value subjected to the same encryption processing.

Fig. 3 is a schematic flow diagram of a data processing method according to an embodiment of the present invention, where a database script statement includes identification information of first sample data, identification information of second sample data, and sample alignment reference information, and the sample alignment reference information includes a sample ID and/or a sample feature. Applied to a first client terminal, as shown in fig. 3, the method may include:

step 301, the first client terminal obtains a database script statement.

And step 302, the second client terminal acquires the database script statement.

Optionally, the first client terminal obtains a database script statement preset at the first client terminal, and the second client terminal obtains a database script statement preset at the second client terminal, where the database script statement of the first client terminal is consistent with the database script statement of the second client terminal.

Optionally, the first client terminal obtains a database script statement input by a user, and the second client terminal obtains the database script statement input by the user. For example, a user of a first client terminal and a user of a second client terminal agree to input the same database statements to the first client terminal and the second client terminal, respectively.

Optionally, the first client terminal obtains a database script statement input by a user, and sends the database script statement to the second client terminal; or the second client terminal acquires the database script statement input by the user and sends the database script statement to the second client terminal. Thus, it is ensured that the database script statements obtained by the first client terminal are identical to the database script statements obtained by the second client terminal.

For the contents of the database script statement, the first sample data, and the second sample data, reference may be made to the foregoing embodiments, which are not repeated herein.

It should be noted that step 301 and step 302 may be executed synchronously or asynchronously, and the order of execution of step 301 and step 302 is not limited herein.

And step 303, the first client terminal acquires at least one first element value in the first sample data according to the database script statement, and encrypts each first element value to acquire first encrypted data.

Specifically, the first client terminal may obtain the data identifier and the sample alignment reference information of the first sample data in the database script statement. And according to the data identification of the first sample data, the first client terminal acquires the first sample data from the local database, and after the first sample data is acquired, at least one element value corresponding to the sample alignment reference information is acquired from the first sample data to obtain at least one first element value.

If the sample alignment reference information includes the sample ID, an element value corresponding to the sample ID is acquired from the first sample data. For example, the first sample data includes sample data of sample a, sample b, sample c, and sample d, and the sample IDs of sample a, sample b, sample c, and sample d are 1, 2, 3, and 4, respectively, and the first sample data includes 1, 2, 3, and 4 together with the sample ID element value.

If the sample alignment reference information includes a sample feature, an element value corresponding to the sample feature is acquired from the first sample data. For example, if the sample alignment reference information includes the user age, the user age of each sample needs to be obtained from the first sample data.

Specifically, since the asymmetric encryption algorithm employs a public key and a private key, in order to ensure data security, the first client terminal and the second client terminal may respectively hold one of the public key and the private key. For example, a first client terminal holds a public key, and a second client terminal holds a private key; or the first client terminal holds the private key and the second client terminal holds the public key. Taking the example that the first client terminal holds the public key, the first client terminal may encrypt each first element value according to the public key and an encryption algorithm to obtain first encrypted data.

And step 304, the first client terminal determines a second client terminal according to the database script statement.

Specifically, the first client terminal may obtain identification information of second sample data in the database script statement, and determine the second client terminal according to the identification information of the second sample data and preset sample data distribution information. Reference may be made to the related contents of the foregoing embodiments, and details are not repeated.

Optionally, after obtaining the database script statement, the first client terminal inputs the database script statement into the compiler to compile the database script statement to obtain a compiled syntax unit, and obtains the identification information of the first sample data, the identification information of the second sample data, and the sample alignment reference information according to the compiled syntax unit. The syntax unit is a machine language which can be understood by a computer, and the identification information of the first sample data, the identification information of the second sample data and the sample alignment reference information can be directly read from the compiled syntax unit.

Step 305, the first client terminal sends the first encrypted data to the second client terminal.

Specifically, after the first encrypted data is obtained, the first client terminal sends the first encrypted data to the second client terminal, and the first encrypted data is used in a data encryption process of the second client terminal.

And step 306, the second client terminal acquires at least one second element value in the second sample data according to the database script statement, and acquires second encrypted data according to the first encrypted data and each second element value.

Wherein the second encrypted data is associated with each second element value that is encrypted.

Specifically, the process of the second client terminal obtaining at least one second element value in the second sample data according to the database script statement may refer to the process of the first client terminal obtaining at least one first element value in the first sample data according to the database script statement, and is not repeated.

Specifically, after the second client terminal obtains the first encrypted data, the second client terminal may process the first encrypted data by using a private key and an encryption algorithm to obtain the first encrypted data encrypted by the private key. And the second client terminal processes each second element value by adopting a private key and an encryption algorithm to obtain each second element value encrypted by the private key. And obtaining second encrypted data by the first encrypted data encrypted by the private key and each second element value encrypted by the private key. Therefore, the first client terminal encrypts the first sample data and the second client terminal encrypts the second sample data, thereby ensuring the data security of the first sample data and the data security of the second sample data.

Optionally, after obtaining the database script statement, the second client terminal inputs the database script statement into the compiler to compile the database script statement to obtain a compiled syntax unit, and obtains the identification information of the first sample data, the identification information of the second sample data, and the sample alignment reference information according to the compiled syntax unit.

Step 307, the second client terminal sends the second encrypted data to the first client terminal.

And 308, the first client terminal obtains each encrypted first element value and each encrypted second element value according to the first encrypted data and the second encrypted data, and performs data alignment to obtain intersection data.

Specifically, after the first client terminal obtains the second encrypted data, since the second encrypted data includes the first encrypted data encrypted by the private key and the second element value encrypted by the private key, and since the first encrypted data is the first element value encrypted by the public key, based on the first encrypted data and the first encrypted data encrypted by the private key, the part encrypted by the public key in the first encrypted data is restored to obtain the first element value encrypted only by the private key. After the first client terminal takes the first element value encrypted by the private key and the second element value encrypted by the private key, the first client terminal can perform intersection operation on the first element value encrypted by the private key and the second element value encrypted by the private key to obtain intersection data of the first element value encrypted by the private key and the second element value encrypted by the private key. Because the first client terminal has the first element value, intersection data of the first element value and the second element value, namely intersection data of the first sample data and the second sample data obtained based on the database script statement, can be obtained based on intersection data of the first element value encrypted by the private key and the second element value encrypted by the private key. Therefore, in the whole process of the first sample data and the second sample data performed by the first client terminal and the second client terminal, the data security of the first sample data and the second sample data is ensured, the data security of the intersection data of the first sample data and the second sample data is ensured, and the alignment between the first sample data and the second sample data is finally completed.

According to the data processing method provided by the embodiment, the database script statement and the encryption algorithm which comprise the identification information of the first sample data, the identification information of the second sample data and the sample alignment reference information are used for realizing the data alignment of the first sample data and the second sample data between the first client terminal and the second client terminal on the premise of ensuring the data safety of the first sample data and the second sample data, the database script language and the data encryption are combined based on the database script statement and the data encryption, and the characteristics of simplicity and high efficiency when the database script language is used for intersection solving are utilized, so that the efficiency of the data alignment of the first client terminal and the second client terminal in federal learning is improved.

In some embodiments, when the first client terminal encrypts each first element value, in order to prevent the second client terminal from decrypting the first element value through the private key, a corresponding random number may be generated for each first element value, and the random numbers corresponding to different element values are different. And the first client terminal encrypts the random data through the public key to obtain the confusion factor of each first element value. And obtaining the hash value of each first element through a hash algorithm. And obtaining the public key encryption value corresponding to each first element according to the confusion factor of each first element and the hash value of each first element. And the public key encryption value corresponding to each first element forms first encryption data.

Correspondingly, after receiving the first encrypted data, the second client terminal can process each public key encrypted value in the first encrypted data through the private key to obtain the first encrypted data encrypted by the private key. Since the hash value of the first element in each public key encryption value in the first encrypted data is not encrypted by the public key, and only the obfuscating factor is encrypted by the public key, the processing of each public key encryption value in the first encrypted data by the second client terminal through the private key is equivalent to decrypting the obfuscating factor in each public key encryption value and simultaneously encrypting the public key hash value of each first element value in each encryption value.

Correspondingly, when the second client terminal encrypts each second element value through the private key, the hash value of each second element value can be obtained through the same hash algorithm as that of the first client terminal. And carrying out private key encryption on the hash value corresponding to each second element value to obtain a private key encryption value corresponding to each second element value. Therefore, the second encrypted data includes the private-key-encrypted value of each second element value and the private-key-encrypted first encrypted data, and each private-key-encrypted value in the private-key-encrypted first encrypted data is the first element value encrypted by the obfuscation factor and the private key.

Correspondingly, after the first client terminal receives the second encrypted data, because the first client terminal stores the obfuscation factors of the first element values, the first client terminal can process the private key encrypted values in the first encrypted data encrypted by the private key to remove the obfuscation factors in the private key encrypted values, and obtain the first element values encrypted only by the private key. Therefore, the first client terminal obtains the first element value encrypted by the private key and the second element value encrypted by the private key, and can further obtain intersection data of the first element value encrypted by the private key and the second element value encrypted by the private key.

As an example, the first client terminal first generates a corresponding random number for each first element value, and encrypts the random number through the public key to obtain an obfuscation factor of each first element value. And obtaining the hash value of each first element value, and obfuscating the hash value of each first element value through an obfuscating factor to obtain an obfuscating result of each first element value. And performing modulus operation on the confusion result of each first element value, and forming first encryption data by each confusion result after modulus operation. For example, the first encrypted data Y_ACan be expressed as:

Y_A＝{ri^e% n H (ui)% n, where the public key is expressed as (n, e), r_iIs a random number, u, corresponding to the ith first element value_i∈X_ADenotes the ith first element value, X_AMeans all ofSet of first element values, H (u)_i) Represents u_iThe hash value of (1).

As an example, in the process of encrypting each second element value by using a preset private key, the second client terminal first obtains the hash value of each second element value, hashes the plurality of hash values to obtain a final hash value, and encrypts the final hash value according to the private key to obtain a private key encrypted value corresponding to each second element value. For example, a set Z of private key cryptographic values corresponding to each second element value_BCan be expressed as:

Z_B＝{H(H(u_j))^d% n } in which u_j∈X_BDenotes the ith second element value, X_BRepresenting the set of all second element values and the private key is represented as (n, d).

As an example, the second client uses a private key to encrypt the first encrypted data, and the obtained first encrypted data Z encrypted by the private key_ACan be expressed as:

as an example, after receiving the second sub-encrypted data, the first client terminal processes the first encrypted data encrypted by the public key, and obtains a set of first element values encrypted by the private key as:

D_A＝{H(ri*(H(ui))^d/ri％n)}＝{H(H(u_i))^d}. Thus, the first client terminal may be D_AAnd Z_BAnd performing intersection solving to obtain intersection data. In the formula, the first client terminal pair Z_AAn analog operation (i.e., a division modulo operation in the formula) is performed to remove the random number r_iThe first sample data encrypted by the private key is obtained according to the influence of the first confusion factor.

In some embodiments, the database script statement is a structured query language, SQL, statement that includes a file name of the first sample data, a file name of the second sample data, and sample alignment reference information. For example, an SQL statement may be expressed as:

“select featureA,featureB from A join B on ID_A＝ID_B”。

where a denotes first sample data and B denotes second sample data. ID _ a and ID _ B are sample alignment reference data, ID _ a represents a sample ID in the first sample data, and ID _ B represents a sample ID observed in the second sample data. featureA represents sample features in the first sample data and featureB represents sample features in the second sample data, which may be understood as all sample features in the second sample data since they are unknown to the first client terminal.

Specifically, since the SQL statement is a database scripting language, after the first client terminal obtains the SQL language, the SQL language can be input into the compiler to convert the SQL language into a machine language, so as to obtain a syntax element that can be understood by the machine. And acquiring the identification information of the first sample data, the identification information of the second sample data and the sample alignment reference information from the grammar unit.

By way of example, FIG. 4 is an exemplary diagram of data alignment for organization A and organization B. Here, the mechanism a may be understood as the first client terminal, and the mechanism B may be understood as the second client terminal. Data table a is the first sample data and data table B is the second sample data. The target SQL represents a database script statement. The Federal AI Technology Enabler (FATE) system is a computing framework for Federal learning, and the FATE system comprises a compiler, a distributed computing system and a storage system, and can realize operations such as compiling, operation and storage in the Federal learning process.

As shown in fig. 4, the a mechanism and the B mechanism compile the target SQL to obtain a syntax unit, and obtain distribution information of the data table a and the data table B according to the syntax unit and the sample data distribution information, that is, a client terminal where the data table a is located and a client terminal where the data table B is located, and may further obtain sample alignment reference information from the syntax unit. And performing the data encryption and alignment operations described in any one of the above method embodiments on the data table a and the data table B in the distributed computing system according to the sample reference alignment information.

The FATE system can be located on a server of the organization A and a server of the organization B, and a user of the organization A and a user of the organization B can input the target SQL statement on respective terminal devices. And the terminal equipment sends the target SQL statement to respective servers.

In some embodiments, the second client terminal stores the public key and the private key, and the public key may be sent to the first client terminal by the second client terminal, or the first client terminal stores the public key and the private key, and the public key may be sent to the second client terminal by the first client terminal, so that the above-mentioned data encryption operation between the first client terminal and the second client terminal is implemented through the unified public key and the private key.

Fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention. As shown in fig. 5, the data processing apparatus may include:

an obtaining module 501, configured to obtain a database script statement, where the database script statement is associated with first sample data and second sample data, and the first sample data is sample data in a local database of a first client terminal;

a determining module 502, configured to determine, according to the database script statement, a second client terminal, where second sample data is sample data in a local database of the second client terminal;

and the intersection module 503 is configured to perform data alignment between the first sample data and the second sample data with the second client terminal according to the database script statement and the encryption algorithm, so as to obtain intersection data of the first sample data and the second sample data, where the intersection data is used for federal learning.

The data processing apparatus provided in this embodiment may be configured to execute the technical solution provided in any of the foregoing method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

In a possible implementation manner, the database script statement includes identification information of the first sample data, and the obtaining module 501 is further configured to: and acquiring the first sample data from a local database of the first client terminal according to the identification information of the first sample data.

In a possible implementation manner, the database script statement further includes identification information of second sample data, and the determining module 502 is specifically configured to: and determining the second client terminal according to the identification information of the second sample data and preset sample data distribution information, wherein the sample data distribution information is used for indicating the corresponding relation between the identification information of the sample data and the client terminal to which the sample data belongs.

In one possible implementation, the database script statement further includes sample alignment reference information, and the sample alignment reference information includes one or more of the following: the sample ID, sample feature, and intersection module 503 is specifically configured to: according to an encryption algorithm, performing data alignment of at least one first element value and at least one second element value with a second client terminal to obtain intersection data; the first element value is an element value corresponding to the sample alignment reference information in the first sample data, and the second element value is an element value corresponding to the sample alignment reference information in the second sample data.

In a possible implementation manner, the intersection module 503 is specifically configured to: encrypting each first element value to obtain first encrypted data; sending the first encrypted data to a second client terminal, and receiving second encrypted data returned by a second client, wherein the second encrypted data is associated with each encrypted second element value; and according to the first encrypted data and the second encrypted data, carrying out data alignment on each encrypted first element value and each encrypted second element value to obtain intersection data.

In one possible implementation, the data processing apparatus further includes: and the compiling unit is used for compiling the database script statements to obtain compiled grammar units. Wherein, the obtaining module 501 is further configured to: and obtaining the identification information of the first sample data, the identification information of the second sample data and the sample alignment reference information according to the compiled grammar unit.

In one possible implementation manner, the database script statement is a structured query language SQL statement, and the structured query language SQL statement includes a file name of the first sample data, a file name of the second sample data, and sample alignment reference information.

In one possible implementation, the data processing apparatus further includes: and the transceiver 504 is configured to send the intersection data to the second client terminal.

In a possible implementation manner, the transceiver module 504 is further configured to: and sending the database script statement to the second client terminal.

The data processing apparatus provided in any of the foregoing embodiments is configured to execute the technical solution of any of the foregoing method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 6 is a schematic structural diagram of a data processing device according to an embodiment of the present invention. As shown in fig. 6, the apparatus may include: a memory 601, a processor 602 and a data processing program stored on the memory 601 and executable on the processor 602, the data processing program implementing the steps of the data processing method according to any of the previous embodiments when executed by the processor 602.

Alternatively, the memory 601 may be separate or integrated with the processor 602.

For the implementation principle and the technical effect of the device provided by this embodiment, reference may be made to the foregoing embodiments, and details are not described here.

An embodiment of the present invention further provides a computer-readable storage medium, where a data processing program is stored on the computer-readable storage medium, and when the data processing program is executed by a processor, the data processing program implements the steps of the data processing method according to any of the foregoing embodiments.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor to execute some steps of the methods according to the embodiments of the present invention.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A data processing method, applied to a first client terminal, comprising:

2. The method according to claim 1, wherein the database script statement includes identification information of the first sample data, and before the second client terminal performs data alignment of the first sample data and the second sample data, the method further comprises:

3. The method according to claim 2, wherein the database script statement further includes identification information of the second sample data, and the determining a second client terminal according to the database script statement comprises:

4. The method of claim 3, wherein the database script statement further comprises sample alignment reference information, wherein the sample alignment reference information comprises one or more of: a sample ID, a sample characteristic, and the performing, according to the database script statement and the encryption algorithm, data alignment of the first sample data and the second sample data with the second client terminal, includes:

5. The method according to claim 4, wherein said performing data alignment of at least one first element value and at least one second element value with the second client terminal according to the encryption algorithm to obtain the intersection data comprises:

encrypting each first element value to obtain first encrypted data;

6. The method of claim 4, wherein prior to determining a second client terminal from the database script statement, the method further comprises:

compiling the database script statements to obtain compiled grammar units;

7. The method of claim 4, wherein the database script statement is a Structured Query Language (SQL) statement that includes a file name of the first sample data, a file name of the second sample data, and the sample alignment reference information.

8. The method according to any one of claims 1-7, further comprising:

and sending the intersection data to the second client terminal.

9. The method according to any of claims 1-7, wherein before data alignment of said first sample data and said second sample data with said second client terminal, the method further comprises:

and sending the database script statement to the second client terminal.

10. A data processing apparatus, comprising:

the acquisition module is used for acquiring a database script statement, wherein the database script statement is associated with first sample data and second sample data, and the first sample data is sample data in a local database of a first client terminal;

11. A data processing apparatus, characterized in that the data processing apparatus comprises: memory, processor and data processing program stored on the memory and executable on the processor, which data processing program, when executed by the processor, carries out the steps of the data processing method according to any one of claims 1 to 9.

12. A computer-readable storage medium, on which a data processing program is stored, which when executed by a processor implements the steps of the data processing method according to any one of claims 1 to 9.