CN112287396A - Data processing method and device based on privacy protection - Google Patents

Data processing method and device based on privacy protection Download PDF

Info

Publication number
CN112287396A
CN112287396A CN202011547030.6A CN202011547030A CN112287396A CN 112287396 A CN112287396 A CN 112287396A CN 202011547030 A CN202011547030 A CN 202011547030A CN 112287396 A CN112287396 A CN 112287396A
Authority
CN
China
Prior art keywords
data flow
flow graph
machine learning
learning algorithm
programming model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011547030.6A
Other languages
Chinese (zh)
Other versions
CN112287396B (en
Inventor
徐世真
朱晓芳
倪裕芳
王鲲鹏
刘荔园
唐家渝
田天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Real AI Technology Co Ltd
Original Assignee
Beijing Real AI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Real AI Technology Co Ltd filed Critical Beijing Real AI Technology Co Ltd
Priority to CN202011547030.6A priority Critical patent/CN112287396B/en
Publication of CN112287396A publication Critical patent/CN112287396A/en
Application granted granted Critical
Publication of CN112287396B publication Critical patent/CN112287396B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services

Abstract

The disclosure relates to a data processing method and device based on privacy protection, wherein the method comprises the following steps: acquiring a machine learning algorithm to be trained; acquiring multiple groups of feature data needing to be called repeatedly from a machine learning algorithm, and converting the acquired feature data into tuple variable parameters of composite data structures such as arrays, dictionaries or sets; inputting the tuple variable parameters into a programming model so that the programming model converts the machine learning algorithm based on the tuple variable parameters and a preset data flow graph generating tool to obtain a data flow graph corresponding to the machine learning algorithm; the dataflow graph includes a series of operators; dividing the data flow graph into a plurality of subgraphs, and scheduling the subgraphs to a target participant for execution; compiling the subgraph into a new data flow graph, and obtaining a privacy protection machine learning algorithm by generating instructions of each operational character in the new data flow graph. The method and the device can reduce the compiling times and improve the parallelism.

Description

Data processing method and device based on privacy protection
Technical Field
The present disclosure relates to the field of data encryption technologies, and in particular, to a data processing method and apparatus based on privacy protection.
Background
In a machine learning scene needing multi-party participation, not only is multi-party data required to participate in information interaction to complete model training, but also data safety needs to be ensured and data privacy needs to be protected. The XGBoost (eXtreme Gradient Boosting) algorithm is an algorithm in a machine learning scenario. Aiming at the XGboost algorithm, data privacy protection in the machine learning process is mainly that X data (characteristic data) and Y data (label data) are respectively arranged at two different parties, and the XGboost algorithm training is required to be completed by the two parties under the requirement that the X data and the Y data are not leaked.
At present, an XGBoost algorithm with a dynamic multi-party interaction function may be converted to generate a static execution dataflow graph through a programming model, so as to be used for security verification of a privacy protection protocol. However, in the process of converting the XGBoost algorithm into the data flow graph, the programming model needs to be called for many times, which results in the problems of many compiling times, many communication times, low parallelism and the like.
Disclosure of Invention
In order to solve the technical problem or at least partially solve the technical problem, the present disclosure provides a data processing method and apparatus based on privacy protection.
The present disclosure provides a data processing method based on privacy protection, the method comprising: acquiring a machine learning algorithm to be trained; acquiring a plurality of groups of feature data needing to be called repeatedly from the machine learning algorithm, and converting the acquired feature data into tuple variable parameters of a composite data structure; wherein the composite data structure comprises: an array, dictionary, or collection; inputting the tuple variable parameters into a programming model so that the programming model converts the machine learning algorithm based on the tuple variable parameters and a preset data flow graph generating tool to obtain a data flow graph corresponding to the machine learning algorithm; the dataflow graph includes a series of operators; segmenting the dataflow graph into a plurality of subgraphs, and scheduling the subgraphs to a target participant for execution; compiling the subgraph into a new data flow graph, and obtaining a privacy protection machine learning algorithm by generating calculation operation of each operational character in the new data flow graph.
Further, the type of default input parameter of the programming model is a composite data structure, and the selection decision function in the programming model is configured to: each variable within the tuple variable parameters is accessed in an indexed manner.
Further, the data flow graph is segmented into a plurality of subgraphs, and the subgraphs are dispatched to a target participant to be executed, and the method comprises the following steps: evaluating the installation cost of each operator in the data flow graph when executed by different participants, and setting instructions in the programming model as unilaterally executed instructions executed by one participant in the evaluation process; and according to the arrangement cost obtained by evaluation, segmenting the dataflow graph into a plurality of subgraphs, scheduling the subgraphs to a target participant for execution, and setting an instruction in the programming model as a two-party execution instruction executed by two participants simultaneously in the segmentation and scheduling processes of the dataflow graph.
Further, the method further comprises: caching the data flow graph, the subgraph and the new data flow graph through a preset operation; wherein the preset operation is: and deleting the data shape of the tuple variable parameter from the cache signature of the data flow graph when the tuple variable parameter is cached.
Further, the generating of the calculation operation of each operator in the new data flow graph includes: compiling the first operational character in the new data flow graph into a calculation operation of a semantic corresponding to numpy of Python; wherein the first operator is an operator to be subjected to dynamic plaintext calculation.
Further, the instructions for generating the operators in the new data flow graph include: compiling a second operational character in the new dataflow graph into a calculation operation of corresponding semantics by using a Google-XLA compiling framework; wherein the second operator is an operator to be subjected to static plaintext computation.
Further, the generating of the calculation operation of each operator in the new data flow graph includes: determining an intersection between two or more indexes in the programming model as a base index; compiling, optimizing and executing a third operator in the new data flow graph based on the base index; wherein the third operator is a specific semantic operator in a computational primitive that implements the programming model.
The present disclosure also provides a data processing apparatus based on privacy protection, the apparatus including: the algorithm acquisition module is used for acquiring a machine learning algorithm to be trained; the parameter conversion module is used for acquiring a plurality of groups of feature data needing to be called repeatedly from the machine learning algorithm and converting the acquired feature data into tuple variable parameters of a composite data structure such as an array, a dictionary or a set; the data flow graph generating module is used for inputting the tuple variable parameters into a programming model so that the programming model converts the machine learning algorithm based on the tuple variable parameters and a preset data flow graph generating tool to obtain a data flow graph corresponding to the machine learning algorithm; the dataflow graph includes a series of operators; the segmentation scheduling module is used for segmenting the dataflow graph into a plurality of subgraphs and scheduling the subgraphs to a target participant for execution; and the compiling module is used for compiling the subgraph into a new data flow graph and obtaining a privacy protection machine learning algorithm by generating calculation operation of each operational character in the new data flow graph.
The present disclosure also provides an electronic device, including: a processor and a storage device; the storage means has stored thereon a computer program which, when executed by the processor, performs the method described above.
The present disclosure also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described method.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:
the data processing method based on privacy protection provided by the embodiment of the disclosure comprises the steps of firstly converting a plurality of groups of feature data which are acquired from a machine learning algorithm and need to be called repeatedly into tuple variable parameters of composite data structures such as arrays, dictionaries or sets; then inputting the tuple variable parameters into a programming model so that the programming model converts the machine learning algorithm based on the tuple variable parameters and a preset data flow graph generating tool to obtain a data flow graph corresponding to the machine learning algorithm; then, dividing the data flow graph into a plurality of subgraphs, and scheduling the subgraphs to a target participant for execution; and finally compiling the subgraph into a new data flow graph, and obtaining a privacy protection machine learning algorithm by generating instructions of each operational character in the new data flow graph. In the process of converting the XGboost algorithm into the data flow graph, multiple groups of feature data can be used as tuple variable parameters together to execute the tuple variable parameters in parallel, so that the compiling times are reduced, and the parallelism is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is an example of an implementation function of a programming model according to an embodiment of the present disclosure;
FIG. 2 is a flowchart of a data processing method based on privacy protection according to an embodiment of the present disclosure;
FIG. 3 is an example of an implementation function of another programming model according to embodiments of the present disclosure;
FIG. 4 is a schematic structural diagram of a calculation compiling module according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a base index according to an embodiment of the present disclosure;
fig. 6 is a schematic diagram of a calculation process based on a base index according to an embodiment of the disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
Currently, to achieve privacy protection, the XGBoost algorithm may be transformed by a programming model to generate a static execution dataflow graph. For example, in some embodiments, implementing dynamic multi-party interaction functionality in a programming model may be accomplished by designing a computation primitive for "select _ and _ reduce" as shown in FIG. 1. The XGboost algorithm may be transformed by the programming model to generate a static execution dataflow graph for use in security verification of privacy preserving protocols. In the step of segmenting the feature points of the XGboost algorithm, the process of calculating the information gain of each segmenting point of each feature can be executed in batches. However, in the "select _ and _ reduce" function interface provided in fig. 1, the judgment condition selected by the selector function can only complete one division point gain judgment of one feature, and multiple conditions (such as simultaneous judgment of multiple features) cannot be completed in parallel, and need to be performed in sequence in a for loop, so that the parallel opportunity cannot be utilized. The "select _ and _ reduce" function is called many times, which results in many compilation times, many communication times, low parallelism, and the like.
Meanwhile, the execution bottom layer of the scheme depends on Google-XLA and other compiling execution components which need to know the shape (shape) of the array in advance and cannot change once being determined, so that the dynamic number of multiple conditions can be limited to be executed in parallel.
Still other embodiments can take advantage of the parallel opportunities described above, but not for privacy-preserving scenarios (i.e., scenarios where X-data and Y-data are distributed across two participants). The scheme is based on a pure dynamic scene (realized by C + + & OpenMP), and although the parallel opportunity can be utilized, the problem that the parallel opportunity cannot be converted into a static execution dataflow graph is faced; the visual display of the data flow graph is an important basis for the security verification of a subsequent privacy protection protocol.
In view of the above problem that the data flow graph and the sufficient parallelism cannot be simultaneously implemented, embodiments of the present disclosure provide a data processing method and apparatus based on privacy protection, and details of embodiments of the present disclosure are described below.
The first embodiment is as follows:
fig. 2 is a flowchart of a data processing method based on privacy protection according to an embodiment of the present disclosure, where the method includes the following steps S202 to S210:
step S202, a machine learning algorithm to be trained is obtained, and the machine learning algorithm is an algorithm comprising dynamic multi-party interaction and static multi-party interaction. The XGBoost algorithm is taken as an example of the machine learning algorithm of the present embodiment.
Dynamic multiparty interaction: the interactive calculation process of the first party and the second party comprises dynamic instructions such as branch dynamic instructions and loop dynamic instructions, and when the data content of the first party and/or the second party changes, the calculation processes such as the calculation times and the calculation sequence also change. Static multiparty interactions: the computing process of the interaction between the first party and the second party is static, contains no dynamic instructions, and the computing process remains unchanged when the data content of the first party and/or the second party changes.
Step S204, obtaining a plurality of groups of feature data needing to be called repeatedly from a machine learning algorithm, and converting the obtained feature data into tuple variable parameters of a composite data structure; wherein the composite data structure includes but is not limited to: an array, dictionary, or collection.
The characteristic data includes, for example: test data, Threshold data and feature _ i data, wherein the feature _ i represents the number of features, and the Threshold represents a specific parameter value of the feature data.
Step S206, inputting the tuple variable parameters into the programming model so that the programming model converts the machine learning algorithm based on the tuple variable parameters and a preset data flow graph generating tool to obtain a data flow graph corresponding to the machine learning algorithm; a dataflow graph includes a series of operators.
In this embodiment, the type of the default input parameter of the programming model is a composite data structure such as an array, a dictionary, or a set, and the selection judgment function in the programming model is configured to: each variable within the tuple variable parameters is accessed in an indexed manner. The data flow graph generating tool is a generating tool with a front-end interface and a preset data flow graph format, and the front-end interface is used for acquiring a machine learning algorithm; a dataflow graph generation tool such as the Google-JAX computing framework.
And step S208, dividing the dataflow graph into a plurality of subgraphs, and scheduling the subgraphs to a target participant for execution.
And step S210, compiling the subgraph into a new data flow graph, and obtaining a privacy protection machine learning algorithm by generating calculation operation of each operational character in the new data flow graph. The privacy protection machine learning algorithm supports machine learning based on the data privacy protection algorithm through a preset machine learning frame.
The data processing method based on privacy protection provided by the embodiment of the disclosure comprises the steps of firstly converting a plurality of groups of feature data which are acquired from a machine learning algorithm and need to be called repeatedly into tuple variable parameters of composite data structures such as arrays, dictionaries or sets; then inputting the tuple variable parameters into a programming model so that the programming model converts the machine learning algorithm based on the tuple variable parameters and a preset data flow graph generating tool to obtain a data flow graph corresponding to the machine learning algorithm; then, dividing the data flow graph into a plurality of subgraphs, and scheduling the subgraphs to a target participant for execution; and finally compiling the subgraph into a new data flow graph, and obtaining a privacy protection machine learning algorithm by generating instructions of each operational character in the new data flow graph. In the process of converting the XGboost algorithm into the data flow graph, multiple groups of feature data can be used as tuple variable parameters together to execute the tuple variable parameters in parallel, so that the compiling times are reduced, and the parallelism is improved.
For ease of understanding, the data processing method based on privacy protection described above is described below.
In converting a machine learning algorithm into a dataflow graph through an existing programming model (such as a model implemented by the "select _ and _ reduce" function in fig. 1), the inventors found that test data in the machine learning algorithm is the same parameter under different selector judgment conditions. And the ith feature _ i and the corresponding parameter value threshold are called for many times under different judgment conditions, although the specific features and the corresponding parameter values expressed under different judgment conditions are different, the two called feature _ i and threshold parameters are unchanged, that is, the feature data feature _ i and threshold are variable parameters with variable values. For example, under one judgment condition, feature data feature _ i and threshold need to be called and represent age and 15 years, respectively, and under another judgment condition, feature data feature _ i and threshold need to be called as well, but represent age and 18 years, respectively. Based on this, in this embodiment, it can also be considered that feature data feature _ i and threshold are called once, but there are multiple selector judgment conditions; further, the feature data feature _ i and threshold may be determined as the judgment variable set.
The embodiment provides a judgment variable set and a new programming interface layer design example, as shown in fig. 3, the feature data shown in fig. 3 includes test data, feature _ record and threshold _ record, the feature _ record represents a feature to be split, and the corresponding threshold _ record represents a feature splitting point, which are two lists with the same length. The judgment variable set consisting of feature data feature _ record and threshold _ record and the test data form tuple variable parameters of a composite data structure such as an array, a dictionary or a set.
The new programming model in this embodiment is implemented by the select _ and _ reduce function shown in FIG. 3, i.e., select _ and _ reduce provided in FIG. 3 is a computational primitive for implementing the new programming model. The default input parameter type of the programming model is a composite data structure such as an array, a dictionary or a set, and the selection judgment function in the programming model is configured as follows: each variable within the tuple variable parameters is accessed in an indexed manner. It can be understood that, in the select _ and _ reduce function, the data input by the selector is tuple variable parameters containing multiple sets of characteristic data, and the tuple variable parameters are composite data structures such as arrays, dictionaries or sets; the selector is used for converting data (indicated as TEST) of one participant into an internal index (internal index), the reducer is used for selecting partial data (indicated as APPLY) of another participant according to the internal index and carrying out calculation operations such as user-specified reduction and the like, and the implementation process comprises a data-driven calculation flow, namely the calculation behavior of the reducer is highly dependent on the result output by the selector. In this embodiment, the data TEST of a certain participant may be a tuple variable parameter generated by feature data including TEST data, feature _ record, and threshold _ record.
It should be noted that the selector function in this embodiment semantically represents the participation behavior of all variables in the tuple variable parameters of the select _ and _ reduce interface, but actually can be interpreted by the subsequent transformation layer or compilation execution layer as the participation behavior of any slice of all variables, i.e. the selector function does not require the shape (shape) of the participation variable, and the participation variable and the shape are determined by the transformation layer/compilation execution layer according to the performance internal optimization scheme. The conversion layer is used to implement step S208, and the compilation execution layer is used to implement step S210.
The present embodiment provides a process for implementing step S208 by the conversion layer, as shown below: the placement cost of each operator in the data flow graph when executed by different participants is evaluated, and in the evaluation process, the instructions in the programming model are set to be executed in a single party executed by one participant. And according to the arrangement cost obtained by evaluation, the data flow graph is divided into a plurality of subgraphs, the subgraphs are dispatched to a target participant to be executed, and in the process of dividing and dispatching the data flow graph, the instruction in the programming model is set as a two-party execution instruction executed by two participants at the same time.
Specifically, for the scenario of vertical federated learning, one operator will only execute on a certain participant. Based on this, in the present embodiment, during the evaluation of operator placement, the instruction in the programming model (i.e. select _ and _ reduce) is used as a single party to execute the instruction by the holder of data APPLY; in the data flow graph segmentation and scheduling, select _ and _ reduce is taken as an instruction executed by two participants at the same time, and is scheduled to a subgraph of the data flow graph executed by the two participants.
In an embodiment, the data flow graph, the subgraph and the new data flow graph can be cached through preset operation; wherein the preset operation is: and when the tuple variable parameters are cached, deleting the data shape of the tuple variable parameters from the cache signature of the data flow graph, wherein the cache signature is obtained when dynamic cache marks are added to the tuple variable parameters based on the default behavior of Google-JAX.
Considering that caching of the dataflow graph is an important basis for just-in-time compilation, the embodiment may use the dataflow graph generated by conversion, the subgraph obtained by splitting, and the new dataflow graph obtained by compiling as intermediate data, and avoid re-performing the related operations of transformation and compilation of the dataflow graph when calling for the second time by caching the intermediate data. Since these operations are generally longer, the data processing time can be effectively reduced by buffering the intermediate data. In this embodiment, a dynamic cache flag is set, and when a dynamic batch is executed, the shape (shape) of an input tuple variable parameter is removed from a data flow graph cache signature (signature), so as to complete flow graph caching of the dynamic input tuple variable parameter.
The present embodiment provides a process of implementing step S210 by compiling an execution layer. In this embodiment, the compiling execution layer includes, but is not limited to, a calculation compiling module, and the calculation compiling module is configured to generate a calculation operation corresponding to an operator to be subjected to plaintext calculation in the new dataflow graph. As shown in fig. 4, the calculation compilation module includes a dynamic calculation unit, a static compilation calculation unit, and a primitive-specific calculation unit. Wherein:
the dynamic calculation unit is used for compiling a first operational character in the new data flow graph into a calculation operation of semantics corresponding to numpy (called numpy hereinafter); the first operator is an operator to be subjected to dynamic plaintext calculation, such as the calculation formula of y _ g in fig. 3. The embodiment uses the numpy library as an execution back end to complete the dynamic execution of the first operator. Specifically, each first operator in the new dataflow graph is compiled into one or more compute operations for numpy of corresponding semantics. The numpy calculation operation has no static requirement on the shape of the input data, and the conventional compiler execution such as Google-XLA requires that the shape of the input data is not changed once being determined, so that the embodiment can complete the dynamic execution of the first operator by using a numpy library function.
The static compiling calculation unit is used for compiling a second operational character in the new data flow graph into a calculation operation of corresponding semantics by using a Google-XLA compiling framework; wherein the second operator is an operator to be subjected to static plaintext computation.
A special primitive computing unit, for determining the intersection part between two or more indexes in the programming model as a base index; compiling, optimizing and executing a third operational character in the new data flow graph based on the base index to obtain a computing operation corresponding to the index; wherein the third operator is a specific semantic operator in the computation primitive implementing the new programming model, that is, the third operator is a specific semantic operator in select _ and _ reduce shown in FIG. 3.
In the embodiment, a large number of selector judgment conditions occur during the execution of the select _ and _ reduce function, so that a plurality of groups of indexes are generated, some indexes are the same, and some indexes have intersection. In this case, at least the intersection between the indices is determined as the base index, or the base index is composed of the intersection elements of at least two indices. Referring to fig. 5, the intersection between index 1 and index 2 is element 3, element 4, and element 5, and thus element 3, element 4, and element 5 may be determined as base index 2. It is understood that fig. 4 is only an illustration, where base index 1 is an intersection of index 1 and other indexes not shown, and base index 3 is an intersection of index 2 and other indexes not shown. Considering that the multiplexing of the intermediate result is adversely affected by too many elements in the base index, and the cost of the index forming process of the base index is relatively high due to too few elements; therefore, the number of elements included in the base index can be preset or adjusted, so that the base index can be better adapted to actual use.
And according to the base index, performing relevant semantic operations such as summation and the like on the third operator in the new data flow diagram, and combining the results of the base index to obtain a result corresponding to the index. For example, referring to fig. 6, a manner of performing semantic operations on a third operator may be shown, where the third operator is encrypted operators g and h of a second participant, and the third operator performs summation calculation on ciphertext according to a pre-generated base index, and the calculation includes calculation of sum _ g and sum _ h of the base index and calculation of sum _ g and sum _ h of indexes obtained by subsequent combination.
The redundancy problem of the same index can be improved through the base index, and in the process of compiling and optimizing the third operator according to the base index, the calculation amount of sum _ g and sum _ h on the original index can be reduced through the calculation of the base index and the multiplexing of the calculation result of the base index.
It should be noted that, except for the module receiving | | | g |/| | h | | in fig. 6, the other modules can be executed in a multi-thread or multi-process manner, so that dynamic batch execution is realized, and the training speed of the machine learning algorithm is obviously improved.
In summary, in the above embodiments, multiple groups of feature data are used as tuple variable parameters together, and the programming model converts the machine learning algorithm into the data flow graph based on the tuple variable parameters, so that the repeated processing and compiling time of the data flow graph can be avoided; in the compiling process of the new data flow graph, the processes of encryption and decryption, index (or base index) calculation and intermediate result calculation can be executed in parallel by the execution layer; the base index can also reduce redundant computation and maximize the result of multiplexing intermediate computation. In addition, the embodiment enables each selector judgment condition to be executed in batch and efficiently by setting the base index and the cache middle node. Therefore, in practical application, the method provided by the embodiment can improve the training speed of the machine learning algorithm.
Example two:
the embodiment provides a data processing device based on privacy protection, which comprises:
the algorithm acquisition module is used for acquiring a machine learning algorithm to be trained;
the parameter conversion module is used for acquiring a plurality of groups of feature data needing to be called repeatedly from a machine learning algorithm and converting the acquired feature data into tuple variable parameters of a composite data structure; wherein the composite data structure comprises: an array, dictionary, or collection;
the data flow graph generating module is used for inputting the tuple variable parameters into the programming model so that the programming model converts the machine learning algorithm based on the tuple variable parameters and a preset data flow graph generating tool to obtain a data flow graph corresponding to the machine learning algorithm; the dataflow graph includes a series of operators;
the segmentation scheduling module is used for segmenting the dataflow graph into a plurality of subgraphs and scheduling the subgraphs to a target participant for execution;
and the compiling module is used for compiling the subgraph into a new data flow graph and obtaining the privacy protection machine learning algorithm by generating calculation operation of each operational character in the new data flow graph.
The device provided by the embodiment of the present disclosure has the same implementation principle and technical effect as the foregoing method embodiment, and for the sake of brief description, no mention is made in the device embodiment, and reference may be made to the corresponding contents in the foregoing method embodiment.
An embodiment of the present disclosure provides an electronic device, which includes: a processor and a storage device; the storage device has stored thereon a computer program which, when executed by the processor, performs the data processing method based on privacy protection as described in embodiment one.
The disclosed embodiment provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the data processing method based on privacy protection in the first embodiment are executed.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A data processing method based on privacy protection, the method comprising:
acquiring a machine learning algorithm to be trained;
acquiring a plurality of groups of feature data needing to be called repeatedly from the machine learning algorithm, and converting the acquired feature data into tuple variable parameters of a composite data structure; wherein the composite data structure comprises: an array, dictionary, or collection;
inputting the tuple variable parameters into a programming model so that the programming model converts the machine learning algorithm based on the tuple variable parameters and a preset data flow graph generating tool to obtain a data flow graph corresponding to the machine learning algorithm; the dataflow graph includes a series of operators;
segmenting the dataflow graph into a plurality of subgraphs, and scheduling the subgraphs to a target participant for execution;
compiling the subgraph into a new data flow graph, and obtaining a privacy protection machine learning algorithm by generating calculation operation of each operational character in the new data flow graph.
2. The method of claim 1, wherein the type of default input parameter for the programming model is a composite data structure, and wherein the selection decision function in the programming model is configured to: each variable within the tuple variable parameters is accessed in an indexed manner.
3. The method of claim 1, wherein splitting the dataflow graph into multiple subgraphs and scheduling the subgraphs for execution by a target participant comprises:
evaluating the installation cost of each operator in the data flow graph when executed by different participants, and setting instructions in the programming model as unilaterally executed instructions executed by one participant in the evaluation process;
and according to the arrangement cost obtained by evaluation, segmenting the dataflow graph into a plurality of subgraphs, scheduling the subgraphs to a target participant for execution, and setting an instruction in the programming model as a two-party execution instruction executed by two participants simultaneously in the segmentation and scheduling processes of the dataflow graph.
4. The method of claim 1, further comprising:
caching the data flow graph, the subgraph and the new data flow graph through a preset operation; wherein the preset operation is: and deleting the data shape of the tuple variable parameter from the cache signature of the data flow graph when the tuple variable parameter is cached.
5. The method of claim 1, wherein the computing operations to generate operators in the new dataflow graph include:
compiling the first operational character in the new data flow graph into a calculation operation of a semantic corresponding to numpy of Python; wherein the first operator is an operator to be subjected to dynamic plaintext calculation.
6. The method of claim 1 or 5, wherein the instructions to generate the operators in the new dataflow graph include:
compiling a second operational character in the new dataflow graph into a calculation operation of corresponding semantics by using a Google-XLA compiling framework; wherein the second operator is an operator to be subjected to static plaintext computation.
7. The method of claim 1 or 5, wherein the instructions to generate the operators in the new dataflow graph include:
determining an intersection between two or more indexes in the programming model as a base index;
compiling, optimizing and executing a third operator in the new data flow graph based on the base index; wherein the third operator is a specific semantic operator in a computational primitive that implements the programming model.
8. A data processing apparatus based on privacy protection, the apparatus comprising:
the algorithm acquisition module is used for acquiring a machine learning algorithm to be trained;
the parameter conversion module is used for acquiring a plurality of groups of feature data needing to be called repeatedly from the machine learning algorithm and converting the acquired feature data into tuple variable parameters of a composite data structure; wherein the composite data structure comprises: an array, dictionary, or collection;
the data flow graph generating module is used for inputting the tuple variable parameters into a programming model so that the programming model converts the machine learning algorithm based on the tuple variable parameters and a preset data flow graph generating tool to obtain a data flow graph corresponding to the machine learning algorithm; the dataflow graph includes a series of operators;
the segmentation scheduling module is used for segmenting the dataflow graph into a plurality of subgraphs and scheduling the subgraphs to a target participant for execution;
and the compiling module is used for compiling the subgraph into a new data flow graph and obtaining a privacy protection machine learning algorithm by generating calculation operation of each operational character in the new data flow graph.
9. An electronic device, comprising: a processor and a storage device;
the storage device has stored thereon a computer program which, when executed by the processor, performs the method of any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of the claims 1 to 7.
CN202011547030.6A 2020-12-24 2020-12-24 Data processing method and device based on privacy protection Active CN112287396B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011547030.6A CN112287396B (en) 2020-12-24 2020-12-24 Data processing method and device based on privacy protection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011547030.6A CN112287396B (en) 2020-12-24 2020-12-24 Data processing method and device based on privacy protection

Publications (2)

Publication Number Publication Date
CN112287396A true CN112287396A (en) 2021-01-29
CN112287396B CN112287396B (en) 2021-03-30

Family

ID=74425857

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011547030.6A Active CN112287396B (en) 2020-12-24 2020-12-24 Data processing method and device based on privacy protection

Country Status (1)

Country Link
CN (1) CN112287396B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051543A (en) * 2021-04-01 2021-06-29 郭洪铜 Cloud service security verification method and cloud service system in big data environment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200151575A1 (en) * 2018-11-13 2020-05-14 Teradata Us, Inc. Methods and techniques for deep learning at scale over very large distributed datasets
CN111783124A (en) * 2020-07-07 2020-10-16 矩阵元技术(深圳)有限公司 Data processing method and device based on privacy protection and server
CN111831979A (en) * 2020-07-06 2020-10-27 北京瑞莱智慧科技有限公司 Method and device for analyzing data privacy protection protocol
CN112016703A (en) * 2020-10-15 2020-12-01 北京瑞莱智慧科技有限公司 Conversion system and method of machine learning algorithm and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200151575A1 (en) * 2018-11-13 2020-05-14 Teradata Us, Inc. Methods and techniques for deep learning at scale over very large distributed datasets
CN111831979A (en) * 2020-07-06 2020-10-27 北京瑞莱智慧科技有限公司 Method and device for analyzing data privacy protection protocol
CN111783124A (en) * 2020-07-07 2020-10-16 矩阵元技术(深圳)有限公司 Data processing method and device based on privacy protection and server
CN112016703A (en) * 2020-10-15 2020-12-01 北京瑞莱智慧科技有限公司 Conversion system and method of machine learning algorithm and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051543A (en) * 2021-04-01 2021-06-29 郭洪铜 Cloud service security verification method and cloud service system in big data environment

Also Published As

Publication number Publication date
CN112287396B (en) 2021-03-30

Similar Documents

Publication Publication Date Title
CN104298496B (en) data analysis type software development framework system
CN108984155B (en) Data processing flow setting method and device
CN105824974B (en) The method and system of Data Analysis Services
US20190065444A1 (en) Techniques for efficient & high-throughput web content-creation
CN113011337B (en) Chinese character library generation method and system based on deep meta learning
CN116522403B (en) Interactive information desensitization method and server for focusing big data privacy security
CN112287396B (en) Data processing method and device based on privacy protection
CN113379070A (en) Deep learning framework conversion method, system, storage medium and equipment
CN115510981A (en) Decision tree model feature importance calculation method and device and storage medium
CN107832271B (en) Function image drawing method, device, equipment and computer storage medium
CN107180168A (en) File loading, generation method and device, and intelligent terminal
Zhang Clcnet: Improving the efficiency of convolutional neural network using channel local convolutions
CN109656952B (en) Query processing method and device and electronic equipment
US11625642B2 (en) Method for converting nominal to ordinal or continuous variables using time-series distances
CN110413750A (en) The method and apparatus for recalling standard question sentence according to user's question sentence
Lucidi et al. A convergent hybrid decomposition algorithm model for SVM training
Chakroborti et al. Optimized Storing of Workflow Outputs through Mining Association Rules
CN114115872A (en) Method and device for realizing service expression generation and analysis tool
Pandey et al. Data modeling and performance analysis approach of big data
CN111126012A (en) Method and device for generating expression in customization mode
Wu et al. Streaming Approach to In Situ Selection of Key Time Steps for Time‐Varying Volume Data
Zhang et al. Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models
CN110309062A (en) Case generation method, device, electronic equipment and storage medium
CN112818406B (en) Prediction method and device of score card model
CN114579752B (en) Feature importance-based long text classification method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant