CN115169594A - Learning debugging method and debugging system of privacy protection distributed machine - Google Patents

Learning debugging method and debugging system of privacy protection distributed machine Download PDF

Info

Publication number
CN115169594A
CN115169594A CN202211100671.6A CN202211100671A CN115169594A CN 115169594 A CN115169594 A CN 115169594A CN 202211100671 A CN202211100671 A CN 202211100671A CN 115169594 A CN115169594 A CN 115169594A
Authority
CN
China
Prior art keywords
debugging
training
server
metadata
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211100671.6A
Other languages
Chinese (zh)
Inventor
刘川意
段少明
何田雨
韩培义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN202211100671.6A priority Critical patent/CN115169594A/en
Publication of CN115169594A publication Critical patent/CN115169594A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention is suitable for the field of learning and debugging of distributed machines, and provides a learning and debugging method and a debugging system of a privacy-protecting distributed machine, wherein the learning and debugging method comprises the following steps: step S1: firstly, a data analyst builds a distributed machine learning pipeline; step S2: in the local client training process; and step S3: the server receives the model from local training and the calculated debugging intermediate value; and step S4: the server debugging module detects the federal training of the current round based on debugging metadata collected by the server and debugging intermediate values transmitted by each client according to a distributed machine learning debugging algorithm; step S5: and after the federal training is finished, outputting a debugging report of the training. Aims to solve the technical problems existing in the background technology.

Description

Learning debugging method and system for privacy protection distributed machine
Technical Field
The invention belongs to the field of distributed machine learning and debugging, and particularly relates to a learning and debugging method and a debugging system of a privacy-protecting distributed machine.
Background
For the federal learning training, patent CN 114638377A provides a model training method, device and electronic device based on federal learning, wherein the method comprises the following steps: processing external intermediate data and local feature data by using a key to obtain initial secret data, wherein the key is obtained from a central node, and the external intermediate data is obtained from other training nodes; sending the initial secret state difference value to a central node, receiving a screened index sent by the central node, wherein the initial secret state data is used for determining the screened index; determining a dense state gradient value by using the updated dense state difference value and the training subset corresponding to the updated dense state difference value; sending the dense gradient value to the central node, receiving the target gradient value sent by the central node, and updating the current model by using the target gradient value. The federated learning algorithm only considers how the model is trained in the federated learning environment, but does not consider the training problem except how to debug the model when the model is trained, and the model effect is improved.
The patent CN 111796811A provides a flow control engine system for supporting breakpoint debugging in bang learning, which includes a flow graph generation module generating a logic flow graph at the back end; the scheduling queue module is used for traversing nodes of the logic flow graph; the operation container module is used for creating a task ID for each task and transmitting the task ID to the life cycle manager; the breakpoint management module is used for setting the state of the running container and triggering the breakpoint to trigger a pause state and a continuous running state; the life cycle manager is used for managing the node running state in the container and controlling the complete life cycle of the container running process; the context manager is used to save the currently running intermediate results in the node state. The flow control engine system supporting breakpoint debugging in federal learning is adopted, the system runs in a breakpoint mode, and is suspended after running to be fused, and some intermediate results after the system is currently run can be checked, so that federal learning modeling personnel can conveniently debug. The Federal learning and debugging can only pause training through a breakpoint in the training process so that a debugging person can directly check static data in the model training process, on one hand, the training problem of the model cannot be automatically detected, and on the other hand, no method is available for analyzing and tracing the problem of the model after the training is finished.
In the dependent distributed training, an embodiment of a patent CN 114580651A discloses a method, an apparatus, a device, a system and a computer-readable storage medium for federated learning, where the method includes: first, each second device obtains data distribution information and sends the data distribution information to the first device. And then, the first equipment receives data distribution information sent by a plurality of second equipment participating in federal learning, so that a matched federal learning strategy is selected according to the data distribution information sent by the plurality of second equipment. And then, the first equipment sends a parameter reporting strategy corresponding to the federal learning to at least one second equipment in the plurality of second equipment. The second device receiving the parameter reporting policy is configured to obtain second gain information according to the parameter reporting policy and the current training sample, where the second gain information is used to obtain a second model of the second device. According to the method and the device, the interference caused by the phenomenon of non-independent and same distribution of data is reduced, and the second model with better performance can be obtained. The interference of the non-independent same distribution phenomenon of data on the training of the Federal learning model is reduced by acquiring data distribution and selecting a proper Federal learning strategy, but the problems existing in Federal learning and debugging are far more than the non-independent same distribution of the data, and the Federal learning and debugging can not detect other problems in other data preprocessing, feature engineering or training processes.
In machine learning commissioning, patent CN 114503132A discloses a method, system and computer readable medium for commissioning and profiling of machine learning model training. A machine learning analysis system receives data associated with training of a machine learning model. The data is collected by a machine learning training cluster. The machine learning analysis system performs an analysis of the data associated with the training of the machine learning model. The machine learning analysis system detects one or more conditions associated with the training of the machine learning model based at least in part on the analysis. The machine learning analysis system generates one or more alerts describing the one or more conditions associated with the training of the machine learning model. However, this patent automatically detects and alarms the collected data by collecting the relevant data in the training process of the machine learning model, but it cannot be directly applied to the federal learning.
In terms of data collection, patent CN 114066314A provides a data tracking method, device, server, and storage medium, where the method includes receiving an index to be tracked and an index type of the index to be tracked, which are input by a user when a configuration start operation is detected, displaying a main interactive interface corresponding to the index type according to the index type, receiving early warning information, which is input by the user on the main interactive interface and is specific to the index to be tracked, acquiring a plurality of target data according to the early warning information, performing operation processing on the plurality of target data according to a target operation relationship corresponding to the plurality of target data to obtain an operation result, determining an early warning trigger state according to the operation result, and performing early warning tracking on the index to be tracked according to an early warning manner in the early warning information when the early warning trigger state indicates that early warning needs to be triggered. The method and the device can realize tracking of the index to be tracked in an information configuration mode, and are beneficial to improving the tracking efficiency. The patent realizes the functions of tracking, collecting and analyzing variables according to the user input specification in a data tracking mode. But not for federal learning.
Disclosure of Invention
The invention aims to provide a learning debugging method and a debugging system of a privacy-preserving distributed machine, and aims to solve the technical problems in the background technology.
The invention is realized in such a way, a learning debugging method and a debugging system of a privacy protection distributed machine, wherein the learning debugging method comprises the following steps:
step S1: firstly, a data analyst builds distributed machine learning pipeline, and distributes the distributed machine learning pipeline to all client sides of the federation for training by using local data;
step S2: in the local client training process, a non-invasive metadata collection module of the client automatically analyzes a distributed machine learning pipeline script, collects metadata related to debugging, asynchronously reads the debugging metadata through a debugging module of the client, calculates an intermediate value according to a distributed machine learning debugging algorithm, and transmits the intermediate value to a server along with a model after one local training round;
and step S3: the server receives the models from the local training and the calculated debugging intermediate values, on one hand, the Federal learning algorithm is used for aggregating all client models and testing the global model effect of the training, and on the other hand, the server uses the server non-intrusive metadata collection model to collect debugging metadata in the process;
and step S4: the server debugging module detects the federal training of the current round according to the distributed machine learning debugging algorithm based on debugging metadata collected by the server and debugging intermediate values transmitted by each client, if a training problem is detected, a warning is given to a data analyst through the debugging feedback module, the data analyst debugs the distributed machine learning pipeline according to warning information, and if the training problem is not detected, the next round of federal training is started;
step S5: and after the federal training is finished, outputting a debugging report of the training.
The further technical scheme of the invention is as follows: the client non-invasive metadata collection module in the step S2 automatically analyzes the distributed machine learning pipeline script and collects metadata related to debugging therein, including the following steps:
step S21: a code analysis stage;
step S22: and a metadata acquisition stage.
The further technical scheme of the invention is as follows: the step S21 further includes the steps of:
step S211: obtaining an abstract syntax tree according to a source code;
step S212: and traversing the abstract syntax tree, acquiring the function call and variable information of the program, and forming a data flow diagram.
The further technical scheme of the invention is as follows: the step S22 further includes the steps of:
step S221: identifying target data;
step S222: target data is acquired.
The further technical scheme of the invention is as follows: the concrete step of step S211 is to present the code through the structure of the abstract syntax tree, and obtain the abstract syntax tree object through the interface of the programming language.
The further technical scheme of the invention is as follows: the concrete step of step S212 is to traverse the abstract syntax tree object through the corresponding interface, and obtain function call and variable information in the program after traversing, where the variables and the functions have input-output relationships, and the functions and the variables are connected in a staggered manner according to the relationships to form a data flow graph.
The invention further adopts the technical scheme that: the specific step of step S221 is to classify the target data into two types: one is the model parameters and inputs and outputs that are closely related to the specific framework of machine learning, becoming framework-related metadata; another type is customized by the model developer.
The further technical scheme of the invention is as follows: the step S222 specifically includes that a variable has a unique variable name during the running process of the code, and if the variable name is in the target metadata list, the variable is added to the metadata collection dictionary, and after multiple rounds of training, the dictionary will collect multiple rounds of data values of the variable data.
The invention also aims to provide a debugging system of the learning and debugging method of the privacy-preserving distributed machine, which comprises a server, a plurality of clients connected with the server and a debugging feedback module connected with the server.
The further technical scheme of the invention is as follows: the server comprises a model aggregation, a server non-invasive metadata collection module connected with the model aggregation, a server metadata storage connected with the server non-invasive metadata collection module, and a server debugging module connected with the server metadata storage, wherein the debugging feedback module is connected with the server debugging module; the client comprises a local training connected with the model aggregation, a client non-invasive metadata collection module connected with the local training, a client metadata storage connected with the client non-invasive metadata collection module, and a client debugging module respectively connected with the client metadata storage and the server debugging module.
The invention has the beneficial effects that: this application has following advantage: the method comprises the steps of automatically detecting training problems in privacy protection distributed machine learning training processes such as federal learning and the like, helping a data analyst debug the problems and improving model effects, through a non-invasive metadata collection method, not embedding any data collection codes in a machine learning training script, and aiming at the current common distributed training problems, providing a series of distributed machine learning debugging rules for detecting and debugging a machine learning model in the distributed training process and improving the model effects.
Drawings
FIG. 1 is a block diagram of a system provided by an embodiment of the invention;
FIG. 2 is an overall flow diagram of the non-invasive metadata collection provided by an embodiment of the present invention;
FIG. 3 is example code for deep learning provided by embodiments of the present invention;
FIG. 4 is an exemplary code data flow diagram provided by an embodiment of the present invention;
fig. 5 is a distributed machine learning debugging diagram provided by an embodiment of the present invention.
Detailed Description
The data as a novel production element has great value, and the data safety becomes a big 'road barricade' in the process of releasing the data value. Federal learning has been extensively studied because it can release data value while protecting data privacy. The federated learning is essentially a distributed machine learning framework, a plurality of client nodes cooperatively train a deep learning model under the coordination of a central server, the model is locally trained on each participant, and the model gradient or parameters are shared after local training without sharing data, so that data value mining is realized under privacy protection. However, federal learning of such data with invisible security philosophy can bring new problems while protecting the data. That is, in the machine learning and training process, when the model test performance is reduced or the training is not converged, since the training data and the training process are not contactable, it becomes a challenge how to debug the machine learning under the privacy protection.
A data analyst trains machine learning tasks in a distributed machine learning training system such as federal learning, and machine learning pipeline is distributed to various clients to be trained by real data. When a client side carries out local training, the non-invasive metadata collection module analyzes the machine learning pipeline script, and training metadata (model weight, gradient, input, output and the like) are obtained when the machine learning training script runs. And the local client debugging module calculates the acquired metadata to obtain debugging intermediate values, and the intermediate results are sent to the server side for aggregation. When a server side and a global model are aggregated and tested, a non-invasive metadata collection module collects metadata (model weight or gradient) in the process, and a server side debugging module carries out debugging problem detection on the round of distributed training based on the collected metadata and the middle value of each client side. If the training problem is found, a warning is given to a data analyst, and the data analyst debugs the machine to learn pipeline according to the warning problem.
Fig. 1 shows a debugging system of a learning and debugging method for a privacy-preserving distributed machine according to the present invention, where the debugging system includes a server, a plurality of clients connected to the server, and a debugging feedback module connected to the server.
The server comprises a model aggregation, a server non-invasive metadata collection module connected with the model aggregation, a server metadata storage connected with the server non-invasive metadata collection module, and a server debugging module connected with the server metadata storage, wherein the debugging feedback module is connected with the server debugging module; the client comprises a local training connected with the model aggregation, a client non-intrusive metadata collection module connected with the local training, a client metadata storage connected with the client non-intrusive metadata collection module, and a client debugging module respectively connected with the client metadata storage and the server debugging module.
A learning and debugging method of a privacy-preserving distributed machine comprises the following steps:
step S1: firstly, a data analyst builds distributed machine learning pipeline, and distributes the distributed machine learning pipeline to all client sides of the federation for training by using local data; the client-side local training process is completely transparent to the data analyst.
Step S2: in the local client training process, the client non-invasive metadata collection module automatically analyzes the distributed machine learning pipeline script, collects metadata related to debugging, asynchronously reads the debugging metadata through the client debugging module, calculates an intermediate value according to a distributed machine learning debugging algorithm, and transmits the intermediate value to the server along with the model after a local training round.
The client non-invasive metadata collection module in the step S2 automatically analyzes the distributed machine learning pipeline script and collects metadata related to debugging therein, including the following steps:
step S21: a code analysis stage;
the step S21 further includes the steps of:
step S211: obtaining an abstract syntax tree according to a source code; the specific steps are that the codes are presented through the structure of the abstract syntax tree, and abstract syntax tree objects are obtained through an interface of a programming language.
Step S212: traversing the abstract syntax tree, acquiring function call and variable information of a program, and forming a data flow graph; the method specifically comprises the steps of traversing an abstract syntax tree object through a corresponding interface, obtaining function call and variable information in a program after traversing, wherein the variables and the functions have input and output relations, and connecting the functions and the variables in a staggered manner according to the relation to form a data flow graph.
Step S22: and a metadata acquisition stage.
The step S22 further includes the steps of:
step S221: identifying target data; the specific steps are that target data are divided into two types: one is the model parameters and inputs and outputs that are closely related to the specific framework of machine learning, becoming framework-related metadata; another type is customized by the model developer.
Step S222: target data is acquired. The method specifically comprises the steps that a variable has a unique variable name in the running process of the code, if the variable name is in a target metadata list, the variable is added into a metadata collection dictionary, and the dictionary collects multiple rounds of data values of variable data after multiple rounds of training.
And step S3: the server receives the models from local training and the calculated debugging intermediate values, on one hand, the client models are aggregated by using a federal learning algorithm, and the global model effect of the training in the current round is tested, and on the other hand, the server collects debugging metadata in the process by using a server non-invasive metadata collection model.
And step S4: and the server debugging module detects the federal training of the current round according to the distributed machine learning debugging algorithm based on debugging metadata collected by the server and debugging intermediate values transmitted by each client, if the training problem is detected, the debugging feedback module gives an alarm to a data analyst, the data analyst debugs the distributed machine learning pipeline according to the alarm information, and otherwise, the next round of federal training is started.
Step S5: and after the federal training is finished, outputting a debugging report of the training.
Taking the task of using federal learning to train tuberculosis gene data by a data analyst and predicting whether a patient has tuberculosis according to the patient gene data as an example:
1. a data analyst writes a deep learning training script by using a deep learning framework such as a Pythrch and then trains the script on a federal learning system in combination with gene data of a plurality of hospitals.
2. In the training process, the debugging feedback module alarms and detects that the global model has overfitting errors, and simultaneously detects that a plurality of client side models have overfitting errors, and data distribution of all participants is not independent and has the same distribution errors.
3. And stopping the federal training by the data analyst, processing the detected problem, adjusting machine learning pipeline, and performing distributed training again.
Non-invasive metadata collection. In order to enable a data analyst to concentrate on machine learning, and no debugging information collection code needs to be embedded in a machine learning code, the invention provides a cost-intrusive metadata collection method, and target debugging metadata (such as model weight, gradient, input and output) are automatically collected in the running process of a machine learning training script. The non-invasive metadata collection method comprises two parts of code analysis and metadata acquisition: firstly, analyzing a machine learning training code, analyzing a source code through a syntax analysis module, constructing an abstract syntax tree, and traversing the abstract syntax tree to generate a machine learning training data flow diagram. The target metadata data is then located in the dataflow graph in conjunction with the corresponding machine learning framework grammar. Finally, values of these metadata are collected during the program run phase.
The overall flow of non-invasive metadata collection is shown in fig. 2:
in the code analysis phase:
1. and obtaining the abstract syntax tree according to the source code. Modern programming languages present codes through the structure of an abstract syntax tree in the process of compiling and running, and abstract syntax tree objects can be obtained through an interface of the programming languages.
2. And traversing the abstract syntax tree, acquiring the function call and variable information of the program, and forming a data flow diagram. For abstract syntax tree objects, we can traverse them through the corresponding interfaces. And after traversing, acquiring function call and variable information in the program, wherein the variable and the function have input and output relations, and the function and the variable are connected in a staggered way according to the relation to form a data flow diagram.
In the metadata acquisition phase:
1. target data is identified. Target data is divided into two categories: one is that the specific framework of machine learning is closely related, e.g., model parameters, inputs and outputs, etc., to framework-related metadata. The other is metadata that is customized by the model developer, e.g., training accuracy. The following process is used to identify the data name to be collected in the above two data types: (1) constructing a search list for the data flow graph according to a depth-first search algorithm; (2) judging whether a data node in the data flow graph is metadata appointed to be collected by a developer or not or whether the data node is output data of a corresponding function or not according to the sequence of the search list, and if so, adding the data node into a target metadata list; (3) and returning the target metadata list.
2. Target data is acquired. And if the variable name is in the target metadata list, the variable is added into a metadata collection dictionary, and the dictionary collects multiple rounds of data values of the variable data after multiple rounds of training.
Taking the convolutional neural network model training script written by the Pythrch to collect metadata as an example:
for example, as shown in fig. 3, a simplest convolutional neural network comprises a convolutional layer, a fully connected layer, and an active layer, and a process in which data entering the neural network passes through the above layers in a fixed order to obtain an output is called a forward propagation process. Besides forward propagation, a certain loss function is needed to help the model training, and the loss function is customized by a model developer.
In the code analysis phase:
and establishing a syntax tree for the source code of the neural network part, calling an interface built in a programming language to facilitate the syntax tree to form a data flow graph, wherein the data flow graph of the simple deep convolutional neural network is shown as a figure 4. In fig. 4, black blocks correspond to function calls in the code, white blocks correspond to variables in the code, and the function calls are linked together through the variables to form a complete neural network workflow.
In the metadata acquisition phase:
the result of the convolution process has an important image on the training effect of the model, and a model developer designates convolution layer output and neural network output as metadata to be checked. Starting the metadata identification process to determine the data name needed to be acquired in the training process: and constructing a search list for the data flow graph according to a depth-first search algorithm, traversing data nodes in the data flow graph according to the list sequence, finding that an output data node is metadata which is specified and collected by a developer, and adding output variable information into a target metadata list. Meanwhile, the convolutional layer is an objective function specified by a user, output data (the data are not real variable names) of the convolutional layer is added into a target data list, training of the convolutional neural network is started after the target metadata list is obtained in a traversing mode, and when the convolutional layer is operated to obtain output in the training process, a program can add corresponding data of the convolutional layer to corresponding positions of a metadata collection dictionary.
In the distributed machine learning training process, the machine learning model is trained by using private data at each client, and then model parameters are transmitted to the server side for aggregation. In order to effectively debug various problems occurring in machine learning pipeline in the distributed training process, the method mainly comprises a client debugging module and a server debugging module. The method is limited by privacy limitation, and debugging metadata collected by a client cannot be directly concentrated to a server for analysis, so that local debugging metadata needs to be analyzed at the client, a part of intermediate values are calculated, and the intermediate values without privacy are sent to the server for training problem detection. The server debugging module detects a training problem of the global model by combining the server debugging metadata and the debugging intermediate values of the clients, wherein the training problem is shown in table 1. Feedback to the data analyst when training problems are detected.
Distributed debugging method as shown in fig. 5, the client debugging module mainly analyzes the local debugging metadata and calculates some privacy-free debugging intermediate values, such as the local model training accuracy and the local model training loss value. The debugging intermediate values are sent to a server, a server debugging module comprises built-in and pre-customized distributed debugging rules, whether debugging problems shown in the table 1 are met or not is calculated according to the debugging intermediate values transmitted by the global model debugging metadata and the clients, and if the debugging problems are matched, the debugging problems are fed back to a data analyst through a debugging feedback module. The data analyst then adjusts the machine learning pipeline according to the feedback problem.
TABLE 1 debugging problem example Table
Figure 264065DEST_PATH_IMAGE001
Taking the debugging of the over-fitting problem in the distributed training process as an example:
the overfitting of the machine learning model means that the training error is much lower than the testing error, and the formalization is expressed as:
Figure 118888DEST_PATH_IMAGE002
or
Figure 836309DEST_PATH_IMAGE003
. Wherein
Figure 903622DEST_PATH_IMAGE004
The value of the training loss is represented,
Figure 655677DEST_PATH_IMAGE005
representing training accuracy, i.e. the training loss value is much lower than the test loss value, or the training test accuracy exceeds the test accuracy by a certain threshold
Figure 212561DEST_PATH_IMAGE006
. In the distributed training process, the model is trained at each client, and the model is tested at the server, so that the working flow of the distributed debugging method provided by the invention is as follows:
1. the client debugging module of each client calculates the accuracy rate of local training
Figure 417277DEST_PATH_IMAGE007
And loss value of local training
Figure 550931DEST_PATH_IMAGE008
Counting the number of local data samples
Figure 891913DEST_PATH_IMAGE009
And sending the intermediate results to the server;
2. loss value of server side calculation model test
Figure 822960DEST_PATH_IMAGE010
And test accuracy
Figure 514973DEST_PATH_IMAGE011
(ii) a Meanwhile, the training loss value is calculated according to the intermediate value transmitted by each client
Figure 720826DEST_PATH_IMAGE012
And training accuracy
Figure 978632DEST_PATH_IMAGE013
Figure 18263DEST_PATH_IMAGE014
3. The server debugging module detects whether the training of the current round is overfitting according to the overfitting rule:
Figure 728730DEST_PATH_IMAGE015
or
Figure 472695DEST_PATH_IMAGE016
. If the conditions are met, the global model is over-fitted, and the global model is reported to a data analyst; otherwise, entering the next round of distributed training.
This application has following advantage: the method comprises the steps of automatically detecting training problems in privacy protection distributed machine learning training processes such as federal learning and the like, helping a data analyst debug the problems and improving model effects, through a non-invasive metadata collection method, not embedding any data collection codes in a machine learning training script, and aiming at the current common distributed training problems, providing a series of distributed machine learning debugging rules for detecting and debugging a machine learning model in the distributed training process and improving the model effects.
The above description is intended to be illustrative of the preferred embodiment of the present invention and should not be taken as limiting the invention, but rather, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims (10)

1. A learning and debugging method of a privacy-preserving distributed machine is characterized by comprising the following steps:
step S1: firstly, a data analyst builds distributed machine learning pipeline, and distributes the distributed machine learning pipeline to all client sides of the federation for training by using local data;
step S2: in the local client training process, a client non-invasive metadata collection module automatically analyzes a distributed machine learning pipeline script, collects metadata related to debugging, asynchronously reads the debugging metadata through a client debugging module, calculates an intermediate value according to a distributed machine learning debugging algorithm, and transmits the intermediate value to a server along with a model after a local training round is finished;
and step S3: the server receives the models from local training and the calculated debugging intermediate values, on one hand, the client models are aggregated by using a federal learning algorithm, and the global model effect of the training in the current round is tested, and on the other hand, the server collects debugging metadata in the process by using a server non-invasive metadata collection model;
and step S4: the server debugging module detects the federal training of the current round according to the debugging metadata collected by the server and the debugging intermediate value transmitted by each client based on the distributed machine learning debugging algorithm, if the training problem is detected, the debugging feedback module gives an alarm to a data analyst, the data analyst debugs the distributed machine learning pipeline according to the alarm information, and if the training problem is not detected, the next round of federal training is started;
step S5: and after the federal training is finished, outputting a debugging report of the training.
2. The learning and debugging method of claim 1, wherein the step S2 of the client non-intrusive metadata collection module automatically parsing the distributed machine learning pipeline script and collecting metadata related to debugging therein comprises the following steps:
step S21: a code analysis stage;
step S22: and a metadata acquisition stage.
3. The learning and debugging method of claim 2, wherein the step S21 further comprises the steps of:
step S211: obtaining an abstract syntax tree according to a source code;
step S212: and traversing the abstract syntax tree, acquiring function call and variable information of the program, and forming a data flow diagram.
4. The learning and debugging method of claim 2, wherein the step S22 further comprises the steps of:
step S221: identifying target data;
step S222: target data is acquired.
5. The learning and debugging method of claim 3, wherein the step S211 is embodied by rendering the code through a structure of an abstract syntax tree, and obtaining the abstract syntax tree object through an interface of a programming language.
6. The learning and debugging method of claim 5, wherein the step S212 is implemented by traversing the abstract syntax tree object through the corresponding interface, and obtaining function calls and variable information in the program after the traversal, wherein the variables and the functions have input/output relationships, and the functions and the variables are connected in an interleaving manner by the relationship to form the data flow graph.
7. The learning and debugging method of claim 4, wherein the step S221 specifically comprises the following steps: one is the model parameters and inputs and outputs that are closely related to the specific framework of machine learning, becoming framework-related metadata; the other is customized by the model developer.
8. The method for learning and debugging of claim 7, wherein the step S222 specifically includes that a variable has a unique variable name during the running of the code, and if the variable name is in the target metadata list, the variable is added to the metadata collection dictionary, and after multiple rounds of training, the dictionary collects multiple rounds of data values of the variable data.
9. The debugging system for the learning and debugging method according to any one of claims 1 to 8, wherein the debugging system comprises a server, a plurality of clients connected to the server, and a debugging feedback module connected to the server.
10. The debugging system of claim 9, wherein the server comprises a model aggregation, a server non-intrusive metadata collection module connected to the model aggregation, a server metadata store connected to the server non-intrusive metadata collection module, and a server debugging module connected to the server metadata store, the debugging feedback module being connected to the server debugging module; the client comprises a local training connected with the model aggregation, a client non-invasive metadata collection module connected with the local training, a client metadata storage connected with the client non-invasive metadata collection module, and a client debugging module respectively connected with the client metadata storage and the server debugging module.
CN202211100671.6A 2022-09-09 2022-09-09 Learning debugging method and debugging system of privacy protection distributed machine Pending CN115169594A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211100671.6A CN115169594A (en) 2022-09-09 2022-09-09 Learning debugging method and debugging system of privacy protection distributed machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211100671.6A CN115169594A (en) 2022-09-09 2022-09-09 Learning debugging method and debugging system of privacy protection distributed machine

Publications (1)

Publication Number Publication Date
CN115169594A true CN115169594A (en) 2022-10-11

Family

ID=83482416

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211100671.6A Pending CN115169594A (en) 2022-09-09 2022-09-09 Learning debugging method and debugging system of privacy protection distributed machine

Country Status (1)

Country Link
CN (1) CN115169594A (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113204787A (en) * 2021-05-06 2021-08-03 广州大学 Block chain-based federated learning privacy protection method, system, device and medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113204787A (en) * 2021-05-06 2021-08-03 广州大学 Block chain-based federated learning privacy protection method, system, device and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANRAN LI等: ""Efficient Federated-Learning Model Debugging"", 《2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING》 *
骆朋帅: ""隐私保护下的神经网络模型调试关键技术研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Similar Documents

Publication Publication Date Title
Di Nucci et al. A developer centered bug prediction model
CN110008288A (en) The construction method in the knowledge mapping library for Analysis of Network Malfunction and its application
CN105095048B (en) A kind of monitoring system alarm association processing method based on business rule
CN111985561A (en) Fault diagnosis method and system for intelligent electric meter and electronic device
CN109800127A (en) A kind of system fault diagnosis intelligence O&M method and system based on machine learning
CN107111540A (en) Dynamic telemetry message is dissected and adjusted
CN111162949A (en) Interface monitoring method based on Java byte code embedding technology
CN111782460A (en) Large-scale log data anomaly detection method and device and storage medium
Gutiérrez‐Madroñal et al. Evolutionary mutation testing for IoT with recorded and generated events
CN116681250A (en) Building engineering progress supervisory systems based on artificial intelligence
CN110020687A (en) Abnormal behaviour analysis method and device based on operator's Situation Awareness portrait
CN110011990A (en) Intranet security threatens intelligent analysis method
CN115640159A (en) Micro-service fault diagnosis method and system
CN107111609A (en) Lexical analyzer for neural language performance identifying system
CN115801369A (en) Data processing method and server based on cloud computing
CN116996325A (en) Network security detection method and system based on cloud computing
Pan et al. Refactoring packages of object–oriented software using genetic algorithm based community detection technique
CN115145751A (en) Method, device, equipment and storage medium for positioning fault root cause of micro-service system
CN115169594A (en) Learning debugging method and debugging system of privacy protection distributed machine
CN115643153A (en) Alarm correlation analysis method based on graph neural network
CN111221704B (en) Method and system for determining running state of office management application system
JI et al. Log Anomaly Detection Through GPT-2 for Large Scale Systems
CN113242213A (en) Power communication backbone network node vulnerability diagnosis method
CN110780660A (en) Tobacco production industry control system fault diagnosis method based on production state
Khoshgoftaar et al. Data mining of software development databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20221011