WO2023216900A1 - 用于模型性能评估的方法、装置、设备和存储介质 - Google Patents

用于模型性能评估的方法、装置、设备和存储介质 Download PDF

Info

Publication number
WO2023216900A1
WO2023216900A1 PCT/CN2023/091156 CN2023091156W WO2023216900A1 WO 2023216900 A1 WO2023216900 A1 WO 2023216900A1 CN 2023091156 W CN2023091156 W CN 2023091156W WO 2023216900 A1 WO2023216900 A1 WO 2023216900A1
Authority
WO
WIPO (PCT)
Prior art keywords
category
data samples
perturbation
predicted
group
Prior art date
Application number
PCT/CN2023/091156
Other languages
English (en)
French (fr)
Inventor
孙建凯
杨鑫
王崇
解浚源
吴迪
Original Assignee
北京字节跳动网络技术有限公司
脸萌有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司, 脸萌有限公司 filed Critical 北京字节跳动网络技术有限公司
Publication of WO2023216900A1 publication Critical patent/WO2023216900A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning

Definitions

  • Example embodiments of the present disclosure relate generally to the field of computers, and in particular to methods, apparatus, devices and computer-readable storage media for model performance evaluation.
  • Federated learning can achieve performance consistent with traditional machine learning algorithms in an encrypted environment without the data leaving the local node.
  • Federated learning refers to using the data of each node to achieve joint modeling and improve the effect of machine learning models on the basis of ensuring data privacy and security.
  • Federated learning can allow each node to not leave the end to achieve data protection purposes. In federated learning, it is expected to better protect data privacy solutions, including the privacy of labeled data corresponding to data samples.
  • a scheme for model performance evaluation is provided.
  • a method for model performance evaluation includes, at a client node, determining a plurality of prediction classification results corresponding to the plurality of data samples by comparing a plurality of prediction scores output by the machine learning model for the plurality of data samples with a score threshold. Multiple prediction classification results respectively indicate that multiple data samples are predicted to belong to Category 1 or Category 2.
  • the method also includes determining values of a plurality of metric parameters related to predetermined performance indicators of the machine learning model based on differences between the plurality of predicted classification results and the plurality of true value classification results corresponding to the plurality of data samples.
  • the method also includes applying perturbations to the values of multiple metric parameters to obtain perturbation values of the multiple metric parameters.
  • the method also includes sending the perturbation values of the plurality of metric parameters to the service node.
  • a method for model performance evaluation includes receiving, at the service node, perturbation values of a plurality of metric parameters related to predetermined performance indicators of the machine learning model from at least one group of client nodes, respectively.
  • the method further includes, for each group in at least one group, aggregating perturbation values of a plurality of metric parameters from client nodes of the group by metric parameter to obtain an aggregate value of a plurality of metric parameters respectively corresponding to the at least one group.
  • the method further includes determining a value of the predetermined performance indicator based on at least one score threshold value respectively associated with at least one group and an aggregate value of at least one metric parameter respectively corresponding to at least one group.
  • an apparatus for model performance evaluation includes a classification determination module configured to determine a plurality of prediction classification results corresponding to the plurality of data samples by comparing a plurality of prediction scores output by the machine learning model for the plurality of data samples with a score threshold.
  • the plurality of prediction classification results respectively indicate that the plurality of data samples are predicted to belong to the first category or the second category.
  • the device further includes a metric parameter determination module configured to determine a plurality of metrics related to the predetermined performance indicators of the machine learning model based on differences between the plurality of predicted classification results and the plurality of true value classification results corresponding to the plurality of data samples. The value of the parameter.
  • the device also includes a perturbation module configured to apply perturbations to the values of multiple metric parameters to obtain perturbation values of the multiple metric parameters.
  • the device also includes a perturbation value sending module configured to send perturbation values of a plurality of metric parameters to the service node.
  • an apparatus for model performance evaluation includes a perturbation value receiving module configured to respectively receive perturbation values of a plurality of metric parameters related to predetermined performance indicators of the machine learning model from at least one group of client nodes.
  • the device further includes an aggregation module configured to, for each group in at least one group, aggregate perturbation values of multiple metric parameters from client nodes of the group by metric parameters to obtain multiple perturbation values corresponding to at least one group respectively.
  • the aggregate value of a metric parameter also includes indicators
  • the determining module is configured to determine a value of the predetermined performance indicator based on at least one score threshold respectively associated with at least one group and an aggregate value of a plurality of metric parameters respectively corresponding to at least one group.
  • an electronic device in a fifth aspect of the present disclosure, includes at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit.
  • the instructions when executed by at least one processing unit, cause the device to perform the method of the first aspect.
  • an electronic device in a sixth aspect of the present disclosure, includes at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit.
  • the instructions when executed by at least one processing unit, cause the device to perform the method of the second aspect.
  • a computer-readable storage medium is provided.
  • a computer program is stored on the medium, and the computer program is executed by the processor to implement the method of the first aspect.
  • a computer-readable storage medium is provided.
  • a computer program is stored on the medium, and the computer program is executed by the processor to implement the method of the second aspect.
  • Figure 1 shows a schematic diagram of an example environment in which embodiments of the present disclosure can be applied
  • FIG. 2 illustrates a flow diagram of a signaling flow for model performance evaluation in accordance with some embodiments of the present disclosure
  • FIG. 3 illustrates a flowchart of a process of applying a perturbation in accordance with some embodiments of the present disclosure
  • Figure 4 shows a schematic diagram of a ROC curve according to some embodiments of the present disclosure
  • Figure 5 illustrates a flowchart of a process for model performance evaluation according to some embodiments of the present disclosure
  • FIG. 6 illustrates a flowchart of another process for model performance evaluation in accordance with some embodiments of the present disclosure
  • Figure 7 shows a block diagram of another apparatus for model performance evaluation according to some embodiments of the present disclosure.
  • Figure 8 shows a block diagram of an apparatus for model performance evaluation according to some embodiments of the present disclosure.
  • Figure 9 illustrates a block diagram of a computing device/system capable of implementing one or more embodiments of the present disclosure.
  • a prompt message is sent to the user to clearly remind the user that the operation requested will require the acquisition and use of the user's personal information. Therefore, users can autonomously choose whether to provide personal information to software or hardware such as electronic devices, applications, servers or storage media that perform the operations of the technical solution of the present disclosure based on the prompt information.
  • the method of sending prompt information to the user can be, for example, a pop-up window, and the prompt information can be presented in the form of text in the pop-up window.
  • the pop-up window can also host a selection control for the user to choose "agree” or "disagree” to provide personal information to the electronic device.
  • model can learn the association between the corresponding input and output from the training data, so that the corresponding output can be generated for the given input after the training is completed. Model generation can be based on machine learning techniques. Deep learning is a machine learning algorithm that uses multiple layers of processing units to process inputs and provide corresponding outputs. Neural network models are an example of deep learning-based models. In this article, a “model” may also be called a “machine learning model,” “learning model,” “machine learning network,” or “learning network,” and these terms are used interchangeably in this article.
  • a "neural network” is a machine learning network based on deep learning. Neural networks are capable of processing inputs and providing corresponding outputs, and typically include an input layer and an output layer and one or more hidden layers between the input layer and the output layer. Neural networks used in deep learning applications often include many hidden layers, thereby increasing the depth of the network.
  • the layers of a neural network are connected in sequence such that the output of the previous layer is provided as the input of the subsequent layer, where the input layer receives the input of the neural network and the output of the output layer serves as the final output of the neural network.
  • Each layer of a neural network consists of one or more nodes (also called processing nodes or neurons), each processing input from the previous layer.
  • machine learning can roughly include three stages, namely, training stage and testing stage. and the application phase (also called the inference phase).
  • the training phase a given model can be trained using a large amount of training data, and parameter values are updated iteratively until the model can obtain consistent inferences from the training data that meet the expected goals.
  • the model can be thought of as being able to learn the association between inputs and outputs (also known as input-to-output mapping) from the training data.
  • the parameter values of the trained model are determined.
  • test inputs are applied to the trained model to test whether the model can provide the correct output, thereby determining the performance of the model.
  • the model can be used to process the actual input and determine the corresponding output based on the parameter values obtained through training.
  • FIG. 1 shows a schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented.
  • Client nodes 110-1...110-k,...110-N can maintain respective local data sets 112-1...112-k,...112-N respectively.
  • client nodes 110-1...110-k,...110-N may be collectively or individually referred to as client nodes 110
  • local data sets 112-1...112-k,...112- N may be referred to collectively or individually as local data set 112 .
  • the client node 110 and/or the service node 120 may be implemented at a terminal device or a server.
  • the terminal device can be any type of mobile terminal, fixed terminal or portable terminal, including mobile phones, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, media computers, multimedia tablets, personal communication system (PCS) devices , personal navigation device, personal digital assistant (PDA), audio/video player, digital camera/camcorder, positioning device, television receiver, radio receiver, e-book device, gaming device, or any combination of the foregoing, including Accessories and peripherals for these devices or any combination thereof.
  • the terminal device is also able to support any type of interface to the user (such as "wearable" circuitry, etc.).
  • Servers are various types of computing systems/servers capable of providing computing capabilities, including but not limited to mainframes, edge computing nodes, computing devices in cloud environments, and so on.
  • client nodes refer to nodes that provide part of the training data for machine learning models. Client nodes can also be called clients, terminal nodes, terminal devices, and users. Household equipment, etc.
  • a service node refers to a node that aggregates training results at client nodes.
  • N client nodes 110 jointly participate in training the machine learning model 130 and aggregate the intermediate results in the training to the service node 120 so that the service node 120 updates the parameters of the machine learning model 130 set.
  • the complete set of local data for these client nodes 110 constitutes the complete training data set for the machine learning model 130 . Therefore, according to the federated learning mechanism, the service node 120 will generate a global machine learning model 130.
  • the local data set 112 at the client node 110 may include data samples and ground truth labels.
  • Figure 1 specifically illustrates a local data set 112-k at a client node 110-k, which includes a set of data samples and a set of ground truth labels.
  • Each data sample 102 may be annotated with a corresponding ground truth label 105 .
  • Data sample 102 may correspond to an input to machine learning model 130, with ground truth label 105 indicating the true output of data sample 102.
  • Ground truth labels are an important part of supervised machine learning.
  • the machine learning model 130 may be built based on various machine learning or deep learning model architectures, and may be configured to implement various prediction tasks, such as various classification tasks, recommendation tasks, and so on.
  • the machine learning model 130 may also be called a prediction model, a recommendation model, a classification model, etc.
  • Data samples 102 may include input information related to a specific task of the machine learning model 130, with truth labels 105 related to the desired output of the task.
  • the machine learning model 130 may be configured to predict whether the input data sample belongs to the first category or the second category, and the ground truth label is used to mark that the data sample actually belongs to the first category or the second category.
  • Category II Many practical applications can be classified as such two-category tasks, such as whether the recommended items are converted (for example, clicks, purchases, registrations or other demand behaviors) in the recommendation task, etc.
  • Figure 1 only shows an example federated learning environment. Depending on the federated learning algorithm and actual application needs, the environment can also be different.
  • the service node 120 may serve as a client node in addition to serving as a central node to provide partial data for model training, model performance evaluation, etc. Embodiments of the present disclosure are not limited in this regard.
  • the client node 110 does not need to disclose local data samples or label data, but sends gradient data calculated based on local training data to the service node 120 so that the service node 120 can update the parameters of the machine learning model 130 set.
  • the performance of the machine learning model 130 may be measured by one or more performance metrics. Different performance indicators can measure the difference between the predicted output given by the machine learning model 130 for the data sample set and the real output indicated by the true value label set from different angles. Generally, if the difference between the predicted output and the real output given by the machine learning model 130 is small, it means that the performance of the machine learning model is better. It can be seen that it is usually necessary to determine the performance index of the machine learning model 130 based on the set of ground-truth labels of the data samples.
  • a model performance evaluation solution which can protect label data local to a client node.
  • multiple predicted classification results corresponding to the multiple data samples are determined by comparing multiple predicted scores output by the machine learning model for the multiple data samples with score thresholds received from the service node.
  • the client node determines values of a plurality of measurement parameters related to predetermined performance indicators of the machine learning model based on differences between the plurality of predicted classification results and the plurality of true value classification results corresponding to the plurality of data samples.
  • the client node applies perturbations to the values of multiple metric parameters to get Perturbation values for multiple metric parameters.
  • the client node sends the perturbation values of multiple metric parameters to the service node.
  • the service node determines a predetermined performance indicator based on the perturbation values of the plurality of metric parameters received from the respective client nodes.
  • each client node does not need to expose its local true value label set, nor does it need to expose its local prediction classification results (ie, prediction label information).
  • the service node can also be based on the feedback information of the client node. (for example, perturbation values of multiple metric parameters) to calculate the value of the performance indicator. In this way, while determining the performance indicators of the machine learning model, the privacy protection of the local label data of the client node is achieved.
  • FIG. 2 shows a schematic diagram of signaling flow 200 for model performance evaluation in accordance with some embodiments of the present disclosure.
  • Signaling flow 200 involves service node 120 and multiple client node groups 202-1, 202-2, ... 202-L in environment 100.
  • client node groups are collectively or individually referred to as client node groups in this article, where L is an integer greater than or equal to 1.
  • Client node group 202 may include multiple client nodes 110 .
  • client node group 202-1 may include client nodes 110-1, 110-2,...110-J, where J is greater than or equal to 1 and less than or equal to N. integer. It should be understood that the signaling flow 200 may involve any number of service nodes 120 and any number of client node groups 202.
  • each client node group 202 may include any number of client nodes 110.
  • the number of client nodes 110 included in each client node group 202 may be the same or different.
  • the N client nodes 110 may be divided evenly or approximately equally into L client node groups 202, where each client node group 202 includes N/L (rounded as an integer) client nodes. 110.
  • the machine learning model 130 to be evaluated may be a global machine learning model determined based on the training process of federated learning.
  • the client node 110 and the service node 120 participate in the training process of the machine learning model 130 .
  • machine learning The learning model 130 may also be a model obtained in any other manner, and the client node 110 and the service node 120 may not participate in the training process of the machine learning model 130. The scope of the present disclosure is not limited in this regard.
  • service node 120 sends (not shown) machine learning model 130 to client nodes 110 in respective client node groups 202 . After receiving the machine learning model 130, each client node 110 may perform a subsequent evaluation process based on the machine learning model 130. In some embodiments, the machine learning model 130 to be evaluated may also be provided to the client node 110 in any other suitable manner.
  • the service node 120 may send multiple score thresholds to the client nodes 110 in at least one client node group 202 respectively.
  • the service node 120 may randomly generate L score thresholds and send the L score thresholds to each client node 110 of the L client node groups 202 respectively.
  • Each score threshold is a value between 0 and 1.
  • the value of L (ie, the number of score thresholds or the number of client node groups 202 ) may be predetermined by the service node 120 .
  • the value of L may be determined based on the number of client nodes 110 .
  • the value of L may also be determined according to the type of predetermined performance indicator to be determined by the service node 120 . For example, if the predetermined performance index to be determined by the service node 120 is the accuracy rate (ACC) of the prediction result, the service node 120 may determine the value of L to be 1.
  • the service node 120 may determine the value of L as an integer greater than 1. It should be understood that for the case where the predetermined performance index is ACC or AUC of ROC, the value of L can also be determined as other appropriate integer values. Embodiments of the present disclosure are not limited in this respect.
  • service node 120 may send (205-1) the first score threshold to client nodes 110 in its associated client node group 202-1. Similarly, the service node 120 may send (205-2/.../205-L) the second/.../Lth score threshold to its associated client node group 202-2/.../202 Client node 110 in -L.
  • the client nodes 110 in each client node group 202 receive (210-1/210-2/.../210-L) their respective score thresholds.
  • the client node 110 can apply each data sample 102 to the machine learning model 130 as an input to the model, and obtain the prediction score output by the machine learning model 130.
  • the client node 110 determines (215-1) multiple predicted classification results corresponding to the multiple data samples 102 by comparing the multiple predicted scores output by the machine learning model 130 for the multiple data samples 102 with the score threshold.
  • the plurality of prediction classification results respectively indicate that the plurality of data samples 102 are predicted to belong to the first category or the second category.
  • Each prediction score may indicate a predicted probability that the corresponding data sample 102 belongs to the first category or the second category. Both categories can be configured according to actual task needs.
  • the value range of the prediction score output by the machine learning model 130 can be set arbitrarily.
  • the prediction score can be a value in a certain continuous value interval (for example, a value between 0 and 1), or it can be one of multiple discrete values (for example, it can be 0, One of discrete values such as 1, 2, 3, 4, 5).
  • a higher prediction score may indicate that the data sample 102 has a greater predicted probability of belonging to the first category and a smaller predicted probability of belonging to the second category.
  • the opposite setting is also possible.
  • a higher prediction score may indicate a greater prediction probability that the data sample 102 belongs to the second category, and a smaller prediction probability that the data sample 102 belongs to the first category.
  • the client node 110 may determine the predicted classification result corresponding to the data sample 102 as indicating that the data sample belongs to the first category. On the contrary, if the predicted score of the machine learning model 130 for the data sample 102 does not exceed the score threshold, the client node 110 may determine the predicted classification result corresponding to the data sample 102 as indicating that the data sample belongs to the second category.
  • each data sample 102 has a ground truth label 105 .
  • the truth label 105 is used to label whether the corresponding data sample 102 belongs to the first category or the second category.
  • data samples belonging to the first category are sometimes called positive samples, positive examples or positive samples, and data samples belonging to the second category are sometimes called negative samples, negative examples or negative samples.
  • each truth label 105 may have one of two values, indicating the first category or the second category respectively.
  • the value of the true value label 105 corresponding to the first category may be set to “1”, which indicates that the corresponding data sample 102 belongs to the first category and is a positive sample.
  • the value of the true value label 105 corresponding to the second category can be set to "0", which indicates that the corresponding data sample 102 belongs to the second category and is a negative sample.
  • the first category and the second category can be any categories in the binary classification problem. Taking the two-classification problem of determining whether the content of an image is a cat as an example, the first category can indicate that the content of the image is a cat category, while the second category can indicate that the content of the image is a non-cat category. Taking the evaluation of the quality of an item as an example, the first category can mean that the quality of the item meets the standard, while the second category can mean that the quality of the item does not meet the standard.
  • the two-classification problems listed above are only exemplary, and the model performance evaluation method described in this article is applicable to all types of two-classification problems. Embodiments of the present disclosure are not limited in this regard. In some example embodiments below, for convenience of discussion, image classification is mainly used as an example for explanation, but it should be understood that this does not imply that those embodiments can only be applied to the two-classification problem.
  • the client node 110 in the client node group 202-1 determines (220-1) based on the difference between the multiple predicted classification results and the multiple true value classification results corresponding to the multiple data samples 102. Values of multiple measurement parameters related to predetermined performance indicators. Multiple ground-truth classification results may be respectively labeled by multiple ground-truth labels of the plurality of data samples 102 to indicate that the plurality of data samples 102 belong to the first category or to the second category.
  • the plurality of metric parameters may include a first number of data samples of the first type in the plurality of data samples 102 .
  • the predicted classification results and true value classification results corresponding to the first category data sample both indicate the first category. For example, for a certain data sample 102, if the true value classification result (or the true value label 105) of the data sample 102 indicates that the data sample 102 belongs to the first category (for example, an image belonging to a cat), and the machine learning model 130 predicts The predicted classification result also indicates that the data sample 102 belongs to the first category, then the data sample 102 belongs to the first type of data sample, also called a true sample (True Positive, TP).
  • the plurality of metric parameters may include a second number of data samples of the second type in the plurality of data samples 102 .
  • the predicted classification results and true value classification results corresponding to the second category data sample both indicate the second category. For example, for a certain data sample 102, if the ground-truth classification result (or ground-truth label 105) of the data sample 102 indicates that the data sample 102 belongs to the second category (for example, does not belong to an image of a cat), and the machine learning model 130 predicts If the predicted classification result also indicates that the data sample 102 belongs to the second category, then the data sample 102 belongs to the second category data sample, which is also called a true negative sample (True Negative, TN).
  • the plurality of metric parameters may include a third number of third types of data samples in the plurality of data samples 102 .
  • the predicted classification result corresponding to the third type data sample indicates the first category and the corresponding ground truth classification result indicates the second category.
  • ground-truth classification result (or ground-truth label 105) of the data sample 102 indicates that the data sample 102 belongs to the second category (for example, does not belong to an image of a cat)
  • the machine learning model 130 predicts If the predicted classification result indicates that the data sample 102 belongs to the first category (for example, an image of a cat), then the data sample 102 belongs to the third category of data sample, which is also called a false positive sample (False Positive, FP).
  • the plurality of metric parameters may include a fourth number of data samples of a fourth type in the plurality of data samples.
  • the predicted classification result corresponding to the fourth category data sample indicates the second category and the corresponding true value classification result indicates the first category. For example, for a certain data sample 102, if the true value classification result (or the true value label 105) indicates that the data sample 102 belongs to the first category (for example, an image belonging to a cat), and the predicted classification result predicted by the machine learning model 130 Indicates that the data sample 102 belongs to the second category (for example, does not belong to an image of a cat), then the data sample 102 belongs to the fourth category of data sample, which is also called a false negative sample (False Negative, FN).
  • the client node 110 can determine (220-1) the value of at least one of the above metric parameters based on the differences between the multiple predicted classification results and the multiple true value classification results corresponding to the multiple data samples 102.
  • the client node 110 may determine (220-1) the values of the numbers of TP, TN, FP, and FN described above based on the above differences.
  • client node 110 may also determine values for other additional metric parameters.
  • the client node 110 may represent these four values as a four-tuple, namely (FP, FN, TP, TN). Additionally, in some embodiments, the above four values may be saved together with the score threshold of the client node 110, for example, represented as (k_i, FP, FN, TP, TN), where k_i represents the i-th client The score threshold for the end node group.
  • each client node 110 receives only one score threshold.
  • Each client node 110 determines the value of each metric parameter based solely on a score threshold. In this way, leakage of information of the client node 110 (such as predicted classification results or predicted classification labels, etc.) can be avoided.
  • the client node 110 in the client node group 202-1 applies perturbations to the values of multiple metric parameters to obtain (225-1) the perturbation values of the multiple metric parameters. For example, for at least one of TP, TN, FP, and FN, the client node 110 may add random perturbation to one or more values of TP, TN, FP, and FN through, for example, a Gaussian mechanism or a Laplace mechanism.
  • Figure 3 illustrates a flow diagram of a process 300 of applying a perturbation in accordance with some embodiments of the present disclosure.
  • Process 300 may be implemented at client node 110.
  • the client node 110 is configured to determine a perturbation-related Sensitivity value.
  • the sensitivity ⁇ can be set to 1. That is, every time the label of a data sample 102 is changed, the maximum impact on the statistics is 1.
  • the sensitivity value can also be set to other appropriate values.
  • the client node 110 is configured to determine a random perturbation distribution based on the sensitivity value ⁇ and the label differential privacy mechanism.
  • the random response mechanism is one of the Differential Privacy (DP) mechanisms.
  • DP Differential Privacy
  • ⁇ and ⁇ are real numbers greater than or equal to 0, that is, and It is a random mechanism (random algorithm).
  • the so-called random mechanism refers to that for a specific input, the output of the mechanism is not a fixed value, but obeys a certain distribution.
  • For the random mechanism It can be considered a random mechanism if the following conditions are met With ( ⁇ , ⁇ )-differential privacy: for any two adjacent training data sets D, D′, and for An arbitrary subset S of possible outputs exists:
  • the random mechanism can also be considered With ⁇ -differential privacy ( ⁇ -DP).
  • ⁇ -DP ⁇ -differential privacy
  • differential privacy mechanisms for random mechanisms with ( ⁇ , ⁇ )-differential privacy or ⁇ -differential privacy It is expected that the distribution of the two outputs obtained after acting on two adjacent data sets respectively is indistinguishable. In this case, observers can hardly detect small changes in the input data set of the algorithm by observing the output results, thus achieving the purpose of protecting privacy. If the random mechanism If applied to any adjacent data set, the probability of obtaining a specific output S is almost the same, then it will be considered that the algorithm is difficult to achieve the effect of differential privacy.
  • label differential privacy can be defined. Specifically, assume that ⁇ and ⁇ are real numbers greater than or equal to 0, that is, and It is a random mechanism (random algorithm). It can be considered a random machine if the following conditions are met system With ( ⁇ , ⁇ )-label differential privacy: for any two adjacent training data sets D, D′, their difference is only that the label of a single data sample is different, and for An arbitrary subset S of possible outputs exists:
  • the random mechanism can also be considered With ⁇ -differential privacy ( ⁇ -DP).
  • ⁇ -DP ⁇ -differential privacy
  • the random response mechanism is a random mechanism applied for the purpose of differential privacy protection.
  • the random response mechanism is positioned as follows: assume ⁇ is a parameter, and y ⁇ [0, 1] is a known value of the truth label in the random response mechanism. If for the value y of the true value label, the random response mechanism derives a random value from the following probability distribution
  • the random response mechanism After applying the random response mechanism, the random value There is a certain probability that it is equal to y, and there is also a certain probability that it is not equal to y.
  • the random response mechanism will satisfy ⁇ -differential privacy.
  • the client node 110 can add random perturbations to the values of multiple metric parameters to avoid the service node 120 obtains privacy at client node 110 (eg, predicted tag information, etc.).
  • the tag differential privacy mechanism may be a Gaussian mechanism.
  • the standard deviation ⁇ of the random perturbation distribution ( ⁇ , ⁇ )-DP of the Gaussian mechanism i.e., the standard deviation of the added noise
  • ⁇ , ⁇ the standard deviation of the added noise
  • represents the sensitivity value
  • ⁇ and ⁇ are any values between 0 and 1 (excluding 0 and 1).
  • the tag differential privacy mechanism may be the Laplace mechanism.
  • Gaussian mechanism and Laplace mechanism listed above are only illustrative and not restrictive. Embodiments of the present disclosure may use other suitable tag differential privacy mechanisms to determine random perturbation distributions. Embodiments of the present disclosure are not limited in this regard.
  • the client node 110 is configured to apply a perturbation to at least one number based on a random perturbation distribution. For example, in the example in which the client node 110 determines the above four values TP, TN, FP, and FN, the client node 110 can apply random perturbations to these four values respectively, and the perturbation values (FP', FN', TP',TN'), or it can also be expressed as (k_i,FP',FN',TP',TN'). Alternatively, in some embodiments, client node 110 may apply random perturbations to one or more of the above four values.
  • the information of the client node 110 can be prevented from being leaked.
  • this way of applying perturbation can prevent the service node 120 from guessing the predicted label of the client node 110 .
  • the service node may determine that samples greater than the threshold k_i are all positive samples.
  • Perturbation is applied to the value.
  • the service node may determine that the samples smaller than the threshold k_i are all negative samples.
  • Client nodes 110 in client node group 202-1 send (230-1) perturbation values of a plurality of metric parameters to service node 120 for use in determining predetermined performance indicators at service node 120.
  • perturbation is sometimes also called noise, interference, etc.
  • each client node 110 in client node group 202-1 may convert the quadruple (FP', FN', TP', TN') or (k_i, FP', FN', TP' , TN') is sent to the service node 120.
  • the client node 110 in the client node group 202-1 may send one or more of the above four disturbance values to the service node 120. The process by which the service node 120 determines the predetermined performance index will be described below.
  • client node 110 in client node group 202-2/.../202-L determines (215 -2/.../215-L) Multiple prediction classification results corresponding to multiple data samples 102.
  • the client node 110 in the client node group 202-2/.../202-L determines (220-2 /.../220-L) values of a plurality of metric parameters related to predetermined performance indicators of the machine learning model 130.
  • the client node 110 in the client node group 202-2/.../202-L applies perturbation to the values of multiple metric parameters to obtain (225-2/.../225-L) perturbation values of multiple metric parameters. .
  • the client node 110 in the client node group 202-2/.../202-L sends (230-2/.../230-L) the perturbation values of the multiple metric parameters to the service node 120 for use in Predetermined performance indicators are determined at service node 120.
  • the above process is similar to the corresponding process of the client node group 202-1, and will not be described again here.
  • the service node 120 receiving (235-1, or also 235-2/.../235-L) from at least one client node group 202-1 (or 202-2/.../202-L) respectively, Collectively referred to as 235) perturbation values of a plurality of metric parameters related to predetermined performance indicators of the machine learning model 130.
  • the service node 120 may be selected from at least one client node group 202 Do not accept the 235 quadruple (FP',FN',TP',TN') or (k_i,FP',FN',TP',TN').
  • the service node 120 may receive (235) one or more of the four perturbation values described above, which may depend on the performance metric to be calculated.
  • TP′ represents a first perturbation number of data samples of the first category among the plurality of data samples 102 at a given client node 110 , where the first category data samples are annotated as the first category by ground truth labels and predicted by the machine learning model 130 as Category 1.
  • TN' represents the second perturbation number of the second type of data sample in the plurality of data samples 102, wherein the second type of data sample is labeled as the second category by the ground truth label and is also predicted as the second category by the model.
  • FP' represents a third perturbation number of third type data samples in the plurality of data samples 102, where the third type data samples are labeled as the second category and predicted as the first category.
  • FN' represents the fourth perturbation number of the fourth type of data sample in the plurality of data samples 102, where the fourth type of data sample is labeled as the first category and predicted as the second category.
  • the service node 120 For each client node group in the at least one client node group 202, the service node 120 aggregates the perturbation values of a plurality of metric parameters from the client nodes 110 of the client node group according to the metric parameter, and obtains a perturbation value corresponding to the at least one client node group 202.
  • the service node 120 aggregates (240-1) the perturbation values (FP', FN', TP', TN') of multiple metric parameters from the client nodes 110 of the client node group 202-1 by metric parameters to obtain An aggregate value of a plurality of metric parameters corresponding to client node group 202-1.
  • the service node 120 aggregates (240-2/.../240-L) the perturbation values of the multiple metric parameters of the client nodes 110 from the client node group 202-2/.../202-L by metric parameter. (FP', FN', TP', TN'), and obtain the aggregate values of multiple metric parameters respectively corresponding to the client node group 202-2/.../202-L.
  • the service node 120 may calculate an aggregate value TPR of multiple metric parameters corresponding to the client node group 202 based on the following equations (7) and (8): (True Sample Ratio) and FPR (False Positive Sample Ratio).
  • TPR represents the proportion of samples that are actually positive (positive samples) that are correctly judged as positive.
  • FPR represents the proportion of samples that are actually negative (negative samples) that are incorrectly judged as positive.
  • TPR TP′/(TP′+FN′) (7)
  • FPR FP′/(FP′+TN′) (8)
  • service node 120 may use other methods to derive aggregate values for multiple metric parameters corresponding to client node group 202 .
  • the service node 120 determines (245) a predetermined performance indicator based on a plurality of score thresholds respectively associated with the at least one client node group 202 and an aggregate value of a plurality of metric parameters respectively corresponding to the at least one client node group 202. value.
  • the predetermined performance metric includes at least the area under the receiver operating characteristic curve (ROC) curve (AUC).
  • the service node 120 may determine the ROC of the machine learning model 130 based on at least one score threshold and an aggregate value of a plurality of metric parameters.
  • the value of L (ie, the number of score thresholds or the number of client node groups 202) may be greater than one.
  • at least one group may include multiple groups and at least one score threshold may include multiple score thresholds.
  • the service node 120 can calculate multiple coordinate points of (FPR, TPR) pairs based on each threshold score, and connect these points into lines to fit the ROC curve of the machine learning model 130 .
  • the service node 120 may then determine the AUC of the ROC.
  • AUC refers to the area under the ROC curve.
  • the AUC can be calculated according to the definition of AUC by calculating the area under the ROC curve using an approximation algorithm.
  • FIG. 4 shows a schematic diagram of a ROC curve 410 in accordance with some embodiments of the present disclosure.
  • the ROC curve 410 is plotted by calculating multiple (FPR, TPR) pairs of coordinate points based on multiple threshold scores (in this example, the value of L is greater than 1).
  • the multiple coordinate points are (0,0), (0,0.2), (0.2,0.2), (0.2,0.4), (0.4,0.4), (0.4,0.6), (0.6 ,0.6), (0.6,0.8), (0.8,0.8), (0.8,1) and (1,1).
  • a curve 420 dividing the ROC plane into two parts.
  • the ROC curve 410 can be used to determine that the AUC of the ROC curve is 0.7. Alternatively, it can be determined that the area between the ROC curve 410 and the curve 420 is 0.2, and then the AUC of the ROC curve 410 is determined to be 0.7.
  • ROC curve 410 depicted in Figure 4 is only schematic, and the coordinate values of each coordinate point in Figure 4 are for illustrative purposes only and are not limiting.
  • the service node 120 can determine the coordinate values of other values, and can draw ROC curves of other shapes.
  • the results of the model performance evaluation can be more precise.
  • the AUC may also be determined from a probabilistic perspective.
  • AUC can be thought of as: randomly selecting a positive sample and a negative sample, the probability that the machine learning model gives the positive sample a higher prediction score than the negative sample. That is to say, in the data sample set, positive and negative samples are combined to form a positive and negative sample pair, in which the prediction score of the positive sample is greater than the prediction score of the negative sample. If the model can give more positive samples a higher prediction score than the negative samples, it can be considered that the AUC is higher and the model has better performance.
  • the value range of AUC is between 0.5 and 1. The closer the AUC is to 1, the better the performance of the model.
  • the service node 120 may determine the AUC from a probabilistic perspective based on a corresponding score threshold of at least one client node group 202 and an aggregate value of a plurality of metric parameters.
  • the value of L ie, the number of score thresholds or the number of client node groups 202 may be 1 or an integer greater than 1.
  • the predetermined performance metric may include an ACC of the predicted outcome.
  • the performance indicators of the machine learning model 130 may also include an accuracy representation, which is the probability that a subset of data samples that are predicted to be positive samples are labeled as positive samples by a label.
  • the performance indicators of the machine learning model 130 may also include a PR curve with recall on the horizontal axis and precision on the vertical axis. The closer the PR curve is to the upper right angle, indicating the better the performance of the model. The area under the curve is called the AP score (Average Precision Score).
  • performance indicators such as the AUC of the ROC listed above are only exemplary and not limiting.
  • Examples of performance metrics used in this disclosure include, but are not limited to, AUC of ROC, correct rate, error rate (Error-rate), precision rate, recall rate, AP score, etc.
  • the service node 120 can determine a predetermined performance index based on the perturbation values of a plurality of metric parameters received from at least one client node group, thereby performing model performance evaluation.
  • the client node does not need to expose its local ground-truth label set nor its local prediction classification results (i.e., prediction label information).
  • the service node can also be based on the feedback information of the client node (e.g., multiple The perturbation value of the metric parameter) calculates the value of the performance indicator. In this way, while determining the performance indicators of the machine learning model, the privacy protection of the local label data of the client node is achieved.
  • FIG. 5 illustrates a flow diagram of a process 500 for model performance evaluation in accordance with some embodiments of the present disclosure.
  • Process 500 may be implemented at client node 110.
  • the client node 110 determines a plurality of predicted classification results corresponding to the plurality of data samples 102 by comparing a plurality of predicted scores output by the machine learning model 130 for the plurality of data samples 102 with a score threshold.
  • the plurality of prediction classification results respectively indicate that the plurality of data samples 102 are predicted to belong to the first category or the second category.
  • process 500 further includes receiving a score threshold from service node 120 .
  • the client node 110 may determine multiple predicted classification results corresponding to the multiple data samples 102 by comparing multiple predicted scores output by the machine learning model 130 for the multiple data samples 102 with score thresholds received from the service node 120 .
  • the client node 110 determines a plurality of metric parameters related to the predetermined performance indicators of the machine learning model 130 based on differences between the plurality of predicted classification results and the plurality of ground truth classification results corresponding to the plurality of data samples 102 value.
  • the client node 110 may determine at least one of the following based on the above-mentioned differences: a first number of first-type data samples in the plurality of data samples, a first-type data sample The corresponding predicted classification results and true value classification results The results all indicate the first category; the second number of the second category data samples among the multiple data samples, the predicted classification results and the true value classification results corresponding to the second category data samples both indicate the second category; the third among the multiple data samples a third number of class data samples, the predicted classification result corresponding to the third class data sample indicates the first class and the corresponding true value classification result indicates the second class; and a fourth number of fourth class data samples among the plurality of data samples, The predicted classification result corresponding to the fourth category data sample indicates the second category and the corresponding true value classification result indicates the first category.
  • the client node 110 applies perturbations to the values of the plurality of metric parameters to obtain perturbed values of the plurality of metric parameters.
  • the client node 110 may apply a perturbation to at least one of the first number, the second number, the third number, and the fourth number by: Determine a sensitivity value related to the perturbation; determine a random perturbation distribution based on the sensitivity value and the label differential privacy mechanism; and apply a perturbation to at least one number based on the random perturbation distribution.
  • the client node 110 sends the perturbation values of the plurality of metric parameters to the service node 120 for use in determining a predetermined performance indicator at the service node 120 .
  • the predetermined performance metric includes at least the area under the curve (AUC) of the receiver operating characteristic curve (ROC).
  • Figure 6 illustrates a flow diagram of a process 600 for model performance evaluation at service node 120, in accordance with some embodiments of the present disclosure.
  • Process 600 may be implemented at service node 120.
  • the service node 110 receives perturbation values for a plurality of metric parameters related to predetermined performance indicators of the machine learning model 130 from at least one group of client nodes 110 , respectively.
  • the client nodes 110 of each group in at least one group are different from the client nodes 110 of other groups.
  • process 600 further includes sending at least one score threshold to client nodes 110 in respective associated groups.
  • the perturbation values of the plurality of metric parameters include at least one of the following: a first perturbation number of data samples of the first type among the plurality of data samples at the given client node 110 , the first type of data sample is labeled as the first category and predicted as the first category; the second perturbation number of the second type data sample among the plurality of data samples, the second type of data sample is labeled as the second category and predicted For the second category; multiple a third perturbation number of a third type of data sample in the data sample, the third type of data sample is labeled as the second category and predicted as the first category; and a fourth perturbation number of the fourth type of data sample in the plurality of data samples, The fourth category data sample is labeled as the first category and predicted as the second category.
  • the predictions described above are based on a comparison of the predicted score output by the machine learning model 130 with a score threshold associated with the group in which a given client node 110 is located.
  • the service node 110 aggregates perturbation values of multiple metric parameters from the client nodes 110 of the group by metric parameters to obtain a plurality of metric parameters respectively corresponding to at least one group. Aggregated value of a metric parameter.
  • the service node 110 determines a value of a predetermined performance indicator based on a plurality of score thresholds respectively associated with at least one group and an aggregate value of a plurality of metric parameters respectively corresponding to at least one group.
  • at least one group may include multiple groups and at least one score threshold may include multiple score thresholds.
  • the service node 120 may determine the receiver operating characteristic curve ROC of the machine learning model 130 based on a plurality of score thresholds and an aggregate value of a plurality of metric parameters; and determine the ROC of the ROC. Area under the curve AUC.
  • Figure 7 illustrates a block diagram of an apparatus 700 for model performance evaluation at a client node, in accordance with some embodiments of the present disclosure.
  • Apparatus 700 may be implemented as or included in client node 110 .
  • Each module/component in the device 700 may be implemented by hardware, software, firmware, or any combination thereof.
  • the apparatus 700 includes a classification determination module 710 configured to determine a plurality of prediction scores corresponding to the plurality of data samples 102 by comparing a plurality of prediction scores output by the machine learning model 130 for the plurality of data samples 102 with a score threshold. Predict classification results. The plurality of prediction classification results respectively indicate that the plurality of data samples 102 are predicted to belong to the first category or the second category.
  • the apparatus 700 further includes a receiving module configured to receive the score threshold from the service node 120 . The apparatus 700 may determine multiple predicted classification results corresponding to the multiple data samples 102 by comparing multiple predicted scores output by the machine learning model 130 for the multiple data samples 102 with score thresholds received from the service node 120 .
  • the apparatus 700 further includes a metric parameter determination module 720 configured to be based on differences between a plurality of predicted classification results and a plurality of true value classification results corresponding to the plurality of data samples 102, Values of a plurality of metric parameters related to predetermined performance indicators of the machine learning model 130 are determined.
  • the metric parameter determination module 720 is configured to determine at least one of the following based on the above-mentioned difference: a first number of first-type data samples among the plurality of data samples, a predicted classification result corresponding to the first-type data sample, and The true value classification results all indicate the first category; the second number of the second category data samples among the multiple data samples, the predicted classification results and the true value classification results corresponding to the second category data samples both indicate the second category; the multiple data samples The third number of the third type of data samples in the third type of data sample, the prediction classification result corresponding to the third type of data sample indicates the first category and the corresponding true value classification result indicates the second category; and the third number of the fourth type of data sample among the plurality of data samples Number four, the prediction classification result corresponding to the fourth category data sample indicates the second category and the corresponding true value classification result indicates the first category.
  • the apparatus 700 further includes a perturbation module 730 configured to apply perturbations to the values of multiple metric parameters to obtain perturbation values of the multiple metric parameters.
  • the perturbation module 730 may be configured to, for at least one of the first number, the second number, the third number, and the fourth number, apply a perturbation to the at least one number by: determining a sensitivity value associated with the perturbation; based on The sensitivity value and label differential privacy mechanism are used to determine the random perturbation distribution; and based on the random perturbation distribution, perturbation is applied to at least one number.
  • the apparatus 700 further includes a perturbation value sending module 740 configured to send perturbation values of a plurality of metric parameters to the service node 120 for determining a predetermined performance index at the service node 120 .
  • the predetermined performance metric includes at least the area under the curve (AUC) of the receiver operating characteristic curve (ROC).
  • Figure 8 shows a block diagram of an apparatus 800 for model performance evaluation at a service node in accordance with some embodiments of the present disclosure.
  • Apparatus 800 may be implemented as or included in service node 120 .
  • Each module/component in the device 800 may be implemented by hardware, software, firmware, or any combination thereof.
  • the apparatus 800 includes a perturbation value receiving module 810 configured to receive perturbation values of a plurality of metric parameters related to predetermined performance indicators of the machine learning model 130 from at least one group of client nodes 110 respectively.
  • the client nodes 110 of each group in at least one group are different from the client nodes 110 of other groups.
  • the apparatus 800 may further include a sending module configured to send at least one score threshold The values are sent individually to the client nodes 110 in their respective associated groups.
  • the perturbation values of the plurality of metric parameters include at least one of the following: a first perturbation number of data samples of the first type among the plurality of data samples at the given client node 110 , the first type of data sample is labeled as the first category and predicted as the first category; the second perturbation number of the second type data sample among the plurality of data samples, the second type of data sample is labeled as the second category and predicted is the second category; the third perturbation number of the third category data sample among the plurality of data samples, the third category data sample is marked as the second category and predicted as the first category; and the fourth category data among the plurality of data samples The fourth perturbation number of samples, the fourth category data sample is labeled as the first category and predicted as the second category.
  • the predictions described above are based on a comparison of the predicted score output by the machine learning model 130 with a score threshold associated with the group in which a given client node 110 is located.
  • the apparatus 800 further includes an aggregation module 820 configured to, for each group in at least one group, aggregate the perturbation values of a plurality of metric parameters from the client nodes 110 of the group by metric parameters to obtain the perturbation values corresponding to the at least one group respectively.
  • the aggregate value of multiple metric parameters is configured to, for each group in at least one group, aggregate the perturbation values of a plurality of metric parameters from the client nodes 110 of the group by metric parameters to obtain the perturbation values corresponding to the at least one group respectively.
  • the aggregate value of multiple metric parameters is configured to, for each group in at least one group, aggregate the perturbation values of a plurality of metric parameters from the client nodes 110 of the group by metric parameters to obtain the perturbation values corresponding to the at least one group respectively.
  • the apparatus 800 further includes an indicator determination module 830 configured to determine a value of a predetermined performance indicator based on at least one score threshold respectively associated with at least one group and an aggregate value of a plurality of metric parameters respectively corresponding to at least one group.
  • at least one group may include multiple groups and at least one score threshold may include multiple score thresholds.
  • the metric determination module 830 includes a receiver operating characteristic curve (ROC) determination module configured to determine the ROC of the machine learning model 130 based on a plurality of score thresholds and an aggregate value of a plurality of metric parameters.
  • the metric determination module 830 also includes an area under the curve (AUC) determination module configured to determine the AUC of the ROC.
  • Figure 9 illustrates a block diagram of a computing device/system 900 capable of implementing one or more embodiments of the present disclosure. It should be understood that the computing device/system 900 shown in Figure 9 is exemplary only and should not constitute any limitation on the functionality and scope of the embodiments described herein. The computing device/system 900 shown in FIG. 9 may be used to implement the client node 110 or the service node 120 of FIG. 1 .
  • computing device/system 900 is in the form of a general purpose computing device.
  • Components of computing device/system 900 may include, but are not limited to, one or more processors or processing units 910, memory 920, storage device 930, one or more communication units 940, one or more input devices 950, and one or more output devices 960.
  • the processing unit 910 may be a real or virtual processor and can perform various processes according to a program stored in the memory 920 . In a multi-processor system, multiple processing units execute computer-executable instructions in parallel to increase the parallel processing capabilities of the computing device/system 900.
  • Computing device/system 900 typically includes a plurality of computer storage media. Such media may be any available media that is accessible to computing device/system 900, including, but not limited to, volatile and nonvolatile media, removable and non-removable media.
  • Memory 920 may be volatile memory (e.g., registers, cache, random access memory (RAM)), nonvolatile memory (e.g., read only memory (ROM), electrically erasable programmable read only memory (EEPROM) , flash memory) or some combination thereof.
  • Storage device 930 may be a removable or non-removable medium and may include machine-readable media such as a flash drive, a magnetic disk, or any other medium that may be capable of storing information and/or data (such as training data for training ) and can be accessed within computing device/system 900.
  • machine-readable media such as a flash drive, a magnetic disk, or any other medium that may be capable of storing information and/or data (such as training data for training ) and can be accessed within computing device/system 900.
  • Computing device/system 900 may further include additional removable/non-removable, volatile/non-volatile storage media.
  • a disk drive may be provided for reading from or writing to a removable, non-volatile disk (such as a "floppy disk") and for reading from or writing to a removable, non-volatile optical disk. Read or write to optical disc drives.
  • each drive may be connected to the bus (not shown) by one or more data media interfaces.
  • Memory 920 may include a computer program product 925 having one or more program modules configured to perform various methods or actions of various embodiments of the disclosure.
  • the communication unit 940 implements communication with other computing devices through communication media. Additionally, the functionality of the components of computing device/system 900 may be implemented as a single computing cluster or as multiple computing machines capable of communicating over a communications connection. Accordingly, computing device/system 900 may operate in a networked environment using logical connections to one or more other servers, networked personal computers (PCs), or another network node.
  • PCs personal computers
  • Input device 950 may be one or more input devices, such as a mouse, keyboard, trackball, etc.
  • Output device 960 may be one or more output devices, such as a monitor, speaker, speakers, printers, etc.
  • the computing device/system 900 may also communicate via the communication unit 940 with one or more external devices (not shown), such as storage devices, display devices, etc., as needed, and with one or more external devices that enable the user to interact with the computing device/system. 900 interacts with devices, or communicates with any device (e.g., network card, modem, etc.) that enables computing device/system 900 to communicate with one or more other computing devices. Such communication may be performed via an input/output (I/O) interface (not shown).
  • I/O input/output
  • a computer-readable storage medium is provided with computer-executable instructions or computer programs stored thereon, wherein the computer-executable instructions or computer programs are executed by a processor to implement the method described above. .
  • a computer program product is also provided, the computer program product is tangibly stored on a non-transitory computer-readable medium and includes computer-executable instructions, and the computer-executable instructions are executed by a processor to implement the method described above.
  • These computer-readable program instructions may be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus, thereby producing a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, the computer-readable program instructions , resulting in an apparatus that implements the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium. These instructions cause the computer, programmable data processing device and/or other equipment to work in a specific manner. Therefore, the computer-readable medium storing the instructions includes An article of manufacture that includes instructions that implement aspects of the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • Computer-readable program instructions may be loaded onto a computer, other programmable data processing apparatus, or other equipment, causing a series of operating steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process, thus making The instructions executing on a computer, other programmable data processing apparatus, or other device implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions that contains one or more executable functions for implementing the specified logical functions instruction.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two consecutive blocks may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or acts. , or can be implemented using a combination of specialized hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioethics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

根据本公开的实施例,提供了用于模型性能评估的方法、装置、设备和存储介质。该方法包括在客户端节点处,通过将机器学习模型针对多个数据样本输出的多个预测得分与得分阈值相比较,确定多个数据样本对应的多个预测分类结果。多个预测分类结果分别指示多个数据样本被预测属于第一类别或第二类别。该方法还包括基于多个预测分类结果和多个数据样本对应的多个真值分类结果之间的差异,确定与机器学习模型的预定性能指标相关的多个度量参数的值。该方法还包括向多个度量参数的值施加扰动,得到多个度量参数的扰动值。该方法还包括将多个度量参数的扰动值发送给服务节点。

Description

用于模型性能评估的方法、装置、设备和存储介质
本申请要求于2022年5月13日递交的,标题为“用于模型性能评估的方法、装置、设备和存储介质”、申请号为202210524000.6的中国发明专利申请的优先权。
技术领域
本公开的示例实施例总体涉及计算机领域,特别地涉及用于模型性能评估的方法、装置、设备和计算机可读存储介质。
背景技术
随着数据隐私保护问题越来越受到重视,导致目前集中式的机器学习系统难以进一步提高。因此,联邦学习得以兴起。联邦学习可以在数据离开本地节点的情况下,在加密的环境中实现和传统机器学习算法一致的性能。联邦学习,指的是在保证数据隐私安全的基础上,利用各个节点的数据实现共同建模,提升机器学习模型的效果。联邦学习可以允许各个节点不离开端,以达到数据保护目的。在联邦学习中,期望能够更好地保护数据隐私方案,包括数据样本对应的标签数据的隐私。
发明内容
根据本公开的示例实施例,提供了一种用于模型性能评估的方案。
在本公开的第一方面,提供了一种用于模型性能评估的方法。该方法包括在客户端节点处,通过将机器学习模型针对多个数据样本输出的多个预测得分与得分阈值相比较,确定多个数据样本对应的多个预测分类结果。多个预测分类结果分别指示多个数据样本被预测属于 第一类别或第二类别。该方法还包括基于多个预测分类结果和多个数据样本对应的多个真值分类结果之间的差异,确定与机器学习模型的预定性能指标相关的多个度量参数的值。该方法还包括向多个度量参数的值施加扰动,得到多个度量参数的扰动值。该方法还包括将多个度量参数的扰动值发送给服务节点。
在本公开的第二方面,提供了一种用于模型性能评估的方法。该方法包括在服务节点处,从至少一个组的客户端节点分别接收与机器学习模型的预定性能指标相关的多个度量参数的扰动值。该方法还包括对于至少一个组中的每个组,按度量参数聚合来自该组的客户端节点的多个度量参数的扰动值,得到与至少一个组分别相对应的多个度量参数的聚合值。该方法还包括基于与至少一个组分别相关联的至少一个得分阈值,以及与至少一个组分别相对应的至少一个度量参数的聚合值,确定预定性能指标的值。
在本公开的第三方面,提供了一种用于模型性能评估的装置。该装置包括分类确定模块,被配置为通过将机器学习模型针对多个数据样本输出的多个预测得分与得分阈值相比较,确定多个数据样本对应的多个预测分类结果。多个预测分类结果分别指示多个数据样本被预测属于第一类别或第二类别。该装置还包括度量参数确定模块,被配置为基于多个预测分类结果和多个数据样本对应的多个真值分类结果之间的差异,确定与机器学习模型的预定性能指标相关的多个度量参数的值。该装置还包括扰动模块,被配置为向多个度量参数的值施加扰动,得到多个度量参数的扰动值。该装置还包括扰动值发送模块,被配置为将多个度量参数的扰动值发送给服务节点。
在本公开的第四方面,提供了一种用于模型性能评估的装置。该装置包括扰动值接收模块,被配置为从至少一个组的客户端节点分别接收与机器学习模型的预定性能指标相关的多个度量参数的扰动值。该装置还包括聚合模块,被配置为对于至少一个组中的每个组,按度量参数聚合来自该组的客户端节点的多个度量参数的扰动值,得到与至少一个组分别相对应的多个度量参数的聚合值。该装置还包括指标 确定模块,被配置为基于与至少一个组分别相关联的至少一个得分阈值,以及与至少一个组分别相对应的多个度量参数的聚合值,确定预定性能指标的值。
在本公开的第五方面,提供了一种电子设备。该设备包括至少一个处理单元;以及至少一个存储器,至少一个存储器被耦合到至少一个处理单元并且存储用于由至少一个处理单元执行的指令。指令在由至少一个处理单元执行时使设备执行第一方面的方法。
在本公开的第六方面,提供了一种电子设备。该设备包括至少一个处理单元;以及至少一个存储器,至少一个存储器被耦合到至少一个处理单元并且存储用于由至少一个处理单元执行的指令。指令在由至少一个处理单元执行时使设备执行第二方面的方法。
在本公开的第七方面,提供了一种计算机可读存储介质。介质上存储有计算机程序,计算机程序被处理器执行以实现第一方面的方法。
在本公开的第八方面,提供了一种计算机可读存储介质。介质上存储有计算机程序,计算机程序被处理器执行以实现第二方面的方法。
应当理解,本发明内容部分中所描述的内容并非旨在限定本公开的实施例的关键特征或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的描述而变得容易理解。
附图说明
结合附图并参考以下详细说明,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。在附图中,相同或相似的附图标记表示相同或相似的元素,其中:
图1示出了本公开的实施例能够在其中应用的示例环境的示意图;
图2示出了根据本公开的一些实施例的模型性能评估的信令流的流程图;
图3示出根据本公开的一些实施例的施加扰动的过程的流程图;
图4示出根据本公开的一些实施例的ROC曲线的示意图;
图5示出根据本公开的一些实施例的模型性能评估的过程的流程图;
图6示出根据本公开的一些实施例的模型性能评估的另一过程的流程图;
图7示出了根据本公开的一些实施例的用于模型性能评估的另一装置的框图;
图8示出了根据本公开的一些实施例的用于模型性能评估的装置的框图;以及
图9示出了能够实施本公开的一个或多个实施例的计算设备/系统的框图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中示出了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反,提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
在本公开的实施例的描述中,术语“包括”及其类似用语应当理解为开放性包含,即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实施例”或“该实施例”应当理解为“至少一个实施例”。术语“一些实施例”应当理解为“至少一些实施例”。下文还可能包括其他明确的和隐含的定义。
可以理解的是,本技术方案所涉及的数据(包括但不限于数据本身、数据的获取或使用)应当遵循相应法律法规及相关规定的要求。
可以理解的是,在使用本公开各实施例公开的技术方案之前,均应当根据相关法律法规通过适当的方式对本公开所涉及个人信息的 类型、使用范围、使用场景等告知用户并获得用户的授权。
例如,在响应于接收到用户的主动请求时,向用户发送提示信息,以明确地提示用户,其请求执行的操作将需要获取和使用到用户的个人信息。从而,使得用户可以根据提示信息来自主地选择是否向执行本公开技术方案的操作的电子设备、应用程序、服务器或存储介质等软件或硬件提供个人信息。
作为一种可选的但非限制性的实现方式,响应于接收到用户的主动请求,向用户发送提示信息的方式,例如可以是弹窗的方式,弹窗中可以以文字的方式呈现提示信息。此外,弹窗中还可以承载供用户选择“同意”或“不同意”向电子设备提供个人信息的选择控件。
可以理解的是,上述通知和获取用户授权过程仅是示意性的,不对本公开的实现方式构成限定,其他满足相关法律法规的方式也可应用于本公开的实现方式中。
如本文中所使用的,术语“模型”可以从训练数据中学习到相应的输入与输出之间的关联,从而在训练完成后可以针对给定的输入,生成对应的输出。模型的生成可以基于机器学习技术。深度学习是一种机器学习算法,通过使用多层处理单元来处理输入和提供相应输出。神经网络模型是基于深度学习的模型的一个示例。在本文中,“模型”也可以被称为“机器学习模型”、“学习模型”、“机器学习网络”或“学习网络”,这些术语在本文中可互换地使用。
“神经网络”是一种基于深度学习的机器学习网络。神经网络能够处理输入并且提供相应输出,其通常包括输入层和输出层以及在输入层与输出层之间的一个或多个隐藏层。在深度学习应用中使用的神经网络通常包括许多隐藏层,从而增加网络的深度。神经网络的各个层按顺序相连,从而前一层的输出被提供作为后一层的输入,其中输入层接收神经网络的输入,而输出层的输出作为神经网络的最终输出。神经网络的每个层包括一个或多个节点(也称为处理节点或神经元),每个节点处理来自上一层的输入。
通常,机器学习大致可以包括三个阶段,即训练阶段、测试阶段 和应用阶段(也称为推理阶段)。在训练阶段,给定的模型可以使用大量的训练数据进行训练,不断迭代更新参数值,直到模型能够从训练数据中获取一致的满足预期目标的推理。通过训练,模型可以被认为能够从训练数据中学习从输入到输出之间的关联(也称为输入到输出的映射)。训练后的模型的参数值被确定。在测试阶段,将测试输入应用到训练后的模型,测试模型是否能够提供正确的输出,从而确定模型的性能。在应用阶段,模型可以被用于基于训练得到的参数值,对实际的输入进行处理,确定对应的输出。
图1示出了本公开的实施例能够在其中实现的示例环境100的示意图。环境100涉及联邦学习环境,其中包括N个客户端节点110-1……110-k、……110-N(其中N为大于1的整数,k=1、2、……N)以及服务节点120。客户端节点110-1……110-k、……110-N可以分别维护各自的本地数据集112-1……112-k、……112-N。为便于讨论,客户端节点110-1……110-k、……110-N可以被统称为或单独称为客户端节点110,本地数据集112-1……112-k、……112-N可以被统称为或单独称为本地数据集112。
在一些实施例中,客户端节点110和/或服务节点120可以被实现在终端设备或服务器处。终端设备可以是任意类型的移动终端、固定终端或便携式终端,包括移动手机、台式计算机、膝上型计算机、笔记本计算机、上网本计算机、平板计算机、媒体计算机、多媒体平板、个人通信系统(PCS)设备、个人导航设备、个人数字助理(PDA)、音频/视频播放器、数码相机/摄像机、定位设备、电视接收器、无线电广播接收器、电子书设备、游戏设备或者前述各项的任意组合,包括这些设备的配件和外设或者其任意组合。在一些实施例中,终端设备也能够支持任意类型的针对用户的接口(诸如“可佩戴”电路等)。服务器是能够提供计算能力的各种类型的计算系统/服务器,包括但不限于大型机、边缘计算节点、云环境中的计算设备,等等。
在联邦学习中,客户端节点指的是提供机器学习模型的部分训练数据的节点,客户端节点也可称为客户端、终端节点、终端设备、用 户设备等。在联邦学习中,服务节点指的是聚合客户端节点处的训练结果的节点。
在图1的示例中,假设N个客户端节点110节点共同参与对机器学习模型130的训练,并将训练中的中间结果汇集到服务节点120,以由服务节点120更新机器学习模型130的参数集。这些客户端节点110的本地数据的全集构成机器学习模型130的完整训练数据集。因此,根据联邦学习的机制,服务节点120将生成全局的机器学习模型130。
针对机器学习模型130,客户端节点110处的本地数据集112可以包括数据样本和真值标签。图1具体示出了客户端节点110-k处的本地数据集112-k,其包括数据样本集和真值标签集。数据样本集包括多个(M个)数据样本102-1、……1102-i、……102-M(统称为或单独称为数据样本102),并且真值标签集包括对应的多个(M个)真值标签(ground-truth label)105-1、……1105-i、……105-M(统称为或单独称为真值标签105),其中M为大于1的整数,i=1、2、……M。每个数据样本102可以被标注有对应的真值标签105。数据样本102可以对应于机器学习模型130的输入,真值标签105指示数据样本102的真实输出。真值标签是有监督机器学习中的重要部分。
在本公开的实施例中,机器学习模型130可以基于各种机器学习或深度学习的模型架构来构建,并且可以被配置为实现各种预测任务,诸如各种分类任务、推荐任务等等。机器学习模型130也可以被称为预测模型、推荐模型、分类模型,等等。
数据样本102可以包括与机器学习模型130的具体任务相关的输入信息,真值标签105与任务的期望输出有关。作为一个示例,在二分类任务中,机器学习学习模型130可以被配置为预测输入的数据样本属于第一类别或是第二类别,真值标签用于标注该数据样本实际属于第一类别或是第二类别。很多实际应用均可以被归类为这样的二分类任务,例如在推荐任务中对推荐项目的转化(例如,点击、购买、注册或其他需求行为)与否,等等。
应当理解,图1仅示出了示例的联邦学习环境。根据联邦学习算法和实际应用需要,环境还可以不同。例如,虽然被示出为单独的节点,在某些应用中,服务节点120除了作为中央节点外,还可以作为客户端节点,以提供部分数据用于模型训练、模型性能评估等。本公开的实施例在此方面不受限制。
在机器学习模型130的训练阶段,已有一些机制保护各个客户端节点110的本地数据不泄露。例如,在模型训练过程中,客户端节点110不必透漏本地的数据样本或标签数据,而是向服务节点120发送根据本地训练数据计算的梯度数据,以供服务节点120更新机器学习模型130的参数集。
在一些情况下,还希望全局地评估训练出的机器学习模型130的性能。机器学习模型130的性能可以通过一个或多个性能指标来衡量。不同性能指标能够从不同角度,衡量机器学习模型130针对数据样本集给出的预测输出与真值标签集所指示的真实输出之间的差异。通常,如果机器学习模型130给出的预测输出与真实输出之间的差异较小,那意味着机器学习模型的性能较好。可以看出,通常需要基于数据样本的真值标签集来确定机器学习模型130的性能指标。
随着数据监管体系不断加强,对数据隐私保护的要求也越来越高,包括对数据样本的真值标签也需要保护,避免被泄露。因此,如何既能够确定机器学习模型130的性能指标,又保护客户端节点本地的标签数据不被泄露,是一项具有挑战性的任务。当前还没有非常有效的方案能够解决该问题。
根据本公开的实施例,提供了一种模型性能评估方案,其能够保护客户端节点本地的标签数据。具体地,在客户端节点处,通过将机器学习模型针对多个数据样本输出的多个预测得分与从服务节点接收到的得分阈值相比较,确定多个数据样本对应的多个预测分类结果。客户端节点基于多个预测分类结果和多个数据样本对应的多个真值分类结果之间的差异,确定与机器学习模型的预定性能指标相关的多个度量参数的值。客户端节点向多个度量参数的值施加扰动,得到 多个度量参数的扰动值。客户端节点将多个度量参数的扰动值发送给服务节点。服务节点基于从各个客户端节点接收到的多个度量参数的扰动值确定预定性能指标。
根据本公开的实施例,各个客户端节点无需暴露将本地的真值标签集,也无须暴露其本地的预测分类结果(即,预测标签信息),同时服务节点还能够基于客户端节点的反馈信息(例如,多个度量参数的扰动值)计算出性能指标的值。以此方式,在确定机器学习模型的性能指标的同时,达到了对客户端节点本地标签数据的隐私保护目的。
以下将继续参考附图描述本公开的一些示例实施例。
图2示出了根据本公开的一些实施例的模型性能评估的信令流200的示意图。信令流200涉及环境100中的服务节点120和多个客户端节点组202-1、202-2、……202-L。为便于讨论,这些客户端节点组在本文中统称为或单独称为客户端节点组,其中L是大于等于1的整数。
客户端节点组202可以包括多个客户端节点110。例如,以客户端节点组202-1为例,客户端节点组202-1可以包括客户端节点110-1、110-2、……110-J,其中J是大于等于1并且小于等于N的整数。应理解,信令流200可以涉及任何数目的服务节点120和任何数目的客户端节点组202。
应理解,各个客户端节点组202可以包括任何数目的客户端节点110。各个客户端节点组202所包括的客户端节点110的数目可以相同或者不同。在一些实施例中,可以将N个客户端节点110平均或近似平均地划分为L个客户端节点组202,其中每个客户端节点组202中包括N/L(取整数)个客户端节点110。
在本公开的实施例中,假设要评估机器学习模型130的性能。在一些实施例中,待评估的机器学习模型130可以是基于联邦学习的训练过程确定的全局机器学习模型,例如客户端节点110和服务节点120参与了机器学习模型130的训练过程。在一些实施例中,机器学 习模型130也可以是以任何其他方式获得的模型,并且客户端节点110和服务节点120可以未参与了机器学习模型130的训练过程。本公开的范围在此方面不受限制。
在一些实施例中,服务节点120将机器学习模型130发送(未示出)给各个客户端节点组202中的客户端节点110。在接收到机器学习模型130后,各个客户端节点110可以基于机器学习模型130来执行后续评估过程。在一些实施例中,也可以以任何其他适当的方式将要评估的机器学习模型130提供给客户端节点110。
在一些实施例中,服务节点120可以将多个得分阈值分别发送给至少一个客户端节点组202中的客户端节点110。例如,服务节点120可以随机生成L个得分阈值,并且将该L个得分阈值分别发送给L个客户端节点组202的各个客户端节点110。各个得分阈值均为介于0与1之间的数值。
在一些实施例中,L的值(即,得分阈值的数目或者客户端节点组202的数目)可以由服务节点120预先确定。例如,L的值可以根据客户端节点110的数目来确定。在一些实施例中,L的值也可以根据服务节点120要确定的预定性能指标的类型来确定。例如,如果服务节点120要确定的预定性能指标是预测结果的正确率(ACC),则服务节点120可以将L的值确定为1。又如,如果服务节点120要确定的预定性能指标是受试者工作特征曲线(ROC)的曲线下面积(AUC),则服务节点120可以将L的值确定为大于1的整数。应理解,对于预定性能指标是ACC或者ROC的AUC的情形,L的值也可以被确定为其他适当的整数值。本公开的实施例在此方面不受限制。
如图2所示,服务节点120可以将第一个得分阈值发送(205-1)给与其相关联的客户端节点组202-1中的客户端节点110。类似地,服务节点120可以将第二个/……/第L个得分阈值发送(205-2/……/205-L)给与其相关联的客户端节点组202-2/……/202-L中的客户端节点110。
各个客户端节点组202中的客户端节点110接收(210-1/210-2/……/210-L)到各自的得分阈值。
以客户端节点组202-1中的客户端节点110为例,客户端节点110可以将各个数据样本102应用到机器学习模型130,作为模型的输入,并获得机器学习模型130输出的预测得分。客户端节点110通过将机器学习模型130针对多个数据样本102输出的多个预测得分与得分阈值相比较,确定(215-1)多个数据样本102对应的多个预测分类结果。多个预测分类结果分别指示多个数据样本102被预测属于第一类别或第二类别。
在本公开的实施例中,特别关注于在实现二分类任务的机器学习模型的性能指标。每个预测得分可以指示对应的数据样本102属于第一类别或第二类别的预测概率。两种类别可以根据实际任务需要配置。
机器学习模型130输出的预测得分的取值范围可以任意设置。例如,预测得分可以是在某个连续取值区间中的取值(例如,0到1之间的取值),或者可以是多个离散取值中的一个取值(例如,可以是0、1、2、3、4、5等离散取值之一)。在一些示例中,越高的预测得分可以指示数据样本102属于第一类别的预测概率越大,属于第二类别的预测概率越小。当然,相反设置也是可以的,例如越高的预测得分可以指示数据样本102属于第二类别的预测概率越大,属于第一类别的预测概率越小。
在一些实施例中,如果机器学习模型130针对数据样本102的预测得分超过得分阈值,则客户端节点110可以将该数据样本102对应的预测分类结果确定为指示该数据样本属于第一类别。反之,如果机器学习模型130针对数据样本102的预测得分未超过得分阈值,则客户端节点110可以将该数据样本102对应的预测分类结果确定为指示该数据样本属于第二类别。
在一些实施例中,各个数据样本102具有真值标签105。真值标签105用于标注对应的数据样本102属于第一类别或是第二类别。在 下文中,为了方便讨论,将由属于第一类别的数据样本有时称为正样本、正例或正类样本,将属于第二类别的数据样本有时称为负样本、负例或负类样本。在一些实施例中,每个真值标签105可以具有两个取值之一,分别用于指示第一类别或第二类别。在下文的一些实施例中,为了方便讨论,可以将第一类别对应的真值标签105的取值设置为“1”,其指示对应的数据样本102属于第一类别,是正样本。此外,可以将第二类别对应的真值标签105的取值设置为“0”,其指示对应的数据样本102属于第二类别,是负样本。
应理解,第一类别和第二类别可以是在二分类问题中的任意类别。以判断图像的内容是否为猫这一二分类问题为例,第一类别可以表示该图像的内容为猫这一类别,而第二类别可以表示该图像的内容为非猫这一类别。以评价物品质量为例,第一类别可以表示该物品质量达到标准,而第二类别可以表示该物品的质量不达标。应理解,以上所列举的二分类问题仅仅是是示例性的,本文所描述的模型性能评估方法适用于各类二分类问题。本公开的实施例在此方面不受限制。在下文的一些示例实施例中,为便于讨论,主要以图像分类为例进行解释说明,但应当理解,这并不暗示那些实施例仅能够应用于该二分类问题。
客户端节点组202-1中的客户端节点110基于多个预测分类结果和多个数据样本102对应的多个真值分类结果之间的差异,确定(220-1)与机器学习模型130的预定性能指标相关的多个度量参数的值。多个真值分类结果可以由多个数据样本102的多个真值标签分别标注,以指示多个数据样本102属于第一类别或是属于第二类别。
在一些实施例中,多个度量参数可以包括多个数据样本102中第一类数据样本的第一数目。第一类数据样本对应的预测分类结果和真值分类结果均指示第一类别。例如,对于某个数据样本102,如果数据样本102的真值分类结果(或真值标签105)指示该数据样本102属于第一类别(例如,属于猫的图像),而机器学习模型130预测出的预测分类结果也指示该数据样本102属于第一类别,则该数据样本 102属于第一类数据样本,也被称为真正样本(True Positive,TP)。
在一些实施例中,多个度量参数可以包括多个数据样本102中第二类数据样本的第二数目。第二类数据样本对应的预测分类结果和真值分类结果均指示第二类别。例如,对于某个数据样本102,如果数据样本102的真值分类结果(或真值标签105)指示该数据样本102属于第二类别(例如,不属于猫的图像),而机器学习模型130预测出的预测分类结果也指示该数据样本102属于第二类别,则该数据样本102属于第二类数据样本,也被称为真负样本(True Negative,TN)。
在一些实施例中,多个度量参数可以包括多个数据样本102中第三类数据样本的第三数目。第三类数据样本对应的预测分类结果指示第一类别并且对应的真值分类结果指示第二类别。例如,对于某个数据样本102,如果数据样本102的真值分类结果(或真值标签105)指示该数据样本102属于第二类别(例如,不属于猫的图像),而机器学习模型130预测出的预测分类结果指示该数据样本102属于第一类别(例如,属于猫的图像),则该数据样本102属于第三类数据样本,也被称为假正样本(False Positive,FP)。
在一些实施例中,多个度量参数可以包括多个数据样本中第四类数据样本的第四数目。第四类数据样本对应的预测分类结果指示第二类别并且对应的真值分类结果指示第一类别。例如,对于某个数据样本102,如果真值分类结果(或真值标签105)指示该数据样本102属于第一类别(例如,属于猫的图像),而机器学习模型130预测出的预测分类结果指示该数据样本102属于第二类别(例如,不属于猫的图像),则该数据样本102属于第四类数据样本,也被称为假负样本(False Negative,FN)。
以上描述的四种结果被总结在以下表1中。
表1

以上以第一数目(TP的数目)、第二数目(TN的数目)、第三数目(FP的数目)、和第四数目(FN的数目)为例列举了多个度量参数的示例。应理解,客户端节点110可以基于多个预测分类结果和多个数据样本102对应的多个真值分类结果之间的差异,确定(220-1)以上至少一个度量参数的值。备选地,客户端节点110可以基于上述差异,确定(220-1)以上描述的TP、TN、FP、和FN的数目的值。附加地,客户端节点110还可以确定其他附加的度量参数的值。
在客户端节点110确定上述TP、TN、FP、和FN的值的示例中,客户端节点110可以将这四个值表示为四元组,即(FP,FN,TP,TN)。附加地,在一些实施例中,上述四个数值可以与该客户端节点110的得分阈值保存在一起,例如被表示为(k_i,FP,FN,TP,TN),其中k_i表示第i个客户端节点组的得分阈值。
以上所划分的客户端节点组202之间没有重复的客户端节点110。换句话说,多个客户端节点组202中每个客户端节点组202的客户端节点110与其他客户端节点组202的客户端节点不相同。以这样,每个客户端节点110都仅接收到一个得分阈值。每个客户端节点110都仅基于一个得分阈值来确定各个度量参数的值。以此方式,能够避免客户端节点110的信息(诸如,预测分类结果或预测分类标签等)泄漏。
客户端节点组202-1中的客户端节点110向多个度量参数的值施加扰动,得到(225-1)多个度量参数的扰动值。例如,对于TP、TN、FP、和FN中的至少一个数目,客户端节点110可以通过例如Gaussian机制或者Laplace机制来对TP、TN、FP、和FN中的一个或多个值添加随机扰动。
图3示出根据本公开的一些实施例的施加扰动的过程300的流程图。过程300可以在客户端节点110处实现。
具体地,在框310,客户端节点110被配置为确定与扰动相关的 灵敏度值。例如,可以将灵敏度Δ设置为1。即,每没变一个数据样本102的标签,对统计量的影响最大为1。备选地,还可以将灵敏度值设置为其他适当的数值。
在框320,客户端节点110被配置为基于灵敏度值Δ和标签差分隐私机制,来确定随机扰动分布。
随机响应机制是差分隐私(Differential Privacy,DP)机制中的一种。为更好地理解本公开的实施例,下文将首先简单介绍差分隐私和随机响应机制。
假设∈,δ是大于等于0的实数,即并且是一个随机机制(随机算法)。所谓随机机制,指的是对于特定输入,该机制的输出不是固定值,而是服从某一分布。对于随机机制如果满足以下情况则可以认为随机机制具有(∈,δ)-差分隐私:对于任意两个相邻训练数据集D,D′,并且对于的可能的输出的任意子集S,存在:
此外,如果δ=0,还可以认为随机机制具有∈-差分隐私(∈-DP)。在差分隐私机制中,对于具有(∈,δ)-差分隐私或∈-差分隐私的随机机制期望其分别作用于两个相邻数据集后得到的两个输出的分布难以区分。这样的话,观察者可以通过观察输出结果,很难察觉到算法的输入数据集中的微小变化,从而达到保护隐私的目的。如果随机机制作用于任何相邻数据集,得到特定输出S的概率均差不多,那么将会认为该算法难以达到差分隐私的效果。
在本文的实施例中,关注于对数据样本的标签的差分隐私,且标签指示二分类结果。因此,遵循差分隐私的设置,可以定义标签差分隐私。具体地,假设∈,δ是大于等于0的实数,即并且是一个随机机制(随机算法)。如果满足以下情况则可以认为随机机 制具有(∈,δ)-标签差分隐私(label differential privacy):对于任意两个相邻训练数据集D,D′,它们的差异仅在于单个数据样本的标签不同,并且对于的可能的输出的任意子集S,存在:
此外,如果δ=0,还可以认为随机机制具有∈-差分隐私(∈-DP)。也就是说,期望在改变数据样本的标签后,从随机机制的输出结果的分布仍较小,使得观察者难以察觉到标签的改变。
随机响应机制是为达到差分隐私保护的目的所应用的一种随机机制。随机响应机制被定位如下:假设∈是一个参数,并且y∈[0,1]是在随机响应机制中真值标签的已知取值。如果对于真值标签的取值y,随机响应机制从以下的概率分布中导出随机值
也就是说,在应用随机响应机制后,随机值有一定概率等于y,也有一定概率不等于y。以上随机响应机制被认为具有δ=0的标签差分隐私((∈,0)-标签差分隐私),因为:
也就是说,随机响应机制将满足∈-差分隐私。
以上讨论了差分隐私和随机响应机制。经过随机响应机制,客户端节点110可以对多个度量参数的值添加随机扰动,以避免服务节点 120获取客户端节点110处的隐私(例如,预测出的标签信息等)。
在一些实施例中,标签差分隐私机制可以是Gaussian机制。Gaussian机制的随机扰动分布(∈,δ)-DP的标准差σ(即,所添加的噪声的标准查)可以被计算为以下式(5)。
其中,Δ表示灵敏度值,δ和∈是介于0与1之间(不包括0和1)的任意数值。
备选地,在一些实施例中,标签差分隐私机制可以是Laplace机制。Laplace机制的随机扰动分布满足(∈,0)-DP,该Laplace分布的比例(scale)为b=Δ/∈,所添加的随机噪声的标准差σ可以被计算为以下式(6)。
应理解,以上所列举的Gaussian机制和Laplace机制仅仅是示意性的,而不是限制性的。本公开的实施例可以使用其他适合适当的标签差分隐私机制来确定随机扰动分布。本公开的实施例在此方面不受限制。
在框330,客户端节点110被配置为基于随机扰动分布,对至少一个数目施加扰动。例如,在客户端节点110确定上述TP、TN、FP、和FN四个值的示例中,客户端节点110可以对这四个值分别施加随机扰动,已得到扰动值(FP’,FN’,TP’,TN’),或者也可表示为(k_i,FP’,FN’,TP’,TN’)。备选地,在一些实施例中,客户端节点110可以对以上四个值中的一个或多个施加随机扰动。
通过对度量参数的值施加扰动,可以避免客户端节点110的信息,特别是正例标签信息遭到泄漏。具体地,这种施加扰动的方式能够避免服务节点120来对客户端节点110的预测标签进行推测。例如,如果不对度量参数的值施加扰动,当TP/(TP+FP)过大时,服务节点有可能判断出大于阈值k_i的样本均为正样本。又如,如果不对度量参数 的值施加扰动,当FN/(FN+TN)过大时,服务节点有可能判断出小于阈值k_i的样本均为负样本。通过对度量参数的值施加扰动,可以避免上述情况和其他潜在的信息泄漏情况。
继续参考图2。客户端节点组202-1中的客户端节点110将多个度量参数的扰动值发送(230-1)给服务节点120,以用于在服务节点120处确定预定性能指标。在本文中,“扰动”有时也称为噪音、干扰等。
在一些实施例中,客户端节点组202-1中的各个客户端节点110可以将四元组(FP’,FN’,TP’,TN’)或者(k_i,FP’,FN’,TP’,TN’)发送给服务节点120。备选地,客户端节点组202-1中的客户端节点110可以将上述四个扰动值中的一个或多个发送给服务节点120。服务节点120确定预定性能指标的过程将在下文进行描述。
以上以客户端节点组202-1为例进行了描述。类似地,客户端节点组202-2/……/202-L中的客户端节点110通过将机器学习模型130针对多个数据样本102输出的多个预测得分与得分阈值相比较,确定(215-2/……/215-L)多个数据样本102对应的多个预测分类结果。客户端节点组202-2/……/202-L中的客户端节点110基于多个预测分类结果和多个数据样本102对应的多个真值分类结果之间的差异,确定(220-2/……/220-L)与机器学习模型130的预定性能指标相关的多个度量参数的值。客户端节点组202-2/……/202-L中的客户端节点110向多个度量参数的值施加扰动,得到(225-2/……/225-L)多个度量参数的扰动值。客户端节点组202-2/……/202-L中的客户端节点110将多个度量参数的扰动值发送(230-2/……/230-L)给服务节点120,以用于在服务节点120处确定预定性能指标。上述过程与客户端节点组202-1的相应过程类似,在此不再进行赘述。
在服务节点120处,从至少一个客户端节点组202-1(或202-2/……/202-L)分别接收(235-1,或还包括235-2/……/235-L,统称为235)与机器学习模型130的预定性能指标相关的多个度量参数的扰动值。例如,服务节点120可以从至少一个客户端节点组202分 别接收235四元组(FP’,FN’,TP’,TN’)或者(k_i,FP’,FN’,TP’,TN’)。备选地,服务节点120可以接收(235)上述四个扰动值中的一个或多个,这可以取决于要计算的性能指标。
TP’表示给定客户端节点110处的多个数据样本102中第一类数据样本的第一扰动数目,其中第一类数据样本由真值标签标注为第一类别并且被机器学习模型130预测为第一类别。TN’表示多个数据样本102中第二类数据样本的第二扰动数目,其中第二类数据样本由真值标签标注为第二类别并且被模型也预测为第二类别。FP’表示多个数据样本102中第三类数据样本的第三扰动数目,其中第三类数据样本被标注为第二类别并且被预测为第一类别。FN’表示多个数据样本102中第四类数据样本的第四扰动数目,其中第四类数据样本被标注为第一类别并且被预测为第二类别。
对于至少一个客户端节点组202中的每个客户端节点组,服务节点120按度量参数聚合来自该客户端节点组的客户端节点110的多个度量参数的扰动值,得到与至少一个客户端节点组202分别相对应的多个度量参数的聚合值。例如,服务节点120按度量参数聚合(240-1)来自客户端节点组202-1的客户端节点110的多个度量参数的扰动值(FP’,FN’,TP’,TN’),得到与客户端节点组202-1相对应的多个度量参数的聚合值。类似地,服务节点120按度量参数聚合(240-2/……/240-L)来自客户端节点组202-2/……/202-L的客户端节点110的多个度量参数的扰动值(FP’,FN’,TP’,TN’),得到与客户端节点组202-2/……/202-L分别相对应的多个度量参数的聚合值。
在一些实施例中,对于每个客户端节点组202,服务节点120可以基于以下式(7)和式(8)来计算与该客户端节点组202相对应的多个度量参数的聚合值TPR(真正样本比率)和FPR(假正样本比率)。TPR表示实际为阳性的样本(正样本)中被正确地判断为阳性的比例。FPR表示实际为阴性的样本(负样本)中被错误地判断为阳性的比例。
TPR=TP′/(TP′+FN′)      (7)
FPR=FP′/(FP′+TN′)      (8)
应理解,以上所描述的聚合值TPR和FPR仅仅是示例性的,而不是限制性的。在一些实施例中,服务节点120可以使用其他方法来得出与客户端节点组202相对应的多个度量参数的聚合值。
服务节点120基于与至少一个客户端节点组202分别相关联的多个得分阈值,以及与至少一个客户端节点组202分别相对应的多个度量参数的聚合值,确定(245)预定性能指标的值。
在一些实施例中,预定性能度量指标至少包括受试者工作特征曲线(ROC)的曲线下面积(AUC)。服务节点120可以基于至少一个得分阈值和多个度量参数的聚合值,确定机器学习模型130的ROC。
在一些实施例中,L的值(即,得分阈值的数目或客户端节点组202的数目)可以大于1。换句话说,至少一个组可以包括多个组并且至少一个得分阈值包括多个得分阈值。服务节点120可以根据每个阈值得分计算出多个(FPR,TPR)对的坐标点,将这些点连成线,以拟合机器学习模型130的ROC曲线。
服务节点120进而可以确定该ROC的AUC。从定义上理解,AUC指的是ROC曲线下方面积。在一些实施例中,可以根据AUC的定义,通过用近似算法计算ROC曲线下的面积来计算AUC。
图4示出根据本公开的一些实施例的ROC曲线410的示意图。ROC曲线410是由根据多个阈值得分(在此示例中,L的值大于1)计算出多个(FPR,TPR)对的坐标点绘制得出的。在图4的示例中,多个坐标点是(0,0)、(0,0.2)、(0.2,0.2)、(0.2,0.4)、(0.4,0.4)、(0.4,0.6)、(0.6,0.6)、(0.6,0.8)、(0.8,0.8)、(0.8,1)和(1,1)。图4中还示出了将ROC平面划分为两部分的曲线420。可以利用ROC曲线410来确定出该ROC曲线的AUC为0.7。备选地,可以确定出ROC曲线410与曲线420之间的面积为0.2,进而确定出ROC曲线410的AUC为0.7。
应理解,图4所描绘的ROC曲线410仅仅是示意性的,图4中的各个坐标点的坐标值仅仅是出于说明的目的,而不是限制性的。服务节点120可以确定出其他数值的坐标值,并且可以绘制出其他形状的ROC曲线。
通过将客户端节点110划分为两个或两个以上的组,并且通过根据多个阈值得分计算出多个(FPR,TPR)对的坐标点来得出ROC的AUC,可以使模型性能评估的结果更加准确。
附加地或备选地,在计算AUC时,还可以从概率视角来确定AUC。AUC可以被认为是:随机选择一个正样本和一个负样本,机器学习模型给正样本的预测得分高于负样本的预测得分的概率。也就是说,在数据样本集中,将正、负样本两两组合形成正负样本对,其中正样本的预测得分大于负样本的预测得分的占比。如果模型能够给更多正样本输出高于负样本的预测得分,可以认为AUC更高,模型的性能更好。AUC的取值范围在0.5和1之间。AUC越接近1,说明模型的性能越好。在此示例中,服务节点120可以基于至少一个客户端节点组202的相应的得分阈值和多个度量参数的聚合值,从概率视角来确定AUC。在这一示例中,L的值(即,得分阈值的数目或客户端节点组202的数目)可以是1,也可以是大于1的整数。
在上述AUC计算中,均需要基于数据样本102的标签数据来确定所需要的各个度量参数的值。
备选地或附加地,在一些实施例中,预定性能度量指标可以包括预测结果的ACC。在这一示例中,对于每个客户端节点组202,服务节点120可以将(TP’+TN’)与(TP’+FP’+FN’+TN’)分别确定为与该客户端节点组202相对应的多个度量参数的聚合值。服务节点120进而可以基于上述聚合值,利用以下式(9)来确定出ACC的值。
ACC=(TP’+TN’)/(TP’+FP’+FN’+TN’)    (9)
除AUC和ACC之外,机器学习模型130的性能指标还可以包括精确率(Precision),其被表示为Precision=TP’/TP’+FP’。机器学习模型130的性能指标还可以包括精确度表示,被预测为正样本的数据样本子集中,由标签标注为正样本的概率。机器学习模型130的性能指标还可以包括召回率(Recall),其被表示为Recall=TP’/TP’+FN’,即正样本被预测的概率。机器学习模型130的性能指标还可以包括P-R曲线,其以召回率为横轴,精确度为纵轴。P-R曲线越靠近右上 角,说明模型的性能越好。曲线下面积称作AP分数(Average Precision Score,平均精确率分数)。
应理解,以上所列举的诸如ROC的AUC等性能指标仅仅是示例性的,而不是限制新的。本公开所使用的性能指标的示例包括但不限于ROC的AUC、正确率、错误率(Error-rate)、精确率、召回率、AP分数,等等。
通过以上方式,服务节点120可以基于从至少一个客户端节点组接收到的多个度量参数的扰动值,来确定预定性能指标,从而进行模型性能评估。通过这样方式,客户端节点无需暴露将本地的真值标签集也无须暴露其本地的预测分类结果(即,预测标签信息),同时服务节点还能够基于客户端节点的反馈信息(例如,多个度量参数的扰动值)计算出性能指标的值。以此方式,在确定机器学习模型的性能指标的同时,达到了对客户端节点本地标签数据的隐私保护目的。
图5示出根据本公开的一些实施例的用于模型性能评估的过程500的流程图。过程500可以在客户端节点110处实现。
在框510处,客户端节点110通过将机器学习模型130针对多个数据样本102输出的多个预测得分与得分阈值相比较,确定多个数据样本102对应的多个预测分类结果。多个预测分类结果分别指示多个数据样本102被预测属于第一类别或第二类别。在一些实施例中,过程500还包括:从服务节点120接收得分阈值。客户端节点110可以通过将机器学习模型130针对多个数据样本102输出的多个预测得分与从服务节点120接收到的得分阈值相比较,确定多个数据样本102对应的多个预测分类结果。
在框520处,客户端节点110基于多个预测分类结果和多个数据样本102对应的多个真值分类结果之间的差异,确定与机器学习模型130的预定性能指标相关的多个度量参数的值。
在一些实施例中,为了确定多个度量参数的值,客户端节点110可以基于上述差异来确定以下至少一项:多个数据样本中第一类数据样本的第一数目,第一类数据样本对应的预测分类结果和真值分类结 果均指示第一类别;多个数据样本中第二类数据样本的第二数目,第二类数据样本对应的预测分类结果和真值分类结果均指示第二类别;多个数据样本中第三类数据样本的第三数目,第三类数据样本对应的预测分类结果指示第一类别并且对应的真值分类结果指示第二类别;以及多个数据样本中第四类数据样本的第四数目,第四类数据样本对应的预测分类结果指示第二类别并且对应的真值分类结果指示第一类别。
在框530处,客户端节点110向多个度量参数的值施加扰动,得到多个度量参数的扰动值。例如,为了对多个度量参数的值施加扰动,客户端节点110可以对于第一数目、第二数目、第三数目和第四数目中的至少一个数目,通过以下来对至少一个数目施加扰动:确定与扰动相关的灵敏度值;基于灵敏度值和标签差分隐私机制,来确定随机扰动分布;以及基于随机扰动分布,对至少一个数目施加扰动。
在框540处,客户端节点110将多个度量参数的扰动值发送给服务节点120,以用于在服务节点120处确定预定性能指标。例如,预定性能度量指标至少包括受试者工作特征曲线(ROC)的曲线下面积(AUC)。
图6示出根据本公开的一些实施例的在服务节点120处用于模型性能评估的过程600的流程图。过程600可以在服务节点120处实现。
在框610处,服务节点110从至少一个组的客户端节点110分别接收与机器学习模型130的预定性能指标相关的多个度量参数的扰动值。在一些实施例中,至少一个组中每个组的客户端节点110与其他组的客户端节点110不相同。在一些实施例中,过程600还包括将至少一个得分阈值分别发送给各自相关联的组中的客户端节点110。
在一些实施例中,对于给定客户端节点110,多个度量参数的扰动值包括以下至少一项:给定客户端节点110处的多个数据样本中第一类数据样本的第一扰动数目,第一类数据样本被标注为第一类别并且被预测为第一类别;多个数据样本中第二类数据样本的第二扰动数目,第二类数据样本被标注为第二类别并且被预测为第二类别;多个 数据样本中第三类数据样本的第三扰动数目,第三类数据样本被标注为第二类别并且被预测为第一类别;以及多个数据样本中第四类数据样本的第四扰动数目,第四类数据样本被标注为第一类别并且被预测为第二类别。上述预测是基于机器学习模型130输出的预测得分与给定客户端节点110所在的组相关联的得分阈值的比较的。
在框620处,服务节点110对于至少一个组中的每个组,按度量参数聚合来自该组的客户端节点110的多个度量参数的扰动值,得到与至少一个组分别相对应的多个度量参数的聚合值。
在框630处,服务节点110基于与至少一个组分别相关联的多个得分阈值,以及与至少一个组分别相对应的多个度量参数的聚合值,确定预定性能指标的值。在一些实施例中,至少一个组可以包括多个组并且至少一个得分阈值包括多个得分阈值。在这种示例中,为了确定预定性能指标的值,服务节点120可以基于多个得分阈值和多个度量参数的聚合值,确定机器学习模型130的受试者工作特征曲线ROC;以及确定ROC的曲线下面积AUC。
图7示出了根据本公开的一些实施例的在客户端节点处用于模型性能评估的装置700的框图。装置700可以被实现为或者被包括在客户端节点110中。装置700中的各个模块/组件可以由硬件、软件、固件或者它们的任意组合来实现。
如图所示,装置700包括分类确定模块710,被配置为通过将机器学习模型130针对多个数据样本102输出的多个预测得分与得分阈值相比较,确定多个数据样本102对应的多个预测分类结果。多个预测分类结果分别指示多个数据样本102被预测属于第一类别或第二类别。在一些实施例中,装置700还包括接收模块,被配置为从服务节点120接收得分阈值。装置700可以通过将机器学习模型130针对多个数据样本102输出的多个预测得分与从服务节点120接收到的得分阈值相比较,确定多个数据样本102对应的多个预测分类结果。
装置700还包括度量参数确定模块720,被配置为基于多个预测分类结果和多个数据样本102对应的多个真值分类结果之间的差异, 确定与机器学习模型130的预定性能指标相关的多个度量参数的值。
在一些实施例中,度量参数确定模块720被配置为基于上述差异来确定以下至少一项:多个数据样本中第一类数据样本的第一数目,第一类数据样本对应的预测分类结果和真值分类结果均指示第一类别;多个数据样本中第二类数据样本的第二数目,第二类数据样本对应的预测分类结果和真值分类结果均指示第二类别;多个数据样本中第三类数据样本的第三数目,第三类数据样本对应的预测分类结果指示第一类别并且对应的真值分类结果指示第二类别;以及多个数据样本中第四类数据样本的第四数目,第四类数据样本对应的预测分类结果指示第二类别并且对应的真值分类结果指示第一类别。
装置700还包括扰动模块730,被配置为向多个度量参数的值施加扰动,得到多个度量参数的扰动值。例如,扰动模块730可以被配置为对于第一数目、第二数目、第三数目和第四数目中的至少一个数目,通过以下来对至少一个数目施加扰动:确定与扰动相关的灵敏度值;基于灵敏度值和标签差分隐私机制,来确定随机扰动分布;以及基于随机扰动分布,对至少一个数目施加扰动。
装置700还包括扰动值发送模块740,被配置为将多个度量参数的扰动值发送给服务节点120,以用于在服务节点120处确定预定性能指标。例如,预定性能度量指标至少包括受试者工作特征曲线(ROC)的曲线下面积(AUC)。
图8示出了根据本公开的一些实施例的在服务节点处用于模型性能评估的装置800的框图。装置800可以被实现为或者被包括在服务节点120中。装置800中的各个模块/组件可以由硬件、软件、固件或者它们的任意组合来实现。
如图所示,装置800包括扰动值接收模块810,被配置为从至少一个组的客户端节点110分别接收与机器学习模型130的预定性能指标相关的多个度量参数的扰动值。在一些实施例中,至少一个组中每个组的客户端节点110与其他组的客户端节点110不相同。在一些实施例中,装置800还可以包括发送模块,被配置为将至少一个得分阈 值分别发送给各自相关联的组中的客户端节点110。
在一些实施例中,对于给定客户端节点110,多个度量参数的扰动值包括以下至少一项:给定客户端节点110处的多个数据样本中第一类数据样本的第一扰动数目,第一类数据样本被标注为第一类别并且被预测为第一类别;多个数据样本中第二类数据样本的第二扰动数目,第二类数据样本被标注为第二类别并且被预测为第二类别;多个数据样本中第三类数据样本的第三扰动数目,第三类数据样本被标注为第二类别并且被预测为第一类别;以及多个数据样本中第四类数据样本的第四扰动数目,第四类数据样本被标注为第一类别并且被预测为第二类别。上述预测是基于机器学习模型130输出的预测得分与给定客户端节点110所在的组相关联的得分阈值的比较的。
装置800还包括聚合模块820,被配置为对于至少一个组中的每个组,按度量参数聚合来自该组的客户端节点110的多个度量参数的扰动值,得到与至少一个组分别相对应的多个度量参数的聚合值。
装置800还包括指标确定模块830,被配置为基于与至少一个组分别相关联的至少一个得分阈值,以及与至少一个组分别相对应的多个度量参数的聚合值,确定预定性能指标的值。在一些实施例中,至少一个组可以包括多个组并且至少一个得分阈值包括多个得分阈值。在这种实施例中,指标确定模块830包括受试者工作特征曲线(ROC)确定模块,被配置为基于多个得分阈值和多个度量参数的聚合值,确定机器学习模型130的ROC。指标确定模块830还包括曲线下面积(AUC)确定模块,被配置为确定ROC的AUC。
图9示出了能够实施本公开的一个或多个实施例的计算设备/系统900的框图。应当理解,图9所示出的计算设备/系统900仅仅是示例性的,而不应当构成对本文所描述的实施例的功能和范围的任何限制。图9所示出的计算设备/系统900可以用于实现图1的客户端节点110或服务节点120。
如图9所示,计算设备/系统900是通用计算设备的形式。计算设备/系统900的组件可以包括但不限于一个或多个处理器或处理单元 910、存储器920、存储设备930、一个或多个通信单元940、一个或多个输入设备950以及一个或多个输出设备960。处理单元910可以是实际或虚拟处理器并且能够根据存储器920中存储的程序来执行各种处理。在多处理器系统中,多个处理单元并行执行计算机可执行指令,以提高计算设备/系统900的并行处理能力。
计算设备/系统900通常包括多个计算机存储介质。这样的介质可以是计算设备/系统900可访问的任何可以获得的介质,包括但不限于易失性和非易失性介质、可拆卸和不可拆卸介质。存储器920可以是易失性存储器(例如寄存器、高速缓存、随机访问存储器(RAM))、非易失性存储器(例如,只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、闪存)或它们的某种组合。存储设备930可以是可拆卸或不可拆卸的介质,并且可以包括机器可读介质,诸如闪存驱动、磁盘或者任何其他介质,其可以能够用于存储信息和/或数据(例如用于训练的训练数据)并且可以在计算设备/系统900内被访问。
计算设备/系统900可以进一步包括另外的可拆卸/不可拆卸、易失性/非易失性存储介质。尽管未在图9中示出,可以提供用于从可拆卸、非易失性磁盘(例如“软盘”)进行读取或写入的磁盘驱动和用于从可拆卸、非易失性光盘进行读取或写入的光盘驱动。在这些情况中,每个驱动可以由一个或多个数据介质接口被连接至总线(未示出)。存储器920可以包括计算机程序产品925,其具有一个或多个程序模块,这些程序模块被配置为执行本公开的各种实施例的各种方法或动作。
通信单元940实现通过通信介质与其他计算设备进行通信。附加地,计算设备/系统900的组件的功能可以以单个计算集群或多个计算机器来实现,这些计算机器能够通过通信连接进行通信。因此,计算设备/系统900可以使用与一个或多个其他服务器、网络个人计算机(PC)或者另一个网络节点的逻辑连接来在联网环境中进行操作。
输入设备950可以是一个或多个输入设备,例如鼠标、键盘、追踪球等。输出设备960可以是一个或多个输出设备,例如显示器、扬 声器、打印机等。计算设备/系统900还可以根据需要通过通信单元940与一个或多个外部设备(未示出)进行通信,外部设备诸如存储设备、显示设备等,与一个或多个使得用户与计算设备/系统900交互的设备进行通信,或者与使得计算设备/系统900与一个或多个其他计算设备通信的任何设备(例如,网卡、调制解调器等)进行通信。这样的通信可以经由输入/输出(I/O)接口(未示出)来执行。
根据本公开的示例性实现方式,提供了一种计算机可读存储介质,其上存储有计算机可执行指令或计算机程序,其中计算机可执行指令或计算机程序被处理器执行以实现上文描述的方法。
根据本公开的示例性实现方式,还提供了一种计算机程序产品,计算机程序产品被有形地存储在非瞬态计算机可读介质上并且包括计算机可执行指令,而计算机可执行指令被处理器执行以实现上文描述的方法。
在本文中参照根据本公开实现的方法、装置、设备和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理单元,从而生产出一种机器,使得这些指令在通过计算机或其他可编程数据处理装置的处理单元执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
可以把计算机可读程序指令加载到计算机、其他可编程数据处理装置、或其他设备上,使得在计算机、其他可编程数据处理装置或其他设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得 在计算机、其他可编程数据处理装置、或其他设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实现的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本公开的各实现,上述说明是示例性的,并非穷尽性的,并且也不限于所公开的各实现。在不偏离所说明的各实现的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实现的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其他普通技术人员能理解本文公开的各个实现方式。

Claims (20)

  1. 一种模型性能评估的方法,包括:
    在客户端节点处,通过将机器学习模型针对多个数据样本输出的多个预测得分与得分阈值相比较,确定所述多个数据样本对应的多个预测分类结果,所述多个预测分类结果分别指示所述多个数据样本被预测属于第一类别或第二类别;
    基于所述多个预测分类结果和所述多个数据样本对应的多个真值分类结果之间的差异,确定与所述机器学习模型的预定性能指标相关的多个度量参数的值;
    向所述多个度量参数的值施加扰动,得到所述多个度量参数的扰动值;以及
    将所述多个度量参数的扰动值发送给服务节点。
  2. 根据权利要求1所述的方法,还包括:
    从所述服务节点接收所述得分阈值。
  3. 根据权利要求1或2所述的方法,其中确定所述多个度量参数的值包括:基于所述差异来确定以下至少一项:
    所述多个数据样本中第一类数据样本的第一数目,所述第一类数据样本对应的预测分类结果和真值分类结果均指示所述第一类别;
    所述多个数据样本中第二类数据样本的第二数目,所述第二类数据样本对应的预测分类结果和真值分类结果均指示所述第二类别;
    所述多个数据样本中第三类数据样本的第三数目,所述第三类数据样本对应的预测分类结果指示所述第一类别并且对应的真值分类结果指示所述第二类别;以及
    所述多个数据样本中第四类数据样本的第四数目,所述第四类数据样本对应的预测分类结果指示所述第二类别并且对应的真值分类结果指示所述第一类别。
  4. 根据权利要求3所述的方法,其中对所述多个度量参数的所述值施加扰动包括:
    对于所述第一数目、所述第二数目、所述第三数目和所述第四数目中的至少一个数目,通过以下来对所述至少一个数目施加扰动:
    确定与扰动相关的灵敏度值;
    基于所述灵敏度值和标签差分隐私机制,来确定随机扰动分布;以及
    基于所述随机扰动分布,对所述至少一个数目施加扰动。
  5. 根据权利要求1至4中任一项所述的方法,其中所述预定性能度量指标至少包括受试者工作特征曲线ROC的曲线下面积AUC。
  6. 一种模型性能评估的方法,包括:
    在服务节点处,从至少一个组的客户端节点分别接收与机器学习模型的预定性能指标相关的多个度量参数的扰动值;
    对于所述至少一个组中的每个组,按度量参数聚合来自该组的客户端节点的所述多个度量参数的所述扰动值,得到与所述至少一个组分别相对应的所述多个度量参数的聚合值;以及
    基于与所述至少一个组分别相关联的至少一个得分阈值,以及与所述至少一个组分别相对应的所述多个度量参数的聚合值,确定所述预定性能指标的值。
  7. 根据权利要求6所述的方法,还包括:
    将所述至少一个得分阈值分别发送给各自相关联的组中的客户端节点。
  8. 根据权利要求6或7所述的方法,其中对于给定客户端节点,所述多个度量参数的扰动值包括以下至少一项:
    所述给定客户端节点处的多个数据样本中第一类数据样本的第一扰动数目,所述第一类数据样本被标注为第一类别并且被预测为所述第一类别;
    所述多个数据样本中第二类数据样本的第二扰动数目,所述第二类数据样本被标注为第二类别并且被预测为所述第二类别;
    所述多个数据样本中第三类数据样本的第三扰动数目,所述第三类数据样本被标注为所述第二类别并且被预测为所述第一类别;以及
    所述多个数据样本中第四类数据样本的第四扰动数目,所述第四类数据样本被标注为所述第一类别并且被预测为所述第二类别;并且
    其中所述预测基于所述机器学习模型输出的预测得分与所述给定客户端节点所在的组相关联的得分阈值的比较。
  9. 根据权利要求8所述的方法,其中所述至少一个组包括多个组并且所述至少一个得分阈值包括多个得分阈值,并且其中确定所述预定性能指标的值包括:
    基于所述多个得分阈值和所述多个度量参数的聚合值,确定所述机器学习模型的受试者工作特征曲线ROC;以及
    确定所述ROC的曲线下面积AUC。
  10. 根据权利要求6至9中任一项所述的方法,其中所述至少一个组中每个组的客户端节点与其他组的客户端节点不相同。
  11. 一种用于模型性能评估的装置,包括:
    分类确定模块,被配置为通过将机器学习模型针对多个数据样本输出的多个预测得分与得分阈值相比较,确定所述多个数据样本对应的多个预测分类结果,所述多个预测分类结果分别指示所述多个数据样本被预测属于第一类别或第二类别;
    度量参数确定模块,被配置为基于所述多个预测分类结果和所述多个数据样本对应的多个真值分类结果之间的差异,确定与所述机器学习模型的预定性能指标相关的多个度量参数的值;
    扰动模块,被配置为向所述多个度量参数的值施加扰动,得到所述多个度量参数的扰动值;以及
    扰动值发送模块,被配置为将所述多个度量参数的扰动值发送给服务节点。
  12. 根据权利要求11所述的装置,还包括:
    接收模块,被配置为从所述服务节点接收所述得分阈值。
  13. 根据权利要求11或12所述的装置,其中所述度量参数确定模块被配置为基于所述差异来确定以下至少一项:
    所述多个数据样本中第一类数据样本的第一数目,所述第一类数 据样本对应的预测分类结果和真值分类结果均指示所述第一类别;
    所述多个数据样本中第二类数据样本的第二数目,所述第二类数据样本对应的预测分类结果和真值分类结果均指示所述第二类别;
    所述多个数据样本中第三类数据样本的第三数目,所述第三类数据样本对应的预测分类结果指示所述第一类别并且对应的真值分类结果指示所述第二类别;以及
    所述多个数据样本中第四类数据样本的第四数目,所述第四类数据样本对应的预测分类结果指示所述第二类别并且对应的真值分类结果指示所述第一类别。
  14. 根据权利要求13所述的装置,其中所述扰动模块被配置为:
    对于所述第一数目、所述第二数目、所述第三数目和所述第四数目中的至少一个数目,通过以下来对所述至少一个数目施加扰动:
    确定与扰动相关的灵敏度值;
    基于所述灵敏度值和标签差分隐私机制,来确定随机扰动分布;以及
    基于所述随机扰动分布,对所述至少一个数目施加扰动。
  15. 一种用于模型性能评估的装置,包括:
    扰动值接收模块,被配置为从至少一个组的客户端节点分别接收与机器学习模型的预定性能指标相关的多个度量参数的扰动值;
    聚合模块,被配置为对于所述至少一个组中的每个组,按度量参数聚合来自该组的客户端节点的所述多个度量参数的所述扰动值,得到与所述至少一个组分别相对应的所述多个度量参数的聚合值;以及
    指标确定模块,被配置为基于与所述至少一个组分别相关联的至少一个得分阈值,以及与所述至少一个组分别相对应的所述多个度量参数的聚合值,确定所述预定性能指标的值。
  16. 根据权利要求15所述的装置,还包括:
    发送模块,被配置为将所述至少一个得分阈值分别发送给各自相关联的组中的客户端节点。
  17. 根据权利要求15或16所述的方法,其中对于给定客户端节 点,所述多个度量参数的扰动值包括以下至少一项:
    所述给定客户端节点处的多个数据样本中第一类数据样本的第一扰动数目,所述第一类数据样本被标注为第一类别并且被预测为所述第一类别;
    所述多个数据样本中第二类数据样本的第二扰动数目,所述第二类数据样本被标注为第二类别并且被预测为所述第二类别;
    所述多个数据样本中第三类数据样本的第三扰动数目,所述第三类数据样本被标注为所述第二类别并且被预测为所述第一类别;以及
    所述多个数据样本中第四类数据样本的第四扰动数目,所述第四类数据样本被标注为所述第一类别并且被预测为所述第二类别;并且其中所述预测基于所述机器学习模型输出的预测得分与所述给定客户端节点所在的组相关联的得分阈值的比较。
  18. 根据权利要求17所述的方法,其中所述至少一个组包括多个组并且所述至少一个得分阈值包括多个得分阈值,并且其中所述指标确定模块包括:
    特征曲线ROC确定模块,被配置为基于所述多个得分阈值和所述多个度量参数的聚合值,确定所述机器学习模型的受试者工作特征曲线ROC;以及
    曲线下面积AUC确定模块,被配置为确定所述ROC的曲线下面积AUC。
  19. 一种电子设备,包括:
    至少一个处理单元;以及
    至少一个存储器,所述至少一个存储器被耦合到所述至少一个处理单元并且存储用于由所述至少一个处理单元执行的指令,所述指令在由所述至少一个处理单元执行时使所述设备执行根据权利要求1至5中任一项所述的方法或者根据权利要求6至10中任一项所述的方法。
  20. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行以实现根据权利要求1至5中任一项所述的方 法或者根据权利要求6至10中任一项所述的方法。
PCT/CN2023/091156 2022-05-13 2023-04-27 用于模型性能评估的方法、装置、设备和存储介质 WO2023216900A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210524000.6A CN117114145A (zh) 2022-05-13 2022-05-13 用于模型性能评估的方法、装置、设备和存储介质
CN202210524000.6 2022-05-13

Publications (1)

Publication Number Publication Date
WO2023216900A1 true WO2023216900A1 (zh) 2023-11-16

Family

ID=88729636

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/091156 WO2023216900A1 (zh) 2022-05-13 2023-04-27 用于模型性能评估的方法、装置、设备和存储介质

Country Status (2)

Country Link
CN (1) CN117114145A (zh)
WO (1) WO2023216900A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434280A (zh) * 2020-12-17 2021-03-02 浙江工业大学 一种基于区块链的联邦学习防御方法
WO2021144803A1 (en) * 2020-01-16 2021-07-22 Telefonaktiebolaget Lm Ericsson (Publ) Context-level federated learning
CN113379071A (zh) * 2021-06-16 2021-09-10 中国科学院计算技术研究所 一种基于联邦学习的噪声标签修正方法
CN113626866A (zh) * 2021-08-12 2021-11-09 中电积至(海南)信息技术有限公司 一种面向联邦学习的本地化差分隐私保护方法、系统、计算机设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021144803A1 (en) * 2020-01-16 2021-07-22 Telefonaktiebolaget Lm Ericsson (Publ) Context-level federated learning
CN112434280A (zh) * 2020-12-17 2021-03-02 浙江工业大学 一种基于区块链的联邦学习防御方法
CN113379071A (zh) * 2021-06-16 2021-09-10 中国科学院计算技术研究所 一种基于联邦学习的噪声标签修正方法
CN113626866A (zh) * 2021-08-12 2021-11-09 中电积至(海南)信息技术有限公司 一种面向联邦学习的本地化差分隐私保护方法、系统、计算机设备及存储介质

Also Published As

Publication number Publication date
CN117114145A (zh) 2023-11-24

Similar Documents

Publication Publication Date Title
WO2022105022A1 (zh) 基于联邦学习的机器学习方法、电子装置和存储介质
Zhang et al. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing
US10685008B1 (en) Feature embeddings with relative locality for fast profiling of users on streaming data
TW201928805A (zh) 模型整合方法及裝置
TW201928708A (zh) 模型整合方法及裝置
US10795738B1 (en) Cloud security using security alert feedback
US11275845B2 (en) Method and apparatus for clustering privacy data of plurality of parties
US20240020380A1 (en) Clustering-based adaptive robust collaborative learning method and apparatus
Bien et al. Non-convex global minimization and false discovery rate control for the TREX
Wu et al. Self-adaptive SVDD integrated with AP clustering for one-class classification
US20210150335A1 (en) Predictive model performance evaluation
Zhou et al. Communication-efficient and Byzantine-robust distributed learning with statistical guarantee
US20210326757A1 (en) Federated Learning with Only Positive Labels
Bauckhage et al. Kernel archetypal analysis for clustering web search frequency time series
WO2024022082A1 (zh) 信息分类的方法、装置、设备和介质
US10320636B2 (en) State information completion using context graphs
WO2023216900A1 (zh) 用于模型性能评估的方法、装置、设备和存储介质
WO2021212753A1 (zh) 计算机性能数据确定方法、装置、计算机设备及存储介质
WO2023216899A1 (zh) 用于模型性能评估的方法、装置、设备和介质
WO2023216902A1 (zh) 用于模型性能评估的方法、装置、设备和介质
CN115511104A (zh) 用于训练对比学习模型的方法、装置、设备和介质
Jiang et al. Differentially Private Federated Learning with Heterogeneous Group Privacy
Zhou et al. Towards robust and privacy-preserving federated learning in edge computing
Duanyi et al. Constructing Adversarial Examples for Vertical Federated Learning: Optimal Client Corruption through Multi-Armed Bandit
US11665203B1 (en) Automatic monitoring and modeling

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23802672

Country of ref document: EP

Kind code of ref document: A1