CN113657617A

CN113657617A - Method and system for model joint training

Info

Publication number: CN113657617A
Application number: CN202111077337.9A
Authority: CN
Inventors: 陈超超; 曹绍升; 王力; 周俊
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-04-23
Filing date: 2020-04-23
Publication date: 2021-11-16
Also published as: CN111523686A; CN111523686B; CN113689006A

Abstract

The embodiment of the specification discloses a method and a system for model joint training. The method comprises the following steps: the method comprises the following steps that a plurality of joint training participant terminals respectively perform model training based on private data held by the terminals in a joint mode, and the plurality of joint training participant terminals respectively generate respective gradients by using a gradient-based optimization algorithm; the plurality of participating terminals respectively send the respective gradients to a server; the server selects a credible gradient from the plurality of gradients and updates parameters of the joint training model according to the selected credible gradient; the sample data is text data, voice data or graphic data.

Description

Method and system for model joint training

Description of the cases

The application is a divisional case proposed by Chinese application 202010326265.6 filed on 23/04 in 2020.

Technical Field

The present disclosure relates to the field of machine learning, and more particularly, to a method and system for model training.

Background

And (4) multi-party joint modeling, namely, a machine learning model is jointly established by a plurality of participants on the basis of protecting respective private data. However, in this scenario, one or more of the multiple participants may poison the training data for their own benefit, so that the model obtained by the final training has a bias, for example: the model makes false judgments for certain samples so that the offending participant can benefit from it.

Therefore, a method and a system for model joint training are desired, which can resist one or more of a plurality of participants from poisoning training data in the scene of multi-party joint modeling.

Disclosure of Invention

One of the embodiments of the present specification provides a method for model joint training, including:

the method comprises the following steps that a plurality of joint training participated terminals respectively carry out model joint training based on sample data held by the terminals, and the plurality of joint training participated terminals respectively generate respective gradients by using an optimization algorithm based on the gradients; the plurality of participating terminals respectively send the respective gradients to a server; the server selects a credible gradient from the plurality of gradients and updates parameters of the joint training model according to the selected credible gradient; the sample data is text data, voice data or graphic data.

One of the embodiments of the present specification provides a system for model joint training, the system including:

the gradient generation module is used for enabling a plurality of joint training participated terminals to respectively perform model joint training based on sample data held by the terminals, and the plurality of joint training participated terminals respectively use an optimization algorithm based on the gradient to generate respective gradients; a gradient sending module, configured to enable the multiple participant terminals to send the respective gradients to a server respectively; the parameter updating module is used for enabling the server to select a credible gradient from the plurality of gradients and updating the parameters of the joint training model according to the selected credible gradient; the sample data is text data, voice data or graphic data.

One of the embodiments of the present specification provides an apparatus for model joint training, the apparatus including:

at least one processor and at least one memory; the at least one memory is for storing computer instructions; the at least one processor is configured to execute at least some of the computer instructions to implement a method of model co-training.

One of the embodiments of the present specification provides a computer-readable storage medium storing computer instructions, and when the computer instructions in the storage medium are read by a computer, the computer executes at least a part of the instructions to implement a method for model joint training.

Drawings

The present description will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:

FIG. 1 is a block diagram of a system for model co-training in accordance with some embodiments described herein;

FIG. 2 is an exemplary flow diagram of a method of model co-training in accordance with some embodiments shown herein;

FIG. 3 is a diagram of an application scenario for model joint training in accordance with some embodiments of the present description; and

FIG. 4 is a schematic diagram illustrating updating parameters of a model based on gradients in accordance with some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.

It should be understood that "system", "device", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts, portions or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

FIG. 1 is a block diagram of a system for model co-training in accordance with some embodiments described herein.

As shown in FIG. 1, the system for model joint training may include a generation module 110, a transmission module 120, and an update module 130.

The generating module 110 may be configured to enable a plurality of jointly-trained participant terminals to perform model joint training based on sample data held by the terminal, and the plurality of jointly-trained participant terminals generate respective gradients by using a gradient-based optimization algorithm. For a detailed description that the multiple jointly-trained participant terminals respectively generate their respective gradients by using a gradient-based optimization algorithm, refer to fig. 2, which is not described herein again.

The sending module 120 may be configured to enable the plurality of participant terminals to send the gradients to the server respectively. For a detailed description of the plurality of participating terminals respectively sending the gradients to the server, refer to fig. 2, which is not described herein again.

The updating module 130 may be configured to enable the server to select a trustworthy gradient from the plurality of gradients, and update parameters of the joint training model according to the selected trustworthy gradient. For a detailed description that the server selects a confidence gradient from the plurality of gradients and updates the parameters of the joint training model according to the selected confidence gradient, refer to fig. 2, which is not described herein again.

It should be understood that the system and its modules shown in FIG. 1 may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory for execution by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided, for example, on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The system and its modules in this specification may be implemented not only by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also by software executed by various types of processors, for example, or by a combination of the above hardware circuits and software (e.g., firmware).

It should be noted that the above description of the system and its modules for model co-training is for convenience of description only, and should not limit the present disclosure within the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the teachings of the present system, any combination of modules or sub-system configurations may be used to connect to other modules without departing from such teachings. For example, in some embodiments, the generating module 110, the sending module 120, and the updating module 130 disclosed in fig. 1 may be different modules in a system, or may be a module that implements the functions of two or more modules described above. For example, the generating module 110 and the sending module 120 may be two modules, or one module may have both functions of generating a gradient and sending a gradient. Such variations are within the scope of the present disclosure.

FIG. 2 is an exemplary flow diagram of a method of model co-training in accordance with some embodiments shown herein.

And 210, performing model joint training on the plurality of joint-training participated terminals respectively based on sample data held by the terminals, and generating respective gradients by the plurality of joint-training participated terminals respectively by using an optimization algorithm based on the gradients. In particular, this step may be performed by the generation module 110.

In some embodiments, multiple participant terminals need to jointly train a machine learning model, and the training data held by each participant terminal has the same feature space, but different samples. For example: one participating terminal is a social platform, the other participating terminal is an e-commerce platform, and since some services of the two are similar, the feature space may be the same, for example, both collect features such as user preference, historical consumption records, etc., but since the client groups of the two platforms are different, the collected sample data is different.

In some embodiments, the server initializes parameters of the joint training model. For example: for the logistic regression model, the weight parameter can be initialized to a normally distributed random number with a mean value of 0 and a standard deviation of 0.01, and the deviation parameter can be cleared. In some embodiments, a plurality of participating terminals respectively obtain a joint training model from a server, and the model is trained based on sample data held by the terminals themselves. In some embodiments, the sample data may be text data, voice data, or graphics data. For example: if the model is used for text recognition, the sample data may be text data in the form of a phrase or sentence. Another example is: if the model is used for image classification, the sample data may be pictures of various animals such as cats, dogs, etc., or pictures of plants such as flowers, grasses, trees, etc. For another example: if the model is used for speech recognition, the sample data may be speech data.

In some embodiments, the joint training model may be any model that updates parameters using a gradient-based optimization algorithm. Including but not limited to: logistic Regression (LR) model, Gradient Boosting Decision Tree (GBDT) model, Convolutional Neural Networks (CNN) model, and the like.

The gradient is intended to be a vector representing the maximum taken along the direction of the directional derivative of a certain function at that point, i.e. the function changes the fastest and the rate of change is the greatest (modulo of the gradient) along that direction (direction of this gradient) at that point. Specifically, the method comprises the following steps: if a multi-element function is subjected to partial derivation, a plurality of partial derivation functions are obtained, and a matrix or vector formed by the values of the partial derivation functions is a gradient. For example, for a multivariate function:

the gradient may be:

i.e. the gradient of the function F (theta) is two elements: (

And

) The vector of (2). In some embodiments, a gradient-based optimization algorithm may be used to generate gradients corresponding to parameters of the model, the gradients having the same dimensions as the parameters of the model. For example: the model has 10 parameters, and the gradient is 10A vector of elements. Gradient-based optimization algorithms are commonly used in machine learning. For example: when the minimum value of the loss function is solved, the Gradient Descent (Gradient Descent) algorithm can be used for iterative solution step by step to obtain the minimized loss function and the final parameter value of the model. The Gradient Descent optimization algorithm may be further classified into a Batch Gradient Descent (BGD), a Stochastic Gradient Descent (SGD), a Mini-Batch Gradient Descent (MBGD), and the like, according to the amount of data used in the training period.

How a plurality of participating terminals generate respective gradients is described below by taking a joint training model as a multiple linear regression model as an example:

for convenience of description, it is assumed that the linear regression model has 5 weight parameters: theta₁～θ₅And a bias parameter: theta₀. The model can be expressed as:

wherein x is⁽ⁱ⁾Represents the ith sample (x)⁽ⁱ⁾，y⁽ⁱ⁾) The input vector of (1) is selected,

indicating the 1 st feature in the input vector, …,

representing the 5 th feature in the input vector. Input vector x⁽ⁱ⁾The corresponding labels are: y is⁽ⁱ⁾Is used to denote x⁽ⁱ⁾After the model is input, the result output by the model is expected. In some embodiments, during the training of the model, the model may be referred to as a Hypothesis function (hypthesis function), with x⁽ⁱ⁾Inputting the hypothesis function, the obtained predicted value may be related to the label y⁽ⁱ⁾And are not consistent. Thus, in some embodiments, a loss function needs to be established. The loss function is also called cost function and is used for evaluating the degree of inconsistency between the predicted value of the hypothesis function and the corresponding label, and the smaller the value of the loss function isThe more the predicted value of the model is close to the expected value, the process of model training is the process of continuously adjusting the parameters of the model to minimize the loss function, and it is assumed that the relationship among the function, the loss function and the parameters of the model is as shown in fig. 4. In this example, the mean square error function is chosen as the loss function:

in equation (2), m is the number of samples involved in the calculation, and for the batch gradient descent algorithm, all samples are involved in the calculation, for example: participant terminal C1 has 5000 samples, then m may be 5000. 1/2 in equation (2) is a constant that is used to offset the quadratic power when subsequently calculating the gradient, which facilitates the calculation without affecting the result. In this example, the gradient of the loss function J (θ) can be calculated using a batch gradient descent algorithm, as known from the definition of the gradient:

i.e. the gradient

Is 6 elements

A vector of components, which corresponds to the parameters of the model: theta₀～θ₅In particular, the elements

Is at theta₀As variables, other parameters (θ)₁～θ₅) The value obtained by partial derivative of J (theta) is a constant value, …, element

Is at theta₅As variables, other parameters (θ)₀～θ₄) The value obtained by partial derivative of J (θ) is a constant.In some embodiments, equation (2) may be decomposed into

And f (u) ═ h_θ(x⁽ⁱ⁾)-y⁽ⁱ⁾Two functions, according to the complex function derivation rule, can be obtained:

then respectively by theta₀、…、θ₅Taking the variable as a variable and other parameters as constants, and solving the partial derivative of J (theta) to obtain:

assume that there are 5 participating terminals in total: C1-C5. In this example, the gradient calculated by the participant terminals C1-C5 may be expressed as: g1, g2, g3, g4 and g 5.

Step 220, the plurality of participant terminals respectively send the gradients to the server. In particular, this step may be performed by the sending module 120.

In some embodiments, each participant terminal may send the gradient calculated in step 210 to the server. The transmission modes include but are not limited to: network transmission, console push, hard disk copy, etc.

In step 230, the server selects a confidence gradient from the plurality of gradients and updates parameters of the joint training model according to the selected confidence gradient. In particular, this step may be performed by the update module 130.

In some embodiments, there may be situations where one or more of the participating terminals are poisoned by data, such as: the model is used for image recognition, and in the training process, a certain participant slightly changes some data in the picture, for example, a certain pixel value in the picture is changed from "000" to "010" or "100", so that the recognition result of the trained model for certain samples is changed. To prevent model poisoning in the above-described case, in some embodiments, the server needs to select a trusted gradient from the plurality of gradients. In some embodiments, the confidence gradient may be a gradient determined to result from training data that has not been detoxified using some law derived from theoretical derivation or a large number of experiments. In some embodiments, it may be considered that the gradient calculated by one participant terminal using the detoxified data may have a value greater than the gradient calculated by the other participant terminals using the normal training data. Thus, in some embodiments, the step of selecting a confidence gradient may be:

(1) a first average of the plurality of gradients is calculated. The first average means an average value calculated in some embodiments of the present specification to distinguish other average values described later in the present specification, for example, the second average value. In some embodiments, the server may calculate an average of the plurality of gradients transmitted by the plurality of participant terminals in step 220 as the first average. Specifically, the plurality of gradients may be added and then divided to obtain an average value of the plurality of gradients. For example, the first average of the gradients g1-g 5 obtained in the example of step 210 may be: g _ bar ═ g1+ g2+ g3+ g4+ g5 ÷ 5.

(2) And respectively comparing the gradients with the first average value to obtain a plurality of comparison results. Specifically, the plurality of gradients received in step 220 and the first average value calculated in the above step are subtracted, and a modulus of the operation result is taken to obtain a plurality of difference values. For example: the plurality of differences between the gradients g1-g 5 obtained in the example of step 210 and the first average value g _ bar obtained in the example of step (1) are: diff1 ═ g1-g _ bar |, …, diff5 ═ g5-g _ bar |.

(3) And sequencing a plurality of comparison results to obtain a credible gradient. Specifically, the plurality of difference values obtained in the above steps are arranged in a descending order, and the gradient corresponding to the first L difference values is used as a trusted gradient. In some embodiments, the gradient that deviates most from the first mean may be rejected as suspect and the remaining gradients may be trusted gradients. For example: arranging the 5 difference values calculated in the example of the step (2) into diff5, diff2, diff4, diff1 and diff3 in the order of increasing the value, and then the gradients corresponding to the first 4 difference values: g5, g2, g4 and g1 may be regarded as trustworthy gradients. In some embodiments, the number of gradients rejected may also be 2. For example: in the above example, the first 3 differences correspond to gradients: g5, g2, and g4 may be used as confidence gradients. In some embodiments, the number of the rejected gradients may also be 3 or more than 3, which may be determined according to the number of the participating terminals or other situations, and is not limited by the description of the present specification. In some embodiments, if a threshold can be determined above which the gradient deviates from the first average, the gradient can be considered suspect, this step can be replaced by the following step:

(3_1) respectively comparing the comparison results with a preset threshold value to obtain the credible gradient. Specifically, gradients corresponding to K differences, of which the median values are smaller than a preset threshold, are used as the trusted gradients. For example, the preset threshold is 0.2, and diff1 to diff5 obtained in step (2) are respectively: 0.16, 0.12, 0.23, 0.15, 0.18, the confidence gradient is 4: g1, g2, g4 and g 5. For another example, if the predetermined threshold is 0.17, the confidence gradient is 3: g1, g2 and g 4.

In some embodiments, the respective participant terminals are not trusted by the server, and therefore the gradient sent in step 220 is a gradient encrypted using an encryption algorithm (e.g., a homomorphic encryption algorithm). The server may return the first average to each participant terminal, which locally calculates the difference between the respective gradient and the first average. The server can determine which participating terminal has the largest calculated difference value by a safe extremum solving method, and then eliminates the gradient calculated by the participating terminal, and other gradients are used as credible gradients.

In some embodiments, a second average of the obtained confidence gradients may be calculated. Specifically, the multiple trusted gradients may be added and then divided to obtain an average value of the multiple trusted gradients. For example, the second average of the trustworthiness gradients g1, g2, g4, and g5 obtained in the example of step (3_1) may be: g _ bar1 ═ (g1+ g2+ g4+ g5) ÷ 4.

In some embodiments, the parameters of the joint training model may be updated using the gradient-based optimization algorithm used in step 210. For example, the parameters of the model are updated using a gradient descent algorithm:

wherein, theta_jFor the jth parameter of the model, α is called the Learning rate (Learning rate), which is used to adjust the value of the gradient, which determines whether and when the loss function can converge to a local minimum, the value of the Learning rate can be adjusted during the training of the model, and if a correct value is obtained, the value of the loss function becomes smaller. In some existing embodiments, the first average value obtained in step (1) may be used as a gradient corresponding to a parameter of the joint training model to update the parameter of the model. In order to prevent the model from being poisoned in the training process, in the embodiment described in the present specification, the second average value may be used as the gradient corresponding to the parameter of the joint training model to update the parameter of the model. Since the second average is a result calculated based on a plurality of trusted gradients after removing the suspicious gradient, the embodiments described herein can avoidModel poisoning caused by poisoning data by one or more of a plurality of participating terminals is avoided. How to update the parameters of the model is illustrated in the following by an example in step 210, and for convenience of description, the second average value g _ bar1 (here, the first average value g _ bar, if some existing embodiments are used) is represented as:<aver0，aver1，aver2，aver3，aver4，aver5>wherein aver0 corresponds to

aver1 corresponds to

…, aver5 corresponds to

The parameters of the model are updated as follows:

θ₀＝θ₀-α*aver0；

θ₁＝θ₁-α*aver1；

θ₂＝θ₂-α*aver2；

θ₃＝θ₃-α*aver3；

θ₄＝θ₄-α*aver4；

θ₅＝θ₅-α*aver5。

in some embodiments, the parameters of the updated joint training model are downloaded by the plurality of participant terminals from the server to the local of each participant terminal during the next training round, and the parameters of the next training round are updated according to steps 210-230, as shown in fig. 4, until the gradient value is smaller than a threshold value, for example: 10^-5At this time, the loss function converges on the training set, i.e., the value of the loss function is not substantially reduced any more, and the model training ends.

The beneficial effects that may be brought by the embodiments of the present description include, but are not limited to: before updating the parameters of the model, the server eliminates the gradient deviating from the average gradient to the maximum or eliminates the gradient deviating from the average gradient to exceed the set threshold value, so that one or more of the plurality of participating terminals can be prevented from poisoning the training data. It is to be noted that different embodiments may produce different advantages, and in different embodiments, any one or combination of the above advantages may be produced, or any other advantages may be obtained.

It should be noted that the above description related to the flow 200 is only for illustration and description, and does not limit the applicable scope of the present specification. Various modifications and alterations to flow 200 will be apparent to those skilled in the art in light of this description. However, such modifications and variations are intended to be within the scope of the present description. For example, step 230 may be split into two steps 230_1 and 230_2, where a confidence gradient is selected in 230_1 and the parameters of the model are updated in step 230_ 2.

FIG. 3 is a diagram of an application scenario for model joint training in accordance with some embodiments of the present description.

As shown in fig. 4, each of the participating terminals 1 to 4 is an e-commerce platform, and the data characteristics possessed by each terminal are the same, but the samples are different, for example, each terminal collects characteristics such as age, sex, and historical consumption record of the user, but the user population of each terminal is different. The participating terminals 1 to 4 need to jointly establish a risk identification model by using respective held data, and in order to prevent one or more parties from poisoning training data in the training process, the method described in the specification is used for joint training. For a detailed method of joint training, please refer to fig. 2, which is not described herein.

The method described in this specification can also be applied to other application scenarios, and is not limited by the description of this specification.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.

Also, the description uses specific words to describe embodiments of the description. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.

Moreover, those skilled in the art will appreciate that aspects of the present description may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereof. Accordingly, aspects of this description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.), or by a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present description may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.

The computer storage medium may comprise a propagated data signal with the computer program code embodied therewith, for example, on baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, etc., or any suitable combination. A computer storage medium may be any computer-readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated over any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or any combination of the preceding.

Computer program code required for the operation of various portions of this specification may be written in any one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, a conventional programming language such as C, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, Ruby, and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).

Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

For each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this specification, the entire contents of each are hereby incorporated by reference into this specification. Except where the application history document does not conform to or conflict with the contents of the present specification, it is to be understood that the application history document, as used herein in the present specification or appended claims, is intended to define the broadest scope of the present specification (whether presently or later in the specification) rather than the broadest scope of the present specification. It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of this specification shall control if they are inconsistent or contrary to the descriptions and/or uses of terms in this specification.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

Claims

1. A method of model co-training, the method comprising:

obtaining a plurality of gradients, wherein the gradients are obtained by performing model joint training on a plurality of joint-training participated terminals based on sample data held by the terminals respectively;

calculating a first average value of the gradients, and respectively comparing the gradients with the first average value to obtain a plurality of deviation results;

selecting a credible gradient from the gradients based on the deviation results, and updating parameters of the joint training model according to the selected credible gradient, wherein other gradients except the credible gradient in the gradients are suspicious gradients which are not used for updating the parameters of the joint training model at this time;

the sample data is text data, voice data or graphic data.

2. The method of claim 1, wherein said selecting a trustworthy gradient from the plurality of gradients comprises:

and selecting the gradient with deviation smaller than a preset threshold value from the plurality of gradients as a credible gradient.

3. The method of claim 1, wherein said selecting a trustworthy gradient from the plurality of gradients comprises:

and determining the ranking of the deviation results from small deviation to large deviation, and selecting the gradient with the ranking smaller than a preset threshold value from the gradients as a credible gradient.

4. The method of claim 1, wherein the updating the parameters of the joint training model according to the selected belief gradient comprises:

calculating a second average of the plurality of trustworthy gradients;

and taking the second average value as a gradient corresponding to the parameter of the joint training model, and updating the parameter of the joint training model by using the optimization algorithm based on the gradient.

5. A system for model co-training, the system comprising:

a generation module to:

the sample data is text data, voice data or graphic data.

6. The system of claim 5, wherein to select a trustworthy gradient from the plurality of gradients, the update module is further to:

7. The system of claim 5, wherein to select a trustworthy gradient from the plurality of gradients, the update module is further to:

8. The system of claim 5, wherein to update parameters of the joint training model according to the selected belief gradient, the update module is further to:

calculating a second average of the plurality of trustworthy gradients;

9. An apparatus for model co-training, wherein the apparatus comprises at least one processor and at least one memory;

the at least one memory is for storing computer instructions;

the at least one processor is configured to execute at least some of the computer instructions to implement the method of any of claims 1-4.

10. A computer-readable storage medium storing computer instructions which, when read by a computer, cause the computer to perform the method of any one of claims 1 to 4.