GB2625622A

GB2625622A - Method and system for federated learning

Info

Publication number: GB2625622A
Application number: GB2314691.3A
Authority: GB
Inventors: Li Da; Hu Xu; Hospedales Timothy
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2022-09-27
Filing date: 2023-09-25
Publication date: 2024-06-26
Also published as: GB202314691D0; GB202214058D0; WO2024072074A1

Abstract

A transformer-based foundation model (FM), comprising several transformer blocks, is trained for a specific downstream task using federated learning. A pre-trained FM 10 is received at a client device 200 from a server 100, and has a trainable module 16 that is coupled to the transformer blocks. Whilst the FM is frozen, the trainable module is trained using a local training dataset, and then sent to the server for aggregation. Training may involve processing a data item in the training dataset using the frozen FM, outputting a token from each of its blocks, inputting each token into the trainable module, and training the trainable module to process the tokens to output a prediction for the data item, possibly to minimise a task specific loss between the prediction and a ground truth of the data item. The training module may comprise a self-attention block.

Description

Method and System for Federated Learning

Field

[001] The present application generally relates to a method and system for federated learning. In particular, the present application provides a method for training a machine learning, ML, model using a server and a plurality of client devices without needing to access user data on the client devices, and without needing to repeatedly share large amounts of data (e.g. full models) between the server and client devices.

Background

[002] Federated learning (FL) has been proposed as a privacy-preserving way to enable distributed learning in which training computation is moved from a server to client devices, because the raw data for training resides on client devices. FL enables a server to train machine learning, ML, models across decentralized edge devices (i.e. clients/client devices, such as smartphones) without accessing private user data. FL training is resource-intensive in terms of compute, power, and bandwidth, especially given the limited resources of edge devices. This challenge grows as the scale of deep models grows. Thus, there has been research focused on addressing the emerging challenges that arise due to FL constraints, such as communication efficiency (in terms of at least bandwidth), data heterogeneity, and supporting diverse device hardware. For example, to reduce the communication cost, ideas borrowed from model compression, such as quantization, sparsificafion and pruning have been successfully applied; to mitigate the non-IID (where IID means "independent and identically distributed" issue of data heterogeneity, different model training recipes for optimization, model initialization and architecture design have also been proposed. Meanwhile, system heterogeneity imposes latency overheads due to stragglers and may introduce biases if slower clients are consistently precluded from participation.

[003] A distinct trend is the growth of foundation models, FM, where large architectures are trained on huge datasets in the cloud at great computational expense. Benefiting from this pre-training, they are typically then "fine-tuned" to different specific tasks of interest in a comparatively data and compute efficient way. However, fine-tuning the whole pre-trained model can be problematic due to the heavy communication cost of exchanging large numbers of model parameters and the weak capabilities of on-device training for many client devices (particularly the more resource-constrained devices, such as low-tier/budget smartphones).

[4] The present applicant has therefore identified the need for an improved method of performing federated learning of foundation models.

Summary

[5] In a first approach of the present techniques, there is provided a computer-implemented method for training, at a client device, a transformer-based foundation model, FM, for a specific downstream task using federated learning, the method comprising: receiving, from a server, a pre-trained transformer-based foundation model, FM, comprising a plurality of transformer blocks, wherein a trainable module has been inserted into the foundation model, FM, and is coupled to the plurality of transformer blocks; training, using a local training dataset comprising a plurality of data items, the trainable module inserted into the pre-trained transformer-based FM while keeping the pre-trained transformer-based FM frozen; and transmitting, to the server, the trained trainable module to the server for aggregation.

[6] Challenges in FL include: (i) the computing power and time required for updating models on-device, and the communication cost in terms of bandwidth (as well as cost to a user of a device) to exchange model parameters (both worse if applied to typically large architectures used in FMs), (ii) difficulty of dealing with heterogeneous devices with different compute capabilities, (iii) instability of federated learning compared to centralised learning, especially given data heterogeneity. With respect to compute and communication requirements, many efforts, such as model compression/pruning and knowledge distillation, have been made to improve the communication and training efficiency of federated learning. However, these would usually need sophisticated designs and often bring some sacrifice of model performance.

[7] The present techniques advantageously overcome these challenges by employing a pre-trained Foundation model (FM) and injecting a trainable module into the FM. The trainable module comprises one or more parameters which can be trained on the client device, while keeping the pre-trained foundation model fixed. (As explained below, the trainable module may be a small machine learning model itself, or may be a learnable vector). This reduces the amount of training that needs to be performed, which reduces the computation requirements. The trainable module is also referred to interchangeably herein as "additional trainable parameters (ATP)". The trainable module can be trained on-device more easily than the FM, because the size of the trainable module is much smaller than the FM. Furthermore, each client device participating in federated learning does not need access to the data used to train the FM. Instead, the trainable module can be trained using data on/of the client device. As only the trainable module is trained on the client device, the federated learning process is more efficient than if the whole FM were to be trained, which means the training can be performed on resource-constrained client devices, such as smartphones and consumer electronic devices. The trained trainable module, or one or more parameters/values thereof, is then returned to the central server to be used in the federated learning process to update the trainable module stored on the server. The present techniques advantageously improve the federated learning process and communication efficiency, while maintaining excellent performance. The present techniques also enable more heterogeneous devices to participate in the federated learning process, and enable training stability, i.e. enable the model to more easily converge.

[8] In other words, the server comprises a foundation model, FM, which has been pre-trained. The server injects a trainable module into the pre-trained foundation model prior to transmitting the foundation model to client devices. The pre-trained foundation model -both on the server and on the client devices -is frozen, i.e. remains static during the federated learning process, while the trainable modules are trained by the client devices. It is the trainable module which is trained by the client devices and which is updated by the server during the federated learning. This allows a FM to be trained or adapted for a specific downstream task in a more computationally-efficient manner.

[9] The trainable module may itself be a transformer-based machine learning, ML, model.

[010] In transformer-based models, inputs are encoded as a sequence of vectors, known as input tokens. The input tokens are then processed via the transformer model and the output from any layer or block of the transformer model is an output token which encodes information, such as a semantic representation, after the processing by the model up to the layer/block. The output token can be decoded to provide a final output from the model. Thus, since the pre-trained FM and the trainable module may be transformer-based, training the trainable module may comprise: processing a data item in the local training dataset using the frozen pre-trained transformer-based FM; outputting a token from each transformer block of the plurality of transformer blocks of the FM; inputting each token into the trainable module; and training the trainable module to process the tokens and output a prediction for the data item using the processing. In other words, the trainable module takes, as inputs, tokens output by the frozen pre-trained foundation model, and processes these to generate a prediction for a data item. Thus, the trainable module uses the pre-trained FM to do some processing, which means the trainable module benefits from the capabilities of the pre-trained FM. As the pre-trained FM is frozen, the knowledge and learning of the FM is not lost during the on-device training due to distribution shift, which is advantageous.

[11] When the trainable module is a transformer-based ML model, training the trainable module to process the tokens and output a prediction for the data item using the processing may comprise training the trainable module to minimise a task-specific loss between a ground truth label for the data item (which is known if the client device data items are labelled data items) and a predicted label output by the trainable module for the data item. When the data items are images, the task-specific loss may be a cross-entropy loss for an image processing task such as image classification.

[12] When the trainable module is a transformer-based ML model, training the trainable 15 module may comprise: forming a token sequence using the tokens output by each transformer block of the frozen pre-trained transformer-based FM; and processing, using the trainable module, the token sequence.

[13] In some cases, when the trainable module is a transformer-based ML model, the trainable module may comprise at least one self-attention block. In this case, training the trainable module may comprise training the at least one self-attention block. As noted below, multiple self-attentions may be included, but in some cases, one self-attention block is sufficient.

[014] When the trainable module is a transformer-based ML model, the method may comprise: determining at least one training criterion for the client device. For example, the training criterion may be based on the hardware capabilities of the client device. Based on the at least one training criterion, processing a data item using the frozen pre-trained transformer-based FM may comprise processing a data item using a subset of the plurality of transformer blocks of the FM. In other words, a data item may not be processed by the whole pre-trained FM (i.e. the full depth of the FM), but instead a prediction may be output 'early from the pre-trained FM. In such cases, the trainable module is trained using the outputs from the early layers of the pre-trained FM.

[015] Determining at least one training criterion may comprise determining any one or more of: a processing capacity of the client device, a memory capacity of the client device, a battery capacity of the client device, and a latency criterion. Thus, a wide variety of client devices may be able to participate in the federated learning process.

[016] Transmitting the trained trainable module to the server may comprise transmitting parameters of the trained transformer-based ML module to the server for aggregation. For example, weights of the trained ML module may be transmitted to the server for aggregation, so that an updated trainable module may be generated by the server based on the local, client device training.

[017] The method may further comprise: receiving, from the server, an updated trainable module, updated using parameters received from client devices; and replacing the trained trainable module with the received updated trainable module. In other words, the client device may receive a centrally updated trainable module back from the server for use, after the aggregation process has been performed by the server. The client device may use the updated trainable module, together with the existing pre-trained FM to perform inference, or may personalise the updated trainable module -on local data -before use for inference.

[18] Thus, the method may further comprise personalising the updated trainable module for use by the client device. The personalisation may involve using the same training data items used by the client device to perform the training of the trainable module, or may involve using different data items. For example, the training of the trainable module may have involved using a first subset of data items in the training dataset and the personalisation may involve using a second subset of data items in the training dataset. Alternatively, personalisation may involve using data items acquired/obtained after the training of the trainable module.

Personalising the updated trainable module may advantageously involve training a client token rather than the whole updated trainable module. This is advantageous because it reduces the computational resources required to personalise the updated trainable module.

[19] In some cases, the trainable module may comprise at least one learnable vector. The learnable vector(s) can be trained, such that the whole pre-trained FM does not need to be retrained. That is, while finetuning models pre-trained on large-scale datasets results in a strong supervised classifier, the fine-tuning process severely negatively impacts the ability to process data items that are new/unseen. To alleviate this, the FM is kept frozen -which has been pre-trained -and only the learnable vector(s) is(are) trained/updated.

[020] When the trainable module comprises at least one learnable parameter, training the training module may comprise training the trainable module to fine-tune the at least one learnable parameter for the specific downstream task [021] In a second approach of the present techniques, there is provided a client device for training a transformer-based foundation model, FM, for a specific downstream task using federated learning, the client device comprising: storage storing a local training dataset comprising a plurality of data items; and at least one processor coupled to memory, for: receiving, from a server, a pre-trained transformer-based foundation model, FM, comprising a plurality of transformer blocks, wherein a trainable module has been inserted into the foundation model, FM, and is coupled to the plurality of transformer blocks; training, using the stored local training dataset, the trainable module inserted into the pre-trained transformer-based FM while keeping the pre-trained transformer-based FM frozen; and transmitting, to the server, the trainable module to the server for aggregation.

[22] The features described above with respect to the first approach apply equally to the second approach and therefore, for the sake of conciseness, are not repeated.

[23] The plurality of data items in the local training dataset may be a plurality of images. In 20 such cases, training the FM may comprise training the FM to perform an image processing task, such as object detection, image classification, and so on.

[24] The client device may further comprise an image capture device for capturing the plurality of images.

[25] The plurality of data items in the local training dataset may be a plurality of audio data items. In such cases, training the FM may comprise training the FM to perform an audio processing task, such as speech-to-text processing, sound recognition, speaker verification, and so on.

[26] The client device may further comprise an audio capture device for capturing the audio data items.

[27] The client device may be a constrained-resource device, but which has the minimum 35 hardware capabilities to use a trained neural network/ML model. The client device may be any one of: a smartphone, tablet, laptop, computer or computing device, virtual assistant device, a vehicle, an autonomous vehicle, a robot or robotic device, a robotic assistant, image capture system or device, an augmented reality system or device, a virtual reality system or device, a gaming system, an Internet of Things device, or a smart consumer device (such as a smart fridge). It will be understood that this is a non-exhaustive and non-limiting list of example client devices.

[28] In a third approach of the present techniques, there is provided a computer-implemented method for training, at a server, a transformer-based foundation model, FM, using federated learning, the method comprising: transmitting, to a plurality of client devices, a pre-trained transformer-based foundation model, FM, comprising a plurality of transformer blocks, wherein a trainable module has been inserted into the foundation model, FM, and is coupled to the plurality of transformer blocks; receiving, from some or all of the plurality of client devices, a trained trainable module; aggregating the received trained trainable modules to generate an updated trainable module; and transmitting, to the plurality of client devices, the updated trainable module.

[29] The trainable module may be a transformer-based machine learning, ML, model.

[30] Receiving the trainable modules from the client devices may comprise receiving 20 parameters of a trained transformer-based ML model from some or all of the client devices.

[31] In a fourth approach of the present techniques, there is provided a server for training a transformer-based foundation model, FM, using federated learning, the server comprising: at least one processor coupled to memory, for: transmitting, to a plurality of client devices, a pre-trained transformer-based foundation model, FM, comprising a plurality of transformer blocks, wherein a trainable module has been inserted into the foundation model, FM, and is coupled to the plurality of transformer blocks; receiving, from some or all of the plurality of client devices, a trained trainable module; aggregating the received trained trainable modules to generate an updated trainable module; transmitting, to the plurality of client devices, the updated trainable module.

[32] The features described above with respect to the third approach apply equally to the fourth approach and therefore, for the sake of conciseness, are not repeated.

[033] In a fifth approach of the present techniques, there is provided a system for training a transformer-based foundation model, FM, for a specific downstream task using federated learning, the system comprising: a plurality of client devices; and a server comprising at least one processor coupled to memory, for: transmitting, to the plurality of client devices, a pre-trained transformer-based foundation model, FM, comprising a plurality of transformer blocks, wherein a trainable module has been inserted into the foundation model, FM, and is coupled to the plurality of transformer blocks; receiving, from some or all of the plurality of client devices, a trained trainable module; aggregating the received trained trainable modules to generate an updated trainable module; and transmitting, to the plurality of client devices, the updated trainable module.

[034] In the system, each client device comprises: storage storing a local training dataset comprising a plurality of data items; and at least one processor coupled to memory, for: receiving, from the server, the pre-trained transformer-based foundation model, FM; training, using the stored local training dataset, the trainable module inserted into the pre-trained transformer-based FM while keeping the pre-trained transformer-based FM frozen; and transmitting, to the server, the trainable module to the server for aggregation.

[035] The features described above with respect to the first to fourth approaches apply equally to the fifth approach and therefore, for the sake of conciseness, are not repeated.

[036] In a related approach of the present techniques, there is provided a computer-readable storage medium comprising instructions which, when executed by a processor, causes the processor to carry out any of the methods described herein.

[37] As will be appreciated by one skilled in the art, the present techniques may be embodied 25 as a system, method or computer program product. Accordingly, present techniques may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.

[38] Furthermore, the present techniques may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.

[39] Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object oriented programming languages and conventional procedural programming languages. Code components may be embodied as procedures, methods or the like, and may comprise sub-components which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction set to high-level compiled or interpreted language constructs.

[40] Embodiments of the present techniques also provide a non-transitory data carrier 10 carrying code which, when implemented on a processor, causes the processor to carry out any of the methods described herein.

[41] The techniques further provide processor control code to implement the above-described methods, for example on a general purpose computer system or on a digital signal processor (DSP). The techniques also provide a carrier carrying processor control code to, when running, implement any of the above methods, in particular on a non-transitory data carrier. The code may be provided on a carrier such as a disk, a microprocessor, CD-or DVDROM, programmed memory such as non-volatile memory (e.g. Flash) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier. Code (and/or data) to implement embodiments of the techniques described herein may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as Python, C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog (RTM) or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, such code and/or data may be distributed between a plurality of coupled components in communication with one another. The techniques may comprise a controller which includes a microprocessor, working memory and program memory coupled to one or more of the components of the system.

[042] It will also be clear to one of skill in the art that all or part of a logical method according to embodiments of the present techniques may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the above-described methods, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application-specific integrated circuit. Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media.

[43] In an embodiment, the present techniques may be realised in the form of a data carrier 5 having functional data thereon, said functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable said computer system to perform all the steps of the above-described method.

[44] The method described above may be wholly or partly performed on an apparatus, i.e. an electronic device, using a machine learning or artificial intelligence model. The model may be processed by an artificial intelligence-dedicated processor designed in a hardware structure specified for artificial intelligence model processing. The artificial intelligence model may be obtained by training. Here, "obtained by training" means that a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose) is obtained by training a basic artificial intelligence model with multiple pieces of training data by a training algorithm. The artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values and performs neural network computation by computation between a result of computation by a previous layer and the plurality of weight values.

[45] As mentioned above, the present techniques may be implemented using an Al model. A function associated with Al may be performed through the non-volatile memory, the volatile memory, and the processor. The processor may include one or a plurality of processors. At this time, one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an Al-dedicated processor such as a neural processing unit (NPU). The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (Al) model stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning. Here, being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule or Al model of a desired characteristic is made. The learning may be performed in a device itself in which Al according to an embodiment is performed, and/o may be implemented through a separate server/system.

[46] The Al model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights. Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks.

[47] The learning algorithm is a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

Brief description of the drawings

[48] Implementations of the present techniques will now be described, by way of example only, with reference to the accompanying drawings, in which: [049] Figure 1 is a schematic diagram illustrating a system for training a transformer-based foundation model, FM, for a specific downstream task using federated learning; [50] Figure 2 shows the model architecture of the present techniques, PETRA, in more detail; [51] Figure 3 is a schematic diagram illustrating an example use of the present federated learning method; [052] Figures 4A and 4B are schematic diagrams illustrating another example use of the present federated learning method; [53] Figure 5 is a flowchart of example steps performed by a client device for training a transformer-based foundation model for a specific downstream task using federated learning; [54] Figure 6 is a flowchart of example steps performed by a server for training a transformer-30 based foundation model for a specific downstream task using federated learning; [55] Figure 7 is a diagram showing the steps performed by the server and client devices to perform federated learning; [56] Figure 8A shows a flowchart of example steps performed by the server when injecting a trainable module into a foundation model; [057] Figure 8B shows a flowchart of example steps performed by the client device when training the trainable module; [58] Figure 9 is a block diagram showing an ATP/trainable module for use in a CLIP based model; [59] Figure 10 shows results of experiments on conventional FL; [60] Figure 11 shows results of experiments on anytime FL; [061] Figure 12 shows results of experiments to test accuracy of each exit trained on CIFAR100 with Anytime federated learning (Top) and multi-tier constraint (Bottom); [62] Figure 13 shows results of experiments on multi-tier FL; [63] Figure 14 shows results of ablation studies on the multi-tier based FL set-up; [64] Figure 15 shows the number of trainable parameters of different methods that were 10 evaluated in the experiments; [65] Figure 16 is a table showing actual computational (in FLOPs) and memory budgets across all exit layers; [66] Figure 17 shows the communication cost of different methods to reach a target test accuracy; and [067] Figure 18 is a block diagram of a system 300 for training a ML model using federated learning.

Detailed description of the drawings

[068] Broadly speaking, embodiments of the present techniques provide a method for training a machine learning, ML, model using a server and a plurality of client devices using federated learning, FL. Advantageously, the present FL method does not require access to user data on the client devices, and does not require repeatedly sharing large amounts of data (e.g. full models) between the server and client devices. Furthermore, the present FL method enables a large variety of client devices to participate in the FL, which reduces bias in the trained ML model.

[069] As mentioned above, there are a number of challenges in federated learning, FL, which include: (i) the compute cost of updating models on device, and the communication cost for exchanging model parameters (both worse if applied to typically large architectures used in FMs), (ii) difficulty of dealing with heterogeneous devices with different compute capabilities, and (iii) instability of federated learning compared to centralised learning, especially given data heterogeneity. With respect to cost, many efforts, such as model compression/pruning and knowledge distillation, have been made to improve the communication and training efficiency of federated learning. However, these usually need sophisticated designs and often come with some sacrifice in model performance.

[70] Though federated learning is still an emerging topic, here exist two FL settings: cross-device and cross-silo FL. The present techniques focus on cross-device FL. The main research focuses in this setting are designing systems to solve communication efficiency, data and system heterogeneity problems. Researchers have proposed different techniques to improve communication efficiency, e.g. using the quantized model parameters or gradients during the FL communication. Similarly, sparsifying and pruning the training model into smaller cardinality to reduce communication cost has been discussed. Some researchers tried to address this by progressive model training. Another main focus is on data heterogeneity. In contrast to centralized training where the learner has access to the whole distribution, each worker has access to a biased distribution in non-I ID. FL negatively affects convergence and final model accuracy. In attempts to alleviate this, people proposed to add proximal regularization in the local training. Alternatively, some normalized averaging technique may be used to mitigate the inconsistency between different clients. Interestingly, more recently, researchers found that model initialization (pre-trained vs random) plays an important role in reducing the detrimental impact on heterogeneity, and so does model architecture (Transformer vs CNN). System heterogeneity is a key concern in cross-device FL, where participants have different hardware resources and are thus unable to perform the same amount of learning. To this end, researchers have leveraged various techniques spanning quantization, ordered pruning or even neural architecture search, NAS, in the federated setting. Orthogonally, heterogeneity has also been mitigated by client selection or asynchronous aggregation schemes.

[71] The present techniques differ from existing solutions to the problems in cross-device FL.

Rather than training from scratch, the present techniques focus on the challenges and opportunities of adapting a pre-trained foundation model, FM, (e.g. a transformer model) by federated learning. While this might seem to exacerbate communication bottlenecks and system heterogeneity issues above, it is shown herein that the present techniques simultaneously ameliorate all the above-mentioned challenges of communication cost, heterogeneous device capabilities, and difficulty of federated learning on non-IID. data.

[72] The idea of parameter-efficient learning of adapters, was first proposed for adapting a single model into multiple datasets. It has since been extended into various problems, including few-shot learning and ASR. Especially as the pre-trained FMs' model sizes are growing rapidly, parameter-efficient (transfer) learning has become increasingly important. Instead of fine-tuning the full pre-trained model, studies design different small additive modules for adapting the pre-trained FMs into downstream tasks. It has been found that tuning the prompt input of the pre-trained language model enables excellent performance on downstream tasks. It has also been found that fine-tuning some injected adapters can be more effective than fine-tuning the top layers of a pre-trained N LP model. The present techniques provide parameter-efficient adaptation in an FL context, and provide a novel transformer-based adaptation module specifically customized for this task.

[73] Multi-exit networks have been proposed in the literature as a variant of dynamic DN Ns to alleviate the problem of redundant computation at inference time when input samples are simple. Various extensions have been made to tie these architectures to the underlying hardware, leverage them in a distributed setting or for different tasks and model architectures. Except for some models which feature an auxiliary classifier for enhancing the feedback signal strength during backpropagation, there have been few works that leverage multi-exit networks for training efficiently in the federated setting. However, both even in those few existing works, the training of the network is by progressively freezing the layers of the network for memory efficiency, leading to increased latency. One work leverages a depth-based pruning setting for clients with constrained resources. However, their self-distillation technique seems to negate the benefits of the proposed method as a means to recover the accuracy loss due to unseen distributions in deeper layers. The present techniques are the first work to apply multi-exit networks in a federated setting with a common, weight-shared adapter.

[74] Federated Learning: Consider a typical setting of FL with k participating client devices per round over K addressable client devices (i.e. partial participation, meaning that not all possible client devices participate in FL in a given round of training), where a local device i has Ni private training examples denoted by Di:= Rxi,yi)}7±, with xi the input data item (e.g. an image) and yi the target label or prediction for the input data item (e.g. a class). The learning objective of federated learning aims at finding a global set of model parameters wg(t) at time step t, such that: (t)

W_ N

Wt) where 2-a-1 Ef=o1n1,where wi = wi -i V-e(w) Di), (1) where ti is the learning rate and Aw; Di) is the task-specific loss function. As explained in more detail below, the common setup for FL introduces a server to receive and aggregate gradients sent from each participating client device, and therefore brings two major challenges: communication cost and data heterogeneity.

[75] Minimizing Equation 1 explicitly optimizes the generalization of the shared global model across all clients. However, the model performance on individual client devices might still be sub-optimal, especially under client data heterogeneity. Client devices often care more about personalized performance (i.e., overfitting to client data). Therefore, given the outcome of global federated model learning in Eq. 1, each client device may further fine-tune the parameters locally.

[76] Parameter-Efficient Fine-Tuning (PEFT). Parameter-efficient learning is a family of strategies for adapting pre-trained FMs to downstream tasks. The full model size of an FM is often much larger than the size of a downstream task dataset, making the fine-tuning prone to overfitting. Moreover, fine-tuning the full FM on client devices is a costly On terms of computation time and requirements) and time-consuming process that may not be feasible on many client devices. To this end, multiple PEFT methods have been proposed for fast adaptation of FMs, where the learning objective is typically formulated for a client device with private data Di:= [(xi, yi)}i as nm = argmin -A4 4,1 i9(F,Fmm,pE (Xi), yi),

WPE

where F is the foundation model, composed of both frozen weights wFm togeter with the weights associated with the introduced PEFT module wpE. The goal of the present techniques is to enable efficient federated fine-tuning of transformer-based FMs by designing a lightweight 20 module with wpE to be learned by FL.

[77] PETRA: Parameter-Efficient Transformer-based Recurrent Adapters. The motivation of the present techniques is to perform federated learning of a downstream task of interest by adapting an upstream pre-trained Transformer FM. To this end, the present techniques propose a new recurrent adapter for federated fine-tuning of Transformer models that is both parameter and communication efficient while supporting diverse client capacities/budgets. Given a pre-trained and frozen FM, e.g., ViT for images, AST for audio and Llama for NLP, the present techniques re-wire their feature extraction pathways via PETRA, which tweaks the FM to yield a better feature representation to the private clients. Note that PETRA is the only trainable module to be communicated by FL and has an order of magnitude fewer parameters compared to the full FM.

[78] Figure 1 is a schematic diagram illustrating a system for training a transformer-based foundation model, FM, for a specific downstream task using federated learning. The federated 35 learning system involves the use of a central server 100, and a plurality of client devices 200. wpe (2)

The central server 100 comprises at least one processor coupled to memory, for: transmitting, to the plurality of client devices 200, a pre-trained transformer-based foundation model, FM, 10 comprising a plurality of transformer blocks. The pre-trained transformer-based FM 10 may be any foundation model, such as a vision transformer, a data-efficient image transformer, or a contrastive language-image pre-training, CLIP, model. The pre-trained FM 10 is initialised on the server 100 and deployed to each user/client device 200. Once broadcast to client devices 200, the FM model 10 may not need to be communicated again, as indicated by the single-arrow dashed line.

[079] The server 100 has inserted a trainable module 16 into the foundation model, FM, 10, where the trainable module 16 is coupled to the plurality of transformer blocks (see Figure 2). The trainable module 16 (also referred to herein as an additional trainable parameter or ATP) which is injected into the FM model 10, will be the only module trained and communicated by the client devices 200 during the federated learning, as indicated by the double-arrow solid line. That is, after some or all of the client devices 200 have completed a training round on-device, the server 100 receive, from some or all of the plurality of client devices 200, a trained trainable module 16. The server 100 then aggregates the received trained trainable modules to generate an updated trainable module, which is then transmitted to the plurality of client devices 200 for use (inference), further federated learning, or personalisation (to better perform on the client device's data).

[80] The federated learning system shown in Figure 1 works as follows. The server 100 downloads/obtains a pre-trained FM 10 (but may in some cases do the pre-training of the FM itself), and then deploys the FM 10 together with some ATP 16, such as position embedding offsets and adapters, into the connected client devices 200. The ATP 16 injected into each FM 10 may depend on the type of FM or the task performed by the FM, as explained below. Each client device 200 then fixes or freezes the pre-trained FM 10 and trains only the ATP 16 locally. After training, only the trained ATP will be sent by the client devices 200 back to the server. The server 100 then aggregates the received trained ATPs and returns an aggregated ATP to the client devices 200 for further training, use or personalisation.

[81] Figure 2 shows the model architecture of the present techniques, PETRA, in more detail. Given a frozen Transformer model 10, PETRA 16 aggregates the information from the history of some special tokens 14 (e.g., [CLS]) through feature transformation, which yields a better feature representation for early exits at different layers. To understand the interaction between the transformer blocks 12 of the frozen transformer model 10 and PETRA 16, the recurrence is unfolded and each interaction is given a number (0-2 in this example). The input to PETRA 16 is a sequence, which grows after each interaction. For resource-constrained client devices, the computation may exist early, and the task specific prediction can benefit from collecting both the input and output of PETRA 16. For example, an early-exist classification can be done by MLP(Prev + Post). Besides, the proposed architecture allows client device personalisation by introducing a learnable and dedicated [Client] token 18 to the input sequence of PETRA 16.

[82] The idea is inspired by the recurrent nature of a Transformer model as the Transformer blocks are usually identical in structure. By appending a shared PETRA on top of each block, it offers a parameter-efficient way to modulate the feature representation to be ready for prediction at early exits for low-budget devices. Otherwise, the Transformer will struggle to achieve two contradictory goals: 1) transforming the current features into (better) new features and 2) exploiting the current features for task-specific predictions.

[83] The two main modules of the present techniques -i.e. a pre-trained Transformer and the attention-based adapter -are now explained in more detail, as well as how the adapter or trainable module modulates the outputs of the Transformer to support the early predictions.

[084] In some cases, both the pre-trained frozen FM and the trainable module are transformer-based. A Transformer model typically consists of a sequence of residual blocks of multi-head self-attention (MSA), each followed by a residual block of feed-forward multilayer perceptron (MLP) with LayerNorm (LN) applied to both MSA and MLP blocks. Denoting x as an input data item, p the positional encoding, and z1: = [41s, ..., 4] the intermediate tokens at layer!, the feed-forward pass of a Transformer can be formalized as: z° = Tokenizer(x) + p, (3) zl = MSA(LN(z1-1)) + zi-1, 1 = 1 * * 1" (4) = MLP(LN(z1)) + z1, 1 = 1.-L. (5) [085] To adapt a pre-trained Transformer model to the present task, a PETRA 16 is injected into each self-attention block followed by an MLP head to enable early predictions. A single PETRA 16 handles the outputs from all layers of the pre-trained Transformer model. PETRA share the same weights for all layers of a pretrained FM. Using the example of applying PETRA on [CLS] tokens 14, by collecting the history of CLS tokens, zcits can be modulated by 1 with hi = P ET R A ([z z + p'°.... + cl + 1]), (6) where the PETRA parameterized by is another Transformer (randomly initialized) with a single self-attention block as defined by Eq. 4 and 5. (A single self-attention block in PETRA 16 may be sufficient for accuracy. Increasing the number of self-attention blocks may increase accuracy but comes with a significantly heavier communication burden). It is noteworthy that PETRA treats the features from different layers as different token embeddings rather than needing tokenisafion as Eq. 3. Here, 12.11 means the last element in 121, namely the modulated CLS token from layer I in the pre-trained transformer In PETRA 16, the positional embedding = [p'0, embeds the layer information from the pretrained Transformer. Note that the modified zcils will be again used by the pre-trained Transformer, specifically, the next self-attention block, to produce z', as per Eq. 4.

[86] Early-exiting. The present technques enable early exiting, i.e. early predictions to be made, by modulating the features of all layers sequentially and exiting at any preferred layer. In practical terms, a residual connection is made between the vanilla CLS token embedding;its and the modulated feature 121. When making early-exit predictions, it has been found that using the CLIENT token 18 zetie", is more effective than utilising h_11. This may be because Zclient interacts with all traversed CLS tokens, allowing it to capture a richer contextual representation. Specifically, the task-specific prediction at exit / is given by: 2/ = MLP-heado (zdie"t + zel (8) where an MLP parameterized by is the early-exit head that is weight-shared across all layers, and zeils is the CLS token of layer / before modulation.

[87] Federated Training of PETRA: Following the PE optimization in Eq. 2, the learnable parameters WpE = (q5,4') reduce to the weights of the PETRA and the MLP head.

[088] Multi-exit Learning. Assuming a client device i has a certain budget On terms of computation resources), and that the client device can afford to run the pre-trained Transformer 10 up to layer Li, the aim is to solve, on the client device, a modified version of Eq. 2: argmin cliZ 7_1 -9(91 pE y;), (8) wPE where li is the layer of an early exit or depth that is chosen by the client device. There are two schemes for selecting the value of a) fixed budget, i.e. If = Li which is predefined, and b) anytime budget, where I, is taken uniformly at random from the range [1, Li}, where Li is the maximum depth of the transformer FM that the client device is able to utilise due to hardware capabilities (i.e. computing and memory capacities).

[89] With respect to personalisation of PETRA (i.e. the trainable module) after the updated version of PETRA is received from a server, there are two choices: fine-tuning the whole (0, or the Client token zche", only. Given that they are both lightweight, the personalization is less likely to overfit even under an extremely low data regime.

[90] With respect to inference, given an FL trained model in client 1, it can infer labels 9Li not only from block Li according to its capability but also labels Irci as long as li is smaller than the budget constraint L. This property can be very useful in the case when a client device suffers from low battery, which enables the user to set a small budget I < Li to allow energy-efficient inference.

[91] Figure 3 is a schematic diagram illustrating an example use of the present federated learning method. The present techniques enable a server to train a single foundation model, FM, that can be used in different clients with different processing capacities (e.g. the different tiers of device shown in Figure 3). For example, a single automatic speech recognition system can be flexibly used in a Samsung smartwatch, smartphone and digital tablet devices. This is due to one of the present PETRA designs, in which the self-attention blocks of the pre-trained vision transformer models are decomposed to enable the addition of a recurrent transformer which can accumulate the outputs of different transformer blocks as a multi-exits classifier and make a prediction flexibly depending on the compute capacity of a device. Moreover, this recurrent classifier can be trained in a usual federated learning way, even with clients with different processing capacities. The trainable module may comprise or be a classifier that can accumulate the class token embeddings inferred from any depth of a transformer model. Depth is predicted by some mapping function, which inputs the device spec and outputs the depth value, i.e. which layer to exit. (By having this mapping function, the present techniques can adapt to any hardware flexibly.) [92] As noted above, the server injects a recurrent self-attention transformer as multi-exit classifiers. The PETRA classifier enables data (e.g. class token embeddings) to be accumulated from any depth of the transformer model, which allows the server to aggregate ATP data from client devices having differing hardware capabilities. In other words, for the pre-trained vision or audio Foundation transformer models, the present techniques use the nature of self-attention blocks and add a small recurrent transformer as the ATP module. This added recurrent transformer can accumulate the class tokens inferred from different depths of the Foundation transformer blocks and can predict based on the accumulated tokens flexibly at any depth. Specifically for each device the depth value will be predicted by a mapping function which inputs the device spec. This makes the present framework work well with clients with different processing capacities at both training and test.

[93] Figures 4A and 46 are schematic diagrams illustrating another example use of the present federated learning method. The present techniques can be used with any Al-enabled edge devices, such as, but not limited to, a smartphone. The present techniques can improve a smartphone with various Al services, e.g. automatic speech recognition and image recognition in Figure 4A, and semantic segmentation and language translation in Figure 46. The present techniques allow multiple user devices to collaboratively learn to improve Al services. More users of client devices will provide more data, and thus better train better the Al services, while simultaneously maintaining user privacy. Furthermore, the present techniques enable efficient personalization of user devices according to the environmental condition / user preferences. Further still, the present techniques enable not only individual device tiers, but individual devices themselves to scale the inference cost/latency dynamically.

[94] Figure 5 is a flowchart of example steps performed by a client device 200 for training/adapting a transformer-based foundation model, FM, for a specific downstream task using federated learning. The method performed by a client device 200 comprises: receiving, from a server 100, a pre-trained transformer-based foundation model, FM, comprising a plurality of transformer blocks, wherein a trainable module has been inserted into the foundation model, FM, and is coupled to the plurality of transformer blocks (step 5100); training, using a local training dataset comprising a plurality of data items, the trainable module inserted into the pre-trained transformer-based FM while keeping the pre-trained transformer-based FM frozen (step S102); and transmitting, to the server, the trained trainable module to the server for aggregation (step 5104).

[95] Thus, the server 100 is able to benefit from training on a larger training dataset without having to access the training data of the client devices 200, and the client devices 200 do not have to train the whole model, which enables participation in the training process by a wider variety of devices (thereby reducing model bias). That is, the foundation model is fixed and not trained locally. The client device 200 uses the pre-trained foundation model and locally trained trainable module for inference purposes until a new updated trainable module is received from the server, at which stage the client device 200 replaces the trainable module with the updated trainable module (step S106).

[96] The trainable module may itself be a transformer-based machine learning, ML, model. Since the pre-trained FM and the trainable module may be transformer-based, training the trainable module at step S102 may comprise: processing a data item in the local training dataset using the frozen pre-trained transformer-based FM; outputting a token from each transformer block of the plurality of transformer blocks of the FM; inputting each token into the trainable module; and training the trainable module to process the tokens and output a prediction for the data item using the processing. In other words, the trainable module takes, as inputs, tokens output by the frozen pre-trained foundation model, and processes these to generate a prediction for a data item. Thus, the trainable module uses the pre-trained FM to do some processing, which means the trainable module benefits from the capabilities of the pre-trained FM. As the pre-trained FM is frozen, the knowledge and learning of the FM is not lost during the on-device training due to distribution shift, which is advantageous.

[97] When the trainable module is a transformer-based ML model, training the trainable module to process the tokens and output a prediction for the data item using the processing may comprise training the trainable module to minimise a task-specific loss between a ground truth label for the data item (which is known if the client device data items are labelled data items), and a predicted label output by the trainable module for the data item.

[098] When the trainable module is a transformer-based ML model, training the trainable module may comprise: forming a token sequence using the tokens output by each transformer block of the frozen pre-trained transformer-based FM; and processing, using the trainable module, the token sequence.

[099] In some cases, when the trainable module is a transformer-based ML model, the trainable module may comprise at least one self-attention block. In this case, training the trainable module may comprise training the at least one self-attention block. As noted above, multiple self-attention blocks may be included, but in some cases, one self-attention block is sufficient.

[100] When the trainable module is a transformer-based ML model, the method may comprise: determining at least one training criterion for the client device. For example, the training criterion may be based on the hardware capabilities of the client device. Based on the at least one training criterion, processing a data item using the frozen pre-trained transformer-based FM may comprise processing a data item using a subset of the plurality of transformer blocks of the FM. In other words, a data item may not be processed by the whole pre-trained FM (i.e. the full depth of the FM), but instead a prediction may be output 'early' from the trainable module. In such cases, the trainable module is trained using the outputs from the early layers of the pre-trained FM.

[101] Determining at least one training criterion may comprise determining any one or more of: a processing capacity of the client device, a memory capacity of the client device, a battery capacity of the client device, and a latency criterion. In this way, not only is user data used to train the trainable module, but also the training takes into account the hardware capabilities of the user device, such as the user device's processing and memory capabilities. This means that a wide variety of user devices are able to participate in the training process.

[102] Transmitting the trained trainable module to the server may comprise transmitting parameters of the trained transformer-based ML model to the server for aggregation. For example, weights of the trained ML model may be transmitted to the server for aggregation 15 and update of the FM model.

[103] As shown in Figure 5, the method performed by the client device 200 may further comprise: receiving, from the server 100, an updated trainable module, updated using parameters received from client devices; and replacing the trained trainable module with the received updated trainable module (step S106). In other words, the client device may receive a centrally updated trainable module back from the server for use, after the aggregation process has been performed by the server. The client device may use the updated trainable module, together with the existing pre-trained FM to perform inference, or may personalise the updated trainable module -on local data -before use for inference. Thus, the method may further comprise personalising the updated trainable module for use by the client device.

[104] In some cases, the trainable module may comprise at least one learnable vector. The learnable vector(s) can be trained, such that the whole pre-trained FM does not need to be retrained. That is, while finetuning models pre-trained on large-scale datasets results in a strong supervised classifier, the fine-tuning process severely negatively impacts the ability to process data items that are new/unseen. To alleviate this, the FM is kept frozen -which has been pre-trained -and only one or more learnable vectors are trained.

[105] When the trainable module comprises at least one learnable parameter, training the 35 trainable module may comprise training the trainable module to fine-tune the at least one learnable parameter for the specific downstream task.

[106] Figure 6 is a flowchart of example steps performed by a central server 100 for training a transformer-based foundation model, FM, using federated learning. The method performed by the server 100 may comprise: transmitting, to a plurality of client devices 200, a pre-trained 5 transformer-based foundation model, FM, comprising a plurality of transformer blocks, wherein a trainable module has been inserted into the foundation model, FM, and is coupled to the plurality of transformer blocks (step 3200); receiving, from some or all of the plurality of client devices, a trained trainable module (step S202); aggregating the received trained trainable modules to generate an updated trainable module (step 3204); and transmitting, to 10 the plurality of client devices, the updated trainable module (step 3206).

[107] Figure 7 is a diagram showing the steps performed by the server and client devices to perform federated learning using the present techniques. The individual processes have been described with reference to Figures 5 and 6 and so are not repeated, but Figure 7 shows the pipeline within the whole system. A key point to note here is that when the updated trainable module is received by a client device from the server, the client device may either now use the updated trainable module together with the existing pre-trained FM to process new data items (i.e. perform inference), or may perform further training as part of the federated learning process, or may personalise the updated trainable module. The personalisation may advantageously involve training client token 18 (see Figure 2) rather than the whole updated trainable module. This is advantageous because it reduces the computational resources required to personalise the updated trainable module.

[108] Figure 8A shows a flowchart of example steps performed by the server 100 to inject at least one ATP/trainable module into a pre-trained foundation model. The server injects a suitable ATP/trainable module based on the type of the foundation model. As shown, when the foundation model is a vision or audio transformer based model, the server may inject a recurrent self-attention transformer as multi-exit classifiers into the foundation model. Alternatively, when the foundation model is a CLIP model, the server may inject learnable position embedding offsets into the foundation model. When the foundation model is neither of these types of model, the server may request alternative designs of foundation model, which are more amenable to the insertion of ATPs.

[109] Figure 8B shows a flowchart of example steps performed by the client device 200 when 35 training the at least one ATP/trainable module. As shown, the client device may train the ATP differently based on the type of ATP and foundation model. As shown, when the foundation model is a vision or audio transformer based model, the client device may train the ATP based on the client device's hardware specifications (e.g. processing capabilities). As noted above, the ATP injected into the model may be a series of classifiers providing different exit points along the foundation model. The ATP may be trained up to a specific exit point based on the capabilities of the client device. Alternatively, when the foundation model is a CLIP model, the client device may train the ATP in a usual federated learning manner.

[110] Figure 9 is a block diagram showing an ATP for use in a CLIP based model. As noted above, the server may inject learnable position embedding offsets (as the trainable module) into the CLIP model. A CLIP model is a language-image model, which comprises an image encoder to generate an image embedding representing image features in an input image, and a text encoder for generating a text embedding representing text features in an input text data item. As shown in Figure 9, the image encoder receives image inputs captured by e.g. an image capture device, and outputs feature representations (e.g. image embeddings) for the image inputs. The text encoder receives tokenised text embeddings plus learnable position embedding offsets. The output of the text encoder is a feature representation/text embedding of a text input. The image embeddings and text embeddings are input into a scoring module which outputs similarities between the image features and text features. That is, for the pre-trained language/vision CLIP model, the present techniques inject some learnable position embedding offsets as the trainable module into the inputs of the textual encoder. Similar to above, these position embedding offsets will be trained in a FL way. This set of position embedding offsets can be formed with extremely small number of parameters, e.g. 512 parameters, which is beneficial to devices with communication bandwidth shortage or latency issue.

[111] Before explaining the experiments used to test the present techniques, a summary of the present techniques and key advantages is provided.

[112] The present techniques are based on the key idea of employing the pre-trained foundation models (FMs) and injecting a trainable module, such as position embedding offsets and adaptors. FL then only involves training the trainable module on the client devices. This improves FL's training and communication efficiency while maintaining excellent performance, and also enables more heterogeneous devices to participate, and more training stability.

[113] Specifically, the present techniques provide a new trainable module suited for FL under the foundation model, FM, regime, where the trainable module is designed for the requirements of adaptation to fit client devices at anytime (under different compute and memory budgets) and anywhere (under severe data heterogeneity).

[114] Given a pre-trained foundation model (e.g. a transformer model such as the vision transformer, ViT, model, an audio spectrogram transformer, AST, model, or a data-efficient image transformer, DeiT, model), the present techniques re-wire the feature extraction pathway of the foundation model to tackle the anytime and anywhere challenges. Specifically, the present techniques keep track of the special classification tokens (CLS tokens) after each self-attention transformation, and make use of the history of previous CLS tokens to revise the current CLS token by a lightweight Transformer, which is also referred to as an PETRA herein. PETRA has an order of magnitude fewer parameters than a pre-trained Transformer model, and is the only module trainable during the local training. Therefore, the training and communication efficiencies of the model can both be significantly improved. To show this, a comparison is made between PETRA and a standard early-exit model -this is explained in more detail below with reference to Figures 15-17. The comparisons experiments clearly show that the present techniques are more efficient in reaching a certain target performance. In addition, due to the efficient optimization enabled by PETRA, user personalization for a particular client can be conducted efficiently with even better performance than fine-tuning the whole model.

[115] Some advantages of the present techniques include: better accuracy in downstream tasks after FL due to exploiting FM; significantly reduced communication/bandwidth cost and reduced compute and memory/storage cost compared to standard FL; significantly faster and more stable convergence compared to standard FL; better support for heterogeneous devices of different capability to cooperate vs standard FL; better support for heterogeneous and uncurated data across clients compared to standard FL; better compute and data efficiency of personalisation compared to standard FL; and as a by-product, the PETRA adapter variant of ATP supports "anytime" inference, where the cost of predictions can be scaled dynamically by the edge device (e.g. due to power constraints, or CPU/GPU load constraints, or user-specified latency).

[116] Experimental Setup. As explained above, the present techniques take a different perspective to existing techniques, and redirects the focus of model designs in federated learning to take advantage of the pre-trained Foundation Models from large-scale datasets. In comparison, the existing state-of-the-art methods mainly focus on training federated learning models from scratch. Direct combination of FL and FMs is infeasible due to the huge communications cost. However, the present applicant shows that adding some small sets of parameters allows the pre-trained FM to efficiently and effectively be adapted into a federated learning system.

[117] The present techniques make use of a pre-trained Foundation model, rather than a model initialized randomly, to benefit the federated learning system in multiple aspects. The pre-trained FMs can potentially mitigate the data heterogeneity between clients as they see a diversity of data during pre-training and enable more exceptional system performance. This solves the failure of naive FL+FM combination by performing FL on a special set of injected ATP modules rather than the FM itself.

[118] During the training, the FM parameters are fixed in each client and add some small sets of ATP modules. These small sets of ATP can be efficiently trained and communicated, addressing FL's training and communication bottlenecks. Moreover, according to the types of pre-trained FMs, different designs of ATP modules can be added, as described above. Furthermore, as it is only necessary to train a small amount of ATP in each client device, user personalization can be done efficiently and with less data/higher accuracy.

[119] To verify the efficacy of the present FL framework, experiments were conducted on the standard FL benchmarks with Flower codebase, including CIFAR-100, FEMNIST and SpeechCommandV2, as downstream tasks of pretrained DeiT and AST. The original test set in CIFAR-100 is used as the global test dataset and split 10,000 images from the original training set as the personal validation set. Three different data partitions are simulated for CI FAR-100, including one IID-data partition (a = 1000.0), and two non-IID data partitions (a = 1.0, 0.1) by LDA partitioning following prior works with 100 clients. FEMNIST has a total of 80,000 images (grayscale) of 62 different character classes (10 numeric, 26 lowercase, and 26 uppercase). For FEMNIST, the training data is partitioned in two ways, one IID and one non-IID, as in FEMNIST with 381 clients. The Speech Commands v2 dataset consists of 105,829, 16KHz 1-second long audio clips of a spoken word. The experiments were conducted with the 12-classes version, with 10 classes as "yes", "no", "up", "down", "left", "right", "on", "off', "stop", 207 "go" and one class "unknown" and a class "silence" which has no spoken word in the clip. The dataset has three disjoint sets of speakers: 2112 for training, 256 for validation and 250 for the test. 209 For each round in FL, 10% of clients were sampled for training on CIFAR-100 and FEMNIST, and 1% on SpeechCommandv2. A pretrained DEiT-small model is used with 16 x 16 patches on I LSVRC-2012 as the foundation model, which can be used for both image and speech recognition -i.e. the backbone of AST. All experiments are conducted with PyTorch on a single Nvidia Tesla V100 GPU with results repeated three times and reporting the mean and standard deviation (std). All models are trained using SGD with a cosine annealing learning rate schedule.

[120] All the FL experiments are run with the FedAvg aggregation scheme. A comparison with all the baselines is made, namely: Fine-tuning, where the whole pretrained vision transformer is finetuned end-to-end in FL; Layer-wise MLP, also named L.W MLP or Vanilla Early Exit (EE), where a two-layer MLP with GELU and Random Dropout is appended after each transformer block with the rest frozen. Note that MLPs are straightforward combinations of standard early exit with FL. Several state-of-the-art adapter architectures are compared, including serial adapter, parallel adapter (PA) and LoRA. In each of these cases, the corresponding adapter is inserted into the backbone and the corresponding MLP exit is added for fair comparisons. Finally, the present method inserts one shared PETRA and one shared MLP into each transformer block, with the rest frozen.

[121] All the methods in four different settings are evaluated, including: 1 Conventional FL -Clients train the local models by using only the final exit of a pretrained transformer, and the trained global model will be evaluated at a test set using only the final exit.

2 Anytime FL -Clients train the local models by using a random exit at each iteration, and the trained global model will be tested at each exit.

3 Multi-tier FL -Clients train the local models by using a specific early exit determined by the tier of each client. The trained global model will be tested at each exit, and 4 Personalization, where the FL-trained model from setting (3) will be finetuned using the local data.

[122] Experimental Results: Conventional Federated Learning. Figure 10 shows results of experiments to on conventional FL performance. Bold indicates the best result. The results in Figure 10 show the comparison between all competitors. The results show that fine-tuning (FT) works the best among all methods. This is unsurprising, as it has access to the large, combined dataset of all clients and can use this to tune the whole model, but it incurs the most local training and communication costs. Other parameter-efficient modules are quite effective in enabling federated adaptation of the pretrained Transformer, achieving comparable performance to FT and outperforming MLP head in all settings. Nevertheless, the present techniques, PETRA, achieve the best result in all cases, with some results even surpassing full network fine-tuning, such as on FEMNIST with IID. The results demonstrate the efficacy of the proposed PETRA to adapt a pretrained Transformer model into FL under different types of data and system heterogeneity.

[123] Experimental Results: Anytime Federated Learning. In reality, it is necessary to consider the early exit situations. For each layer of a pretrained DeiT, an early exit is trained, such as an MLP head and variants of adding different adapters. At each local training iteration, an exit layer index I E [0, * * * , 11] will be sampled randomly, and only that exit (and PETRA where relevant) will be trained. For fine-tuning, the layer-wise MLP heads are appended and all parameters are trained during the FL training. After training, the global model will be evaluated on the test set at all exits. Figure 11 shows results of experiments to on anytime FL performance. Bold indicates the best result. Figure 12 shows results of experiments to test accuracy of each exit trained on CI FAR-100 with Anytime federated learning (Top) and multi-tier constraint (Bottom). The results in Figure 11 and Figure 12 (top) show the average and budget-wise performance over all exits. Without surprise, it can be seen that global fine-tuning achieves the best at a substantial comms cost. Among the more efficient competitors, it can be seen that the present techniques, PETRA, outperform L.W. MLP and other parameter-efficient methods significantly, again demonstrating the effectiveness of the present method.

[124] Experimental Results: Multi-tier Federated Learning. More realistically, there is a certain level of system heterogeneity among client devices. Individual devices in FL training have a certain level of computing capability and their associated early exit should be persistently fixed in both training and testing. Figure 13 shows results of experiments to on multi-tier FL performance. Bold indicates the best result. It can be seen that most results dropped to some extent compared with Figure 11. This is expected due to the existence of system heterogeneity among clients. Most observations in relation to Figure 13 are similar to Figures 10 and 11. One interesting observation is that in this harder scenario, PETRA outperforms the fine-tuning baseline consistently in all situations in CIFAR-100. Figure 12 (bottom) shows the corresponding results for clients of different tiers on CIFAR-100 after training by different methods. This again shows PETRA works for anytime inference in a harder and more practical scenario.

[125] Ablation Study. In order to verify the contributions of different components proposed in PETRA, ablation studies were conducted based on the multi-tier based FL setup on CIFAR35 100(0 = 0.1), the hardest scenario. In Figure 14, the full PETRA indicates a transformer with depth=1, class token parallel connection, token replacement mechanism, and an MLP classification head. Among them, it is found that parallel connection influences performance the most. This is mainly because DeiT's original class tokens are well-learned on ImageNet, which are directly beneficial for model predictions in the downstream tasks. Removing token replacing reduces the effect of modulating the extracted features from DeiT. Thus the performance drops noticeably. A larger depth, i.e. 3, was also tried in the PETRA, but it seemed not to produce much performance gain. More interestingly, PETRA works complementarily with these existing PEFT methods explored, which is another positive sign of the proposed PETRA.

[126] Having evaluated the performance of the proposed technique compared to the competitors and the contribution of each component, the training footprint of PETRA is now quantified. First, the number of trainable parameters for the different methods in Anytime FL that have been evaluated so far are shown in Figure 15, as a proxy for memory consumption and communication cost of updates per client participation in FL.

[127] Figure 16 is a table showing actual computational (in FLOPs) and memory budgets across all exit layers with Deit-S as the foundation model. The memory and compute footprint increases linearly with the layer at which the early exit is triggered. During training, the memory peak can be largely reduced if the base model is kept frozen, allowing in this way the participation of more constrained devices. Lastly, Figure 17 shows the communication cost of different methods to reach a target test accuracy, which is set as the best performance of a Layer-wise Linear head here. Specifically, Figure 17 shows transmitted message size (# communication roundx# model parameters (M)) required to reach a target performance with multi-tier FL on CIFAR-100 and FEMNIST. It can be seen from the results that PETRA achieves the lowest cost to reach the target by only cost 40 * 3.17 and 40 * 2.99 message size for CIFAR-100 (a = 0.1) and FEMNIST(Non-IID).

[128] Figure 18 is a block diagram of a system 300 for training a transformer-based foundation model, FM, for a specific downstream task using federated learning.

[129] The system 300 comprises a central server 100 and a plurality of client devices 200. Only one client device 200 is shown here for the sake of simplicity, but it will be understood that there may be tens to thousands to millions of client devices in the system.

[130] As shown, the server 100 comprises a foundation model, FM, 110, which has been pre-trained. In some cases, the foundation model 110 may have been trained on the server 100 using a training dataset 106. In other cases, the server 100 may receive the pre-trained foundation model 110 from elsewhere. As explained above, the server 100 injects a trainable module 108 into the pre-trained foundation model 110 prior to transmitting the foundation model 110 to client devices 200. The pre-trained foundation model 110 is frozen, i.e. remains static during the federated learning process, while the trainable modules 108 are trained by the client devices 200. It is the trainable module 108 which is trained by the client devices 200 and which is updated by the server 100 during the federated learning.

[131] The server 100 comprise at least one processor 102 coupled to memory 104, for: transmitting, to the plurality of client devices 200, a pre-trained transformer-based foundation model, FM, 110 comprising a plurality of transformer blocks, wherein a trainable module 108 has been inserted into the foundation model, FM, and is coupled to the plurality of transformer blocks; receiving, from some or all of the plurality of client devices 200, a trained trainable module 208; aggregating the received trained trainable modules 208 to generate an updated trainable module; and transmitting, to the plurality of client devices 200, the updated trainable module.

[132] In the system, each client device 200 comprises: storage storing a local training dataset 206 comprising a plurality of data items; and at least one processor 202 coupled to memory 204, for: receiving, from the server 100, the pre-trained transformer-based foundation model, FM 110; training, using the stored local training dataset, the trainable module 108 inserted into the pre-trained transformer-based FM 110 while keeping the pre-trained transformer-based FM 110 frozen; and transmitting, to the server 100, the trained trainable module 208 to the server for aggregation.

[133] The client device 200 may be any one of: a smartphone, tablet, laptop, computer or computing device, virtual assistant device, a robot or robotic device, a robotic assistant, image capture system or device, an Internet of Things device, and a smart consumer device. It will be understood that this is a non-limiting and non-exhaustive list of client devices.

[134] The at least one processor 202 may comprise one or more of a microprocessor, a microcontroller, and an integrated circuit. The memory 204 may comprise volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example.

[135] References: * Flower codebase -Daniel J Beutel, Taner Topa!, Akhil Mathur, Xinchi Qiu, Titouan Parcollet, and Nicholas D Lane. Flower: A friendly federated learning research framework.

arXiv preprint arXiv:2007.14390, 2020.

* Cl FAR-100 -Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

* FEMNIST -Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 2921-2926. IEEE, 2017.

* SpeechCommandV2 -Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.

* DelT -Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347-10357. PMLR, 18-24 Jul 2021.

* AST -Yuan Gong, Yu-An Chung, and James Glass. AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021, pages 571-575, 2021.

* ILSVRC-2012 -Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li. Imagenet A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition, pages 248-255, 2009.

* FedAvg -Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273-1282. PMLR, 2017.

* Serial adapter -Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nip. In International Conference on Machine Learning, pages 2790-2799. PMLR, 2019.

* Parallel adapter -Junxian He, Chunfing Thou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 356 2021.

* LoRA -Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.

arXiv preprint arXiv:2106.09685, 2021.

[136] Those skilled in the art will appreciate that while the foregoing has described what is considered to be the best mode and where appropriate other modes of performing present techniques, the present techniques should not be limited to the specific configurations and methods disclosed in this description of the preferred embodiment. Those skilled in the art will recognise that present techniques have a broad range of applications, and that the embodiments may take a wide range of modifications without departing from any inventive concept as defined in the appended claims.

Claims

CLAIMS1. A computer-implemented method for training, at a client device, a transformer-based foundation model, FM, for a specific downstream task using federated learning, the method 5 comprising: receiving, from a server, a pre-trained transformer-based foundation model, FM, comprising a plurality of transformer blocks, wherein a trainable module has been inserted into the foundation model, FM, and is coupled to the plurality of transformer blocks; training, using a local training dataset comprising a plurality of data items, the trainable 10 module inserted into the pre-trained transformer-based FM while keeping the pre-trained transformer-based FM frozen; and transmitting, to the server, the trained trainable module to the server for aggregation.
2. The method as claimed in claim 1 wherein the trainable module is a transformer-based 15 machine learning, ML, model.
3. The method as claimed in claim 2 wherein training the trainable module comprises: processing a data item in the local training dataset using the frozen pre-trained transformer-based FM; outputting a token from each transformer block of the plurality of transformer blocks of the frozen pre-trained transformer-based FM; inputting each token into the trainable module; and training the trainable module to process the tokens and output a prediction for the data item using the processing.
4. The method as claimed in claim 3 wherein training the trainable module to process the tokens and output a prediction for the data item using the processing comprises training the trainable module to minimise a task-specific loss between a ground truth of the data item and the prediction output by the trainable module.
The method as claimed in claim 3 or 4 wherein training the trainable module comprises: forming a token sequence using the tokens output by each transformer block; and processing, using the trainable module, the token sequence.
6. The method as claimed in any of claims 2 to 5 wherein the trainable module comprises at least one self-attention block, and wherein training the trainable module comprises training the at least one self-attention block
7. The method as claimed in any of claims 2 to 5 further comprising: determining at least one training criterion for the client device; wherein, based on the at least one training criterion, processing a data item using the frozen pre-trained transformer-based FM comprises processing a data item using a subset of the plurality of transformer blocks of the FM.
8. The method as claimed in claim 7 wherein determining at least one training criterion comprises determining any one or more of: a processing capacity of the client device, a memory capacity of the client device, a battery capacity of the client device, and a latency criterion.
9. The method as claimed in any of claims 2 to 8 wherein transmitting the trained trainable module comprises transmitting parameters of the trained transformer-based ML model to the server for aggregation.
10 The method as claimed in any preceding claim further comprising: receiving, from the server, an updated trainable module, updated using parameters received from client devices; and replacing the trained trainable module with the received updated trainable module.
11. The method as claimed in claim 10 further comprising personalising the updated trainable module for use by the client device.
12. The method as claimed in claim 1 wherein the trainable module comprises at least one learnable vector
13. The method as claimed in claim 12 wherein training the training module comprises training the trainable module to fine-tune the at least one learnable vector for the specific downstream task.
14. A client device for training a transformer-based foundation model, FM, for a specific downstream task using federated learning, the client device comprising: storage storing a local training dataset comprising a plurality of data items; and at least one processor coupled to memory, for: receiving, from a server, a pre-trained transformer-based foundation model, FM, comprising a plurality of transformer blocks, wherein a trainable module has been inserted into the foundation model, FM, and is coupled to the plurality of transformer blocks; training, using the stored local training dataset, the trainable module inserted into the pre-trained transformer-based FM while keeping the pre-trained transformer-based FM frozen; and transmitting, to the server, the trainable module to the server for aggregation.
15. The client device as claimed in claim 14 wherein the plurality of data items in the local training dataset are a plurality of images, and training the FM comprises training the FM to perform an image processing task.
16. The client device as claimed in claim 15 further comprising an image capture device for capturing the plurality of images.
17. The client device as claimed in claim 14 wherein the plurality of data items in the local 20 training dataset are a plurality of audio data items, and training the FM comprises training the FM to perform an audio processing task.
18. The client device as claimed in claim 17 further comprising an audio capture device for capturing the audio data items. 25
19. A computer-implemented method for training, at a server, a transformer-based foundation model, FM, using federated learning, the method comprising: transmitting, to a plurality of client devices, a pre-trained transformer-based foundation model, FM, comprising a plurality of transformer blocks, wherein a trainable module has been 30 inserted into the foundation model, FM, and is coupled to the plurality of transformer blocks; receiving, from some or all of the plurality of client devices, a trained trainable module; aggregating the received trained trainable modules to generate an updated trainable module; transmitting, to the plurality of client devices, the updated trainable module.
20. The method as claimed in claim 19 wherein the trainable module is a transformer-based machine learning, ML, model.
21. The method as claimed in claim 20 wherein receiving the trainable modules comprises 5 receiving parameters of a trained transformer-based ML model from some or all of the client devices.
22. A server for training a transformer-based foundation model, FM, using federated learning, the server comprising: at least one processor coupled to memory, for: transmitting, to a plurality of client devices, a pre-trained transformer-based foundation model, FM, comprising a plurality of transformer blocks, wherein a trainable module has been inserted into the foundation model, FM, and is coupled to the plurality of transformer blocks; receiving, from some or all of the plurality of client devices, a trained trainable module; aggregating the received trained trainable modules to generate an updated trainable module; transmitting, to the plurality of client devices, the updated trainable module.
23. A computer-readable storage medium comprising instructions which, when executed by a processor, causes the processor to carry out the method of claims 1 to 13 or 19 to 21.
24. A system for training a transformer-based foundation model, FM, for a specific 25 downstream task using federated learning, the system comprising: a plurality of client devices; and a server comprising at least one processor coupled to memory, for: transmitting, to the plurality of client devices, a pre-trained transformer-based foundation model, FM, comprising a plurality of transformer blocks, wherein a trainable module has been inserted into the foundation model, FM, and is coupled to the plurality of transformer blocks; receiving, from some or all of the plurality of client devices, a trained trainable module; aggregating the received trained trainable modules to generate an updated trainable module; transmitting, to the plurality of client devices, the updated trainable module.The system as claimed in claim 24 wherein each client device comprises: storage storing a local training dataset comprising a plurality of data items; and at least one processor coupled to memory, for: receiving, from the server, the pre-trained transformer-based foundation model, FM; training, using the stored local training dataset, the trainable module inserted into the pre-trained transformer-based FM while keeping the pre-trained transformer-based FM frozen; and transmitting, to the server, the trainable module to the server for aggregation.