WO2021201370A1

WO2021201370A1 - Federated learning resource management apparatus and system, and resource efficiency method therefor

Info

Publication number: WO2021201370A1
Application number: PCT/KR2020/017156
Authority: WO
Inventors: 금승우; 문재원; 김영기
Original assignee: 한국전자기술연구원
Priority date: 2020-03-31
Filing date: 2020-11-27
Publication date: 2021-10-07
Also published as: KR20210121915A; KR102567565B1

Abstract

The present invention relates to a method performed by a federated learning (FL) resource management apparatus. An embodiment of the present invention comprises the steps of: receiving, from an FL server, request information for operation of a plurality of FL clients; selecting, on the basis of the request information, one or more FL clients to be operated from among the plurality of FL clients; distributing a basic model and a learning tool to the selected FL clients; and, as the FL server generates an integrated model on the basis of an updated model, receiving, from the FL server, a request for the distribution of the integrated model.

Description

Federated learning resource management device, system and method for resource efficiency thereof

The present invention relates to a federated learning resource management apparatus, system, and resource efficiency method thereof, and more particularly, to a system and method capable of efficiently managing resources by selectively operating a plurality of FL clients in consideration of computing resources. it's about

Recently, as cloud technology, artificial intelligence, and machine learning (hereinafter referred to as AI/ML) technology develop, various AI/ML services are being provided with high accuracy.

In order to maintain such a high degree of accuracy, a large amount of data for learning and high-performance computing resources capable of processing it are required.

To this end, in the prior art, learning data was collected in a high-performance server or cloud, etc. and learning was carried out. However, as issues such as personal information protection have recently been raised, the problem of storing large amounts of learning data in the cloud has been raised.

Federated Learning (FL) does not store data in one place to solve this problem, but stores it only in the data generator, then proceeds with learning in the data generator and collects only the learning results on an independent server to create AI/ML It was presented as a way to improve the performance of the model.

This federated learning has the effect of fundamentally blocking the leakage of personal information because the data is not gathered in one place and the learning itself proceeds from the data generator.

However, the implementation of federated learning has a problem in that efficient resource management is difficult because the FL client is implemented in the form of a register that notifies the FL server that it wants to perform the corresponding function.

The current FL client has a structure in which, when the FL application is installed, it registers itself as an FL client in the FL server regardless of its own capability. If, for example, when memory is insufficient or processor performance is low, if one's own capability cannot perform federated learning, the FL client stores applications that cannot be operated, resulting in storage wastage.

In addition, since the FL client has to repeatedly operate a process that does not terminate within time, computing resources such as CPU and memory are wasted.

An embodiment of the present invention provides a federated learning resource management apparatus, system, and resource efficiency method capable of efficiently managing resources by selectively operating a plurality of FL clients in consideration of the computing resources of the FL clients.

However, the technical task to be achieved by the present embodiment is not limited to the technical task as described above, and other technical tasks may exist.

As a technical means for achieving the above-mentioned technical problem, the method performed by the federated learning (FL) resource management apparatus according to the first aspect of the present invention is request information for operation of a plurality of FL clients from the FL server receiving; selecting one or more FL clients to be operated from among the plurality of FL clients based on the request information; distributing a basic model and a learning tool to the selected FL client; and receiving, by the FL server, a distribution request of the unified model from the FL server as the unified model is generated based on the updated model.

In some embodiments of the present invention, the step of receiving request information for operation of a plurality of FL clients from the FL server includes computing requirements information, basic model information and learning tools for the plurality of FL clients as the request information. It may include and convey one or more of the information.

In some embodiments of the present invention, the step of selecting one or more FL clients to be operated from among the plurality of FL clients includes at least one of computing resource information, related service operation information, and local data collection information of the plurality of FL clients. One or more FL clients may be selected based on this.

In some embodiments of the present invention, the step of selecting one or more FL clients to be operated from among the plurality of FL clients includes selecting one FL client among the plurality of FL clients for which a related service is being operated based on the related service operation information. It may include the step of selecting a car.

In some embodiments of the present invention, the selecting of one or more FL clients to be operated from among the plurality of FL clients includes the step of finally selecting an available FL client from among the first selected FL clients based on the computing resource information may include.

In some embodiments of the present invention, software for joint learning may be installed and executed for only the selected FL client.

In some embodiments of the present invention, the selected FL client may update the basic model based on the learning tool, and transmit the updated model to the FL server as the basic model update is completed.

In some embodiments of the present invention, the FL server updates the basic model stored in the FL server into an integrated model as it receives more than a preset number of the updated models from a plurality of FL clients, and uses the FL resource management device to use the plurality of updated models. can request distribution of the integrated model to FL clients of

In some embodiments of the present invention, the method may further include registering the plurality of FL clients for management of the FL clients.

In addition, the apparatus for federated learning (FL) resource management according to the second aspect of the present invention is a communication module for transmitting and receiving data to a FL server and a plurality of FL clients, a program for resource management of the plurality of FL clients Comprising the stored memory and a processor for executing the program stored in the memory, when the processor executes the program and receives request information for operation of a plurality of FL clients from the FL server through the communication module, Based on the request information, one or more FL clients to be operated are selected from among the plurality of FL clients, and basic model information and learning tool information are distributed.

In some embodiments of the present invention, the processor may select one or more FL clients based on at least one of computing resource information, related service operation information, and local data collection information of the plurality of FL clients.

In addition, the system for performing and managing federated learning (FL) according to the third aspect of the present invention is a plurality of FL clients that update the basic model by operating at least one selected and performing federated learning, the updated Upon receiving the request information for the operation of the plurality of FL clients from the FL server and the FL server that update the unified model by receiving more than a preset number of update models and request distribution of the unified model to the plurality of FL clients, FL resource for selecting one or more FL clients to be operated from among the plurality of FL clients based on the request information, distributing basic model information and learning tool information, and distributing an integrated model to the FL clients according to the request of the FL server Includes management.

In addition to this, another method for implementing the present invention, another system, and a computer-readable recording medium for recording a computer program for executing the method may be further provided.

Other specific details of the invention are included in the detailed description and drawings.

According to any one of the above-described problem solving means of the present invention, the software for joint learning is not unilaterally installed and operated in a plurality of FL clients, but only the FL clients selected in consideration of the computing resources of the FL clients are operated. It can support efficient management of resources.

In addition, since the software is installed and operated only on the selected FL client, the efficiency of FL client resource utilization can be increased, and the idle resources of other terminals can be increased.

In addition, it is possible to increase the success rate of model update of the FL client for federated learning operation, so that federated learning can be operated stably. can be provided as

Effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

1 is a diagram for explaining a federated learning protocol according to the prior art.

2 is a view for explaining a system for performing and managing federated learning according to an embodiment of the present invention.

3 is a block diagram of a system for performing and managing federated learning according to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement them. However, the present invention may be embodied in many different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted.

In the entire specification, when a part "includes" a certain element, it means that other elements may be further included, rather than excluding other elements, unless otherwise stated.

The environment for performing Federated Learning (FL) consists of a FL client that collects local data and updates the local model, and an FL server that collects the local update model and updates and distributes the unified model. do.

First, the FL client registers itself as a corresponding FL client with the FL server.

Then, the FL server delivers the integrated model and local learning technique to the registered FL client.

If the situation condition is satisfied, the FL client attempts to update the local model by applying the local learning technique to the local data.

Thereafter, when the update of the corresponding model is completed, the FL client delivers the updated model to the FL server, and when a sufficient number of updated models is secured, the FL server updates the integrated model based on this.

Thereafter, as the integrated model is updated, the above process is repeated.

The prior art has a problem in that the FL client arbitrarily registers without considering the computing resource of the FL client.

That is, in the prior art, there is no consideration for resource management, such as the FL client cannot function as a normal FL client due to insufficient memory or storage resources.

This results in lowering the resource utilization of the FL client, and also acts as a factor for lowering the update success rate from the FL client, causing a problem in the update performance of the entire integrated model.

In contrast, an embodiment of the present invention is characterized in that the FL resource management apparatus 300 is additionally introduced in order to solve the above-described problem.

In an embodiment of the present invention, the FL resource management device 300 is a kind of resource management tool, and among various registered resources, a resource capable of performing the role of the FL client 100 , that is, the actual corresponding federated learning service is operated. By selecting the FL client 100 only when it is being used and having sufficient available resources, efficient resource management and efficiency of the FL process can be improved.

Hereinafter, the federated learning performance and management system 1 according to an embodiment of the present invention will be described with reference to FIGS. 2 to 3 .

2 is a view for explaining the federated learning performance and management system 1 according to an embodiment of the present invention. 3 is a block diagram of a system 1 for performing and managing federated learning according to an embodiment of the present invention.

Referring to FIG. 2 , the federated learning performance and management system 1 according to an embodiment of the present invention includes a plurality of FL clients 100 , FL servers 200 , and FL resource management apparatus 300 .

At least one FL client 100 selected by the FL resource management device 300 is operated, and the selected FL client 100 performs federated learning to update the basic model. That is, in one embodiment of the present invention, not all registered FL clients 100 are operated, but based on the computing resource status of the FL clients 100, the FL clients 100 in an operable state are selectively selected. and operating.

As the FL server 200 receives more than a preset number of updated updated models from the FL client 100, the updated models are collected and updated into the integrated model, and then the integrated model distribution to a plurality of FL clients 100 is distributed to the FL resources. A request is made to the management device 300 .

When the FL resource management apparatus 300 receives request information for operation of the plurality of FL clients 100 from the FL server 200 , one or more FL clients to be operated among the plurality of FL clients 100 based on the request information (100) is selected to distribute basic model information and learning tool information.

At this time, the FL resource management apparatus 300 may select one or more FL clients 100 based on one or more of computing resource information, related service operation information, and local data collection information of the plurality of FL clients 100 . .

On the other hand, in an embodiment of the present invention, the FL client 100 , the FL server 300 and the FL resource management apparatus 300 are connected to the communication module 410 , the memory 420 and the memory 420 as shown in FIG. 3 . It may be configured to include a processor 430 that executes a stored program.

The communication module 410 is preferably configured as a wireless communication module, but this does not exclude a wired communication module. The wireless communication module may be implemented by wireless LAN (WLAN), Bluetooth, HDR WPAN, UWB, ZigBee, Impulse Radio, 60GHz WPAN, Binary-CDMA, wireless USB technology, wireless HDMI technology, and the like. In addition, the wired communication module may be implemented as a power line communication device, a telephone line communication device, a cable home (MoCA), Ethernet, IEEE1294, an integrated wired home network, and an RS-485 control device.

The memory 420 collectively refers to a non-volatile storage device and a volatile storage device that continuously maintain stored information even when power is not supplied. For example, the memory 420 may include a compact flash (CF) card, a secure digital (SD) card, a memory stick, a solid-state drive (SSD), and a micro SD card. NAND flash memory such as cards, magnetic computer storage devices such as hard disk drives (HDDs), etc., and optical disc drives such as CD-ROMs and DVD-ROMs. can

For reference, the components described in the embodiment of the present invention may be implemented in the form of software or hardware such as a field programmable gate array (FPGA) or an application specific integrated circuit (ASIC), and may perform predetermined roles.

However, 'components' are not limited to software or hardware, and each component may be configured to reside in an addressable storage medium or to reproduce one or more processors.

Thus, as an example, a component includes components such as software components, object-oriented software components, class components and task components, and processes, functions, properties, procedures, sub It includes routines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.

Components and functions provided within the components may be combined into a smaller number of components or further divided into additional components.

Hereinafter, with reference to FIG. 4 , a method for improving the resource efficiency of the federated learning process performed by the FL resource management apparatus 300 will be described.

4 is a flowchart of a federated learning process method according to an embodiment of the present invention.

First, the FL resource management apparatus 300 receives request information for the operation of the FL client 100 from the FL server 200 (S110).

In this case, according to an embodiment of the present invention, the FL resource management apparatus 300 may perform a process of registering resources of a plurality of FL clients 100 to be managed in order to select the FL client 100 . That is, the unregistered FL client 100 cannot be used as a resource in an embodiment of the present invention unless a separate registration procedure is performed.

In one embodiment, the FL server 200 includes one or more of computing requirements information, basic model information, and learning tool information for a plurality of FL clients 100 as the request information to the FL resource management device 300 . can transmit

Next, the FL resource management apparatus 300 selects one or more FL clients 100 to be operated from among the plurality of FL clients 100 based on the request information (S120).

In one embodiment, the FL resource management apparatus 300 is based on one or more of the computing resource information of the FL client 100, the related service operation information, and the local data collection information, one or more FLs of the plurality of FL clients 100 The client 100 may be selected.

For example, the FL resource management apparatus 300 may first select the FL client 100 in which the related service is being operated from among the plurality of FL clients 100 based on the related service operation information.

Then, the FL resource management apparatus 300 may finally select the available FL clients 100 based on the computing resource information among the first selected FL clients 100 .

For example, when it is desired to improve the image classification service through federated learning, the FL resource management apparatus 300 first selects the FL client 100 in which a program capable of performing the image classification service is installed. select Then, it is determined whether there are sufficient resources available from the selected FL clients 100 , and the FL client 100 satisfying the corresponding condition is selected again. In this case, the above-described process may also be performed as a process of searching for a Pod in which a specific application is running from a tool such as Kubernetes.

Next, the FL resource management apparatus 300 distributes the basic model and the learning tool to the selected FL client 100 (S130).

Only the selected FL client 100 is installed and executed software for joint learning.

Thereafter, when the environment for learning is satisfied, the selected FL client 100 updates the basic model based on the learning tool, and transmits the updated model to the FL server 200 as the update of the basic model is completed.

The FL server 200 updates the basic model stored in the FL server 200 into an integrated model as the plurality of FL clients 100 receive more than a preset number of update models.

Next, as the FL server 200 generates an integrated model based on the updated model, the FL server 200 requests distribution of the integrated model to the FL resource management device 300 ( S140 ).

For example, if there is a FL client 100 operating an image classification service, and the model used in the FL client 100 is updated through federated learning, the FL server 200 collects the updated model and After updating to the integrated model, the distribution of the integrated model may be requested from the FL resource management apparatus 300 . At this time, this process can be implemented by searching for a Pod in which a specific application is running from a tool such as kubernetes, or defining a separate CRD (Custom Resource Definition).

When the distribution of the integrated model to the FL client 100 is completed, the FL server 200 repeats the above-described process again and transmits the request information for the operation of the FL client 100 to the FL resource management device 300 .

Meanwhile, in the above description, steps S110 to S140 may be further divided into additional steps or combined into fewer steps according to an embodiment of the present invention. In addition, some steps may be omitted as necessary, and the order between steps may be changed. In addition, even if other omitted contents, the above-described contents in FIGS. 2 to 3 may be applied to the method of resource efficiency of the federated learning process of FIG. 4 .

An embodiment of the present invention may also be implemented in the form of a computer program stored in a medium executed by a computer or a recording medium including instructions executable by the computer. Computer-readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. Also, computer-readable media may include both computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Communication media typically includes computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave, or other transport mechanism, and includes any information delivery media.

Although the methods and systems of the present invention have been described with reference to specific embodiments, some or all of their components or operations may be implemented using a computer system having a general purpose hardware architecture.

The above description of the present invention is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.

The scope of the present invention is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention. do.

Claims

In an AI/ML model operation method in an FL client including an artificial intelligence and machine learning (AI/ML) service process unit, a local AI/ML model repository, and a federated learning (FL) process unit,

requesting, by the AI/ML service processing unit, to download the AI/ML model from the AI/ML model repository;

storing the AI/ML model in the local AI/ML model storage as the AI/ML model is downloaded in response to the download request;

providing, by the AI/ML service processing unit, reading and storing the AI/ML model from the local AI/ML model storage and providing a service;

determining, by the FL processing unit, whether a federated learning performance condition is satisfied;

performing federated learning by reading and storing the AI/ML model from the local AI/ML model storage as the FL performance condition is satisfied; and

AI/ML model operating method comprising the step of providing the updated AI/ML model to the FL server as the federated learning is completed.
According to claim 1,

wherein the local AI/ML model storage is stored in a shareable volume between the AI/ML service processing unit in the FL client and the FL processing unit.
3. The method of claim 2,

When the AI/ML service processing unit and the FL processing unit are configured as a Docker container, the local AI/ML model storage is configured as a shared volume of a host.
According to claim 1,

The step of determining whether the FL processing unit satisfies the federated learning performance condition includes:

The AI/ML model operating method is determined based on at least one of whether the FL client is powered on, whether it is idle for a preset time, and whether collection data for updating the AI/ML model is satisfied.
According to claim 1,

The FL server collects the updated AI/ML model from at least one FL client, and when the update condition is satisfied, the AI/ML model operating method is updated to an integrated model based on the collected AI/ML model.
6. The method of claim 5,

The FL server is an AI/ML model operating method that registers the updated integrated model in the AI/ML model storage as the update of the integrated model is completed.
6. The method of claim 5,

The method further comprising the step of confirming, by the AI/ML service processing unit, whether to update the integrated model of the AI/ML model storage at every preset period,

The step of the AI/ML service processing unit requesting the download of the AI/ML model from the AI/ML model storage comprises:

When it is confirmed that the integrated model has been updated, the AI/ML model operating method of requesting the download of the integrated model from the AI/ML model repository.
8. The method of claim 7,

The step of the AI/ML service processing unit confirming whether to update the integrated model of the AI/ML model storage every preset period comprises:

The AI/ML model operating method of confirming whether the update is made based on at least one of the version information of the integrated model and the version information of the AI/ML model updated by the FL processing unit.
In the artificial intelligence and machine learning (AI/ML) model operating system,

An AI/ML service processing unit that receives an AI/ML model and provides an AI/ML service, a local AI/ML model repository that stores and provides the AI/ML model, and the AI/ML from the local AI/ML model repository At least one FL client including a FL processing unit that reads and stores the model to perform Federated Learning (FL) to update the AI/ML model;

an AI/ML model storage that provides the AI/ML model to the local AI/ML model storage according to the request of the AI/ML service processing unit;

AI/ML comprising a FL server that collects the updated AI/ML model from the FL client, updates the collected AI/ML model as an integrated model as an update condition is satisfied, and registers it in the AI/ML model repository model operating system.
10. The method of claim 9,

and the local AI/ML model storage is stored in a shareable volume between the AI/ML service processing unit and the FL processing unit in the FL client.
10. The method of claim 9,

The AI/ML service processing unit checks whether the integrated model of the AI/ML model storage is updated every preset period, and requests a download for the integrated model when the integrated model is updated as a result of the check. ML model operating system.
In a Federated Learning (FL) client for artificial intelligence and machine learning (AI/ML) model operation,

an AI/ML service processing unit that provides AI/ML services based on AI/ML models provided from the AI/ML model repository;

A local AI/ML model storage that receives and stores the AI/ML model requested from the AI/ML service processing unit from the AI/ML model storage and provides it to the AI/ML service processing unit;

FL process for performing federated learning by reading and storing the AI/ML model from the local AI/ML model storage based on whether the federated learning performance condition is satisfied, and providing the updated AI/ML model to the FL server FL clients with units.