CN114064278A

CN114064278A - Heterogeneous acceleration engine and method for federal learning

Info

Publication number: CN114064278A
Application number: CN202111370530.1A
Authority: CN
Inventors: 彭瑞; 王亚玲; 王玮
Original assignee: Shenzhen Zhixing Technology Co Ltd
Current assignee: Shenzhen Zhixing Technology Co Ltd
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2022-02-18

Abstract

The application relates to a heterogeneous acceleration engine and method for federal learning. The heterogeneous acceleration engine includes: the multiple application interfaces respectively correspond to the heterogeneous computing correlation function and the multiple service modules. Each of the application interfaces corresponds to at least one of the service modules. And the execution process of the federal learning scheme is carried out through a service module corresponding to the application interface selectively called according to the scheme. The service module comprises a device calling module and an operator calling module, and the device calling module and the operator calling module are used for calling devices and operators of the federated learning hardware system. And in the execution process of the scheme, the calling of the equipment calling module and the operator calling module is determined according to the performance parameters of the scheme and the performance test result of the federal learning hardware system. The performance test result is determined by performing performance tests in advance according to a plurality of preset performance parameters. The performance parameter of the scheme corresponds to at least one of a plurality of preset performance parameters. Therefore, heterogeneous acceleration hardware resources are fully utilized, energy consumption is reduced, and computing performance is improved.

Description

Heterogeneous acceleration engine and method for federal learning

Technical Field

The application relates to the technical field of data security and privacy protection, in particular to a heterogeneous acceleration engine and a method for federal learning.

Background

With the development of application fields such as artificial intelligence and big data mining analysis, the demand for data volume is more and more increased. For example, training artificial intelligence application models requires the use of large amounts of training data with appropriate data labels or feature values. High quality data often comes from application data generated and accumulated in business activities. However, application data is often distributed among different organizations and individuals, for example, transaction data is distributed among various financial institutions and medical diagnosis data is distributed among various medical institutions. Application data across industries and domains is also dispersed, for example, social attribute data and e-commerce transaction data in the internet domain are controlled by different entities. As the importance of data ownership, user privacy, data security, and the like are more emphasized, and as the laws and regulations put more strict constraints and requirements on data collection processing, organizations or individuals who grasp application data are often unwilling or do not have appropriate means to collaborate with each other, so that it is difficult for the application data grasped by each organization or individual to work together. This dilemma in data sharing and collaborative collaboration is referred to as data islanding. In order to solve the problem of cross-industry and cross-organization data cooperation, especially the key problems of privacy protection and data security, a Federal Learning (FL) concept is proposed. The federated learning refers to each participant who owns data, and under the premise that protected private data is not shared and the own data is not transmitted to the outside, the relevant information of the model is exchanged in an encryption mode, so that the collaborative optimization of the federated learning model is realized. Federal learning is a machine learning technique that trains models on multiple decentralized edge devices or servers without exchanging raw data. Federal learning can allow a plurality of data parties to establish a common and powerful machine learning model under the condition of not sharing original data, so that key problems of data privacy, data safety, data access permission, heterogeneous data access and the like are solved. The federated learning can be divided into horizontal federated learning with large overlap in the data feature space and small overlap in the sample space, vertical federated learning with small overlap in the data feature space and large overlap in the sample space, and federated migration learning with small overlap in both the data feature space and the sample space according to the distribution conditions of the data feature space and the sample space of the training data.

Several federal learning frameworks have been constructed including tensirflow fed, PySyft, and Fed AI Technology Enabler (FATE), etc., and federal learning implementations based on homomorphic encryption Technology are widely adopted. In a homomorphic encryption system, original data are encrypted and then transmitted, the encrypted data can be specifically calculated on the premise of keeping the encryption state and the correctness of an operation result, and the original data can be obtained by decrypting the calculation result. However, the various federal learning frameworks and homomorphic encryption techniques described above place high demands on computing power, and the algorithms of federal learning need to process much larger data and make decisions with low delay than conventional machine learning with limited computing resources. For example, in a semi-homomorphic Paillier cryptosystem based on an asymmetric public and private key, original data is encrypted to become an integer with a large bit width, for example, the original data is encrypted to 2048 bits through a 1024-bit public key, and meanwhile, a secret addition operation and a multiplication operation are respectively performed through a modular multiplication operation and a modular exponentiation operation, but the encrypted large-bit-width integer such as the 2048-bit large integer can cause corresponding base number, modulus and exponent to become data with a large integer bit width, so that the complexity of computing time is far greater than that of computing an integer or a floating point number in a plaintext. Although optimization can be made for modular exponentiation or other common operations under the federal learning algorithm, such as negative number conversion, divide and conquer concepts, and montgomery modular multiplication algorithm, the bottleneck of insufficient computation power caused by modular exponentiation under large integer bit width data or other common operations under the federal learning algorithm still faces.

On the other hand, existing hardware is also difficult to handle the requirements of federal learning calculations. For example, a Central Processing Unit (CPU) is designed to perform general computing tasks in various scenarios, but has insufficient performance in scenarios such as federal learning that require large-scale parallelization of computations; graphics Processing Units (GPUs) are capable of performing large-scale, high-intensity parallelized mathematical operations but are difficult to handle complex computational tasks. Therefore, heterogeneous systems or heterogeneous architectures are often used in applications for acceleration, such as heterogeneous computation by CPUs and GPUs. However, heterogeneous systems composed of various instruction set architectures and different types of processors need to solve the problems of internal load balancing, power consumption optimization and the like, and also need to meet the requirements of various complex and variable federal learning frameworks and federal learning algorithms.

Therefore, a heterogeneous acceleration engine and method for federal learning are needed, which can meet the complex and diversified requirements of the federal learning framework and the federal learning algorithm on the one hand, and can exert the performance of a heterogeneous system on the other hand, so as to meet the challenges of computing power and computing performance in the federal learning scene.

Disclosure of Invention

In a first aspect, an embodiment of the present application provides a heterogeneous acceleration engine, which is applied to federal learning and is connected between a federal learning software system and a federal learning hardware system, where the heterogeneous acceleration engine includes: the system comprises a plurality of application interfaces, a federated learning software system and a plurality of application interfaces, wherein the application interfaces correspond to heterogeneous computation related functions respectively, and the federated learning software system selectively calls the application interfaces and the corresponding heterogeneous computation related functions according to a specific federated learning scheme; and a plurality of service modules, wherein each application interface of the plurality of application interfaces corresponds to at least one service module of the plurality of service modules. Wherein the execution process of the specific federated learning scheme is performed through a service module corresponding to an application interface selectively invoked by the federated learning software system according to the specific federated learning scheme. The multiple service modules comprise a device calling module used for calling the device of the federal learning hardware system and an operator calling module used for calling the operator of the device of the federal learning hardware system. And in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to the performance parameters of the specific federated learning scheme and the performance test result of the federated learning hardware system. The performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to a plurality of preset performance parameters, and the performance parameter of the specific federal learning scheme corresponds to at least one preset performance parameter in the plurality of preset performance parameters.

According to the technical scheme described in the first aspect, the requirements of the federal learning software system and heterogeneous acceleration hardware of the federal learning hardware system are organically combined together, so that heterogeneous acceleration hardware resources of the federal learning hardware system are allowed to be fully utilized, matched calling configuration of the heterogeneous acceleration hardware is provided for running of the specific federal learning scheme, the calling configuration has high reliability based on performance tests conducted in advance, and the computing performance and the training efficiency are improved while the energy consumption is reduced.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the plurality of preset performance parameters include an operator pattern, and a performance test result of the federal learning hardware system is determined by performing a performance test on the federal learning hardware system in advance according to the plurality of preset performance parameters, where the performance test result includes: and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to different operator modes.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the performance parameter of the specific federated learning scheme is an operator mode, and the invoking of the device invoking module and the operator invoking module in the execution process of the specific federated learning scheme is determined according to the performance parameter of the specific federated learning scheme and the performance test result of the federated learning hardware system, including: and in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to the operator mode of the specific federated learning scheme and the performance test result of the federated learning hardware system.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the multiple preset performance parameters include a data bit width, and a performance test result of the federal learning hardware system is determined by performing a performance test on the federal learning hardware system in advance according to the multiple preset performance parameters, where the performance test result includes: and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to different data bit widths.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the performance parameter of the specific federated learning scheme is a data bit width, and the invoking of the device invoking module and the operator invoking module in the execution process of the specific federated learning scheme is determined according to the performance parameter of the specific federated learning scheme and the performance test result of the federated learning hardware system, including: and in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to the data bit width of the specific federated learning scheme and the performance test result of the federated learning hardware system.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the multiple preset performance parameters include a data volume magnitude, and a performance test result of the federal learning hardware system is determined by performing a performance test on the federal learning hardware system in advance according to the multiple preset performance parameters, where the performance test result includes: and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to different data volume levels.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the performance parameter of the specific federated learning scheme is a data volume magnitude, and the invoking of the device invoking module and the operator invoking module in the execution process of the specific federated learning scheme is determined according to the performance parameter of the specific federated learning scheme and the performance test result of the federated learning hardware system, including: and in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to the data volume magnitude of the specific federated learning scheme and the performance test result of the federated learning hardware system.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the multiple preset performance parameters include an operator mode, a data bit width, and a data volume magnitude, and a performance test result of the federal learning hardware system is determined by performing a performance test on the federal learning hardware system in advance according to the multiple preset performance parameters, where the performance test result includes: and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to different operator modes, different data bit widths or different data volume magnitudes.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the performance parameter of the specific federated learning scheme at least includes an operator mode, a data bit width, or a data volume magnitude, and the invoking of the device invoking module and the operator invoking module in the execution process of the specific federated learning scheme is determined according to the performance parameter of the specific federated learning scheme and the performance test result of the federated learning hardware system, including: and in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to an operator mode, data bit width or data volume magnitude of the specific federated learning scheme and a performance test result of the federated learning hardware system.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the invoking of the device invoking module and the operator invoking module in the execution process of the specific federated learning scheme further refers to a user-specified device or user preference.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the multiple service modules further include: the data transmission module is used for controlling data interaction between the heterogeneous acceleration engine and the federal learning hardware system; a data storage module for managing respective data storage operations of the heterogeneous acceleration engine and the federated learning hardware system; the data format conversion module is used for carrying out data format conversion operation between the data format of the federal learning software system and the data format of the federal learning hardware system; and the equipment returning module is used for returning the calculation result of the federal learning hardware system to the federal learning software system.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the invoking of the data transmission module, the data storage module, the data format conversion module, and the device backhaul module in the execution process of the specific federated learning scheme is determined based on configuration information of the specific federated learning scheme, where the configuration information of the specific federated learning scheme indicates whether a specific calculation result of the federated learning hardware system is reusable, whether the specific calculation result is based on a matrix operation, and whether the specific calculation result is required by the federated learning software system.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that, in an execution process of the specific federated learning scheme, the invoking of the data transmission module, the data storage module, the data format conversion module, and the device backhaul module is determined based on configuration information of the specific federated learning scheme, including: when a specific calculation result of the federal learning hardware system can be reused or is operated based on matrix operation and is not required by the federal learning software system, the specific calculation result is not transmitted back to the federal learning software system; when the specific calculation result is needed by the federal learning software system, the specific calculation result is transmitted back to the federal learning software system, and if the data format of the specific calculation result is different from that of the federal learning software system, the data format of the specific calculation result is converted into that of the federal learning software system.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the invoking of the data transmission module, the data storage module, the data format conversion module, and the device return module in the execution process of the specific federated learning scheme is based on a preset policy, so that the specific computation result of the federated learning hardware system is returned to the federated learning software system only when a preset condition of the preset policy is satisfied, where the preset condition includes: the subsequent operations of the specific calculation result, which are required by the federal learning software system or are related to the specific calculation result, cannot be realized by the heterogeneous calculation related functions respectively corresponding to the multiple application interfaces.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the multiple application interfaces include multiple cryptographic operation interfaces, and the multiple cryptographic operation interfaces include: the isomorphic computation related function corresponding to the homomorphic encryption interface is a homomorphic encryption function; the isomorphic decryption interface is used for carrying out isomorphic computation on the related functions corresponding to the isomorphic decryption interface; the heterogeneous computation related function corresponding to the secret addition interface is a secret addition function; the dense multiplication interface, the heterogeneous computation related function corresponding to the dense multiplication interface is a dense multiplication function; a matrix transposition interface, wherein a heterogeneous calculation related function corresponding to the matrix transposition interface is a matrix transposition function; the heterogeneous computation related function corresponding to the dense matrix multiplication interface is a dense matrix multiplication function; the dense state summation interface is used for calculating a heterogeneous computation related function corresponding to the dense state summation interface to be a dense state summation function; the matrix slicing interface is used for calculating a heterogeneous calculation related function corresponding to the matrix slicing interface to be a matrix slicing function; and the heterogeneous calculation related function corresponding to the matrix splicing interface is a matrix splicing function.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the multiple application interfaces include a plaintext operation interface and a tensor operation interface, the heterogeneous computation related function corresponding to the plaintext operation interface is a plaintext operation function, and the heterogeneous computation related function corresponding to the tensor operation interface is a tensor operation function.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the plurality of application interfaces include a direct management interface, and the federal learning software system sends an instruction to the device call module, the data transmission module, the data storage module, or the data format conversion module by calling the direct management interface.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the multiple service modules further include: the equipment management module is used for managing equipment of the federal learning hardware system; the memory management module is used for managing respective memories of the heterogeneous acceleration engine and the federal learning hardware system; an exception handling module for capturing and responding to exception signals from the heterogeneous acceleration engine and the federated learning hardware system; and the equipment monitoring module is used for monitoring the equipment of the federal learning hardware system.

According to a possible implementation manner of the technical solution of the first aspect, an embodiment of the present application further provides that the device of the federal learning hardware system includes at least one ASIC, at least one FPGA, at least one GPU, and at least one TPU.

In a second aspect, an embodiment of the present application provides a heterogeneous acceleration method, which is applied to federated learning, and is implemented by a heterogeneous acceleration engine connected between a federated learning software system and a federated learning hardware system, where the heterogeneous acceleration engine includes multiple application interfaces and multiple service modules, the multiple application interfaces respectively correspond to heterogeneous computation related functions, each application interface in the multiple application interfaces corresponds to at least one service module in the multiple service modules, and the heterogeneous acceleration method includes: obtaining a specific federal learning scheme; the federal learning software system selectively calls the multiple application interfaces and corresponding heterogeneous calculation related functions according to the specific federal learning scheme; and executing the specific federal learning scheme through a service module corresponding to the application interface selectively called by the federal learning software system according to the specific federal learning scheme. The multiple service modules comprise a device calling module used for calling the device of the federal learning hardware system and an operator calling module used for calling the operator of the device of the federal learning hardware system. And in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to the performance parameters of the specific federated learning scheme and the performance test result of the federated learning hardware system. The performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to a plurality of preset performance parameters, and the performance parameter of the specific federal learning scheme corresponds to at least one preset performance parameter in the plurality of preset performance parameters.

The technical scheme described in the second aspect organically combines the requirements of the federal learning software system and the heterogeneous acceleration hardware of the federal learning hardware system, thereby allowing the heterogeneous acceleration hardware resources of the federal learning hardware system to be fully utilized, providing the matched calling configuration of the heterogeneous acceleration hardware for the operation of the specific federal learning scheme, and the calling configuration has high reliability based on the performance test performed in advance, and improving the computing performance and the training efficiency while reducing the energy consumption.

According to a possible implementation manner of the technical solution of the second aspect, an embodiment of the present application further provides that the multiple preset performance parameters include an operator mode, a data bit width, and a data volume magnitude, and a performance test result of the federal learning hardware system is determined by performing a performance test on the federal learning hardware system in advance according to the multiple preset performance parameters, where the performance test result includes: and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to different operator modes, different data bit widths or different data volume magnitudes.

According to a possible implementation manner of the technical solution of the second aspect, an embodiment of the present application further provides that the performance parameter of the specific federated learning scheme at least includes an operator mode, a data bit width, or a data volume magnitude, and the invoking of the device invoking module and the operator invoking module in the execution process of the specific federated learning scheme is determined according to the performance parameter of the specific federated learning scheme and the performance test result of the federated learning hardware system, including: and in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to an operator mode, data bit width or data volume magnitude of the specific federated learning scheme and a performance test result of the federated learning hardware system.

According to a possible implementation manner of the technical solution of the second aspect, an embodiment of the present application further provides that the multiple service modules further include: the data transmission module is used for controlling data interaction between the heterogeneous acceleration engine and the federal learning hardware system; a data storage module for managing respective data storage operations of the heterogeneous acceleration engine and the federated learning hardware system; the data format conversion module is used for carrying out data format conversion operation between the data format of the federal learning software system and the data format of the federal learning hardware system; and the equipment returning module is used for returning the calculation result of the federal learning hardware system to the federal learning software system.

According to a possible implementation manner of the technical solution of the second aspect, an embodiment of the present application further provides that the invoking of the data transmission module, the data storage module, the data format conversion module, and the device backhaul module in the execution process of the specific federated learning scheme is determined based on configuration information of the specific federated learning scheme, where the configuration information of the specific federated learning scheme indicates whether a specific calculation result of the federated learning hardware system is reusable, whether the specific calculation result is based on a matrix operation, and whether the specific calculation result is required by the federated learning software system.

According to a possible implementation manner of the technical solution of the second aspect, an embodiment of the present application further provides that, in an execution process of the specific federated learning scheme, the invoking of the data transmission module, the data storage module, the data format conversion module, and the device backhaul module is determined based on configuration information of the specific federated learning scheme, including: when a specific calculation result of the federal learning hardware system can be reused or is operated based on matrix operation and is not required by the federal learning software system, the specific calculation result is not transmitted back to the federal learning software system; when the specific calculation result is needed by the federal learning software system, the specific calculation result is transmitted back to the federal learning software system, and if the data format of the specific calculation result is different from that of the federal learning software system, the data format of the specific calculation result is converted into that of the federal learning software system.

According to a possible implementation manner of the technical solution of the second aspect, an embodiment of the present application further provides that the invoking of the data transmission module, the data storage module, the data format conversion module, and the device return module in the execution process of the specific federated learning scheme is based on a preset policy, so that the specific calculation result of the federated learning hardware system is returned to the federated learning software system only when a preset condition of the preset policy is satisfied, where the preset condition includes: the subsequent operations of the specific calculation result, which are required by the federal learning software system or are related to the specific calculation result, cannot be realized by the heterogeneous calculation related functions respectively corresponding to the multiple application interfaces.

Drawings

In order to explain the technical solutions in the embodiments or background art of the present application, the drawings used in the embodiments or background art of the present application will be described below.

Fig. 1 is a schematic diagram illustrating a federated learning heterogeneous acceleration scenario provided in an embodiment of the present application.

Fig. 2 shows a block diagram of a heterogeneous acceleration engine for federal learning according to an embodiment of the present application.

Fig. 3 shows a flowchart of a heterogeneous acceleration method for federal learning according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a heterogeneous acceleration engine and a heterogeneous acceleration method for federal learning, in order to solve the technical problems that on one hand, the complex and diversified requirements of a federal learning framework and a federal learning algorithm can be met, and on the other hand, the performance of a heterogeneous system can be exerted, so that the challenges of computing power and computing performance in a federal learning scene are solved. The heterogeneous acceleration engine is connected between a federal learning software system and a federal learning hardware system, and comprises: the system comprises a plurality of application interfaces, a federated learning software system and a plurality of application interfaces, wherein the application interfaces correspond to heterogeneous computation related functions respectively, and the federated learning software system selectively calls the application interfaces and the corresponding heterogeneous computation related functions according to a specific federated learning scheme; and a plurality of service modules, wherein each application interface of the plurality of application interfaces corresponds to at least one service module of the plurality of service modules. Wherein the execution process of the specific federated learning scheme is performed through a service module corresponding to an application interface selectively invoked by the federated learning software system according to the specific federated learning scheme. The multiple service modules comprise a device calling module used for calling the device of the federal learning hardware system and an operator calling module used for calling the operator of the device of the federal learning hardware system. And in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to the performance parameters of the specific federated learning scheme and the performance test result of the federated learning hardware system. The performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to a plurality of preset performance parameters, and the performance parameter of the specific federal learning scheme corresponds to at least one preset performance parameter in the plurality of preset performance parameters. Therefore, the requirements of the federal learning software system and the heterogeneous acceleration hardware of the federal learning hardware system are organically combined together, so that the heterogeneous acceleration hardware resources of the federal learning hardware system are allowed to be fully utilized, the matched calling configuration of the heterogeneous acceleration hardware is provided for the running of the specific federal learning scheme, the calling configuration has high reliability based on the performance test carried out in advance, and the computing performance and the training efficiency are improved while the energy consumption is reduced.

Embodiments of the application may be used in application scenarios including, but not limited to, multi-party security computing, federal learning related machine learning model training, data security, privacy protection, or other application scenarios applying a privacy computing framework or algorithm, etc.

The embodiments of the present application may be modified and improved according to specific application environments, and are not limited herein.

In order to make the technical field of the present application better understand, embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

Fig. 1 is a schematic diagram illustrating a federated learning heterogeneous acceleration scenario provided in an embodiment of the present application. As shown in FIG. 1, a typical federated learning heterogeneous acceleration scenario includes a federated learning software system 100 and a federated learning hardware system 140. Here, the federal learning software system 100 represents various current and future possible federal learning frameworks in the federal learning scenario, including, but not limited to, the tensrflow fed framework, the PySyft framework, the FATE framework, and similar frameworks designed based on these basic frameworks. These federal learning frameworks, such as the FATE framework, provide a set of full-flow functions from data intersection, modeling, prediction to online publishing machine learning, thereby enabling data to be available and invisible. The underlying technical support of each of these federated learning frameworks may be based on different programming languages or different design ideas, and may also employ different logic flows and execution approaches in implementing the federated learning computing tasks. The federated learning software system 100 shown in FIG. 1 should be understood to correspond to any software framework or software system or federated learning framework for federated learning, and should be understood to include the logical flows, software layer operations, algorithms, models, etc., that are necessary to accomplish the federated learning computational tasks that each such software framework or software system or federated learning framework employs to accomplish. Additionally, the notion of federal learning as referred to in the examples of this application should be broadly construed as a machine learning technique for training models on multiple decentralized edge devices or servers without exchanging raw data, and this application is intended to cover a federal learning framework, a software framework for federal learning, or a software system that is proposed or published after the filing date of this application.

With continued reference to fig. 1, from a heterogeneous acceleration perspective, the federated learning software system 100 may be divided into a business layer 102, an algorithm layer 104, a security protocol layer 106, and an algorithm layer 108. This partitioning is done to facilitate understanding of the complex and varied requirements that may be imposed by the federated learning software system 100 through heterogeneous acceleration, such as the different logic flows, software layer operations, algorithms, models, etc. that may be employed by the various federated learning frameworks mentioned above for performing specific federated learning computational tasks. In particular, the business layer 102 can be understood as a business interface open to users and used to provide data import, model construction, model training, model evaluation, result prediction, and the like. After a user initiates a task through the service layer 102, the service layer 102 may call a relevant module of the algorithm layer 104 and receive a result returned by the algorithm layer 104. The operator layer 108 includes various machine learning models for federated learning, such as linear regression, logistic regression, poisson regression, neural networks, gradient boosting trees, and the like. If the calculation of protecting the data security and the data privacy is needed, the operator layer 108 calls the relevant modules of the security protocol layer 106 to perform operation, and receives the result returned by the security protocol layer 106. The security protocol layer 106 includes algorithms specific to federal learning that act on the raw data to protect the privacy of the raw data, including but not limited to federal learning implementations based on homomorphic encryption, differential privacy, multi-party security computing, etc. The homomorphic encryption is an encryption mode supporting specific calculation in an encryption state (secret state for short). Homomorphic password systems based on homomorphic encryption comprise a Paillier password system, a BFV password system, a CKKS password system and the like. The specific security protocol or the specific cryptographic system to be used may be determined by a user according to actual application scenarios and requirements, may also be determined according to industry regulations, business habits, or preset rules, and may also be determined by any suitable means. Security protocol layer 106 calls the associated operators of operator layer 108 to perform a series of operations and receives results returned by operator layer 108. The operator layer 108 includes various operators suitable for execution by hardware, including homomorphic encryption, homomorphic decryption, cryptographic addition, cryptographic multiplication, and cryptographic matrix operations. For a given federal learning and computing task, the federal learning and computing task is initiated from the business layer 102, then the business layer 102 calls the relevant modules and the corresponding models of the algorithm layer 104, then the operator layer 108 calls the relevant modules of the security protocol layer 106 and adopts the corresponding federal learning implementation scheme (selecting an encryption mode, a cryptosystem, a security protocol and the like), and finally the security protocol layer 106 calls the relevant operators of the operator layer 108 to execute a series of operations. Therefore, from the perspective of heterogeneous acceleration, different logic flows, software layer operations, algorithms, models, etc. that may be adopted by various different federal learning frameworks for performing specific federal learning calculation tasks (which may also be referred to as specific federal learning schemes or specific federal learning implementation schemes or specific federal learning tasks, etc.) are divided into the business layer 102, the algorithm layer 104, the security protocol layer 106, and the calculation layer 108; moreover, the execution flow of any federal learning calculation task can also be disassembled from the viewpoint of heterogeneous acceleration and then respectively corresponds to the operations on the business layer 102, the algorithm layer 104, the security protocol layer 106 and the algorithm layer 108. The operator layer 108 corresponds to a process how to complete a series of operations by running various operators suitable for hardware execution in the execution flow of a specific federated learning calculation task. In other words, the complex and varied requirements that may be imposed by the federal learning software system 100, such as the above-mentioned various different logic flows, software layer operations, algorithms and models that may be adopted by the various federal learning frameworks for performing specific federal learning calculation tasks, will ultimately be embodied as a process of executing various operators suitable for hardware execution at the operator layer 108 to complete a series of operations. These operators, which are suitable for hardware execution, running at the operator layer 108, are accomplished by the heterogeneous acceleration engine 120.

Continuing with FIG. 1, the heterogeneous acceleration engine 120 includes an API layer 122, a service layer 124, and a driver layer 126. The API layer 122 includes a variety of Application interfaces, which may also be referred to as Application Program Interfaces (APIs). The API layer 122 includes a plurality of application interfaces corresponding to the heterogeneous computation related functions, respectively. It should be appreciated that any function that may be called by the federated learning framework to perform heterogeneous computations may be considered part of the API layer 122 or as one of the application interfaces. The various application interfaces included in the API layer 122 of the heterogeneous acceleration engine 120 may correspond to a homomorphic encryption-based homomorphic cryptographic system, such as Paillier cryptographic system, BFV cryptographic system, CKKS cryptographic system, etc., or may correspond to any federal learning calculation task or federal learning scheme or federal learning implementation scheme or federal learning task, and the present application is intended to cover a federal learning framework, a software framework for federal learning, or a software system that is proposed or published after the filing date of the present application. The function calls at the API layer 122 may include any form of function call, including pointers, references, copies, etc. that pass data variables. Function calls at the API layer 122 may also involve passing back the results of the calculations and the status to the federated learning software system 100 by way of parameters, return values, etc., depending on the actual function and implementation mechanism. As described above, the complex and varied requirements that may be imposed by the federal learning software system 100, such as different logic flows, software layer operations, algorithms and models that may be adopted by various different federal learning frameworks for performing specific federal learning computation tasks, are finally embodied as a process of executing various hardware-suitable operators at the operator layer 108 to complete a series of operations, and these hardware-suitable operators executed at the operator layer 108 are realized by calling various application interfaces included in the API layer 122 of the heterogeneous acceleration engine 120. That is, the federated learning software system 100 selectively invokes the various application interfaces and corresponding heterogeneous computation related functions according to a particular federated learning scheme (which may also be referred to as a particular federated learning computation task).

As shown in fig. 1, the federal learning software system 100 sends information to the heterogeneous acceleration engine 120, including instructions related to calling the API 110, i.e., operators running at the operator layer 108 that are suitable for hardware execution, by calling various application interfaces included in the API layer 122 of the heterogeneous acceleration engine 120. On the other hand, the heterogeneous compute engine 120 also sends information to the federated learning software system 100, including the back-transmitted compute results 112. Information interaction between the federated learning software system 100 to the heterogeneous acceleration engine 120 may also include, for example, anomaly information feedback, and the like. Therefore, the complex and varied requirements that may be provided by the federal learning software system 100, such as different logic flows, software layer operations, algorithms, models, etc. that may be adopted by various different federal learning frameworks for performing a specific federal learning calculation task, are finally embodied as the relevant instructions that are sent by the federal learning software system 100 to the heterogeneous acceleration engine 120 to call the API 110, that is, the API layer 122 of the heterogeneous acceleration engine 120 selectively calls the multiple application interfaces and the corresponding heterogeneous calculation related functions according to a specific federal learning scheme (which may also be referred to as a specific federal learning calculation task). It should be understood that in one possible implementation, the federated learning software system 100 and the heterogeneous acceleration engine 120 use a non-API-invoked interaction pattern, such as read and write files/memory/registers, etc.

In one possible embodiment, the API layer 122 provides the relevant interfaces for a homomorphic encryption system, such as homomorphic encryption, homomorphic decryption, cryptographic addition, cryptographic multiplication, matrix transposition, cryptographic matrix multiplication, cryptographic summation, matrix slicing, matrix splicing, and the like. The API layer 122 also provides an associated computational interface for plaintext (unencrypted) data, and the like.

Referring to fig. 1, the service layer 124 includes internal service modules of the heterogeneous acceleration engine 120, which are used to implement the functions of the API layer 122, call the underlying hardware, and maintain the normal operation of the heterogeneous acceleration engine 120. In some exemplary embodiments, the service layer 124 includes the following service modules: the device management module is used for initializing the device, resetting the device, closing the device, switching the device, acquiring the current device id, initializing the device driver, creating the device running environment, acquiring the current device running environment, releasing the device running environment, setting the device running environment, acquiring the device parameters, acquiring the device memory use information and the like; the data transmission module is mainly used for controlling the reading and writing of data on the hard disk and the transmission of the data between the host and the equipment; the memory management module is used for the operations of space allocation, initialization, release, copy, movement, comparison, positioning and the like of the host memory and the equipment memory; a data conversion module for converting data formats, such as serialization/deserialization operations related to transceiving data, interconversion between different data structures of different programming languages, and the like; the data storage module is used for managing the storage of data, including the storage of the data in different data structures of a host memory, the storage of the data in a device memory, the storage of the data in a hard disk or other storage media and the like; the exception handling module is used for capturing and handling program exceptions of each level (service layer, drive layer and heterogeneous acceleration hardware) and feeding the program exceptions back to the upper layer; the operator calling module is used for calling related operators of the heterogeneous equipment through the driving layer so as to finish calculation tasks, such as a modular exponentiation operator, a modular multiplication operator, a homomorphic encryption operator, a homomorphic decryption operator, a secret addition operator and the like; the device calling module is used for calling the specified heterogeneous device to finish calculation; the device monitoring module is used for monitoring the running condition of the device, including power consumption information, memory use information, thread information, program execution progress, error information and the like, and if abnormal, the abnormal processing module is used for processing the abnormal information; and the equipment returning module is used for returning the calculation result and the state of the equipment. It should be understood that the names and respective functions of the service modules included in the service layer 124 mentioned above may be replaced, adjusted, including added, reduced or recombined, and integrated and disassembled according to actual needs, as long as the names and respective functions can be used for realizing the functions of the API layer 122, calling the underlying hardware, and maintaining the normal operation of the heterogeneous acceleration engine 120. Each of the plurality of application interfaces included in the API layer 122 corresponds to at least one of the plurality of service modules included in the service layer 124. As mentioned above, the complex and varied requirements that may be provided by the federal learning software system 100, such as different logic flows, software layer operations, algorithms and models that may be adopted by various different federal learning frameworks for performing a specific federal learning calculation task, are finally embodied as related instructions for calling the API 110 that the federal learning software system 100 sends to the heterogeneous acceleration engine 120, that is, as related instructions for selectively calling the multiple application interfaces and corresponding heterogeneous calculation related functions according to a specific federal learning scheme (which may also be referred to as a specific federal learning calculation task) at the API layer 122 of the heterogeneous acceleration engine 120. Accordingly, execution of the particular federated learning scheme is performed through the service modules corresponding to the application interfaces that are selectively invoked by the federated learning software system 100 in accordance with the particular federated learning scheme.

Continuing with FIG. 1, driver layer 126 includes relevant code and routines for managing, monitoring, and controlling heterogeneous acceleration hardware. For example, assuming that the heterogeneous acceleration hardware is a GPU, the Driver layer content for the GPU may be implemented based on a Unified computing Device Architecture (CUDA) programming, including the relevant Code of the CUDA, such as the CUDA Kernel Code, the CUDA Device Code, the CUDA Driver API, the CUDA Runtime API, and the like. In addition, the driver layer content for the GPU also includes a compute library or other file of the CUDA. For another example, the heterogeneous acceleration hardware is an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA), and the driver layer content for the ASIC or the FPGA includes modules such as an initialization device, a reset device, a device read/write module call, a task management module, and a device monitor module.

With continued reference to fig. 1, the federal learning hardware system 140 includes a variety of heterogeneous acceleration hardware, where heterogeneous acceleration hardware is understood to be a device that can be used to perform a particular federal learning calculation task (also referred to as a particular federal learning scheme or a particular federal learning implementation or a particular federal learning task, etc.), and the definition of heterogeneous acceleration hardware is relative to the main processor used in the federal learning software system 100. Here, the different logic flows, software layer operations, algorithms, models, etc. that may be employed by the federal learning software system 100 to implement various federal learning frameworks for performing specific federal learning computing tasks are considered to be dependent on the support of the host processor. Here, the host processor may be understood as a processor for running the federal learning software system 100, or the federal learning software system 100 runs on the host processor. However, the main processor is not suitable for providing large-scale high-strength parallelization operations, so that in an application, a heterogeneous architecture (also referred to as a heterogeneous system) in which the main processor is collocated with heterogeneous acceleration hardware is often adopted, so that the operations of the above-mentioned federal learning software system 100 are completed by the main processor (that is, the execution flow of any federal learning calculation task is disassembled from the perspective of heterogeneous acceleration and then respectively corresponds to the operations on the business layer 102, the algorithm layer 104, the security protocol layer 106 and the calculation layer 108), and meanwhile, the large-scale high-strength parallelization operations, that is, heterogeneous calculation or heterogeneous acceleration, are performed by the heterogeneous acceleration hardware of the federal learning hardware system 140. Thus, the various heterogeneous acceleration hardware included in federated learning hardware system 140 may be understood to be any device that is architecturally distinct from the host processor and that may be used to perform a particular federated learning computing task. In some embodiments, the various heterogeneous acceleration hardware included in federated learning hardware system 140 may be understood as devices for heterogeneous computing or heterogeneous operations, and means ways to compose a system for computing using different types of instruction sets or different architectures of computing units, relative to the host processor for federated learning software system 100. In other embodiments, the various heterogeneous acceleration hardware included in the federated learning hardware system 140 may be inconsistent with respect to the host processor used in the federated learning software system 100 in aspects including, but not limited to, instruction sets, instruction architectures, hardware architectures, control flows, data flows, reconfigurability, particularly dynamic reconfigurability, and any other suitable metrics or characteristics. Also, the present application is intended to include indicators or features that are included within the relevant understanding of the concepts of heterogeneous computing, heterogeneous systems, heterogeneous acceleration hardware, etc., as well as processor technologies or integrated circuit hardware architectures after the filing or publication of this application.

Federal learning hardware system 140 typically includes two or more types of heterogeneous acceleration hardware. For example, the apparatus of federal learning hardware system 140 includes at least one ASIC, at least one FPGA, at least one GPU, and at least one TPU. Accordingly, federated learning hardware system 140 may act as a heterogeneous system comprised of a variety of different instruction set architectures and different kinds of processors. The federated learning hardware system 140 may include heterogeneous acceleration hardware including, but not limited to, CPUs, GPUs, ASICs, FPGAs, and Tensor Processors (TPUs) or any type of processor as long as it can be used for heterogeneous computation or heterogeneous acceleration. Driver layer 126 of heterogeneous acceleration engine 120 provides the relevant code and programs for managing, monitoring and controlling the heterogeneous acceleration hardware of federated learning hardware system 140. For example, when the heterogeneous acceleration hardware included in the federated learning hardware system 140 includes a GPU, the driver layer 126 provides relevant code for the content of the GPU, such as the CUDA. The heterogeneous acceleration engine 120 sends information to the federated learning hardware system 140, including instructions related to the calling device 130. On the other hand, the federal learning hardware system 140 also sends information to the heterogeneous acceleration engine 120, including the backtracking computation results 132. As mentioned above, the complex and varied requirements that may be provided by the federal learning software system 100, such as different logic flows, software layer operations, algorithms, models, etc. that may be adopted by various different federal learning frameworks for performing a specific federal learning calculation task, are finally embodied as related instructions for calling the API 110 that are sent by the federal learning software system 100 to the heterogeneous acceleration engine 120, that is, as related instructions for selectively calling the multiple application interfaces and corresponding heterogeneous calculation related functions according to a specific federal learning scheme (which may also be referred to as a specific federal learning calculation task) at the API layer 122 of the heterogeneous acceleration engine 120; and execution of the particular federated learning scheme is performed through the service module corresponding to the application interface that is selectively invoked by federated learning software system 100 in accordance with the particular federated learning scheme. The multiple service modules include a device calling module for calling a device of the federated learning hardware system 140 and an operator calling module for calling an operator of the device of the federated learning hardware system 140. Therefore, the hardware-executable operators running at the operator layer 108 are adapted to call the multiple application interfaces included in the API layer 122 of the heterogeneous acceleration engine 120, so that the execution process of the specific federated learning scheme is performed through the service module corresponding to the application interface selectively called by the federated learning software system 100 according to the specific federated learning scheme, and finally, the device and operator of the federated learning hardware system 140 are called through the device calling module and the operator calling module of the service module.

It should be understood that in one possible implementation, the driver layer 126 may be merged with the service layer 124, i.e., the service layer 124 integrates the relevant code and programs that the driver layer 126 has for managing, monitoring and controlling the heterogeneous acceleration hardware. In one possible implementation, the service layer 124 may be combined with the API layer 122, i.e., the respective functions of the service layer 124 and the API layer 122 are integrated. In other words, the hierarchical division inside the heterogeneous acceleration engine 120 may be replaced by one hierarchy, but the same hierarchy may still provide corresponding operations with reference to the respective functions of the API layer 122, the service layer 124 and the driver layer 126, including selectively invoking the multiple application interfaces and corresponding heterogeneous computation related functions according to a specific federal learning scheme (which may also be referred to as a specific federal learning computation task) and the execution process of the specific federal learning scheme is performed by the service module corresponding to the application interface selectively invoked by the federal learning software system 100 according to the specific federal learning scheme. In some embodiments, the driver layer 126 may not be part of the heterogeneous compute engine 120, but instead may be provided separately or integrated into the federated learning hardware system 140 such that the driver layer 126 is part of the federated learning hardware system 140.

With continued reference to fig. 1, the heterogeneous acceleration engine 120 is coupled between the federal learning software system 100 and the federal learning hardware system 140 as a whole. The primary interactions between the federated learning software system 100, the heterogeneous acceleration engine 120, and the federated learning hardware system 140 are shown as arrows in FIG. 1. Specifically, the federated learning software system 100 calls a heterogeneous acceleration engine 120, such as calling the API layer 110; heterogeneous acceleration engine 120 manages, monitors, and controls federated learning hardware system 140 and transmits instructions, parameters, and data to federated learning hardware system 140, such as invoking device 130; after or during the operation, the federal learning hardware system 140 returns a return value, an operation result and error information to the heterogeneous acceleration engine 120; the heterogeneous acceleration engine 120 processes the information returned by the federated learning hardware system 140 and then returns the processed information to the federated learning software system 100. It should be understood that in some embodiments, information returned by the federated learning hardware system 140 may not pass through the heterogeneous acceleration engine 120 but may be returned directly to the federated learning software system 100.

It should be appreciated that while the above-described embodiments refer to complex and varied requirements that may be addressed by the federated learning software system 100, such as different logic flows, software layer operations, algorithms, models, etc., that may be employed by various different federated learning frameworks for performing particular federated learning computational tasks, the heterogeneous acceleration engine 120 referred to in the embodiments of the present application is not limited to the context of federated learning, but may be used in other areas of non-federated learning within the scope of machine learning.

As described above, in the context of federal learning, various federal learning frameworks, federal learning algorithms or privacy computation technologies each use the same, similar or different logic flows, software layer operations, algorithms and models for implementing the federal learning computation task (or referred to as the federal learning scheme, the federal learning implementation scheme or the federal learning task), and may also be based on the same or different encryption modes, cryptosystems, security protocols, and the like, thereby providing complex and variable requirements in the aspects of algorithms, operators, models, and the like; the complex and variable requirements, in addition to common operations under modular exponentiation or other federal learning algorithms, and large integer bit width data and large integer bit width parameters under a secret state, put forward high requirements on computing power and computing performance; the processor performance for general purpose computing tasks is insufficient and therefore suitable for acceleration by heterogeneous acceleration hardware, but existing heterogeneous acceleration hardware has different architectures and is difficult to integrate to improve overall computing performance. Therefore, a heterogeneous acceleration engine and a method for federal learning are needed, which can meet the complex and varied requirements of the federal learning framework and the federal learning algorithm on the one hand, and can exert the performance of a heterogeneous system on the other hand.

In one possible implementation, the heterogeneous acceleration engine 120 shown in FIG. 1 is applied to Federal learning. Heterogeneous acceleration engine 120 is coupled between federated learning software system 100 and federated learning hardware system 140. The heterogeneous acceleration engine 120 includes an API layer 122 that includes a variety of application interfaces. The multiple application interfaces respectively correspond to the heterogeneous calculation related functions. The federated learning software system 100 selectively invokes the various application interfaces and corresponding heterogeneous computation related functions according to a particular federated learning scheme. The heterogeneous acceleration engine 120 also includes a service layer 124, the service layer 124 including a variety of service modules. Each application interface of the plurality of application interfaces corresponds to at least one service module of the plurality of service modules. Execution of the particular federated learning scheme is performed through a service module corresponding to an application interface that is selectively invoked by the federated learning software system 100 according to the particular federated learning scheme. On one hand, the complex and varied requirements of the federal learning framework and the federal learning algorithm, such as the complex and varied requirements that may be proposed by the federal learning software system 100, and further, for example, different logic flows, software layer operations, algorithms, models, etc. that may be adopted by various different federal learning frameworks for performing specific federal learning calculation tasks, are finally embodied as related instructions for calling the API 110 that are sent by the federal learning software system 100 to the heterogeneous acceleration engine 120, that is, as related instructions for selectively calling the multiple application interfaces and corresponding heterogeneous calculation related functions according to a specific federal learning scheme (which may also be referred to as specific federal learning calculation tasks) in the API layer 122 of the heterogeneous acceleration engine 120. Accordingly, execution of the particular federated learning scheme is performed through the service modules corresponding to the application interfaces that are selectively invoked by the federated learning software system 100 in accordance with the particular federated learning scheme. On the other hand, in the face of various heterogeneous acceleration hardware included in the federal learning hardware system 140, the multiple service modules include a device call module for calling a device of the federal learning hardware system and an operator call module for calling an operator of the device of the federal learning hardware system. The invocation of the equipment invocation module and the operator invocation module during execution of the particular federated learning scheme is determined according to the performance parameters of the particular federated learning scheme and the performance test results of the federated learning hardware system 140. The performance test result of the federal learning hardware system 140 is determined by performing performance test on the federal learning hardware system 140 in advance according to a plurality of preset performance parameters. The performance parameter of the particular federated learning scheme corresponds to at least one of the plurality of preset performance parameters. In this embodiment, on one hand, the heterogeneous acceleration engine 120 is implemented to convert complex and variable requirements of the federal learning framework and the federal learning algorithm (for example, complex and variable requirements that may be proposed by the federal learning software system 100) into selective invocation of the multiple application interfaces and corresponding heterogeneous computation related functions according to a specific federal learning scheme, and further, the execution process of the specific federal learning scheme is performed by the service module corresponding to the application interface selectively invoked according to the specific federal learning scheme. This means that the federal learning software system 100 only needs to call the application interface of the API layer 122 without contacting the deeper service layer 124 and without involving further interaction between the heterogeneous acceleration engine 120 and the federal learning hardware system 140, so that various federal learning frameworks or federal learning algorithms represented by the federal learning software system 100 can be flexibly matched, and it is also advantageous to optimize the correspondence between the application interface of the API layer 122 inside the heterogeneous acceleration engine 120 and the service module of the service layer 124 without affecting the interaction between the heterogeneous acceleration engine 120 and the federal learning software system 100. On the other hand, in this embodiment, the heterogeneous acceleration engine 120 determines that the invocation of the device invocation module and the operator invocation module in the execution process of the specific federal learning scheme is determined according to the performance parameters of the specific federal learning scheme and the performance test result of the federal learning hardware system 140, and the performance test result is determined by performing performance test on the federal learning hardware system 140 in advance according to a plurality of preset performance parameters, and the performance parameter of the specific federal learning scheme corresponds to at least one preset performance parameter of the plurality of preset performance parameters. This means that the federated learning hardware system 140 is only exposed to calls to devices and operators from the service layer 124 of the heterogeneous acceleration engine 120, without involving the API layer 122 and without involving the interaction between the heterogeneous acceleration engine 120 and the federated learning software system 100, so that various heterogeneous acceleration hardware that the federated learning hardware system 140 may include can be flexibly matched, and the overhead of memory copy and data transmission brought by the direct return of the computation result from the federated learning hardware system 140 to the federated learning software system 100 can be saved. The calling of the heterogeneous acceleration hardware of the federal learning hardware system 140 is determined according to the performance parameters of the specific federal learning scheme and the performance test result of the federal learning hardware system 140, the performance test result is performed according to a plurality of preset performance parameters, and the performance parameters of the specific federal learning scheme correspond to at least one of the plurality of preset performance parameters. In this way, the heterogeneous acceleration engine 120 organically combines the requirements of the federal learning software system 100 and the heterogeneous acceleration hardware of the federal learning hardware system 140, thereby allowing the heterogeneous acceleration hardware resources of the federal learning hardware system 140 to be fully utilized, providing a matched calling configuration of the heterogeneous acceleration hardware for the operation of the specific federal learning scheme, and the calling configuration has high reliability based on the performance test performed in advance, and improving the computing performance and the training efficiency while reducing the energy consumption.

Fig. 2 shows a block diagram of a heterogeneous acceleration engine for federal learning according to an embodiment of the present application. As shown in fig. 2, the heterogeneous acceleration engine 220 includes an API layer 222 and a service layer 224. Wherein, the heterogeneous acceleration engine 220 shown in fig. 2 may refer to the heterogeneous acceleration engine 120 shown in fig. 1 for details, the API layer 222 shown in fig. 2 may refer to the API layer 122 shown in fig. 1 for details, and the service layer 224 shown in fig. 2 may refer to the service layer 124 shown in fig. 1 for details. The heterogeneous acceleration engine 220 shown in fig. 2, which is used for federal learning, is connected between a federal learning software system and a federal learning hardware system. The heterogeneous acceleration engine 220 includes: the system comprises a plurality of application interfaces, a federated learning software system and a plurality of application interfaces, wherein the application interfaces correspond to heterogeneous computation related functions respectively, and the federated learning software system selectively calls the application interfaces and the corresponding heterogeneous computation related functions according to a specific federated learning scheme; and a plurality of service modules, wherein each application interface of the plurality of application interfaces corresponds to at least one service module of the plurality of service modules. The various application interfaces correspond to or can be understood as part of the API layer 222. The various service modules correspond to the service layer 224 or may be understood as part of the service layer 224. The execution process of the specific federated learning scheme is performed through a service module corresponding to an application interface selectively invoked by the federated learning software system according to the specific federated learning scheme. The multiple service modules include a device calling module (not shown) for calling a device of the federated learning hardware system and an operator calling module (not shown) for calling an operator of the device of the federated learning hardware system. And in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to the performance parameters of the specific federated learning scheme and the performance test result of the federated learning hardware system. And the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to a plurality of preset performance parameters. The performance parameter of the particular federated learning scheme corresponds to at least one of the plurality of preset performance parameters.

In order to realize a federal learning calculation task (or called a federal learning scheme, a federal learning implementation scheme or a federal learning task), the various federal learning frameworks, federal learning algorithms or privacy calculation technologies respectively adopt the same, similar or different logic flows, software layer operations, algorithms, models and the like, and can also be based on the same or different encryption modes, cryptosystems, security protocols and the like, so that complex and variable requirements are provided on the aspects of algorithms, operators, models and the like; the complex and variable requirements, in addition to common operations under modular exponentiation or other federal learning algorithms, and large integer bit width data and large integer bit width parameters under a secret state, put forward high requirements on computing power and computing performance; the processor performance for general purpose computing tasks is insufficient and therefore suitable for acceleration by heterogeneous acceleration hardware, but existing heterogeneous acceleration hardware has different architectures and is difficult to integrate to improve overall computing performance. The heterogeneous acceleration engine 220 shown in fig. 2 realizes that the complex and variable requirements of the federal learning framework and the federal learning algorithm are converted into selective calling of the multiple application interfaces and corresponding heterogeneous computation related functions according to a specific federal learning scheme, and further, the execution process of the specific federal learning scheme is performed through the service module corresponding to the application interface selectively called according to the specific federal learning scheme. This means that the federal learning software system only needs to call the application interface without contacting a deeper service module and without involving further interaction between the heterogeneous acceleration engine 220 and the federal learning hardware system, so that various federal learning frameworks or federal learning algorithms represented by the federal learning software system can be flexibly matched, and the corresponding relationship between the application interface and the service module inside the heterogeneous acceleration engine 220 can be optimized without affecting the interaction between the heterogeneous acceleration engine 220 and the federal learning software system.

The heterogeneous acceleration engine 220 shown in fig. 2 further realizes that the invocation of the device invocation module and the operator invocation module in the execution process of the specific federal learning scheme is determined according to the performance parameters of the specific federal learning scheme and the performance test result of the federal learning hardware system, the performance test result is determined by performing performance test on the federal learning hardware system in advance according to a plurality of preset performance parameters, and the performance parameter of the specific federal learning scheme corresponds to at least one preset performance parameter of the plurality of preset performance parameters. This means that the federal learning hardware system only faces the call of the device and the operator from the heterogeneous acceleration engine 220, and does not need to involve the interaction between the heterogeneous acceleration engine 220 and the federal learning software system, so that various heterogeneous acceleration hardware possibly included in the federal learning hardware system can be flexibly matched, and the memory copy and data transmission overhead brought by the direct return of the calculation result from the federal learning hardware system to the federal learning software system can be saved. And the calling of the heterogeneous acceleration hardware of the Federal learning hardware system is determined according to the performance parameters of the specific Federal learning scheme and the performance test result of the Federal learning hardware system, the performance test result is carried out according to a plurality of preset performance parameters, and the performance parameters of the specific Federal learning scheme correspond to at least one of the plurality of preset performance parameters. In this way, the heterogeneous acceleration engine 220 organically combines the requirement of the federal learning software system and the heterogeneous acceleration hardware of the federal learning hardware system, thereby allowing the heterogeneous acceleration hardware resources of the federal learning hardware system to be fully utilized, providing a matched calling configuration of the heterogeneous acceleration hardware for the operation of the specific federal learning scheme, and the calling configuration has high reliability based on a performance test performed in advance, so that the computing performance and the training efficiency are improved while the energy consumption is reduced.

As shown in FIG. 2, the various application interfaces of the API layer 222 include various cryptographic interfaces. The various cryptographic operation interfaces comprise: a homomorphic encryption interface 230, wherein the heterogeneous computation related function corresponding to the homomorphic encryption interface 230 is a homomorphic encryption function; a homomorphic decryption interface 231, wherein the heterogeneous computation related function corresponding to the homomorphic decryption interface 231 is a homomorphic decryption function; a secret addition interface 232, wherein the heterogeneous computation related function corresponding to the secret addition interface 232 is a secret addition function; a dense state multiplication interface 233, wherein the heterogeneous computation related function corresponding to the dense state multiplication interface 233 is a dense state multiplication function; a matrix transposition interface 234, wherein the heterogeneous computation correlation function corresponding to the matrix transposition interface 234 is a matrix transposition function; a dense state matrix multiplication interface 235, wherein the heterogeneous computation related function corresponding to the dense state matrix multiplication interface 235 is a dense state matrix multiplication function; the dense state summation interface 236, wherein the heterogeneous computation related function corresponding to the dense state summation interface 236 is a dense state summation function; a matrix slicing interface 237, where a heterogeneous computation related function corresponding to the matrix slicing interface 237 is a matrix slicing function; and the matrix splicing interface 238, wherein the heterogeneous computation related function corresponding to the matrix splicing interface 238 is a matrix splicing function. The multiple application interfaces further include a plaintext operation interface 240 and a tensor operation interface 241, the heterogeneous computation related function corresponding to the plaintext operation interface 240 is a plaintext operation function, and the heterogeneous computation related function corresponding to the tensor operation interface 241 is a tensor operation function. The multiple application interfaces include a direct management interface 250, and the federal learning software system sends an instruction to the device call module, the data transmission module 270, the data storage module 271 or the data format conversion module 272 by calling the direct management interface 250.

As shown in FIG. 2, the various service modules of the service layer 224 include: a device management module 260 for managing devices of the federated learning hardware system; a memory management module 261, configured to manage respective memories of the heterogeneous acceleration engine and the federal learning hardware system; an exception handling module 262 for capturing and responding to exception signals from the heterogeneous acceleration engine and the federal learning hardware system; and an equipment monitoring module 263, configured to monitor equipment of the federal learning hardware system. The various service modules of the service layer 224 also include: a data transmission module 270, configured to control data interaction between the heterogeneous acceleration engine and the federal learning hardware system; a data storage module 271, configured to manage respective data storage operations of the heterogeneous acceleration engine and the federated learning hardware system; a data format conversion module 272, configured to perform a data format conversion operation between a data format of the federal learning software system and a data format of the federal learning hardware system; and a device returning module 273, configured to return the calculation result of the federal learning hardware system to the federal learning software system. These service modules of the service layer 224, in combination with a device calling module (not shown) for calling devices of the federated learning hardware system and an operator calling module (not shown) for calling operators of devices of the federated learning hardware system, enable execution of the particular federated learning scheme to be performed through the service module corresponding to the application interface that is selectively called by the federated learning software system according to the particular federated learning scheme.

In a possible implementation manner, the plurality of preset performance parameters include an operator mode, and the performance test result of the federal learning hardware system is determined by performing a performance test on the federal learning hardware system in advance according to the plurality of preset performance parameters, where the performance test result includes: and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to different operator modes. Therefore, performance testing is performed in advance according to different operator modes. In some embodiments, the performance parameter of the specific federated learning scheme is an operator pattern, and the invocation of the device invocation module and the operator invocation module during execution of the specific federated learning scheme is determined according to the performance parameter of the specific federated learning scheme and the performance test result of the federated learning hardware system, including: and in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to the operator mode of the specific federated learning scheme and the performance test result of the federated learning hardware system. Therefore, the method realizes that the operator mode based on the specific federated learning scheme provides the matched calling configuration of the heterogeneous acceleration hardware for the operation of the specific federated learning scheme, and improves the calculation performance and the training efficiency while reducing the energy consumption.

In a possible implementation manner, the plurality of preset performance parameters include data bit width, and the performance test result of the federal learning hardware system is determined by performing a performance test on the federal learning hardware system in advance according to the plurality of preset performance parameters, where the performance test result includes: and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to different data bit widths. Therefore, performance testing according to different data bit widths is achieved. In some embodiments, the performance parameter of the specific federated learning scheme is a data bit width, and the invoking of the device invocation module and the operator invocation module during the execution of the specific federated learning scheme is determined according to the performance parameter of the specific federated learning scheme and the performance test result of the federated learning hardware system, including: and in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to the data bit width of the specific federated learning scheme and the performance test result of the federated learning hardware system. Therefore, the data bit width based on the specific federated learning scheme is used for providing the matched calling configuration of the heterogeneous acceleration hardware for the operation of the specific federated learning scheme, and the calculation performance and the training efficiency are improved while the energy consumption is reduced.

In a possible embodiment, the preset performance parameters include data volume magnitude, and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to the preset performance parameters, where the performance test result includes: and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to different data volume levels. Therefore, performance testing is performed in advance according to different data volume levels. In some embodiments, the performance parameter of the specific federated learning scheme is a data volume level, and the invoking of the device invocation module and the operator invocation module during the execution of the specific federated learning scheme is determined according to the performance parameter of the specific federated learning scheme and the performance test result of the federated learning hardware system, including: and in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to the data volume magnitude of the specific federated learning scheme and the performance test result of the federated learning hardware system. Therefore, the data volume magnitude based on the specific federated learning scheme is used for providing the matched calling configuration of the heterogeneous acceleration hardware for the operation of the specific federated learning scheme, and the calculation performance and the training efficiency are improved while the energy consumption is reduced.

In a possible implementation manner, the plurality of preset performance parameters include an operator mode, a data bit width, and a data volume magnitude, and the performance test result of the federal learning hardware system is determined by performing a performance test on the federal learning hardware system in advance according to the plurality of preset performance parameters, and includes: and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to different operator modes, different data bit widths or different data volume magnitudes. Therefore, performance testing is performed in advance according to different operator modes, different data bit widths or different data volume magnitudes. In some embodiments, the performance parameters of the specific federated learning scheme at least include an operator pattern, a data bit width, or a data volume magnitude, and the invocation of the device invocation module and the operator invocation module during the execution of the specific federated learning scheme is determined according to the performance parameters of the specific federated learning scheme and the performance test results of the federated learning hardware system, including: and in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to an operator mode, data bit width or data volume magnitude of the specific federated learning scheme and a performance test result of the federated learning hardware system. Therefore, the operator mode, the data bit width or the data volume magnitude based on the specific federated learning scheme can provide the matched calling configuration of the heterogeneous acceleration hardware for the operation of the specific federated learning scheme, and the calculation performance and the training efficiency are improved while the energy consumption is reduced.

In one possible embodiment, the invocation of the device invocation module and the operator invocation module during execution of the particular federated learning scheme further references user-specified device or user preferences. In this manner, a calling configuration is achieved that incorporates user preferences and user specifications to provide matching heterogeneous acceleration hardware for the operation of that particular federated learning scheme.

In a possible implementation manner, the invocation of the data transmission module, the data storage module, the data format conversion module and the device return module during the execution process of the specific federated learning scheme is determined based on configuration information of the specific federated learning scheme, where the configuration information of the specific federated learning scheme indicates whether a specific calculation result of the federated learning hardware system is reusable, whether the specific calculation result is operated based on a matrix operation, and whether the specific calculation result is required by the federated learning software system. In this manner, optimization is facilitated based on configuration information for the particular federated learning scheme. In some embodiments, the invoking of the data transmission module, the data storage module, the data format conversion module, and the device backhaul module during the execution of the particular federated learning scheme is determined based on configuration information of the particular federated learning scheme, including: when a specific calculation result of the federal learning hardware system can be reused or is operated based on matrix operation and is not required by the federal learning software system, the specific calculation result is not transmitted back to the federal learning software system; when the specific calculation result is needed by the federal learning software system, the specific calculation result is transmitted back to the federal learning software system, and if the data format of the specific calculation result is different from that of the federal learning software system, the data format of the specific calculation result is converted into that of the federal learning software system. For a particular federal learning scheme, the format of data for which the federal learning software system is adapted is generally different from the format for which the federal learning hardware system is adapted, so that additional memory copies and data format conversions are required to be involved in the transmission of data from the federal learning software system to the federal learning hardware system, and also additional memory copies and data format conversions are required to be involved in the transmission of data from the federal learning hardware system back to the federal learning software system. Therefore, the calculation result can be prevented from being returned as much as possible through the improvement provided by the embodiment, so that the calculation result can be temporarily stored in the heterogeneous acceleration hardware to be beneficial to subsequent calling. In particular, when a specific calculation result of the federal learning hardware system is reusable or based on a matrix operation, the specific calculation result may be required by the federal learning software system and must be returned as soon as possible because the matrix operation generally involves a large number of memory operations. In addition, data format conversion operation should be avoided as much as possible, so that it is also required to further judge whether the data format of the specific calculation result is different from the data format of the federal learning software system when the specific calculation result is required to be returned by the federal learning software system, and perform data format conversion operation only when the data format is different. Therefore, the method is favorable for saving the expenses of memory copy and data transmission brought by directly returning the calculation result from the federal learning hardware system to the federal learning software system and is also favorable for saving the expenses brought by data format conversion operation.

In a possible embodiment, the invocation of the data transmission module, the data storage module, the data format conversion module and the device return module during the execution of the specific federated learning scheme is based on a preset policy such that the specific calculation result of the federated learning hardware system is returned to the federated learning software system only when a preset condition of the preset policy is satisfied, wherein the preset condition includes: the subsequent operations of the specific calculation result, which are required by the federal learning software system or are related to the specific calculation result, cannot be realized by the heterogeneous calculation related functions respectively corresponding to the multiple application interfaces. For a particular federal learning scheme, the format of data for which the federal learning software system is adapted is generally different from the format for which the federal learning hardware system is adapted, so that additional memory copies and data format conversions are required to be involved in the transmission of data from the federal learning software system to the federal learning hardware system, and also additional memory copies and data format conversions are required to be involved in the transmission of data from the federal learning hardware system back to the federal learning software system. Therefore, the improvement provided by the embodiment may avoid returning the calculation result as much as possible, that is, the specific calculation result is required to be returned only when the specific calculation result is required by the federal learning software system or the subsequent operation related to the specific calculation result cannot be realized through the heterogeneous calculation related functions respectively corresponding to the plurality of application interfaces.

It should be understood that the various heterogeneous acceleration hardware included in the federated learning hardware system may be understood to be any device that is architecturally distinct from the host processor and that may be used to perform a particular federated learning computing task. A federal learning hardware system typically includes two or more types of heterogeneous acceleration hardware. For example, an apparatus of a federally learned hardware system includes at least one ASIC, at least one FPGA, at least one GPU, and at least one TPU.

Fig. 3 shows a flowchart of a heterogeneous acceleration method for federal learning according to an embodiment of the present application. The heterogeneous acceleration method 300 is implemented by a heterogeneous acceleration engine connected between a federal learning software system and a federal learning hardware system, and the heterogeneous acceleration engine includes a plurality of application interfaces and a plurality of service modules, the plurality of application interfaces respectively correspond to heterogeneous computation related functions, and each application interface of the plurality of application interfaces corresponds to at least one service module of the plurality of service modules. As shown in fig. 3, the heterogeneous acceleration method 300 includes the following steps.

Step S302: a specific federal learning scheme is obtained.

Step S304: and the federal learning software system selectively calls the multiple application interfaces and the corresponding heterogeneous calculation related functions according to the specific federal learning scheme.

Step S306: and executing the specific federal learning scheme through a service module corresponding to the application interface selectively called by the federal learning software system according to the specific federal learning scheme.

Referring to step S302 to step S306, the multiple service modules include a device calling module for calling a device of the federal learning hardware system and an operator calling module for calling an operator of the device of the federal learning hardware system. And in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to the performance parameters of the specific federated learning scheme and the performance test result of the federated learning hardware system. And the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to a plurality of preset performance parameters, and the performance parameter of the specific federal learning scheme corresponds to at least one preset performance parameter in the plurality of preset performance parameters.

In order to realize a federal learning calculation task (or called a federal learning scheme, a federal learning implementation scheme or a federal learning task), the various federal learning frameworks, federal learning algorithms or privacy calculation technologies respectively adopt the same, similar or different logic flows, software layer operations, algorithms, models and the like, and can also be based on the same or different encryption modes, cryptosystems, security protocols and the like, so that complex and variable requirements are provided on the aspects of algorithms, operators, models and the like; the complex and variable requirements, in addition to common operations under modular exponentiation or other federal learning algorithms, and large integer bit width data and large integer bit width parameters under a secret state, put forward high requirements on computing power and computing performance; the processor performance for general purpose computing tasks is insufficient and therefore suitable for acceleration by heterogeneous acceleration hardware, but existing heterogeneous acceleration hardware has different architectures and is difficult to integrate to improve overall computing performance. The heterogeneous acceleration method 300 shown in fig. 3 realizes that the complex and variable requirements of the federal learning framework and the federal learning algorithm are converted into selective calling of the multiple application interfaces and corresponding heterogeneous computation related functions according to a specific federal learning scheme, and further the execution process of the specific federal learning scheme is performed through the service module corresponding to the application interface selectively called according to the specific federal learning scheme. This means that the federal learning software system only needs to call the application interface without contacting a deeper service module and without involving further interaction between the heterogeneous acceleration engine and the federal learning hardware system, so that various federal learning frameworks or federal learning algorithms represented by the federal learning software system can be flexibly matched, and the corresponding relationship between the application interface and the service module in the heterogeneous acceleration engine can be optimized without affecting the interaction between the heterogeneous acceleration engine and the federal learning software system.

The heterogeneous acceleration method 300 shown in fig. 3 further realizes that the invocation of the device invocation module and the operator invocation module in the execution process of the specific federal learning scheme is determined according to the performance parameters of the specific federal learning scheme and the performance test result of the federal learning hardware system, the performance test result is determined by performing performance test on the federal learning hardware system in advance according to a plurality of preset performance parameters, and the performance parameter of the specific federal learning scheme corresponds to at least one preset performance parameter of the plurality of preset performance parameters. This means that the federal learning hardware system only faces the call of the device and the operator from the heterogeneous acceleration engine, and does not need to involve the interaction between the heterogeneous acceleration engine and the federal learning software system, so that various heterogeneous acceleration hardware possibly included in the federal learning hardware system can be flexibly matched, and the memory copy and data transmission overhead brought by the direct return of the calculation result from the federal learning hardware system to the federal learning software system can be saved. And the calling of the heterogeneous acceleration hardware of the Federal learning hardware system is determined according to the performance parameters of the specific Federal learning scheme and the performance test result of the Federal learning hardware system, the performance test result is carried out according to a plurality of preset performance parameters, and the performance parameters of the specific Federal learning scheme correspond to at least one of the plurality of preset performance parameters. Therefore, the heterogeneous acceleration engine organically combines the requirement of the federal learning software system and the heterogeneous acceleration hardware of the federal learning hardware system, so that the heterogeneous acceleration hardware resource of the federal learning hardware system is allowed to be fully utilized, the matched calling configuration of the heterogeneous acceleration hardware is provided for the operation of the specific federal learning scheme, the calling configuration has high reliability based on the performance test carried out in advance, and the computing performance and the training efficiency are improved while the energy consumption is reduced.

In a possible implementation manner, the plurality of preset performance parameters include an operator mode, a data bit width, and a data volume magnitude, and the performance test result of the federal learning hardware system is determined by performing a performance test on the federal learning hardware system in advance according to the plurality of preset performance parameters, and includes: and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to different operator modes, different data bit widths or different data volume magnitudes. The performance parameters of the specific federated learning scheme at least include an operator mode, a data bit width or a data volume magnitude, and the invoking of the device invoking module and the operator invoking module in the execution process of the specific federated learning scheme is determined according to the performance parameters of the specific federated learning scheme and the performance test result of the federated learning hardware system, and includes: and in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to an operator mode, data bit width or data volume magnitude of the specific federated learning scheme and a performance test result of the federated learning hardware system.

In one possible implementation, the plurality of service modules further includes: the data transmission module is used for controlling data interaction between the heterogeneous acceleration engine and the federal learning hardware system; a data storage module for managing respective data storage operations of the heterogeneous acceleration engine and the federated learning hardware system; the data format conversion module is used for carrying out data format conversion operation between the data format of the federal learning software system and the data format of the federal learning hardware system; and the equipment returning module is used for returning the calculation result of the federal learning hardware system to the federal learning software system. And in the execution process of the specific federated learning scheme, the calling of the data transmission module, the data storage module, the data format conversion module and the equipment return module is determined based on configuration information of the specific federated learning scheme, wherein the configuration information of the specific federated learning scheme indicates whether a specific calculation result of the federated learning hardware system is reusable, whether the specific calculation result is based on matrix operation and whether the specific calculation result is required by the federated learning software system. The calling of the data transmission module, the data storage module, the data format conversion module and the device return module in the execution process of the specific federated learning scheme is determined based on the configuration information of the specific federated learning scheme, and includes: when a specific calculation result of the federal learning hardware system can be reused or is operated based on matrix operation and is not required by the federal learning software system, the specific calculation result is not transmitted back to the federal learning software system; when the specific calculation result is needed by the federal learning software system, the specific calculation result is transmitted back to the federal learning software system, and if the data format of the specific calculation result is different from that of the federal learning software system, the data format of the specific calculation result is converted into that of the federal learning software system. In some embodiments, the invoking of the data transmission module, the data storage module, the data format conversion module, and the device feedback module during the execution of the specific federated learning scheme is based on a preset policy such that the specific calculation result of the federated learning hardware system is transmitted back to the federated learning software system only when a preset condition of the preset policy is satisfied, wherein the preset condition includes: the subsequent operations of the specific calculation result, which are required by the federal learning software system or are related to the specific calculation result, cannot be realized by the heterogeneous calculation related functions respectively corresponding to the multiple application interfaces.

The embodiments provided herein may be implemented in any one or combination of hardware, software, firmware, or solid state logic circuitry, and may be implemented in connection with signal processing, control, and/or application specific circuitry. Particular embodiments of the present application provide an apparatus or device that may include one or more processors (e.g., microprocessors, controllers, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), etc.) that process various computer-executable instructions to control the operation of the apparatus or device. Particular embodiments of the present application provide an apparatus or device that can include a system bus or data transfer system that couples the various components together. A system bus can include any of a variety of different bus structures or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. The devices or apparatuses provided in the embodiments of the present application may be provided separately, or may be part of a system, or may be part of other devices or apparatuses.

Particular embodiments provided herein may include or be combined with computer-readable storage media, such as one or more storage devices capable of providing non-transitory data storage. The computer-readable storage medium/storage device may be configured to store data, programmers and/or instructions that, when executed by a processor of an apparatus or device provided by embodiments of the present application, cause the apparatus or device to perform operations associated therewith. The computer-readable storage medium/storage device may include one or more of the following features: volatile, non-volatile, dynamic, static, read/write, read-only, random access, sequential access, location addressability, file addressability, and content addressability. In one or more exemplary embodiments, the computer-readable storage medium/storage device may be integrated into a device or apparatus provided in the embodiments of the present application or belong to a common system. The computer-readable storage medium/memory device may include optical, semiconductor, and/or magnetic memory devices, etc., and may also include Random Access Memory (RAM), flash memory, read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, a hard disk, a removable disk, a recordable and/or rewriteable Compact Disc (CD), a Digital Versatile Disc (DVD), a mass storage media device, or any other form of suitable storage media.

The above is an implementation manner of the embodiments of the present application, and it should be noted that the steps in the method described in the embodiments of the present application may be sequentially adjusted, combined, and deleted according to actual needs. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. It is to be understood that the embodiments of the present application and the structures shown in the drawings are not to be construed as particularly limiting the devices or systems concerned. In other embodiments of the present application, an apparatus or system may include more or fewer components than the specific embodiments and figures, or may combine certain components, or may separate certain components, or may have a different arrangement of components. Those skilled in the art will understand that various modifications and changes may be made in the arrangement, operation, and details of the methods and apparatus described in the specific embodiments without departing from the spirit and scope of the embodiments herein; without departing from the principles of embodiments of the present application, several improvements and modifications may be made, and such improvements and modifications are also considered to be within the scope of the present application.

Claims

1. A heterogeneous acceleration engine for federal learning, the heterogeneous acceleration engine coupled between a federal learning software system and a federal learning hardware system, the heterogeneous acceleration engine comprising:

the system comprises a plurality of application interfaces, a federated learning software system and a plurality of application interfaces, wherein the application interfaces correspond to heterogeneous computation related functions respectively, and the federated learning software system selectively calls the application interfaces and the corresponding heterogeneous computation related functions according to a specific federated learning scheme; and

a plurality of service modules, wherein each application interface of the plurality of application interfaces corresponds to at least one service module of the plurality of service modules,

wherein the execution of the particular federated learning scheme is performed through a service module corresponding to an application interface that is selectively invoked by the federated learning software system according to the particular federated learning scheme,

wherein the plurality of service modules comprise a device calling module for calling the device of the federated learning hardware system and an operator calling module for calling the operator of the device of the federated learning hardware system,

wherein the invocation of the equipment invocation module and the operator invocation module during execution of the particular federated learning scheme is determined according to the performance parameters of the particular federated learning scheme and the performance test results of the federated learning hardware system,

the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to a plurality of preset performance parameters, and the performance parameter of the specific federal learning scheme corresponds to at least one preset performance parameter in the plurality of preset performance parameters.

2. The heterogeneous acceleration engine of claim 1, wherein the plurality of preset performance parameters include operator patterns, and the performance test result of the federal learning hardware system is determined by performing performance tests on the federal learning hardware system in advance according to the plurality of preset performance parameters, and includes:

and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to different operator modes.

3. The heterogeneous acceleration engine of claim 2, wherein the performance parameter of the specific federated learning scheme is an operator pattern, and the invocation of the device invocation module and the operator invocation module during execution of the specific federated learning scheme is determined according to the performance parameter of the specific federated learning scheme and the performance test results of the federated learning hardware system, and comprises:

and in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to the operator mode of the specific federated learning scheme and the performance test result of the federated learning hardware system.

4. The heterogeneous acceleration engine according to claim 1, wherein the plurality of preset performance parameters include data bit width, and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to the plurality of preset performance parameters, and includes:

and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to different data bit widths.

5. The heterogeneous acceleration engine of claim 4, wherein the performance parameter of the specific federated learning scheme is a data bit width, and the invocation of the device invocation module and the operator invocation module during the execution of the specific federated learning scheme is determined according to the performance parameter of the specific federated learning scheme and the performance test result of the federated learning hardware system, and comprises:

and in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to the data bit width of the specific federated learning scheme and the performance test result of the federated learning hardware system.

6. The heterogeneous acceleration engine of claim 1, wherein the plurality of preset performance parameters include data volume levels, and the performance test result of the federal learning hardware system is determined by performing performance tests on the federal learning hardware system in advance according to the plurality of preset performance parameters, and includes:

and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to different data volume levels.

7. The heterogeneous acceleration engine of claim 6, wherein the performance parameter of the specific federated learning scheme is a data volume level, and the invocation of the device invocation module and the operator invocation module during execution of the specific federated learning scheme is determined according to the performance parameter of the specific federated learning scheme and the performance test results of the federated learning hardware system, comprising:

and in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to the data volume magnitude of the specific federated learning scheme and the performance test result of the federated learning hardware system.

8. The heterogeneous acceleration engine of claim 1, wherein the plurality of preset performance parameters include operator patterns, data bit widths, and data volume levels, and the performance test result of the federal learning hardware system is determined by performing performance tests on the federal learning hardware system in advance according to the plurality of preset performance parameters, and includes:

and the performance test result of the federal learning hardware system is determined by performing performance test on the federal learning hardware system in advance according to different operator modes, different data bit widths or different data volume magnitudes.

9. The heterogeneous acceleration engine of claim 8, wherein the performance parameters of the specific federated learning scheme at least include operator patterns, data bit widths, or data volume levels, and the invocation of the device invocation module and the operator invocation module during the execution of the specific federated learning scheme is determined according to the performance parameters of the specific federated learning scheme and the performance test results of the federated learning hardware system, and includes:

and in the execution process of the specific federated learning scheme, the calling of the equipment calling module and the operator calling module is determined according to an operator mode, data bit width or data volume magnitude of the specific federated learning scheme and a performance test result of the federated learning hardware system.

10. The heterogeneous acceleration engine of claim 9, the invocation of the device invocation module and the operator invocation module during execution of the particular federated learning scheme further referencing a user-specified device or user preference.

11. The heterogeneous acceleration engine of claim 1, the plurality of service modules further comprising:

the data transmission module is used for controlling data interaction between the heterogeneous acceleration engine and the federal learning hardware system;

a data storage module for managing respective data storage operations of the heterogeneous acceleration engine and the federated learning hardware system;

the data format conversion module is used for carrying out data format conversion operation between the data format of the federal learning software system and the data format of the federal learning hardware system; and

and the equipment returning module is used for returning the calculation result of the federal learning hardware system to the federal learning software system.

12. The heterogeneous acceleration engine of claim 11, wherein the invocation of the data transmission module, the data storage module, the data format conversion module, and the device backhaul module during execution of the particular federated learning scheme is determined based on configuration information of the particular federated learning scheme, wherein the configuration information of the particular federated learning scheme indicates whether a particular computation result of the federated learning hardware system is reusable, operates based on a matrix operation, and is required by the federated learning software system.

13. The heterogeneous acceleration engine of claim 12, wherein the invocation of the data transmission module, the data storage module, the data format conversion module, and the device backhaul module during execution of the particular federated learning scheme is determined based on configuration information of the particular federated learning scheme, comprising:

when a specific calculation result of the federal learning hardware system can be reused or is operated based on matrix operation and is not required by the federal learning software system, the specific calculation result is not transmitted back to the federal learning software system;

when the specific calculation result is needed by the federal learning software system, the specific calculation result is transmitted back to the federal learning software system, and if the data format of the specific calculation result is different from that of the federal learning software system, the data format of the specific calculation result is converted into that of the federal learning software system.

14. The heterogeneous acceleration engine of claim 11, wherein the invocation of the data transmission module, the data storage module, the data format conversion module, and the device feedback module during execution of the specific federated learning scheme is based on a preset policy such that the specific computation result of the federated learning hardware system is transmitted back to the federated learning software system only when preset conditions of the preset policy are satisfied, wherein the preset conditions include:

the subsequent operations of the specific calculation result, which are required by the federal learning software system or are related to the specific calculation result, cannot be realized by the heterogeneous calculation related functions respectively corresponding to the multiple application interfaces.

15. The heterogeneous acceleration engine of claim 1, the plurality of application interfaces comprising a plurality of cryptographic interfaces, the plurality of cryptographic interfaces comprising:

the isomorphic computation related function corresponding to the homomorphic encryption interface is a homomorphic encryption function;

the isomorphic decryption interface is used for carrying out isomorphic computation on the related functions corresponding to the isomorphic decryption interface;

the heterogeneous computation related function corresponding to the secret addition interface is a secret addition function;

the dense multiplication interface, the heterogeneous computation related function corresponding to the dense multiplication interface is a dense multiplication function;

a matrix transposition interface, wherein a heterogeneous calculation related function corresponding to the matrix transposition interface is a matrix transposition function;

the heterogeneous computation related function corresponding to the dense matrix multiplication interface is a dense matrix multiplication function;

the dense state summation interface is used for calculating a heterogeneous computation related function corresponding to the dense state summation interface to be a dense state summation function;

the matrix slicing interface is used for calculating a heterogeneous calculation related function corresponding to the matrix slicing interface to be a matrix slicing function;

and the heterogeneous calculation related function corresponding to the matrix splicing interface is a matrix splicing function.

16. The heterogeneous acceleration engine of claim 1, the plurality of application interfaces comprising a plaintext operational interface and a tensor operational interface, the heterogeneous computation related functions corresponding to the plaintext operational interface being plaintext operational functions and the heterogeneous computation related functions corresponding to the tensor operational interface being tensor operational functions.

17. The heterogeneous acceleration engine of claim 11, the plurality of application interfaces comprising a direct management interface, the federal learning software system sending instructions to the device call module, the data transfer module, the data storage module, or the data format conversion module by calling the direct management interface.

18. The heterogeneous acceleration engine of claim 1, the plurality of service modules further comprising:

the equipment management module is used for managing equipment of the federal learning hardware system;

the memory management module is used for managing respective memories of the heterogeneous acceleration engine and the federal learning hardware system;

an exception handling module for capturing and responding to exception signals from the heterogeneous acceleration engine and the federated learning hardware system;

and the equipment monitoring module is used for monitoring the equipment of the federal learning hardware system.

19. The heterogeneous acceleration engine of claim 1, the devices of the federal learned hardware system include at least one ASIC, at least one FPGA, at least one GPU, and at least one TPU.

20. A heterogeneous acceleration method applied to federated learning is realized through a heterogeneous acceleration engine connected between a federated learning software system and a federated learning hardware system, and the heterogeneous acceleration engine comprises a plurality of application interfaces and a plurality of service modules, wherein the plurality of application interfaces respectively correspond to heterogeneous computation related functions, each application interface in the plurality of application interfaces corresponds to at least one service module in the plurality of service modules, and the heterogeneous acceleration method comprises the following steps:

obtaining a specific federal learning scheme;

the federal learning software system selectively calls the multiple application interfaces and corresponding heterogeneous calculation related functions according to the specific federal learning scheme; and

performing an execution process of the particular federated learning scheme through a service module corresponding to an application interface selectively invoked by the federated learning software system according to the particular federated learning scheme,

21. The heterogeneous acceleration method according to claim 20, wherein the plurality of preset performance parameters include operator modes, data bit widths, and data volume levels, and the performance test result of the federal learning hardware system is determined by performing performance tests on the federal learning hardware system in advance according to the plurality of preset performance parameters, and includes:

22. The heterogeneous acceleration method according to claim 21, wherein the performance parameters of the specific federated learning scheme at least include operator patterns, data bit widths, or data volume levels, and the invocation of the device invocation module and the operator invocation module during the execution of the specific federated learning scheme is determined according to the performance parameters of the specific federated learning scheme and the performance test results of the federated learning hardware system, and includes:

23. The heterogeneous acceleration method of claim 20, the plurality of service modules further comprising:

24. The heterogeneous acceleration method according to claim 23, wherein the invocation of the data transmission module, the data storage module, the data format conversion module, and the device backhaul module during execution of the specific federated learning scheme is determined based on configuration information of the specific federated learning scheme, wherein the configuration information of the specific federated learning scheme indicates whether a specific computation result of the federated learning hardware system is reusable, whether to operate based on a matrix operation, and whether to be required by the federated learning software system.

25. The heterogeneous acceleration method of claim 24, wherein the invoking of the data transmission module, the data storage module, the data format conversion module, and the device backhaul module during the execution of the specific federated learning scheme is determined based on configuration information of the specific federated learning scheme, comprising:

26. The heterogeneous acceleration method according to claim 23, wherein the invocation of the data transmission module, the data storage module, the data format conversion module and the device return module during the execution of the specific federated learning scheme is based on a preset policy such that the specific computation result of the federated learning hardware system is returned to the federated learning software system only when preset conditions of the preset policy are satisfied, wherein the preset conditions include: