CN112000598B

CN112000598B - Processor for federal learning, heterogeneous processing system and private data transmission method

Info

Publication number: CN112000598B
Application number: CN202010664237.5A
Authority: CN
Inventors: 王亚玲; 王玮; 胡水海
Original assignee: Shenzhen Zhixing Technology Co Ltd
Current assignee: Shenzhen Zhixing Technology Co Ltd
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2022-06-21
Anticipated expiration: 2040-07-10
Also published as: CN112000598A

Abstract

Embodiments of the present description provide a heterogeneous processing system. The heterogeneous processing system is applicable to a federated learning scenario. The heterogeneous processing system comprises a main processing device and a slave processing device, wherein the main processing device and the slave processing device comprise PCIe high-speed interface modules, and the PCIe high-speed interface modules comprise DMA controllers. The task processing source data is transmitted between the main processing device and the auxiliary processing device in a DMA mode of a PCIe high-speed interface, and the task configuration data is transmitted in a DMA or PIO mode. The heterogeneous processing system opens a direct data transmission path for the main processing equipment and the slave processing equipment through the DMA controller of the PCIe high-speed interface module, transmission operation is not required to be interfered by the main processing equipment too much, the frequency of interrupt processing of the main processing equipment is reduced, meanwhile, the workload of the main processing equipment is reduced, the heterogeneous processing system has the capability of rapidly transmitting a large amount of data, and the requirements of data transmission real-time performance and high speed performance in federal learning are well met.

Description

Processor for federal learning, heterogeneous processing system and private data transmission method

Technical Field

Embodiments of the present description relate generally to the field of private computing, and more particularly, to a processor, a heterogeneous processing system, and a private data transmission method.

Background

Federal Learning (Federated Learning) is a distributed machine Learning technique based on homomorphic encryption, in which participating parties co-build models without revealing plaintext data. The method realizes that the own data of each participant can not be locally sent, and a virtual common model is established through parameter exchange under an encryption mechanism, namely under the condition of not violating a data privacy regulation, so that the user privacy and the data security are protected. Homomorphic encryption relates to mathematical calculation of high bit width and large integers, and mass data generated in federal learning training cause huge pressure on a data transmission system, so that the requirement on data transmission bandwidth is greatly improved.

Disclosure of Invention

In view of the foregoing, embodiments of the present specification provide a processor, a heterogeneous processing system, and a private data transmission method. By utilizing the processor, the heterogeneous processing system and the data transmission method, efficient private data transmission can be realized.

According to an aspect of embodiments of the present specification, there is provided a first processor including: a first high-speed interface module including a first DMA controller, the first high-speed interface module being configured to receive task processing source data from an external processing device in a DMA manner, receive task configuration data from the external processing device in a DMA manner or a PIO manner, and transmit task processing result data to the external processing device in a DMA manner; the read/write control module is configured to control read/write operation of data in the slave memory, read task processing source data in the slave memory and distribute the task processing source data to the computing module, and store computing processing result data processed by the computing module in the slave memory; and the computing module is used for performing task computing processing on the received task processing source data according to a preset encryption algorithm to obtain task processing result data.

Optionally, in an example of the above aspect, the first processor further includes a register into which the task configuration data is stored, and the first DMA controller includes: the storage access module comprises an uplink module and a downlink module, wherein the uplink module is used for processing the received task processing source data and task configuration data, and the downlink module is used for processing the received task processing result data; the receiving engine is used for sending the task processing source data and/or the task configuration data received by the first high-speed interface module to the storage access module; and the sending engine is used for sending the task processing result data received in the storage access module to the first high-speed interface module.

Optionally, in an example of the above aspect, the first high-speed interface module further includes an interrupt controller, and the interrupt controller sends an interrupt message to the external processing device when the computing module completes a predetermined computing task.

Optionally, in an example of the foregoing aspect, the uplink module includes an uplink control unit and an uplink data processing unit, the uplink control unit processes the task configuration data, the uplink data processing unit processes the task processing source data, the downlink module includes a downlink control unit and a downlink data processing unit, the downlink control unit controls the interrupt controller, and the downlink data processing unit processes the task processing result data.

Optionally, in an example of the above aspect, the first processor includes at least one of an FPGA, a GPU, and an ASIC.

Optionally, in an example of the above aspect, the computing module employs a parallel computing architecture including a hierarchy of multiple processing units, each processing unit being a smallest processing unit with independent task processing capability.

According to another aspect of embodiments of the present specification, there is provided a processing apparatus comprising the first processor as described above; and a memory communicably connected with the first processor and configured to store task processing source data received from an external processing apparatus and task processing result data transmitted to the external processing apparatus.

According to another aspect of embodiments herein, there is provided a heterogeneous processing system, comprising: a main processing device comprising a main processor, the main processor comprising a second high speed interface module, the second high speed interface module comprising a second DMA controller; and a slave processing device including the first processor as described above, wherein the master processing device transmits task processing source data and task configuration data to the slave processing device, and receives task processing result data from the slave processing device.

Optionally, in one example of the above aspect, the heterogeneous processing system is applied to federal learning.

According to another aspect of embodiments of the present specification, there is provided a private data transmission method performed by a slave processing apparatus, a first processor of the slave processing apparatus including a first high-speed interface module, a read/write control module, and a calculation module, the data transmission method including: receiving and storing task processing source data from an external device in a memory in a DMA manner via the first high-speed interface module; reading task processing source data in the memory and distributing the task processing source data to the computing module through a read/write control module; the computing module executes task processing to obtain task processing result data and writes the task processing result data into the memory through the read/write control module; and sending the task processing result data to the external equipment in a DMA mode through the first high-speed interface module.

Optionally, in an example of the above aspect, the transmission method further includes: receiving task configuration data from an external device via the first high speed interface module in a DMA mode or a PIO mode and storing the task configuration data into a memory.

Optionally, in an example of the above aspect, the transmission method further includes: when the computing module completes a predetermined computing task, the first high-speed interface module sends an interrupt message to the external device.

According to another aspect of embodiments of the present specification, there is provided a private data transmission method performed by a heterogeneous processing system including a master processing device including a master processor including a second high-speed interface including a second DMA controller and a slave processing device including a processing device as described above, the data transmission method including: the master processing device sends task processing source data to the slave processing device in a DMA mode through the second high-speed interface module; receiving task processing source data in a DMA manner from a processing device via the first high-speed interface module and storing the data in a memory; the master processing device sends task configuration data to the slave processing device in a DMA or PIO mode through the second high-speed interface module; receiving task configuration data from a processing device in a DMA or PIO mode through the first high-speed interface module and storing the task configuration data into a register; executing task processing by a computing module of the processing equipment to obtain task processing result data; sending task processing result data in a DMA mode from a processing device through the first high-speed interface module; and the master processing device receives the task processing result data sent by the slave processing device in a DMA mode through the second high-speed interface module.

Optionally, in an example of the foregoing aspect, after the slave processing device performs task processing to obtain task processing result data, the private data processing method further includes: the slave processing device sends an interrupt control request to the master processing device; and after receiving the interrupt control request, the main processing device receives the task processing result data sent by the slave processing device.

Optionally, in an example of the above aspect, the private data processing method further includes: the main processing device detects the state of the slave processing device in a polling mode, and when the slave processing device is in a data receiving ready state, the main processing device sends task processing source data and task configuration data to the slave processing device; and when the slave processing equipment is in a data sending ready state, the slave processing equipment sends the task processing result data to the main processing equipment.

Optionally, in an example of the above aspect, the computing module and the private data processing method further include: determining a state of the slave processing device according to the state value in the register.

Optionally, in an example of the above aspect, the private data transmission method further includes: and the main processing equipment configures the number of parallel transmission channels for transmitting the task processing data by the second high-speed interface module according to the actual task processing data volume and the operation type, and simultaneously transmits the task processing data in parallel according to the configured parallel transmission data channels.

Optionally, in an example of the above aspect, the private data transmission method further includes: dividing the task processing data into a plurality of batches, and transferring the data from the master processing device to the slave processing device in a predetermined order according to the divided batches.

According to another aspect of embodiments of the present specification, there is provided a computer-readable storage medium characterized in that a privacy data transmission program is stored thereon, and when executed by a processor, the privacy data processing program realizes the steps of the privacy data transmission method as described above.

In the invention, a direct data transmission path is opened for the main processing equipment and the auxiliary processing equipment through the DMA controller of the PCIe high-speed interface module, excessive intervention transmission operation of the main processing equipment is not needed, the frequency of interrupt processing of the main processing equipment is reduced, the workload of the main processing equipment is reduced, the capacity of quickly transmitting a large amount of data is realized, and the requirements of data transmission real-time performance and high speed performance in federal learning are well met.

Drawings

FIG. 1 illustrates a schematic diagram of a heterogeneous processing system of an embodiment of the present description.

FIG. 2 illustrates a schematic diagram of a first high speed interface module, according to embodiments of the present description.

FIG. 3 illustrates an example schematic diagram of a parallel computing architecture in accordance with embodiments of the present description.

FIG. 4 illustrates an example schematic diagram of a parallel computing architecture with a multi-tier data distribution/consolidation mechanism, according to embodiments of the present description.

FIG. 5 illustrates an example schematic diagram of an operation pipeline design of a processing unit in accordance with an embodiment of the present description.

Fig. 6 illustrates a data transmission method between a master processing device and a slave processing device according to an embodiment of the present specification.

Fig. 7 illustrates a data transmission method from a processing device according to an embodiment of the present description.

Fig. 8 illustrates a driver flow diagram for PCIe where the main processor is a CPU according to an embodiment of the present specification.

Detailed Description

The subject matter described herein will now be discussed with reference to example embodiments. It should be understood that these embodiments are discussed only to enable those skilled in the art to better understand and thereby implement the subject matter described herein, and are not intended to limit the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as necessary. For example, the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. In addition, features described with respect to some examples may also be combined in other examples.

As used herein, the term "include" and its variants mean open-ended terms in the sense of "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment". The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. The definition of a term is consistent throughout the specification unless the context clearly dictates otherwise.

Federal learning is an important machine learning framework in the field of Artificial Intelligence (AI). The machine learning framework can realize the sharing of data of two parties on the premise of ensuring the data security, privacy and legality of different enterprises, organizations or users, for example, the machine learning framework is used for training and learning AI, and therefore the data island limitation is broken.

Data is the basis of machine learning, and in order to ensure that data is shared between different enterprises or users in a secure and private manner, multi-party secure computing processing needs to be performed on the data. An example of multi-party security computation may include homomorphic cryptographic operations. The homomorphic encryption operation is a complex mathematical operation with high bit width and large integers, and has very large calculation amount and also relates to the requirements on the real-time performance and the performance of the calculation, so that the calculation system has very high requirements on the data transmission efficiency of the hardware processor.

In order to meet the above-described efficient computing performance requirements for computing systems, Heterogeneous computing (Heterogeneous computer) systems have been proposed. Heterogeneous computing systems refer to systems of processing devices using different architectures. Heterogeneous computing systems typically employ a master-slave processing device architecture. The master processing device is responsible for processing task scheduling and configuration in the heterogeneous computing system, and provides task configuration information and processing task source data to the slave processing devices through the transport interface for specific execution by the slave processing devices. Data transmission efficiency between master-slave processing devices becomes an important factor affecting the computational performance of heterogeneous computing systems.

Embodiments of the present specification provide a heterogeneous processing system having a master processing device and a slave processing device, in which the master processing device issues most of algorithm tasks to the slave processing device, and the slave processing device is responsible for processing the algorithm tasks. The interface module in the slave processing equipment adopts a PCIe high-speed interface module, and the interface module is provided with a DMA controller, so that the data transmission capability of the heterogeneous processing system is improved by adopting a DMA data transmission mode in the PCIe interface module, the requirement of high-speed data transmission on reliability is met, and a set of data transmission mechanism with low delay and high transmission bandwidth is designed for the existing federal learning system.

A heterogeneous processing system, processor, data transmission system and method according to the present specification will be described with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of a heterogeneous processing system of an embodiment of the present description. As shown in fig. 1, the heterogeneous processing system 1 includes a slave processing device 100 and a master processing device 200.

The main processing device 200 includes a main processor 210 and a main memory 220, and further, the main processor 210 has a second high-speed interface module 211. The main processing device 200 is responsible for the control and scheduling of the overall system tasks of the heterogeneous processing system. The slave processing device 100 is responsible for implementing computational processing of processing tasks (e.g., algorithmic tasks).

Data transmission and communication between the master processing device 200 and the slave processing device 100 can be performed through the PCIe interface module, thereby completing data and information interaction between the slave processing device 100 and the master processing device 200. The data transmitted between the slave processing apparatus 100 and the master processing apparatus 200 may include source data required for task processing (hereinafter referred to as "task processing source data") and task configuration data. Data transmission and communication between two processing devices can be realized by using a local real-time high-speed communication protocol (such as PCIe) instead of a remote communication protocol (such as TCP/IP), so that communication delay can be greatly reduced. After the slave processing device 100 receives the task processing source data and the task configuration data from the master processing device 200, corresponding parallel processing may be performed to obtain task processing result data, and provided to the master processing device 200.

The structure and operation of the slave processing apparatus 100 according to an embodiment of the present specification will be described in more detail below with reference to the accompanying drawings.

Returning to FIG. 1, as shown in FIG. 1, the slave processing device 100 includes a slave processor 110 (i.e., a first processor). The slave processor 110 includes a first high speed interface module 111, a read/write controller 112, and a calculation module 113. The first high-speed interface module 111 is configured to receive task processing source data and task configuration data from the main processing device 200 (via the second high-speed interface module 211) for task parallel processing. Further, the first high-speed interface module 111 is also configured to transmit the task processing result data to the master processing device 200 after the slave processing device 100 completes the task processing.

In this description, the read/write controller module may be the read/write controller 112 of FIG. 1. The slave processor 110 includes a first high-speed interface module 111, the first high-speed interface module 111 includes a first DMA (direct memory access) controller 1110, and the slave processing device performs data transmission with the interface module of the master processing device 200 in a DMA manner. The main processor 210 includes a second high-speed interface module 211, the second high-speed interface module 211 includes a second DMA controller 2111, and the main processing device performs data transmission with the first high-speed interface module of the slave processing device 100 in a DMA manner. The employed DMA approach is efficient DMA and supports bi-directional DMA. That is, the master processor 210 in the master processing device 200 is supported to directly access the slave memory 120 of the slave processing device 110, and the slave processing device 100 may also be supported to directly access the master memory 220 of the master processing device 200.

The first high-speed interface module 111 and the second high-speed interface module 211 may employ a PCIe interface. The PCIe bus is a high-speed serial computer expansion bus standard, uses a high-speed differential bus, adopts a data transmission mode based on multiple channels and messages, and has the main advantages of end-to-end reliable transmission, higher bandwidth, lower delay, and good expansibility and flexibility.

DMA is an efficient data transfer mechanism. In actual operation, a large amount of data to be computed can be directly moved from a main memory to a source data space (source memory) in a storage (e.g., a slave memory) of a slave processing device without requiring excessive intervention of the master processing device. Then, the algorithm calculation is carried out by taking the algorithm from the source data space of the processing device. After the slave processing device finishes calculating, writing the result data into a result data space (result memory) of a memory of the slave processing device for caching, and simultaneously informing that the task corresponding to the master processing device is calculated, directly moving the result data space of the memory of the slave processing device to the main memory by the master processing device again in a DMA mode, thereby finishing data interaction of the algorithm task.

The invention provides a data transmission mechanism between the main processing equipment and the auxiliary processing equipment for the federal learning system, and uses the DMA function of PCIe bus to carry out asynchronous data transmission in batch, thereby greatly improving the efficiency of data transmission and improving the stability of data transmission.

FIG. 2 illustrates a schematic diagram of a first high speed interface module, according to embodiments of the present description. The first high-speed interface module 111 includes an interface 1115 and a first DMA controller 1110. The interface 1115 is an input/output port responsible for interfacing with a host processing device, and the first DMA controller is responsible for controlling the sending and receiving of data in a DMA manner. The first DMA controller includes a receiving engine 1111, a transmitting engine 1112, and a memory access module 1113. The receiving engine 1111 is responsible for receiving data and read data status through the interface module 1115, the sending engine 1112 is responsible for write data and interrupt control, and the storage access module 1113 is responsible for processing task configuration data and task processing data.

Further, the storage access module 1113 includes an uplink module 11131 and a downlink module 11132, where the uplink module 11131 is responsible for processing the received task processing source data and task configuration data, and the downlink module 11132 is responsible for processing the received task processing result data. The task configuration data is used to indicate configuration information of tasks that the slave processing device 100 needs to execute, and the task processing source data is data that needs to be used when task algorithm processing is performed. For example, when performing a task of federal learning, the data processing source data received from the master processing device by the slave processing device 100 includes: a plaintext data set M, a key n and a random number set r. The plaintext data set M includes a plurality of plaintext data M1, M2, … …, mi, and the random number set r includes a plurality of random numbers r1, r2, … …, ri. The number of random numbers in the random number set r is the same as the number of plaintext data in the plaintext data set M, and. Further, optionally, in another example, the length of each random number may be the same as the length of the corresponding plaintext data. In another example, the length of each random number may also be different from the length of the corresponding plaintext data.

In one example, the plaintext data set M, the key n, and the random number set r may be received from the master processing device 200. For example, the master processing device 200 provides the plaintext data set M, the key n, and the random number set r to the storage access module 1113 of the slave processing device 100.

In another example, the slave processing device 100 may include a random number generation module (not shown). In this case, the plaintext data set M and the key n are received from the external device from the processing device 100. After receiving the plaintext data set M, the random number generation module generates a random number set r according to the plaintext data set M. In this embodiment, the data processing source data includes: a plaintext data set M and a key n.

Further, the uplink module 11131 includes an uplink control unit 111311 and an uplink data processing unit 111312. The downstream module 11132 comprises a downstream control unit 111321 and a downstream data processing unit 111322. The first DMA controller also includes an interrupt controller 1114. The uplink control unit 111311 is responsible for processing task configuration data, the uplink data processing unit 111312 is responsible for processing task processing source data, the downlink control unit 111321 is responsible for controlling the interrupt controller, and the downlink data processing unit 111322 is responsible for processing task result data.

Specifically, the upstream control unit 111311 sends the received task configuration data to the register 117 for storage, the upstream data processing unit 111312 writes the received task processing source data to the slave memory 120 through the read/write controller 112, and the computing unit 113 reads the task processing source data stored in the slave memory 120 through the read/write controller when executing the computing task. The calculation module 113 performs task calculation processing on the received task processing source data according to a predetermined encryption algorithm to obtain task processing result data. The predetermined encryption algorithm may be homomorphic encryption, differential privacy, secret sharing, or the like. When the computing module 113 completes a predetermined computing task, the corresponding state is sent to the register 117 for storage, and information is sent to the downlink control unit 111321, the downlink control unit 111321 controls the interrupt controller 1114 to send interrupt request information to the main processing device 200, and the main processing device 200 can perform interrupt processing according to actual conditions after receiving the interrupt request; meanwhile, the calculation module 113 writes the task processing result data obtained by completing the predetermined calculation task in the slave memory 120 through the read/write controller 112, and the downstream data processing unit 111322 reads the task processing result data in the slave memory through the read/write controller 112 and transmits the data to the master processing device 200 through the transmission engine 1112 and the interface 1115.

Specifically, when a predetermined computation task completes to transmit task processing result data, the master processing apparatus may monitor the state of the register set 117 by polling the master processing apparatus 200 to detect the state of the slave processing apparatus 100, in addition to the above-described processing manner of interrupt, and when the state value in the register of the slave processing apparatus 100 indicates that the slave processing apparatus 100 is in the data reception ready state, the master processing apparatus 200 transmits task processing source data and task configuration data to the slave processing apparatus 100; when the status value in the register of the slave processing apparatus 100 indicates that the slave processing apparatus 100 is in the data transmission ready state, the slave processing apparatus 100 transmits the task processing result data to the master processing apparatus 200.

Further, the register 117 is also connected to a PIO (Programming Input/Output Model) bus, and the task configuration data can be transferred not only by a DMA method but also by a PIO method. In federal learning, task processing source data is usually a larger bit number, which has higher requirements on the efficiency of task transmission, so the task processing source data is transmitted in a DMA mode; the bit number of the task configuration data is relatively smaller than that of the task processing source data, so that a more flexible DMA or PIO data transmission configuration mode can be adopted for the task configuration data according to specific conditions. Specifically, the task processing source data is transmitted in a DMA mode, and PIO can be adopted to transmit the task configuration data when a hardware transmission channel of the DMA is occupied, so that the task processing source data and the task configuration data can be transmitted simultaneously; the PIO is adopted to transmit the task configuration data, so that the transmission of the task configuration data does not occupy the transmission resource of the DMA, and the transmission of the whole system is more efficient.

Furthermore, the number of DMA parallel transmission channels can be configured according to specific conditions, and when the task data is large, the number of the configured parallel transmission channels can be increased; the number of configured transmission channels can also be reduced when the data is relatively small.

Further, the task management module 114 is configured to distribute the task processing source data to the respective processing units in the computing module 113 for parallel processing according to the task configuration data, resulting in task processing result data. After the task processing result data is obtained, the slave processing device transmits the task processing result data to the master processor 210 in the master processing device 200 through the task management module 114.

Further, the computing module 113 may adopt a parallel computing architecture, where the parallel computing architecture is a hierarchical structure composed of a plurality of processing units, and each processing unit is the smallest processing unit with independent task processing capability. In other words, each processing unit is capable of independently performing the full-flow processing of the algorithm. Optionally, in one example, the parallel computing architecture may employ a nested hierarchy.

FIG. 3 illustrates an example schematic diagram of a parallel computing architecture in accordance with embodiments of the present description. As shown in fig. 2, the parallel computing architecture employs a Multi-core (Multi-Kernel) computing architecture, which is a nested, hierarchical computing architecture.

In this specification, a nested hierarchical computing architecture includes multiple computing hierarchies, each of which may be made up of multiple processing units (Kernel-engines), multiple lower-level computing hierarchies, or a combination of both (i.e., lower-level engines described below). Each computation layer or each processing unit can independently complete the algorithm full-flow processing. Here, the processing unit is the smallest component unit of the parallel computing architecture and cannot be further subdivided.

Specifically, as shown in FIG. 3, the Multi-Kernel computing architecture may be divided into multiple layers. The first layer, called Die Engine layer, may include all the lower engines under a single Die from the processor interior. In each Die _ Engine layer, it may be subdivided into a plurality of Kernel _ Engine _ Lvl1 layers (Kernel Engine layer 1), and the Kernel _ Engine _ Lvll may also be referred to as a second layer. Next, at each second layer, a plurality of Kernel _ Engine _ Lvl2(Kernel Engine layer 2) may be subdivided, and the Kernel _ Engine _ Lv12 may also be referred to as a third layer. By analogy, the n +1 th layer is called Kernel _ Engine _ Lvln. It follows that in this specification, each hierarchy in a parallel computing architecture may contain multiple lower sub-layers, up to the final lower sub-layer, consisting of processing units (Kernel-Engine) and not further sub-divided.

In this specification, the nested hierarchy of the parallel computing architecture is configurable. For example, in one example, the nested hierarchy of the parallel computing architecture, i.e., the number of computing levels of the parallel computing architecture and the number of processing units per computing level, may be configured according to the task processing algorithm and the internal computing resources of the slave processors. For example, the more complex the task processing algorithm, the more computing levels the parallel computing architecture contains. The complexity of the task processing algorithm and the number of computing layers of the parallel computing architecture are not in a linear relationship. In addition, since too many layers may waste computing resources of the slave processor without significant performance improvement, the more the layers are, the better the layers are, and 3-6 layers are usually adopted. In addition, the number of total processing units of all compute hierarchies may be determined by the internal compute resources of the slave (the total chip resources of the slave).

In one example configuration of a nested hierarchy, the DIE Engine layer (DIE _ Engine) may be a one-time configuration number that may be determined by the number of DIE contained within the slave processor model used by the slave processing device 100, with no subsequent reconfiguration required. The hierarchical configuration of each computation hierarchy below the DIE engine level may be configured by the previous hierarchy. For example, Kernel _ Engine _ Lvl1 may configure the number of Kernel _ Engine _ Lvl 2.

It is noted that, in one example, the configuration process of the nested hierarchy described above may be performed in advance according to a predetermined task processing algorithm at the time of design from a processing device (e.g., FPGA chip). In this case, the slave processing device concerned may apply the task processing of the predetermined task processing algorithm or the like. In another example, the nested hierarchy configuration process described above may also be done in real time as the slave processing device performs task processing.

By utilizing the nested hierarchical structure configuration, when an application scene contains different types or different levels of task processing algorithms, the Multi-Kernel computing architecture can be configured, for example, the specific design and parameter configuration of the computing level or the processing engine are changed, so as to meet the task processing requirements of different algorithms.

Furthermore, in the parallel computing architecture of the embodiments of the present specification, a multi-layer data distribution and merging mechanism may also be provided. FIG. 4 illustrates an example schematic diagram of a parallel computing architecture with a multi-tier data distribution/consolidation mechanism in accordance with embodiments of the present description.

As shown in fig. 1, the slave processing device 100 includes a data distribution/consolidation module 115. The data distribution/combination module 115 includes a pair of a data distribution module and a data combination module. The data distribution/merging module 115 is configured to distribute the task processing source data to the respective processing units in the parallel computing architecture for parallel processing together with the task management module 114, and perform merging processing on parallel processing results of the respective processing units, resulting in task processing result data.

In this description, the data distribution/consolidation module 115 may employ a multi-layer data transmission hierarchy. Specifically, the data distribution module includes a multi-level data distribution module, and the data merging module includes a multi-level data merging module. The first-level data distribution module is connected with a plurality of second-level data distribution modules, the second-level data distribution modules are connected with a plurality of third data distribution modules, and the like. The connection mode of the data merging module is opposite to that of the data distribution module.

Specifically, the data distribution and merging mechanism can be divided into multiple layers, with data distribution being "one in, multiple out", and data merging being "multiple in, one out". As shown in fig. 3, the first layer of Data distribution may be referred to as Data _ Disp _ DieEng, the second layer as Data _ Disp _ Lvl1, the third layer as Data _ Disp _ Lvl2, and so on, the n +1 th layer as Data _ Disp _ Lvln. And the first layer of Data merging is called Data _ Merg _ DieEng, the second layer is called Data _ Merg _ Lvl1, the third layer is called Data _ Merg _ Lvl2, and so on, the (n + 1) th layer is called Data _ Merg _ Lvln. The relationship between layers in data distribution and data merging is as follows: for data distribution, a single upper layer data distribution module outputs data to a plurality of data channels, each data channel connecting a lower layer data distribution module, that is, a single upper layer data distribution module connects (drives) a plurality of lower layer data distribution modules. For data merging, the data merging modules at the upper layers merge data into a single data merging module at the lower layer, and the data is merged into single-path data (i.e., task processing result data) and provided to the task management module 114.

In the architecture of the present specification, the levels of the data transmission hierarchy are configurable, and the number of channels for data distribution and data merging at each level is flexibly configurable, for example, 8, 16, 32, etc., and in practical applications, the configuration of the number of channels may be comprehensively considered according to the number of processing units (Kernel _ Engine) at each level.

By using the above multi-layer data distribution/merging mechanism, when the amount of single-processing task data issued by the main processor of the main processing device 200 is relatively large, for example, 256MB, 512MB, and the like, the data distribution module may equally distribute the amount of task data to all processing units (Kernel _ Engine) inside the parallel computing architecture for parallel computing, so as to improve the efficiency of internal data transmission, and simultaneously improve the internal overall performance and the main frequency of the parallel computing architecture. In addition, multiple layers of computational engine data interaction may be adapted using the above-described multiple layers of data distribution and data merging mechanisms. In addition, because the data distribution and combination module is designed by adopting a 'round training' mechanism, the number and the range of the 'round training' can be reduced by adopting a 'layered' design, so that the data processing parallelism among a plurality of processing units can be improved, and the processing delay is reduced.

Further, in the present specification, each processing unit (Kernel-Engine) may employ an operation PipeLine (PipeLine) design. FIG. 5 illustrates an example schematic diagram of an operation pipeline design of a processing unit according to embodiments of this specification.

As shown in fig. 5, the processing unit may include multiple stages of operations, and each stage of operation may perform various mathematical operations, such as addition, subtraction, multiplication, division, and the like. Each stage of operation is connected together seamlessly through Pipeline, so that each stage of operation can be processed in parallel at the same time. The result of the previous stage of calculation is temporarily stored in a memory, for example, the memory inside the processor, while the next stage of calculation can flexibly select different upper stage result data as calculation input, so that the multiple stages are connected together to complete very complex algorithm calculation. The design of Pipeline greatly optimizes the parallelism of task processing, simplifies the complexity in single-step calculation and improves the operation efficiency.

In addition, in the present specification, the number of Pipeline operations in each processing unit is flexibly configurable, for example, the number of stages may be configured to be 5, 6, 7, and the like. In one example, the number of Pipeline operations may be determined in conjunction with the complexity of the task processing algorithm in the actual application. Generally, the more complex a task processing algorithm, the more number of required operation stages, but the more number of operation stages, the more slave processor resources consumed by a single processing unit. Based on the above considerations, in one example, the Pipeline operation progression may be configured according to the level of the nested hierarchy of the parallel computing architecture and the number of processing units.

Further, alternatively, in one example, each stage of operation of the operation pipeline of each processing unit has a loop operation function (feedback/cyclic operation), which may also be referred to as a feedback operation function, that is, an operation result after operation may be fed back to an input of the same stage for re-recursive operation. Examples of the loop operation may include, for example, an iterative operation and/or a recursive operation. By utilizing the design, recursive operation or iterative operation contained in some AI algorithms can be completed, thereby improving the utilization rate and the calculation complexity of the processing unit.

Further, in this specification, the number of parallel processing tasks of the parallel computing architecture is configurable, e.g., 64, 128, 256, etc. The maximum value of this configuration parameter may not be capped. The upper limit on the number of parallel processing tasks depends at least on the following two factors. One factor is the size of the slave processor memory on the slave processing device 100 side, which determines the maximum amount of buffered data that all tasks will accumulate together. Another factor is the maximum amount of processing data (batch size) supported by a single task, which can also be flexibly configured according to the business needs, e.g., 128Mb, 256Mb, 512 Mb. When the batch size is configured to be larger, the number of parallel processing tasks should be configured to be smaller considering the limitation of the size of the slave processor memory. On the contrary, the number of parallel processing tasks can be configured to be larger, but the data amount accumulation of all tasks cannot exceed the maximum value of the memory of the slave processor.

In addition, the slave processing device 100 may also include a system control/monitoring module 116. The system control/monitor module 116 is disposed between the first high-speed interface module 111 and the task management module 114 and the calculation module 113, and is configured to provide the task configuration data received by the first high-speed interface module 111 to the task management module 114 and the calculation module 113. In addition, the system control/monitoring module 116 may monitor the task processing status of the calculation module 113 online. The internal processing units of the computing module 113 are all designed with multiple sets of monitoring registers for monitoring the data volume, the computing state, the task statistical information, and the like of the processing tasks. The internal task processing state monitoring of the parallel computing architecture 113 may be implemented in a register read-write manner. Examples of the monitor registers may include, but are not limited to, configuration registers, control registers, status monitor registers, statistics monitor registers, error information registers, and the like.

With the heterogeneous processing system according to the embodiments of the present specification, by implementing a task processing algorithm in the slave processing device having the slave processor and designing a parallel computing architecture with efficient operation and high throughput inside the slave processor, the characteristics of high parallelism, high bandwidth and low delay of the computation of the slave processor can be fully utilized, and the performance and efficiency of the heterogeneous processing system are greatly improved.

Through practical tests in practical business applications (such as federally-learned homomorphic encryption calculation), the computing performance of the heterogeneous processing system according to the embodiment of the specification is improved by multiple times compared with the traditional CPU-based processing system, and the performance, the power consumption and the price ratio are also greatly improved compared with the GPU-based heterogeneous processing system. The heterogeneous processing system according to the embodiments of the present description can meet the functional and performance requirements in the federal learning application scenario, so that large-scale business landing of federal learning becomes possible, and the industry development is promoted.

Further, optionally, the slave processing device 100 may further comprise a slave memory 120. The slave memory 120 is configured to store task processing source data received from the master processing device 200 and task processing result data of the calculation module 113.

Further, optionally, the slave processing device 100 may further include a data read/write controller 112. The data read/write controller 112 is disposed between the task management module 112 and the slave memory 120, and is configured to control read/write operations with respect to data in the slave memory 120. Thus, the task management module 112 can read source data required for parallel computing processing of the parallel computing architecture 113 and store parallel computing processing result data of the parallel computing architecture 113 into the slave memory 120 under the control of the data read/write controller 116.

Further, slave processor 110 may optionally include a cache (not shown). The cache is an intra-processor cache disposed between the task management module 116 and the data read/write controller 112. The cache is configured to cache data read from the slave memory 120 from the slave processor 110 under the control of the data read/write controller 112, or cache task calculation results of the calculation module 113.

In one example of the present specification, the main processing device 200 may be a CPU-based processing device, i.e., the main processor may be a CPU. The slave processing device 100 may be an FPGA-based processing device, i.e. the slave processor may be implemented using an FPGA chip. Alternatively, in another example, the secondary processor may be implemented using, for example, a GPU, ASIC, or other suitable chip.

Fig. 6 illustrates a data transmission method between a master processing device and a slave processing device according to an embodiment of the present specification. The master processing device and the slave processing device transmit data through a high-speed interface, such as a PCIe interface, and through DMA.

At block 610, the master processing device DMA-wise transmits task processing source data to the slave processing device via the second high speed interface module.

In a specific embodiment, the master processing device directly sends the task processing source data to the slave processing device; optionally, the master processing device may also send the address-related information of the task processing source data to the slave processing device, and the slave processing device actively acquires the task processing source data according to the address-related information of the task processing source data sent by the master processing device.

At block 620, task processing source data is received in a DMA fashion from the processing device via the first high speed interface module and stored in memory.

In a specific implementation mode, the slave processing device directly receives task processing source data sent from the master processing device and stores the task processing source data in the memory; optionally, the slave processing device may also actively acquire the task processing source data according to the information related to the address of the task processing source data sent by the master processing device.

At block 630, the master processing device sends task configuration data to the slave processing device via the second high speed interface module in a DMA or PIO manner.

At block 640, task configuration data is received from the processing device via the first high speed interface module in a DMA or PIO manner and stored in a register.

At block 650, task processing result data is obtained from the computing module of the processing device performing task processing.

At block 660, task processing result data is DMA-wise transmitted from the processing device via the first high speed interface module.

At block 670, the master processing device DMA's receiving the task processing result data sent by the slave processing device via the second high speed interface module.

Fig. 7 illustrates a data transmission method from a processing device according to an embodiment of the present description. The slave processing device transfers data via a high speed interface, such as a PCIe interface, and via DMA.

At block 710, task processing source data is received from an external device via the first high speed interface module in a DMA manner and stored into a memory.

At block 720, the in-memory task processing source data is read via the read/write controller and distributed to the compute modules.

At block 730, task processing results data are obtained by the computing module executing task processing and written to the memory via the read/write controller.

At block 740, the task processing result data is DMA-wise transmitted to the external device via the first high speed interface module.

In a specific implementation manner, the main processor is a CPU, the secondary processor is a heterogeneous computing chip, the heterogeneous computing chip includes a field Programmable Gate array fpga (field Programmable Gate array), a graphics Processing unit gpu (graphics Processing unit), an application specific Integrated circuit asic (application specific Integrated circuit), and the like, and a PCIe high-speed interface is used as a high-speed interface between the CPU and the heterogeneous chip. Fig. 8 shows a drive flow diagram of PCIe where the main processor is a CPU according to this embodiment.

At block 801, the CPU initializes.

At block 802, the DMA read and write parameters are configured, i.e., the task configuration data sent to the slave processing device is configured.

At block 803, the CPU sends the task processing source data and the task configuration data to the heterogeneous compute chips.

At block 804, a determination is made as to whether the data batch transmission is complete, if so, block 805 is executed to determine whether an interrupt control request is received, and if not, a determination is continued at block 804 as to whether the data batch transmission is complete.

At block 805, it is determined whether the interaction with the heterogeneous chip is in an interrupt mode, and if so, the process goes to block 806 to wait for the heterogeneous chip to trigger an interrupt control request; otherwise to block 807 to check the state of the CPU and heterogeneous chips.

At block 806, the heterogeneous chip is waited for to trigger an interrupt control request.

At block 807, the state of the CPU and heterogeneous chips is checked.

At block 808, it is determined whether the CPU is idle and the heterogeneous chip is in a ready state for reception, if so, block 811 is reached to receive external data; otherwise to block 807 to continue checking the state of the CPU and heterogeneous chips.

At block 809, a determination is made as to whether an interrupt is triggered, if so to block 810, the CPU starts an interrupt handler; otherwise to block 806, the heterogeneous chip continues to wait for an interrupt to be triggered.

At block 810, the CPU starts an interrupt handler.

At block 811, the CPU receives external data.

At block 812, a determination is made as to whether the batch of data is received complete.

At block 813, the read and write of the PCIe DMA is terminated.

According to one embodiment of the present description, a program product, such as a machine-readable medium (e.g., a non-transitory machine-readable medium), is provided. A machine-readable medium may have instructions (i.e., elements described above as being implemented in software) that, when executed by a machine, cause the machine to perform various operations and functions described above in connection with fig. 1-8 in the various embodiments of the present description. Specifically, a system or apparatus may be provided which is provided with a readable storage medium on which software program code implementing the functions of any of the above embodiments is stored, and causes a computer or processor of the system or apparatus to read out and execute instructions stored in the readable storage medium.

In this case, the program code itself read from the readable medium can realize the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present invention.

Examples of the readable storage medium include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or from the cloud via a communications network.

As described above with reference to fig. 1 to 8, the heterogeneous processing system, the processor, and the task processing method according to the embodiment of the present specification are described. It will be understood by those skilled in the art that various changes and modifications may be made in the above-disclosed embodiments without departing from the spirit of the invention. Accordingly, the scope of the invention should be determined from the following claims.

It should be noted that not all steps and units in the above flows and system structure diagrams are necessary, and some steps or units may be omitted according to actual needs. The execution order of the steps is not fixed, and can be determined as required. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical entity, or some units may be implemented by a plurality of physical entities, or some units may be implemented by some components in a plurality of independent devices.

The detailed description set forth above in connection with the appended drawings describes exemplary embodiments but does not represent all embodiments that may be practiced or fall within the scope of the claims. The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous" over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A first processor for federated learning, comprising:

the first PCIe high-speed interface module comprises a first DMA controller, is configured to receive task processing source data from an external processing device in a DMA mode, receives task configuration data from the external processing device in a PIO mode and sends task processing result data to the external processing device in a DMA mode;

the read/write control module is configured to control read/write operation of data in the slave memory, read task processing source data in the slave memory and distribute the task processing source data to the computing module, and store computing processing result data processed by the computing module in the slave memory;

and the computing module is used for performing task computing processing on the received task processing source data according to a preset encryption algorithm to obtain task processing result data.

2. The first processor of claim 1, wherein the first processor further comprises a register into which the task configuration data is stored; the first DMA controller comprises: the storage access module comprises an uplink module and a downlink module, wherein the uplink module is used for processing the received task processing source data and task configuration data, and the downlink module is used for processing the received task processing result data;

the receiving engine is used for sending the task processing source data and/or the task configuration data received by the first PCIe high-speed interface module to the storage access module;

and the sending engine is used for sending the task processing result data received in the storage access module to the first PCIe high-speed interface module.

3. The first processor of any of claims 1 or 2, wherein the first PCIe high speed interface module further comprises an interrupt controller, the interrupt controller sending an interrupt message to the external processing device when the compute module completes a predetermined compute task.

4. The first processor according to claim 3, wherein the uplink module includes an uplink control unit and an uplink data processing unit, the uplink control unit processes the task configuration data, and the uplink data processing unit processes the task processing source data; the downlink module comprises a downlink control unit and a downlink data processing unit, the downlink control unit controls the interrupt controller, and the downlink data processing unit processes the task processing result data.

5. The first processor of claim 1, wherein the first processor comprises at least one of an FPGA, a GPU, and an ASIC.

6. A bang learning processing apparatus, comprising:

the first processor of any one of claims 1 to 5; and a memory communicably connected with the first processor and configured to store task processing source data received from an external processing apparatus and task processing result data transmitted to the external processing apparatus.

7. A federated learning heterogeneous processing system, comprising:

the main processing device comprises a main processor, wherein the main processor comprises a second PCIe high-speed interface module, and the second PCIe high-speed interface module comprises a second DMA controller; and

a slave processing device comprising a first processor as claimed in any of claims 1 to 5,

the main processing device sends task processing source data and task configuration data to the slave processing device, and receives task processing result data from the slave processing device.

8. A method for transmitting federated learning privacy data, wherein the method for transmitting privacy data is executed by a slave processing device, wherein a first processor of the slave processing device comprises a first PCIe high-speed interface module, a read/write control module and a computation module, and wherein the method for transmitting data comprises:

receiving task processing source data from an external device in a DMA manner via the first PCIe high-speed interface module and storing the task processing source data into a memory;

receiving task configuration data from an external device via the first PCIe high-speed interface module in a PIO mode at the same time and storing the task configuration data into a memory;

reading task processing source data in the memory and distributing the task processing source data to the computing module through a read/write control module; the computing module executes task processing to obtain task processing result data and writes the task processing result data into the memory through the read/write control module;

and sending the task processing result data to the external equipment in a DMA mode through the first PCIe high-speed interface module.

9. The private data transmission method according to claim 8, wherein the transmission method further comprises: and when the computing module completes a preset computing task, the first PCIe high-speed interface module sends an interrupt message to the external equipment.

10. A federated learning privacy data transfer method performed by a heterogeneous processing system that includes a master processing device that includes a master processor that includes a second PCIe high-speed interface that includes a second DMA controller and a slave processing device that includes the processing device of claim 6, the data transfer method comprising:

the master processing device sends task processing source data to the slave processing device in a DMA mode through the second PCIe high-speed interface module;

the primary processing device sends task configuration data to the secondary processing device in a PIO mode through the second PCIe high-speed interface module;

receiving task processing source data in a DMA mode from a processing device through the first PCIe high-speed interface module and storing the task processing source data in a memory;

receiving task configuration data from the processing device in a PIO mode through the first PCIe high-speed interface module and storing the task configuration data into a register;

executing task processing by a computing module of the processing equipment to obtain task processing result data;

transmitting task processing result data in a DMA mode from the processing device via the first PCIe high-speed interface module;

and the master processing device receives the task processing result data sent by the slave processing device in a DMA mode through the second PCIe high-speed interface module.

11. The private data transmission method according to claim 10, wherein after the task processing result data is obtained by the slave computing module executing task processing, the private data processing method further includes:

the slave processing device sends an interrupt control request to the master processing device;

and after receiving the interrupt control request, the main processing device receives the task processing result data sent by the slave processing device.

12. The private data transmission method according to claim 11, wherein the private data processing method further includes: the main processing device detects the state of the slave processing device in a polling mode, and when the slave processing device is in a data receiving ready state, the main processing device sends task processing source data and task configuration data to the slave processing device;

and when the slave processing equipment is in a data sending ready state, the slave processing equipment sends the task processing result data to the main processing equipment.

13. The private data transmission method according to any one of claims 10 to 12, wherein the private data processing method further includes: determining the state of the slave processing device according to the state value in the register.

14. The private data transmission method according to claim 10, further comprising: and the main processing equipment configures the number of parallel transmission channels for transmitting the task processing data by the second PCIe high-speed interface module according to the actual task processing data volume and the operation type, and simultaneously transmits the task processing data in parallel according to the configured parallel transmission data channels.

15. The private data transmission method according to claim 10, further comprising: dividing the task processing data into a plurality of batches, and transferring the data from the master processing device to the slave processing device in a predetermined order according to the divided batches.

16. A computer readable storage medium having stored thereon a private data transfer program, which when executed by a processor, performs the steps of the federally learned private data transmission method as claimed in any of claims 8 to 15.