CN117273162A - Federal learning acceleration method, federal learning acceleration device, storage medium and electronic equipment - Google Patents

Federal learning acceleration method, federal learning acceleration device, storage medium and electronic equipment Download PDF

Info

Publication number
CN117273162A
CN117273162A CN202210664921.2A CN202210664921A CN117273162A CN 117273162 A CN117273162 A CN 117273162A CN 202210664921 A CN202210664921 A CN 202210664921A CN 117273162 A CN117273162 A CN 117273162A
Authority
CN
China
Prior art keywords
data
processed
module
memory
modular multiplication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210664921.2A
Other languages
Chinese (zh)
Inventor
王子潇
杜洋
车碧瑶
陈映
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202210664921.2A priority Critical patent/CN117273162A/en
Publication of CN117273162A publication Critical patent/CN117273162A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7839Architectures of general purpose stored program computers comprising a single central processing unit with memory
    • G06F15/7842Architectures of general purpose stored program computers comprising a single central processing unit with memory on one IC chip (single chip microcontrollers)
    • G06F15/7846On-chip cache and off-chip main memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The disclosure provides a federal learning acceleration method, a federal learning acceleration device, a storage medium and electronic equipment, and relates to the technical field of computers. The federal learning acceleration method comprises the following steps: the method comprises the steps of controlling a memory reading module to read data to be processed from a global memory, and dividing the data to be processed into a plurality of groups of data streams to be processed; controlling a Montgomery modular multiplication module to acquire each data stream to be processed from the memory reading module, and performing Montgomery modular multiplication calculation on data in each data stream to be processed to obtain a modular multiplication calculation result; and sending the modular multiplication calculation result to a memory writing module so that the memory writing module writes the modular multiplication calculation result into the global memory module. The technical problem of low data processing efficiency of the existing federal learning processing framework is solved, and the technical effect of improving the data processing efficiency of the federal learning processing framework is achieved.

Description

Federal learning acceleration method, federal learning acceleration device, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a federal learning acceleration method, a federal learning acceleration device, a storage medium, and an electronic apparatus.
Background
Federal learning is to realize common modeling by using data mastered by each node on the basis of ensuring the security and legal compliance of data, so that the effect of an AI model is improved, and the federal learning method is widely applied to the fields of big data processing and privacy protection.
The processing architecture of federal learning generally includes: the modular exponentiation calculation module, the preprocessing module, the Montgomery module, the confusion calculation module, the controller and the like are executed by selecting different circuit modules on a chip according to different algorithms, and all the modules cannot work simultaneously.
Therefore, the data processing efficiency of the present federal learning processing architecture is low.
Disclosure of Invention
The disclosure provides a federal learning acceleration method, a federal learning acceleration device, a storage medium and electronic equipment, so that data processing efficiency of a federal learning processing framework is improved.
In a first aspect, an embodiment of the present disclosure provides a federal learning acceleration method, including:
the method comprises the steps of controlling a memory reading module to read data to be processed from a global memory, and dividing the data to be processed into a plurality of groups of data streams to be processed;
controlling a Montgomery modular multiplication module to acquire each data stream to be processed from the memory reading module, and performing Montgomery modular multiplication calculation on data in each data stream to be processed to obtain a modular multiplication calculation result;
and sending the modular multiplication calculation result to the memory writing module so that the memory writing module writes the modular multiplication calculation result into the global memory module.
In an alternative embodiment of the present disclosure, controlling a memory read module to read data to be processed from a global memory and divide the data to be processed into a plurality of groups of data streams to be processed includes:
determining the number of calculation units capable of processing data in unit time in the Montgomery modular multiplication module;
controlling a memory reading module to read data to be processed from a global memory;
dividing the data to be processed into a plurality of data streams to be processed of the number of computing units.
In an alternative embodiment of the present disclosure, controlling a memory read module to read data to be processed from a global memory includes:
determining the execution batch times of the data to be processed obtained by the memory reading module according to the capacity of the data to be processed and the calculation unit number of the Montgomery modular multiplication module;
and sequentially reading the data to be processed from the global memory according to the number of execution batches.
In an alternative embodiment of the present disclosure, sequentially reading data to be processed from a global memory according to the number of execution batches includes:
inquiring a target memory address corresponding to the current execution batch number from a pre-configured memory address library;
and reading target to-be-processed data corresponding to the current execution batch number from the target memory address.
In an alternative embodiment of the present disclosure, the method further comprises:
if the current target memory address is the same as the historical target memory address corresponding to the historical execution batch number, the historical target to-be-processed data corresponding to the historical execution batch number is read from the memory reading module.
In an alternative embodiment of the present disclosure, performing Montgomery modular multiplication on data in each data stream to be processed to obtain a modular multiplication result, including:
carrying out Montgomery modular multiplication parallel calculation on the data in each data stream to be processed to respectively obtain sub-modular multiplication calculation results corresponding to each data stream to be processed;
and sending the sub-multiplication calculation results corresponding to the data streams to be processed at the current moment to the memory writing module as a group of modular multiplication calculation results.
In an alternative embodiment of the present disclosure, before performing Montgomery modular multiplication parallel computation on data in each data stream to be processed, the method further includes:
converting the current system number of each data stream to be processed into a target system number; wherein the target number is greater than the current resulting number.
In a second aspect, one embodiment of the present disclosure provides a federal learning acceleration device, the device comprising:
the first control module is used for controlling the memory reading module to read the data to be processed from the global memory and dividing the data to be processed into a plurality of groups of data streams to be processed;
the second control module is used for controlling the Montgomery modular multiplication module to acquire each data stream to be processed from the memory reading module, and carrying out Montgomery modular multiplication calculation on the data in each data stream to be processed to obtain a modular multiplication calculation result;
and the result processing module is used for sending the modular multiplication calculation result to the memory writing module so that the memory writing module can write the modular multiplication calculation result into the global memory module.
In a third aspect, one embodiment of the present disclosure provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as above.
In a fourth aspect, one embodiment of the present disclosure provides an electronic device, including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method as above via execution of the executable instructions.
The technical scheme of the present disclosure has the following beneficial effects:
according to the federal learning acceleration method provided by the embodiment of the disclosure, the memory reading module, the Montgomery modular multiplication module and the memory writing module are arranged on the FPGA chip, and the modules are in labor division cooperation to jointly complete the reading and writing of the data to be processed. The on-chip data transmission rate is faster than the off-chip data transmission rate, the data to be processed is read from the global memory module in real time through the configured memory reading module, then the Montgomery modular multiplication module is used for modular multiplication calculation of the data to be processed, finally the processed modular multiplication calculation result is written into the global memory through the memory writing module, the modular multiplication calculation and the data reading and writing are simultaneously executed, the execution of one module can be started after the completion of the other module is not needed, the data processing efficiency is higher, and therefore the technical problem that the data processing efficiency of the traditional federal learning processing framework is lower is solved, and the technical effect of improving the data processing efficiency of the federal learning processing framework is achieved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely some embodiments of the present disclosure and that other drawings may be derived from these drawings without undue effort.
Fig. 1 is a diagram schematically showing a federal learning architecture in a federal learning acceleration method according to the present exemplary embodiment;
FIG. 2 is a diagram of a federal learning architecture in a federal learning acceleration method according to an exemplary embodiment;
FIG. 3 illustrates a flow chart of a federal learning acceleration method in accordance with the exemplary embodiment;
FIG. 4 illustrates a flowchart of a federal learning acceleration method in accordance with an exemplary embodiment;
FIG. 5 illustrates a flow chart of a federal learning acceleration method in accordance with the exemplary embodiment;
FIG. 6 illustrates a flowchart of a federal learning acceleration method in accordance with an exemplary embodiment;
FIG. 7 illustrates a flowchart of a federal learning acceleration method in accordance with an exemplary embodiment;
fig. 8 is a schematic diagram showing a structure of a federal learning accelerator in the present exemplary embodiment;
fig. 9 shows a schematic structural diagram of an electronic device in the present exemplary embodiment.
Detailed Description
Exemplary embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the exemplary embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. However, those skilled in the art will recognize that the aspects of the present disclosure may be practiced with one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only and not necessarily all steps are included. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
In the related art, federal learning is to realize common modeling by using data mastered by each node on the basis of ensuring the security and legal compliance of data, so that the effect of an AI model is improved, and the method is widely applied to the fields of big data processing and privacy protection. The processing architecture of federal learning generally includes: the modular exponentiation calculation module, the preprocessing module, the Montgomery module, the confusion calculation module, the controller and the like are executed by selecting different circuit modules on a chip according to different algorithms, and all the modules cannot work simultaneously. Therefore, the data processing efficiency of the present federal learning processing architecture is low.
In view of the above problems, embodiments of the present disclosure provide a federal learning acceleration method, so as to improve efficiency of data processing in federal learning. The following briefly describes an application environment of the federal learning acceleration method provided by the embodiments of the present disclosure:
referring to fig. 1, the federal learning acceleration method provided in the embodiments of the present disclosure is applied to a federal learning acceleration system, where the acceleration system includes an application layer, a power calculation scheduling layer, and a hardware layer. The application layer is used for acquiring the federal learning requirement and constructing a learning framework corresponding to the federal learning requirement, such as a FATE (Federated AI Technology Enabler) framework and the like. The computing power scheduling layer is used for providing an interface between the FPGA hardware circuit and the upper federal learning application layer, is responsible for encrypting and decrypting operators in the federal learning algorithm, such as Paillier operators, RSA operators and the like, and scheduling the corresponding operators to the FPGA hardware circuit for execution. Referring to fig. 2, the hardware layer at least includes an FPGA (Field-Programmable Gate Array, field programmable gate array) accelerator and a global memory unit. Wherein, the FPGA accelerator includes: the memory reading module, the memory writing module and the Montgomery modular multiplication module are respectively provided with a corresponding task control unit for executing tasks in the module. The task control unit may be a control chip, a microprocessor, etc., and the embodiments of the present disclosure are not particularly limited. The memory reading module is used for interacting with the global memory, reading data to be processed from the global memory, primarily processing the read data to be processed into different data streams, and transmitting the data streams to the Montgomery modular multiplication module through different data transmission queues. The Montgomery modular multiplication module receives each data stream according to different data transmission queues, performs modular multiplication calculation on the received data streams in different calculation units respectively, then transmits the obtained calculation result to the memory writing module, and the memory writing module writes the calculation result into the global memory, so that the whole data reading and writing process is completed. It should be explained that the global memory is independent of the memory units outside the FPGA accelerator, and is generally referred to as "off-chip", and correspondingly, each module in the FPGA accelerator is referred to as "on-chip", and the on-chip data transmission rate is far greater than the off-chip data transmission rate.
The following applies the federal learning acceleration method to the federal learning acceleration system, reads the data to be processed in the memory unit, and rewrites the processed calculation result into the memory unit for explanation, referring to fig. 3, the federal learning acceleration method provided by the embodiments of the present disclosure includes the following steps 301 to 303:
step 301, controlling a memory reading module to read data to be processed from a global memory, and dividing the data to be processed into a plurality of groups of data streams to be processed.
With continued reference to fig. 2, the memory reading module is an on-chip module belonging to the FPGA accelerator, and the global memory is an off-chip module independent of the FPGA accelerator, and the task control unit in the memory reading module automatically reads the required parameters from the global memory. It should be explained that the memory reading module may directly obtain all the data to be processed from the global memory, or may perform preliminary classification on the data to be processed, and then receive the data according to different data transmission queues. After the receiving is completed, the data to be processed is split into a plurality of data streams to be processed, and the data streams are transmitted to the Montgomery modular multiplication module through different data transmission queues. For example, in fig. 2, the memory reading module has four data interfaces, so that the data to be processed in the global memory is divided into four data streams, and the data to be processed are respectively read in parallel through four data transmission queues and written into different data transmission queues for being read by the modulo taking advantage module, so that the efficiency is higher.
Step 302, controlling a Montgomery modular multiplication module to acquire each data stream to be processed from the memory reading module, and performing Montgomery modular multiplication calculation on the data in each data stream to be processed to obtain a modular multiplication calculation result.
The task control unit in the Montgomery modular multiplication module automatically reads the required data from each data transmission queue output by the memory reading module, performs Montgomery modular multiplication calculation, and integrates Montgomery calculation results of the data in the same batch to obtain modular multiplication calculation results of the batch of data. A batch refers to a combination of a plurality of data in different transmission queues, each data stream being output from a memory read module, each data stream being a set of data.
Step 303, sending the modular multiplication result to the memory writing module, so that the memory writing module writes the modular multiplication result into the global memory module.
After the Montgomery modular multiplication module obtains the modular multiplication calculation result of one batch of data, each modular multiplication calculation result is sequentially sent to the memory writing module according to the order of obtaining the modular multiplication calculation result, namely, each modular multiplication calculation result forms a data transmission queue between the Montgomery modular multiplication module and the memory writing module. The memory writing module writes the obtained calculation result into the global memory module, so that the reading and writing of the data to be processed can be completed.
According to the federal learning acceleration method provided by the embodiment of the disclosure, the memory reading module, the Montgomery modular multiplication module and the memory writing module are arranged on the FPGA chip, and the modules are in labor division cooperation to jointly complete the reading and writing of the data to be processed. The on-chip data transmission rate is faster than the off-chip data transmission rate, the data to be processed is read from the global memory module in real time through the configured memory reading module, then the Montgomery modular multiplication module is used for modular multiplication calculation of the data to be processed, finally the processed modular multiplication calculation result is written into the global memory through the memory writing module, the modular multiplication calculation and the data reading and writing are simultaneously executed, the execution of one module can be started after the completion of the other module is not needed, the data processing efficiency is higher, and therefore the technical problem that the data processing efficiency of the traditional federal learning processing framework is lower is solved, and the technical effect of improving the data processing efficiency of the federal learning processing framework is achieved.
Referring to fig. 4, in an alternative embodiment of the present disclosure, the step 301 of controlling the memory reading module to read the data to be processed from the global memory and divide the data to be processed into a plurality of data streams to be processed includes the following steps 401 to 403:
step 401, determining the number of calculation units capable of processing data in unit time in the Montgomery modular multiplication module.
The number of calculation units refers to the number of groups of data which can be processed in a unit time or simultaneously by the Montgomery modular multiplication module, and is determined by the data processing performance of the Montgomery modular multiplication module. Once the montgomery modular multiplication module structure is fixed, the number of calculation units is fixed, so the number of calculation units can be stored in the memory reading module in advance, and certainly can also be obtained through real-time test and detection, and the embodiment is not limited specifically.
Step 402, the memory reading module is controlled to read the data to be processed from the global memory.
The memory reading module performs off-chip access memory according to the batch, and the batch size is equal to the number of the calculation units.
Step 403, dividing the data to be processed into a number of data streams to be processed with a number of computing units.
The memory reading module divides the data to be processed into a plurality of data streams to be processed in the number of computing units, and one data stream comprises continuous multiple groups of data so as to conveniently acquire corresponding data from different data streams, thereby forming one data packet of data in one batch and sending the data packet to the Montgomery modular multiplication module for computation.
The embodiment of the disclosure firstly determines the number of the calculation units capable of processing data in unit time in the Montgomery modular multiplication module, then divides the data to be processed read from the global memory into a plurality of data streams to be processed in the number of the calculation units, forms one data packet from one batch of data and sends the data packet to the Montgomery modular multiplication module for calculation, thereby greatly improving the efficiency of the Montgomery modular multiplication module for modular multiplication calculation and further improving the data reading and writing efficiency provided by the embodiment of the disclosure.
Referring to fig. 5, in an alternative embodiment of the present disclosure, the step 402 of controlling the memory reading module to read the data to be processed from the global memory includes the following steps 501 to 502:
step 501, determining the execution batch number of the data to be processed obtained by the memory reading module according to the capacity of the data to be processed and the number of calculation units of the Montgomery modular multiplication module.
For example, a 1024-bit large-number modular multiplication algorithm for federal learning is implemented, the capacity size of data to be processed is 2048, the number of calculation units of a Montgomery modular multiplication module is 4, and the calculation requirements of application layer construction are as follows: calculating the modular multiplication z=x×y mod M, wherein X, Y is a vector formed by 2048 1024-bit large numbers, and M is a 1024-bit large number. The number of Montgomery operator executions required by the power schedule layer to resolve the requirement is 4, and the Montgomery operator executions are respectively as follows:
in the above formulas (1) - (4), l is the bit width of M, namely 1024bit, X and Y are the data to be processed,for different input threshold->Is an intermediate threshold->The input of (2) is +.>And->Z is the output threshold, the input of Z is the intermediate threshold +.>
The power-calculation scheduling layer first pre-calculates-M in mode 2 k The inverse of: -M -1 The specific implementation mode is as follows formula (5):
-M -1 =gmpy2.invert[-M,pow(2,32)] (5)
in the formula (5), the corresponding modulo-inverse M can be obtained by calculating the same M only once through the gpy2.index and the pow function -1
The calculation time of the modular inverse can be shared by the calculation tasks, and is not a calculation bottleneck of a large number of modular multiplication tasks, so that the calculation of the modular inverse is realized by a CPU and the result is transmitted to an FPGA, thereby reducing the calculation amount of the FPGA and further improving the data reading and writing efficiency.
The power calculation scheduling layer calculates the execution batch times vec_num=2048/4=512 of the data to be processed obtained by the memory reading module according to the capacity size 2048 of the data to be processed and the calculation unit number 4 of the memory reading module.
Step 502, sequentially reading data to be processed from the global memory according to the number of execution batches.
After the execution batch times are obtained, the memory reading module reads the X data stream and the Y data stream according to a group of 4 data according to the number 4 of the calculation units in the Montgomery modular multiplication module, and binds the X data stream and the Y data stream into batch, namely, a batch of data group is formed: obtaining X_vec and Y_vec, and writing the X_vec and the Y_vec into an X transmission queue and a Y transmission queue respectively; m, -M -1 Written with x, -m respectively -1 And the transmission queue is used for writing vec_num into the vec_num transmission queue. M, -M -1 And the vec_num only needs to be read and written once, and the X_vec and the Y_vec only need to be read and written 512 times, so that the data reading and writing efficiency is greatly improved.
Referring to fig. 6, in an alternative embodiment of the present disclosure, the step 502 of sequentially reading the data to be processed from the global memory according to the number of execution batches includes the following steps 601-602:
step 601, a target memory address corresponding to the number of current execution batches is queried from a pre-configured memory address library.
Step 602, reading target to-be-processed data corresponding to the current execution batch number from the target memory address.
Different data are stored in different memory addresses, and when the data reading stage is entered, the memory address is firstly queried to determine the storage position of the data to be processed at this time for each execution batch, and a corresponding table, such as the following table (1), is formed for the execution batch and the memory address
It should be noted that, addresses represented by the same symbols in the table occupy the same memory space in the global memory. According to the embodiment of the disclosure, the mode of the common encryption and decryption algorithm in federal learning is judged, the execution batch times of the corresponding algorithm, namely the calling times of the Montgomery modules are automatically obtained, and the memory address capable of maximizing multiplexing data is allocated in the global memory for each calling, so that data copying among multiple Montgomery module calls is avoided, and the data processing efficiency is improved.
According to the embodiment of the disclosure, through the configuration of the Montgomery modular multiplication module, the FPGA can automatically operate after being electrified, the starting of a calculation power scheduling layer is not required to be waited, and an interface interacting with a global memory is not needed, so that the resource cost of control logic and interfaces on an FPGA chip can be reduced to the greatest extent, more Montgomery modular multiplication calculation units are realized in a limited space on the chip, and the data processing efficiency is further improved.
In an alternative embodiment of the present disclosure, the above federal learning acceleration method further includes the following step a:
if the current target memory address is the same as the historical target memory address corresponding to the historical execution batch number, the historical target to-be-processed data corresponding to the historical execution batch number is read from the memory reading module.
As shown in the table (1), when operator calls are performed for multiple times, the addresses of the parameters are multiplexed, for example, the input parameter 1 address of the 3 rd execution call is the same as the output parameter address of the 1 st execution call, so that copying and carrying of the same data in the memory can be avoided. When the memory read-write module detects that the input parameter address called at the time is the same as the last call, the memory access of the corresponding address at the time is skipped, and the cached data on the FPGA chip is directly used, so that the bandwidth pressure of the off-chip memory access is reduced. The input parameters 3 and 4 in the above table (1) are only needed to be accessed and stored once when the 1 st execution is called, and the other parameters are executed on the chip, so that the time for reading and writing data is greatly saved.
Referring to fig. 7, in an alternative embodiment of the present disclosure, the step 302 performs montgomery modular multiplication calculation on data in each data stream to be processed to obtain a modular multiplication calculation result, which includes the following steps 701-702:
and 701, performing Montgomery modular multiplication parallel calculation on data in each data stream to be processed to respectively obtain sub-modular multiplication calculation results corresponding to each data stream to be processed.
Step 702, sending the sub-multiplication calculation results corresponding to the data streams to be processed at the current moment to the memory writing module as a group of modular multiplication calculation results.
The Montgomery modular multiplication module comprises a plurality of calculation units, each calculation unit is used for executing different calculations, when the Montgomery modular multiplication module carries out modular multiplication calculation on a plurality of data which are in the same batch or at the same moment as a group of data in different calculation units, and the obtained sub-modular multiplication calculation results are integrated to obtain the modular multiplication calculation results of the group, so that the calculation efficiency of a large number of the modular multiplication operators is higher, and the data reading and writing efficiency of the embodiment of the present disclosure is further improved.
In an alternative embodiment of the present disclosure, before performing Montgomery modular multiplication parallel computation on the data in each data stream to be processed in step 601, the method further includes the following step B:
step B, converting the current system number of each data stream to be processed into a target system number; wherein the target number is greater than the current resulting number.
To reduce loop count in Montgomery modular multiplication modules, 2 is optimized to 2 k In the embodiment, k=32 is selected, that is, 33 int32 types are used as data structures for storing large integers, and the total number of the data structures is 33×32=1056bit, and 1 int32 more than the space required for storing 1024 bits is used for processing carry and overflow.
Referring to fig. 8, in order to implement the above-mentioned federal learning acceleration method, in one embodiment of the present disclosure, a federal learning acceleration method apparatus is provided. Fig. 8 shows a schematic architecture diagram of a federal learning acceleration device 800, comprising: the first control module 810, the second control module 820, and the result processing module 830, wherein:
the first control module 810 is configured to control the memory read module to read data to be processed from the global memory, and divide the data to be processed into a plurality of groups of data streams to be processed;
the second control module 820 is configured to control the montgomery modular multiplication module to obtain each data stream to be processed from the memory reading module, and perform montgomery modular multiplication calculation on the data in each data stream to be processed to obtain a modular multiplication calculation result;
the result processing module 830 is configured to send the modular multiplication result to the memory writing module, so that the memory writing module writes the modular multiplication result to the global memory module.
In an alternative embodiment, the first control module 810 is specifically configured to determine the number of computing units in the Montgomery modular multiplication module that can process data per unit time; controlling a memory reading module to read data to be processed from a global memory; dividing the data to be processed into a plurality of data streams to be processed of the number of computing units.
In an alternative embodiment, the first control module 810 is specifically configured to determine the number of execution batches of the data to be processed obtained by the memory read module according to the size of the capacity of the data to be processed and the number of calculation units of the montgomery modular multiplication module; and sequentially reading the data to be processed from the global memory according to the number of execution batches.
In an alternative embodiment, the first control module 810 is specifically configured to query, from a pre-configured memory address library, a target memory address corresponding to the number of current execution batches; and reading target to-be-processed data corresponding to the current execution batch number from the target memory address.
In an alternative embodiment, the first control module 810 is further configured to read the historical target pending data corresponding to the number of historical execution lots from the memory reading module if the current target memory address is the same as the historical target memory address corresponding to the number of historical execution lots.
In an alternative embodiment, the second control module 820 is specifically configured to perform montgomery modular multiplication parallel calculation on the data in each data stream to be processed, so as to obtain sub-modular multiplication calculation results corresponding to each data stream to be processed respectively; and sending the sub-multiplication calculation results corresponding to the data streams to be processed at the current moment to the memory writing module as a group of modular multiplication calculation results.
In an alternative embodiment, the first control module 810 is further configured to convert the current number of the data streams to be processed into a target number; wherein the target number is greater than the current resulting number.
Exemplary embodiments of the present disclosure also provide a computer readable storage medium, which may be implemented in the form of a program product comprising program code for causing an electronic device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the above section of the "exemplary method" when the program product is run on the electronic device. In one embodiment, the program product may be implemented as a portable compact disc read only memory (CD-ROM) and includes program code and may be run on an electronic device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider). In the embodiments of the present disclosure, any of the steps in the above federal learning acceleration method may be implemented when the program code stored in the computer-readable storage medium is executed.
Referring to fig. 9, the exemplary embodiment of the present disclosure further provides an electronic device 900, which may be a background server of the information platform. The electronic device 900 is described below with reference to fig. 9. It should be understood that the electronic device 900 shown in fig. 9 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.
As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: at least one processing unit 910, at least one memory unit 920, a bus 930 connecting the different system components, including the memory unit 920 and the processing unit 910.
Wherein the storage unit stores program code that is executable by the processing unit 910 such that the processing unit 910 performs steps according to various exemplary embodiments of the present invention described in the above-described "exemplary methods" section of the present specification. For example, the processing unit 910 may perform method steps as shown in fig. 3, etc.
The storage unit 920 may include volatile storage units such as a random access storage unit (RAM) 921 and/or a cache storage unit 922, and may further include a read only storage unit (ROM) 923.
The storage unit 920 may also include a program/utility 924 having a set (at least one) of program modules 925, such program modules 925 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 930 may include a data bus, an address bus, and a control bus.
The electronic device 900 may also communicate with one or more external devices 2000 (e.g., keyboard, pointing device, bluetooth device, etc.) via an input/output (I/O) interface 940. Electronic device 900 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet through network adapter 950. As shown, the network adapter 950 communicates with other modules of the electronic device 900 via the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 900, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
In the embodiment of the present disclosure, any step in the above federal learning acceleration method may be implemented when the program code stored in the electronic device is executed.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with exemplary embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A federal learning acceleration method, comprising:
the method comprises the steps of controlling a memory reading module to read data to be processed from a global memory, and dividing the data to be processed into a plurality of groups of data streams to be processed;
controlling a Montgomery modular multiplication module to acquire each data stream to be processed from the memory reading module, and performing Montgomery modular multiplication calculation on data in each data stream to be processed to obtain a modular multiplication calculation result;
and sending the modular multiplication calculation result to a memory writing module so that the memory writing module writes the modular multiplication calculation result into the global memory module.
2. The federal learning acceleration method of claim 1, wherein the controlling the memory read module to read the data to be processed from the global memory and divide the data to be processed into a plurality of sets of data streams to be processed comprises:
determining the number of calculation units capable of processing data in unit time in the Montgomery modular multiplication module;
controlling a memory reading module to read data to be processed from a global memory;
dividing the data to be processed into a plurality of data streams to be processed of the computing unit number.
3. The federal learning acceleration method according to claim 2, wherein the controlling the memory reading module to read the data to be processed from the global memory includes:
determining the execution batch times of the memory reading module for obtaining the data to be processed according to the capacity of the data to be processed and the calculation unit number of the Montgomery modular multiplication module;
and reading the data to be processed from the global memory according to the execution batch number.
4. The federal learning acceleration method according to claim 3, wherein the sequentially reading the data to be processed from the global memory according to the execution lot number includes:
inquiring a target memory address corresponding to the current execution batch number from a pre-configured memory address library;
and reading target to-be-processed data corresponding to the current execution batch number from the target memory address.
5. The federal learning acceleration method of claim 4, further comprising:
and if the current target memory address is the same as the historical target memory address corresponding to the historical execution batch number, reading the historical target to-be-processed data corresponding to the historical execution batch number from the memory reading module.
6. The method of claim 4, wherein performing montgomery modular multiplication on the data in each of the data streams to be processed to obtain a modular multiplication result comprises:
carrying out Montgomery modular multiplication parallel calculation on the data in each data stream to be processed to respectively obtain sub-modular multiplication calculation results corresponding to each data stream to be processed;
and sending the sub-multiplication calculation results corresponding to the data streams to be processed at the current moment to the memory writing module as a group of modular multiplication calculation results.
7. The federal learning acceleration method of claim 6, further comprising, prior to the montgomery modular multiplication parallel computing of the data in each of the pending data streams:
converting the current system number of each data stream to be processed into a target system number; wherein the target number is greater than the current resulting number.
8. A federal learning acceleration device, the device comprising:
the first control module is used for controlling the memory reading module to read data to be processed from the global memory and dividing the data to be processed into a plurality of groups of data streams to be processed;
the second control module is used for controlling the Montgomery modular multiplication module to acquire all the data streams to be processed from the memory reading module, and carrying out Montgomery modular multiplication calculation on the data in all the data streams to be processed to obtain a modular multiplication calculation result;
and the result processing module is used for sending the modular multiplication calculation result to the memory writing module so that the memory writing module can write the modular multiplication calculation result into the global memory module.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1 to 7.
10. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the method of any one of claims 1 to 7 via execution of the executable instructions.
CN202210664921.2A 2022-06-13 2022-06-13 Federal learning acceleration method, federal learning acceleration device, storage medium and electronic equipment Pending CN117273162A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210664921.2A CN117273162A (en) 2022-06-13 2022-06-13 Federal learning acceleration method, federal learning acceleration device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210664921.2A CN117273162A (en) 2022-06-13 2022-06-13 Federal learning acceleration method, federal learning acceleration device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN117273162A true CN117273162A (en) 2023-12-22

Family

ID=89203190

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210664921.2A Pending CN117273162A (en) 2022-06-13 2022-06-13 Federal learning acceleration method, federal learning acceleration device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN117273162A (en)

Similar Documents

Publication Publication Date Title
US9454472B2 (en) Sparsity-driven matrix representation to optimize operational and storage efficiency
US20120166771A1 (en) Agile communication operator
WO2007143122A1 (en) Virtual machine for operating n-core application on m-core processor
US11010056B2 (en) Data operating method, device, and system
US11863469B2 (en) Utilizing coherently attached interfaces in a network stack framework
KR20110028211A (en) Autonomous memory architecture
US20230251979A1 (en) Data processing method and apparatus of ai chip and computer device
US20220027432A1 (en) System, method and computer program product for dense/sparse linear system solver accelerator
US20220405114A1 (en) Method, device and computer program product for resource scheduling
US9473572B2 (en) Selecting a target server for a workload with a lowest adjusted cost based on component values
Sontakke et al. Optimization of hadoop mapreduce model in cloud computing environment
US11100123B2 (en) Sharing intermediate data in map-reduce
CN112799851A (en) Data processing method and related device in multi-party security computing
CN117273162A (en) Federal learning acceleration method, federal learning acceleration device, storage medium and electronic equipment
US8892807B2 (en) Emulating a skip read command
CN111681093B (en) Method and device for displaying resource page and electronic equipment
US11481255B2 (en) Management of memory pages for a set of non-consecutive work elements in work queue designated by a sliding window for execution on a coherent accelerator
US11288046B2 (en) Methods and systems for program optimization utilizing intelligent space exploration
US9176910B2 (en) Sending a next request to a resource before a completion interrupt for a previous request
CN113177211A (en) FPGA chip for privacy computation, heterogeneous processing system and computing method
CN111045959A (en) Complex algorithm variable mapping method based on storage optimization
CN111290701A (en) Data read-write control method, device, medium and electronic equipment
CN116880800A (en) Data processing method, device, central processing unit and system
CN116955490A (en) Data partitioning method, device, equipment and storage medium
RU123996U1 (en) PARALLEL FLOW COMPUTING SYSTEM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination