WO2023108801A1 - Data processing method based on cpu-gpu heterogeneous architecture, device and storage medium - Google Patents

Data processing method based on cpu-gpu heterogeneous architecture, device and storage medium Download PDF

Info

Publication number
WO2023108801A1
WO2023108801A1 PCT/CN2021/141312 CN2021141312W WO2023108801A1 WO 2023108801 A1 WO2023108801 A1 WO 2023108801A1 CN 2021141312 W CN2021141312 W CN 2021141312W WO 2023108801 A1 WO2023108801 A1 WO 2023108801A1
Authority
WO
WIPO (PCT)
Prior art keywords
stage
data
multiexp
gpu
data processing
Prior art date
Application number
PCT/CN2021/141312
Other languages
French (fr)
Chinese (zh)
Inventor
鲁真妍
杨永魁
喻之斌
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2023108801A1 publication Critical patent/WO2023108801A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of zero-knowledge proof, in particular to a data processing method, device and storage medium based on a CPU-GPU heterogeneous architecture.
  • Zero-knowledge proof means that the prover can convince the verifier that a statement is correct without revealing any useful information to the verifier. Therefore, problems such as data security and privacy leakage can be solved.
  • the use of current zero-knowledge proofs requires a huge amount of calculation in the process of generating proofs, and time and economic costs limit its application.
  • the CPU In the isomorphic computing mode, the CPU cannot meet the intensive computing requirements of zero-knowledge proofs. For example, in a distributed storage system project, it is necessary to complete the zero-knowledge proofs to encapsulate blocks and submit them to the chain. Using only the CPU to complete the zero-knowledge proof, the time to dig out a block is much longer than the specified block time, and it becomes an invalid block in the blockchain.
  • the GPU not only has powerful floating-point calculation capabilities, but is also suitable for parallel computing of large-scale data. If CPU-GPU heterogeneous computing is used, the efficiency of zero-knowledge proof can be greatly improved. However, on the CPU-GPU heterogeneous architecture, how to coordinate the relationship between devices of different architectures so that they can reach the maximum utilization rate and form the most efficient system is more complicated than the case of the isomorphic method.
  • the present application provides a data processing method, device and storage medium based on a CPU-GPU heterogeneous architecture.
  • the application provides a data processing method based on CPU-GPU heterogeneous architecture, the data processing method comprising:
  • the calculation task of the zero-knowledge proof is divided into three stages, which are respectively the SYNTHESIZE stage, the FFT stage and the MULTIEXP stage, wherein the MULTIEXP stage is divided into the MULTIEXP A stage, the MULTIEXP B stage and the MULTIEXP C stage according to the different input data;
  • the output data of the FFT stage is input to the MULTIEXP A stage, and the MULTIEXP A stage outputs the first certification information;
  • a final certification result is generated by combining the first certification information, the second certification information and the third certification information.
  • the data input into the FFT stage is divided into several sub-data
  • the first part of output sub-data is transmitted to the MULTIEXPA stage for data processing until the processing and transmission of all sub-data is completed.
  • the data processing method also includes;
  • the reusable parameter reading is parallelized in advance.
  • the data processing method also includes:
  • the amount of data processed by a single GPU is determined based on the theoretical maximum amount of data processed.
  • the theoretical maximum data processing capacity of the single GPU task includes:
  • the calculation of the first video memory occupied by threads in the GPU includes:
  • the data processing method also includes:
  • the CPU in the CPU-GPU heterogeneous architecture is responsible for logic control and data preprocessing, and the GPU is responsible for processing intensive and parallelizable calculations.
  • the present application also provides a terminal device, where the terminal device includes a memory and a processor, wherein the memory is coupled to the processor;
  • the memory is used to store program data
  • the processor is used to execute the program data to implement the above data processing method.
  • the present application also provides a computer storage medium, the computer storage medium is used to store program data, and the program data is used to implement the above data processing method when executed by a processor.
  • the terminal device obtains the calculation task of the zero-knowledge proof; the calculation task of the zero-knowledge proof is divided into three stages, which are the SYNTHESIZE stage, the FFT stage and the MULTIEXP stage, wherein the MULTIEXP stage is based on different input data It is divided into MULTIEXP A stage, MULTIEXP B stage and MULTIEXP C stage; the data of zero-knowledge proof is input into the SYNTHESIZE stage for processing, and the output data of the SYNTHESIZE stage are respectively input into the FFT stage, MULTIEXP B stage and MULTIEXP C stage; the FFT stage The output data of the MULTIEXP is input to the MULTIEXP A stage, and the MULTIEXP A stage outputs the first certification information; the MULTIEXP B phase and the MULTIEXP C phase are processed in parallel, and the second certification information and the
  • Fig. 1 is a schematic flow chart of an embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided by the present application;
  • Fig. 2 is a schematic diagram of the zero-knowledge proof calculation data flow based on the CPU-GPU heterogeneous architecture provided by the present application;
  • Fig. 3 is the specific sub-step of step S14 of the data processing method shown in Fig. 1;
  • FIG. 4 is a schematic flowchart of another embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided by the present application;
  • Fig. 5 is a schematic flow chart of the serial execution of the data processing method in the prior art
  • Fig. 6 is a schematic flow diagram of parallel execution of the data processing method provided by the present application.
  • Fig. 7 is the total execution time under different optimization schemes provided by this application.
  • Fig. 8 is the execution time of the MULTIEXP stage under different optimization schemes provided by the present application.
  • FIG. 9 is a schematic structural diagram of an embodiment of a terminal device provided by the present application.
  • Fig. 10 is a schematic structural diagram of an embodiment of a computer storage medium provided by the present application.
  • this application adjusts the CPU to be responsible for logic control and data preprocessing, and the GPU is responsible for processing intensive and parallelizable calculations, and proposes a zero-knowledge proof performance optimization method to solve the problem. Due to application obstacles caused by performance problems, the implementation of zero-knowledge proof technology in application scenarios is accelerated.
  • Figure 1 is a schematic flow chart of an embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided by this application
  • Figure 2 is a schematic flow diagram of an embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided by this application Schematic diagram of knowledge proof computing data flow.
  • the data processing method based on the CPU-GPU heterogeneous architecture of the embodiment of the present application specifically includes the following steps:
  • Step S11 Obtain the calculation task of zero-knowledge proof.
  • Step S12 Divide the calculation task of the zero-knowledge proof into three stages, which are the SYNTHESIZE (circuit generation) stage, the FFT (Fast Fourier Transform) stage, and the MULTIEXP (large number multiplication and addition) stage, wherein the MULTIEXP stage is based on The different input data are divided into MULTIEXP A stage (that is, large number multiplication and addition A stage), MULTIEXP B (that is, large number multiplication and addition B stage) and MULTIEXP C stage (that is, large number multiplication and addition C stage).
  • the present application proposes a parallel execution solution for computing parallelization based on the CPU-GPU heterogeneous architecture.
  • this application divides the calculation of zero-knowledge proof into three stages: SYNTHESIZE stage, FFT stage, and MULTIEXP stage.
  • the calculation of the MULTIEXP stage can be divided into three parts A, B, and C according to the different input data, that is, MULTIEXP Phase A, Phase B of MULTIEXP and Phase C of MULTIEXP.
  • the output data of the SYNTHESIZE stage is divided into three parts, one part is used as the input of the FFT stage, and the other two parts are used as the input of the MULTIEXP B stage and the MULTIEXP C stage respectively.
  • the output of the FFT stage will serve as the input to the MULTIEXP A stage.
  • the output of the final MULTIEXP A stage, MULTIEXP B stage, and MULTIEXP C stage proves PROOF.
  • Step S13 Input the zero-knowledge proof data into the SYNTHESIZE stage for processing, and input the output data of the SYNTHESIZE stage into the FFT stage, MULTIEXP B stage and MULTIEXP C stage respectively.
  • the terminal device inputs the zero-knowledge proof data into the SYNTHESIZE stage for processing, and inputs the output data of the SYNTHESIZE stage into the FFT stage, MULTIEXP B stage, and MULTIEXP C stage in parallel for data processing.
  • this application aims at setting the CPU in charge of logic control and data preprocessing in the process of zero-knowledge proof to realize the acceleration of disk IO and data preprocessing.
  • repeated parameters are continuously used for data preprocessing in the data preprocessing process, and the data preprocessing time of the CPU in the FFT stage and the MULTIEXP stage takes a long time.
  • the CPU can parallelize the reading of reusable parameters in advance, reducing the constant reading and calling of reusable parameters.
  • Step S14 Input the output data of the FFT stage into the MULTIEXP A stage, and the MULTIEXP A stage outputs the first certification information.
  • the terminal device needs to input the output result of the SYNTHESIZE stage into the FFT stage, and then processed by the FFT stage and then input into the MULTIEXP A stage for processing.
  • FIG. 3 is step S14 of the data processing method shown in FIG. 1 specific sub-steps.
  • the data processing method provided in the embodiment of the present application also includes:
  • Step S141 Divide the data input into the FFT stage into several sub-data.
  • combining the FFT stage and the MULTIEXPA stage have the characteristics of dividing the data into several parts and calculating each part independently.
  • the CPU-GPU heterogeneous architecture can implement the FFT stage and the MULTIEXPA stage in a two-stage pipeline. Pipelining of the MULTIEXP A stage.
  • Step S142 Process the first sub-data through the FFT stage to obtain the first output sub-data.
  • Step S143 While processing the second sub-data through the FFT stage, transmit the first output sub-data to the MULTIEXP A stage for data processing until the processing and transmission of all sub-data is completed.
  • both the FFT stage and the MULTIEXP A stage divide the data to be processed into N parts.
  • the thread that calculates the FFT stage outputs the xth data
  • it is immediately passed to the thread that calculates the MULTIEXP A stage, so that the MULTIEXP A stage
  • the thread calculates the xth data.
  • the FFT stage can start processing the x+1th data.
  • the data processing efficiency of the FFT stage and the MULTIEXP A stage can be further improved, and it can even be consistent with the data processing speed of the MULTIEXP B stage and the MULTIEXP C stage.
  • Step S15 Process the MULTIEXP B phase and the MULTIEXP C phase in parallel, and output the second certification information and the third certification information respectively.
  • the data processing processes of these two stages can be directly parallelized for simultaneous calculation.
  • Step S16 Combine the first certification information, the second certification information and the third certification information to generate a final certification result.
  • the terminal device combines the first proof information calculated in the MULTIEXP A phase, the second proof information calculated in the MULTIEXP B phase, and the third proof information calculated in the MULTIEXP C phase to generate the final proof result PROOF.
  • the terminal device obtains the calculation task of zero-knowledge proof; the calculation task of zero-knowledge proof is divided into three stages, namely the SYNTHESIZE stage, the FFT stage and the MULTIEXP stage, wherein the MULTIEXP stage is based on the input data.
  • the data processing method of the present application proposes a zero-knowledge proof performance optimization method to solve application obstacles caused by performance problems and accelerate the implementation of zero-knowledge proof technology in application scenarios.
  • FIG. 4 is a schematic flowchart of another embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided by the present application.
  • the CPU-GPU heterogeneous architecture optimizes the performance of the CPU-GPU heterogeneous architecture by reducing the number of CPU-GPU transmissions while increasing the number of GPU computing threads, and improves the utilization efficiency of the architecture.
  • the data processing method provided in the embodiment of the present application includes:
  • Step S21 Calculate the theoretical maximum data processing capacity of a single GPU task.
  • the terminal device can calculate the theoretical maximum data processing capacity d max of a single GPU task through the following formula:
  • mem is the memory size of the GPU
  • i is the ratio of the actual number of threads to the maximum number of parallel threads of the GPU
  • cores is the number of stream processors in the GPU.
  • k size is the size of a scalar data
  • p size is the size of a vector data. Divide the remaining video memory by the size of the input data, except for the video memory occupied by the buckets in all threads in the GPU, to obtain the theoretical maximum data processing capacity of a single GPU task.
  • Step S22 Determine the amount of data processed by a single GPU based on the theoretical maximum amount of data processed.
  • the CPU-GPU heterogeneous architecture sets the amount of data processed by a single GPU according to the calculated maximum theoretical data processing capacity to process the data of the zero-knowledge proof, which can minimize the number of transmissions and improve processing efficiency .
  • the CPU-GPU heterogeneous architecture can also increase i, for example, from 2 to 4, thereby increasing actual thread data, increasing the amount of data processed by a single GPU, and further reducing the number of transmissions.
  • this application optimizes the performance of zero-knowledge proof on the CPU-GPU heterogeneous architecture, and continuously improves the performance of the CPU-GPU heterogeneous architecture from the following three parts:
  • the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process.
  • the specific execution order of each step should be based on its function and possible
  • the inner logic is OK.
  • FIG. 9 is a schematic structural diagram of an embodiment of the terminal device provided in the present application.
  • the terminal device 500 in this embodiment of the present application includes a memory 51 and a processor 52, where the memory 51 and the processor 52 are coupled.
  • the memory 51 is used to store program data
  • the processor 52 is used to execute the program data to implement the data processing method based on the CPU-GPU heterogeneous architecture described in the above embodiments.
  • the processor 52 may also be referred to as a CPU (Central Processing Unit, central processing unit).
  • the processor 52 may be an integrated circuit chip with signal processing capability.
  • the processor 52 can also be a general-purpose processor, a digital signal processor (DSP, Digital Signal Process), an application specific integrated circuit (ASIC, Application Specific Integrated Circuit), a field programmable gate array (FPGA, Field Programmable Gate Array) or other possible Program logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • Program logic devices discrete gate or transistor logic devices, discrete hardware components.
  • the general purpose processor may be a microprocessor or the processor 52 may be any conventional processor or the like.
  • the present application also provides a computer storage medium.
  • the computer storage medium 600 is used to store program data 61.
  • the program data 61 is executed by the processor, it is used to implement the CPU-based Data processing method of GPU heterogeneous architecture.
  • the present application also provides a computer program product, wherein the computer program product includes a computer program, and the computer program is operable to cause a computer to execute the data processing method based on the CPU-GPU heterogeneous architecture described in the embodiment of the present application.
  • the computer program product may be a software installation package.
  • the data processing method based on the CPU-GPU heterogeneous architecture described in the above embodiments of the present application can be stored in a device when it is implemented as a software functional unit and sold or used as an independent product, for example, a computer can read from the storage medium.
  • a computer can read from the storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) or a processor (processor) execute all or part of the steps of the methods described in various embodiments of the present invention.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Abstract

The present application provides a data processing method based on a CPU-GPU heterogeneous architecture, a device and a storage medium. The data processing method based on the CPU-GPU heterogeneous architecture comprises: obtaining a calculation task of zero-knowledge proof; inputting data of the zero-knowledge proof into a SYNTHESIZE stage for processing, and respectively inputting output data in the SYNTHESIZE stage into an FFT stage, a MULTIEXP B stage and a MULTIEXP C stage; inputting output data of the FFT stage into a MULTIEXP A stage, and outputting first proof information in the MULTIEXP A stage; performing parallel processing on the MULTIEXP B stage and the MULTIEXP C stage, and respectively outputting second proof information and third proof information; and generating a final proof result by combining the first proof information, the second proof information and the third proof information. By means of the mode above, the data processing method of the present application solves application hindrance caused by performance problems by providing a zero-knowledge proof performance optimization mode, and accelerates landing of a zero-knowledge proof technology in an application scene.

Description

基于CPU-GPU异构架构的数据处理方法、设备以及存储介质Data processing method, device and storage medium based on CPU-GPU heterogeneous architecture 技术领域technical field
本申请涉及零知识证明技术领域,特别是涉及一种基于CPU-GPU异构架构的数据处理方法、设备以及存储介质。The present application relates to the technical field of zero-knowledge proof, in particular to a data processing method, device and storage medium based on a CPU-GPU heterogeneous architecture.
背景技术Background technique
零知识证明指的是证明者能在不向验证者暴露任何有用信息的情况下,使验证者相信某个陈述是正确的。因此可以解决数据安全、隐私泄漏等问题。Zero-knowledge proof means that the prover can convince the verifier that a statement is correct without revealing any useful information to the verifier. Therefore, problems such as data security and privacy leakage can be solved.
目前使用目前零知识证明在生成证明的过程中需要巨大的计算量,时间与经济成本限制了它的应用。在同构计算方式下,CPU无法满足零知识证明中密集计算的需求,如在分布式存储系统项目中,需要通过完成零知识证明来封装区块提交到链上。仅使用CPU完成零知识证明,挖出一个块的时间要远超规定的区块时间,成为区块链中的无效块。而GPU不仅拥有强大浮点数计算能力,同时也适用于大规模数据的并行计算。如果使用CPU-GPU异构计算,可以大大提高零知识证明的效率。但在CPU-GPU的异构架构上,如何协调不同架构设备之间的关系,使其均达到最大利用率,形成一个最高效的系统,比同构方式的情况更加复杂。At present, the use of current zero-knowledge proofs requires a huge amount of calculation in the process of generating proofs, and time and economic costs limit its application. In the isomorphic computing mode, the CPU cannot meet the intensive computing requirements of zero-knowledge proofs. For example, in a distributed storage system project, it is necessary to complete the zero-knowledge proofs to encapsulate blocks and submit them to the chain. Using only the CPU to complete the zero-knowledge proof, the time to dig out a block is much longer than the specified block time, and it becomes an invalid block in the blockchain. The GPU not only has powerful floating-point calculation capabilities, but is also suitable for parallel computing of large-scale data. If CPU-GPU heterogeneous computing is used, the efficiency of zero-knowledge proof can be greatly improved. However, on the CPU-GPU heterogeneous architecture, how to coordinate the relationship between devices of different architectures so that they can reach the maximum utilization rate and form the most efficient system is more complicated than the case of the isomorphic method.
发明内容Contents of the invention
本申请提供了一种基于CPU-GPU异构架构的数据处理方法、设备以及存储介质。The present application provides a data processing method, device and storage medium based on a CPU-GPU heterogeneous architecture.
本申请提供了一种基于CPU-GPU异构架构的数据处理方法,所述数据处理方法包括:The application provides a data processing method based on CPU-GPU heterogeneous architecture, the data processing method comprising:
获取零知识证明的计算任务;Computational tasks for obtaining zero-knowledge proofs;
将所述零知识证明的计算任务分为三个阶段,分别为SYNTHESIZE阶段、FFT阶段以及MULTIEXP阶段,其中,MULTIEXP阶段根据输入数据的不同 划分为MULTIEXP A阶段、MULTIEXP B阶段和MULTIEXP C阶段;The calculation task of the zero-knowledge proof is divided into three stages, which are respectively the SYNTHESIZE stage, the FFT stage and the MULTIEXP stage, wherein the MULTIEXP stage is divided into the MULTIEXP A stage, the MULTIEXP B stage and the MULTIEXP C stage according to the different input data;
将所述零知识证明的数据输入到SYNTHESIZE阶段进行处理,并将SYNTHESIZE阶段的输出数据分别输入FFT阶段、MULTIEXP B阶段和MULTIEXP C阶段;Input the data of the zero-knowledge proof into the SYNTHESIZE stage for processing, and input the output data of the SYNTHESIZE stage into the FFT stage, MULTIEXP B stage and MULTIEXP C stage respectively;
将所述FFT阶段的输出数据输入到MULTIEXP A阶段,所述MULTIEXP A阶段输出第一证明信息;The output data of the FFT stage is input to the MULTIEXP A stage, and the MULTIEXP A stage outputs the first certification information;
将所述MULTIEXP B阶段和所述MULTIEXP C阶段并行处理,分别输出第二证明信息和第三证明信息;Parallel processing of the MULTIEXP B phase and the MULTIEXP C phase, outputting the second certification information and the third certification information respectively;
结合所述第一证明信息、所述第二证明信息和所述第三证明信息生成最终的证明结果。A final certification result is generated by combining the first certification information, the second certification information and the third certification information.
其中,将输入所述FFT阶段的数据划分为若干份子数据;Wherein, the data input into the FFT stage is divided into several sub-data;
通过所述FFT阶段处理第一份子数据,得到第一份输出子数据;Processing the first sub-data through the FFT stage to obtain the first output sub-data;
在通过所述FFT阶段处理第二份子数据的同时,将所述第一份输出子数据传输给所述MULTIEXP A阶段进行数据处理,直至完成所有子数据的处理和传输。While the second part of sub-data is being processed by the FFT stage, the first part of output sub-data is transmitted to the MULTIEXPA stage for data processing until the processing and transmission of all sub-data is completed.
其中,所述数据处理方法,还包括;Wherein, the data processing method also includes;
在CPU对FFT阶段和MULTIEXP阶段的数据预处理期间,将重复使用的参数读取提前并行化进行。During the CPU's data preprocessing for the FFT stage and the MULTIEXP stage, the reusable parameter reading is parallelized in advance.
其中,所述数据处理方法,还包括:Wherein, the data processing method also includes:
计算单次GPU任务的理论数据最大处理量;Calculate the theoretical maximum data processing capacity of a single GPU task;
基于所述理论数据最大处理量确定单次GPU处理的数据量。The amount of data processed by a single GPU is determined based on the theoretical maximum amount of data processed.
其中,所述单次GPU任务的理论数据最大处理量,包括:Wherein, the theoretical maximum data processing capacity of the single GPU task includes:
获取所述GPU的总显存;Obtain the total video memory of the GPU;
计算所述GPU中线程占用的第一显存;Calculate the first video memory occupied by threads in the GPU;
基于所述总显存和所述第一显存的差值,获取所述GPU的剩余显存;Obtaining the remaining video memory of the GPU based on the difference between the total video memory and the first video memory;
基于所述GPU的剩余显存与所述输入数据的数据量的比值,获取所述理论数据最大处理量。Based on the ratio of the remaining video memory of the GPU to the data volume of the input data, the maximum theoretical data processing capacity is obtained.
其中,所述计算所述GPU中线程占用的第一显存,包括:Wherein, the calculation of the first video memory occupied by threads in the GPU includes:
获取所述GPU的总线程数;Obtain the total number of threads of the GPU;
基于所述GPU中一个线程中的存储单元大小以及存储单元数量,计算一个线程的第二显存;Calculate the second display memory of a thread based on the storage unit size and the number of storage units in a thread in the GPU;
基于所述总线程数和所述第二显存,计算所述第一显存。Calculate the first video memory based on the total number of threads and the second video memory.
其中,所述数据处理方法,还包括:Wherein, the data processing method also includes:
增加所述GPU的总线程数;increase the total number of threads of the GPU;
基于所述GPU增加的总线程数提高所述单次GPU处理的数据量,以使所述GPU的传输次数减少。Increase the amount of data processed by the GPU at a time based on the increased total number of threads of the GPU, so as to reduce the number of transmission times of the GPU.
其中,所述CPU-GPU异构架构中的CPU负责逻辑控制和数据的预处理,GPU负责处理密集且可并行化的计算。Wherein, the CPU in the CPU-GPU heterogeneous architecture is responsible for logic control and data preprocessing, and the GPU is responsible for processing intensive and parallelizable calculations.
本申请还提供了一种终端设备,所述终端设备包括存储器和处理器,其中,所述存储器与所述处理器耦接;The present application also provides a terminal device, where the terminal device includes a memory and a processor, wherein the memory is coupled to the processor;
其中,所述存储器用于存储程序数据,所述处理器用于执行所述程序数据以实现上述的数据处理方法。Wherein, the memory is used to store program data, and the processor is used to execute the program data to implement the above data processing method.
本申请还提供了一种计算机存储介质,所述计算机存储介质用于存储程序数据,所述程序数据在被处理器执行时,用以实现上述的数据处理方法。The present application also provides a computer storage medium, the computer storage medium is used to store program data, and the program data is used to implement the above data processing method when executed by a processor.
本申请的有益效果是:终端设备获取零知识证明的计算任务;将零知识证明的计算任务分为三个阶段,分别为SYNTHESIZE阶段、FFT阶段以及MULTIEXP阶段,其中,MULTIEXP阶段根据输入数据的不同划分为 MULTIEXP A阶段、MULTIEXP B阶段和MULTIEXP C阶段;将零知识证明的数据输入到SYNTHESIZE阶段进行处理,并将SYNTHESIZE阶段的输出数据分别输入FFT阶段、MULTIEXP B阶段和MULTIEXP C阶段;将FFT阶段的输出数据输入到MULTIEXP A阶段,MULTIEXP A阶段输出第一证明信息;将MULTIEXP B阶段和MULTIEXP C阶段并行处理,分别输出第二证明信息和第三证明信息;结合第一证明信息、第二证明信息和第三证明信息生成最终的证明结果。通过上述方式,本申请的数据处理方法通过提出一种零知识证明性能优化方式,解决由于性能问题产生的应用阻碍,加速零知识证明技术在应用场景中的落地。The beneficial effects of the present application are: the terminal device obtains the calculation task of the zero-knowledge proof; the calculation task of the zero-knowledge proof is divided into three stages, which are the SYNTHESIZE stage, the FFT stage and the MULTIEXP stage, wherein the MULTIEXP stage is based on different input data It is divided into MULTIEXP A stage, MULTIEXP B stage and MULTIEXP C stage; the data of zero-knowledge proof is input into the SYNTHESIZE stage for processing, and the output data of the SYNTHESIZE stage are respectively input into the FFT stage, MULTIEXP B stage and MULTIEXP C stage; the FFT stage The output data of the MULTIEXP is input to the MULTIEXP A stage, and the MULTIEXP A stage outputs the first certification information; the MULTIEXP B phase and the MULTIEXP C phase are processed in parallel, and the second certification information and the third certification information are respectively output; the combination of the first certification information and the second certification information information and the third proof information generate the final proof result. Through the above method, the data processing method of the present application proposes a zero-knowledge proof performance optimization method to solve application obstacles caused by performance problems and accelerate the implementation of zero-knowledge proof technology in application scenarios.
附图说明Description of drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。其中:In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort. in:
图1是本申请提供的基于CPU-GPU异构架构的数据处理方法一实施例的流程示意图;Fig. 1 is a schematic flow chart of an embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided by the present application;
图2是本申请提供的基于CPU-GPU异构架构的零知识证明计算数据流示意图;Fig. 2 is a schematic diagram of the zero-knowledge proof calculation data flow based on the CPU-GPU heterogeneous architecture provided by the present application;
图3是图1所示数据处理方法步骤S14的具体子步骤;Fig. 3 is the specific sub-step of step S14 of the data processing method shown in Fig. 1;
图4是本申请提供的基于CPU-GPU异构架构的数据处理方法另一实施例的流程示意图;4 is a schematic flowchart of another embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided by the present application;
图5是现有技术的数据处理方法串行执行的流程示意图;Fig. 5 is a schematic flow chart of the serial execution of the data processing method in the prior art;
图6是本申请提供的数据处理方法并行执行的流程示意图;Fig. 6 is a schematic flow diagram of parallel execution of the data processing method provided by the present application;
图7是本申请提供的不同优化方案下的总执行时间;Fig. 7 is the total execution time under different optimization schemes provided by this application;
图8是本申请提供的不同优化方案下的MULTIEXP阶段执行时间;Fig. 8 is the execution time of the MULTIEXP stage under different optimization schemes provided by the present application;
图9是本申请提供的终端设备一实施例的结构示意图;FIG. 9 is a schematic structural diagram of an embodiment of a terminal device provided by the present application;
图10是本申请提供的计算机存储介质一实施例的结构示意图。Fig. 10 is a schematic structural diagram of an embodiment of a computer storage medium provided by the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.
为解决上述的CPU-GPU异构架构利用率较低的问题,由于零知识证明本身的算法复杂性、巨大数据量与计算量,以及考虑到CPU-GPU异构架构带来的系统复杂性,本申请对于基于CPU-GPU异构架构上的实现,调整CPU负责逻辑控制和数据的预处理,GPU负责处理密集且可并行化的计算,并以此提出一种零知识证明性能优化方法,解决由于性能问题产生的应用阻碍,加速零知识证明技术在应用场景中的落地。In order to solve the above-mentioned low utilization rate of the CPU-GPU heterogeneous architecture, due to the algorithm complexity of the zero-knowledge proof itself, the huge amount of data and calculation, and the system complexity brought by the CPU-GPU heterogeneous architecture, For the implementation based on the CPU-GPU heterogeneous architecture, this application adjusts the CPU to be responsible for logic control and data preprocessing, and the GPU is responsible for processing intensive and parallelizable calculations, and proposes a zero-knowledge proof performance optimization method to solve the problem. Due to application obstacles caused by performance problems, the implementation of zero-knowledge proof technology in application scenarios is accelerated.
具体请参阅图1和图2,图1是本申请提供的基于CPU-GPU异构架构的数据处理方法一实施例的流程示意图,图2是本申请提供的基于CPU-GPU异构架构的零知识证明计算数据流示意图。Please refer to Figure 1 and Figure 2 for details, Figure 1 is a schematic flow chart of an embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided by this application, and Figure 2 is a schematic flow diagram of an embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided by this application Schematic diagram of knowledge proof computing data flow.
如图1所示,本申请实施例的基于CPU-GPU异构架构的数据处理方法具体包括以下步骤:As shown in Figure 1, the data processing method based on the CPU-GPU heterogeneous architecture of the embodiment of the present application specifically includes the following steps:
步骤S11:获取零知识证明的计算任务。Step S11: Obtain the calculation task of zero-knowledge proof.
步骤S12:将零知识证明的计算任务分为三个阶段,分别为SYNTHESIZE(即电路生成)阶段、FFT(即快速傅立叶变换)阶段以及MULTIEXP(即大数乘加)阶段,其中,MULTIEXP阶段根据输入数据的不同划分为MULTIEXP A阶段(即大数乘加A阶段)、MULTIEXP B(即大数乘加B阶段)阶段和MULTIEXP C阶段(即大数乘加C阶段)。Step S12: Divide the calculation task of the zero-knowledge proof into three stages, which are the SYNTHESIZE (circuit generation) stage, the FFT (Fast Fourier Transform) stage, and the MULTIEXP (large number multiplication and addition) stage, wherein the MULTIEXP stage is based on The different input data are divided into MULTIEXP A stage (that is, large number multiplication and addition A stage), MULTIEXP B (that is, large number multiplication and addition B stage) and MULTIEXP C stage (that is, large number multiplication and addition C stage).
在本申请实施例中,如图2所示,本申请基于CPU-GPU异构架构提出了一种计算并行化的并行执行方案。In the embodiment of the present application, as shown in FIG. 2 , the present application proposes a parallel execution solution for computing parallelization based on the CPU-GPU heterogeneous architecture.
具体地,本申请将零知识证明的计算分为SYNTHESIZE阶段、FFT阶段、MULTIEXP阶段三个阶段,其中MULTIEXP阶段的计算又可以根据输入数据 的不同分为A、B、C三个部分,即MULTIEXP A阶段、MULTIEXP B阶段以及MULTIEXP C阶段。如图2所示的计算数据流,SYNTHESIZE阶段输出的数据分为三个部分,一部分作为FFT阶段的输入,另外两个部分分别作为MULTIEXP B阶段与MULTIEXP C阶段的输入。FFT阶段的输出将作为MULTIEXP A阶段的输入。最终MULTIEXP A阶段、MULTIEXP B阶段、MULTIEXP C阶段的输出生成证明PROOF。Specifically, this application divides the calculation of zero-knowledge proof into three stages: SYNTHESIZE stage, FFT stage, and MULTIEXP stage. The calculation of the MULTIEXP stage can be divided into three parts A, B, and C according to the different input data, that is, MULTIEXP Phase A, Phase B of MULTIEXP and Phase C of MULTIEXP. As shown in the calculation data flow in Figure 2, the output data of the SYNTHESIZE stage is divided into three parts, one part is used as the input of the FFT stage, and the other two parts are used as the input of the MULTIEXP B stage and the MULTIEXP C stage respectively. The output of the FFT stage will serve as the input to the MULTIEXP A stage. The output of the final MULTIEXP A stage, MULTIEXP B stage, and MULTIEXP C stage proves PROOF.
由图2所示的计算数据流可以看出,优化后的CPU-GPU异构架构的运行主要分为两部分:FFT阶段与MULTIEXP A阶段的流水线化,以及MULTIEXP B阶段与MULTIEXP C阶段的并行化。下面继续分别对这两部分进行详细描述:From the calculation data flow shown in Figure 2, it can be seen that the operation of the optimized CPU-GPU heterogeneous architecture is mainly divided into two parts: the pipeline of the FFT stage and the MULTIEXP A stage, and the parallelization of the MULTIEXP B stage and the MULTIEXP C stage change. The following continues to describe these two parts in detail:
步骤S13:将零知识证明的数据输入到SYNTHESIZE阶段进行处理,并将SYNTHESIZE阶段的输出数据分别输入FFT阶段、MULTIEXP B阶段和MULTIEXP C阶段。Step S13: Input the zero-knowledge proof data into the SYNTHESIZE stage for processing, and input the output data of the SYNTHESIZE stage into the FFT stage, MULTIEXP B stage and MULTIEXP C stage respectively.
在本申请实施例中,终端设备将零知识证明的数据输入到SYNTHESIZE阶段进行处理,并将SYNTHESIZE阶段的输出数据并行输入FFT阶段、MULTIEXP B阶段和MULTIEXP C阶段,进行数据处理。In the embodiment of this application, the terminal device inputs the zero-knowledge proof data into the SYNTHESIZE stage for processing, and inputs the output data of the SYNTHESIZE stage into the FFT stage, MULTIEXP B stage, and MULTIEXP C stage in parallel for data processing.
其中,本申请针对零知识证明过程中CPU负责逻辑控制和数据的预处理的设置能够实现磁盘IO与数据预处理的加速。具体地,对于CPU而言,对数据的预处理过程中会不断采用重复的参数进行数据预处理,而FFT阶段与MULTIEXP阶段由CPU进行的数据预处理时间占比较长,为了提高CPU的使用效率,CPU可以将重复使用的参数读取提前并行化进行,减少对于重复使用的参数的不断读取和调用。Among them, this application aims at setting the CPU in charge of logic control and data preprocessing in the process of zero-knowledge proof to realize the acceleration of disk IO and data preprocessing. Specifically, for the CPU, repeated parameters are continuously used for data preprocessing in the data preprocessing process, and the data preprocessing time of the CPU in the FFT stage and the MULTIEXP stage takes a long time. In order to improve the CPU usage efficiency , the CPU can parallelize the reading of reusable parameters in advance, reducing the constant reading and calling of reusable parameters.
步骤S14:将FFT阶段的输出数据输入到MULTIEXP A阶段,MULTIEXP A阶段输出第一证明信息。Step S14: Input the output data of the FFT stage into the MULTIEXP A stage, and the MULTIEXP A stage outputs the first certification information.
在本申请实施例中,由于FFT阶段和MULTIEXP A阶段拥有数据依赖关系,因此,不能直接将这两个阶段的数据处理过程并行起来,而是要实现FFT阶段与MULTIEXP A阶段的流水线化。具体地,终端设备需要将SYNTHESIZE阶段的输出结果输入到FFT阶段,然后由FFT阶段处理后再输入MULTIEXP A 阶段进行处理。In the embodiment of the present application, since the FFT stage and the MULTIEXPA stage have a data dependency relationship, the data processing processes of these two stages cannot be directly parallelized, but the pipeline of the FFT stage and the MULTIEXPA stage must be realized. Specifically, the terminal device needs to input the output result of the SYNTHESIZE stage into the FFT stage, and then processed by the FFT stage and then input into the MULTIEXP A stage for processing.
为了进一步提高CPU-GPU异构架构的利用率,本申请实施例还提供了一种两段式流水线化的技术方案,具体请继续参阅图3,图3是图1所示数据处理方法步骤S14的具体子步骤。In order to further improve the utilization rate of the CPU-GPU heterogeneous architecture, the embodiment of the present application also provides a two-stage pipelined technical solution. Please refer to FIG. 3 for details. FIG. 3 is step S14 of the data processing method shown in FIG. 1 specific sub-steps.
如图3所示,本申请实施例提供的数据处理方法还包括:As shown in Figure 3, the data processing method provided in the embodiment of the present application also includes:
步骤S141:将输入FFT阶段的数据划分为若干份子数据。Step S141: Divide the data input into the FFT stage into several sub-data.
在本申请实施例中,结合FFT阶段与MULTIEXP A阶段都拥有将数据分为若干份,对每一份独立计算的特点,CPU-GPU异构架构可以采用两段式流水线的方式实现FFT阶段与MULTIEXP A阶段的流水线化。In the embodiment of this application, combining the FFT stage and the MULTIEXPA stage have the characteristics of dividing the data into several parts and calculating each part independently. The CPU-GPU heterogeneous architecture can implement the FFT stage and the MULTIEXPA stage in a two-stage pipeline. Pipelining of the MULTIEXP A stage.
步骤S142:通过FFT阶段处理第一份子数据,得到第一份输出子数据。Step S142: Process the first sub-data through the FFT stage to obtain the first output sub-data.
步骤S143:在通过FFT阶段处理第二份子数据的同时,将第一份输出子数据传输给MULTIEXP A阶段进行数据处理,直至完成所有子数据的处理和传输。Step S143: While processing the second sub-data through the FFT stage, transmit the first output sub-data to the MULTIEXP A stage for data processing until the processing and transmission of all sub-data is completed.
具体地,FFT阶段与MULTIEXP A阶段均将即将需要处理的数据分为N份,当计算FFT阶段的线程输出第x份数据时,立即将它传给计算MULTIEXP A阶段的线程,让MULTIEXP A阶段的线程对第x份数据进行计算。与此同时,FFT阶段可以开始处理第x+1份数据。Specifically, both the FFT stage and the MULTIEXP A stage divide the data to be processed into N parts. When the thread that calculates the FFT stage outputs the xth data, it is immediately passed to the thread that calculates the MULTIEXP A stage, so that the MULTIEXP A stage The thread calculates the xth data. At the same time, the FFT stage can start processing the x+1th data.
通过以上两段式流水线的方式,可以进一步提高FFT阶段和MULTIEXP A阶段的数据处理效率,甚至可以与MULTIEXP B阶段和MULTIEXP C阶段的数据处理速度保持一致。Through the above two-stage pipeline, the data processing efficiency of the FFT stage and the MULTIEXP A stage can be further improved, and it can even be consistent with the data processing speed of the MULTIEXP B stage and the MULTIEXP C stage.
步骤S15:将MULTIEXP B阶段和MULTIEXP C阶段并行处理,分别输出第二证明信息和第三证明信息。Step S15: Process the MULTIEXP B phase and the MULTIEXP C phase in parallel, and output the second certification information and the third certification information respectively.
在本申请实施例中,由于MULTIEXP B阶段和MULTIEXP C阶段没有数据依赖关系,显存使用率较低,因此,可以直接将这两个阶段的数据处理过程并行起来进行同时计算。In the embodiment of the present application, since the MULTIEXP B stage and the MULTIEXP C stage have no data dependency, and the video memory usage is low, the data processing processes of these two stages can be directly parallelized for simultaneous calculation.
步骤S16:结合第一证明信息、第二证明信息和第三证明信息生成最终的证明结果。Step S16: Combine the first certification information, the second certification information and the third certification information to generate a final certification result.
最后,终端设备结合MULTIEXP A阶段计算得到的第一证明信息,MULTIEXP B阶段计算得到的第二证明信息以及MULTIEXP C阶段计算得到的第三证明信息生成最终的证明结果PROOF。Finally, the terminal device combines the first proof information calculated in the MULTIEXP A phase, the second proof information calculated in the MULTIEXP B phase, and the third proof information calculated in the MULTIEXP C phase to generate the final proof result PROOF.
在本申请实施例中,终端设备获取零知识证明的计算任务;将零知识证明的计算任务分为三个阶段,分别为SYNTHESIZE阶段、FFT阶段以及MULTIEXP阶段,其中,MULTIEXP阶段根据输入数据的不同划分为MULTIEXP A阶段、MULTIEXP B阶段和MULTIEXP C阶段;将零知识证明的数据输入到SYNTHESIZE阶段进行处理,并将SYNTHESIZE阶段的输出数据分别输入FFT阶段、MULTIEXP B阶段和MULTIEXP C阶段;将FFT阶段的输出数据输入到MULTIEXP A阶段,MULTIEXP A阶段输出第一证明信息;将MULTIEXP B阶段和MULTIEXP C阶段并行处理,分别输出第二证明信息和第三证明信息;结合第一证明信息、第二证明信息和第三证明信息生成最终的证明结果。通过上述方式,本申请的数据处理方法通过提出一种零知识证明性能优化方式,解决由于性能问题产生的应用阻碍,加速零知识证明技术在应用场景中的落地。In the embodiment of this application, the terminal device obtains the calculation task of zero-knowledge proof; the calculation task of zero-knowledge proof is divided into three stages, namely the SYNTHESIZE stage, the FFT stage and the MULTIEXP stage, wherein the MULTIEXP stage is based on the input data. It is divided into MULTIEXP A stage, MULTIEXP B stage and MULTIEXP C stage; the data of zero-knowledge proof is input into the SYNTHESIZE stage for processing, and the output data of the SYNTHESIZE stage are respectively input into the FFT stage, MULTIEXP B stage and MULTIEXP C stage; the FFT stage The output data of the MULTIEXP is input to the MULTIEXP A stage, and the MULTIEXP A stage outputs the first certification information; the MULTIEXP B phase and the MULTIEXP C phase are processed in parallel, and the second certification information and the third certification information are respectively output; the combination of the first certification information and the second certification information information and the third proof information generate the final proof result. Through the above method, the data processing method of the present application proposes a zero-knowledge proof performance optimization method to solve application obstacles caused by performance problems and accelerate the implementation of zero-knowledge proof technology in application scenarios.
请继续参阅图4,图4是本申请提供的基于CPU-GPU异构架构的数据处理方法另一实施例的流程示意图。在本申请实施例的数据处理方法中,CPU-GPU异构架构通过减少CPU-GPU传输次数同时增加GPU计算线程数量的方式优化CPU-GPU异构架构的性能,提高架构的使用效率。Please continue to refer to FIG. 4 , which is a schematic flowchart of another embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided by the present application. In the data processing method of the embodiment of the present application, the CPU-GPU heterogeneous architecture optimizes the performance of the CPU-GPU heterogeneous architecture by reducing the number of CPU-GPU transmissions while increasing the number of GPU computing threads, and improves the utilization efficiency of the architecture.
如图4所示,本申请实施例提供的数据处理方法包括:As shown in Figure 4, the data processing method provided in the embodiment of the present application includes:
步骤S21:计算单次GPU任务的理论数据最大处理量。Step S21: Calculate the theoretical maximum data processing capacity of a single GPU task.
在本申请实施例中,终端设备可以通过以下公式计算出单次GPU任务的理论数据最大处理量d maxIn the embodiment of this application, the terminal device can calculate the theoretical maximum data processing capacity d max of a single GPU task through the following formula:
Figure PCTCN2021141312-appb-000001
Figure PCTCN2021141312-appb-000001
其中,mem为GPU的显存大小,i为实际线程数与GPU最大并行线程数的比值,cores为GPU中流处理器的数量。通过将GPU最大并行线程等同于流处理器的数量,i×cores可以用于表示GPU的总线程数。window size为GPU一个window的大小,
Figure PCTCN2021141312-appb-000002
为一个线程拥有的bucket的数量,buck size为GPU一个bucket的大小,则可以计算出
Figure PCTCN2021141312-appb-000003
为一个线程中bucket最多占用的显存大小,进而通过
Figure PCTCN2021141312-appb-000004
可以计算出GPU中所有线程中bucket占用的显存大小,即第二显存。
Among them, mem is the memory size of the GPU, i is the ratio of the actual number of threads to the maximum number of parallel threads of the GPU, and cores is the number of stream processors in the GPU. By equating the maximum parallel threads of the GPU to the number of stream processors, i×cores can be used to represent the total number of threads of the GPU. The window size is the size of a window on the GPU,
Figure PCTCN2021141312-appb-000002
is the number of buckets owned by a thread, and the bucket size is the size of a GPU bucket, then it can be calculated
Figure PCTCN2021141312-appb-000003
It is the maximum video memory size occupied by a bucket in a thread, and then passed
Figure PCTCN2021141312-appb-000004
You can calculate the video memory size occupied by buckets in all threads in the GPU, that is, the second video memory.
进一步地,k size为一个标量数据的大小,p size为一个向量数据的大小。将除GPU中所有线程中bucket占用的显存外,剩下的显存除以输入数据的大小,即可得到单次GPU任务的理论数据最大处理量。 Further, k size is the size of a scalar data, and p size is the size of a vector data. Divide the remaining video memory by the size of the input data, except for the video memory occupied by the buckets in all threads in the GPU, to obtain the theoretical maximum data processing capacity of a single GPU task.
步骤S22:基于理论数据最大处理量确定单次GPU处理的数据量。Step S22: Determine the amount of data processed by a single GPU based on the theoretical maximum amount of data processed.
在本申请实施例中,CPU-GPU异构架构按照计算得到的理论数据最大处理量设置单次GPU处理的数据量对零知识证明的数据进行处理,能够最大限度地减少传输次数,提高处理效率。In the embodiment of this application, the CPU-GPU heterogeneous architecture sets the amount of data processed by a single GPU according to the calculated maximum theoretical data processing capacity to process the data of the zero-knowledge proof, which can minimize the number of transmissions and improve processing efficiency .
进一步地,CPU-GPU异构架构也可以将i提高,例如从2提高到4,从而增加实际线程数据,提高单次GPU处理的数据量,进一步减少传输次数。Furthermore, the CPU-GPU heterogeneous architecture can also increase i, for example, from 2 to 4, thereby increasing actual thread data, increasing the amount of data processed by a single GPU, and further reducing the number of transmissions.
为了验证本申请提供的基于CPU-GPU异构架构的数据处理方法,在均能生成正确证明的前提下,对输入数据为32GB的计算任务进行优化方案验证。验证结果具体请参阅图5~图8,其中,图5是现有技术的数据处理方法串行执行的流程示意图,图6是本申请提供的数据处理方法并行执行的流程示意图,图7是本申请提供的不同优化方案下的总执行时间,图8是本申请提供的不同优化方案下的MULTIEXP阶段执行时间。In order to verify the data processing method based on the CPU-GPU heterogeneous architecture provided by this application, on the premise that the correct proof can be generated, the optimization scheme verification is performed on the calculation task with input data of 32GB. For details of the verification results, please refer to Figures 5 to 8, wherein Figure 5 is a schematic flowchart of the serial execution of the data processing method in the prior art, Figure 6 is a schematic flowchart of the parallel execution of the data processing method provided by the present application, and Figure 7 is a schematic flowchart of the serial execution of the data processing method provided by the present application. The total execution time under the different optimization schemes provided by the application, Fig. 8 is the execution time of the MULTIEXP stage under the different optimization schemes provided by the application.
比较图5的串行执行流程图和图6的并行执行流程图,可以看出采用本申请提供的并行执行方案,GPU的内存使用率均有大幅度的提升。Comparing the serial execution flow chart in FIG. 5 and the parallel execution flow chart in FIG. 6 , it can be seen that the memory usage rate of the GPU is greatly improved by adopting the parallel execution scheme provided by the present application.
如图7所示,在BELLPERSON的基础上进行并行化与数据预处理加速,分别提高了9%和37%的速度。由于并行化增加了CPU与GPU的线程调度开 销,因此速度提升并不明显,可以看出数据预处理时间占比较高,在代码分析时发现有较多重复冗余操作,因此预处理加速技术方案的速度提升效果较好。As shown in Figure 7, parallelization and data preprocessing acceleration on the basis of BELLPERSON have increased the speed by 9% and 37% respectively. Since parallelization increases the thread scheduling overhead of the CPU and GPU, the speed improvement is not obvious. It can be seen that the data preprocessing time takes a relatively high proportion. During code analysis, it is found that there are many redundant operations, so the preprocessing acceleration technology solution The speed increase effect is better.
如图8所示,在预处理加速的基础上减少传输次数并增加线程数,速度提高了35%。As shown in Figure 8, reducing the number of transfers and increasing the number of threads on the basis of preprocessing acceleration increases the speed by 35%.
综上,本申请在CPU-GPU异构架构上对零知识证明进行性能优化,分别从以下三个部分不断提高CPU-GPU异构架构的性能:In summary, this application optimizes the performance of zero-knowledge proof on the CPU-GPU heterogeneous architecture, and continuously improves the performance of the CPU-GPU heterogeneous architecture from the following three parts:
1)计算并行化。1) Calculation parallelization.
2)磁盘IO与数据预处理的加速。2) Acceleration of disk IO and data preprocessing.
3)减少CPU-GPU传输次数同时增加GPU计算线程数量。3) Reduce the number of CPU-GPU transfers while increasing the number of GPU computing threads.
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。Those skilled in the art can understand that in the above method of specific implementation, the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process. The specific execution order of each step should be based on its function and possible The inner logic is OK.
为实现上述实施例的基于CPU-GPU异构架构的数据处理方法,本申请还提出了一种终端设备,具体请参阅图9,图9是本申请提供的终端设备一实施例的结构示意图。In order to implement the data processing method based on the CPU-GPU heterogeneous architecture of the above embodiment, the present application also proposes a terminal device. Please refer to FIG. 9 for details. FIG. 9 is a schematic structural diagram of an embodiment of the terminal device provided in the present application.
本申请实施例的终端设备500包括存储器51和处理器52,其中,存储器51和处理器52耦接。The terminal device 500 in this embodiment of the present application includes a memory 51 and a processor 52, where the memory 51 and the processor 52 are coupled.
存储器51用于存储程序数据,处理器52用于执行程序数据以实现上述实施例所述的基于CPU-GPU异构架构的数据处理方法。The memory 51 is used to store program data, and the processor 52 is used to execute the program data to implement the data processing method based on the CPU-GPU heterogeneous architecture described in the above embodiments.
在本实施例中,处理器52还可以称为CPU(Central Processing Unit,中央处理单元)。处理器52可能是一种集成电路芯片,具有信号的处理能力。处理器52还可以是通用处理器、数字信号处理器(DSP,Digital Signal Process)、专用集成电路(ASIC,Application Specific Integrated Circuit)、现场可编程门阵列(FPGA,Field Programmable Gate Array)或者其它可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器52也可以是任何常规的处理器等。In this embodiment, the processor 52 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 52 may be an integrated circuit chip with signal processing capability. The processor 52 can also be a general-purpose processor, a digital signal processor (DSP, Digital Signal Process), an application specific integrated circuit (ASIC, Application Specific Integrated Circuit), a field programmable gate array (FPGA, Field Programmable Gate Array) or other possible Program logic devices, discrete gate or transistor logic devices, discrete hardware components. The general purpose processor may be a microprocessor or the processor 52 may be any conventional processor or the like.
本申请还提供一种计算机存储介质,如图10所示,计算机存储介质600用于存储程序数据61,程序数据61在被处理器执行时,用以实现如上述实施例所述的基于CPU-GPU异构架构的数据处理方法。The present application also provides a computer storage medium. As shown in FIG. 10, the computer storage medium 600 is used to store program data 61. When the program data 61 is executed by the processor, it is used to implement the CPU-based Data processing method of GPU heterogeneous architecture.
本申请还提供一种计算机程序产品,其中,上述计算机程序产品包括计算机程序,上述计算机程序可操作来使计算机执行如本申请实施例所述的基于CPU-GPU异构架构的数据处理方法。该计算机程序产品可以为一个软件安装包。The present application also provides a computer program product, wherein the computer program product includes a computer program, and the computer program is operable to cause a computer to execute the data processing method based on the CPU-GPU heterogeneous architecture described in the embodiment of the present application. The computer program product may be a software installation package.
本申请上述实施例所述的基于CPU-GPU异构架构的数据处理方法,在实现时以软件功能单元的形式存在并作为独立的产品销售或使用时,可以存储在装置中,例如一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本发明各个实施方式所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The data processing method based on the CPU-GPU heterogeneous architecture described in the above embodiments of the present application can be stored in a device when it is implemented as a software functional unit and sold or used as an independent product, for example, a computer can read from the storage medium. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) or a processor (processor) execute all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .
以上所述仅为本申请的实施方式,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above is only the implementation of the application, and does not limit the patent scope of the application. Any equivalent structure or equivalent process conversion made by using the specification and drawings of the application, or directly or indirectly used in other related technologies fields, are all included in the scope of patent protection of this application in the same way.

Claims (10)

  1. 一种基于CPU-GPU异构架构的数据处理方法,其特征在于,所述数据处理方法包括:A data processing method based on CPU-GPU heterogeneous architecture, characterized in that the data processing method comprises:
    获取零知识证明的计算任务;Computational tasks for obtaining zero-knowledge proofs;
    将所述零知识证明的计算任务分为三个阶段,分别为SYNTHESIZE阶段、FFT阶段以及MULTIEXP阶段,其中,MULTIEXP阶段根据输入数据的不同划分为MULTIEXP A阶段、MULTIEXP B阶段和MULTIEXP C阶段;The calculation task of the zero-knowledge proof is divided into three stages, which are the SYNTHESIZE stage, the FFT stage and the MULTIEXP stage, wherein the MULTIEXP stage is divided into the MULTIEXP A stage, the MULTIEXP B stage and the MULTIEXP C stage according to the input data;
    将所述零知识证明的数据输入到SYNTHESIZE阶段进行处理,并将SYNTHESIZE阶段的输出数据分别输入FFT阶段、MULTIEXP B阶段和MULTIEXP C阶段;Input the data of the zero-knowledge proof into the SYNTHESIZE stage for processing, and input the output data of the SYNTHESIZE stage into the FFT stage, MULTIEXP B stage and MULTIEXP C stage respectively;
    将所述FFT阶段的输出数据输入到MULTIEXP A阶段,所述MULTIEXP A阶段输出第一证明信息;The output data of the FFT stage is input to the MULTIEXP A stage, and the MULTIEXP A stage outputs the first certification information;
    将所述MULTIEXP B阶段和所述MULTIEXP C阶段并行处理,分别输出第二证明信息和第三证明信息;Parallel processing of the MULTIEXP B phase and the MULTIEXP C phase, outputting the second certification information and the third certification information respectively;
    结合所述第一证明信息、所述第二证明信息和所述第三证明信息生成最终的证明结果。A final certification result is generated by combining the first certification information, the second certification information and the third certification information.
  2. 根据权利要求1所述的数据处理方法,其特征在于,The data processing method according to claim 1, wherein:
    所述数据处理方法还包括:The data processing method also includes:
    将输入所述FFT阶段的数据划分为若干份子数据;Divide the data input into the FFT stage into several sub-data;
    通过所述FFT阶段处理第一份子数据,得到第一份输出子数据;Processing the first sub-data through the FFT stage to obtain the first output sub-data;
    在通过所述FFT阶段处理第二份子数据的同时,将所述第一份输出子数据传输给所述MULTIEXP A阶段进行数据处理,直至完成所有子数据的处理和传输。While the second part of sub-data is being processed by the FFT stage, the first part of output sub-data is transmitted to the MULTIEXPA stage for data processing until the processing and transmission of all sub-data is completed.
  3. 根据权利要求1所述的数据处理方法,其特征在于,The data processing method according to claim 1, wherein:
    所述数据处理方法,还包括;The data processing method also includes;
    在CPU对FFT阶段和MULTIEXP阶段的数据预处理期间,将重复使用的参数读取提前并行化进行。During the CPU's data preprocessing for the FFT stage and the MULTIEXP stage, the reusable parameter reading is parallelized in advance.
  4. 根据权利要求1所述的数据处理方法,其特征在于,The data processing method according to claim 1, wherein:
    所述数据处理方法,还包括:The data processing method also includes:
    计算单次GPU任务的理论数据最大处理量;Calculate the theoretical maximum data processing capacity of a single GPU task;
    基于所述理论数据最大处理量确定单次GPU处理的数据量。The amount of data processed by a single GPU is determined based on the theoretical maximum amount of data processed.
  5. 根据权利要求4所述的数据处理方法,其特征在于,The data processing method according to claim 4, wherein:
    所述单次GPU任务的理论数据最大处理量,包括:The theoretical maximum data processing capacity of the single GPU task, including:
    获取所述GPU的总显存;Obtain the total video memory of the GPU;
    计算所述GPU中线程占用的第一显存;Calculate the first video memory occupied by threads in the GPU;
    基于所述总显存和所述第一显存的差值,获取所述GPU的剩余显存;Obtaining the remaining video memory of the GPU based on the difference between the total video memory and the first video memory;
    基于所述GPU的剩余显存与所述输入数据的数据量的比值,获取所述理论数据最大处理量。Based on the ratio of the remaining video memory of the GPU to the data volume of the input data, the maximum theoretical data processing capacity is obtained.
  6. 根据权利要求5所述的数据处理方法,其特征在于,The data processing method according to claim 5, wherein:
    所述计算所述GPU中线程占用的第一显存,包括:The calculation of the first video memory occupied by threads in the GPU includes:
    获取所述GPU的总线程数;Obtain the total number of threads of the GPU;
    基于所述GPU中一个线程中的存储单元大小以及存储单元数量,计算一个线程的第二显存;Calculate the second display memory of a thread based on the storage unit size and the number of storage units in a thread in the GPU;
    基于所述总线程数和所述第二显存,计算所述第一显存。Calculate the first video memory based on the total number of threads and the second video memory.
  7. 根据权利要求6所述的数据处理方法,其特征在于,The data processing method according to claim 6, wherein:
    所述数据处理方法,还包括:The data processing method also includes:
    增加所述GPU的总线程数;increase the total number of threads of the GPU;
    基于所述GPU增加的总线程数提高所述单次GPU处理的数据量,以使所述GPU的传输次数减少。Increase the amount of data processed by the GPU at a time based on the increased total number of threads of the GPU, so as to reduce the number of transmission times of the GPU.
  8. 根据权利要求1所述的数据处理方法,其特征在于,The data processing method according to claim 1, wherein:
    所述CPU-GPU异构架构中的CPU负责逻辑控制和数据的预处理,GPU负责处理密集且可并行化的计算。The CPU in the CPU-GPU heterogeneous architecture is responsible for logic control and data preprocessing, and the GPU is responsible for processing intensive and parallelizable calculations.
  9. 一种终端设备,其特征在于,所述终端设备包括存储器和处理器,其中,所述存储器与所述处理器耦接;A terminal device, characterized in that the terminal device includes a memory and a processor, wherein the memory is coupled to the processor;
    其中,所述存储器用于存储程序数据,所述处理器用于执行所述程序数据以实现权利要求1-8任一项所述的数据处理方法。Wherein, the memory is used to store program data, and the processor is used to execute the program data to implement the data processing method according to any one of claims 1-8.
  10. 一种计算机存储介质,其特征在于,所述计算机存储介质用于存储程序数据,所述程序数据在被处理器执行时,用以实现权利要求1-8任一项所述的数据处理方法。A computer storage medium, characterized in that the computer storage medium is used to store program data, and the program data is used to implement the data processing method according to any one of claims 1-8 when executed by a processor.
PCT/CN2021/141312 2021-12-15 2021-12-24 Data processing method based on cpu-gpu heterogeneous architecture, device and storage medium WO2023108801A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111539679.8 2021-12-15
CN202111539679.8A CN114880109B (en) 2021-12-15 2021-12-15 Data processing method and device based on CPU-GPU heterogeneous architecture and storage medium

Publications (1)

Publication Number Publication Date
WO2023108801A1 true WO2023108801A1 (en) 2023-06-22

Family

ID=82667419

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/141312 WO2023108801A1 (en) 2021-12-15 2021-12-24 Data processing method based on cpu-gpu heterogeneous architecture, device and storage medium

Country Status (2)

Country Link
CN (1) CN114880109B (en)
WO (1) WO2023108801A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150095655A1 (en) * 2013-09-27 2015-04-02 Brent M. Sherman Apparatus and method for implementing zero-knowledge proof security techniques on a computing platform
CN111373694A (en) * 2020-02-21 2020-07-03 香港应用科技研究院有限公司 Zero-knowledge proof hardware accelerator and method thereof
CN111585770A (en) * 2020-01-21 2020-08-25 上海致居信息科技有限公司 Method, device, medium and system for distributed acquisition of zero-knowledge proof
CN113177225A (en) * 2021-03-16 2021-07-27 深圳市名竹科技有限公司 Block chain-based data storage certification method, device, equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPWO2010067820A1 (en) * 2008-12-11 2012-05-24 日本電気株式会社 Zero knowledge proof system, zero knowledge proof device, zero knowledge verification device, zero knowledge proof method and program thereof
CN104572587B (en) * 2014-12-23 2017-11-14 中国电子科技集团公司第三十八研究所 The acceleration operation method and device that data matrix is multiplied
JP6724828B2 (en) * 2017-03-15 2020-07-15 カシオ計算機株式会社 Filter calculation processing device, filter calculation method, and effect imparting device
CN112698094B (en) * 2020-12-04 2022-06-24 中山大学 Multi-channel multi-acquisition-mode high-speed acquisition system and method
CN113114377B (en) * 2021-03-05 2022-03-04 北京遥测技术研究所 QPSK signal frequency offset estimation method for spatial coherent laser communication

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150095655A1 (en) * 2013-09-27 2015-04-02 Brent M. Sherman Apparatus and method for implementing zero-knowledge proof security techniques on a computing platform
CN111585770A (en) * 2020-01-21 2020-08-25 上海致居信息科技有限公司 Method, device, medium and system for distributed acquisition of zero-knowledge proof
CN111373694A (en) * 2020-02-21 2020-07-03 香港应用科技研究院有限公司 Zero-knowledge proof hardware accelerator and method thereof
CN113177225A (en) * 2021-03-16 2021-07-27 深圳市名竹科技有限公司 Block chain-based data storage certification method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN, DONGLIN ET AL.: "Data Confidentiality Mechanism of Science and Technology Service Transaction Based on Zero Knowledge Proof", SCIENCE AND TECHNOLOGY MANAGEMENT RESEARCH, no. 20, 20 October 2021 (2021-10-20), pages 80 - 86, XP009547044 *
ZHANG, YINBING ET AL.: "Research on Zero Knowledge Proof Protocol", JOURNAL OF CHIFENG UNIVERSITY (NATURAL SCIENCE EDITION), vol. 30, no. 4, 30 April 2014 (2014-04-30), XP009547045 *

Also Published As

Publication number Publication date
CN114880109B (en) 2023-04-14
CN114880109A (en) 2022-08-09

Similar Documents

Publication Publication Date Title
WO2017124644A1 (en) Artificial neural network compression encoding device and method
CN111062472B (en) Sparse neural network accelerator based on structured pruning and acceleration method thereof
US20200050918A1 (en) Processing apparatus and processing method
CN113628094B (en) High-throughput SM2 digital signature computing system and method based on GPU
CN112162854A (en) Method, system and medium for scheduling calculation tasks between CPU-GPU
US11977885B2 (en) Utilizing structured sparsity in systolic arrays
US20210166156A1 (en) Data processing system and data processing method
Wang et al. HE-Booster: an efficient polynomial arithmetic acceleration on GPUs for fully homomorphic encryption
CN111859775A (en) Software and hardware co-design for accelerating deep learning inference
CN110704193B (en) Method and device for realizing multi-core software architecture suitable for vector processing
US20150095389A1 (en) Method and system for generating pseudorandom numbers in parallel
WO2023108801A1 (en) Data processing method based on cpu-gpu heterogeneous architecture, device and storage medium
US11546161B2 (en) Zero knowledge proof hardware accelerator and the method thereof
WO2016008317A1 (en) Data processing method and central node
WO2023108800A1 (en) Performance analysis method based on cpu-gpu heterogeneous architecture, and device and storage medium
TWI743648B (en) Systems and methods for accelerating nonlinear mathematical computing
WO2023125463A1 (en) Heterogeneous computing framework-based processing method and apparatus, and device and medium
Zhao et al. Hardware acceleration of number theoretic transform for zk‐SNARK
CN112799637B (en) High-throughput modular inverse computation method and system in parallel environment
Peng et al. MBFQuant: A Multiplier-Bitwidth-Fixed, Mixed-Precision Quantization Method for Mobile CNN-Based Applications
Khan et al. Accelerating SpMV multiplication in probabilistic model checkers using GPUs
CN111796797B (en) Method and device for realizing loop polynomial multiplication calculation acceleration by using AI accelerator
Falcao et al. Heterogeneous implementation of a voronoi cell-based svp solver
Liu et al. IOMRA-A High Efficiency Frequent Itemset Mining Algorithm Based on the MapReduce Computation Model
Phalakarn et al. Vectorized and parallel computation of large smooth-Degree isogenies using precedence-constrained scheduling

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21967931

Country of ref document: EP

Kind code of ref document: A1