WO2023108801A1

WO2023108801A1 - Data processing method based on cpu-gpu heterogeneous architecture, device and storage medium

Info

Publication number: WO2023108801A1
Application number: PCT/CN2021/141312
Authority: WO
Inventors: 鲁真妍; 杨永魁; 喻之斌
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2021-12-15
Filing date: 2021-12-24
Publication date: 2023-06-22
Also published as: CN114880109A; CN114880109B

Abstract

The present application provides a data processing method based on a CPU-GPU heterogeneous architecture, a device and a storage medium. The data processing method based on the CPU-GPU heterogeneous architecture comprises: obtaining a calculation task of zero-knowledge proof; inputting data of the zero-knowledge proof into a SYNTHESIZE stage for processing, and respectively inputting output data in the SYNTHESIZE stage into an FFT stage, a MULTIEXP B stage and a MULTIEXP C stage; inputting output data of the FFT stage into a MULTIEXP A stage, and outputting first proof information in the MULTIEXP A stage; performing parallel processing on the MULTIEXP B stage and the MULTIEXP C stage, and respectively outputting second proof information and third proof information; and generating a final proof result by combining the first proof information, the second proof information and the third proof information. By means of the mode above, the data processing method of the present application solves application hindrance caused by performance problems by providing a zero-knowledge proof performance optimization mode, and accelerates landing of a zero-knowledge proof technology in an application scene.

Description

Data processing method, device and storage medium based on CPU-GPU heterogeneous architecture

technical field

The present application relates to the technical field of zero-knowledge proof, in particular to a data processing method, device and storage medium based on a CPU-GPU heterogeneous architecture.

Background technique

Zero-knowledge proof means that the prover can convince the verifier that a statement is correct without revealing any useful information to the verifier. Therefore, problems such as data security and privacy leakage can be solved.

At present, the use of current zero-knowledge proofs requires a huge amount of calculation in the process of generating proofs, and time and economic costs limit its application. In the isomorphic computing mode, the CPU cannot meet the intensive computing requirements of zero-knowledge proofs. For example, in a distributed storage system project, it is necessary to complete the zero-knowledge proofs to encapsulate blocks and submit them to the chain. Using only the CPU to complete the zero-knowledge proof, the time to dig out a block is much longer than the specified block time, and it becomes an invalid block in the blockchain. The GPU not only has powerful floating-point calculation capabilities, but is also suitable for parallel computing of large-scale data. If CPU-GPU heterogeneous computing is used, the efficiency of zero-knowledge proof can be greatly improved. However, on the CPU-GPU heterogeneous architecture, how to coordinate the relationship between devices of different architectures so that they can reach the maximum utilization rate and form the most efficient system is more complicated than the case of the isomorphic method.

Contents of the invention

The present application provides a data processing method, device and storage medium based on a CPU-GPU heterogeneous architecture.

The application provides a data processing method based on CPU-GPU heterogeneous architecture, the data processing method comprising:

Computational tasks for obtaining zero-knowledge proofs;

The calculation task of the zero-knowledge proof is divided into three stages, which are respectively the SYNTHESIZE stage, the FFT stage and the MULTIEXP stage, wherein the MULTIEXP stage is divided into the MULTIEXP A stage, the MULTIEXP B stage and the MULTIEXP C stage according to the different input data;

Input the data of the zero-knowledge proof into the SYNTHESIZE stage for processing, and input the output data of the SYNTHESIZE stage into the FFT stage, MULTIEXP B stage and MULTIEXP C stage respectively;

The output data of the FFT stage is input to the MULTIEXP A stage, and the MULTIEXP A stage outputs the first certification information;

Parallel processing of the MULTIEXP B phase and the MULTIEXP C phase, outputting the second certification information and the third certification information respectively;

A final certification result is generated by combining the first certification information, the second certification information and the third certification information.

Wherein, the data input into the FFT stage is divided into several sub-data;

Processing the first sub-data through the FFT stage to obtain the first output sub-data;

While the second part of sub-data is being processed by the FFT stage, the first part of output sub-data is transmitted to the MULTIEXPA stage for data processing until the processing and transmission of all sub-data is completed.

Wherein, the data processing method also includes;

During the CPU's data preprocessing for the FFT stage and the MULTIEXP stage, the reusable parameter reading is parallelized in advance.

Wherein, the data processing method also includes:

Calculate the theoretical maximum data processing capacity of a single GPU task;

The amount of data processed by a single GPU is determined based on the theoretical maximum amount of data processed.

Wherein, the theoretical maximum data processing capacity of the single GPU task includes:

Obtain the total video memory of the GPU;

Calculate the first video memory occupied by threads in the GPU;

Obtaining the remaining video memory of the GPU based on the difference between the total video memory and the first video memory;

Based on the ratio of the remaining video memory of the GPU to the data volume of the input data, the maximum theoretical data processing capacity is obtained.

Wherein, the calculation of the first video memory occupied by threads in the GPU includes:

Obtain the total number of threads of the GPU;

Calculate the second display memory of a thread based on the storage unit size and the number of storage units in a thread in the GPU;

Calculate the first video memory based on the total number of threads and the second video memory.

Wherein, the data processing method also includes:

increase the total number of threads of the GPU;

Increase the amount of data processed by the GPU at a time based on the increased total number of threads of the GPU, so as to reduce the number of transmission times of the GPU.

Wherein, the CPU in the CPU-GPU heterogeneous architecture is responsible for logic control and data preprocessing, and the GPU is responsible for processing intensive and parallelizable calculations.

The present application also provides a terminal device, where the terminal device includes a memory and a processor, wherein the memory is coupled to the processor;

Wherein, the memory is used to store program data, and the processor is used to execute the program data to implement the above data processing method.

The present application also provides a computer storage medium, the computer storage medium is used to store program data, and the program data is used to implement the above data processing method when executed by a processor.

The beneficial effects of the present application are: the terminal device obtains the calculation task of the zero-knowledge proof; the calculation task of the zero-knowledge proof is divided into three stages, which are the SYNTHESIZE stage, the FFT stage and the MULTIEXP stage, wherein the MULTIEXP stage is based on different input data It is divided into MULTIEXP A stage, MULTIEXP B stage and MULTIEXP C stage; the data of zero-knowledge proof is input into the SYNTHESIZE stage for processing, and the output data of the SYNTHESIZE stage are respectively input into the FFT stage, MULTIEXP B stage and MULTIEXP C stage; the FFT stage The output data of the MULTIEXP is input to the MULTIEXP A stage, and the MULTIEXP A stage outputs the first certification information; the MULTIEXP B phase and the MULTIEXP C phase are processed in parallel, and the second certification information and the third certification information are respectively output; the combination of the first certification information and the second certification information information and the third proof information generate the final proof result. Through the above method, the data processing method of the present application proposes a zero-knowledge proof performance optimization method to solve application obstacles caused by performance problems and accelerate the implementation of zero-knowledge proof technology in application scenarios.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort. in:

Fig. 1 is a schematic flow chart of an embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided by the present application;

Fig. 2 is a schematic diagram of the zero-knowledge proof calculation data flow based on the CPU-GPU heterogeneous architecture provided by the present application;

Fig. 3 is the specific sub-step of step S14 of the data processing method shown in Fig. 1;

4 is a schematic flowchart of another embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided by the present application;

Fig. 5 is a schematic flow chart of the serial execution of the data processing method in the prior art;

Fig. 6 is a schematic flow diagram of parallel execution of the data processing method provided by the present application;

Fig. 7 is the total execution time under different optimization schemes provided by this application;

Fig. 8 is the execution time of the MULTIEXP stage under different optimization schemes provided by the present application;

FIG. 9 is a schematic structural diagram of an embodiment of a terminal device provided by the present application;

Fig. 10 is a schematic structural diagram of an embodiment of a computer storage medium provided by the present application.

Detailed ways

The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

In order to solve the above-mentioned low utilization rate of the CPU-GPU heterogeneous architecture, due to the algorithm complexity of the zero-knowledge proof itself, the huge amount of data and calculation, and the system complexity brought by the CPU-GPU heterogeneous architecture, For the implementation based on the CPU-GPU heterogeneous architecture, this application adjusts the CPU to be responsible for logic control and data preprocessing, and the GPU is responsible for processing intensive and parallelizable calculations, and proposes a zero-knowledge proof performance optimization method to solve the problem. Due to application obstacles caused by performance problems, the implementation of zero-knowledge proof technology in application scenarios is accelerated.

Please refer to Figure 1 and Figure 2 for details, Figure 1 is a schematic flow chart of an embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided by this application, and Figure 2 is a schematic flow diagram of an embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided by this application Schematic diagram of knowledge proof computing data flow.

As shown in Figure 1, the data processing method based on the CPU-GPU heterogeneous architecture of the embodiment of the present application specifically includes the following steps:

Step S11: Obtain the calculation task of zero-knowledge proof.

Step S12: Divide the calculation task of the zero-knowledge proof into three stages, which are the SYNTHESIZE (circuit generation) stage, the FFT (Fast Fourier Transform) stage, and the MULTIEXP (large number multiplication and addition) stage, wherein the MULTIEXP stage is based on The different input data are divided into MULTIEXP A stage (that is, large number multiplication and addition A stage), MULTIEXP B (that is, large number multiplication and addition B stage) and MULTIEXP C stage (that is, large number multiplication and addition C stage).

In the embodiment of the present application, as shown in FIG. 2 , the present application proposes a parallel execution solution for computing parallelization based on the CPU-GPU heterogeneous architecture.

Specifically, this application divides the calculation of zero-knowledge proof into three stages: SYNTHESIZE stage, FFT stage, and MULTIEXP stage. The calculation of the MULTIEXP stage can be divided into three parts A, B, and C according to the different input data, that is, MULTIEXP Phase A, Phase B of MULTIEXP and Phase C of MULTIEXP. As shown in the calculation data flow in Figure 2, the output data of the SYNTHESIZE stage is divided into three parts, one part is used as the input of the FFT stage, and the other two parts are used as the input of the MULTIEXP B stage and the MULTIEXP C stage respectively. The output of the FFT stage will serve as the input to the MULTIEXP A stage. The output of the final MULTIEXP A stage, MULTIEXP B stage, and MULTIEXP C stage proves PROOF.

From the calculation data flow shown in Figure 2, it can be seen that the operation of the optimized CPU-GPU heterogeneous architecture is mainly divided into two parts: the pipeline of the FFT stage and the MULTIEXP A stage, and the parallelization of the MULTIEXP B stage and the MULTIEXP C stage change. The following continues to describe these two parts in detail:

Step S13: Input the zero-knowledge proof data into the SYNTHESIZE stage for processing, and input the output data of the SYNTHESIZE stage into the FFT stage, MULTIEXP B stage and MULTIEXP C stage respectively.

In the embodiment of this application, the terminal device inputs the zero-knowledge proof data into the SYNTHESIZE stage for processing, and inputs the output data of the SYNTHESIZE stage into the FFT stage, MULTIEXP B stage, and MULTIEXP C stage in parallel for data processing.

Among them, this application aims at setting the CPU in charge of logic control and data preprocessing in the process of zero-knowledge proof to realize the acceleration of disk IO and data preprocessing. Specifically, for the CPU, repeated parameters are continuously used for data preprocessing in the data preprocessing process, and the data preprocessing time of the CPU in the FFT stage and the MULTIEXP stage takes a long time. In order to improve the CPU usage efficiency , the CPU can parallelize the reading of reusable parameters in advance, reducing the constant reading and calling of reusable parameters.

Step S14: Input the output data of the FFT stage into the MULTIEXP A stage, and the MULTIEXP A stage outputs the first certification information.

In the embodiment of the present application, since the FFT stage and the MULTIEXPA stage have a data dependency relationship, the data processing processes of these two stages cannot be directly parallelized, but the pipeline of the FFT stage and the MULTIEXPA stage must be realized. Specifically, the terminal device needs to input the output result of the SYNTHESIZE stage into the FFT stage, and then processed by the FFT stage and then input into the MULTIEXP A stage for processing.

In order to further improve the utilization rate of the CPU-GPU heterogeneous architecture, the embodiment of the present application also provides a two-stage pipelined technical solution. Please refer to FIG. 3 for details. FIG. 3 is step S14 of the data processing method shown in FIG. 1 specific sub-steps.

As shown in Figure 3, the data processing method provided in the embodiment of the present application also includes:

Step S141: Divide the data input into the FFT stage into several sub-data.

In the embodiment of this application, combining the FFT stage and the MULTIEXPA stage have the characteristics of dividing the data into several parts and calculating each part independently. The CPU-GPU heterogeneous architecture can implement the FFT stage and the MULTIEXPA stage in a two-stage pipeline. Pipelining of the MULTIEXP A stage.

Step S142: Process the first sub-data through the FFT stage to obtain the first output sub-data.

Step S143: While processing the second sub-data through the FFT stage, transmit the first output sub-data to the MULTIEXP A stage for data processing until the processing and transmission of all sub-data is completed.

Specifically, both the FFT stage and the MULTIEXP A stage divide the data to be processed into N parts. When the thread that calculates the FFT stage outputs the xth data, it is immediately passed to the thread that calculates the MULTIEXP A stage, so that the MULTIEXP A stage The thread calculates the xth data. At the same time, the FFT stage can start processing the x+1th data.

Through the above two-stage pipeline, the data processing efficiency of the FFT stage and the MULTIEXP A stage can be further improved, and it can even be consistent with the data processing speed of the MULTIEXP B stage and the MULTIEXP C stage.

Step S15: Process the MULTIEXP B phase and the MULTIEXP C phase in parallel, and output the second certification information and the third certification information respectively.

In the embodiment of the present application, since the MULTIEXP B stage and the MULTIEXP C stage have no data dependency, and the video memory usage is low, the data processing processes of these two stages can be directly parallelized for simultaneous calculation.

Step S16: Combine the first certification information, the second certification information and the third certification information to generate a final certification result.

Finally, the terminal device combines the first proof information calculated in the MULTIEXP A phase, the second proof information calculated in the MULTIEXP B phase, and the third proof information calculated in the MULTIEXP C phase to generate the final proof result PROOF.

In the embodiment of this application, the terminal device obtains the calculation task of zero-knowledge proof; the calculation task of zero-knowledge proof is divided into three stages, namely the SYNTHESIZE stage, the FFT stage and the MULTIEXP stage, wherein the MULTIEXP stage is based on the input data. It is divided into MULTIEXP A stage, MULTIEXP B stage and MULTIEXP C stage; the data of zero-knowledge proof is input into the SYNTHESIZE stage for processing, and the output data of the SYNTHESIZE stage are respectively input into the FFT stage, MULTIEXP B stage and MULTIEXP C stage; the FFT stage The output data of the MULTIEXP is input to the MULTIEXP A stage, and the MULTIEXP A stage outputs the first certification information; the MULTIEXP B phase and the MULTIEXP C phase are processed in parallel, and the second certification information and the third certification information are respectively output; the combination of the first certification information and the second certification information information and the third proof information generate the final proof result. Through the above method, the data processing method of the present application proposes a zero-knowledge proof performance optimization method to solve application obstacles caused by performance problems and accelerate the implementation of zero-knowledge proof technology in application scenarios.

Please continue to refer to FIG. 4 , which is a schematic flowchart of another embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided by the present application. In the data processing method of the embodiment of the present application, the CPU-GPU heterogeneous architecture optimizes the performance of the CPU-GPU heterogeneous architecture by reducing the number of CPU-GPU transmissions while increasing the number of GPU computing threads, and improves the utilization efficiency of the architecture.

As shown in Figure 4, the data processing method provided in the embodiment of the present application includes:

Step S21: Calculate the theoretical maximum data processing capacity of a single GPU task.

In the embodiment of this application, the terminal device can calculate the theoretical maximum data processing capacity d _max of a single GPU task through the following formula:

Among them, mem is the memory size of the GPU, i is the ratio of the actual number of threads to the maximum number of parallel threads of the GPU, and cores is the number of stream processors in the GPU. By equating the maximum parallel threads of the GPU to the number of stream processors, i×cores can be used to represent the total number of threads of the GPU. The window _size is the size of a window on the GPU,

is the number of buckets owned by a thread, and the bucket _size is the size of a GPU bucket, then it can be calculated

It is the maximum video memory size occupied by a bucket in a thread, and then passed

You can calculate the video memory size occupied by buckets in all threads in the GPU, that is, the second video memory.

Further, k _size is the size of a scalar data, and p _size is the size of a vector data. Divide the remaining video memory by the size of the input data, except for the video memory occupied by the buckets in all threads in the GPU, to obtain the theoretical maximum data processing capacity of a single GPU task.

Step S22: Determine the amount of data processed by a single GPU based on the theoretical maximum amount of data processed.

In the embodiment of this application, the CPU-GPU heterogeneous architecture sets the amount of data processed by a single GPU according to the calculated maximum theoretical data processing capacity to process the data of the zero-knowledge proof, which can minimize the number of transmissions and improve processing efficiency .

Furthermore, the CPU-GPU heterogeneous architecture can also increase i, for example, from 2 to 4, thereby increasing actual thread data, increasing the amount of data processed by a single GPU, and further reducing the number of transmissions.

In order to verify the data processing method based on the CPU-GPU heterogeneous architecture provided by this application, on the premise that the correct proof can be generated, the optimization scheme verification is performed on the calculation task with input data of 32GB. For details of the verification results, please refer to Figures 5 to 8, wherein Figure 5 is a schematic flowchart of the serial execution of the data processing method in the prior art, Figure 6 is a schematic flowchart of the parallel execution of the data processing method provided by the present application, and Figure 7 is a schematic flowchart of the serial execution of the data processing method provided by the present application. The total execution time under the different optimization schemes provided by the application, Fig. 8 is the execution time of the MULTIEXP stage under the different optimization schemes provided by the application.

Comparing the serial execution flow chart in FIG. 5 and the parallel execution flow chart in FIG. 6 , it can be seen that the memory usage rate of the GPU is greatly improved by adopting the parallel execution scheme provided by the present application.

As shown in Figure 7, parallelization and data preprocessing acceleration on the basis of BELLPERSON have increased the speed by 9% and 37% respectively. Since parallelization increases the thread scheduling overhead of the CPU and GPU, the speed improvement is not obvious. It can be seen that the data preprocessing time takes a relatively high proportion. During code analysis, it is found that there are many redundant operations, so the preprocessing acceleration technology solution The speed increase effect is better.

As shown in Figure 8, reducing the number of transfers and increasing the number of threads on the basis of preprocessing acceleration increases the speed by 35%.

In summary, this application optimizes the performance of zero-knowledge proof on the CPU-GPU heterogeneous architecture, and continuously improves the performance of the CPU-GPU heterogeneous architecture from the following three parts:

1) Calculation parallelization.

2) Acceleration of disk IO and data preprocessing.

3) Reduce the number of CPU-GPU transfers while increasing the number of GPU computing threads.

Those skilled in the art can understand that in the above method of specific implementation, the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process. The specific execution order of each step should be based on its function and possible The inner logic is OK.

In order to implement the data processing method based on the CPU-GPU heterogeneous architecture of the above embodiment, the present application also proposes a terminal device. Please refer to FIG. 9 for details. FIG. 9 is a schematic structural diagram of an embodiment of the terminal device provided in the present application.

The terminal device 500 in this embodiment of the present application includes a memory 51 and a processor 52, where the memory 51 and the processor 52 are coupled.

The memory 51 is used to store program data, and the processor 52 is used to execute the program data to implement the data processing method based on the CPU-GPU heterogeneous architecture described in the above embodiments.

In this embodiment, the processor 52 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 52 may be an integrated circuit chip with signal processing capability. The processor 52 can also be a general-purpose processor, a digital signal processor (DSP, Digital Signal Process), an application specific integrated circuit (ASIC, Application Specific Integrated Circuit), a field programmable gate array (FPGA, Field Programmable Gate Array) or other possible Program logic devices, discrete gate or transistor logic devices, discrete hardware components. The general purpose processor may be a microprocessor or the processor 52 may be any conventional processor or the like.

The present application also provides a computer storage medium. As shown in FIG. 10, the computer storage medium 600 is used to store program data 61. When the program data 61 is executed by the processor, it is used to implement the CPU-based Data processing method of GPU heterogeneous architecture.

The present application also provides a computer program product, wherein the computer program product includes a computer program, and the computer program is operable to cause a computer to execute the data processing method based on the CPU-GPU heterogeneous architecture described in the embodiment of the present application. The computer program product may be a software installation package.

The data processing method based on the CPU-GPU heterogeneous architecture described in the above embodiments of the present application can be stored in a device when it is implemented as a software functional unit and sold or used as an independent product, for example, a computer can read from the storage medium. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) or a processor (processor) execute all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

The above is only the implementation of the application, and does not limit the patent scope of the application. Any equivalent structure or equivalent process conversion made by using the specification and drawings of the application, or directly or indirectly used in other related technologies fields, are all included in the scope of patent protection of this application in the same way.

Claims

A data processing method based on CPU-GPU heterogeneous architecture, characterized in that the data processing method comprises:

Computational tasks for obtaining zero-knowledge proofs;

The calculation task of the zero-knowledge proof is divided into three stages, which are the SYNTHESIZE stage, the FFT stage and the MULTIEXP stage, wherein the MULTIEXP stage is divided into the MULTIEXP A stage, the MULTIEXP B stage and the MULTIEXP C stage according to the input data;

Input the data of the zero-knowledge proof into the SYNTHESIZE stage for processing, and input the output data of the SYNTHESIZE stage into the FFT stage, MULTIEXP B stage and MULTIEXP C stage respectively;

The output data of the FFT stage is input to the MULTIEXP A stage, and the MULTIEXP A stage outputs the first certification information;

Parallel processing of the MULTIEXP B phase and the MULTIEXP C phase, outputting the second certification information and the third certification information respectively;

A final certification result is generated by combining the first certification information, the second certification information and the third certification information.
The data processing method according to claim 1, wherein:

The data processing method also includes:

Divide the data input into the FFT stage into several sub-data;

Processing the first sub-data through the FFT stage to obtain the first output sub-data;

While the second part of sub-data is being processed by the FFT stage, the first part of output sub-data is transmitted to the MULTIEXPA stage for data processing until the processing and transmission of all sub-data is completed.
The data processing method according to claim 1, wherein:

The data processing method also includes;

During the CPU's data preprocessing for the FFT stage and the MULTIEXP stage, the reusable parameter reading is parallelized in advance.
The data processing method according to claim 1, wherein:

The data processing method also includes:

Calculate the theoretical maximum data processing capacity of a single GPU task;

The amount of data processed by a single GPU is determined based on the theoretical maximum amount of data processed.
The data processing method according to claim 4, wherein:

The theoretical maximum data processing capacity of the single GPU task, including:

Obtain the total video memory of the GPU;

Calculate the first video memory occupied by threads in the GPU;

Obtaining the remaining video memory of the GPU based on the difference between the total video memory and the first video memory;

Based on the ratio of the remaining video memory of the GPU to the data volume of the input data, the maximum theoretical data processing capacity is obtained.
The data processing method according to claim 5, wherein:

The calculation of the first video memory occupied by threads in the GPU includes:

Obtain the total number of threads of the GPU;

Calculate the second display memory of a thread based on the storage unit size and the number of storage units in a thread in the GPU;

Calculate the first video memory based on the total number of threads and the second video memory.
The data processing method according to claim 6, wherein:

The data processing method also includes:

increase the total number of threads of the GPU;

Increase the amount of data processed by the GPU at a time based on the increased total number of threads of the GPU, so as to reduce the number of transmission times of the GPU.
The data processing method according to claim 1, wherein:

The CPU in the CPU-GPU heterogeneous architecture is responsible for logic control and data preprocessing, and the GPU is responsible for processing intensive and parallelizable calculations.
A terminal device, characterized in that the terminal device includes a memory and a processor, wherein the memory is coupled to the processor;

Wherein, the memory is used to store program data, and the processor is used to execute the program data to implement the data processing method according to any one of claims 1-8.
A computer storage medium, characterized in that the computer storage medium is used to store program data, and the program data is used to implement the data processing method according to any one of claims 1-8 when executed by a processor.