CN117785480B

CN117785480B - Processor, reduction calculation method and electronic equipment

Info

Publication number: CN117785480B
Application number: CN202410171926.0A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Priority date: 2024-02-07
Filing date: 2024-02-07
Publication date: 2024-04-26
Anticipated expiration: 2044-02-07
Also published as: CN117785480A

Abstract

The embodiment of the disclosure provides a processor, a reduction calculation method and electronic equipment. The processor includes: a plurality of computing cores configured to implement reduction computations; a global memory configured to be accessible by a plurality of computing cores to enable data reading and writing; each computing core includes: the system comprises a matrix main cache, a matrix reduction cache, a plurality of first computing units and a reduction module, wherein the reduction module is configured to be capable of accessing a global memory and performing reduction computation on data written into the reduction module; each first computing unit includes a shared memory and a plurality of second computing units configured to access the shared memory and the matrix reduction cache; the matrix reduction cache is further configured to be accessible to all second computing units located within the same computing core. The processor is additionally provided with two storage levels of the matrix reduction cache and the matrix main cache, so that a multi-level structure is realized, the flexibility of parallel reduction calculation is improved, and multi-level parallel reduction calculation can be realized.

Description

Processor, reduction calculation method and electronic equipment

Technical Field

Embodiments of the present disclosure relate to a processor, a reduction calculation method, and an electronic device.

Background

In large model training, due to the large number of parameters of the model, the computational complexity is high, and parallel computation is usually required to be performed by using a graphics processor (Graphics Processing Unit, GPU) to increase the training speed. To fully utilize the computing power of GPUs, a computing task is typically divided into a plurality of subtasks, and the subtasks are distributed to the GPUs for parallel processing, and then Reduction (Reduction) operations, such as summation, extremum, logic operation, etc., are performed based on the operation results on the GPUs. In the neural network model, normalization class operators, pooling class operators, reduction class operators, activation functions Softmax and the like all relate to application scenes of Reduction calculation.

Disclosure of Invention

At least one embodiment of the present disclosure provides a processor comprising: a plurality of computing cores configured to implement reduction computations; a global memory configured to be accessible by the plurality of computing cores to enable data reading and writing; wherein each of the plurality of computing cores comprises: a matrix main cache configured to store the reduced calculated data; a matrix reduction cache configured to perform the reduction calculation and store the used and generated data; a plurality of first computing units configured to be accessible to the matrix main cache to enable data read and write, wherein each of the plurality of first computing units comprises: a shared memory configured to store the reduced calculated data; the plurality of second computing units are configured to access the shared memory to realize data reading and writing, and access the matrix reduction cache to realize data reading and writing; wherein each of the plurality of computing cores further comprises a reduction module configured to access the global memory to enable data read and write and to perform the reduction computation on the data written to the reduction module; wherein the matrix reduction cache is further configured to be accessible by all second computing units located within the same computing core.

For example, in a processor provided by at least one embodiment of the present disclosure, each of the plurality of second computing units includes a plurality of registers configured to store data of the reduction computation.

For example, in a processor provided in at least one embodiment of the present disclosure, each of the plurality of second computing units supports a plurality of thread bundles, each of the plurality of thread bundles configured to access all register resources in the corresponding second computing unit to enable data reading and writing.

For example, in the processor provided in at least one embodiment of the present disclosure, the plurality of second computing units are further configured to implement reduction computation among the plurality of second computing units by reading and writing the shared memory.

For example, in a processor provided in at least one embodiment of the present disclosure, the plurality of first computing units are further configured to implement reduction computation among the plurality of first computing units by reading and writing the matrix main cache.

For example, in a processor provided in at least one embodiment of the present disclosure, the reduction module is further configured to perform reduction calculation on data read from the global memory and the reduction calculated data stored in a register or in a matrix reduction cache, and write a reduction result back to the global memory.

For example, in a processor provided in at least one embodiment of the present disclosure, the reduction module includes a cache with a reduction calculation function implemented based on a secondary cache.

For example, in a processor provided in at least one embodiment of the present disclosure, the global memory includes a high bandwidth memory.

For example, in a processor provided in at least one embodiment of the present disclosure, the processor includes a graphics processor or a general-purpose graphics processor.

At least one embodiment of the present disclosure provides a reduction calculation method implemented using a processor as provided by at least one embodiment of the present disclosure, including: and performing reduction calculation on the calculation data of a plurality of thread bundles in the same second calculation unit to obtain a first result of the reduction calculation, wherein each thread bundle in the plurality of thread bundles is configured to access all register resources in the second calculation unit to realize data reading and writing.

For example, in the reduction calculation method provided in at least one embodiment of the present disclosure, the method further includes: and utilizing the matrix reduction buffer to realize reduction calculation in the thread bundle, and writing the reduction result into a corresponding position in the matrix reduction buffer, wherein a second calculation unit positioned in the same calculation core can access the matrix reduction buffer to realize data reading and writing.

For example, in the reduction calculation method provided in at least one embodiment of the present disclosure, the method further includes: and the reduction calculation among a plurality of second calculation units in the same first calculation unit is realized through reading and writing the shared memory, wherein the second calculation units in the same first calculation unit can access the shared memory to realize data reading and writing.

For example, in the reduction calculation method provided in at least one embodiment of the present disclosure, the method further includes: and realizing reduction calculation among the plurality of first calculation units by reading and writing the matrix main cache, wherein the first calculation units in the same calculation core can access the matrix main cache to realize data reading and writing.

For example, in the reduction calculation method provided in at least one embodiment of the present disclosure, the method further includes: and reading data from the global memory by using the reduction module, carrying out reduction calculation on the data read by the reduction module and the data stored in a register or a matrix reduction cache, and writing a reduction result back to the global memory, thereby realizing reduction calculation among a plurality of calculation cores.

For example, in the reduction calculation method provided in at least one embodiment of the present disclosure, the method further includes: and reading data used for the reduction calculation from the global memory, and writing the read data into a register in the second calculation unit.

For example, in the reduction calculation method provided in at least one embodiment of the present disclosure, the method further includes: reading data used for reduction calculation from the global memory, and writing the read data into the matrix main cache; and reading data used for the reduction calculation from the matrix main cache, and writing the read data into a register in the second calculation unit.

For example, in the reduction calculation method provided in at least one embodiment of the present disclosure, the method further includes: writing a plurality of the first results into the matrix reduction cache; performing reduction calculation on a plurality of first results by using the matrix reduction cache to obtain second results of the reduction calculation; and performing reduction calculation on a plurality of second results from different calculation cores by utilizing the reduction module.

For example, in the reduction calculation method provided in at least one embodiment of the present disclosure, the method further includes: and performing reduction calculation on a plurality of first results from different calculation cores by utilizing the reduction module.

At least one embodiment of the present disclosure provides an electronic device including a processor as provided by at least one embodiment of the present disclosure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.

FIG. 1 is a schematic diagram of a reduction calculation method using tree reduction mode;

FIG. 2 is a schematic block diagram of a processor provided in accordance with at least one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of processor internal data flow interactions provided by at least one embodiment of the present disclosure;

FIG. 4 is a schematic flow chart diagram of a reduction calculation method provided by at least one embodiment of the present disclosure;

FIG. 5 is a workflow diagram of a reduction calculation method provided by at least one embodiment of the present disclosure;

FIG. 6 is a schematic block diagram of an electronic device provided in accordance with at least one embodiment of the present disclosure;

fig. 7 is a schematic block diagram of another electronic device provided in accordance with at least one embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.

Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

In the reduction calculation, it is assumed that there is input dataThe reduction computation is essentially a computationWherein/>It may represent addition, multiplication, taking maximum, taking minimum, squaring, logical AND or the like, and the binary operator conforming to the bond law. In the neural network model, normalized class operators (e.g., layer normalization Layer Normalization, batch normalization Batch Normalization), pooled class operators (e.g., mean pooled Mean Pooling, max pooled Max Pooling), reduction class operators, and activation function Softmax, etc., all relate to application scenarios of Reduction computation.

In the use of GPUs for high-performance parallel computing, a computing task is typically divided into multiple threads (threads) for execution. The multiple threads are combined into one or more thread blocks (blocks), and the multiple thread blocks are combined into one or more thread grids (Grid) to perform parallel computations.

Threads in the same thread block can access the same Shared Memory (Shared Memory), and can rapidly perform synchronization and communication operations. The thread blocks are the minimum units of the GPU parallel operation scheduling, that is, all threads in one thread block must be executed synchronously, and different thread blocks may be synchronous or asynchronous according to different tasks, so all threads in one thread block must be allocated to the same computing unit for execution, and multiple thread blocks may be executed in the same computing unit or different computing units.

The threads in the thread blocks are grouped and calculated in units of thread bundles (Warp). Thread bundles are the basic execution units of the GPU, each thread bundle containing a fixed number (or less) of threads, e.g., 32 threads. These threads execute the same instructions in a Single Instruction Multiple Thread (SIMT) fashion, except that the data processed is different. In one thread bundle, all threads share register resources. In each compute unit, the thread bundles are scheduled, allocated so that multiple compute cores of the compute unit run the thread bundles. The multiple thread bundles in a thread block may be executed simultaneously or in time-sharing fashion depending on the number of compute cores in the compute unit. Within a thread bundle, intra-bundle threads may communicate with each other through a thread bundle Shuffle (Warp Shuffle) instruction, with threads within the same thread bundle accessing registers of other threads through registers, rather than through shared Memory or Global Memory (Global Memory). The data interaction mode has the characteristics of low time delay, high communication efficiency and the like.

In general, reduction among multiple data in a thread can be implemented through a register, reduction among multiple threads in a thread bundle can be implemented through a thread bundle shuffling instruction, reduction among multiple threads in a thread block can be implemented through a shared memory, and reduction among multiple thread blocks in a thread grid can be implemented through a global memory.

The inventors of the present disclosure noted that in GPUs, parallel reduction computation typically employs a tree reduction pattern, requiring multiple iterations. FIG. 1 is a schematic diagram of a reduction calculation method using tree reduction mode, as shown in FIG. 1, in each iteration, each thread performs reduction calculation on the data it is responsible for processing. After all threads complete the reduction calculation of the data in charge of the threads, the reduction calculation is performed on all thread groups. The last iteration of each round requires a synchronization operation to ensure that all threads in the current iteration complete their own reduction calculation tasks before the next iteration can be started, and this global synchronization is very time-consuming.

At least one embodiment of the present disclosure provides a processor comprising: a plurality of computing cores configured to implement reduction computations; a global memory configured to be accessible by a plurality of computing cores to enable data reading and writing; wherein each of the plurality of computing cores comprises: a matrix main cache configured to store data of the reduction calculation; a matrix reduction cache configured to perform reduction calculations and store the data used and generated; a plurality of first computing units configured to be accessible to a matrix main cache to enable data reading and writing, wherein each of the plurality of first computing units comprises: a shared memory configured to store data of the reduction calculation; the plurality of second computing units are configured to access the shared memory to realize data reading and writing, and access the matrix reduction cache to realize data reading and writing; wherein each of the plurality of computing cores further includes a reduction module configured to access the global memory to enable data read and write and to perform reduction calculations on the data written to the reduction module; wherein the matrix reduction cache is further configured to be accessible by all second computing units located within the same computing core.

The processor provided by the above at least one embodiment of the present disclosure adds two storage levels of the matrix reduction cache and the matrix main cache, thereby implementing a multi-level structure and improving the flexibility of parallel reduction computation. The multi-level parallel reduction calculation can be realized based on a plurality of storage levels, the multiple levels can be selected for combination according to actual application scenes and algorithms, the reduction of the multiple levels forms a pipeline structure, and the reduction calculation of different levels realizes pipeline parallelism, so that the parallelism degree and the execution efficiency are improved.

The processor provided by the present disclosure is described below in terms of several embodiments and examples thereof, and as described below, different features of these specific examples or embodiments may be combined with each other without contradiction to each other to obtain new examples or embodiments, which are also within the scope of the present disclosure.

Fig. 2 is a schematic block diagram of a processor provided in at least one embodiment of the present disclosure.

For example, as shown in FIG. 2, a processor provided by an embodiment of the present disclosure includes a plurality of compute cores (STREAMING PROCESSING CLUSTER, SPC) 10 and a global memory 11. For example, the plurality of SPCs 10 are configured to implement reduction calculations, and the global memory 11 is configured to be accessible by the plurality of SPCs 10 to implement data reading and writing. For example, global memory includes high bandwidth memory (High Bandwidth Memory, HBM). For example, the processor includes a graphics processor (Graphics Processing Unit, GPU) or a General-purpose graphics processor (GPGPU), although embodiments of the present disclosure are not limited in this respect, and the processor may be any other type of processor.

For example, as shown in fig. 2, each SPC 10 includes a matrix Main Buffer (GMB) 12, a plurality of first Computing Units (CU) 13, a reduction module 14, and a matrix reduction Buffer (GEMM Reduction Buffer, GRB) 15, and each CU 13 includes a shared memory 21 and a plurality of second computing units (EU) 23.

For example, the plurality of CUs 13 are configured to be accessible to the GMB 12 to enable data reading and writing, and to enable reduction calculation between the plurality of CUs 13 by reading and writing the GMB 12.

For example, the GMB 12 is configured to store data of the reduction calculation, and the GMB 12 allows access by all CUs 13 within the same SPC to realize data reading and writing, that is, allows access by all EU 23 within the same SPC to realize data reading and writing. By reading and writing the GMB 12, reduction calculation between the plurality of CUs 13 in the same SPC, that is, reduction calculation between the plurality of EUs 23 in the same SPC can be realized. Thus, the GMB 12 may be used to store temporary results of the reduction calculations in the SPC 10, data used or generated during the calculation, or final reduction results.

For example, the reduction module 14 is configured to access the global memory 11 to enable data reading and writing, and to perform reduction calculations on the data written to the reduction module 14. Specifically, the reduction module 14 reads data from the global memory 11, performs reduction calculation on the stored reduction data and the data read from the global memory 11, and writes the reduction result back to the global memory 11. For example, the reduction module 14 includes a Cache with reduction computing functionality implemented based on a level two Cache (L2 Cache). Accordingly, the reduction module 14 may implement reduction calculations between the SPCs 10, temporary results of the reduction calculations between the SPCs 10, data used or generated during the calculation, or final reduction results stored in the global memory 11.

For example, the GRB 15 is configured to perform reduction calculations and store the data used and generated. Specifically, the GRB 15 may implement auto-reduce calculations on the data stored thereon in coordination with the corresponding hardware instruction settings, and store the resulting calculation results. For example, the GRB 15 is also configured to be accessible by all EUs 23 located in the same SPC 10 to enable data reading and writing, and by reading and writing the GRB 15, reduction calculations between multiple EUs 23 in the same SPC 10 can be achieved. Thus, the GRB 15 can also be used to store temporary results of the reduction calculations in the SPC 10, data used or generated during the calculation, or final reduction results.

For example, the EUs 23 in the same CU are configured to access the shared memory 21 to perform data reading and writing, and to perform reduction calculation between the EUs 23 in the same CU by reading and writing the shared memory 21. The plurality of EUs 23 are further configured to access the GRB 15 to enable data reading and writing, and to enable reduction computation between the plurality of EUs 23 by reading and writing the GRB 15. For example, each EU 23 includes a plurality of registers configured to store data of the reduction calculation. Each EU 23 supports a plurality of thread bundles, each thread bundle configured to access all register resources in the corresponding EU 23 to enable data reading and writing. Based on the characteristics of the thread bundles, reduction calculation between thread bundles in the same EU can be realized, shared memory or global memory is not needed, and delay of access can be reduced.

Each EU 23 also includes an arithmetic logic unit (ARITHMETIC LOGICAL UNIT, ALU) that can perform a wide variety of operations, such as integer, floating point addition and multiplication computations, comparison operations, boolean operations, bit shifts, algebraic functions (e.g., planar interpolation, trigonometric functions, exponential functions, logarithmic functions), and the like. The arithmetic logic unit may read data from a specified location (also referred to as a source address) of a register and write back execution results to the specified location (also referred to as a destination address) of the register during execution. In parallel operation, the arithmetic logic unit can provide independent register access for each thread, so that the operation among threads is independent.

For example, the shared memory 21 is configured to store data of reduction calculation, the shared memory 21 allows access by all EU 23 in the same CU 13 to realize data reading and writing, and by reading and writing the shared memory 21, reduction calculation between a plurality of EU 23 in the same CU 13 can be realized. Thus, the shared memory 21 may be used to store temporary results of the reduction calculations in the CU 13, data used or generated during the calculation, or final reduction results.

Fig. 3 is a schematic diagram of processor internal data flow interaction provided in at least one embodiment of the present disclosure.

For example, as shown in fig. 3, GMB 12 may exchange data with register 31. Each EU 23 may include a plurality of registers 31, the registers 31 configured to store data of the reduction calculation. For example, GMB 12 allows access to all CUs within the same SPC to perform data reading and writing, and thus GMB 12 also allows access to all EUs 23 within the same SPC to perform data reading and writing, EU 23 may write data in its register 31 to GMB 12, or may read data from GMB 12 and write the read data to register 31. For example, GMB 12 may also be configured to temporarily store data prefetched from global memory 11, and the prefetched data may be read from GMB 12 and written to register 31.

For example, as shown in FIG. 3, GRB 15 can exchange data with register 31. For example, the GRB 15 allows access to all EUs 23 within the same SPC to perform data reading and writing, and the EUs 23 may write the data in their registers 31 to the GRB 15, or may read the data from the GRB 15 and write the read data to the registers 31. For example, the GRB 15 may also perform a reduction calculation on the data read from the register 31 in cooperation with the corresponding hardware instruction.

For example, as shown in fig. 3, the reduction module 14 is configured to perform reduction calculation of data read from the global memory 11 and reduction calculation data stored in the register 31 or in the GRB 15, and write the reduction result back to the global memory 11, so that reduction calculation among a plurality of SPCs can be realized.

For example, as shown in FIG. 3, global memory 11 may exchange data with registers 31. For example, data may be read from global memory 11 into register 31, or data in register 31 may be written into global memory 11.

Fig. 4 is a schematic flow chart of a reduction calculation method provided in at least one embodiment of the present disclosure.

For example, the reduction calculation method is applicable to a graphic processor or a general-purpose graphic processor, and of course, embodiments of the present disclosure are not limited thereto, and the reduction calculation method may be applied to any other type of processor.

For example, a specific description of the reduction calculation method provided in at least one embodiment of the present disclosure will be made hereinafter by taking a graphics processor as an example, but it will be understood that the reduction calculation method is equally applicable to a general-purpose graphics processor or a processor having a similar architecture or principle, and the specific process will not be repeated.

As shown in FIG. 4, the reduction calculation method provided by the embodiment of the disclosure includes the following steps S101 to S105.

S101: and performing reduction calculation on the calculation data of a plurality of thread bundles in the same EU to obtain a first result of the reduction calculation, wherein each thread bundle is configured to access all register resources in the EU to realize data reading and writing.

For example, the reduction of multiple data between thread bundles may be achieved by accessing all register resources in the EU where they reside. For example, taking reduction calculation as an example, assuming that thread bundle a and thread bundle B belong to the same EU, since thread bundle a can access all register resources in the EU, thread bundle a can access registers corresponding to thread bundle B, and accumulate data stored therein with data in registers corresponding to thread bundle a, thereby achieving accumulation among thread bundles in the same EU. In the embodiment of the disclosure, the thread bundles are not limited to only access to the register resources corresponding to the thread bundles, but can access all the register resources in EU where the thread bundles are located, so that the efficiency of data access and transmission can be improved, the steps and operations of reduction calculation among the thread bundles are simplified, and the calculation efficiency is further improved.

S102: and utilizing the GRB to realize reduction calculation in the thread bundle, and writing the reduction result into a corresponding position in the GRB, wherein EUs in the same SPC can access the GRB to realize data reading and writing.

For example, taking reduction calculation as an addition example, according to the characteristic that the GRB can be matched with corresponding hardware instruction setting to realize automatic reduction calculation, the data in the thread bundle can be subjected to accumulation calculation, and the result of the accumulation calculation is written into a corresponding position in the GRB; or directly accumulating the data in the thread bundle to the corresponding position in the GRB, thereby realizing the accumulation calculation in the thread bundle.

S103: and the reduction calculation among a plurality of EUs in the same CU is realized through reading and writing the shared memory, wherein EUs in the same CU can access the shared memory to realize data reading and writing.

For example, by sharing memory, threads in multiple EUs located in the same CU can read and write to the same block of memory area, thus realizing sharing and transferring data. The shared memory can be used for storing the reduction results of EUs, and then aggregation of a plurality of reduction results is realized by reading and writing the shared memory, so that reduction calculation among a plurality of EUs is realized.

S104: reduction calculation among a plurality of CUs is realized through reading and writing the GMB, wherein the GMB can be accessed by the CUs in the same SPC to realize data reading and writing.

For example, a CU includes multiple EUs, so all EUs in the same SPC can access the GMB to perform data reading and writing. For example, threads in EU located in the same SPC can read and write to the same memory area through GMB, and sharing and transferring of data are realized through reading and writing GMB. The GMB may be used to store the reduction results of the CUs, and then aggregation of the reduction results may be achieved by reading and writing the GMB, thereby achieving reduction computation among the CUs.

S105: and reading data from the global memory by using the reduction module, carrying out reduction calculation on the data read by the reduction module and the data stored in the register or the GRB, and writing the reduction result back to the global memory, thereby realizing reduction calculation among a plurality of SPCs.

For example, reduction calculations between multiple SPCs may be implemented using a reduction module. For example, the reduction module is used to read the data to be reduced from the global memory, the data to be reduced may be the reduction result of other SPC, perform reduction calculation on the data to be reduced and the data stored in the register, and write the reduction result back to the designated location in the global memory. For example, by the reduction module, reduction calculation can be performed on the data to be reduced and the data stored in the GRB, and the reduction result can be written back to the designated location in the global memory. By the method, data in the global memory is not required to be read back into the register to perform reduction calculation, and the reduction module is directly utilized to perform calculation, so that the operation efficiency is improved.

In the multi-stage parallel reduction calculation, all the steps S101 to S105 need not be executed, but a plurality of stages may be selected for combination according to the actual application scenario and algorithm. For example, only any one step or any plurality of steps S101 to S105 may be performed, and the order of performing the steps is not limited to the order shown in fig. 4, which may be according to actual needs, and the embodiments of the present disclosure are not limited thereto. The reduction of multiple levels forms a pipeline structure, and the reduction calculation of different levels realizes pipeline parallelism, so that the parallelism degree and the execution efficiency are improved.

The reduction calculation method provided by at least one embodiment of the present disclosure is described below by taking reduction calculation in a batch normalization (Batch Normalization) operator as an example.

Batch normalization is a technique for deep neural networks that aims to speed up the training process and improve the generalization ability of the model. The main idea is to normalize the input of each Batch (Batch) of the neural network to have zero mean and unit variance. This helps overcome internal covariate offsets in the depth network, thereby promoting stability and convergence of the model. For batch normalization, both the forward and reverse calculations involve reduction calculations. For example, in the forward calculation, for each Channel (Channel), the mean and standard deviation of all samples in the current batch over that Channel are calculated. For example, in the inverse calculation, the weights of the batch normalization layer include scaling parameters and translation parameters, and in order to calculate the gradient of these two parameters, a reduction calculation is also required for the data within the current batch.

In neural networks, the tensor data transferred is typically represented using a four-dimensional tensor whose dimensions are denoted as [ N, C, H, W ], where N represents the number of samples (Batch Size) contained in each Batch, C represents the number of channels per sample, W represents the width of the feature map for each sample, and H represents the height of the feature map for each sample.

Taking the example of batch normalized mean calculation, i.e., for each channel, the sum of the values of all samples in the current batch over that channel is calculated, divided by nxh×w. Thus, the accumulation calculation needs to be performed in three dimensions N, H, W.

For example, one or more dimensions in [ N, C, H, W ] may be selected according to hardware structure and characteristics to segment data, and the disclosure does not limit the segmentation policy, and the segmentation policy is different, and the selected reduction calculation method is different in hierarchy. Here, the N-dimension is split by the SPCs, and the C-dimension is split by the EUs within the SPCs, so that the reduction calculation between EUs is not needed. It should be noted that if other splitting policies are adopted, for example, a plurality of SPCs split the N dimension, and a plurality of EUs split the W dimension in the SPC, then a reduction calculation between the EUs will need to be performed.

FIG. 5 is a workflow diagram of a reduction calculation method provided in at least one embodiment of the present disclosure, for example, for the batch normalization calculation or other reduction calculation scenario described above.

As shown in FIG. 5, the reduction calculation method provided by the embodiment of the disclosure includes the following steps S1 to S7.

S1: and reading data for reduction calculation from the global memory, and writing the read data into a register in the EU.

S2: and reading data for reduction calculation from the global memory, and writing the read data into the GMB.

S3: the data for reduction calculation is read from the GMB, and the read data is written into a register in the EU.

It should be noted that only one of the two method links of step S1 and step s2+s3 is required, which is not limited by the embodiments of the present disclosure. For example, step S1 may be selected to read data directly from global memory into a register, or step s2+s3 may be selected to prefetch data from global memory into GMB and then read data from GMB into a register.

S4: and performing reduction calculation on the calculation data of the plurality of thread bundles in the same EU to obtain a first result of the reduction calculation, wherein each thread bundle in the plurality of thread bundles is configured to access all register resources in the EU to realize data reading and writing.

Step S4 is the same as step S101 described above, and will not be described here.

After the first result is obtained, an intra-thread bundle reduction is also required for the first result. For example, according to the feature that the GRB can implement auto-reduce computation in cooperation with the corresponding hardware instruction set, the reduce computation within the thread bundle can be implemented for the first result. For another example, a thread bundle Shuffle (Warp Shuffle) instruction implementation may be used that allows threads within a thread bundle to directly read registers of another thread, and threads within a thread bundle may exchange data without passing through shared memory and global memory.

S5: and writing the first results into the GRB, and performing reduction calculation on the first results by using the GRB to obtain a second result of the reduction calculation.

S6: a reduction calculation is performed on a plurality of second results from different SPCs using a reduction module.

S7: a reduction calculation is performed on a plurality of first results from different SPCs using a reduction module.

It should be noted that only one of the two method links of step s5+s6 and step S7 is required, which is not limited by the embodiments of the present disclosure. For example, step s5+s6 may be selected to write a plurality of first results to the GRB; then, the plurality of first results are reduced by the GRB to obtain second results of the reduction calculation, and then the reduction module is used for reducing the plurality of second results from different SPCs. Of course, step S7 may also be selected, and the reduction module may be directly utilized to perform reduction calculation on the plurality of first results from different SPCs.

S8: and writing the final reduction result back to the designated position of the global memory by using the reduction module.

Fig. 6 is a schematic block diagram of an electronic device provided in at least one embodiment of the present disclosure. As shown in fig. 6, in some embodiments, the electronic device 600 includes a processor 610, where the processor 610 is a processor provided by any of the above embodiments of the present disclosure. The electronic device 600 may be any device having a computing function, such as a computer, a server, a smart phone, a tablet computer, etc., to which embodiments of the present disclosure are not limited.

Fig. 7 is a schematic block diagram of another electronic device provided in accordance with at least one embodiment of the present disclosure. As shown in fig. 7, the electronic device 700 includes a processor provided in any of the above embodiments of the present disclosure, and the electronic device 700 is suitable for implementing the reduction calculation method provided in the embodiments of the present disclosure, for example. The electronic device 700 may be a terminal device or a server, etc. It should be noted that the electronic device 700 shown in fig. 7 is only one example and does not impose any limitation on the functionality and scope of use of the disclosed embodiments.

As shown in fig. 7, the electronic device 700 may include a processing means (e.g., a central processing unit, a graphic processor, etc.) 71, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 72 or a program loaded from a storage 78 into a Random Access Memory (RAM) 73. For example, the processing device 71 may be a processor provided by any of the above embodiments of the present disclosure. In the RAM 73, various programs and data required for the operation of the electronic apparatus 700 are also stored. The processing device 71, the ROM 72 and the RAM 73 are connected to each other via a bus 74. An input/output (I/O) interface 75 is also connected to bus 74.

In general, the following devices may be connected to the I/O interface 75: input devices 76 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 77 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, etc.; storage 78 including, for example, magnetic tape, hard disk, etc.; and communication means 79. Communication device 79 may allow electronic apparatus 700 to communicate wirelessly or by wire with other electronic apparatuses to exchange data. While fig. 7 shows an electronic device 700 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided, and that electronic device 700 may alternatively be implemented or provided with more or fewer means.

For a detailed description and technical effects of the electronic device 600/700, reference may be made to the above description of the processor and the reduction calculation method, which are not repeated here.

One or more embodiments of the present disclosure provide a processor, a reduction calculation method, and an electronic device, which have one or more of the following beneficial effects:

(1) In the processor provided in at least one embodiment of the present disclosure, each thread bundle is configured to access all register resources in a corresponding EU to implement data read-write, so that reduction computation between thread bundles in the same EU is implemented, without using a shared memory or a global memory, thereby improving computation efficiency.

(2) The processor provided by at least one embodiment of the present disclosure adds two brand-new storage levels, namely GRB and GMB, so as to implement a multi-level structure, and improve the layering and flexibility of parallel reduction computation, thereby improving the parallelism and the execution efficiency.

(3) The processor provided by at least one embodiment of the present disclosure is provided with a reduction module, and the reduction module has a reduction calculation function, so that the data in the global memory is not required to be read back into the register to perform reduction calculation, but is directly calculated by using the reduction module, thereby improving the operation efficiency.

(4) According to the reduction calculation method provided by at least one embodiment of the present disclosure, through a multi-level storage design, multi-level parallel reduction calculation can be implemented, multiple levels can be selected to be combined according to an actual application scenario and algorithm, multiple levels of reduction form a pipeline structure, and reduction calculation of different levels realizes pipeline parallelism, so that the parallelism degree and the execution efficiency are improved.

While the disclosure has been described in detail with respect to the general description and the specific embodiments thereof, it will be apparent to those skilled in the art that certain modifications and improvements may be made thereto based on the embodiments of the disclosure. Accordingly, such modifications or improvements may be made without departing from the spirit of the disclosure and are intended to be within the scope of the disclosure as claimed.

For the purposes of this disclosure, the following points are also noted:

(1) The drawings of the embodiments of the present disclosure relate only to the structures related to the embodiments of the present disclosure, and other structures may refer to the general design.

(2) In the drawings for describing embodiments of the present disclosure, the thickness of layers or regions is exaggerated or reduced for clarity, i.e., the drawings are not drawn to actual scale.

(3) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.

The foregoing is merely specific embodiments of the disclosure, but the scope of the disclosure is not limited thereto, and the scope of the disclosure should be determined by the claims.

Claims

1. A processor, comprising:

A plurality of computing cores configured to implement reduction computations;

A global memory configured to be accessible by the plurality of computing cores to enable data reading and writing;

Wherein each of the plurality of computing cores comprises:

a matrix main cache configured to store the reduced calculated data;

A matrix reduction cache configured to perform the reduction calculation and store the used and generated data;

A plurality of first computing units configured to be accessible to the matrix main cache to enable data read and write, wherein each of the plurality of first computing units comprises:

A shared memory configured to store the reduced calculated data;

the plurality of second computing units are configured to access the shared memory to realize data reading and writing, and access the matrix reduction cache to realize data reading and writing;

Wherein each of the plurality of computing cores further comprises a reduction module configured to access the global memory to enable data read and write and to perform the reduction computation on the data written to the reduction module;

wherein the matrix reduction cache is further configured to be accessible by all second computing units located within the same computing core.

2. The processor of claim 1, wherein each of the plurality of second computing units comprises a plurality of registers configured to store the reduced computed data.

3. The processor of claim 2, wherein each of the plurality of second computing units supports a plurality of thread bundles, each of the plurality of thread bundles configured to access all register resources in the corresponding second computing unit to enable data reading and writing.

4. The processor of claim 1, wherein the plurality of second computing units are further configured to implement a reduction computation between the plurality of second computing units by reading from and writing to the shared memory.

5. The processor of claim 1, wherein the plurality of first computing units are further configured to implement reduction computations among the plurality of first computing units by reading from and writing to the matrix main cache.

6. The processor of claim 2, wherein the reduction module is further configured to reduce data read from the global memory with the reduced data stored in a register or in a matrix reduction cache and write a reduction result back to the global memory.

7. The processor of any of claims 1-6, wherein the reduction module comprises a cache with reduction computing functionality implemented based on a secondary cache.

8. The processor of any of claims 1-6, wherein the global memory comprises high bandwidth memory.

9. The processor of any of claims 1-6, wherein the processor comprises a graphics processor or a general-purpose graphics processor.

10. A reduction calculation method implemented using the processor of any of claims 1-9, comprising:

and performing reduction calculation on the calculation data of a plurality of thread bundles in the same second calculation unit to obtain a first result of the reduction calculation, wherein each thread bundle in the plurality of thread bundles is configured to access all register resources in the second calculation unit to realize data reading and writing.

11. The method of claim 10, further comprising:

And utilizing the matrix reduction buffer to realize reduction calculation in the thread bundle, and writing the reduction result into a corresponding position in the matrix reduction buffer, wherein a second calculation unit positioned in the same calculation core can access the matrix reduction buffer to realize data reading and writing.

12. The method of claim 10, further comprising:

And the reduction calculation among a plurality of second calculation units in the same first calculation unit is realized through reading and writing the shared memory, wherein the second calculation units in the same first calculation unit can access the shared memory to realize data reading and writing.

13. The method of claim 10, further comprising:

and realizing reduction calculation among the plurality of first calculation units by reading and writing the matrix main cache, wherein the first calculation units in the same calculation core can access the matrix main cache to realize data reading and writing.

14. The method of claim 10, further comprising:

And reading data from the global memory by using the reduction module, carrying out reduction calculation on the data read by the reduction module and the data stored in a register or a matrix reduction cache, and writing a reduction result back to the global memory, thereby realizing reduction calculation among a plurality of calculation cores.

15. The method of any of claims 10-14, further comprising:

And reading data used for the reduction calculation from the global memory, and writing the read data into a register in the second calculation unit.

16. The method of any of claims 10-14, further comprising:

reading data used for reduction calculation from the global memory, and writing the read data into the matrix main cache;

And reading data used for the reduction calculation from the matrix main cache, and writing the read data into a register in the second calculation unit.

17. The method of any of claims 10-14, further comprising:

writing a plurality of the first results into the matrix reduction cache;

Performing reduction calculation on a plurality of first results by using the matrix reduction cache to obtain second results of the reduction calculation;

and performing reduction calculation on a plurality of second results from different calculation cores by utilizing the reduction module.

18. The method of any of claims 10-14, further comprising:

and performing reduction calculation on a plurality of first results from different calculation cores by utilizing the reduction module.

19. An electronic device comprising the processor of any of claims 1-9.