CN118012628A - Data processing method, device and storage medium - Google Patents

Data processing method, device and storage medium Download PDF

Info

Publication number
CN118012628A
CN118012628A CN202410296982.7A CN202410296982A CN118012628A CN 118012628 A CN118012628 A CN 118012628A CN 202410296982 A CN202410296982 A CN 202410296982A CN 118012628 A CN118012628 A CN 118012628A
Authority
CN
China
Prior art keywords
data
unit
operator
data block
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410296982.7A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd, Beijing Bilin Technology Development Co ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202410296982.7A priority Critical patent/CN118012628A/en
Publication of CN118012628A publication Critical patent/CN118012628A/en
Pending legal-status Critical Current

Links

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A data processing method, apparatus and storage medium. The data processing method comprises the following steps: partitioning data used in the operator executing process to obtain a plurality of data blocks; determining storage units of the plurality of data blocks according to the processing sequence of the plurality of data blocks; the computing unit calls the plurality of data blocks to execute the operator according to the storage positions of the plurality of data blocks. The data processing method can use smaller data transmission cost, improve the capacity of TLRs temporarily available in the calculation process of the GPGPU operator algorithm, reduce the situation that data overflows to the HBM due to the lack of TLRs, and further improve the overall performance of the operator; meanwhile, the data processing method also improves the utilization rate of the high-speed storage space, avoids the failure of an operator algorithm, and improves the flexibility of data processing, thereby greatly improving the efficiency of data processing.

Description

Data processing method, device and storage medium
Technical Field
Embodiments of the present disclosure relate to a data processing method, a data processing apparatus, and a storage medium, and more particularly, to large-scale data processing of a general-purpose graphic processor in the field of artificial intelligence.
Background
A General-purpose graphics processor (GPGPU), which is a type of graphics processor that utilizes graphics to process graphics tasks to compute General-purpose computing tasks that would otherwise be processed by a central processing unit. Due to the powerful parallel processing capabilities and programmable pipelines of modern graphics processors, the processor is enabled to process non-graphics data. Particularly in the field of artificial intelligence, when single instruction stream multiple data Stream (SIMD) is faced, and the operand of data processing is far greater than the requirement of data scheduling and transmission, the general-purpose graphics processor greatly surpasses the traditional CPU application program in performance.
Disclosure of Invention
At least one embodiment of the present disclosure provides a data processing method, including: partitioning data used in the operator executing process to obtain a plurality of data blocks; determining storage units of the plurality of data blocks according to the processing sequence of the plurality of data blocks; the computing unit calls the plurality of data blocks to execute the operator according to the storage positions of the plurality of data blocks.
For example, in a data processing method provided in at least one embodiment of the present disclosure, performing a blocking process on data used in an operator executing process to obtain a plurality of data blocks includes: and carrying out block processing on the data according to the size of each storage unit and the processing sequence of the data used in the operator execution process.
For example, in a data processing method provided in at least one embodiment of the present disclosure, the storage unit includes a register unit, a first buffer unit, and a second buffer unit, and performing a block processing on data according to a size of each storage unit and a processing sequence of the data used in the operator executing process, where the block processing includes: dividing the data into a first data block, a second data block and a third data block when the data amount of the data is larger than the storage capacities of the register unit and the first cache unit; when the data amount of the data is larger than the storage capacity of the register unit and smaller than the storage capacity of the first cache unit, the data is divided into a first data block and a second data block.
For example, in the data processing method provided in at least one embodiment of the present disclosure, the first data block is stored in the register unit, the second data block is stored in the first buffer unit, and the third data block is stored in the second buffer unit.
For example, in the data processing method provided in at least one embodiment of the present disclosure, the processing sequence of the first data block is earlier than the processing sequence of the second data block, the processing sequence of the second data block is earlier than the processing sequence of the third data block, the distance between the register unit and the calculation unit is smaller than the distance between the first cache unit and the calculation unit, and the distance between the first cache unit and the calculation unit is smaller than the distance between the second cache unit and the calculation unit.
For example, in the data processing method provided in at least one embodiment of the present disclosure, when the operator is a single operator, the data processing method further includes: dividing the single operator into a plurality of sub-parts, and performing block processing on data used by each sub-part to obtain a plurality of data blocks; determining storage units of the plurality of data blocks according to the processing sequence of the data of each sub-part of the operator and the capacity of the storage units; the computing unit calls the plurality of data blocks to execute the sub-parts of the single operator according to the storage positions of the plurality of data blocks.
For example, in the data processing method provided in at least one embodiment of the present disclosure, the plurality of sub-portions includes N sub-portions, a processing order of the nth sub-portion is earlier than a processing order of the n+1th sub-portion, and the calculating unit calls the plurality of data blocks to execute the respective sub-portions of the single operator according to the storage locations of the plurality of data blocks, including: the calculation unit performs calculation of the nth sub-portion using the first data block in the register unit and obtains the calculation result; storing the calculation result to the first buffer unit as the second data block or the second buffer unit as the third data block according to the use time of the calculation result of the nth sub-portion; reading the second data block from the first buffer unit or the third data block from the second buffer unit to the register unit according to the execution sequence of the single operator so that the calculation unit can execute the calculation of the n+1th subsection; wherein N is an integer greater than 1, N is an integer greater than 0 and less than or equal to N.
For example, in the data processing method provided in at least one embodiment of the present disclosure, when the operator is a fusion operator, the data processing method further includes: splitting the fusion operator into a plurality of single operators, and performing block processing on data used by each single operator to obtain a plurality of data blocks; determining storage units of the plurality of data blocks based on the execution sequence of each single operator, the processing sequence of the data of each single operator and the capacity of the storage units; and the computing unit calls the data blocks to execute the fusion operator according to the storage positions of the data blocks.
For example, in a data processing method provided in at least one embodiment of the present disclosure, a computing unit calls the plurality of data blocks to execute the operator according to storage locations of the plurality of data blocks, including: performing a partial calculation of an operator based on the first data block stored in the register unit to obtain temporary data; storing the temporary data to the register unit, the first buffer unit or the second buffer unit based on a processing order of the temporary data, a capacity of a storage unit; reading the second data block stored in the first cache unit or the third data block stored in the second cache unit to the register unit for use by the computing unit to execute the operator.
At least one embodiment of the present disclosure also provides a data processing apparatus, including: the partitioning unit is configured to perform partitioning processing on data used in the operator executing process to obtain a plurality of data blocks; a determining unit configured to determine storage units of the plurality of data blocks according to a processing order of the plurality of data blocks; and the computing unit is configured to call the plurality of data blocks to execute the operator according to the storage positions of the plurality of data blocks.
For example, in the data processing apparatus provided in at least one embodiment of the present disclosure, the blocking unit is further configured to: and carrying out block processing on the data according to the size of each storage unit and the processing sequence of the data used in the operator execution process.
For example, in the data processing apparatus provided in at least one embodiment of the present disclosure, the storage unit includes a register unit, a first buffer unit, and a second buffer unit, and the blocking unit is further configured to: dividing the data into a first data block, a second data block and a third data block when the data amount of the data is larger than the storage capacities of the register unit and the first cache unit; when the data amount of the data is larger than the storage capacity of the register unit and smaller than the storage capacity of the first cache unit, the data is divided into a first data block and a second data block.
For example, in the data processing apparatus provided in at least one embodiment of the present disclosure, the first data block is stored in the register unit, the second data block is stored in the first buffer unit, and the third data block is stored in the second buffer unit.
For example, in the data processing apparatus provided in at least one embodiment of the present disclosure, the processing order of the first data block is earlier than the processing order of the second data block, the processing order of the second data block is earlier than the processing order of the third data block, the distance between the register unit and the calculation unit is smaller than the distance between the first cache unit and the calculation unit, and the distance between the first cache unit and the calculation unit is smaller than the distance between the second cache unit and the calculation unit.
At least one embodiment of the present disclosure also provides a data processing apparatus, including: a processor; a memory; one or more computer program modules, wherein the one or more computer program modules are stored in the memory and configured to be executed by the processor, the one or more computer program modules comprising instructions for performing a data processing method provided by any of the embodiments of the present disclosure.
At least one embodiment of the present disclosure also provides a storage medium that non-transitory stores computer readable instructions that, when executed by a computer, perform the data processing method provided by any of the embodiments of the present disclosure.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.
Fig. 1 is a schematic diagram of a data processing method according to at least one embodiment of the present disclosure.
Fig. 2 illustrates a flow chart of a data processing method provided by at least one embodiment of the present disclosure.
FIG. 3A illustrates a flow chart of a data processing method for a single operator example provided by at least one embodiment of the present disclosure.
FIG. 3B shows a flow chart of a data processing method when the single operator is a matrix multiplier.
FIG. 4 illustrates a flow chart of a data processing method of an example fusion operator provided by at least one embodiment of the present disclosure.
Fig. 5 is a schematic block diagram of a data processing apparatus according to at least one embodiment of the present disclosure.
Fig. 6 is a schematic block diagram of another data processing apparatus provided in at least one embodiment of the present disclosure.
Fig. 7 is a schematic block diagram of an electronic device provided in at least one embodiment of the present disclosure.
Fig. 8 is a schematic diagram of a storage medium according to at least one embodiment of the present disclosure.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.
Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the terms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
In the current operator development process of the GPGPU, three storage spaces, namely Global Shared Memory (GSM), extended Thread Local Register (TLR), and High-speed Memory (HBM), are usually used, wherein the TLR is closest to the computing unit, and the speed is the fastest, and is usually used for storing temporary data of operator computation. GSM is a popular method for temporarily storing data that needs to be used repeatedly. The HBM is the outer video memory of the GPGPU, is far away from the computing unit and has the slowest speed, and is usually used for storing the input and output of operator computation and all other related data. In the current operator implementation, especially matrix multiplication, convolution, etc., which requires interaction of a tensor computation unit (Tcore) and a vector computation unit (Vcore), for performance reasons, TLR is typically used to store temporary interaction data of both.
The inventor notes that, due to chip area limitation, the number of TLRs is usually very limited, and in the prior art, if the temporary space required by the operator algorithm is large, especially when the interaction data of the Tcore and the Vcore is large, insufficient TLR capacity may occur, and the excessive data is overflowed and stored on the HBM, so that the operator performance is poor. The algorithm is strictly limited by the total quantity of TLRs and the flexibility is reduced when designing and realizing high-performance algorithms.
At least one embodiment of the present disclosure provides a data processing method, which performs a blocking process on data used in an operator executing process to obtain a plurality of data blocks; determining storage units of the plurality of data blocks according to the processing sequence of the plurality of data blocks; the computing unit invokes the plurality of data block execution operators according to the storage locations of the plurality of data blocks.
At least one embodiment of the present disclosure also provides a data processing apparatus and a storage medium.
The data processing method provided by at least one embodiment of the present disclosure can use smaller data transmission overhead, improve the capacity of temporary available TLRs in the GPGPU operator algorithm calculation process, reduce the situation that data overflows to HBM due to insufficient TLRs, and thereby improve the overall performance of operators; meanwhile, the utilization rate of the high-speed storage space is improved, the failure of an operator algorithm is avoided, and the flexibility of data processing is improved, so that the efficiency of data processing is greatly improved.
Embodiments of the present disclosure and some examples thereof are described in detail below with reference to the attached drawings.
Fig. 1 is a schematic diagram of a data processing method according to at least one embodiment of the present disclosure. As shown in fig. 1, the data processing method includes steps S110 to S130.
Step S110: and carrying out blocking processing on data used in the operator execution process to obtain a plurality of data blocks.
Step S120: according to the processing sequence of the plurality of data blocks, the storage units of the plurality of data blocks are determined.
Step S130: the computing unit invokes the plurality of data block execution operators according to the storage locations of the plurality of data blocks.
For step S110, for example, in some examples, the storage unit includes a register unit, a first cache unit, and a second cache unit. For example, the register unit is located at a distance from the calculation unit that is smaller than the distance from the first cache unit to the calculation unit that is smaller than the distance from the second cache unit to the calculation unit.
For example, the register unit may be a TLR, which is closest to the calculation unit, with a smaller capacity; the first buffer unit can be GSM, and has a smaller capacity than the TLR, but has a smaller distance from the computing unit; the second Buffer unit may be a vector arithmetic unit store (GMB) which is closer to the calculation unit but farther from the calculation unit than the GSM and has a larger capacity. The following description will take a register unit as TLR, a first buffer unit as GSM, and a second buffer unit as GMB as an example, which is not limited by the embodiments of the present disclosure.
For example, in some examples, data is partitioned according to the size of each memory cell and the processing order of the data used in the operator execution process to achieve the best use effect of each memory cell. For example, when the data amount of the data is larger than the storage capacities of the register unit and the first buffer unit, the data may be divided into a first data block, a second data block, and a third data block; when the data amount of the data is larger than the storage capacity of the register unit and smaller than the first buffer unit, the data may be divided into a first data block and a second data block.
For example, when the data amount of the data is larger than the storage capacities of the register unit and the first buffer unit, the first data block is stored in the register unit, the second data block is stored in the first buffer unit, and the third data block is stored in the second buffer unit; when the data volume of the data is larger than the storage capacity of the register unit and smaller than the storage capacity of the first cache unit, storing the first data block in the register unit and storing the second data block in the first cache unit; when the data volume of the data is smaller than the storage capacity of the register unit, the data can be directly stored into the register unit without being blocked, so that the flexible storage of the data is realized in the effective use space of each storage unit, the interaction speed of the data is improved, and the efficiency of data processing is improved.
For example, the data used above in the present disclosure may include initial input data calculated by an operator, temporary data generated during execution according to the operator, and final output data.
For step S120, for example, the processing order of the first data block is earlier than the processing order of the second data block (i.e., the use time of the first data block is earlier than the use time of the second data block), and the processing order of the second data block is earlier than the processing order of the third data block (i.e., the use time of the second data block is earlier than the use time of the third data block).
For example, as shown in the flowchart of fig. 2, when designing the operator to implement the algorithm, the time of use of the operator data is analyzed first, and according to the data of the operator and the time of use (i.e. processing time) and the data amount of temporary data generated, the data to be used in the operator algorithm (for example, the first data block) is stored in the TLR to directly perform calculation, and the data which is not processed in a short period (for example, the TLR data which is not used in the recent calculation in fig. 2 and the TLR data which is not used in the next calculation) are put in the GSM and GMB for hierarchical temporary storage. For example, temporary small data (e.g., the second data block) used recently is stored in GSM, temporary data or temporary/large data (e.g., the third data block) which is larger or not used for a long time is stored in GMB, and the occupied TLR space is released, so that the limited capacity of the TLR is hierarchically extended by using the high-speed storage space of GSM and GMB. When in subsequent use, the partial data are read from the GSM and the GMB to the TLR for the calculation unit to use, so that the demand of an operator algorithm on the TLR capacity is reduced to a certain extent, and the operator performance and flexibility are improved.
For example, in embodiments of the present disclosure, the operators may be single operators or fusion operators, as embodiments of the present disclosure are not limited in this regard.
For example, when the operator is a single operator, the data processing method further includes: dividing a single operator into a plurality of sub-parts, and performing block processing on data used by each sub-part to obtain a plurality of data blocks; determining storage units of a plurality of data blocks according to the processing sequence of data of each sub-part of the single operator and the capacity of the storage units; the computing unit calls the plurality of data blocks to execute the sub-parts of the single operator according to the storage positions of the plurality of data blocks.
For example, in some examples, the plurality of sub-portions includes N sub-portions, a processing order of the nth sub-portion is earlier than a processing order of the n+1th sub-portion, the computing unit calls each sub-portion of the plurality of data blocks to perform a single operator according to a storage location of the plurality of data blocks, including: the calculation unit performs calculation of the nth sub-portion by using the first data block in the register unit and obtains a calculation result; according to the using time of the calculation result of the nth sub-part, storing the calculation result into a first buffer unit as a second data block or a second buffer unit as a third data block; reading the second data block from the first buffer unit or the third data block from the second buffer unit to the register unit according to the execution sequence of the single operator so as to enable the calculation unit to execute the calculation of the n+1th subsection; n is an integer greater than 1, N is an integer greater than 0 and less than or equal to N.
FIG. 3A illustrates a flow chart of a data processing method for a single operator example provided by at least one embodiment of the present disclosure. For example, as shown in FIG. 3A, for a single operator case, the operator's computation process can generally be split into multiple steps (i.e., multiple sub-portions), each with a respective data dependency and data output. For example, the description below is given of an example where n=3, that is, the operator includes a first subsection 1 (e.g., corresponding to the nth subsection), a second subsection 2 (e.g., corresponding to the n+1th subsection), and a third subsection, and may include more or less subsections, as the case may be, and embodiments of the disclosure are not limited in this respect. For example, as shown in fig. 3A, the first subsection 1 of the operator is calculated to obtain some temporary data (i.e., the result of the calculation of the first subsection 1) that needs to be used by the third subsection. Therefore, the temporary data (i.e. the calculation result of the first subsection 1) of the part can be temporarily stored from the TLR to the GSM or the GMB (big data or data which is not used recently is temporarily stored to the GMB or is stored to the GSM) according to the use time and the capacity of the temporary data, then TLR resources are released for calculation of the second subsection 2, and after the calculation of the second subsection is completed, the data of the part (i.e. the calculation result of the first subsection 1) is taken out to the TLR according to the use time of the temporary data for use of the subsequent subsection. The partial data cannot occupy TLR for a long time, and algorithm failure or data overflow caused by TLR deficiency is avoided. It should be noted that the second subsection 2 and the third subsection may be different locations of a parallel processing algorithm. That is, a calculation that is parallelizable and requires a larger TLR space may be partitioned, and each time, occupying fewer TLR resources, and completed in multiple times.
FIG. 3B shows a flow chart of a data processing method when the single operator is a matrix multiplier. The following describes in detail the example of this single operator as Matrix Multiplication (MMA) operator in connection with fig. 3B.
It should be noted that the single operator is not limited to a matrix multiplication operator, but may be other operators, as embodiments of the present disclosure are not limited in this regard.
In matrix multipliers, it is generally necessary to split the entire matrix multiplier into multiple matrix blocks for multiply-add computation, calculate partial sums of the matrix blocks by a tensor computation unit (Tcore), then transmit the partial sums of each matrix block to multiple vector computation units (Vcore), and complete accumulation of TLRs therein. When Vcore receives parts and data, for some scale matrix multiplication blocks, the temporary parts and data needed are more, which may result in insufficient number of TLRs, so that such block scale algorithms cannot be used, affecting operator flexibility and performance.
For this case, in the data processing method provided in the embodiment of the present disclosure, for example, as shown in fig. 3B, in the calculation process of the matrix multiplier, it is assumed that the matrix is divided into 8 blocks [1, 8] to be accumulated, and the accumulation is required multiple times, but the number of TLRs can only store the data of the next 4 blocks [1, 4] (for example, corresponding to the first data block described above). At this time, the operator temporarily stores temporary data (e.g., corresponding to the second data block) of the second half [5, 8] (e.g., corresponding to the second data block) in the corresponding GSM (if the GSM capacity is still insufficient, the excess portion (e.g., the third data block) is temporarily stored in the GMB), and the calculation result is obtained by accumulating the data (e.g., the first data block) of the first half [1, 4] using the TLR. After the data of the first half part [1, 4] is accumulated once (for example, the calculation corresponding to the nth sub-part) is completed, the calculation result is temporarily stored in the GSM or GMB, and the data of the second half part [5, 8] is read from the GSM or GMB to the TLR for accumulation once (for example, the calculation corresponding to the (n+1) th sub-part). And the front part data and the rear part data are accumulated by alternately using TLR resources to obtain a final calculation result and output the final calculation result.
It should be noted that, since GSM is physically distributed in each Vcore, compared to the GMB transmission path physically existing in the stream processor Cluster (Streaming-Processer-Cluster, abbreviated as SPC), which includes a plurality of vector calculation units and one tensor calculation unit, the transmission path is shorter and the overhead is less. Therefore, GSM will be preferentially used for storage in the data processing method provided in the embodiment of the present disclosure.
It should be noted that the above "4 blocks" and "8 blocks" are for the purpose of helping to illustrate the above embodiments of the disclosure, and the number of the above embodiments may be flexibly configured according to the operator characteristics, the capacities of the TLR, the GSM and the GMB, and the embodiments of the disclosure are not limited thereto.
For example, in other examples, when the operator is a blending operator, the data processing method further comprises: splitting the fusion operator into a plurality of single operators, and performing block processing on data used by each single operator to obtain a plurality of data blocks; determining storage units of a plurality of data blocks based on the execution sequence of each single operator, the processing sequence of the data of each single operator and the capacity of the storage units; the computing unit calls the data blocks to execute the fusion operator according to the storage positions of the data blocks.
FIG. 4 illustrates a flow chart of a data processing method of an example fusion operator provided by at least one embodiment of the present disclosure. The data processing method of the fusion operator is described in detail below with reference to fig. 4.
For the fusion operator, the whole calculation process can be split into a plurality of single operators (for example, operator 1 and operator 2 shown in fig. 4, and the like, and more operators can be included, which are not limited by the embodiment of the present disclosure), and each operator has respective data dependence and data output. For example, the calculation result of the operator 1 needs to be used by the operator 2, and in a normal case, the partial data is stored in the TLR, and overflows to the HBM when the TLR capacity is insufficient, and the HBM is far away from the calculation unit, so that the performance is poor.
For example, as shown in fig. 4, in embodiments of the present disclosure, GSM and GMB may be mapped hierarchically, logically treated as additional TLRs. When the TLR capacity is insufficient to store the interaction data of the fusion operator, the redundant part (i.e. the second data block) is stored in the GSM with higher internal bandwidth of each computing Unit (computing Unit, CU for short, with the function of vector computation and buffering related data), if the GSM capacity is still insufficient or the GSM capacity has been used in the computing process of the operator 1, the redundant temporary data (i.e. the third data block) is stored in the GMB of the SPC where the temporary data is located, so that the TLR capacity is logically enlarged in use, and the fusion operator with larger data interaction granularity is realized.
For step S130, the computing unit invokes a plurality of data block execution operators according to the storage locations of the plurality of data blocks, including: performing a partial calculation of an operator based on the first data block stored in the register unit to obtain temporary data; storing the temporary data to the register unit, the first buffer unit, or the second buffer unit based on a processing order of the temporary data and a capacity of a storage unit; reading the second data block stored in the first cache unit or the third data block stored in the second cache unit to the register unit for use by the computing unit to execute the operator.
For example, referring to the description of fig. 2 to fig. 4, the temporary data is obtained by directly storing the data (for example, the first data block) to be used in the operator algorithm in the TLR, and the temporary data can be determined to be stored in the TLR, the GSM or the GMB according to the use time of the temporary data and the capacities of the TLR, the GSM and the GMB (i.e., the storage unit), and when the temporary data is used later, the stored data is read again from the GSM and the GMB to the TLR for use by the calculation unit to obtain the calculation result, so that the utilization rate of the high-speed storage space can be improved.
It should be noted that, in the embodiments of the present disclosure, the flow of the data processing method provided in the foregoing embodiments of the present disclosure may include more or fewer operations, and these operations may be performed sequentially or performed in parallel. While the flow of the data processing method described above includes a plurality of operations occurring in a particular order, it should be clearly understood that the order of the plurality of operations is not limited. The data processing method described above may be performed once or a plurality of times according to a predetermined condition.
According to the data processing method provided by the disclosure, the capacity of temporary available TLRs in the calculation process of the GPGPU operator algorithm can be improved by using smaller data transmission overhead, and the situation that data overflows to an HBM due to insufficient TLRs is reduced, so that the overall performance of the operator is improved; meanwhile, the utilization rate of the high-speed storage space is improved, the failure of an operator algorithm is avoided, and the flexibility of data processing is improved, so that the efficiency of data processing is greatly improved.
Fig. 7 is a schematic block diagram of a data processing apparatus according to at least one embodiment of the present disclosure. For example, in the example shown in fig. 7, the data processing apparatus 100 includes a blocking unit 110, a determining unit 120, and a calculating unit 130. For example, these units may be implemented by hardware (e.g., circuit) modules or software modules, and the following embodiments are the same as these and will not be described in detail. For example, these elements may be implemented by a Central Processing Unit (CPU), a General Purpose Graphics Processor (GPGPU), an image processor (GPU), a Tensor Processor (TPU), a Field Programmable Gate Array (FPGA), or other form of processing unit with data processing and/or instruction execution capabilities, and corresponding computer instructions. For example, in embodiments of the present disclosure, these units may be implemented by a General Purpose Graphics Processor (GPGPU) in the above embodiments.
The partitioning unit 110 is configured to perform a partitioning process on data used in the operator executing process to obtain a plurality of data blocks. For example, in some examples, the blocking unit 110 is further configured to: and performing block processing on the data according to the size of each storage unit and the processing sequence of the data used in the operator execution process.
For example, the storage unit includes a register unit, a first cache unit, and a second cache unit, and the block unit 110 is further configured to: dividing data into a first data block, a second data block and a third data block when the data amount of the data is larger than the storage capacities of the register unit and the first cache unit; when the data amount of the data is larger than the storage capacity of the register unit and smaller than the storage capacity of the first cache unit, the data is divided into a first data block and a second data block.
For example, a first data block is stored in the register unit, a second data block is stored in the first buffer unit, and a third data block is stored in the second buffer unit.
For example, the processing order of the first data block is earlier than the processing order of the second data block, the processing order of the second data block is earlier than the processing order of the third data block, the distance of the register unit from the calculation unit is smaller than the distance of the first cache unit from the calculation unit, and the distance of the first cache unit from the calculation unit is smaller than the distance of the second cache unit from the calculation unit.
For example, the partitioning unit 110 may implement step S110, and a specific implementation method thereof may refer to a description related to step S110, which is not described herein.
The determination unit 120 is configured to: according to the processing sequence of the plurality of data blocks, the storage units of the plurality of data blocks are determined. For example, the determining unit 120 may implement step S120, and a specific implementation method thereof may refer to a description related to step S120, which is not described herein.
The computing unit 130 is configured to invoke a plurality of data block execution operators according to the storage locations of the plurality of data blocks. For example, the computing unit 130 may implement the step S130, and a specific implementation method thereof may refer to the related description of the step S130, which is not described herein.
Fig. 6 is a schematic block diagram of another data processing apparatus provided in at least one embodiment of the present disclosure. For example, as shown in FIG. 6, the data processing apparatus 200 includes a processor 210, a memory 220, and one or more computer program modules 221.
For example, processor 210 is connected to memory 220 through bus system 230. For example, one or more computer program modules 221 are stored in the memory 220. For example, one or more computer program modules 221 include instructions for performing the data processing methods provided by any of the embodiments of the present disclosure. For example, instructions in one or more computer program modules 221 may be executed by processor 210. For example, bus system 230 may be a conventional serial, parallel communication bus, or the like, as embodiments of the present disclosure are not limited in this regard.
For example, the processor 210 may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an image processor (GPU), a General Purpose Graphics Processor (GPGPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, may be a general purpose processor or a special purpose processor, and may control other components in the data processing apparatus 100 to perform desired functions. Embodiments of the present disclosure are described with reference to a General Purpose Graphics Processor (GPGPU) as an example.
Memory 220 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium that can be executed by the processor 210 to perform the functions of the disclosed embodiments (implemented by the processor 210) and/or other desired functions, such as data processing methods, etc. Various applications and various data, such as first, second and third data blocks, and various data used and/or generated by the applications, etc., may also be stored in the computer-readable storage medium.
It should be noted that, for clarity and brevity, not all of the constituent elements of the data processing device 200 are provided in the embodiments of the present disclosure. To achieve the necessary functions of the data processing apparatus 200, other constituent elements not shown may be provided and set by those skilled in the art according to specific needs, and the embodiments of the present disclosure are not limited thereto.
The data processing method or apparatus according to the embodiments of the present disclosure may also be implemented by means of the architecture of an exemplary electronic device 3000 as shown in fig. 7. As shown in fig. 7, the electronic device 3000 may include a bus 3010, one or more central processors (Central Processing Unit, CPU) or Graphics Processors (GPU) or GPGPU 3020, read-only memory (ROM) 3030, random Access Memory (RAM) 3040, communication ports 3050 connected to a network, input/output components 3060, hard disk 3070, and the like. A storage device in electronic device 3000, such as ROM 3030, hard disk 3070, or RAM memory (i.e., video memory) internal to the GPGPU itself, may store various data or files for use in processing and/or communication of the methods provided by the present disclosure, as well as program instructions for execution by the CPU or GPU or GPGPU. The electronic device 3000 may also include a user interface 3080. Of course, the architecture shown in fig. 7 is merely exemplary, and one or more components of the electronic device shown in fig. 7 may be omitted as may be desired in implementing different devices.
At least one embodiment of the present disclosure also provides a storage medium. Fig. 8 is a schematic diagram of a storage medium according to at least one embodiment of the present disclosure. For example, as shown in fig. 8, the storage medium 400 non-transitory stores computer readable instructions 401, which when executed by a computer (including a processor) may perform a data processing method provided by any of the embodiments of the present disclosure.
For example, the storage medium may be any combination of one or more computer-readable storage media, for example, one computer-readable storage medium containing computer-readable program code for performing a block processing on data used in an operator execution process to obtain a plurality of data blocks, another computer-readable storage medium containing computer-readable program code for determining storage units of the plurality of data blocks according to a processing order of the plurality of data blocks, and another computer-readable storage medium containing computer-readable program code for a computing unit invoking the plurality of data block execution operators according to storage locations of the plurality of data blocks. For example, when the program code is read by a computer, the computer may execute the program code stored in the computer storage medium, performing a data processing method such as provided by any of the embodiments of the present disclosure.
For example, the storage medium may include a memory card of a smart phone, a memory component of a tablet computer, a hard disk of a personal computer, random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), portable compact disc read only memory (CD-ROM), flash memory, or any combination of the foregoing, as well as other suitable storage media.
For the above disclosure, the following points are also described:
(1) The drawings of the embodiments of the present disclosure relate only to the structures to which the embodiments of the present disclosure relate, and reference may be made to the general design for other structures.
(2) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.
The foregoing is merely specific embodiments of the disclosure, but the scope of the disclosure is not limited thereto, and the scope of the disclosure should be determined by the claims.

Claims (16)

1. A data processing method, comprising:
partitioning data used in the operator executing process to obtain a plurality of data blocks;
Determining storage units of the plurality of data blocks according to the processing sequence of the plurality of data blocks;
The computing unit calls the plurality of data blocks to execute the operator according to the storage positions of the plurality of data blocks.
2. The data processing method according to claim 1, wherein the partitioning of the data used in the execution of the operators to obtain the plurality of data blocks includes:
And carrying out block processing on the data according to the sizes of the storage units corresponding to the data blocks and the processing sequence of the data used in the operator executing process.
3. The data processing method according to claim 2, wherein the storage unit includes a register unit, a first buffer unit, and a second buffer unit,
The data is partitioned according to the sizes of the storage units corresponding to the data blocks and the processing sequence of the data used in the operator execution process, and the method comprises the following steps:
Dividing the data into a first data block, a second data block and a third data block when the data amount of the data is larger than the storage capacities of the register unit and the first cache unit;
When the data amount of the data is larger than the storage capacity of the register unit and smaller than the storage capacity of the first cache unit, the data is divided into a first data block and a second data block.
4. A data processing method according to claim 3, wherein the first data block is stored in the register unit, the second data block is stored in the first buffer unit, and the third data block is stored in the second buffer unit.
5. A data processing method according to claim 3, wherein the processing order of the first data block is earlier than the processing order of the second data block, the processing order of the second data block is earlier than the processing order of the third data block, the register unit is located at a distance from the calculation unit that is smaller than the distance from the first cache unit to the calculation unit that is smaller than the distance from the second cache unit to the calculation unit.
6. A data processing method according to claim 3, wherein when the operator is a single operator, the data processing method further comprises:
Dividing the single operator into a plurality of sub-parts, and performing block processing on data used by each sub-part to obtain a plurality of data blocks;
determining storage units of the plurality of data blocks according to the processing sequence of the data of each sub-part of the single operator and the capacity of the storage units;
the computing unit calls the plurality of data blocks to execute the sub-parts of the single operator according to the storage positions of the plurality of data blocks.
7. The data processing method according to claim 6, wherein the plurality of sub-portions includes N sub-portions, a processing order of an nth sub-portion is earlier than a processing order of an n+1th sub-portion,
The computing unit calls the plurality of data blocks to execute each sub-part of the single operator according to the storage positions of the plurality of data blocks, and the computing unit comprises the following steps:
the calculation unit performs calculation of the nth sub-portion using the first data block in the register unit and obtains a calculation result;
Storing the calculation result to the first buffer unit as the second data block or the second buffer unit as the third data block according to the use time of the calculation result of the nth sub-portion;
reading the second data block from the first buffer unit or the third data block from the second buffer unit to the register unit according to the execution sequence of the single operator so that the calculation unit can execute the calculation of the n+1th subsection;
wherein N is an integer greater than 1, N is an integer greater than 0 and less than or equal to N.
8. The data processing method according to claim 1, wherein when the operator is a fusion operator, the data processing method further comprises:
Splitting the fusion operator into a plurality of single operators, and performing block processing on data used by each single operator to obtain a plurality of data blocks;
Determining storage units of the plurality of data blocks based on the execution sequence of each single operator, the processing sequence of the data of each single operator and the capacity of the storage units;
and the computing unit calls the data blocks to execute the fusion operator according to the storage positions of the data blocks.
9. A data processing method according to claim 3, wherein the computing unit invoking the plurality of data blocks to execute the operator according to the storage locations of the plurality of data blocks comprises:
Performing a partial calculation of the operator based on the first data block stored in the register unit to obtain temporary data;
storing the temporary data to the register unit, the first buffer unit, or the second buffer unit based on a processing order of the temporary data and a capacity of the storage unit;
reading the second data block stored in the first cache unit or the third data block stored in the second cache unit to the register unit for use by the computing unit to execute the operator.
10. A data processing apparatus comprising:
The partitioning unit is configured to perform partitioning processing on data used in the operator executing process to obtain a plurality of data blocks;
a determining unit configured to determine storage units of the plurality of data blocks according to a processing order of the plurality of data blocks;
And the computing unit is configured to call the plurality of data blocks to execute the operator according to the storage positions of the plurality of data blocks.
11. The data processing apparatus of claim 10, wherein the chunking unit is further configured to:
And carrying out block processing on the data according to the sizes of the storage units corresponding to the data blocks and the processing sequence of the data used in the operator executing process.
12. The data processing apparatus according to claim 11, wherein the storage unit includes a register unit, a first buffer unit, and a second buffer unit,
The blocking unit is further configured to:
Dividing the data into a first data block, a second data block and a third data block when the data amount of the data is larger than the storage capacities of the register unit and the first cache unit;
When the data amount of the data is larger than the storage capacity of the register unit and smaller than the storage capacity of the first cache unit, the data is divided into a first data block and a second data block.
13. The data processing apparatus of claim 12, wherein the first data block is stored in the register unit, the second data block is stored in the first buffer unit, and the third data block is stored in the second buffer unit.
14. The data processing apparatus according to claim 12, wherein a processing order of the first data block is earlier than a processing order of the second data block, the processing order of the second data block is earlier than a processing order of the third data block, a distance of the register unit from the calculation unit is smaller than a distance of the first cache unit from the calculation unit, and a distance of the first cache unit from the calculation unit is smaller than a distance of the second cache unit from the calculation unit.
15. A data processing apparatus comprising:
A processor;
A memory;
One or more computer program modules, wherein the one or more computer program modules are stored in the memory and configured to be executed by the processor, the one or more computer program modules comprising instructions for performing the data processing method of any of claims 1-9.
16. A storage medium storing non-transitory computer readable instructions which, when executed by a computer, perform the data processing method according to any one of claims 1-9.
CN202410296982.7A 2024-03-15 2024-03-15 Data processing method, device and storage medium Pending CN118012628A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410296982.7A CN118012628A (en) 2024-03-15 2024-03-15 Data processing method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410296982.7A CN118012628A (en) 2024-03-15 2024-03-15 Data processing method, device and storage medium

Publications (1)

Publication Number Publication Date
CN118012628A true CN118012628A (en) 2024-05-10

Family

ID=90948524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410296982.7A Pending CN118012628A (en) 2024-03-15 2024-03-15 Data processing method, device and storage medium

Country Status (1)

Country Link
CN (1) CN118012628A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115587922A (en) * 2021-07-06 2023-01-10 华为技术有限公司 Tensor blocking method and device and storage medium
US20230069890A1 (en) * 2021-09-03 2023-03-09 Advanced Micro Devices, Inc. Processing device and method of sharing storage between cache memory, local data storage and register files
CN116451174A (en) * 2023-04-17 2023-07-18 昆仑芯(北京)科技有限公司 Task execution device, method, electronic device, and storage medium
CN116868202A (en) * 2021-02-23 2023-10-10 华为技术有限公司 Data processing method, device, equipment and medium
CN116881618A (en) * 2023-08-25 2023-10-13 之江实验室 General matrix multiplication calculation optimization method, device and processor
WO2024000464A1 (en) * 2022-06-30 2024-01-04 华为技术有限公司 Blocking policy generation method and apparatus for tensor computation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116868202A (en) * 2021-02-23 2023-10-10 华为技术有限公司 Data processing method, device, equipment and medium
CN115587922A (en) * 2021-07-06 2023-01-10 华为技术有限公司 Tensor blocking method and device and storage medium
US20230069890A1 (en) * 2021-09-03 2023-03-09 Advanced Micro Devices, Inc. Processing device and method of sharing storage between cache memory, local data storage and register files
WO2024000464A1 (en) * 2022-06-30 2024-01-04 华为技术有限公司 Blocking policy generation method and apparatus for tensor computation
CN116451174A (en) * 2023-04-17 2023-07-18 昆仑芯(北京)科技有限公司 Task execution device, method, electronic device, and storage medium
CN116881618A (en) * 2023-08-25 2023-10-13 之江实验室 General matrix multiplication calculation optimization method, device and processor

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘仲;田希;: "面向多核向量处理器的矩阵乘法向量化方法", 计算机学报, no. 10, 30 June 2017 (2017-06-30) *
张军;谢竟成;沈凡凡;谭海;汪吕蒙;何炎祥;: "通用图形处理器缓存子系统性能优化方法综述", 计算机研究与发展, no. 06, 7 June 2020 (2020-06-07) *

Similar Documents

Publication Publication Date Title
US11797302B2 (en) Generalized acceleration of matrix multiply accumulate operations
CN109213962B (en) Operation accelerator
CN106598545B (en) Processor and method for communicating shared resources and non-transitory computer usable medium
CN107844830B (en) Neural network unit with data size and weight size hybrid computing capability
CN106484362B (en) Device for specifying two-dimensional fixed-point arithmetic operation by user
US11816481B2 (en) Generalized acceleration of matrix multiply accumulate operations
US11176449B1 (en) Neural network accelerator hardware-specific division of inference into groups of layers
CN113469350A (en) Deep convolutional neural network acceleration method and system suitable for NPU
TWI754310B (en) System and circuit of pure functional neural network accelerator
US20210182024A1 (en) Mixed precision floating-point multiply-add operation
CN112988656A (en) System and method for loading weights into tensor processing blocks
CN111506520B (en) Address generation method, related device and storage medium
US7769981B2 (en) Row of floating point accumulators coupled to respective PEs in uppermost row of PE array for performing addition operation
US20070198811A1 (en) Data-driven information processor performing operations between data sets included in data packet
CN117271136A (en) Data processing method, device, equipment and storage medium
CN116795324A (en) Mixed precision floating-point multiplication device and mixed precision floating-point number processing method
Khan et al. Accelerating SpMV multiplication in probabilistic model checkers using GPUs
CN118012628A (en) Data processing method, device and storage medium
US20220300326A1 (en) Techniques for balancing workloads when parallelizing multiply-accumulate computations
CN110874643B (en) Conversion method and device of machine learning instruction, board card, mainboard and electronic equipment
CN112463218A (en) Instruction emission control method and circuit, data processing method and circuit
CN117785480B (en) Processor, reduction calculation method and electronic equipment
Lee et al. High-Speed CNN Accelerator SoC Design Based on a Flexible Diagonal Cyclic Array
KR20220125116A (en) Neural processor and control method of neural processor
CN115168284A (en) Coarse-grained reconfigurable array system and calculation method for deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination