CN118170714A - Method, computing device, medium and program product for accelerating computation - Google Patents

Method, computing device, medium and program product for accelerating computation Download PDF

Info

Publication number
CN118170714A
CN118170714A CN202410585569.2A CN202410585569A CN118170714A CN 118170714 A CN118170714 A CN 118170714A CN 202410585569 A CN202410585569 A CN 202410585569A CN 118170714 A CN118170714 A CN 118170714A
Authority
CN
China
Prior art keywords
chip
data
read
operation unit
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410585569.2A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd, Beijing Bilin Technology Development Co ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202410585569.2A priority Critical patent/CN118170714A/en
Publication of CN118170714A publication Critical patent/CN118170714A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a method, computing device, medium and program product for accelerating computation. The method comprises the following steps: a buffer area is configured on a secondary buffer of a current chip where the operation unit group is located, and is used for backing up read-only data acquired across chips; responding to the fact that the operation units in the operation unit group have access requests, and confirming whether read-only data backed up by a buffer area comprise data to be accessed associated with the access requests or not; and responding to the read-only data backed up by the confirmation buffer area without the data to be accessed associated with the access request, so that the operation unit accesses the secondary cache of other chips to acquire the data to be accessed associated with the access request. The invention can obviously reduce the consumption of network bandwidth on chip.

Description

Method, computing device, medium and program product for accelerating computation
Technical Field
Embodiments of the present invention relate generally to the field of artificial intelligence and, more particularly, relate to a method, computing device, computer readable storage medium, and computer program product for accelerating computing.
Background
The AI processor performs parallel operations with a large number of operation units (EU), and data required for the operations is stored in a secondary cache (e.g., L2 cache), and the EU accesses the L2 cache through a Network On Chip (NOC) of the processor. The bandwidth of a network on chip, particularly the bandwidth between dice (D2D), for a processor comprising a plurality of chips, often becomes a bottleneck for the overall process.
The conventional method for accelerating the calculation is, for example: when an EU has an access request, data on the L2 cache of another chip needs to be read, and therefore, inter-chip bandwidth needs to be consumed. Particularly in the general matrix multiplication (GEMM) operation most commonly used in deep learning, the same data may be read by a plurality of EUs at almost the same time or by a single EU at different times, thus resulting in a multiple increase in network bandwidth consumption on chip.
In summary, the conventional method for accelerating computation has the following disadvantages: it is difficult to effectively reduce the consumption of network bandwidth on chip.
Disclosure of Invention
The present invention provides a method, computing device, computer readable storage medium and computer program product for accelerating computing that can significantly reduce the consumption of network-on-chip bandwidth.
According to a first aspect of the present invention, a method for accelerating a calculation is provided. The method comprises the following steps: a buffer area is configured on a secondary buffer of a current chip where the operation unit group is located, and is used for backing up read-only data acquired across chips; responding to the fact that the operation units in the operation unit group have access requests, and confirming whether read-only data backed up by a buffer area comprise data to be accessed associated with the access requests or not; and responding to the read-only data backed up by the confirmation buffer area without the data to be accessed associated with the access request, so that the operation unit accesses the secondary cache of other chips to acquire the data to be accessed associated with the access request.
In some embodiments, the method for accelerating the computing further comprises: and responding to the read-only data backed up by the confirmed cache region to comprise the data to be accessed associated with the access request, so that the operation unit accesses the cache region of the current chip to acquire the data to be accessed associated with the access request.
In some embodiments, each of the current chip and the other chips includes a plurality of chip partitions, each of the plurality of chip partitions including at least: an arithmetic unit group, and a secondary cache associated with the arithmetic unit group.
In some embodiments, determining whether the read-only data backed up by the buffer includes data to be accessed associated with the access request includes: acquiring an original address indicated in the access request; and inquiring a cache region of a chip partition on the current chip, which is associated with the original address, so as to confirm whether read-only data backed up by the cache region comprises data to be accessed associated with the access request.
In some embodiments, determining whether the read-only data backed up by the buffer includes data to be accessed associated with the access request includes: and in response to confirming that the read-only data backed up by the cache region does not comprise the data to be accessed associated with the access request, determining whether the data to be accessed associated with the access request is included in the backed up data in the cache regions of other chip partitions of the current chip.
In some embodiments, the current chip includes a first chip partition and a second chip partition, the other chips include the first chip partition and the second chip partition, and backing up read-only data acquired across the chips includes: backing up read-only data of the secondary cache of the first chip partition obtained from other chips to a cache area of the current chip, which is configured by the secondary cache of the first chip partition; and backing up read-only data of the secondary cache of the second chip partition acquired from the other chips to a cache area of the current chip, which is configured by the secondary cache of the second chip partition.
In some embodiments, the method for accelerating the computing further comprises: confirming whether the correlation operator of the current calculation task of the operation unit changes or not; and in response to the change of the correlation operator of the current calculation task of the confirmation operation unit, clearing the read-only data backed up in the buffer area on the secondary buffer of the current chip.
In some embodiments, the set of arithmetic units includes at least a first arithmetic unit, a second arithmetic unit, and a third arithmetic unit, the method further comprising: multiplying a first block of a first input matrix and a first block of a second input matrix by a first operation unit so as to generate a first output block, wherein the first block of the first input matrix is data backed up in a buffer area configured by a second-level buffer of a chip where the first operation unit is located by read-only data acquired across chips; acquiring a first partition of a first input matrix backed up on the cache region of the chip where the first operation unit is located through a second operation unit; multiplying, via a second arithmetic unit, the first block of the first input matrix and the second block of the second input matrix to generate a second output block, the second output block being adjacent to the first output block in the first direction; acquiring a first partition of a second input matrix backed up on the cache region of the chip where the first operation unit is located through a third operation unit; and multiplying, via a third operation unit, the second block of the first input matrix and the first block of the second input matrix to generate a third output block, the third output block being adjacent to the first output block in the second direction.
According to a second aspect of the present invention, there is also provided a computing device. The computing device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the computing device to perform the method of the first aspect of the invention.
According to a third aspect of the present invention, there is also provided a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a machine, performs the method of the first aspect of the invention.
According to a fourth aspect of the present invention there is also provided a computer program product comprising a computer program which when executed by a machine performs the method of the first aspect of the present invention.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
The above and other features, advantages and aspects of embodiments of the present invention will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements.
FIG. 1 schematically illustrates a schematic diagram of a computing device implementing a method for accelerating computing in accordance with an embodiment of the invention.
FIG. 2 illustrates a flow chart of a method for accelerating a calculation according to an embodiment of the invention.
FIG. 3 illustrates a schematic diagram of a computing device for performing acceleration calculations, according to some embodiments of the invention.
Fig. 4 shows a flow chart of a method for confirming whether a cache region includes data to be accessed associated with an access request according to an embodiment of the invention.
FIG. 5 illustrates a schematic diagram of a computing device for performing acceleration calculations in accordance with further embodiments of the invention.
Fig. 6 shows a schematic diagram of a method for performing matrix operations according to an embodiment of the invention.
Fig. 7 shows a flow chart of a method for performing acceleration calculations for matrix multiplication according to an embodiment of the invention.
Like or corresponding reference characters indicate like or corresponding parts throughout the several views.
Detailed Description
Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are illustrated in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object.
As described above, the conventional method for accelerating computation has disadvantages in that: it is difficult to effectively reduce the consumption of network bandwidth on chip.
To at least partially address one or more of the above problems, as well as other potential problems, example embodiments of the present invention propose a method for accelerating computation. In the method, read-only data acquired by a cross chip (or called a far end) of an operation unit is backed up by utilizing a cache region configured on a secondary cache of a current chip (or called a local end), and when the operation unit in an operation unit group is confirmed to have an access request, whether the read-only data backed up by the cache region comprises data to be accessed or not is firstly confirmed; when the data to be accessed does not exist in the cache area of the secondary cache of the local end, accessing the secondary caches of other chips to acquire the data to be accessed; therefore, after the read-only data acquired by EU across chips are backed up, when a plurality of EUs read the same data for a plurality of times at almost the same time or a single EU reads the same data for a plurality of times at different time, the backup data are preferentially acquired on a local buffer area of the second-level buffer, and the data of other chips are acquired across chips instead of directly through network bandwidth.
FIG. 1 schematically illustrates a schematic diagram of a computing device 100 implementing a method for accelerating computing according to an embodiment of the invention. As shown in fig. 1, computing device 100 may have one or more processing units, including a special purpose processing unit such as a graphics processor (Graphics Processing Unit, GPU), a field programmable gate array (Field Programmable GATE ARRAY, FPGA), an Application SPECIFIC INTEGRATED integrated Circuit (ASIC), or a General-purpose graphics processor (GPGPU), and a General-purpose processing unit such as a CPU. The computing device 100 further includes at least: a cross-chip read-only data backup module 102, a data to be accessed validation module 104, and other chip secondary cache access modules 106.
Regarding the cross-chip read-only data backup module 102, it is configured to configure a buffer area on a secondary cache of a current chip where the operation unit group is located, so as to backup read-only data acquired across chips.
A data to be accessed confirmation module 104 for confirming whether or not an access request exists for an operation unit in the operation unit group; and in response to confirming that the operation units in the operation unit group have access requests, confirming whether read-only data backed up by the buffer area comprise data to be accessed associated with the access requests.
And the second-level cache access module 106 of the other chip is configured to, in response to determining that the read-only data backed up by the cache area does not include the data to be accessed associated with the access request, enable the operation unit to access the second-level cache of the other chip to obtain the data to be accessed associated with the access request.
A method 200 for accelerating computation according to an embodiment of the present invention will be described below in conjunction with fig. 2 and 3. FIG. 2 illustrates a flow chart of a method 200 for accelerating a calculation according to an embodiment of the invention. FIG. 3 illustrates a schematic diagram of a computing device 300 for performing acceleration calculations, according to some embodiments of the invention. It should be appreciated that the method 200 may be performed, for example, at the computing device 100 depicted in fig. 1. Method 200 may also include additional acts not shown and/or may omit acts shown, the scope of the present invention being not limited in this respect.
At step 202, the computing device 100 configures a buffer on the secondary cache of the current chip where the operation unit group is located, so as to backup the read-only data acquired across the chips.
It should be understood that, local backup is performed for read-only data acquired across chips, and the backed-up data is not covered by EU write data, so that the present invention can extend the validity period of the backed-up data to the maximum extent and avoid the problem of causing data inconsistency.
In some embodiments, the computing device includes a plurality of chips. The plurality of chips includes, for example, the present chip and other chips (the number of other chips is, for example, one or more). The current chip on which the arithmetic unit group is located is referred to as the "local side". Other chips, for example, are referred to as "far-ends" with respect to the current chip as "local". As shown in fig. 3, the first chip 310 is, for example, a current chip. The second chip 330 is, for example, another chip. Each of the first chip 310 and the second chip 330 includes, for example, one or more sets of arithmetic unit groups, and one or more sets of secondary caches associated with the arithmetic unit groups. For example, taking the first chip 310 as an example, fig. 3 shows a first arithmetic unit group 310 and a second arithmetic unit group 314 included in the first chip 310. Each arithmetic unit group includes, for example, one or more arithmetic units (i.e., EU). For example, the first arithmetic unit group 310 includes a plurality of arithmetic units (e.g., indicated by reference numeral 312). The second arithmetic unit group 314 includes a plurality of arithmetic units (indicated by a reference numeral 316, for example).
In some embodiments, each of the current chip and the other chips, for example, includes a plurality of chip partitions. In some embodiments, the plurality of chip partitions includes, for example, a first chip partition, a second chip partition. Wherein each chip partition includes, for example, at least one or more sets of arithmetic units and associated secondary caches. The secondary cache is configured with a cache region.
Regarding the secondary cache, it is associated with an arithmetic unit group, for example. The second level cache is configured with a cache area. As shown in fig. 3, a first level two cache 320 is associated with the first set of arithmetic units 310 and a second level two cache 324 is associated with the second set of arithmetic units 314. The operation unit groups are interconnected with the secondary caches, for example, through NOC 318. The first secondary cache 320 is configured with a first cache region 322. The second level cache 324 is configured with a second cache region 326. The first buffer 322 and the second buffer 326 are used for backing up read-only data acquired across chips for any one or more operation units in the associated operation unit group.
It should be appreciated that different arithmetic units may access the secondary cache. In some embodiments, the proportion of the secondary cache configured as a buffer may be dynamically adjusted, e.g., the computing device 100 dynamically configures the buffer's duty cycle based on the multiplexed rate of the backed up data, the size of the data required by the computing task, e.g., without limitation, to configure the buffer's duty cycle to 1/4, 1/8, or 1/2.
At step 204, the computing device 100 confirms whether an access request exists for an arithmetic unit in the arithmetic unit group. If the computing device 100 confirms that no access request exists for an arithmetic unit in the arithmetic unit group, it waits at step 204.
With respect to access requests, it is for example an access request with respect to a computing task. In some embodiments, the access request indicates an original address of the data to be accessed. In some embodiments, the original address points to, for example, a secondary cache of one chip partition of the other chip. In other embodiments, the original address points to a secondary cache of other chips, for example.
At step 206, if the computing device 100 confirms that the operation units in the operation unit group have an access request, it is confirmed whether the read-only data backed up by the buffer includes the data to be accessed associated with the access request.
The operation unit for which an access request is present may be the same operation unit as the operation unit for acquiring read-only data across chips in step 202, or may be another operation unit different from the operation unit for acquiring read-only data across chips. In some embodiments, the arithmetic unit for which an access request exists at step 206 is of the same arithmetic unit group as the arithmetic unit for which cross-chip read-only data acquisition occurs at step 202. In other embodiments, the arithmetic unit for which an access request is present at step 206 and the arithmetic unit for which cross-chip read-only data acquisition is performed at step 202 are separate groups of arithmetic units on the same chip partition. In other embodiments, the arithmetic unit with access request at step 206 and the arithmetic unit with cross-chip read-only data at step 202 are separate chip partitions of the same chip.
At step 208, if the computing device 100 confirms that the read-only data backed up by the buffer area does not include the data to be accessed associated with the access request, the computing unit accesses the secondary cache of the other chip to obtain the data to be accessed associated with the access request.
In some embodiments, the operation unit accessing the secondary cache of the other chip includes: if the read-only data backed up by the cache area does not comprise the data to be accessed associated with the access request, determining whether the data to be accessed associated with the access request is included in the backed up data in the cache areas of other chip partitions of the current chip; and if the backup data in the cache area of the other chip partition of the current chip does not contain the data to be accessed associated with the access request, enabling the operation unit to access the secondary cache of the other chip. By adopting the means, the remote secondary cache can be accessed only when the fact that the backup data of the buffer areas of different chip partitions at the local end do not contain the data to be accessed is confirmed.
At step 210, if the computing device 100 confirms that the read-only data backed up by the cache region includes the data to be accessed associated with the access request, the computing unit accesses the cache region of the current chip to obtain the data to be accessed associated with the access request.
In the above scheme, by using a buffer configured on a secondary cache of a current chip (or called a "local end"), backup is performed on read-only data acquired across chips (or called "remote ends") of an operation unit, and when it is confirmed that an access request exists in operation units in an operation unit group, whether the data backed up by the buffer includes data to be accessed associated with the access request is firstly confirmed; when the secondary cache of the local end does not exist on the data to be accessed, accessing the secondary caches of other chips to acquire the data to be accessed; therefore, after the read-only data acquired by EU across chips are backed up, when a plurality of EUs read the same data for a plurality of times at almost the same time, or when a single EU reads the same data for a plurality of times at different time, the read-only data are preferentially acquired on a local secondary cache, but are not acquired directly across chips through network bandwidth. Therefore, the invention can significantly reduce the consumption of network bandwidth on chip.
A method 400 for confirming whether a cache region includes data to be accessed associated with an access request according to an embodiment of the present invention will be described below in conjunction with fig. 4 and 5. Fig. 4 shows a flow chart of a method 400 for confirming whether a cache region includes data to be accessed associated with an access request, according to an embodiment of the invention. FIG. 5 illustrates a schematic diagram of a computing device 500 for performing acceleration calculations in accordance with further embodiments of the invention. It should be appreciated that the method 400 may be performed, for example, at the computing device 100 depicted in fig. 1. Method 400 may also include additional acts not shown and/or may omit acts shown, the scope of the present invention being not limited in this respect.
At step 402, computing device 100 obtains an original address indicated in the access request.
In some embodiments, the original address indicated in the access request points to a secondary cache of the other chip. In some embodiments, the current chip includes a first chip partition and a second chip partition, and the other chips include the first chip partition and the second chip partition.
As shown in fig. 5, first chip 510 includes, for example, a first chip partition 512, and a second chip partition 514. Each chip partition includes, for example, a set of arithmetic units and a secondary cache adjacent to the set of arithmetic units. Two inter-chip links are included between two adjacent chips (e.g., between the first chip 510 and the second chip 520). Such as a first inter-chip link 530 and a second inter-chip link 532. The first inter-chip link 530 is, for example, configured between the first chip partition 512 of the first chip 510 and the first chip partition 522 of the second chip 520. Second inter-chip link 532 is, for example, configured between second chip partition 514 of first chip 510 and second chip partition 524 of second chip 520.
In backing up read-only data acquired across chips, computing device 100 backs up read-only data acquired from the secondary cache of the first chip partition of the other chip (e.g., first chip partition 522 of second chip 520) to the cache region of the current chip configured by the secondary cache of the first chip partition (e.g., first chip partition 512 of first chip 510); and backing up read-only data obtained from the secondary cache of the second chip partition of the other chip (e.g., second chip partition 524 of second chip 520) to the cache region configured by the secondary cache of the second chip partition of the current chip (e.g., second chip partition 514 of first chip 510). Therefore, the invention can conveniently backup read-only data read across chips by utilizing inter-chip links between corresponding chip partitions of different chips.
At step 404, the computing device 100 queries a cache region of the chip partition associated with the original address on the current chip to confirm whether the read-only data backed up by the cache region includes data to be accessed associated with the access request.
For example, the original address indicated in the access request points to the secondary cache of the second chip partition 524 of the second chip 520, in view of the aforementioned method of backing up read-only data, the computing device 100 may determine that read-only data on the secondary cache of the second chip partition 524 of the second chip 520 to which the original address points is backed up on a cache region configured by the secondary cache of the second chip partition 514 of the associated first chip 510. It should be appreciated that the second chip partition 514 of the first chip 510 is the chip partition associated with the second chip partition 524 of the second chip 520 to which the original address points.
At step 406, if the computing device 100 confirms that the read-only data backed up by the cache region does not include the data to be accessed associated with the access request, it is determined whether the data to be accessed associated with the access request is included in the backed up data in the cache region of the other chip partition of the current chip.
For example, if computing device 100 determines that the read-only data backed up by the buffer configured by the second level cache of second chip partition 514 of first chip 510 does not include the data to be accessed associated with the access request, it determines whether the read-only data backed up by the buffer configured by the second level cache of first chip partition 512 of first chip 510 includes the data to be accessed associated with the access request.
By adopting the means, the remote secondary cache can be accessed only when the fact that the backup data of the buffer areas of different chip partitions at the local end do not contain the data to be accessed is confirmed. Thereby, the consumption of bandwidth for the inter-chip network can be further reduced.
By adopting the means, the method and the device can preferentially inquire the data to be accessed to the cache region of the chip partition associated with the original address, and can remarkably improve the efficiency of inquiring the data to be accessed.
The method 700 of the present invention for performing acceleration calculations for matrix multiplication will be described below in connection with fig. 6 and 7. Fig. 6 shows a schematic diagram of a method 600 for performing matrix operations according to an embodiment of the invention. Fig. 7 shows a flow chart of a method 700 for performing acceleration calculations for matrix multiplication according to an embodiment of the invention. It should be appreciated that the method 500 may be performed, for example, at the computing device 100 depicted in fig. 1. Method 500 may also include additional acts not shown and/or may omit acts shown, the scope of the present invention being not limited in this respect.
At step 702, the computing device 100 multiplies, via the first arithmetic unit, the first block of the first input matrix, which is data backed up in a buffer area configured by a secondary buffer of a chip in which the first arithmetic unit is located, by the read-only data acquired across the chip, and the first block of the second input matrix to generate the first output block.
It should be understood that the operation unit group for performing the matrix multiplication calculation includes at least a plurality of operation units including, for example: a first arithmetic unit, a second arithmetic unit, and a third arithmetic unit.
Regarding the first input matrix, as shown in fig. 6, a symbol 610 indicates a first input matrix (e.g., an a matrix), the first input matrix 610 is split into a plurality of partitions, for example, a first partition A0, a second partition A1, and the like of the first input matrix.
Regarding the second input matrix, as shown in fig. 6, a symbol 620 indicates the second input matrix (e.g., B matrix), and the second input matrix 620 is split into a plurality of blocks, e.g., a first block B0, a second block B1, a third block B2, a fourth block B3, and so on of the second input matrix.
For example, via the first operation unit E0, the first block A0 of the first input matrix and the first block B0 of the second input matrix are multiplied to generate a first output block C0 (i.e., c0=a0×b0).
It should be appreciated that in the field of deep learning, the most commonly used and time-consuming matrix operation process typically employs a computational strategy that divides the output data into blocks. For example, parallel computation is performed by each EU for each of the split input blocks to generate a plurality of output blocks. The output data C is, for example, the result of multiplying a matrix and B matrix, i.e., c=a×b.
At step 704, the computing device 100 obtains, via the second computing unit, the first partition of the first input matrix backed up on the cache region of the chip in which the first computing unit is located. The second arithmetic unit is, for example, an arithmetic unit adjacent to the first arithmetic unit E0.
At step 706, the computing device 100 multiplies, via the second arithmetic unit, the first partition of the first input matrix and the second partition of the second input matrix to generate a second output block, the second output block being adjacent to the first output block in the first direction.
For example, via the second arithmetic unit EU1, the first block A0 of the first input matrix and the second block B1 of the second input matrix are multiplied to generate a second output block C1 (i.e., c1=a0×b1).
Regarding the first direction, it is, for example, the same horizontal direction. As shown in fig. 6, the first output block C0 and the second output block C1 are 2 adjacent output blocks in the same horizontal direction. It should be understood that, when performing the computation of 2 output blocks in the same horizontal direction, for example, when generating the first output block C0 and the second output block C1, the first arithmetic unit EU0 and the second far arithmetic unit EU1 each read the first partition A0 of the same first input matrix. Since, in step 706, the first block A0 of the first input matrix used for the multiplication is obtained from the cache region of the second level cache of the chip where the first arithmetic unit is located, without being obtained again via the cross-chip.
At step 708, the computing device 100 obtains, via the third computing unit, the first partition of the second input matrix backed up on the cache region of the chip in which the first computing unit is located.
At step 710, the computing device 100 multiplies, via a third arithmetic unit, the second partition of the first input matrix and the first partition of the second input matrix to generate a third output block, the third output block being adjacent to the first output block in the second direction.
Regarding the second direction, it is, for example, the same vertical direction. As shown in fig. 6, the first output block C0 and the third output block C2 are adjacent 2 output blocks in the same vertical direction. It should be understood that when performing the computation of 2 output blocks in the same vertical direction, for example, when generating the first output block C0 and the third output block C2, the first computing unit EU0 and the third computing unit EU2 may read the first block B0 of the same second input matrix. Since, in step 710, the first block B0 of the second input matrix for the multiplication is obtained from the cache region of the second level cache of the chip where the first arithmetic unit is located, without going through the cross-chip again.
It should be appreciated that neighboring EUs are controlled to process output blocks in the same horizontal or vertical direction so that local backups of far-end secondary cache data of those EUs can be read by the remaining EUs, thereby reducing the number of accesses to the far-end secondary cache.
In some embodiments, for example, for the round of computation shown in fig. 6, the n+1 th output block C (n), the n+2 th output block C (n+1), the n+3 th output block C (n+2), and the n+4 th output block C (n+3) are generated. The first block A0 of the first input matrix and the third block B2 of the second input matrix are multiplied via the first arithmetic unit E0 to generate an n+1th output block C (n) (i.e., C (n) =a0×b2). The first block A0 of the first input matrix and the fourth block B3 of the second input matrix are multiplied via the second arithmetic unit E1 to generate an n+2 output block C (n+1) (i.e. C (n+1) =a0×b3). It can be seen that the same EU, for example, the first arithmetic unit EU0, may need to process a plurality of output blocks, and EU0 reads the first partition A0 of the same first input matrix when generating the first output block C0 and generating the n+1th output block C (n). The chunking algorithm controls the chunks to be in the same horizontal position or in the same vertical position so that the local data backup can be read repeatedly.
By adopting the means, the invention can ensure that the local backup of the remote secondary cache data can be read by the rest EUs in the matrix multiplication operation process, thereby reducing the number of accessing the remote secondary cache.
The various processes and treatments described above, such as methods 200, 400, 700, may be performed at a computing device. The computing device includes, for example: at least one processor (at least one graphics processor and at least one central processor); and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor. In some embodiments, the methods 200, 400, 700 may be implemented as a computer software program or program product tangibly embodied on a machine-readable medium. In some embodiments, part or all of the computer program may be loaded and/or installed onto the computing device via Read-Only Memory (ROM) and/or a communication unit. One or more of the acts of the methods 200, 400, 700 described above may be performed when a computer program is loaded into Random-access memory (RAM) and executed by a GPU and a CPU.
The present invention may be a method, apparatus, system, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing various aspects of the present invention. The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a central processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the central processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution disclosed in the present application can be achieved, and are not limited herein.
The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors.

Claims (11)

1. A method for accelerating computation, the method comprising:
a buffer area is configured on a secondary buffer of a current chip where the operation unit group is located, and is used for backing up read-only data acquired across chips;
responding to the fact that the operation units in the operation unit group have access requests, and confirming whether read-only data backed up by a buffer area comprise data to be accessed associated with the access requests or not; and
And responding to the fact that the read-only data backed up by the buffer area does not comprise the data to be accessed associated with the access request, so that the operation unit accesses the secondary buffer of other chips to acquire the data to be accessed associated with the access request.
2. The method as recited in claim 1, further comprising:
And responding to the read-only data backed up by the confirmed cache region to comprise the data to be accessed associated with the access request, so that the operation unit accesses the cache region of the current chip to acquire the data to be accessed associated with the access request.
3. The method of claim 1, wherein each of the current chip and the other chips includes a plurality of chip partitions, each of the plurality of chip partitions including at least: an arithmetic unit group, and a secondary cache associated with the arithmetic unit group.
4. A method according to claim 3, wherein determining whether the read-only data backed up by the buffer includes data to be accessed associated with the access request comprises:
Acquiring an original address indicated in the access request; and
And inquiring a cache region of a chip partition on the current chip, which is associated with the original address, so as to confirm whether read-only data backed up by the cache region comprises data to be accessed, which is associated with the access request.
5. The method of claim 4, wherein determining whether the read-only data backed up by the buffer includes data to be accessed associated with the access request comprises:
And in response to confirming that the read-only data backed up by the cache region does not comprise the data to be accessed associated with the access request, determining whether the data to be accessed associated with the access request is included in the backed up data in the cache regions of other chip partitions of the current chip.
6. The method of claim 3, wherein the current chip includes a first chip partition and a second chip partition, the other chips include the first chip partition and the second chip partition, and backing up read-only data acquired across the chips includes:
Backing up read-only data of the secondary cache of the first chip partition obtained from other chips to a cache area of the current chip, which is configured by the secondary cache of the first chip partition; and
And backing up read-only data of the secondary cache of the second chip partition acquired from other chips to a cache area of the current chip, which is configured by the secondary cache of the second chip partition.
7. The method as recited in claim 1, further comprising:
Confirming whether the correlation operator of the current calculation task of the operation unit changes or not; and
And in response to the change of the correlation operator of the current calculation task of the confirmation operation unit, clearing the read-only data backed up in the buffer area on the secondary buffer of the current chip.
8. The method of claim 1, wherein the set of arithmetic units includes at least a first arithmetic unit, a second arithmetic unit, and a third arithmetic unit, the method further comprising:
multiplying a first block of a first input matrix and a first block of a second input matrix by a first operation unit so as to generate a first output block, wherein the first block of the first input matrix is data backed up in a buffer area configured by a second-level buffer of a chip where the first operation unit is located by read-only data acquired across chips;
acquiring a first partition of a first input matrix backed up on the cache region of the chip where the first operation unit is located through a second operation unit;
Multiplying, via a second arithmetic unit, the first block of the first input matrix and the second block of the second input matrix to generate a second output block, the second output block being adjacent to the first output block in the first direction;
Acquiring a first partition of a second input matrix backed up on the cache region of the chip where the first operation unit is located through a third operation unit; and
The second block of the first input matrix and the first block of the second input matrix are multiplied via a third arithmetic unit to generate a third output block, the third output block being adjacent to the first output block in the second direction.
9. A computing device, comprising:
at least one processor; and
A memory communicatively coupled to the at least one processor; wherein the method comprises the steps of
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
10. A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a machine, performs the method according to any of claims 1-8.
11. A computer program product comprising a computer program which, when executed by a machine, performs the method according to any of claims 1-8.
CN202410585569.2A 2024-05-13 2024-05-13 Method, computing device, medium and program product for accelerating computation Pending CN118170714A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410585569.2A CN118170714A (en) 2024-05-13 2024-05-13 Method, computing device, medium and program product for accelerating computation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410585569.2A CN118170714A (en) 2024-05-13 2024-05-13 Method, computing device, medium and program product for accelerating computation

Publications (1)

Publication Number Publication Date
CN118170714A true CN118170714A (en) 2024-06-11

Family

ID=91352923

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410585569.2A Pending CN118170714A (en) 2024-05-13 2024-05-13 Method, computing device, medium and program product for accelerating computation

Country Status (1)

Country Link
CN (1) CN118170714A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020069341A1 (en) * 2000-08-21 2002-06-06 Gerard Chauvel Multilevel cache architecture and data transfer
CN101334759A (en) * 2007-06-28 2008-12-31 国际商业机器公司 L2 cache/nest address translation
CN105095113A (en) * 2015-07-21 2015-11-25 浪潮(北京)电子信息产业有限公司 Cache management method and system
CN110119304A (en) * 2018-02-07 2019-08-13 华为技术有限公司 A kind of interruption processing method, device and server
WO2022222040A1 (en) * 2021-04-20 2022-10-27 华为技术有限公司 Method for accessing cache of graphics processor, graphics processor, and electronic device
CN116185942A (en) * 2021-11-29 2023-05-30 寒武纪(昆山)信息科技有限公司 Data processing method, device, storage medium and electronic equipment
CN116955220A (en) * 2023-08-07 2023-10-27 海光信息技术股份有限公司 Cache access method and device and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020069341A1 (en) * 2000-08-21 2002-06-06 Gerard Chauvel Multilevel cache architecture and data transfer
CN101334759A (en) * 2007-06-28 2008-12-31 国际商业机器公司 L2 cache/nest address translation
CN105095113A (en) * 2015-07-21 2015-11-25 浪潮(北京)电子信息产业有限公司 Cache management method and system
CN110119304A (en) * 2018-02-07 2019-08-13 华为技术有限公司 A kind of interruption processing method, device and server
WO2022222040A1 (en) * 2021-04-20 2022-10-27 华为技术有限公司 Method for accessing cache of graphics processor, graphics processor, and electronic device
CN116185942A (en) * 2021-11-29 2023-05-30 寒武纪(昆山)信息科技有限公司 Data processing method, device, storage medium and electronic equipment
CN116955220A (en) * 2023-08-07 2023-10-27 海光信息技术股份有限公司 Cache access method and device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ALIREZA FARSHIN 等: "Make the Most out of Last Level Cache in Intel Processors", EUROSYS \'19: PROCEEDINGS OF THE FOURTEENTH EUROSYS CONFERENCE, 25 March 2019 (2019-03-25), pages 1 - 17, XP058428787, DOI: 10.1145/3302424.3303977 *

Similar Documents

Publication Publication Date Title
EP3557425B1 (en) Accelerator and system for accelerating operations
EP3557485A1 (en) Method for accelerating operations and accelerator apparatus
CN108388537B (en) Convolutional neural network acceleration device and method
US20090240895A1 (en) Systems and methods for coalescing memory accesses of parallel threads
US7725518B1 (en) Work-efficient parallel prefix sum algorithm for graphics processing units
US8392669B1 (en) Systems and methods for coalescing memory accesses of parallel threads
US20160147571A1 (en) Method for optimizing the parallel processing of data on a hardware platform
CN110415160B (en) GPU (graphics processing Unit) topology partitioning method and device
CN112784973A (en) Convolution operation circuit, device and method
US9513923B2 (en) System and method for context migration across CPU threads
CN103984677A (en) Embedded reconfigurable system based on large-scale coarseness and processing method thereof
US20220253668A1 (en) Data processing method and device, storage medium and electronic device
CN113485750B (en) Data processing method and data processing device
CN118170714A (en) Method, computing device, medium and program product for accelerating computation
Shahbahrami et al. FPGA implementation of parallel histogram computation
Fridman et al. On the scalability of 2-D discrete wavelet transform algorithms
KR100950117B1 (en) Method and apparatus for processing arbitrary key bit length encryption operations with similar efficiencies
CN113222136A (en) Convolution operation method and chip
CN113344765B (en) Frequency domain astronomical image target detection method and system
CN111522776B (en) Computing architecture
CN109344119B (en) File merging processing method and device, computing equipment and computer storage medium
CN112099850A (en) Multi-core Hourglass network acceleration method
US20240126617A1 (en) Deep fusion of kernel execution
JP7150668B2 (en) HIGH-LEVEL SYNTHESIS DEVICE AND HIGH-LEVEL SYNTHESIS METHOD
Liu et al. Reducing communication overhead in the high performance conjugate gradient benchmark on Tianhe-2

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination