CN117132450B - Computing device capable of realizing data sharing and graphic processor - Google Patents
Computing device capable of realizing data sharing and graphic processor Download PDFInfo
- Publication number
- CN117132450B CN117132450B CN202311376818.9A CN202311376818A CN117132450B CN 117132450 B CN117132450 B CN 117132450B CN 202311376818 A CN202311376818 A CN 202311376818A CN 117132450 B CN117132450 B CN 117132450B
- Authority
- CN
- China
- Prior art keywords
- data
- input parameter
- parameter
- unit
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000005540 biological transmission Effects 0.000 claims abstract description 42
- 230000006870 function Effects 0.000 description 11
- 238000000034 method Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 210000003041 ligament Anatomy 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Logic Circuits (AREA)
Abstract
The invention discloses a computing device and a graphics processor capable of realizing data sharing. The computing device comprises a shared memory, a constant register, a multiplexing unit and a plurality of data processing modules; each data processing module comprises a data unit, a data transmission unit and an operation module; the multiplexing unit is used for acquiring data from the data units of the plurality of data processing modules and respectively providing the acquired data to the data transmission units of the plurality of data processing modules according to the first control signal lm_mux; the data transmission unit of each data processing module is used for acquiring data from the data units in the same data processing module, acquiring constants from the constant registers, acquiring data from the multiplexing unit and acquiring data from the shared memory. The invention can obviously improve the operation efficiency through multi-level data sharing, has simple structure and easy realization, and is suitable for various complex operation scenes.
Description
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a computing device and a graphics processor capable of realizing data sharing.
Background
Graphics processor (Graphics Processing Unit, GPU), which is a microprocessor dedicated to image and graphics related operations, is an important component of the graphics system architecture, and is a ligament connecting computers and display terminals. In real-time graphics and video applications, graphics processors are required to have more powerful general purpose computing capabilities. Current data sharing between GPU threads is difficult to achieve and multiple motion (mov) instructions must be used to accomplish the corresponding function. In general, a mov instruction needs multiple cycles to complete corresponding functions due to a bank conflict, so that the mov instruction becomes a bottleneck of the whole program, and the computing capability of a graphics processor is seriously affected.
Disclosure of Invention
Aiming at the defects or improvement demands of the prior art, the invention provides a computing device and a graphics processor capable of realizing data sharing, which are simple in structure, easy to realize and suitable for various complex operation scenes through multistage data sharing.
To achieve the above object, according to one aspect of the present invention, there is provided a computing device including a shared memory, a constant register, a multiplexing unit, and a plurality of data processing modules; each data processing module comprises a data unit, a data transmission unit and an operation module; the multiplexing unit is used for acquiring data from the data units of the plurality of data processing modules and respectively providing the acquired data to the data transmission units of the plurality of data processing modules according to the first control signal lm_mux; the data transmission unit of each data processing module is used for acquiring data from the data unit in the same data processing module and giving a first parameter s0, acquiring constant from a constant register and giving a second parameter s1, acquiring data from a multiplexing unit and giving a third parameter s2, and acquiring data from a shared memory and giving a fourth parameter s3; the operation module of each data processing module is used for executing corresponding operation according to the first parameter s0, the second parameter s1, the third parameter s2 and the fourth parameter s3.
In some embodiments, the data provided by the data units marking each data processing module is respectively us0 to us (M-1), wherein M is the number of the plurality of data processing modules; correspondingly, the numbers of the operation modules of the data processing modules are marked as M, m=0, 1, … and M-1; first control signal lm_mux=0, 1, …, M-1; the multiplexing unit is used for carrying out logic operation on the value of the first control signal and the number M of the operation module of each data processing module, and respectively providing the data us0 to us (M-1) provided by the data unit of each data processing module to the data transmission unit of each data processing module according to the result of the logic operation.
In some embodiments, for the operation module with the number m, the multiplexing unit is configured to provide the data us (Y) to the data transmission unit in the same data processing module as the operation module with the number m, where Y is a logical operation result of the value of the first control signal lm_mux and m.
In some embodiments, the operation module of each data processing module is denoted as Gm, where m=0, 1, …, M-1, M is the number of the plurality of data processing modules; each operation module Gm comprises N operation units, denoted Un, where N is the number of the operation units, n=0, 1, …, N-1; the data transmission unit in the same data processing module as the operation module Gm comprises a data distribution unit; the data distribution unit is used for providing a first input parameter s0', a second input parameter s1', a third input parameter s2 'and a fourth input parameter s3' for each operation unit according to the values of the first parameter s0, the second parameter s1, the third parameter s2 and the fourth parameter s3 and the numbers of the operation units; each operation unit is used for executing corresponding operation according to the first input parameter s0', the second input parameter s1', the third input parameter s2 'and the fourth input parameter s3'.
In some embodiments, the first input parameter s0'[ n ] =s0 [ n ] of the operation unit Un, the second input parameter s1' [ n ] =s1 [ n ] of the operation unit Un, the third input parameter s2'[ n ] =s2 [ n ] of the operation unit Un, the fourth input parameter s3' [ n ] =s3 [ n ] of the operation unit Un, wherein s0[ n ] is the n+1th value of the first parameter s0, s1[ n ] is the n+1th value of the second parameter s1, s2[ n ] is the n+1th value of the third parameter s2, and s3[ n ] is the n+1th value of the fourth parameter s3.
In some embodiments, the data allocation unit is configured to perform a logical operation on the value of the second control signal shuf_oper and the number of each operation unit, and according to the result of the logical operation, value the first input parameter s0 'assigned to each operation unit from the first parameter s0, value the second input parameter s1' assigned to each operation unit from the second parameter s1, value the third input parameter s2 'assigned to each operation unit from the third parameter s2, and value the fourth input parameter s3' assigned to each operation unit from the fourth parameter s3.
In some implementations, the second control signal shuf_oper includes shuf_oper0, shuf_oper1, shuf_oper2, and shuf_oper3; for the operation unit Un, the data distribution unit is configured to perform a logic operation on the value of the second control signal shuf_oper0 and the number n of the operation unit Un, and assign a value from the first parameter s0 to the first input parameter s0' n of the operation unit Un according to the result of the logic operation; the data distribution unit is used for carrying out logic operation on the value of the second control signal shuf_oper1 and the number n of the operation unit Un, and giving the value to a second input parameter s1' n of the operation unit Un from the second parameter s1 according to the result of the logic operation; the data distribution unit is used for carrying out logic operation on the value of the second control signal shuf_oper2 and the number n of the operation unit Un, and giving the value of the third parameter s2 to the third input parameter s2' n of the operation unit Un according to the result of the logic operation; the data distribution unit is used for carrying out logic operation on the value of the second control signal shuf_oper3 and the number n of the operation unit Un, and according to the result of the logic operation, the value of the fourth parameter s3 is given to the fourth input parameter s3' n of the operation unit Un.
In some embodiments, for the operation unit Un, the data distribution unit is configured to assign the data s0[ Z0] to the first input parameter s0'[ n ] of the operation unit Un, the data distribution unit is configured to assign the data s1[ Z1] to the second input parameter s1' [ n ] of the operation unit Un, the data distribution unit is configured to assign the data s2[ Z2] to the third input parameter s2'[ n ] of the operation unit Un, the data distribution unit is configured to assign the data s3[ Z3] to the fourth input parameter s3' [ n ] of the operation unit Un, wherein Z0 is a logical operation result of n and the second control signal shuf_oper0, Z1 is a logical operation result of n and the second control signal shuf_oper1, Z2 is a logical operation result of n and the second control signal shuf_oper2, and Z3 is a logical operation result of n and the second control signal shuf_oper 3.
In some embodiments, the data transmission unit in the same data processing module as the operation module Gm further includes a data exchange module; the data exchange module is used for exchanging data of two of the first input parameter s0', the second input parameter s1', the third input parameter s2 'and the fourth input parameter s3' of each operation unit according to preset data exchange logic when preset data exchange conditions are met; each operation unit is used for executing corresponding operation according to the first input parameter s0', the second input parameter s1', the third input parameter s2 'and the fourth input parameter s3' after data exchange is completed.
In some embodiments, for the operation unit Un, the data exchange module is configured to determine whether the data of the first input parameter s0 'n and the third input parameter s2' n of the operation unit Un can be exchanged according to the enable signal cha_able_s0's2', and when the data of the first input parameter s0 'n and the third input parameter s2' n of the operation unit can be exchanged, the data exchange module is further configured to set a determination condition according to change_s0's2', and when the determination condition is met, exchange the data of the first input parameter s0 'n and the third input parameter s2' n of the operation unit; the data exchange module is further configured to determine whether the second input parameter s1 'n and the fourth input parameter s3' n of the operation unit can be exchanged according to the enable signal cha_able_s1's3', and when the second input parameter s1 'n and the fourth input parameter s3' n of the operation unit can be exchanged, the data exchange module further sets a determination condition according to change_s1's3', and when the determination condition is met, the data exchange module exchanges the data of the second input parameter s1 'n and the fourth input parameter s3' n of the operation unit.
In some embodiments, each arithmetic unit corresponds to a thread, and for an arithmetic unit Un with a number N in the arithmetic module Gm with a number m, the thread number t=n+m×n corresponding to the arithmetic unit Un; setting parameter s=ceil (log) 2 (m×n)), wherein ceil represents an upward integer; setting the exchange flag 1= (T>>change_s0’s2’)&1, exchange flag 2= (T>>change_s1’s3’)&1, a step of; according to the value of the exchange flag1, determining whether to exchange the data of the first input parameter and the third input parameter of the operation unit corresponding to the thread; determining whether to use the data of the second input parameter and the fourth input parameter of the operation unit corresponding to the thread according to the value of the exchange flag2Exchange is performed.
According to another aspect of the present invention, there is provided a graphics processor including the computing device described above.
In general, the above technical solutions conceived by the present invention have the following beneficial effects compared with the prior art: the first-stage data sharing is realized by adding the multiplexing unit, the second-stage data sharing is realized by improving the data transmission unit and adding the logic operation module in the data transmission unit, the third-stage data sharing is realized by adding the data exchange module after the logic operation module of the data transmission unit, and the specific mode of data sharing can be controlled according to actual calculation requirements, so that complex operation is completed while action instructions are greatly reduced, and the operation efficiency is remarkably improved. The invention has simple structure and easy realization, and is especially suitable for various complex operation scenes.
Drawings
FIG. 1 is a schematic diagram of a computing device capable of data sharing according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a structure of a data transmission unit to an operation module according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating a structure of a data transmission unit to an operation module according to another embodiment of the invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. As will be recognized by those of skill in the pertinent art, the described embodiments may be modified in various different ways without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature and not as restrictive.
As shown in fig. 1, the computing device capable of implementing data sharing according to an embodiment of the present invention includes four data processing modules, a shared memory, a constant register, a multiplexing unit, and a local memory. The data processing modules are respectively a data processing module 101, a data processing module 103, a data processing module 105 and a data processing module 107, and each data processing module comprises a data unit, a data transmission unit and an operation module. The data processing module 101 includes a data unit 109, a data transmission unit 111, and an operation module G0, the data processing module 103 includes a data unit 113, a data transmission unit 115, and an operation module G1, the data processing module 105 includes a data unit 117, a data transmission unit 119, and an operation module G2, and the data processing module 107 includes a data unit 121, a data transmission unit 123, and an operation module G3.
The data unit is used for providing data. Specifically, data unit 109 is used to provide data us0, data unit 113 is used to provide data us1, data unit 117 is used to provide data us2, and data unit 121 is used to provide data us3. When the computing device includes M data processing modules, the data units are used to provide data usm, m=0, 1, …, M-1, m=2 k1 K1 is a natural number, i.e., M is an exponential multiple of 2. The constant register is used for providing constant; the multiplexing unit is used for acquiring data us0 to us3 (generally us0 to us (M-1)) of each data unit and distributing the data us0 to us3 (generally us0 to us (M-1)) of each data unit to each data processing module according to the input first control signal lm_mux. The shared memory is used to provide data for the individual data processing modules.
The data transmission unit is used for acquiring data of a data unit in the same data processing module and giving a first parameter s0, acquiring a constant provided by a constant register and giving a second parameter s1, acquiring data provided by the multiplexing unit and giving a third parameter s2, and acquiring data provided by the shared memory and giving a fourth parameter s3. Taking the data processing module 101 as an example, the data transmission unit 111 obtains the data us0 of the data unit 109 to assign to the first parameter s0, obtains the constant provided by the constant register to assign to the second parameter s1, obtains one of the data provided by the multiplexing unit (us 0 to us3, specifically, which data is determined by the input first control signal lm_mux) to assign to the third parameter s2, and obtains the data provided by the shared memory to assign to the fourth parameter s3.
The data transmission unit is used for providing the assigned first parameter s0, second parameter s1, third parameter s2 and fourth parameter s3 for an operation module in the same data processing module, and the operation module executes corresponding operation according to the first parameter s0, second parameter s1, third parameter s2 and fourth parameter s3. Taking the data processing module 101 as an example, the data transmission unit 111 provides the assigned first parameter s0, second parameter s1, third parameter s2 and fourth parameter s3 to the operation module G0, and the operation module G0 performs corresponding operations according to the first parameter s0, the second parameter s1, the third parameter s2 and the fourth parameter s3.
It should be appreciated that the values of the first parameter s0 and the third parameter s2 are likely to be different for different data transmission units when the first parameter s0, the second parameter s1, the third parameter s2 and the fourth parameter s3 are four inputs as data transmission units.
The number of the operation module G0 is 0, the number of the operation module G1 is 1, the number of the operation module G2 is 2, and the number of the operation module G3 is 3. Generally, the number of the computing module Gm is M, and when the computing device includes M data processing modules, m=0, 1, …, M-1, and correspondingly lm_mux=0, 1, …, M-1, that is, the number of lm_mux corresponds to the number of the computing modules. In some embodiments, the multiplexing unit performs a logic operation on the value of the first control signal lm_mux and the number of the operation module, and provides the data us0 to us3 (generally us0 to us (M-1)) of each data unit to each data transmission unit according to the result of the logic operation.
In some embodiments, for the computing module Gm, the multiplexing unit is configured to provide the data us (Y) to the data transmission unit in the same data processing module as the computing module Gm, where Y is a logical operation result of the value of the first control signal lm_mux and m.
Taking the case that the computing device shown in fig. 1 includes four data processing modules as an example, the multiplexing unit performs exclusive or (XOR) logic operation on the value of the first control signal lm_mux and the number of the operation module, and data provided to each data transmission unit is shown in the following table one.
List one
lm_mux | G0 | G1 | G2 | G3 |
0 | us0 | us1 | us2 | us3 |
1 | us1 | us0 | us3 | us2 |
2 | us2 | us3 | us0 | us1 |
3 | us3 | us2 | us1 | us0 |
It can be seen that, when lm_mux is 0, as for the operation module G0, the result of XOR logical operation of the value of lm_mux and the number 0 of the operation module is 0, the multiplexing unit provides the data us0 to the data transmission unit 111, and the data transmission unit 111 assigns the value of us0 to the third parameter s2; similarly, the data transmission unit 115 assigns the value of us1 to the third parameter s2, the data transmission unit 119 assigns the value of us2 to the third parameter s2, and the data transmission unit 123 assigns the value of us3 to the third parameter s2.
In some embodiments, each arithmetic module comprises N arithmetic units, denoted as arithmetic unit Un, where n=0, 1, …, N-1, n=2 k2 K2 is a natural number, i.e., N is an exponential multiple of 2, and the number of the arithmetic unit Un is N. As shown in fig. 2, the data transmission unit includes a data distribution unit, and the data distribution unit provides the first input parameter s0', the second input parameter s1', the third input parameter s2 'and the fourth input parameter s3' for each operation unit according to the values of the first parameter s0, the second parameter s1, the third parameter s2 and the fourth parameter s3 and the number of each operation unit. Each arithmetic unit Un performs a corresponding operation according to the first input parameter s0', the second input parameter s1', the third input parameter s2 'and the fourth input parameter s3'.
In some embodiments, the first input parameter s0'[ n ] =s0 [ n ] of the operation unit Un, the second input parameter s1' [ n ] =s1 [ n ] of the operation unit Un, the third input parameter s2'[ n ] =s2 [ n ] of the operation unit Un, the fourth input parameter s3' [ n ] =s3 [ n ] of the operation unit Un, wherein s0[ n ] is the n+1th value of the first parameter s0, s1[ n ] is the n+1th value of the second parameter s1, s2[ n ] is the n+1th value of the third parameter s2, and s3[ n ] is the n+1th value of the fourth parameter s3.
In order to further realize data sharing, a logic operation function is added to the data allocation unit shown in fig. 2. Specifically, the data distribution unit performs a logical operation on the value of the second control signal shuf_oper and the number of each operation unit, and based on the result of the logical operation, the data distribution unit assigns a value to the first input parameter s0 'of each operation unit from the first parameter s0, assigns a value to the second input parameter s1' of each operation unit from the second parameter s1, assigns a value to the third input parameter s2 'of each operation unit from the third parameter s2, and assigns a value to the fourth input parameter s3' of each operation unit from the fourth parameter s3.
In some embodiments, the second control signal shuf_oper includes a number of shuf_oper0, a number of shuf_oper1, a number of shuf_oper2, and a number of shuf_oper3, i.e., the number of shuf_opers corresponds to the number of first parameter s0, second parameter s1, third parameter s2, and fourth parameter s3.
In some embodiments, for the operation unit Un, the data distribution unit performs a logic operation on the value of the second control signal shuf_oper0 and the number n of the operation unit Un, and according to the result of the logic operation, takes a value from the first parameter s0 to give the first input parameter s0' n of the operation unit Un; the data distribution unit carries out logic operation on the value of the second control signal shuf_oper1 and the number n of the operation unit Un, and according to the result of the logic operation, the data distribution unit takes the value from the second parameter s1 and gives the value to the second input parameter s1' n of the operation unit Un; the data distribution unit carries out logic operation on the value of the second control signal shuf_oper2 and the number n of the operation unit Un, and according to the result of the logic operation, the data distribution unit takes the value from the third parameter s2 and gives the value to the third input parameter s2' n of the operation unit Un; the data distribution unit performs a logical operation on the value of the second control signal shuf_oper3 and the number n of the operation unit Un, and assigns a value from the fourth parameter s3 to the fourth input parameter s3' n of the operation unit Un according to the result of the logical operation.
In some embodiments, for the operation unit Un, the data distribution unit is configured to assign the data s0[ Z0] to the first input parameter s0'[ n ] of the operation unit Un, the data distribution unit is configured to assign the data s1[ Z1] to the second input parameter s1' [ n ] of the operation unit Un, the data distribution unit is configured to assign the data s2[ Z2] to the third input parameter s2'[ n ] of the operation unit Un, the data distribution unit is configured to assign the data s3[ Z3] to the fourth input parameter s3' [ n ] of the operation unit Un, wherein Z0 is a logical operation result of n and the second control signal shuf_oper0, Z1 is a logical operation result of n and the second control signal shuf_oper1, Z2 is a logical operation result of n and the second control signal shuf_oper2, and Z3 is a logical operation result of n and the second control signal shuf_oper 3.
Taking the example of exclusive or (XOR) logical operation of the value of the second control signal shuf_oper by the data distribution unit with the number n of the operation unit Un, the first input parameter s0'[ n ], the second input parameter s1' [ n ], the third input parameter s2'[ n ] and the fourth input parameter s3' [ n ] of the operation unit Un can be expressed as:
s0’[n] = s0[n XOR shuf_oper0],
s1’[n] = s1[n XOR shuf_oper1],
s2’[n] = s2[n XOR shuf_oper2],
s3’[n] = s3[n XOR shuf_oper3]。
as shown in fig. 3, in order to further realize data sharing, a data exchange module is introduced after the data distribution unit. The data exchange module exchanges data with two of the first input parameter s0', the second input parameter s1', the third input parameter s2 'and the fourth input parameter s3' of each operation unit according to preset data exchange logic when preset data exchange conditions are met. Each operation unit executes corresponding operation according to the first input parameter s0', the second input parameter s1', the third input parameter s2 'and the fourth input parameter s3' after data exchange is completed.
Specifically, the data exchange module receives the data output from the data distribution unit. For the operation unit Un, the data exchange module judges whether the data of the first input parameter s0 'n and the third input parameter s2' n of the operation unit Un can be exchanged according to the enabling signal cha_able_s0's2', when the data of the first input parameter s0 'n and the third input parameter s2' n of the operation unit can be exchanged, the condition parameter change_s0's2' is valid, the data exchange module sets the judging condition according to the change_s0's2', and when the judging condition is met, the data of the first input parameter s0 'n and the third input parameter s2' n of the operation unit are exchanged. Similarly, the data exchange module determines whether the data of the second input parameter s1 'n and the fourth input parameter s3' n of the operation unit can be exchanged according to the enable signal cha_able_s1's3', and when the data of the second input parameter s1 'n and the fourth input parameter s3' n of the operation unit can be exchanged, the condition parameter change_s1's3' is valid, and the data exchange module further sets a determination condition according to the change_s1's3', and when the determination condition is met, the data of the second input parameter s1 'n and the fourth input parameter s3' n of the operation unit are exchanged.
In some embodiments, when the enable signal cha_able_s0's2' is 1, the data exchange module determines that the data of the first input parameter s0 'and the third input parameter s2' of the operation unit can be exchanged, and further sets a determination condition according to change_s0's 2'; when the enabling signal cha_able_s1's3' is 1, the data exchange module judges that the data of the second input parameter s1 'and the fourth input parameter s3' of the operation unit can be exchanged, and further sets a judging condition according to change_s1's 3'.
In some embodiments, each arithmetic unit corresponds to a thread, and for a computing device including M arithmetic modules, each arithmetic module includes N arithmetic units, and includes a total of m×n threads.
For an arithmetic unit Un numbered N in the arithmetic module Gm numbered m, the thread number t=n+m×n corresponding to the arithmetic unit Un. For example, for a computing device comprising 4 computing modules, each computing module comprises 16 computing units, the thread number T takes a value within 0 to 63.
Setting parameter s=ceil (log) 2 (m×n)), where ceil represents an upward integer, for example, ceil (2) =2, ceil (2.01) =3, ceil (1.99) =2, and the condition parameters change_s0'S2' and change_s1'S3' can take values within 0 to S-1, respectively.
Setting exchange marks of flag 1= (T > > change_s0's 2') &1, and determining whether to exchange data of a first input parameter and a third input parameter of an operation unit corresponding to a thread according to the value of the exchange marks of flag1, wherein the exchange marks of flag 2= (T > > change_s1's 3') & 1; and according to the value of the exchange flag2, determining whether to exchange the data of the second input parameter and the fourth input parameter of the operation unit corresponding to the thread.
In some embodiments, when the exchange flag1 is 1, exchanging data of the first input parameter and the third input parameter of the operation unit corresponding to the thread; and when the exchange flag2 is 1, exchanging the data of the second input parameter and the fourth input parameter of the operation unit corresponding to the thread.
For a computing device comprising 4 computing modules, each computing module comprising 16 computing units, in one example, the data exchange operation is performed using the method described above, resulting in the exchange flags for each thread as shown in table two below.
Watch II
cha_able_s0’ s2’/ cha_ able_s1’s3 | change_s0’ s2’/ change_ s1’s3’ | flag1/flag2 (thread number 0-63) |
1 | 5 | 0000_0000_0000_0000_0000_0000_0000_0000_1111_1111_1111_ 1111_1111_1111_1111_1111 |
1 | 4 | 0000_0000_0000_0000_1111_1111_1111_1111_0000_0000_0000_ 0000_1111_1111_1111_1111 |
1 | 3 | 0000_0000_1111_1111_0000_0000_1111_1111_0000_0000_1111_ 1111_0000_0000_1111_1111 |
1 | 2 | 0000_1111_0000_1111_0000_1111_0000_1111_0000_1111_0000_ 1111_0000_1111_0000_1111 |
1 | 1 | 0011_0011_0011_0011_0011_0011_0011_0011_0011_0011_0011_ 0011_0011_0011_0011_0011 |
1 | 0 | 0101_0101_0101_0101_0101_0101_0101_0101_0101_0101_0101_ 0101_0101_0101_0101_0101 |
0 | x | 0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_ 0000_0000_0000_0000_0000 |
It can be seen that when the enable signal cha_able_s0's2' is 0, the value of change_s0's2' is x, which is an invalid input; when the enable signal cha_able_s0's2' is 1, change_s0's2' takes a value within 0 to 5. Taking change_s0's2' =5 as an example, the value of flag1 is:
0000_0000_0000_0000_0000_0000_0000_0000_1111_1111_1111_1111_1111_1111_1111_1111;
therefore, the logical operation module does not exchange the data of the first input parameter s0 'and the third input parameter s2' of the operation unit corresponding to the thread with the number 0, but keeps the data generated by the logical operation module unchanged; the logical operation module exchanges data of the first input parameter s0 'and the third input parameter s2' of the operation unit corresponding to the thread with the number 63, wherein the flag 1=1 corresponding to the thread with the number 63.
It should be understood that, with respect to the data transmission unit structure shown in fig. 3, the data distribution unit may or may not have a logic operation function. In some embodiments, the data distribution unit does not have a logic operation function, that is, two-stage data sharing is realized through the multiplexing unit and the data exchange module at this time; in other embodiments, the data distribution unit has a logical operation function, that is, three-level data sharing is realized through the multiplexing unit, the data distribution unit and the data exchange module.
The computing device capable of realizing data sharing according to the present invention will be described in detail below by taking a basic operation unit butterfly operation of an FFT performed by a computing device including 2 operation units each including 2 operation units as an example.
The calculation formula of the butterfly operation of the basic operation unit of the FFT is as follows:
,
setting q=2, the calculation formula of the butterfly operation is:
,
the method further comprises the following steps:
,
,
,
。
setting lm_mux to 1, s2 taking the data from the multiplexing unit, s0 taking the data from the data unit; s1 fetches data from the constant registers, each thread will fetch the same data 1; s3 fetches data from shared memory.
The values of the first input parameter s0', the second input parameter s1', the third input parameter s2 'and the fourth input parameter s3' of the operation unit corresponding to each thread output by the data distribution unit are shown in the following table three.
Watch III
。
Setting cha_able_s0's2' =1, cha_able_s1's3' =1, change_s0's2' =2, change_s1's3' =2. At this time, the data of the first input parameter s0 'and the third input parameter s2' of the threads numbered 2 and 3 are exchanged, and the data of the second input parameter s1 'and the fourth input parameter s3' are exchanged. After the data exchange, the data exchange module outputs the values of the first input parameter s0', the second input parameter s1', the third input parameter s2 'and the fourth input parameter s3' to each operation unit as shown in the following table four.
Table four
。
The arithmetic unit corresponding to each thread executes the multiply-add unit according to the values of the first input parameter s0', the second input parameter s1', the third input parameter s2 'and the fourth input parameter s3' in the fourth table, so as to complete the calculation of the following formula:
,
,
,
,
therefore, the two-stage data sharing structure of the embodiment of the invention can complete the operation by only one instruction.
For the following calculation formula:
,
,
,
,
the data distribution unit has a three-level data sharing structure with a logic operation function, wherein shuf_opener2=1 is set, and the values of a first input parameter s0', a second input parameter s1', a third input parameter s2 'and a fourth input parameter s3' of an operation unit corresponding to each thread output by the data distribution unit are shown in the following table five.
TABLE five
。
Setting cha_able_s0's2' =1, cha_able_s1's3' =1, change_s0's2' =1, change_s1's3' =1. At this time, the data of the first input parameter s0 'and the third input parameter s2' of the threads numbered 1 and 3 are exchanged, and the data of the second input parameter s1 'and the fourth input parameter s3' are exchanged. After the data exchange, the data exchange module outputs the values of the first input parameter s0', the second input parameter s1', the third input parameter s2 'and the fourth input parameter s3' to each operation unit as shown in the following table six.
TABLE six
。
The arithmetic unit corresponding to each thread executes the multiply-add unit according to the values of the first input parameter s0', the second input parameter s1', the third input parameter s2 'and the fourth input parameter s3' in the fourth table, so as to complete the calculation of the following formula:
,
,
,
,
therefore, the three-level data sharing structure of the embodiment of the invention can complete the operation by only one instruction.
The invention also provides a graphic processor comprising the computing device. By the computing device capable of realizing data sharing, the graphics processor can more efficiently complete processing of various graphics data.
The multistage data sharing structure can greatly reduce action instructions, and remarkably improve operation efficiency while completing complex operation. And the method can control the specific mode of data sharing according to actual calculation requirements, has a simple structure, is easy to realize, and is particularly suitable for various complex operation scenes.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
Any process or method description in a flowchart or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more (two or more) executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes additional implementations in which functions may be performed in a substantially simultaneous manner or in an opposite order from that shown or discussed, including in accordance with the functions that are involved.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. All or part of the steps of the methods of the embodiments described above may be performed by a program that, when executed, comprises one or a combination of the steps of the method embodiments, instructs the associated hardware to perform the method.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules described above, if implemented in the form of software functional modules and sold or used as a stand-alone product, may also be stored in a computer-readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of various changes or substitutions within the technical scope of the present application, and these should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. A computing device comprising a shared memory, a constant register, a multiplexing unit, and a plurality of data processing modules; each data processing module comprises a data unit, a data transmission unit and an operation module;
marking data provided by the data units of the data processing modules as us0 to us (M-1), wherein M is the number of the data processing modules; correspondingly, the number of the operation module of each data processing module is marked as M, m=0, 1, … and M-1; first control signal lm_mux=0, 1, …, M-1;
the multiplexing unit is used for carrying out logic operation on the value of the first control signal lm_mux and the number M of the operation module of each data processing module, and respectively providing the data us0 to us (M-1) provided by the data units of each data processing module to the data transmission units of each data processing module according to the result of the logic operation; for the arithmetic module with the number m, the multiplexing unit is used for providing the data us (Y) to a data transmission unit in the same data processing module as the arithmetic module with the number m, wherein Y is the logical operation result of the value of the first control signal lm_mux and m;
the data transmission unit of each data processing module is used for acquiring data from the data unit in the same data processing module and giving a first parameter s0, acquiring constant from the constant register and giving a second parameter s1, acquiring data from the multiplexing unit and giving a third parameter s2, and acquiring data from the shared memory and giving a fourth parameter s3;
the operation module of each data processing module is used for executing corresponding operation according to the first parameter s0, the second parameter s1, the third parameter s2 and the fourth parameter s3.
2. The computing device of claim 1, wherein the computing module of each data processing module is labeled Gm, where M = 0, 1, …, M-1, M is the number of the plurality of data processing modules;
each operation module Gm comprises N operation units, denoted Un, where N is the number of the operation units, n=0, 1, …, N-1; the data transmission unit in the same data processing module as the operation module Gm comprises a data distribution unit;
the data distribution unit is used for providing a first input parameter s0', a second input parameter s1', a third input parameter s2 'and a fourth input parameter s3' for each operation unit according to the values of the first parameter s0, the second parameter s1, the third parameter s2 and the fourth parameter s3 and the numbers of the operation units;
each operation unit is used for executing corresponding operation according to the first input parameter s0', the second input parameter s1', the third input parameter s2 'and the fourth input parameter s3'.
3. The computing device of claim 2, wherein a first input parameter s0 'n of the computing unit Un = s0[ n ], a second input parameter s1' n of the computing unit Un = s1[ n ], a third input parameter s2 'n of the computing unit Un = s2[ n ], a fourth input parameter s3' n of the computing unit Un = s3[ n ], wherein s0[ n ] is a n+1th value of the first parameter s0, s1[ n ] is a n+1th value of the second parameter s1, s2[ n ] is a n+1th value of the third parameter s2, and s3[ n ] is a n+1th value of the fourth parameter s3.
4. The computing device according to claim 2, wherein the data distribution unit is configured to perform a logical operation on the value of the second control signal shuf_oper and the number of each operation unit, and to assign a value from the first parameter s0 to the first input parameter s0 'of each operation unit, a value from the second parameter s1 to the second input parameter s1' of each operation unit, a value from the third parameter s2 to the third input parameter s2 'of each operation unit, and a value from the fourth parameter s3 to the fourth input parameter s3' of each operation unit, based on the result of the logical operation.
5. The computing device of claim 4, wherein the second control signal shuf_oper includes shuf_oper0, shuf_oper1, shuf_oper2, and shuf_oper3;
for the operation unit Un, the data distribution unit is configured to perform a logic operation on the value of the second control signal shuf_oper0 and the number n of the operation unit Un, and assign a value from the first parameter s0 to the first input parameter s0' n of the operation unit Un according to the result of the logic operation; the data distribution unit is configured to perform a logic operation on the value of the second control signal shuf_open1 and the number n of the operation unit Un, and assign a value from the second parameter s1 to a second input parameter s1' [ n ] of the operation unit Un according to a result of the logic operation; the data distribution unit is configured to perform a logic operation on the value of the second control signal shuf_oper2 and the number n of the operation unit Un, and assign a value to a third input parameter s2' [ n ] of the operation unit Un from a third parameter s2 according to a result of the logic operation; the data distribution unit is configured to perform a logical operation on the value of the second control signal shuf_oper3 and the number n of the operation unit Un, and assign a value from the fourth parameter s3 to the fourth input parameter s3' n of the operation unit Un according to the result of the logical operation.
6. The computing device according to claim 5, wherein for the operation unit Un, the data distribution unit is configured to assign data s0[ Z0] to a first input parameter s0'[ n ] of the operation unit Un, the data distribution unit is configured to assign data s1[ Z1] to a second input parameter s1' [ n ] of the operation unit Un, the data distribution unit is configured to assign data s2[ Z2] to a third input parameter s2'[ n ] of the operation unit Un, the data distribution unit is configured to assign data s3[ Z3] to a fourth input parameter s3' [ n ] of the operation unit Un, wherein Z0 is a logical operation result of n and the second control signal shuf_oper0, Z1 is a logical operation result of n and the second control signal shuf_oper1, Z2 is a logical operation result of n and the second control signal shuf_oper2, and Z3 is a logical operation result of n and the second control signal shuf_oper 3.
7. The computing device of any one of claims 2 to 6, wherein the data transmission unit in the same data processing module as the computing module Gm further comprises a data exchange module;
the data exchange module is used for exchanging data of two of a first input parameter s0', a second input parameter s1', a third input parameter s2 'and a fourth input parameter s3' of each operation unit according to preset data exchange logic when preset data exchange conditions are met;
each operation unit is used for executing corresponding operation according to the first input parameter s0', the second input parameter s1', the third input parameter s2 'and the fourth input parameter s3' after data exchange is completed.
8. The computing device of claim 7, wherein for the operation unit Un, the data exchange module is configured to determine whether the data of the first input parameter s0 'n and the third input parameter s2' n of the operation unit Un can be exchanged according to the enable signal cha_able_s0's2', and when the data of the first input parameter s0 'n and the third input parameter s2' n of the operation unit can be exchanged, the data exchange module is further configured to set a determination condition according to change_s0's2', and when the determination condition is met, exchange the data of the first input parameter s0 'n and the third input parameter s2' n of the operation unit; the data exchange module is further configured to determine whether the data of the second input parameter s1 'n and the fourth input parameter s3' n of the operation unit can be exchanged according to the enable signal cha_able_s1's3', and when the data of the second input parameter s1 'n and the fourth input parameter s3' n of the operation unit can be exchanged, the data exchange module is further configured to set a determination condition according to change_s1's3', and when the determination condition is met, exchange the data of the second input parameter s1 'n and the fourth input parameter s3' n of the operation unit.
9. The computing device of claim 8, wherein each arithmetic unit corresponds to a thread, and for an arithmetic unit Un numbered N in an arithmetic module Gm numbered m, the thread number T = n+m x N corresponding to the arithmetic unit Un;
setting parameter s=ceil (log) 2 (m×n)), wherein ceil represents an upward integer;
setting exchange marks of flag 1= (T > > change_s0's 2') &1, and exchange marks of flag 2= (T > > change_s1's 3') & 1; wherein, the change_s0'S2' and the change_s1'S3' respectively take values from 0 to S-1;
according to the value of the exchange flag1, determining whether to exchange the data of the first input parameter and the third input parameter of the operation unit corresponding to the thread; and according to the value of the exchange flag2, determining whether to exchange the data of the second input parameter and the fourth input parameter of the operation unit corresponding to the thread.
10. A graphics processor comprising the computing device of any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311376818.9A CN117132450B (en) | 2023-10-24 | 2023-10-24 | Computing device capable of realizing data sharing and graphic processor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311376818.9A CN117132450B (en) | 2023-10-24 | 2023-10-24 | Computing device capable of realizing data sharing and graphic processor |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117132450A CN117132450A (en) | 2023-11-28 |
CN117132450B true CN117132450B (en) | 2024-02-20 |
Family
ID=88854855
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311376818.9A Active CN117132450B (en) | 2023-10-24 | 2023-10-24 | Computing device capable of realizing data sharing and graphic processor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117132450B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117332742B (en) * | 2023-12-01 | 2024-02-23 | 芯动微电子科技(武汉)有限公司 | Simulation verification method and device for chip design stage |
Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1994002894A2 (en) * | 1992-07-13 | 1994-02-03 | Texas Instruments France | Data-processing system with a device for handling program loops |
CN101449239A (en) * | 2006-05-25 | 2009-06-03 | 高通股份有限公司 | Graphics processor with arithmetic and elementary function units |
CN101504599A (en) * | 2009-03-16 | 2009-08-12 | 西安电子科技大学 | Special instruction set micro-processing system suitable for digital signal processing application |
EP2159690A1 (en) * | 2007-06-20 | 2010-03-03 | Fujitsu Limited | Information processing unit and method for controlling register |
US7680988B1 (en) * | 2006-10-30 | 2010-03-16 | Nvidia Corporation | Single interconnect providing read and write access to a memory shared by concurrent threads |
CN102047241A (en) * | 2008-05-30 | 2011-05-04 | 先进微装置公司 | Local and global data share |
EP2447853A1 (en) * | 2010-10-29 | 2012-05-02 | Elio Strollo | Multiprocessor with private and shared memories |
CN103744644A (en) * | 2014-01-13 | 2014-04-23 | 上海交通大学 | Quad-core processor system built in quad-core structure and data switching method thereof |
US9665969B1 (en) * | 2009-09-29 | 2017-05-30 | Nvidia Corporation | Data path and instruction set for packed pixel operations for video processing |
CN106779057A (en) * | 2016-11-11 | 2017-05-31 | 北京旷视科技有限公司 | The method and device of the calculating binary neural network convolution based on GPU |
EP3396524A1 (en) * | 2017-04-28 | 2018-10-31 | INTEL Corporation | Instructions and logic to perform floating-point and integer operations for machine learning |
CN112130752A (en) * | 2019-06-24 | 2020-12-25 | 英特尔公司 | Shared local memory read merge and multicast return |
CN112819681A (en) * | 2019-11-15 | 2021-05-18 | 英特尔公司 | Enhanced processor functionality for computing |
CN113495865A (en) * | 2020-03-20 | 2021-10-12 | 辉达公司 | Asynchronous data movement pipeline |
CN113961875A (en) * | 2017-05-08 | 2022-01-21 | 辉达公司 | Generalized acceleration of matrix multiply-accumulate operations |
CN114830082A (en) * | 2019-11-15 | 2022-07-29 | 苹果公司 | SIMD operand arrangement selected from multiple registers |
CN116185565A (en) * | 2022-12-29 | 2023-05-30 | 芯动微电子科技(武汉)有限公司 | Memory data isolation and sharing system and method |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8176265B2 (en) * | 2006-10-30 | 2012-05-08 | Nvidia Corporation | Shared single-access memory with management of multiple parallel requests |
US8639882B2 (en) * | 2011-12-14 | 2014-01-28 | Nvidia Corporation | Methods and apparatus for source operand collector caching |
-
2023
- 2023-10-24 CN CN202311376818.9A patent/CN117132450B/en active Active
Patent Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1994002894A2 (en) * | 1992-07-13 | 1994-02-03 | Texas Instruments France | Data-processing system with a device for handling program loops |
CN101449239A (en) * | 2006-05-25 | 2009-06-03 | 高通股份有限公司 | Graphics processor with arithmetic and elementary function units |
US7680988B1 (en) * | 2006-10-30 | 2010-03-16 | Nvidia Corporation | Single interconnect providing read and write access to a memory shared by concurrent threads |
EP2159690A1 (en) * | 2007-06-20 | 2010-03-03 | Fujitsu Limited | Information processing unit and method for controlling register |
CN102047241A (en) * | 2008-05-30 | 2011-05-04 | 先进微装置公司 | Local and global data share |
CN101504599A (en) * | 2009-03-16 | 2009-08-12 | 西安电子科技大学 | Special instruction set micro-processing system suitable for digital signal processing application |
US9665969B1 (en) * | 2009-09-29 | 2017-05-30 | Nvidia Corporation | Data path and instruction set for packed pixel operations for video processing |
EP2447853A1 (en) * | 2010-10-29 | 2012-05-02 | Elio Strollo | Multiprocessor with private and shared memories |
CN103744644A (en) * | 2014-01-13 | 2014-04-23 | 上海交通大学 | Quad-core processor system built in quad-core structure and data switching method thereof |
CN106779057A (en) * | 2016-11-11 | 2017-05-31 | 北京旷视科技有限公司 | The method and device of the calculating binary neural network convolution based on GPU |
EP3396524A1 (en) * | 2017-04-28 | 2018-10-31 | INTEL Corporation | Instructions and logic to perform floating-point and integer operations for machine learning |
CN113961875A (en) * | 2017-05-08 | 2022-01-21 | 辉达公司 | Generalized acceleration of matrix multiply-accumulate operations |
CN112130752A (en) * | 2019-06-24 | 2020-12-25 | 英特尔公司 | Shared local memory read merge and multicast return |
CN112819681A (en) * | 2019-11-15 | 2021-05-18 | 英特尔公司 | Enhanced processor functionality for computing |
CN114830082A (en) * | 2019-11-15 | 2022-07-29 | 苹果公司 | SIMD operand arrangement selected from multiple registers |
CN116627504A (en) * | 2019-11-15 | 2023-08-22 | 苹果公司 | SIMD operand arrangement selected from a plurality of registers |
CN113495865A (en) * | 2020-03-20 | 2021-10-12 | 辉达公司 | Asynchronous data movement pipeline |
CN116185565A (en) * | 2022-12-29 | 2023-05-30 | 芯动微电子科技(武汉)有限公司 | Memory data isolation and sharing system and method |
Non-Patent Citations (1)
Title |
---|
雅可比迭代法在图形处理器上实现的研究;张健;涂永明;涂晓明;;计算机工程与应用(第34期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117132450A (en) | 2023-11-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240070226A1 (en) | Accelerator for sparse-dense matrix multiplication | |
CN117132450B (en) | Computing device capable of realizing data sharing and graphic processor | |
US20170322805A1 (en) | Performing Rounding Operations Responsive To An Instruction | |
US11609762B2 (en) | Systems and methods to load a tile register pair | |
US8386547B2 (en) | Instruction and logic for performing range detection | |
JP4148560B2 (en) | Floating point division arithmetic unit | |
CN107315717B (en) | Device and method for executing vector four-rule operation | |
US20090049113A1 (en) | Method and Apparatus for Implementing a Multiple Operand Vector Floating Point Summation to Scalar Function | |
US20160026607A1 (en) | Parallelization of scalar operations by vector processors using data-indexed accumulators in vector register files, and related circuits, methods, and computer-readable media | |
US20110078225A1 (en) | Extended-Precision Integer Arithmetic and Logical Instructions | |
EP3825842B1 (en) | Data processing method and apparatus, and related product | |
EP4020169A1 (en) | Apparatuses, methods, and systems for 8-bit floating-point matrix dot product instructions | |
KR20180128075A (en) | Per-shader preamble for graphics processing | |
US11823303B2 (en) | Data processing method and apparatus | |
WO2018120767A1 (en) | Data processing method and device | |
US11941395B2 (en) | Apparatuses, methods, and systems for instructions for 16-bit floating-point matrix dot product instructions | |
JP3745673B2 (en) | Processor | |
US10528322B2 (en) | Unified multifunction circuitry | |
WO2018024094A1 (en) | Operation device and method of operating same | |
CN116127261B (en) | Matrix multiply-accumulate method and device in processor and electronic equipment | |
CN111381875B (en) | Data comparator, data processing method, chip and electronic equipment | |
CN111382390B (en) | Operation method, device and related product | |
US20220374207A1 (en) | Applications of and techniques for quickly computing a modulo operation by a mersenne or a fermat number | |
CN111400341B (en) | Scalar lookup instruction processing method and device and related product | |
JPH0435792B2 (en) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |