WO2016024508A1 - マルチプロセッサ装置 - Google Patents
マルチプロセッサ装置 Download PDFInfo
- Publication number
- WO2016024508A1 WO2016024508A1 PCT/JP2015/072246 JP2015072246W WO2016024508A1 WO 2016024508 A1 WO2016024508 A1 WO 2016024508A1 JP 2015072246 W JP2015072246 W JP 2015072246W WO 2016024508 A1 WO2016024508 A1 WO 2016024508A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- instruction
- processor
- processing
- register
- memory
- Prior art date
Links
- 238000012545 processing Methods 0.000 claims abstract description 136
- 230000015654 memory Effects 0.000 claims abstract description 106
- 238000000034 method Methods 0.000 claims abstract description 77
- 230000008569 process Effects 0.000 claims abstract description 67
- 230000002776 aggregation Effects 0.000 claims abstract description 10
- 238000004220 aggregation Methods 0.000 claims abstract description 10
- 239000000284 extract Substances 0.000 claims abstract description 8
- 230000004931 aggregating effect Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 16
- 238000004364 calculation method Methods 0.000 description 11
- 230000003111 delayed effect Effects 0.000 description 6
- 230000001419 dependent effect Effects 0.000 description 6
- 230000007246 mechanism Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000012546 transfer Methods 0.000 description 5
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 4
- 238000012937 correction Methods 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000001174 ascending effect Effects 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 239000002699 waste material Substances 0.000 description 2
- PCTMTFRHKVHKIS-BMFZQQSSSA-N (1s,3r,4e,6e,8e,10e,12e,14e,16e,18s,19r,20r,21s,25r,27r,30r,31r,33s,35r,37s,38r)-3-[(2r,3s,4s,5s,6r)-4-amino-3,5-dihydroxy-6-methyloxan-2-yl]oxy-19,25,27,30,31,33,35,37-octahydroxy-18,20,21-trimethyl-23-oxo-22,39-dioxabicyclo[33.3.1]nonatriaconta-4,6,8,10 Chemical compound C1C=C2C[C@@H](OS(O)(=O)=O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2.O[C@H]1[C@@H](N)[C@H](O)[C@@H](C)O[C@H]1O[C@H]1/C=C/C=C/C=C/C=C/C=C/C=C/C=C/[C@H](C)[C@@H](O)[C@@H](C)[C@H](C)OC(=O)C[C@H](O)C[C@H](O)CC[C@@H](O)[C@H](O)C[C@H](O)C[C@](O)(C[C@H](O)[C@H]2C(O)=O)O[C@H]2C1 PCTMTFRHKVHKIS-BMFZQQSSSA-N 0.000 description 1
- 208000033748 Device issues Diseases 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007334 memory performance Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8053—Vector processors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/32—Address formation of the next instruction, e.g. by incrementing the instruction counter
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/34—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3887—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3887—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
- G06F9/38873—Iterative single instructions for multiple data lanes [SIMD]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
Definitions
- the present invention relates to a multiprocessor device including a plurality of processors.
- a processor that operates according to a program such as a DSP
- a program such as a DSP
- processing with a large amount of processing such as image processing has become necessary.
- a device composed of a plurality of processors such as a GPU (graphics processing unit), has emerged as an alternative to the DSP.
- FIG. 2 shows a configuration of a GPU described in Japanese Patent Application Laid-Open No. 2009-37593.
- the GPU includes an external memory device (EMU) 201, an external memory 202, a vector processing engine (VPE) 205, a vector control device (VCU) 206, and a vector processing device (VPU) 207.
- EMU external memory device
- VPE vector processing engine
- VCU vector control device
- VPU vector processing device
- the vector processing unit (VPU) 207 has a multiprocessor and a plurality of arithmetic units. There is a vector control unit (VCU) 206 as a higher-level control unit, and a vector processing engine (VPE) 205 is one set including these units. A plurality of vector processing engines (VPE) 205 are provided, and are connected via a crossbar via an external memory unit (EMU) 201 so that they can access each other. Since the external memory 202 is also connected, memory access is also possible.
- VCU vector control unit
- VPE vector processing engine
- EMU external memory unit
- L1 cache level 1 cache or temporary storage device as the lowest layer
- L2 cache upper layer
- VPE vector processing engine
- Level 2 cache or temporary storage device Each is a hierarchical memory access flow.
- the performance increases as the number of vector processing engines (VPE) 205 mounted increases.
- the operation of the vector processing engine (VPE) 205 is based on the SIMD (single instruction multiple data) type that executes the same instruction at the same time, when the number of implementations increases, memory access concentrates at the same time, and external memory Performance degradation occurs due to physical memory bandwidth limitations of the unit (EMU) 206 or external memory 202.
- the number of vector processing units (VPU) 207 is limited to several, and the number of vector processing engines (VPE) 205 is increased. If the vector processing engine (VPE) 205 is given a different program or the same program in consideration of the time difference, it is possible to avoid the above-mentioned concentration of accesses at the same time.
- each vector processing engine (VPE) 205 is loosely coupled via the external memory unit (EMU) 201 and requires a mechanism for efficiently exchanging data.
- Data exchange is necessary when a large program is decomposed into small programs (processing is distributed to the vector processing unit (VPU) 207 to improve performance). In order to avoid complication of programming, these data exchanges are performed automatically regardless of the program.
- a memory access request, a transfer destination, and a transfer source message are defined. Each device issues and receives them for processing. Multiple messages are arbitrated by each device and processed in parallel while keeping the order.
- Message processing is mainly performed by a DMA (Direct Memory Access) device in a vector processing engine (VPE). With such a mechanism, data communication between the L2 cache of the vector processing engine (VPE) 205 and the L1 cache of the vector processing unit (VPU) 207 and the external memory 202 is automatically performed. These processes are controlled by the distance, so there is no need to be aware of them in the program.
- VPE Direct Memory Access
- VPU vector processing unit
- FIG. 3 shows the presence or absence of a hazard along the time axis in the pipeline processing, and shows a hazard that occurs when the instruction 1 write and the instruction 2 read register are the same.
- FIG. 4 In order to solve this problem in the conventional processor, a method in which different programs are given alternately as shown in FIG. 4 and flow dependence does not occur has been considered.
- FIG. 4 four programs ABCD are prepared, and the respective instructions are given alternately. If different registers are prepared in the program ABCD, the register written by the program A does not overlap with the read register of the other program BCD. Further, since there are three time differences between the instruction A1 and the instruction A2 of the program A, there is no fear of causing a hazard even if flow dependence occurs.
- the conventional multiprocessor device has the following problems.
- a dynamic method there is a method equipped with a mechanism for monitoring the state of each processor. Monitoring is performed by placing the progress of the program in a shared memory or the like, and if there is a program that can be executed by the processor itself referring to it. Alternatively, separate processors are prepared separately, and the status of each processor is sequentially managed and activated. In any case, the overall hardware mechanism is complicated and increases costs.
- VPE vector processing engines
- the multiprocessor device includes an external memory, a plurality of processors, a memory aggregation device, a register memory, a multiplexer, and an overall control device.
- the memory aggregation device aggregates memory accesses of a plurality of processors.
- the register memory the number of products of the number of registers managed by the processor and the maximum processing number of the processor is prepared.
- the multiplexer accesses the register memory in accordance with an instruction given to the register access of the processor.
- the overall control device extracts parameters from the instructions and gives them to the processor and multiplexer for control.
- the overall control device sequentially processes the given number of processings by changing the addressing of the register memory by the processor with the same instruction, and switches to the next instruction when the number of processings is over. Repeat the process.
- registers for the number of physical processors are prepared, but registers for a very large number of logical processors are prepared, and the number of processes is automatically adjusted.
- the number of processes is automatically adjusted.
- the adjustment is only the addition / subtraction of the number of processes, the degree of parallelism of the processors can be easily increased or decreased.
- the program is not aware of the number of processors.
- the present invention can provide a multiprocessor device that facilitates program creation, is scalable in performance and function, and has excellent cost performance.
- FIG. 1 is a diagram illustrating a multiprocessor device according to an embodiment of the present invention.
- FIG. 2 is a diagram illustrating the configuration of a conventional multiprocessor device.
- FIG. 3 is a diagram for explaining the pipeline operation in the case where the hazard of the conventional multiprocessor device does not occur.
- FIG. 4 is a diagram for explaining a pipeline operation in which different programs of a conventional multiprocessor device are executed alternately.
- FIG. 5 is a diagram for explaining access to the registers of the multiprocessor device according to the embodiment of the present invention.
- FIG. 6 is a diagram for explaining the structure of the multiprocessor device for the register memory according to the embodiment of the present invention.
- FIG. 7 is a diagram for explaining a processing cycle and addressing for a register memory when no hazard occurs in the multiprocessor device according to the embodiment of the present invention.
- FIG. 8 is a diagram for explaining the processing cycle and addressing for the register memory when a hazard occurs in the multiprocessor device according to the embodiment of the present invention.
- FIG. 9 is a diagram for explaining that the multiprocessor device according to the embodiment of the present invention includes a plurality of processing units.
- FIG. 10 is a diagram illustrating register access to different logical processors in the multiprocessor device according to the embodiment of the present invention.
- FIG. 11 is a diagram for explaining horizontal and vertical register access in image processing in the multiprocessor device according to the embodiment of the present invention.
- FIG. 12 is a diagram for explaining the generation of the branch condition of the multiprocessor device according to the embodiment of the present invention.
- FIG. 13 is a diagram illustrating an example of an image generated using a branch condition in the multiprocessor device according to the embodiment of the present
- a multiprocessor device according to the first embodiment of the present invention will be described. This embodiment will be described with reference to FIGS. 1 and 5 to 9.
- the multiprocessor device 100 includes a memory aggregation device 101, an external memory 102, a multiplexer 103, an overall control device 105, a register memory 106, and a plurality of processors 107.
- the memory aggregating apparatus 101 aggregates memory accesses of a plurality of processors 107.
- the multiplexer 103 accesses the register memory 106 according to an instruction given to the register access of the processor 107.
- the number of processors 107 is physically eight, and the logical number (maximum number of processes) that can be processed as SIMD is 1024.
- the register memory 106 is a register that reads and writes from the processor 107, and registers for the number of logical processors are prepared. For example, if there are 16 registers per processor, 16384 registers of 16384 are prepared. That is, there are 1024 processors (16384 registers) logically held, and the maximum number that can be physically processed per unit time (1 cycle) is 8.
- the eight processors 107 change the addressing for the registers and sequentially process the instructions with the given processing number N.
- the overall control device 105 sequentially processes the given number of processes by changing the addressing of the register memory 106 by the processor 107 with the same instruction. Further, when the processing by the processor 107 is completed, the overall control device 105 switches to the next instruction and repeats the processing for the given number of processing.
- the instructions are stored in the external memory 102 here. Further, the overall control device 105 extracts parameters from the commands and gives them to the processor 107 and the multiplexer 103 for control.
- FIG. 5 is a diagram showing the above operation.
- the register memory 106 is illustrated as being divided into two, reading and writing.
- reading and writing occur as described above. For example, in the binary operation, two of the 16 prepared registers per processor are read and written to one register. Which register is selected as an operand is described as an operand number in the instruction, and the multiplexer 103 receives the operand number and switches.
- the register memory 106 simultaneously receives accesses from the eight processors 107, but since all the eight instructions are the same, the operand numbers are the same. Therefore, the multiplexer 103 may realize the same addressing and the same switch for each processor 107. However, it is necessary to divide up to 16 register groups so that a plurality of 16 registers can be selected simultaneously. In order to maximize throughput, reading and writing should be possible at the same time. These can be realized by preparing 16 banks of 2-port SRAM.
- FIG. 6 is a diagram showing the above operation.
- the multiplexer 103 is divided into two, a multiplexer 103 that performs a read operation and a multiplexer 103 that performs a write operation.
- common addressing is performed on the 16 banks of SRAM, and the two read operand numbers described in the instruction are placed in the multiplexer 103 (downwardly divided into upper and lower for convenience). And two necessary operands are selected and entered into the processor 107.
- one write operand number described in the instruction is placed in the multiplexer 103 (upper and lower divided for convenience), and the operation result of the processor 107 is stored in the SRAM of the designated bank.
- FIG. 7 the horizontal axis corresponds to time, and the vertical axis corresponds to addressing.
- R0 generated by the instruction 0 is the instruction 1
- there is no problem because the writing of the instruction 0 to the R0 is completed a long time though it is flow-dependent.
- the memory aggregating apparatus 101 basically connects a plurality of requests when there are adjacent addresses in response to a random address request to the memory, or mounts a cache to speed up local memory access. To do. However, the more the optimization is performed, the more fluctuation occurs in the latency that is an index of the response speed. However, if the upper limit of the number of pipeline stages is high and a hazard is unlikely to occur, these latencies can be absorbed to some extent.
- CORDIC Coordinat Rotation Digital Computer
- a register that reads and writes at a high operating frequency is generally configured by a flip-flop.
- a flip-flop In this configuration, there is no problem even if a low-cost SRAM is used because there is a sufficient number of pipeline stages. For example, if a pipelined SRAM is used, even if several cycles are required for the access time, the throughput is one cycle, which is not a problem.
- the instruction is unchanged for a period of 38 cycles.
- the instruction update frequency is low, and it is not necessary to take a means for speeding up access such as an instruction cache.
- This feature also eliminates restrictions on instruction length (word length) and is advantageous for the implementation of horizontal instructions called VLIW (Very Long Instruction Word) (operators are parallel and basically do not interfere with each other). It is.
- FIG. 9 is a diagram for explaining that the multiprocessor device is composed of a plurality of processing units.
- this multiprocessor device 100 includes an integer unit 801 (multiplication and addition), a floating point unit 802 (multiplication and addition), and the above-described CORDIC unit 803 (calculation and division of trigonometric and hyperbolic functions), A memory access unit 804 is provided.
- Each processor 107 includes these, and the multiplexer 103 is expanded to select and supply necessary registers for each instruction of each unit.
- Each unit 801 to 804 is pipelined and consumes 1 cycle of the square of each unit in the figure, but each has a different latency.
- the FIFO memory absorbs the timing shift, synchronizes it, and returns the result to the multiplexer 103.
- FIG. 9 basically, one-way operation from register reading to register writing is performed, and the structure is simple without mutual interference. This is advantageous when speeding up.
- the instruction word length can be easily expanded. Further, since the unit can be easily attached and detached (validation and invalidation), addition and reduction of circuits can be easily realized. As a result, it is easy to provide a multiprocessor device that allows the user to easily customize the processing circuit in accordance with the purpose.
- the register memory 106 basically accesses each processor, but a separately accessible register may be prepared separately. Such a register is used, for example, when referring to a variable common to the entire processing. However, when written from multiple processors, the value may be different from the original value. To avoid this, it is possible to create a histogram by summing the values written for each logical processor. preferable.
- a multiprocessor device according to the second embodiment of the present invention will be described. This embodiment will be described with reference to FIG.
- the multiprocessor device 100 described in the first embodiment is more effective as the number of processes N is larger.
- the number of logical processors maximum number of processes
- 1024 there are cases where it is necessary to cope with the case where the number of processes N is less than the number of logical processors and the case where the number of processes N exceeds the number of logical processors. is there.
- division processing is necessary. The division is not switched for each instruction as in a conventional processor, but is switched after a series of programs are completed.
- variables x and y are QVGA coordinates
- C0 to C5 are constants representing the rotation of the Affine transformation
- mem [] [] is a memory description for storing each QVGA pixel.
- the coordinates of the transfer source obtained by matrix calculation (Affine transformation) are substituted for R2 and R3, the transfer source data is read via R0, and the read values are the transfer destinations indicated by the variables x and y. The coordinates are written.
- variable x is scanned every eight, which is the number of physical processors, and the variable i is processed in parallel for the number of processors of eight. That is, in the step in which the variable y changes by 1, all the series of programs are processed, and this is repeated until the Y coordinate becomes maximum.
- the number of loops is 1/3 times as the variable y and 3 times as the variable x as follows. (The correction of the part referring to the variables x and y is omitted). As a result, it is possible to perform processing in a state where pixels for three rows are combined in one row in the direction along the variable y.
- correction may be performed by setting the number of loops to be twice as large as the variable y and 1/2 as large as the variable x as follows. Thereby, the process in the state which divided
- an example is shown in which one process is divided into two, but the number of divisions is not particularly limited and can be arbitrarily set.
- the overall control device 105 automatically adjusts. Such an adjustment can be realized by finding a multiple of the maximum X coordinate not exceeding 1024, for example. That is, also in this embodiment, it is not necessary to be aware of the number of logical processors or the number of physical processors, and a conventional program may be given.
- FIG. 8 shows a hazard that occurs when the number of pipeline stages is 60.
- the processor 107 immediately after the processor 107 is activated 38 times with the instruction 0, the processor 107 cannot be activated with the instruction 1. It is necessary to wait until the cycle 60 where the processing of the instruction 0 is completed. That is, when a command is switched and a new command is executed, if the processing performed in the same processing order as the processing order of the new command is not completed in the command before switching, a new command is executed until the processing is completed. There is a need to wait for processing on the instruction.
- the processing order means a processing position in an instruction of the number N of processing processed in order. For example, a process having a process order of “10” means a process that has been executed tenth out of N instructions.
- the overall control device 105 In order for the overall control device 105 to dynamically control these, it is only necessary to detect flow dependency before and after the command or between several commands. However, this control requires a round-robin inspection of operand numbers.
- static flow-dependent information may be given to the overall control device 105 for processing.
- the presence or absence of overlapping register numbers before and after the program is flow-dependent information.
- the compiler converts such that it ignores the dependency relationship of the n-th previous instruction. That is, in the instruction before switching, it is executed in the same processing order as the processing order of the new instruction, and if the processing executed before the number of instructions specified in advance is not completed, a new one is executed until the processing is completed. Wait for processing on the instruction. In this case, processing for a new command is performed without confirming the end of processing performed after the number of commands specified in advance.
- FIG. 6 A multiprocessor device according to the third embodiment of the present invention will be described. This embodiment will be described with reference to FIG. 6, FIG. 7, FIG. 10, and FIG.
- the multiprocessor device 100 handles image processing. As shown in FIG. 7, the multiprocessor device 100 takes a form in which addressing is changed and instructions are sequentially executed, and addressing is performed along the X coordinate of image processing. Also, after executing the instructions in the program, the next Y coordinate is processed again, and the process ends when the entire image has been processed.
- the addressing to the register file 106 is changed according to the shift amount. Since an operand for which the shift amount is specified and an operand for which the shift amount is not specified may be performed at the same time, the bank in which the shift amount is not specified is not operated.
- FIG. 10 refers to a register of a logical processor 12 away in the forward direction. Since the addressing during processing is n, the relative addressing spans n + 1 and n + 2. Since eight pieces of continuous data (for the number of physical processors) that can be read out by one addressing are performed, only two steps at the beginning of addressing are performed when straddling this data. The next and subsequent addressing needs only one new addressing if the previously accessed value (data) is remembered.
- the local position number of addressing n + 2 not used (remainder of dividing logical processor number by 8) Data from “4” to “7” is stored. Then, in the next addressing, eight data of addressing n + 3 are acquired. As a result, the stored data of the local position numbers “4” to “7” of the addressing n + 2 and the newly acquired data of the local position numbers “0” to “3” of the addressing n + 3 can be used. . In this case, the data of the local position numbers “4” to “7” of the addressing n + 3 that is not used is stored and used for subsequent processing.
- the overall control device 105 makes access only twice at the start of addressing.
- the relative reference is a multiple of 8
- the addressing does not extend over two, so there is no need to access it twice.
- the multiplexer 103 allocates addressing n + 1 data to physical processor numbers 0 to 3 for addressing n + 1 and addressing n + 2 data, and addressing n + 2 to physical processor numbers 4 to 7. Allocate data.
- the former shifts and gives the logical processor local position numbers 4 to 7 to each.
- the latter shifts and gives the logical processor local position numbers 0 to 3 to each.
- the multiprocessor device 100 does not execute different Y coordinates until execution of all instructions for the X coordinates is completed. Therefore, it is not possible to refer to values during processing of different Y coordinates. However, at the start of a new Y coordinate program, the result of processing with the previous Y coordinate remains in the register memory 106. By referring to this, the register value of the logical processor having the smaller Y coordinate can be used.
- R0 to R3 can be used as they are. However, when processing the next Y coordinate, it is necessary to newly update R0 to R3.
- R0 to R3 Since it is inefficient to change the program every time the Y coordinate is updated, R0 to R3 must be shown at the same relative position in the processing of the next Y coordinate regardless of the processing of the Y coordinate. Therefore, when the processing of the X coordinate is finished, the current Y coordinate result must be transferred from R1 to R0, from R2 to R1, from R3 to R2, and to R3. If this is done on a program, several instructions will be consumed.
- the least significant 4 bits of the Y coordinate are added to the operand number designated by the instruction to the multiplexer 103 that selects the register.
- the horizontal and vertical start / end points include end processing such as mirroring and copying.
- the mirror is processed assuming that the coordinates are -1 and 2 if the coordinates are -2, and the copy is processed assuming that the coordinates are all 0 if the coordinates are negative. The same applies not only to negative values but also when the maximum number of coordinates is exceeded.
- FIG. 7 A multiprocessor device according to the fourth embodiment of the present invention will be described. This embodiment will be described with reference to FIG. 7, FIG. 9, FIG. 12, and FIG.
- the multiprocessor device 100 stores branch conditions for each logical processor, and determines the presence / absence of processing from the stored information. Thereby, even when the same instruction is input to the eight processors 107, a branch can be realized for each processor 107.
- the register memory 106 has an operation result such as carry or overflow and the above-described branch condition flag in addition to a register that is generally used.
- condition code The condition codes of the integer unit 801 and the memory access unit 804 in FIG. 9 are collectively CCint, the floating point unit is CCmad, and the CORDIC unit is CCcor.
- CC consists of 4 bits, N representing positive / negative, Z representing zero, V representing overflow, and C representing carry.
- FIG. 12 shows a process of generating branch flags F0 to F3.
- the branch flag 111 included in the register memory 106 includes four levels F0 to F3. Selection of original information for generating the branch flag 111 is performed based on the selection table 112. The selection is specified in the instruction.
- the generation table 113 represents all combinations of 4 bits (16 patterns of 2 to the 4th power) constituting the original information selected based on the selection table 112, and an update branch flag from the state of each 4 bit digit. Is a table that generates Generation of a branch flag for update using the generation table 113 is also specified in the instruction.
- the instruction table 114 is a table that generates (selects) a new branch flag by combining the branch flag 111 and the update branch flag generated in the generation table 113. Generation of a new branch flag using the instruction table 114 is also specified in the instruction.
- the determination table 115 represents all the combinations of 4 bits (16 patterns of 2 to the 4th power) constituting the branch flag 111, and is a table for generating (selecting) a determination flag from each 4-bit digit state. Generation of a determination flag using the determination table 115 is also specified in the instruction.
- the write instruction table 116 is a table for determining whether to write to the register memory 106 and whether to write to the branch flag 111 based on the determination flag generated in the determination table 115. Determination of whether or not to write using the write instruction table is also specified in the command.
- a branch flag for update is generated based on the selection table 112 from either the CC (CCint, CCmad, CCcor) attached to the calculation result of each unit or the branch flag 111. For example, if N and Z bits of CCint's NZVC are set to 1, if the branch flag is set, the selection table 112 is “1”, and the generation table 113 is “1111,000,000,000,000” (a pattern in which NZ is compatible). ).
- the instruction table 114 is used to instruct how to incorporate the update branch flag into the original four-level branch flag. For example, when the most recently generated flag is pushed out and a new branch flag is inserted at an empty position, the instruction table 114 is set to “2”. The result generated based on this instruction table 114 becomes the next branch flag 111.
- the branch flag 111 in FIG. 12 it is determined whether to write the operation result to the register memory 106 and whether to write to the branch flag 111. For example, when two states of F0 and F1 of the branch flag 111 are changed to create four states, the setting is made for each state in the determination table 115 and the instruction table 116 is set to “1” for processing. This will be described in a writing method similar to the following C language (for statements related to the multiprocessor are omitted).
- the program given to the processor 107 is as follows.
- Judge [] is a binary table index
- F3210 is a bit concatenation of branch flags F3 to F0.
- the condition may be described in the instruction. However, for example, when a loop is executed at a branch, the condition for exiting the loop may not be true. This is because, although it becomes a condition for each logical processor to exit, there may be a case where all the logical processors (except for the unprocessed part) do not meet all the conditions. Therefore, in addition to the description of conditions, the upper limit of the number of loops is described.
- the switching from the instruction 0 to the instruction 1 is when the last addressing of the instruction 0 is completed.
- Parameters relating to the instruction 0 (operand designation, etc.) need only be read once and stored, and the instruction 1 may be read in advance after the cycle immediately after the instruction 0 is read. In this case, parameters are stacked in a FIFO memory or the like.
- the branch flags of all the logical processors cannot be determined unless the cycle 37 of the instruction 0 is reached. If it is attempted to acquire the instruction branched off at the cycle 37 of the instruction 0, the activation of the processor 107 is delayed correspondingly. This leads to performance degradation.
- delay branching is adopted.
- the instruction 1 is executed unconditionally and a branch is performed at the end of the instruction 1.
- the parameter of the instruction 1 is acquired during the execution of the instruction 0
- the parameter of the branch destination instruction determined at the end of the instruction 0 is acquired during the execution of the instruction 1, and the processor 107 can be started continuously. it can.
- Instructions 0 and 1 normalize the X and Y coordinates to be operated, initialize the branch flag 111, and initialize variables.
- the recurrence formula calculation (R2) is performed in the floating point unit 802, and the convergence determination calculation (R8) is performed in the CORDIC unit 803. Also, the convergence count R3 is incremented.
- the instruction 2 is looped (the symbol “! &” In the instruction means negative and all). .
- the number of loops is set to a maximum of 64, and instruction 3 is always executed because it is a delayed branch. If the result of the CORDIC unit 803 is not an overflow V (a result that cannot be expressed, that is, R * R-2 * 2 ⁇ 0), F0 is overwritten. This indicates that R1 diverges and ends when it is 2 or more.
- instruction 3 as in instruction 2, the recursion formula calculation (R1) is performed in the floating point unit 802, and the convergence determination calculation (R9) is performed in the CORDIC unit 803.
- the branch flag F0 is determined in the same manner as the instruction 2, and the result is overwritten.
- a program can be executed with a small number of instructions.
- it is effective in a system that reduces the number of instructions by using arithmetic units in parallel.
- this specification discloses a multiprocessor device including an external memory, a plurality of processors, an external memory, a plurality of processors, a memory aggregation device, a register memory, a multiplexer, and an overall control device.
- the memory aggregation device aggregates memory accesses of a plurality of processors.
- the register memory the number of products of the number of registers managed by the processor and the maximum processing number of the processor is prepared.
- the multiplexer accesses the register memory in accordance with an instruction given to the register access of the processor.
- the overall control unit extracts the parameters from the instructions, gives them to the processor and the multiplexer, and controls them, and sequentially processes the given number of processes by changing the addressing of the register memory in the processor with the same instruction. For example, the process is switched to the next instruction and the process for the given number of processes is repeated.
- the overall control apparatus executes the process by dividing the number into a number of processes if the given number of processes exceeds the maximum number of processes, and the given number of processes is the maximum number of processes. If the number of processes is less than this, it is possible to adopt a configuration in which some processes are combined to execute processes.
- the overall control device executes a new command by switching the command, the processing executed in the same processing order as the processing order of the new command is completed in the command before switching. If there is not, it is possible to adopt a configuration in which processing for a new instruction is waited until the processing is completed. Or, if the register writing position of the processing executed in the same processing order in the instruction before switching is equal to the register reading position in the new instruction, the overall control device performs processing for the new instruction until the processing is completed. A configuration for waiting can also be adopted. Alternatively, in the instruction before switching, the overall control device is executed in the same processing order as the processing order of the new instruction, and if the processing executed before the number of instructions specified in advance is not completed, the processing is executed. It is also possible to adopt a configuration in which processing for a new instruction is waited until the process is completed. In this case, the overall control apparatus performs processing for a new command without confirming the end of processing performed after the number of commands specified in advance.
- the overall control device extracts a relative shift amount related to the processing order by each processor from a given instruction, gives the shift amount to the multiplexer, and the shift amount is equal to the number of processors. If it is not an integer multiple, a configuration in which the addressing to the register memory is instructed to be performed only twice at the beginning may be employed.
- the multiplexer shifts and extracts the data from the data obtained by the addressing of the register memory and the data obtained by the past addressing according to the above-described shift amount, and supplies the extracted data to a plurality of processors. Can be configured. With this configuration, data can be exchanged between processors only by addressing and data shift at the time of register access, and is particularly effective for 2D processing such as image processing.
- the processor generates a flag indicating a branch condition from a given instruction and each operation result, and registers it as a new branch flag in combination with a plurality of branch flags stored in the register memory according to the instruction.
- a configuration for storing in a memory can be employed.
- the processor determines whether or not to write the operation result to the register memory or whether to move to the designated instruction from the given instruction and a plurality of branch flags stored in each of the register memories.
- the multiprocessor device of the present invention can be applied to digital AV equipment, mobile terminals, mobile phones, computer equipment, in-vehicle control equipment, medical equipment, and the like, which are applications of computer systems.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Advance Control (AREA)
- Executing Machine-Instructions (AREA)
- Multi Processors (AREA)
- Complex Calculations (AREA)
- Image Processing (AREA)
Abstract
Description
for (x=0; x<320; x++) {
R2 = C0 * x + C1 * y + C2;
R3 = C3 * x + C4 * y + C5;
R0 = mem[R3][R2];
mem[y][x] = R0;
}
for (x=0; x<320; x+=8)
for (i=0; i<8; i++)
R2 = C0 * (x+i) + C1 * y + C2;
for (x=0; x<320; x+=8)
for (i=0; i<8; i++)
R3 = C3 * (x+i) + C4 * y + C5;
for (x=0; x<320; x+=8)
for (i=0; i<8; i++)
R0 = mem[R3][R2];
for (x=0; x<320; x+=8)
for (i=0; i<8; i++)
mem[y][x] = R0;
}
for (x=0; x<320*3; x+=8)
for (x=0; x<1920/2; x+=8)
case 00:
R0 = R1 + R2;
break;
case 01:
R0 = R1 + R2;
break;
case 10:
R3 = R4 / R1;
break;
case 11:
R0 = R1 + R2;
R3 = R4 / R1;
break;
}
Judge = 0x3333; if (!Judge[F3210]) R3 = R4 / R1;
for (x=0; x<64; x++) {
0: R4 = 1/16 * x - 2; R3 = F3210 = 0;
1: R5 = 1/32 * y - 1; R0=R1=0;
2: R2 = R0 * R0 - R1 * R1 + R4; R8 = sqrt(R1 * R1 - 4); R3 += 1;
Judge = 0xaaaa; if (!&Judge[F3210] & (Loop < 64)) goto 2;
Form = 0x3333; F0 |= Form[CCcor];
3: R1 = (R0 * R1 + R5) * 2; R9 = sqrt(R2 * R2 - 4); R0 = R2;
Form = 0x3333; F0 |= Form[CCcor];
4: mem[x][y] = R3;
}
101 メモリ集約装置
102 外部メモリ
103 マルチプレクサ
105 全体制御装置
106 レジスタメモリ
107 プロセッサ
Claims (5)
- 外部メモリと、
複数のプロセッサと、
前記複数のプロセッサのメモリアクセスを集約するメモリ集約装置と、
前記プロセッサが管理するレジスタ数と前記プロセッサの最大処理数の積の数のレジスタメモリと、
前記プロセッサのレジスタアクセスに対し与えられた命令に従って前記レジスタメモリのアクセスを行うマルチプレクサと、
命令からパラメータを抽出し前記プロセッサと前記マルチプレクサに与え制御するとともに、与えられた処理数分を同一命令にて前記プロセッサで前記レジスタメモリのアドレッシングを変化させて順次処理させ、処理数分が終われば次の命令に切り替えて与えられた処理数分の処理を繰り返させる全体制御装置と、
を備えるマルチプロセッサ装置。 - 前記全体制御装置は、前記与えられた処理数が前記最大処理数を越える処理数であればいくつかに分割して処理を実行し、前記与えられた処理数が前記最大処理数に満たない処理数であればいくつかを結合して処理を実行する、請求項1記載のマルチプロセッサ装置。
- 前記全体制御装置は、命令を切り替えて新たな命令を実行する際に、切替前の命令において、新たな命令の処理順番と同じ処理順番で実施された処理が終了していなければ当該処理が終了するまで新たな命令についての処理を待機させる、または切替前の命令において同じ処理順番で実施された処理のレジスタ書き込み位置と新たな命令でのレジスタ読み込み位置が等しい場合は当該処理が終了するまで新たな命令についての処理を待機させる、あるいは切替前の命令において、新たな命令の処理順番と同じ処理順番で実施されるとともに、予め指定された命令数以前に実施された処理が終了していなければ当該処理が終了するまで新たな命令についての処理を待機させる、請求項1記載のマルチプロセッサ装置。
- 前記全体制御装置は、与えられた命令から前記各プロセッサによる処理順番に関する相対的なシフト量を抽出して、当該シフト量を前記マルチプレクサに与えるとともに、当該シフト量が前記プロセッサの数の整数倍以外であれば前記レジスタメモリへのアドレッシングを最初だけ2回行うよう指示し、
前記マルチプレクサは、前記レジスタメモリのアドレッシングで得られるデータと、過去のアドレッシングで得られたデータから、前記シフト量に従ってデータをシフトさせて抽出するとともに、当該抽出したデータを前記複数のプロセッサに与える、請求項1記載のマルチプロセッサ装置。 - 前記プロセッサは、与えられた命令と個々の演算結果から分岐条件を示すフラグを生成し、命令に従って前記レジスタメモリに格納された複数の分岐フラグと組み合わせ新たな分岐フラグとして前記レジスタメモリに格納し、
前記プロセッサは与えられた命令と個々の前記レジスタメモリに格納された複数の分岐フラグから、前記レジスタメモリへの演算結果の書き込みの有無、もしくは指定された命令への移動の有無を決定する、請求項1記載のマルチプロセッサ装置。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/317,183 US10754818B2 (en) | 2014-08-12 | 2015-08-05 | Multiprocessor device for executing vector processing commands |
JP2016542545A JP6551751B2 (ja) | 2014-08-12 | 2015-08-05 | マルチプロセッサ装置 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2014-164137 | 2014-08-12 | ||
JP2014164137 | 2014-08-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016024508A1 true WO2016024508A1 (ja) | 2016-02-18 |
Family
ID=55304138
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2015/072246 WO2016024508A1 (ja) | 2014-08-12 | 2015-08-05 | マルチプロセッサ装置 |
Country Status (3)
Country | Link |
---|---|
US (1) | US10754818B2 (ja) |
JP (1) | JP6551751B2 (ja) |
WO (1) | WO2016024508A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019208566A1 (ja) * | 2018-04-24 | 2019-10-31 | ArchiTek株式会社 | プロセッサ装置 |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7476676B2 (ja) * | 2020-06-04 | 2024-05-01 | 富士通株式会社 | 演算処理装置 |
WO2022139646A1 (en) * | 2020-12-23 | 2022-06-30 | Imsys Ab | A novel data processing architecture and related procedures and hardware improvements |
US20220207148A1 (en) * | 2020-12-26 | 2022-06-30 | Intel Corporation | Hardening branch hardware against speculation vulnerabilities |
CN114553700B (zh) * | 2022-02-24 | 2024-06-28 | 树根互联股份有限公司 | 设备分组方法、装置、计算机设备及存储介质 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030070059A1 (en) * | 2001-05-30 | 2003-04-10 | Dally William J. | System and method for performing efficient conditional vector operations for data parallel architectures |
JP2008217061A (ja) * | 2007-02-28 | 2008-09-18 | Ricoh Co Ltd | Simd型マイクロプロセッサ |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH077385B2 (ja) * | 1983-12-23 | 1995-01-30 | 株式会社日立製作所 | データ処理装置 |
US5790879A (en) * | 1994-06-15 | 1998-08-04 | Wu; Chen-Mie | Pipelined-systolic single-instruction stream multiple-data stream (SIMD) array processing with broadcasting control, and method of operating same |
US5513366A (en) * | 1994-09-28 | 1996-04-30 | International Business Machines Corporation | Method and system for dynamically reconfiguring a register file in a vector processor |
JP3971535B2 (ja) * | 1999-09-10 | 2007-09-05 | 株式会社リコー | Simd型プロセッサ |
US6892361B2 (en) * | 2001-07-06 | 2005-05-10 | International Business Machines Corporation | Task composition method for computer applications |
US8041929B2 (en) * | 2006-06-16 | 2011-10-18 | Cisco Technology, Inc. | Techniques for hardware-assisted multi-threaded processing |
US7627744B2 (en) | 2007-05-10 | 2009-12-01 | Nvidia Corporation | External memory accessing DMA request scheduling in IC of parallel processing engines according to completion notification queue occupancy level |
JP5049802B2 (ja) * | 2008-01-22 | 2012-10-17 | 株式会社リコー | 画像処理装置 |
US20100115233A1 (en) * | 2008-10-31 | 2010-05-06 | Convey Computer | Dynamically-selectable vector register partitioning |
US8542732B1 (en) * | 2008-12-23 | 2013-09-24 | Elemental Technologies, Inc. | Video encoder using GPU |
US8112551B2 (en) * | 2009-05-07 | 2012-02-07 | Cypress Semiconductor Corporation | Addressing scheme to allow flexible mapping of functions in a programmable logic array |
JP6081300B2 (ja) * | 2013-06-18 | 2017-02-15 | 株式会社東芝 | 情報処理装置及びプログラム |
-
2015
- 2015-08-05 WO PCT/JP2015/072246 patent/WO2016024508A1/ja active Application Filing
- 2015-08-05 JP JP2016542545A patent/JP6551751B2/ja active Active
- 2015-08-05 US US15/317,183 patent/US10754818B2/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030070059A1 (en) * | 2001-05-30 | 2003-04-10 | Dally William J. | System and method for performing efficient conditional vector operations for data parallel architectures |
JP2008217061A (ja) * | 2007-02-28 | 2008-09-18 | Ricoh Co Ltd | Simd型マイクロプロセッサ |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019208566A1 (ja) * | 2018-04-24 | 2019-10-31 | ArchiTek株式会社 | プロセッサ装置 |
JPWO2019208566A1 (ja) * | 2018-04-24 | 2021-04-08 | ArchiTek株式会社 | プロセッサ装置 |
JP7061742B2 (ja) | 2018-04-24 | 2022-05-02 | ArchiTek株式会社 | プロセッサ装置 |
US11500632B2 (en) | 2018-04-24 | 2022-11-15 | ArchiTek Corporation | Processor device for executing SIMD instructions |
Also Published As
Publication number | Publication date |
---|---|
US20170116153A1 (en) | 2017-04-27 |
US10754818B2 (en) | 2020-08-25 |
JP6551751B2 (ja) | 2019-07-31 |
JPWO2016024508A1 (ja) | 2017-06-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9830156B2 (en) | Temporal SIMT execution optimization through elimination of redundant operations | |
US8108659B1 (en) | Controlling access to memory resources shared among parallel synchronizable threads | |
RU2427895C2 (ru) | Оптимизированная для потоков многопроцессорная архитектура | |
US8086806B2 (en) | Systems and methods for coalescing memory accesses of parallel threads | |
CN117724763A (zh) | 用于矩阵操作加速器的指令的装置、方法和系统 | |
JP2020518042A (ja) | 処理装置と処理方法 | |
WO2016024508A1 (ja) | マルチプロセッサ装置 | |
US20110231616A1 (en) | Data processing method and system | |
US8438370B1 (en) | Processing of loops with internal data dependencies using a parallel processor | |
JP6493088B2 (ja) | 演算処理装置及び演算処理装置の制御方法 | |
US8370845B1 (en) | Method for synchronizing independent cooperative thread arrays running on a graphics processing unit | |
US20220043770A1 (en) | Neural network processor, chip and electronic device | |
US20210166156A1 (en) | Data processing system and data processing method | |
CN102012802B (zh) | 面向向量处理器数据交换的方法及装置 | |
US8473948B1 (en) | Method for synchronizing independent cooperative thread arrays running on a graphics processing unit | |
Lin et al. | swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer | |
US9477628B2 (en) | Collective communications apparatus and method for parallel systems | |
US8090762B2 (en) | Efficient super cluster implementation for solving connected problems in a distributed environment | |
US11640302B2 (en) | SMID processing unit performing concurrent load/store and ALU operations | |
US10997277B1 (en) | Multinomial distribution on an integrated circuit | |
CN112506853A (zh) | 零缓冲流水的可重构处理单元阵列及零缓冲流水方法 | |
CN112463218A (zh) | 指令发射控制方法及电路、数据处理方法及电路 | |
CN110766150A (zh) | 一种深度卷积神经网络硬件加速器中的区域并行数据载入装置及方法 | |
Raju et al. | Performance enhancement of CUDA applications by overlapping data transfer and Kernel execution | |
US11392667B2 (en) | Systems and methods for an intelligent mapping of neural network weights and input data to an array of processing cores of an integrated circuit |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15832289 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2016542545 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 15317183 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15832289 Country of ref document: EP Kind code of ref document: A1 |