WO2020053618A1 - A processor and a method of operating a processor - Google Patents
A processor and a method of operating a processor Download PDFInfo
- Publication number
- WO2020053618A1 WO2020053618A1 PCT/IB2018/056875 IB2018056875W WO2020053618A1 WO 2020053618 A1 WO2020053618 A1 WO 2020053618A1 IB 2018056875 W IB2018056875 W IB 2018056875W WO 2020053618 A1 WO2020053618 A1 WO 2020053618A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- core
- processor
- cores
- address
- input
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 14
- 239000000872 buffer Substances 0.000 claims abstract description 44
- 238000004891 communication Methods 0.000 claims abstract description 33
- 238000012545 processing Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 241001481828 Glyptocephalus cynoglossus Species 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000011982 device technology Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
Definitions
- the disclosure relates generally to computer processors and specifically to a processor which may lack a central CU (Control Unit) and an associated method.
- a central CU Control Unit
- Modern computer processors or Central Processing Units may comprise plural cores.
- a Control Unit (CU) is usually configured to direct the operation of the plural cores.
- CU Central Processing Unit
- the number of cores which a processor has may also increase, even increase exponentially.
- the Applicant envisages that future processors may have millions or billions of cores (or similarly configured processing sub-units).
- a problem with having a massive number (e.g., billions) of cores in a processor is that there may be considerable overhead in using a centralised CU which conventionally would load code (e.g., opcode) and cause the cores to execute it. On such a vast scale, such loading of code is time-consuming and potentially too time- consuming to operate the cores efficiently.
- code e.g., opcode
- a scaled down architecture with a smaller number of cores can be used.
- An example instruction set as well as a centralised CU may demonstrate the bottleneck.
- a core can read from another core’s value and a core only executes opcode, usually with a very small input size, for example,“A + B”, or“A > B”.
- opcode usually with a very small input size, for example,“A + B”, or“A > B”.
- a more complex input size or instruction e.g.“A + B * (C / D)”, cannot be executed and is not a valid instruction. It has to be broken down into smaller components.
- a symbol may be an address, a literal, or a dereference instruction. Dereferencing will load the value at the symbol and use it as the memory address for the context of the instruction in which it is used.
- Every row is an execution cycle of the processor. Every column is a core in the processor and there are 10 cores. A core may retain its internal value every cycle. In this background example, every core is an unsigned integer:
- CO Literal 10 loads into memory at address CO.
- CO Copies the value of CO (this is technically pointless because the value is already in the core, but it helps to explain the code’s behaviour).
- C1 Loads the literal 10 from CO and multiples it by 2, then stores the resulting value in C1.
- Cycle 2 C1 : Loads the Value 10 from CO and the Value 20 from C1 then it multiplies them with one another, and stores the resulting value int C1.
- FIG. 1 is an extract from FIG. 3 of PCT/IB2018/054669 and it illustrates a proposed multi-core processor architecture and is still considered as background to the present disclosure although it may not rate as prior art as it may not be publicly available at the priority date of this present application.
- every core may be assigned an index in a sequential order, i.e. CO, C1 , C2, etc. Every core comprises two buffers: the first“A” buffer and the second“B” buffer as well as the core’s own value.
- the core label itself will be used to reference this buffer (memory unit).
- Buffer A may be configured always to write its value into an associated logic unit, e.g., an Arithmetic Logic Unit (ALU).
- Buffer B may also be configured to output its value to the ALU. Both buffer A and buffer B may be responsive to a read control signal.
- a memory unit has two control signals, read and write. The read and write operation is operated from the point of view of the buffer itself. This means that if buffer A was instructed to read, it takes the value from the bus and reads it to its internal storage and a write operation writes the internal value into the configured output, so in the case of the memory unit the write operation will output its value onto the bus.
- Another important function is for the CU to assign the ALU its operation, that is, the CU tells the core which arithmetic operation it should do, e.g., addition, subtraction, multiplication or division. This may be notated as Cx.Operator, e,g., CO. Sub. Using this naming convention, operations may be illustrated as follows:
- Table 5 Using the notation of Table 5, the opcode of Tables 3-4 can be translated into micro-instructions:
- Control unit During every execution cycle the CU iterates over every core, one by one, in a sequential manner, executing that specific core’s opcode, witch means that only one core is operating at any given time.
- the bus Because only one core can use the bus at a time, only one core can operate at a time, which is a huge bottle neck.
- the Applicant desires a processor which may overcome or ameliorate the abovementioned drawbacks and may enable processors with a large number of cores.
- the disclosure provides a processor which comprises: a plurality of cores, each core comprising: at least one input buffer; a logic unit having an input and an output, wherein the input is in communication with the input buffer; and a memory unit in communication with the output of the logic unit, wherein the logic unit of each core is configured to perform or execute only one type of operation; a communication bus configured to interconnect the cores; and a plurality of address modules respectively associated with the plurality of cores, each address module being configured to store an address which points to a value to be communicated to the input buffer, wherein each core is configured to calculate an output based on the input value and the type of operation which its logic unit is configured to perform.
- the plurality of address modules may embody a distributed control system.
- the type of operation may be in the form of an opcode or segments of opcode.
- each core may be configured statically, or statically pre-defined, to execute only one opcode.
- the cores may not be re-configurable in the type of operation which they can execute.
- the input value(s) may need to be communicated to a core which has been configured to perform that instruction, without communicating the instruction itself to the core.
- a core which has been configured to perform that instruction, without communicating the instruction itself to the core.
- the values and the corresponding opcode may be communicated to any core which my then load the opcode and execute it with respect to the input values.
- Each core may include plural input buffers.
- Each core may include two input buffers.
- the logic unit may be, or may include, an Arithmetic Logic Unit (ALU), a Floating- Point Unit (FPU), and/or a Graphics Processing Unit (GPU).
- ALU Arithmetic Logic Unit
- FPU Floating- Point Unit
- GPU Graphics Processing Unit
- the address modules supply the cores with addresses of the input values.
- the address modules of the present disclosure do not supply opcode, and are thus different from a conventionally configured CU which does supply opcode. In other words, in the present disclosure, the need to assign opcode to each core may be eliminated.
- Each core may include a core identifier.
- Each core, or each core in a group of cores, may be sequential or cyclical. That is, the cores may be configured to execute in a sequence.
- the core identifier may indicate a sequence of the cores.
- Each core may include an execution complete flag.
- the execution complete flag may be configured to indicate whether or not that core has executed or completed an instruction within a particular cycle. If the cores are sequential, each core may be configured to execute its instruction only if one or more cores earlier in the sequence have already executed their instruction. Accordingly, a particular core may only execute if one or more cores earlier in the sequence have their execution complete flags set as true. Conversely, a particular core may not execute, or temporarily skip or suspend execution, if one or more cores earlier in the sequence has their execution complete flag set as false.
- Each address module may comprise an address or pointer to a memory location.
- Each address module may include as many addresses or pointers as each core has input buffers. Where the core comprises two input buffers, each address module may comprise two addresses, one for each input buffer.
- Each core may be configured to fetch a value from the memory location addressed by the address module and place that value into the input buffer. The core may be configured to fetch the value at the beginning of, or during, each execution cycle.
- the communication bus may comprise two portions, a first portion to communicate values and a second portion to communicate addresses pointing to the values.
- the first portion of the communication bus may be coupled to the input buffer(s) of each core.
- the second portion of the communication bus may be coupled to the address modules of each core.
- the first and second portions of the communication bus may be logical portions or physical portions.
- the cores may be divided into groups.
- the groups may be characterised as resource groups.
- the cores with a resource group may be configured to perform the same or similar types of operations.
- the cores of different resource groups may be configured to perform different types of operations.
- the communication bus may also be divided into groups or sub-buses.
- a sub- bus may link cores within a resource group and then a larger sub-bus may link resource groups. This hierarchy of sub-buses may occur iteratively or exponentially, with groups of resource groups being connected by an even larger bus, and so forth.
- a resource group may comprise four cores (or quadrants), with a lowest level sub-bus interconnecting the four cores.
- the sub-bus may have a cross configuration (whether physical or logical).
- Four of these resource groups may be grouped together, by a second-lowest level sub-bus which may also have a cross pattern, and four of these groups of resource groups may be grouped together, and so forth.
- Foer example there may thus be 4 n cores, n levels of sub-buses,
- the grouping of cores and division of the communication bus may be done in any practicable manner and need not necessarily be quadrangular or even symmetrical.
- a gateway junction may be provided between adjacent levels of the sub-buses, e.g., between a larger, top level sub-bus and a smaller, second level sub-bus. This may result in a tree-like structure or system. Every resource group may have a gateway junction connecting it to a higher level sub-bus. By doing this, multiple resource groups may be cycled or executed simultaneously. A resource group may only be required to wait for time on the communication bus if it requests values or data from another resource group.
- An operating system which operates the processor may be configured to address the cores dynamically to reflect available cores and resources. This way the operating system can keep a computer program’s execution in as few as possible resource groups, maximising performance.
- each core may be configured to trigger or invoke a subsequent core or resource group, e.g., in a cascading arrangement. It may be acceptable for more than one core to execute at the same time, as long as they all are in cyclic sync, e.g., every core must cycle the same amount. So, if one core has cycled, it must wait for every other core to cycle as well, before moving on to the next cycle, otherwise it will violate concurrency considerations (see PCT/IB2018/054669).
- the disclosure extends to a method of operating a processor, the method comprising:
- each core comprising:
- a logic unit having an input and an output, wherein the input is in communication with the input buffer
- a memory unit in communication with the output of the logic unit; executing, by the logic unit of each core, only one type of operation;
- the disclosure extends to a non-transitory computer-readable medium which has stored thereon a set of instructions which, when executed by a computer processor, causes the computer processor to perform the method defined above.
- FIG. 1 shows an extract from FIG. 3 of PCT/IB2018/054669
- FIG. 2 shows a schematic view of a processor comprising at least one core in accordance with the present disclosure
- FIG. 3 shows a schematic view of a first resource group comprising plural cores of
- FIG. 1 is a diagrammatic representation of FIG. 1 ;
- FIG. 4 shows a schematic view of a second resource group comprising plural cores of FIG. 1 ;
- FIG. 5 shows a schematic view of a pattern of resource groups of FIG. 4
- FIG. 6 shows a schematic view of the cores of FIG. 3 in a chain
- FIG. 7 shows a flow diagram of a method of operating a processor in accordance with the present disclosure.
- FIG. 2 illustrates a processor 100 which has plural cores 102. Only a single core 102 (core ri) is illustrated in FIG.
- the processor 100 has a communication bus 104 configured to interconnect the cores 102.
- the core 102 has two input buffers 106, 108, namely buffer A 106 and buffer B 108.
- the core 102 has a logic unit in the form of an ALU 110 which has an input and an output, wherein the input is in communication with the input buffers 106, 108 and the output is in communication with a memory unit or cell 112.
- the ALU 110 is configured to perform or execute only one type of operation or a specific piece of opcode. This is a first aspect in which the present disclosure differs from conventional processors or cores which are configured to execute an operation defined by a CU based on opcode supplied by the CU.
- the whole program can be loaded into the available cores 102 which eliminates the need to change the memory unit 112 of the cores 102 during runtime, which in turn eliminates the need to load new addresses into the cores 102 during execution.
- the number of instructions is still the same, but they are parallelised; the total number of cycles is still the same, as per Table 8:
- every core 100 may have an execution complete flag 1 14, indicating whether it has executed at least once. Then, whenever a core (e.g., C3) tries to read the value of another core (e.g., C2), it may skip one execution cycle if flag 1 14 is false. In this fashion, the order of operation may be correctly maintained.
- a core e.g., C3
- another core e.g., C2
- Tables 7-9 are also illustrative of methods of the present disclosure, as may be implemented by the processor 100 (refer also to FIG. 7).
- processor 100 executes routines linearly, this may lead to the communication bus 104 being shared and possible contention issues. Also, only one core 102 may be able to execute at a time. To address this, each core 102 may be made responsible for its own execution.
- the core 102 also includes address modules 122, 124.
- the address modules 122, 124 may correspond to the input buffers 106, 108.
- the core 102 may have a single address module configured to store plural addresses.
- Each address module 122, 124 is configured to store an address which points to a value to be communicated to its associated input buffer 106, 108.
- the inclusion of the address modules 122, 124 is another aspect which differentiates the present disclosure from conventional processors.
- the address modules 106, 108 may embody a distributed control system of the processor 100.
- core C2 may have the address of core CO in l_A 122 and of core C1 in l_B 124 (or the addresses of the memory units 1 12 of those cores).
- the communication bus 104 may be modified to have two portions: the first portion being for the memory address and the second portion being for the value at that address. In this fashion, every core 102 may be able to write the values of their address modules 122, 124 respectively onto the communication bus 104 and, separately, then read values into the input buffers 106, 108.
- the processor 100 may, during normal operation, not have to load any literals, opcodes, or data. Accordingly, machine code source size may no longer be a consideration, and there may no longer be there a need for a centralised CU. Because every core 102 can operate itself based on the content of the memory addresses 122,
- the Applicant notes that an issue that may need to be overcome is that only one core 102 may use the bus 104 at a time. To address this issue, one or more groups of communications buses, groups of cores, and/or invocation chains may be implemented.
- FIGS 3-4 illustrate cores 102 grouped into resource groups 200. Again, only a few cores 102 are illustrated (1 1 in FIG. 3 and four in FIG. 4) but the number may be adjusted as desired or based on intended application.
- Resource groups 200 may be considered a systematic grouping of cores 102. By grouping cores 102 together, they can be tightly connected with their own internal group sub-bus 206 (which forms part of the communication bus 102) thereby allowing the grouped cores 102 to share data with one another quickly inside their own group 200.
- Groups 200 can be specialised to specific tasks; some examples include: a group designed for graphics processing may consist of more floating point cores, and a group optimised for string processing may have more string manipulation cores, etc.
- the sub-bus 206 connects to a remainder of the bus 104 by a gateway junction 202. Instead, FIG. 4 could represent a grouping of four resource groups 200 of FIG. 3, rather than a grouping of four cores 102.
- a way to determine the cores 102 in a group 200 is to perform a frequency analysis, e.g., on existing software, like LinuxTM, or a rendering engine, to determine which operation is performed most, and then using the resulting statistics, determine distribution and grouping of the cores 102.
- a frequency analysis e.g., on existing software, like LinuxTM, or a rendering engine
- the resource groups 200 can be interconnected or arrayed in a manner that allows any giving resource group 200 to communicate with any other given group 200.
- FIG. 5 illustrates a fractal pattern which is an example, but other layouts are contemplated.
- FIG. 5 has 16 (4 2 ) groups 200 of cores 102, each group 200 having four cores 102.
- Each group 200 has a third level bus (B_3), with a second level bus (B_2) interconnecting the groups 200, and a first level bus (B_1 ) interconnecting the second level buses (B_2).
- B_3 third level bus
- B_2 second level bus
- B_1 first level bus
- Every resource group 200 may have a gateway junction connecting it to its sub-bus. By doing this, the individual resource groups 200 can be cycled at the same time. The only time a resource group 200 may wait for time on the bus is when it requests data from another resource group 200.
- FIG. 4 an example of a small group 200 is illustrated. If a program does not need any information from an external source, all four cores 102 may safely be cycled at the same time because none of them will use the network bus to get data from another group. The same applies to groups of resource groups 200. If a core 102 in one group 200 requests data from another group 200, the rest of the resource groups 200 that are not trying to use the sub-bus 206 can still be cycled. Accordingly, two groups 200 may not hold up the rest of the processor 100 when one is waiting for data from the other.
- An operating system that loads a given program into the processor 100 may be configured to re-link the program to reflect the actual reality of the core’s and available resources. This way, the operating system can keep a program’s execution in as few resource groups 200 as possible, maximising performance.
- processor 100 lacks a central CU, there may be a need to cycle cores 102 and resource groups 200 in a synchronised manner to avoid violating concurrency requirements.
- individual cores 102 may be chained in a chain reaction manner, and similarly resource groups 200 may be chained in a cascading reaction manner.
- FIG. 6 illustrates such a chain 300.
- Each core 102 invokes the next core 102 in its respective chain 300 once it has completed its own execution cycle. This way, only one core 102 may use the internal sub-bus 206 at a time.
- each sub-bus (B_1 , B_2, B_3) can be thought of as a tree structure (root at the top, expanding down).
- the top most bus (first level bus B_1 ) sends an invocation signal to the second level buses (B_2) which sends invocation signals to the third level buses (B_3). This may iterate over and over until the lowest sub-bus is reached where the resource groups 200 are, which comprise the cores 102. This may mean that every core 102 gets executed by the same clock.
- each core 102 may execute at the same time, as long as they all are in cyclic sync, e.g., every core 102 must cycle the same amount. If one core 102 has cycled, it may be configured wait for every other core 102 to cycle as well, before moving on to the next cycle.
- FIG. 7 illustrates a method 400 implemented by the processor 100 as has been explained above in explaining the functionality of the processor 100.
- the processor 100 is provided (at block 402) with a plurality of cores 102, each of which is configured to perform a single type of operation. Addresses pointing to the values to be inputted to the cores 102 are stored (at block 404) in the address modules 122. Then, as explained above, an output is calculated (at block 406) based on the input values which the address modules 122 point to based on the operation type which the logic unit 110 or the core 102 is configured to perform. This may be repeated (at block 408) numerously.
- Machine Code Size Bottleneck Elimination of the need to cycle different opcodes every cycle, by creating a processor where every line of machine code is executed in a single cycle, as well as only using cores with a single function to eliminate the concept of opcodes.
- Control Unit Bottleneck Eliminating the need for a control unit by replacing it with a distributed control system, thereby removing the need for a linear execution cycle.
- Bus Usage Bottleneck Solving the problem of bus usage limitations, by subdividing groups.
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/IB2018/056875 WO2020053618A1 (en) | 2018-09-10 | 2018-09-10 | A processor and a method of operating a processor |
ZA2021/01831A ZA202101831B (en) | 2018-09-10 | 2021-03-18 | A processor and a method of operating a processor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/IB2018/056875 WO2020053618A1 (en) | 2018-09-10 | 2018-09-10 | A processor and a method of operating a processor |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020053618A1 true WO2020053618A1 (en) | 2020-03-19 |
Family
ID=69776845
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2018/056875 WO2020053618A1 (en) | 2018-09-10 | 2018-09-10 | A processor and a method of operating a processor |
Country Status (2)
Country | Link |
---|---|
WO (1) | WO2020053618A1 (en) |
ZA (1) | ZA202101831B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070192547A1 (en) * | 2005-12-30 | 2007-08-16 | Feghali Wajdi K | Programmable processing unit |
US20080086626A1 (en) * | 2006-10-05 | 2008-04-10 | Simon Jones | Inter-processor communication method |
US7958416B1 (en) * | 2005-11-23 | 2011-06-07 | Altera Corporation | Programmable logic device with differential communications support |
US20140122555A1 (en) * | 2012-10-31 | 2014-05-01 | Brian Hickmann | Reducing power consumption in a fused multiply-add (fma) unit responsive to input data values |
US20150242322A1 (en) * | 2013-06-19 | 2015-08-27 | Empire Technology Development Llc | Locating cached data in a multi-core processor |
-
2018
- 2018-09-10 WO PCT/IB2018/056875 patent/WO2020053618A1/en active Application Filing
-
2021
- 2021-03-18 ZA ZA2021/01831A patent/ZA202101831B/en unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7958416B1 (en) * | 2005-11-23 | 2011-06-07 | Altera Corporation | Programmable logic device with differential communications support |
US20070192547A1 (en) * | 2005-12-30 | 2007-08-16 | Feghali Wajdi K | Programmable processing unit |
US20080086626A1 (en) * | 2006-10-05 | 2008-04-10 | Simon Jones | Inter-processor communication method |
US20140122555A1 (en) * | 2012-10-31 | 2014-05-01 | Brian Hickmann | Reducing power consumption in a fused multiply-add (fma) unit responsive to input data values |
US20150242322A1 (en) * | 2013-06-19 | 2015-08-27 | Empire Technology Development Llc | Locating cached data in a multi-core processor |
Non-Patent Citations (1)
Title |
---|
RANGER ET AL.: "Evaluating MapReduce for multi-core and multiprocessor systems", 2007 IEEE 13TH INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE, 14 February 2007 (2007-02-14), pages 13 - 24, XP055111222 * |
Also Published As
Publication number | Publication date |
---|---|
ZA202101831B (en) | 2022-07-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107347253B (en) | Hardware instruction generation unit for special purpose processor | |
JP5035277B2 (en) | A locking mechanism that allows atomic updates to shared memory | |
US8544019B2 (en) | Thread queueing method and apparatus | |
US6237021B1 (en) | Method and apparatus for the efficient processing of data-intensive applications | |
US20120066668A1 (en) | C/c++ language extensions for general-purpose graphics processing unit | |
CN104050033A (en) | System and method for hardware scheduling of indexed barriers | |
US20110072249A1 (en) | Unanimous branch instructions in a parallel thread processor | |
US9110692B2 (en) | Method and apparatus for a compiler and related components for stream-based computations for a general-purpose, multiple-core system | |
CN103197916A (en) | Methods and apparatus for source operand collector caching | |
CN103226463A (en) | Methods and apparatus for scheduling instructions using pre-decode data | |
US20210232394A1 (en) | Data flow processing method and related device | |
CN103885893A (en) | Technique For Accessing Content-Addressable Memory | |
CN103279379A (en) | Methods and apparatus for scheduling instructions without instruction decode | |
CN113326066B (en) | Quantum control microarchitecture, quantum control processor and instruction execution method | |
US9378533B2 (en) | Central processing unit, GPU simulation method thereof, and computing system including the same | |
CN104050032A (en) | System and method for hardware scheduling of conditional barriers and impatient barriers | |
CN103885902A (en) | Technique For Performing Memory Access Operations Via Texture Hardware | |
CN112580792B (en) | Neural network multi-core tensor processor | |
CN103885903A (en) | Technique For Performing Memory Access Operations Via Texture Hardware | |
CN116783578A (en) | Execution matrix value indication | |
CN103294449A (en) | Pre-scheduled replays of divergent operations | |
WO2020053618A1 (en) | A processor and a method of operating a processor | |
CN112463218B (en) | Instruction emission control method and circuit, data processing method and circuit | |
CN116401039A (en) | Asynchronous memory deallocation | |
Sakai et al. | Towards automating multi-dimensional data decomposition for executing a single-GPU code on a multi-GPU system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18933547 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18933547 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18933547 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13/09/2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18933547 Country of ref document: EP Kind code of ref document: A1 |