WO2020053618A1

WO2020053618A1 - A processor and a method of operating a processor

Info

Publication number: WO2020053618A1
Application number: PCT/IB2018/056875
Authority: WO
Inventors: Emile BADENHORST
Original assignee: Badenhorst Emile
Priority date: 2018-09-10
Filing date: 2018-09-10
Publication date: 2020-03-19
Also published as: ZA202101831B

Abstract

A processor comprises a plurality of cores and lacks a conventional CU (Control Unit). Each core comprises at least one input buffer; a logic unit having an input and an output, wherein the input is in communication with the input buffer; and a memory unit in communication with the output of the logic unit. The logic unit of each core is configured to perform or execute only one type of operation. A plurality of address modules are associated with the plurality of cores, each address module being configured to store an address which points to a value to be communicated to the input buffer. Each core is configured to calculate an output based on the input value and the type of operation which its logic unit is configured to perform.

Description

A Processor and a Method of Operating a Processor

FIELD OF DISCLOSURE

The disclosure relates generally to computer processors and specifically to a processor which may lack a central CU (Control Unit) and an associated method.

BACKGROUND OF DISCLOSURE Modern computer processors or Central Processing Units (CPUs) may comprise plural cores. A Control Unit (CU) is usually configured to direct the operation of the plural cores. As miniaturisation and device technologies increase, the number of cores which a processor has may also increase, even increase exponentially. The Applicant envisages that future processors may have millions or billions of cores (or similarly configured processing sub-units).

A problem with having a massive number (e.g., billions) of cores in a processor is that there may be considerable overhead in using a centralised CU which conventionally would load code (e.g., opcode) and cause the cores to execute it. On such a vast scale, such loading of code is time-consuming and potentially too time- consuming to operate the cores efficiently. The Applicant notes that a massive multi- core processor architecture is described in its previous PCT application no. PCT/IB2018/054669.

To explain the background to the problem of controlling a large number of cores, a scaled down architecture with a smaller number of cores (e.g., 10) can be used. An example instruction set as well as a centralised CU may demonstrate the bottleneck. Further, a core can read from another core’s value and a core only executes opcode, usually with a very small input size, for example,“A + B”, or“A > B”. For normal opcode, a more complex input size or instruction, e.g.“A + B * (C / D)”, cannot be executed and is not a valid instruction. It has to be broken down into smaller components.

Consider the following instructions:

Table 1

A symbol may be an address, a literal, or a dereference instruction. Dereferencing will load the value at the symbol and use it as the memory address for the context of the instruction in which it is used.

Using the Instruction Set (IS) defined above, an architecture following the concurrent solution outlined in PCT/IB2018/054669 can be created. For example, every row is an execution cycle of the processor. Every column is a core in the processor and there are 10 cores. A core may retain its internal value every cycle. In this background example, every core is an unsigned integer:

Table 2 Taking the following code sample:

let bar(x : u32, y : u32) = x * y;

let foo(x : u32) = bar(x, x * 2);

let out = foo(10);

and converting this to opcodes will look like:

Table 3

Executing the code in Table 3 yields:

Table 4

This execution is described as follows:

Cycle 0:

CO: Literal 10 loads into memory at address CO.

Cycle 1 :

CO: Copies the value of CO (this is technically pointless because the value is already in the core, but it helps to explain the code’s behaviour).

C1 : Loads the literal 10 from CO and multiples it by 2, then stores the resulting value in C1.

Cycle 2: C1 : Loads the Value 10 from CO and the Value 20 from C1 then it multiplies them with one another, and stores the resulting value int C1.

FIG. 1 is an extract from FIG. 3 of PCT/IB2018/054669 and it illustrates a proposed multi-core processor architecture and is still considered as background to the present disclosure although it may not rate as prior art as it may not be publicly available at the priority date of this present application.

In FIG. 1 , every core may be assigned an index in a sequential order, i.e. CO, C1 , C2, etc. Every core comprises two buffers: the first“A” buffer and the second“B” buffer as well as the core’s own value. The core label itself will be used to reference this buffer (memory unit). Buffer A may be configured always to write its value into an associated logic unit, e.g., an Arithmetic Logic Unit (ALU). Buffer B may also be configured to output its value to the ALU. Both buffer A and buffer B may be responsive to a read control signal. A memory unit has two control signals, read and write. The read and write operation is operated from the point of view of the buffer itself. This means that if buffer A was instructed to read, it takes the value from the bus and reads it to its internal storage and a write operation writes the internal value into the configured output, so in the case of the memory unit the write operation will output its value onto the bus.

Another important function is for the CU to assign the ALU its operation, that is, the CU tells the core which arithmetic operation it should do, e.g., addition, subtraction, multiplication or division. This may be notated as Cx.Operator, e,g., CO. Sub. Using this naming convention, operations may be illustrated as follows:

Table 5 Using the notation of Table 5, the opcode of Tables 3-4 can be translated into micro-instructions:

Table 6

The following observations may be made about the above process:

Linear nature of the Control unit: During every execution cycle the CU iterates over every core, one by one, in a sequential manner, executing that specific core’s opcode, witch means that only one core is operating at any given time.

Machine code source size: When a processor has a large number of cores, the executing code can be extremely large, e.g. if every opcode is 3 bytes in size and there are 2^L32-1 cores, and assuming you use every core, every execution cycle, you can calculate a single execution cycle’s machine code’s size: 3 * (2^L32) = 12884901888 bytes = 12.88 GB, loading ~13 GB of machine code, every execution cycle may not be practical and may result in slow execution cycles. The bus: Because only one core can use the bus at a time, only one core can operate at a time, which is a huge bottle neck.

Core Functions: All the cores have the same instruction set and is therefore very large and generalized, resulting in slower speed and complexity.

The Applicant desires a processor which may overcome or ameliorate the abovementioned drawbacks and may enable processors with a large number of cores.

SUMMARY OF DISCLOSURE

Accordingly, the disclosure provides a processor which comprises: a plurality of cores, each core comprising: at least one input buffer; a logic unit having an input and an output, wherein the input is in communication with the input buffer; and a memory unit in communication with the output of the logic unit, wherein the logic unit of each core is configured to perform or execute only one type of operation; a communication bus configured to interconnect the cores; and a plurality of address modules respectively associated with the plurality of cores, each address module being configured to store an address which points to a value to be communicated to the input buffer, wherein each core is configured to calculate an output based on the input value and the type of operation which its logic unit is configured to perform.

Conceptually, the plurality of address modules may embody a distributed control system. The type of operation may be in the form of an opcode or segments of opcode. Accordingly, each core may be configured statically, or statically pre-defined, to execute only one opcode. The cores may not be re-configurable in the type of operation which they can execute.

Accordingly, to execute a given instruction or opcode, the input value(s) may need to be communicated to a core which has been configured to perform that instruction, without communicating the instruction itself to the core. This is in contrast with a conventional processor in which the values and the corresponding opcode may be communicated to any core which my then load the opcode and execute it with respect to the input values.

Each core may include plural input buffers. Each core may include two input buffers.

The logic unit may be, or may include, an Arithmetic Logic Unit (ALU), a Floating- Point Unit (FPU), and/or a Graphics Processing Unit (GPU).

It will be noted that, contrary to conventional processor architecture where a CU supplies each core with input value(s) and with opcode, in the present disclosure, there is no CU. The address modules supply the cores with addresses of the input values. The address modules of the present disclosure do not supply opcode, and are thus different from a conventionally configured CU which does supply opcode. In other words, in the present disclosure, the need to assign opcode to each core may be eliminated.

Each core may include a core identifier. Each core, or each core in a group of cores, may be sequential or cyclical. That is, the cores may be configured to execute in a sequence. The core identifier may indicate a sequence of the cores. Each core may include an execution complete flag. The execution complete flag may be configured to indicate whether or not that core has executed or completed an instruction within a particular cycle. If the cores are sequential, each core may be configured to execute its instruction only if one or more cores earlier in the sequence have already executed their instruction. Accordingly, a particular core may only execute if one or more cores earlier in the sequence have their execution complete flags set as true. Conversely, a particular core may not execute, or temporarily skip or suspend execution, if one or more cores earlier in the sequence has their execution complete flag set as false.

Each address module may comprise an address or pointer to a memory location. Each address module may include as many addresses or pointers as each core has input buffers. Where the core comprises two input buffers, each address module may comprise two addresses, one for each input buffer. Each core may be configured to fetch a value from the memory location addressed by the address module and place that value into the input buffer. The core may be configured to fetch the value at the beginning of, or during, each execution cycle.

The communication bus may comprise two portions, a first portion to communicate values and a second portion to communicate addresses pointing to the values. The first portion of the communication bus may be coupled to the input buffer(s) of each core. The second portion of the communication bus may be coupled to the address modules of each core. The first and second portions of the communication bus may be logical portions or physical portions.

The cores may be divided into groups. The groups may be characterised as resource groups. The cores with a resource group may be configured to perform the same or similar types of operations. The cores of different resource groups may be configured to perform different types of operations. The communication bus may also be divided into groups or sub-buses. A sub- bus may link cores within a resource group and then a larger sub-bus may link resource groups. This hierarchy of sub-buses may occur iteratively or exponentially, with groups of resource groups being connected by an even larger bus, and so forth.

For example, a resource group may comprise four cores (or quadrants), with a lowest level sub-bus interconnecting the four cores. The sub-bus may have a cross configuration (whether physical or logical). Four of these resource groups may be grouped together, by a second-lowest level sub-bus which may also have a cross pattern, and four of these groups of resource groups may be grouped together, and so forth. Foer example, there may thus be 4ⁿ cores, n levels of sub-buses,

individual sub-buses. The grouping of cores and division of the communication bus may be done in any practicable manner and need not necessarily be quadrangular or even symmetrical.

A gateway junction may be provided between adjacent levels of the sub-buses, e.g., between a larger, top level sub-bus and a smaller, second level sub-bus. This may result in a tree-like structure or system. Every resource group may have a gateway junction connecting it to a higher level sub-bus. By doing this, multiple resource groups may be cycled or executed simultaneously. A resource group may only be required to wait for time on the communication bus if it requests values or data from another resource group.

An operating system which operates the processor may be configured to address the cores dynamically to reflect available cores and resources. This way the operating system can keep a computer program’s execution in as few as possible resource groups, maximising performance.

As the processor may not include a CU, there may be a need to cycle the cores (and resource groups, if present) in a synchronised manner to avoid concurrency violations (see PCT/IB2018/054669). Accordingly, each core may be configured to trigger or invoke a subsequent core or resource group, e.g., in a cascading arrangement. It may be acceptable for more than one core to execute at the same time, as long as they all are in cyclic sync, e.g., every core must cycle the same amount. So, if one core has cycled, it must wait for every other core to cycle as well, before moving on to the next cycle, otherwise it will violate concurrency considerations (see PCT/IB2018/054669).

The disclosure extends to a method of operating a processor, the method comprising:

providing a plurality of cores, each core comprising:

at least one input buffer;

a logic unit having an input and an output, wherein the input is in communication with the input buffer; and

a memory unit in communication with the output of the logic unit; executing, by the logic unit of each core, only one type of operation;

interconnecting the cores by a communication bus;

storing, by a plurality of address modules respectively associated with the plurality of cores, an address which points to a value to be communicated to the input buffer; and

calculating, by at least one of the cores, an output based on the input value and the type of operation which its logic unit is configured to perform.

The disclosure extends to a non-transitory computer-readable medium which has stored thereon a set of instructions which, when executed by a computer processor, causes the computer processor to perform the method defined above. BRIEF DESCRIPTION OF DRAWINGS

The disclosure will now be further described, by way of example, with reference to the accompanying diagrammatic drawings.

In the drawings:

FIG. 1 shows an extract from FIG. 3 of PCT/IB2018/054669;

FIG. 2 shows a schematic view of a processor comprising at least one core in accordance with the present disclosure;

FIG. 3 shows a schematic view of a first resource group comprising plural cores of

FIG. 1 ;

FIG. 4 shows a schematic view of a second resource group comprising plural cores of FIG. 1 ;

FIG. 5 shows a schematic view of a pattern of resource groups of FIG. 4;

FIG. 6 shows a schematic view of the cores of FIG. 3 in a chain; and

FIG. 7 shows a flow diagram of a method of operating a processor in accordance with the present disclosure.

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENT

The following description of the disclosure is provided as an enabling teaching of the disclosure. Those skilled in the relevant art will recognise that many changes can be made to the embodiment described, while still attaining the beneficial results of the present disclosure. It will also be apparent that some of the desired benefits of the present disclosure can be attained by selecting some of the features of the present disclosure without utilising other features. Accordingly, those skilled in the art will recognise that modifications and adaptations to the present disclosure are possible and can even be desirable in certain circumstances, and are a part of the present disclosure. Thus, the following description is provided as illustrative of the principles of the present disclosure and not a limitation thereof. FIG. 2 illustrates a processor 100 which has plural cores 102. Only a single core 102 (core ri) is illustrated in FIG. 2 for clarity but it will be appreciated that the architecture of the processor 100 is massively scalable and there may be a massive number of cores, e.g., 2³², merely to give a numerical illustration of the possible scale of the processor 100. The processor 100 has a communication bus 104 configured to interconnect the cores 102.

The core 102 has two input buffers 106, 108, namely buffer A 106 and buffer B 108. The core 102 has a logic unit in the form of an ALU 110 which has an input and an output, wherein the input is in communication with the input buffers 106, 108 and the output is in communication with a memory unit or cell 112. The ALU 110 is configured to perform or execute only one type of operation or a specific piece of opcode. This is a first aspect in which the present disclosure differs from conventional processors or cores which are configured to execute an operation defined by a CU based on opcode supplied by the CU.

This implies, practically, that when a particular operation is required, inputs are merely supplied to a core 102 which is configured to perform that operation, without supplying any opcode. To demonstrate this, Table 3 in the Background of Disclosure above can be re-written as Table 7:

Table 7

Because a large number of cores 102 is contemplated, the whole program can be loaded into the available cores 102 which eliminates the need to change the memory unit 112 of the cores 102 during runtime, which in turn eliminates the need to load new addresses into the cores 102 during execution. The number of instructions is still the same, but they are parallelised; the total number of cycles is still the same, as per Table 8:

Table 8

These methods may introduce a problem of concurrency. If, for example, core C3 finishes executing before core C2, the result will be incorrect. To solve this problem, every core 100 may have an execution complete flag 1 14, indicating whether it has executed at least once. Then, whenever a core (e.g., C3) tries to read the value of another core (e.g., C2), it may skip one execution cycle if flag 1 14 is false. In this fashion, the order of operation may be correctly maintained.

This may mean that a program will still take two cycles to execute even though there is only one row in the table. That row may have to be repeated over and over until the final result is calculated. Table 9 illustrates this principle:

Table 9

Tables 7-9 are also illustrative of methods of the present disclosure, as may be implemented by the processor 100 (refer also to FIG. 7).

If the processor 100 executes routines linearly, this may lead to the communication bus 104 being shared and possible contention issues. Also, only one core 102 may be able to execute at a time. To address this, each core 102 may be made responsible for its own execution.

Accordingly, the core 102 also includes address modules 122, 124. The address modules 122, 124 may correspond to the input buffers 106, 108. In this example, there are two input buffers 106, 108 and accordingly there are two address modules 122, 124, designated as l_A 122 (corresponding to input buffer A 106) and l_B 124 (corresponding to input buffer 108). (Instead, conceptually, the core 102 may have a single address module configured to store plural addresses.) Each address module 122, 124 is configured to store an address which points to a value to be communicated to its associated input buffer 106, 108. The inclusion of the address modules 122, 124 is another aspect which differentiates the present disclosure from conventional processors. The address modules 106, 108 may embody a distributed control system of the processor 100.

For example, and referring back to Table 7, core C2 may have the address of core CO in l_A 122 and of core C1 in l_B 124 (or the addresses of the memory units 1 12 of those cores). In order for efficient execution, the communication bus 104 may be modified to have two portions: the first portion being for the memory address and the second portion being for the value at that address. In this fashion, every core 102 may be able to write the values of their address modules 122, 124 respectively onto the communication bus 104 and, separately, then read values into the input buffers 106, 108.

Therefore, the processor 100 may, during normal operation, not have to load any literals, opcodes, or data. Accordingly, machine code source size may no longer be a consideration, and there may no longer be there a need for a centralised CU. Because every core 102 can operate itself based on the content of the memory addresses 122,

124.

The Applicant notes that an issue that may need to be overcome is that only one core 102 may use the bus 104 at a time. To address this issue, one or more groups of communications buses, groups of cores, and/or invocation chains may be implemented.

FIGS 3-4 illustrate cores 102 grouped into resource groups 200. Again, only a few cores 102 are illustrated (1 1 in FIG. 3 and four in FIG. 4) but the number may be adjusted as desired or based on intended application. Resource groups 200 may be considered a systematic grouping of cores 102. By grouping cores 102 together, they can be tightly connected with their own internal group sub-bus 206 (which forms part of the communication bus 102) thereby allowing the grouped cores 102 to share data with one another quickly inside their own group 200. Groups 200 can be specialised to specific tasks; some examples include: a group designed for graphics processing may consist of more floating point cores, and a group optimised for string processing may have more string manipulation cores, etc. The sub-bus 206 connects to a remainder of the bus 104 by a gateway junction 202. Instead, FIG. 4 could represent a grouping of four resource groups 200 of FIG. 3, rather than a grouping of four cores 102.

A way to determine the cores 102 in a group 200 is to perform a frequency analysis, e.g., on existing software, like Linux™, or a rendering engine, to determine which operation is performed most, and then using the resulting statistics, determine distribution and grouping of the cores 102. Once the resource groups 200 are defined, they can be interconnected or arrayed in a manner that allows any giving resource group 200 to communicate with any other given group 200. FIG. 5 illustrates a fractal pattern which is an example, but other layouts are contemplated.

FIG. 5, has 16 (4²) groups 200 of cores 102, each group 200 having four cores 102. Each group 200 has a third level bus (B_3), with a second level bus (B_2) interconnecting the groups 200, and a first level bus (B_1 ) interconnecting the second level buses (B_2). This pattern can be recursively extrapolated as far as practically possible with the size of a given core in mind. In other embodiments, the pattern need not be symmetrical.

For every recursion performed on this pattern, a gateway junction is created on the new smaller bus onto the top level bus. That way, a pattern is created that leads to subgroups in a tree structure like system. Every resource group 200 may have a gateway junction connecting it to its sub-bus. By doing this, the individual resource groups 200 can be cycled at the same time. The only time a resource group 200 may wait for time on the bus is when it requests data from another resource group 200.

In FIG. 4, an example of a small group 200 is illustrated. If a program does not need any information from an external source, all four cores 102 may safely be cycled at the same time because none of them will use the network bus to get data from another group. The same applies to groups of resource groups 200. If a core 102 in one group 200 requests data from another group 200, the rest of the resource groups 200 that are not trying to use the sub-bus 206 can still be cycled. Accordingly, two groups 200 may not hold up the rest of the processor 100 when one is waiting for data from the other.

An operating system that loads a given program into the processor 100 may be configured to re-link the program to reflect the actual reality of the core’s and available resources. This way, the operating system can keep a program’s execution in as few resource groups 200 as possible, maximising performance.

As the processor 100 lacks a central CU, there may be a need to cycle cores 102 and resource groups 200 in a synchronised manner to avoid violating concurrency requirements. To do this, individual cores 102 may be chained in a chain reaction manner, and similarly resource groups 200 may be chained in a cascading reaction manner.

FIG. 6 illustrates such a chain 300. Each core 102 invokes the next core 102 in its respective chain 300 once it has completed its own execution cycle. This way, only one core 102 may use the internal sub-bus 206 at a time.

In FIG. 5, each sub-bus (B_1 , B_2, B_3) can be thought of as a tree structure (root at the top, expanding down). The top most bus (first level bus B_1 ) sends an invocation signal to the second level buses (B_2) which sends invocation signals to the third level buses (B_3). This may iterate over and over until the lowest sub-bus is reached where the resource groups 200 are, which comprise the cores 102. This may mean that every core 102 gets executed by the same clock.

It may be acceptable for each core 102 to execute at the same time, as long as they all are in cyclic sync, e.g., every core 102 must cycle the same amount. If one core 102 has cycled, it may be configured wait for every other core 102 to cycle as well, before moving on to the next cycle.

FIG. 7 illustrates a method 400 implemented by the processor 100 as has been explained above in explaining the functionality of the processor 100. By way of summary, the processor 100 is provided (at block 402) with a plurality of cores 102, each of which is configured to perform a single type of operation. Addresses pointing to the values to be inputted to the cores 102 are stored (at block 404) in the address modules 122. Then, as explained above, an output is calculated (at block 406) based on the input values which the address modules 122 point to based on the operation type which the logic unit 110 or the core 102 is configured to perform. This may be repeated (at block 408) numerously.

The disclosure may provide one or more of the following advantages:

Machine Code Size Bottleneck: Elimination of the need to cycle different opcodes every cycle, by creating a processor where every line of machine code is executed in a single cycle, as well as only using cores with a single function to eliminate the concept of opcodes.

Control Unit Bottleneck: Eliminating the need for a control unit by replacing it with a distributed control system, thereby removing the need for a linear execution cycle.

Bus Usage Bottleneck: Solving the problem of bus usage limitations, by subdividing groups.

Claims

1. A processor which comprises:

a plurality of cores, each core comprising:

at least one input buffer;

a memory unit in communication with the output of the logic unit, wherein the logic unit of each core is configured to perform or execute only one type of operation;

a communication bus configured to interconnect the cores; and a plurality of address modules respectively associated with the plurality of cores, each address module being configured to store an address which points to a value to be communicated to the input buffer,

wherein each core is configured to calculate an output based on the input value and the type of operation which its logic unit is configured to perform.

2. The processor of claim 1 , wherein the type of operation which each core is configured to perform is a statically pre-defined opcode or a statically pre-defined segment of opcode.

3. The processor of claim 1 , wherein each core comprises two input buffers.

4. The processor of claim 1 , wherein the logic unit is an Arithmetic Logic Unit (ALU), a Floating-Point Unit (FPU), or a Graphics Processing Unit (GPU).

5. The processor of claim 1 , wherein no CU is provided.

6. The processor of claim 1 , wherein: each core includes a core identifier; and

the cores are executed sequentially or cyclically.

7. The processor of claim 1 , wherein each core comprises an execution complete flag which is configured to indicate whether or not that core has executed or completed an instruction within a particular cycle.

8. The processor of claim 7, wherein each core is configured to execute its instruction only if one or more cores earlier in the cycle have already executed their instruction as indicated by the execution complete flag.

9. The processor of claim 1 , wherein each address module comprises an address or pointer to a memory location.

10. The processor of claim 1 , wherein each core comprises as many address modules, or wherein each address module stores as many memory address, as each core has input buffers.

11. The processor of claim 1 , wherein each core is configured to fetch the value pointed to by the address in the address module at the beginning of, or during, each execution cycle.

12. The processor of claim 1 , wherein the communication bus comprises two portions, a first portion to communicate values and a second portion to communicate addresses pointing to the values.

13. The processor of claim 12, wherein the first portion of the communication bus is coupled to each input buffer and the second portion of the communication bus is coupled to the address modules of each core.

14. The processor of claim 1 , in which the cores are divided into resource groups.

15. The processor of claim 14, in which the cores with a resource group are configured to perform the same or similar types of operations.

16. The processor of claim 14, in which the communication bus is divided into groups or sub-buses, wherein a sub-bus links cores within a resource group and then a larger sub-bus links resource groups.

17. The processor of claim 16, wherein a gateway junction is provided between adjacent levels of the sub-buses which results in a tree-like structure or system.

18. The processor of claim 17, wherein every resource group has a gateway junction connecting it to a higher level sub-bus, which enables multiple resource groups to be cycled or executed simultaneously.

19. A method of operating a processor, the method comprising:

providing a plurality of cores, each core comprising:

at least one input buffer;

interconnecting the cores by a communication bus;

storing, by a plurality of address modules respectively associated with the plurality of cores, an address which points to a value to be communicated to the input buffer; and calculating, by at least one of the cores, an output based on the input value and the type of operation which its logic unit is configured to perform.

20. A non-transitory computer-readable medium which has stored thereon a set of instructions which, when executed by a computer processor, cause the computer processor to perform the method of claim 19.