WO2020053618A1 - A processor and a method of operating a processor - Google Patents

A processor and a method of operating a processor Download PDF

Info

Publication number
WO2020053618A1
WO2020053618A1 PCT/IB2018/056875 IB2018056875W WO2020053618A1 WO 2020053618 A1 WO2020053618 A1 WO 2020053618A1 IB 2018056875 W IB2018056875 W IB 2018056875W WO 2020053618 A1 WO2020053618 A1 WO 2020053618A1
Authority
WO
WIPO (PCT)
Prior art keywords
core
processor
cores
address
input
Prior art date
Application number
PCT/IB2018/056875
Other languages
French (fr)
Inventor
Emile BADENHORST
Original Assignee
Badenhorst Emile
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Badenhorst Emile filed Critical Badenhorst Emile
Priority to PCT/IB2018/056875 priority Critical patent/WO2020053618A1/en
Publication of WO2020053618A1 publication Critical patent/WO2020053618A1/en
Priority to ZA2021/01831A priority patent/ZA202101831B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution

Definitions

  • the disclosure relates generally to computer processors and specifically to a processor which may lack a central CU (Control Unit) and an associated method.
  • a central CU Control Unit
  • Modern computer processors or Central Processing Units may comprise plural cores.
  • a Control Unit (CU) is usually configured to direct the operation of the plural cores.
  • CU Central Processing Unit
  • the number of cores which a processor has may also increase, even increase exponentially.
  • the Applicant envisages that future processors may have millions or billions of cores (or similarly configured processing sub-units).
  • a problem with having a massive number (e.g., billions) of cores in a processor is that there may be considerable overhead in using a centralised CU which conventionally would load code (e.g., opcode) and cause the cores to execute it. On such a vast scale, such loading of code is time-consuming and potentially too time- consuming to operate the cores efficiently.
  • code e.g., opcode
  • a scaled down architecture with a smaller number of cores can be used.
  • An example instruction set as well as a centralised CU may demonstrate the bottleneck.
  • a core can read from another core’s value and a core only executes opcode, usually with a very small input size, for example,“A + B”, or“A > B”.
  • opcode usually with a very small input size, for example,“A + B”, or“A > B”.
  • a more complex input size or instruction e.g.“A + B * (C / D)”, cannot be executed and is not a valid instruction. It has to be broken down into smaller components.
  • a symbol may be an address, a literal, or a dereference instruction. Dereferencing will load the value at the symbol and use it as the memory address for the context of the instruction in which it is used.
  • Every row is an execution cycle of the processor. Every column is a core in the processor and there are 10 cores. A core may retain its internal value every cycle. In this background example, every core is an unsigned integer:
  • CO Literal 10 loads into memory at address CO.
  • CO Copies the value of CO (this is technically pointless because the value is already in the core, but it helps to explain the code’s behaviour).
  • C1 Loads the literal 10 from CO and multiples it by 2, then stores the resulting value in C1.
  • Cycle 2 C1 : Loads the Value 10 from CO and the Value 20 from C1 then it multiplies them with one another, and stores the resulting value int C1.
  • FIG. 1 is an extract from FIG. 3 of PCT/IB2018/054669 and it illustrates a proposed multi-core processor architecture and is still considered as background to the present disclosure although it may not rate as prior art as it may not be publicly available at the priority date of this present application.
  • every core may be assigned an index in a sequential order, i.e. CO, C1 , C2, etc. Every core comprises two buffers: the first“A” buffer and the second“B” buffer as well as the core’s own value.
  • the core label itself will be used to reference this buffer (memory unit).
  • Buffer A may be configured always to write its value into an associated logic unit, e.g., an Arithmetic Logic Unit (ALU).
  • Buffer B may also be configured to output its value to the ALU. Both buffer A and buffer B may be responsive to a read control signal.
  • a memory unit has two control signals, read and write. The read and write operation is operated from the point of view of the buffer itself. This means that if buffer A was instructed to read, it takes the value from the bus and reads it to its internal storage and a write operation writes the internal value into the configured output, so in the case of the memory unit the write operation will output its value onto the bus.
  • Another important function is for the CU to assign the ALU its operation, that is, the CU tells the core which arithmetic operation it should do, e.g., addition, subtraction, multiplication or division. This may be notated as Cx.Operator, e,g., CO. Sub. Using this naming convention, operations may be illustrated as follows:
  • Table 5 Using the notation of Table 5, the opcode of Tables 3-4 can be translated into micro-instructions:
  • Control unit During every execution cycle the CU iterates over every core, one by one, in a sequential manner, executing that specific core’s opcode, witch means that only one core is operating at any given time.
  • the bus Because only one core can use the bus at a time, only one core can operate at a time, which is a huge bottle neck.
  • the Applicant desires a processor which may overcome or ameliorate the abovementioned drawbacks and may enable processors with a large number of cores.
  • the disclosure provides a processor which comprises: a plurality of cores, each core comprising: at least one input buffer; a logic unit having an input and an output, wherein the input is in communication with the input buffer; and a memory unit in communication with the output of the logic unit, wherein the logic unit of each core is configured to perform or execute only one type of operation; a communication bus configured to interconnect the cores; and a plurality of address modules respectively associated with the plurality of cores, each address module being configured to store an address which points to a value to be communicated to the input buffer, wherein each core is configured to calculate an output based on the input value and the type of operation which its logic unit is configured to perform.
  • the plurality of address modules may embody a distributed control system.
  • the type of operation may be in the form of an opcode or segments of opcode.
  • each core may be configured statically, or statically pre-defined, to execute only one opcode.
  • the cores may not be re-configurable in the type of operation which they can execute.
  • the input value(s) may need to be communicated to a core which has been configured to perform that instruction, without communicating the instruction itself to the core.
  • a core which has been configured to perform that instruction, without communicating the instruction itself to the core.
  • the values and the corresponding opcode may be communicated to any core which my then load the opcode and execute it with respect to the input values.
  • Each core may include plural input buffers.
  • Each core may include two input buffers.
  • the logic unit may be, or may include, an Arithmetic Logic Unit (ALU), a Floating- Point Unit (FPU), and/or a Graphics Processing Unit (GPU).
  • ALU Arithmetic Logic Unit
  • FPU Floating- Point Unit
  • GPU Graphics Processing Unit
  • the address modules supply the cores with addresses of the input values.
  • the address modules of the present disclosure do not supply opcode, and are thus different from a conventionally configured CU which does supply opcode. In other words, in the present disclosure, the need to assign opcode to each core may be eliminated.
  • Each core may include a core identifier.
  • Each core, or each core in a group of cores, may be sequential or cyclical. That is, the cores may be configured to execute in a sequence.
  • the core identifier may indicate a sequence of the cores.
  • Each core may include an execution complete flag.
  • the execution complete flag may be configured to indicate whether or not that core has executed or completed an instruction within a particular cycle. If the cores are sequential, each core may be configured to execute its instruction only if one or more cores earlier in the sequence have already executed their instruction. Accordingly, a particular core may only execute if one or more cores earlier in the sequence have their execution complete flags set as true. Conversely, a particular core may not execute, or temporarily skip or suspend execution, if one or more cores earlier in the sequence has their execution complete flag set as false.
  • Each address module may comprise an address or pointer to a memory location.
  • Each address module may include as many addresses or pointers as each core has input buffers. Where the core comprises two input buffers, each address module may comprise two addresses, one for each input buffer.
  • Each core may be configured to fetch a value from the memory location addressed by the address module and place that value into the input buffer. The core may be configured to fetch the value at the beginning of, or during, each execution cycle.
  • the communication bus may comprise two portions, a first portion to communicate values and a second portion to communicate addresses pointing to the values.
  • the first portion of the communication bus may be coupled to the input buffer(s) of each core.
  • the second portion of the communication bus may be coupled to the address modules of each core.
  • the first and second portions of the communication bus may be logical portions or physical portions.
  • the cores may be divided into groups.
  • the groups may be characterised as resource groups.
  • the cores with a resource group may be configured to perform the same or similar types of operations.
  • the cores of different resource groups may be configured to perform different types of operations.
  • the communication bus may also be divided into groups or sub-buses.
  • a sub- bus may link cores within a resource group and then a larger sub-bus may link resource groups. This hierarchy of sub-buses may occur iteratively or exponentially, with groups of resource groups being connected by an even larger bus, and so forth.
  • a resource group may comprise four cores (or quadrants), with a lowest level sub-bus interconnecting the four cores.
  • the sub-bus may have a cross configuration (whether physical or logical).
  • Four of these resource groups may be grouped together, by a second-lowest level sub-bus which may also have a cross pattern, and four of these groups of resource groups may be grouped together, and so forth.
  • Foer example there may thus be 4 n cores, n levels of sub-buses,
  • the grouping of cores and division of the communication bus may be done in any practicable manner and need not necessarily be quadrangular or even symmetrical.
  • a gateway junction may be provided between adjacent levels of the sub-buses, e.g., between a larger, top level sub-bus and a smaller, second level sub-bus. This may result in a tree-like structure or system. Every resource group may have a gateway junction connecting it to a higher level sub-bus. By doing this, multiple resource groups may be cycled or executed simultaneously. A resource group may only be required to wait for time on the communication bus if it requests values or data from another resource group.
  • An operating system which operates the processor may be configured to address the cores dynamically to reflect available cores and resources. This way the operating system can keep a computer program’s execution in as few as possible resource groups, maximising performance.
  • each core may be configured to trigger or invoke a subsequent core or resource group, e.g., in a cascading arrangement. It may be acceptable for more than one core to execute at the same time, as long as they all are in cyclic sync, e.g., every core must cycle the same amount. So, if one core has cycled, it must wait for every other core to cycle as well, before moving on to the next cycle, otherwise it will violate concurrency considerations (see PCT/IB2018/054669).
  • the disclosure extends to a method of operating a processor, the method comprising:
  • each core comprising:
  • a logic unit having an input and an output, wherein the input is in communication with the input buffer
  • a memory unit in communication with the output of the logic unit; executing, by the logic unit of each core, only one type of operation;
  • the disclosure extends to a non-transitory computer-readable medium which has stored thereon a set of instructions which, when executed by a computer processor, causes the computer processor to perform the method defined above.
  • FIG. 1 shows an extract from FIG. 3 of PCT/IB2018/054669
  • FIG. 2 shows a schematic view of a processor comprising at least one core in accordance with the present disclosure
  • FIG. 3 shows a schematic view of a first resource group comprising plural cores of
  • FIG. 1 is a diagrammatic representation of FIG. 1 ;
  • FIG. 4 shows a schematic view of a second resource group comprising plural cores of FIG. 1 ;
  • FIG. 5 shows a schematic view of a pattern of resource groups of FIG. 4
  • FIG. 6 shows a schematic view of the cores of FIG. 3 in a chain
  • FIG. 7 shows a flow diagram of a method of operating a processor in accordance with the present disclosure.
  • FIG. 2 illustrates a processor 100 which has plural cores 102. Only a single core 102 (core ri) is illustrated in FIG.
  • the processor 100 has a communication bus 104 configured to interconnect the cores 102.
  • the core 102 has two input buffers 106, 108, namely buffer A 106 and buffer B 108.
  • the core 102 has a logic unit in the form of an ALU 110 which has an input and an output, wherein the input is in communication with the input buffers 106, 108 and the output is in communication with a memory unit or cell 112.
  • the ALU 110 is configured to perform or execute only one type of operation or a specific piece of opcode. This is a first aspect in which the present disclosure differs from conventional processors or cores which are configured to execute an operation defined by a CU based on opcode supplied by the CU.
  • the whole program can be loaded into the available cores 102 which eliminates the need to change the memory unit 112 of the cores 102 during runtime, which in turn eliminates the need to load new addresses into the cores 102 during execution.
  • the number of instructions is still the same, but they are parallelised; the total number of cycles is still the same, as per Table 8:
  • every core 100 may have an execution complete flag 1 14, indicating whether it has executed at least once. Then, whenever a core (e.g., C3) tries to read the value of another core (e.g., C2), it may skip one execution cycle if flag 1 14 is false. In this fashion, the order of operation may be correctly maintained.
  • a core e.g., C3
  • another core e.g., C2
  • Tables 7-9 are also illustrative of methods of the present disclosure, as may be implemented by the processor 100 (refer also to FIG. 7).
  • processor 100 executes routines linearly, this may lead to the communication bus 104 being shared and possible contention issues. Also, only one core 102 may be able to execute at a time. To address this, each core 102 may be made responsible for its own execution.
  • the core 102 also includes address modules 122, 124.
  • the address modules 122, 124 may correspond to the input buffers 106, 108.
  • the core 102 may have a single address module configured to store plural addresses.
  • Each address module 122, 124 is configured to store an address which points to a value to be communicated to its associated input buffer 106, 108.
  • the inclusion of the address modules 122, 124 is another aspect which differentiates the present disclosure from conventional processors.
  • the address modules 106, 108 may embody a distributed control system of the processor 100.
  • core C2 may have the address of core CO in l_A 122 and of core C1 in l_B 124 (or the addresses of the memory units 1 12 of those cores).
  • the communication bus 104 may be modified to have two portions: the first portion being for the memory address and the second portion being for the value at that address. In this fashion, every core 102 may be able to write the values of their address modules 122, 124 respectively onto the communication bus 104 and, separately, then read values into the input buffers 106, 108.
  • the processor 100 may, during normal operation, not have to load any literals, opcodes, or data. Accordingly, machine code source size may no longer be a consideration, and there may no longer be there a need for a centralised CU. Because every core 102 can operate itself based on the content of the memory addresses 122,
  • the Applicant notes that an issue that may need to be overcome is that only one core 102 may use the bus 104 at a time. To address this issue, one or more groups of communications buses, groups of cores, and/or invocation chains may be implemented.
  • FIGS 3-4 illustrate cores 102 grouped into resource groups 200. Again, only a few cores 102 are illustrated (1 1 in FIG. 3 and four in FIG. 4) but the number may be adjusted as desired or based on intended application.
  • Resource groups 200 may be considered a systematic grouping of cores 102. By grouping cores 102 together, they can be tightly connected with their own internal group sub-bus 206 (which forms part of the communication bus 102) thereby allowing the grouped cores 102 to share data with one another quickly inside their own group 200.
  • Groups 200 can be specialised to specific tasks; some examples include: a group designed for graphics processing may consist of more floating point cores, and a group optimised for string processing may have more string manipulation cores, etc.
  • the sub-bus 206 connects to a remainder of the bus 104 by a gateway junction 202. Instead, FIG. 4 could represent a grouping of four resource groups 200 of FIG. 3, rather than a grouping of four cores 102.
  • a way to determine the cores 102 in a group 200 is to perform a frequency analysis, e.g., on existing software, like LinuxTM, or a rendering engine, to determine which operation is performed most, and then using the resulting statistics, determine distribution and grouping of the cores 102.
  • a frequency analysis e.g., on existing software, like LinuxTM, or a rendering engine
  • the resource groups 200 can be interconnected or arrayed in a manner that allows any giving resource group 200 to communicate with any other given group 200.
  • FIG. 5 illustrates a fractal pattern which is an example, but other layouts are contemplated.
  • FIG. 5 has 16 (4 2 ) groups 200 of cores 102, each group 200 having four cores 102.
  • Each group 200 has a third level bus (B_3), with a second level bus (B_2) interconnecting the groups 200, and a first level bus (B_1 ) interconnecting the second level buses (B_2).
  • B_3 third level bus
  • B_2 second level bus
  • B_1 first level bus
  • Every resource group 200 may have a gateway junction connecting it to its sub-bus. By doing this, the individual resource groups 200 can be cycled at the same time. The only time a resource group 200 may wait for time on the bus is when it requests data from another resource group 200.
  • FIG. 4 an example of a small group 200 is illustrated. If a program does not need any information from an external source, all four cores 102 may safely be cycled at the same time because none of them will use the network bus to get data from another group. The same applies to groups of resource groups 200. If a core 102 in one group 200 requests data from another group 200, the rest of the resource groups 200 that are not trying to use the sub-bus 206 can still be cycled. Accordingly, two groups 200 may not hold up the rest of the processor 100 when one is waiting for data from the other.
  • An operating system that loads a given program into the processor 100 may be configured to re-link the program to reflect the actual reality of the core’s and available resources. This way, the operating system can keep a program’s execution in as few resource groups 200 as possible, maximising performance.
  • processor 100 lacks a central CU, there may be a need to cycle cores 102 and resource groups 200 in a synchronised manner to avoid violating concurrency requirements.
  • individual cores 102 may be chained in a chain reaction manner, and similarly resource groups 200 may be chained in a cascading reaction manner.
  • FIG. 6 illustrates such a chain 300.
  • Each core 102 invokes the next core 102 in its respective chain 300 once it has completed its own execution cycle. This way, only one core 102 may use the internal sub-bus 206 at a time.
  • each sub-bus (B_1 , B_2, B_3) can be thought of as a tree structure (root at the top, expanding down).
  • the top most bus (first level bus B_1 ) sends an invocation signal to the second level buses (B_2) which sends invocation signals to the third level buses (B_3). This may iterate over and over until the lowest sub-bus is reached where the resource groups 200 are, which comprise the cores 102. This may mean that every core 102 gets executed by the same clock.
  • each core 102 may execute at the same time, as long as they all are in cyclic sync, e.g., every core 102 must cycle the same amount. If one core 102 has cycled, it may be configured wait for every other core 102 to cycle as well, before moving on to the next cycle.
  • FIG. 7 illustrates a method 400 implemented by the processor 100 as has been explained above in explaining the functionality of the processor 100.
  • the processor 100 is provided (at block 402) with a plurality of cores 102, each of which is configured to perform a single type of operation. Addresses pointing to the values to be inputted to the cores 102 are stored (at block 404) in the address modules 122. Then, as explained above, an output is calculated (at block 406) based on the input values which the address modules 122 point to based on the operation type which the logic unit 110 or the core 102 is configured to perform. This may be repeated (at block 408) numerously.
  • Machine Code Size Bottleneck Elimination of the need to cycle different opcodes every cycle, by creating a processor where every line of machine code is executed in a single cycle, as well as only using cores with a single function to eliminate the concept of opcodes.
  • Control Unit Bottleneck Eliminating the need for a control unit by replacing it with a distributed control system, thereby removing the need for a linear execution cycle.
  • Bus Usage Bottleneck Solving the problem of bus usage limitations, by subdividing groups.

Abstract

A processor comprises a plurality of cores and lacks a conventional CU (Control Unit). Each core comprises at least one input buffer; a logic unit having an input and an output, wherein the input is in communication with the input buffer; and a memory unit in communication with the output of the logic unit. The logic unit of each core is configured to perform or execute only one type of operation. A plurality of address modules are associated with the plurality of cores, each address module being configured to store an address which points to a value to be communicated to the input buffer. Each core is configured to calculate an output based on the input value and the type of operation which its logic unit is configured to perform.

Description

A Processor and a Method of Operating a Processor
FIELD OF DISCLOSURE
The disclosure relates generally to computer processors and specifically to a processor which may lack a central CU (Control Unit) and an associated method.
BACKGROUND OF DISCLOSURE Modern computer processors or Central Processing Units (CPUs) may comprise plural cores. A Control Unit (CU) is usually configured to direct the operation of the plural cores. As miniaturisation and device technologies increase, the number of cores which a processor has may also increase, even increase exponentially. The Applicant envisages that future processors may have millions or billions of cores (or similarly configured processing sub-units).
A problem with having a massive number (e.g., billions) of cores in a processor is that there may be considerable overhead in using a centralised CU which conventionally would load code (e.g., opcode) and cause the cores to execute it. On such a vast scale, such loading of code is time-consuming and potentially too time- consuming to operate the cores efficiently. The Applicant notes that a massive multi- core processor architecture is described in its previous PCT application no. PCT/IB2018/054669.
To explain the background to the problem of controlling a large number of cores, a scaled down architecture with a smaller number of cores (e.g., 10) can be used. An example instruction set as well as a centralised CU may demonstrate the bottleneck. Further, a core can read from another core’s value and a core only executes opcode, usually with a very small input size, for example,“A + B”, or“A > B”. For normal opcode, a more complex input size or instruction, e.g.“A + B * (C / D)”, cannot be executed and is not a valid instruction. It has to be broken down into smaller components.
Consider the following instructions:
Figure imgf000003_0001
Table 1
A symbol may be an address, a literal, or a dereference instruction. Dereferencing will load the value at the symbol and use it as the memory address for the context of the instruction in which it is used.
Using the Instruction Set (IS) defined above, an architecture following the concurrent solution outlined in PCT/IB2018/054669 can be created. For example, every row is an execution cycle of the processor. Every column is a core in the processor and there are 10 cores. A core may retain its internal value every cycle. In this background example, every core is an unsigned integer:
Figure imgf000003_0002
Table 2 Taking the following code sample:
let bar(x : u32, y : u32) = x * y;
let foo(x : u32) = bar(x, x * 2);
let out = foo(10);
and converting this to opcodes will look like:
Figure imgf000004_0001
Table 3
Executing the code in Table 3 yields:
Figure imgf000004_0002
Table 4
This execution is described as follows:
Cycle 0:
CO: Literal 10 loads into memory at address CO.
Cycle 1 :
CO: Copies the value of CO (this is technically pointless because the value is already in the core, but it helps to explain the code’s behaviour).
C1 : Loads the literal 10 from CO and multiples it by 2, then stores the resulting value in C1.
Cycle 2: C1 : Loads the Value 10 from CO and the Value 20 from C1 then it multiplies them with one another, and stores the resulting value int C1.
FIG. 1 is an extract from FIG. 3 of PCT/IB2018/054669 and it illustrates a proposed multi-core processor architecture and is still considered as background to the present disclosure although it may not rate as prior art as it may not be publicly available at the priority date of this present application.
In FIG. 1 , every core may be assigned an index in a sequential order, i.e. CO, C1 , C2, etc. Every core comprises two buffers: the first“A” buffer and the second“B” buffer as well as the core’s own value. The core label itself will be used to reference this buffer (memory unit). Buffer A may be configured always to write its value into an associated logic unit, e.g., an Arithmetic Logic Unit (ALU). Buffer B may also be configured to output its value to the ALU. Both buffer A and buffer B may be responsive to a read control signal. A memory unit has two control signals, read and write. The read and write operation is operated from the point of view of the buffer itself. This means that if buffer A was instructed to read, it takes the value from the bus and reads it to its internal storage and a write operation writes the internal value into the configured output, so in the case of the memory unit the write operation will output its value onto the bus.
Another important function is for the CU to assign the ALU its operation, that is, the CU tells the core which arithmetic operation it should do, e.g., addition, subtraction, multiplication or division. This may be notated as Cx.Operator, e,g., CO. Sub. Using this naming convention, operations may be illustrated as follows:
Figure imgf000005_0001
Table 5 Using the notation of Table 5, the opcode of Tables 3-4 can be translated into micro-instructions:
Figure imgf000006_0001
Table 6
The following observations may be made about the above process:
Linear nature of the Control unit: During every execution cycle the CU iterates over every core, one by one, in a sequential manner, executing that specific core’s opcode, witch means that only one core is operating at any given time.
Machine code source size: When a processor has a large number of cores, the executing code can be extremely large, e.g. if every opcode is 3 bytes in size and there are 2L32-1 cores, and assuming you use every core, every execution cycle, you can calculate a single execution cycle’s machine code’s size: 3 * (2L32) = 12884901888 bytes = 12.88 GB, loading ~13 GB of machine code, every execution cycle may not be practical and may result in slow execution cycles. The bus: Because only one core can use the bus at a time, only one core can operate at a time, which is a huge bottle neck.
Core Functions: All the cores have the same instruction set and is therefore very large and generalized, resulting in slower speed and complexity.
The Applicant desires a processor which may overcome or ameliorate the abovementioned drawbacks and may enable processors with a large number of cores.
SUMMARY OF DISCLOSURE
Accordingly, the disclosure provides a processor which comprises: a plurality of cores, each core comprising: at least one input buffer; a logic unit having an input and an output, wherein the input is in communication with the input buffer; and a memory unit in communication with the output of the logic unit, wherein the logic unit of each core is configured to perform or execute only one type of operation; a communication bus configured to interconnect the cores; and a plurality of address modules respectively associated with the plurality of cores, each address module being configured to store an address which points to a value to be communicated to the input buffer, wherein each core is configured to calculate an output based on the input value and the type of operation which its logic unit is configured to perform.
Conceptually, the plurality of address modules may embody a distributed control system. The type of operation may be in the form of an opcode or segments of opcode. Accordingly, each core may be configured statically, or statically pre-defined, to execute only one opcode. The cores may not be re-configurable in the type of operation which they can execute.
Accordingly, to execute a given instruction or opcode, the input value(s) may need to be communicated to a core which has been configured to perform that instruction, without communicating the instruction itself to the core. This is in contrast with a conventional processor in which the values and the corresponding opcode may be communicated to any core which my then load the opcode and execute it with respect to the input values.
Each core may include plural input buffers. Each core may include two input buffers.
The logic unit may be, or may include, an Arithmetic Logic Unit (ALU), a Floating- Point Unit (FPU), and/or a Graphics Processing Unit (GPU).
It will be noted that, contrary to conventional processor architecture where a CU supplies each core with input value(s) and with opcode, in the present disclosure, there is no CU. The address modules supply the cores with addresses of the input values. The address modules of the present disclosure do not supply opcode, and are thus different from a conventionally configured CU which does supply opcode. In other words, in the present disclosure, the need to assign opcode to each core may be eliminated.
Each core may include a core identifier. Each core, or each core in a group of cores, may be sequential or cyclical. That is, the cores may be configured to execute in a sequence. The core identifier may indicate a sequence of the cores. Each core may include an execution complete flag. The execution complete flag may be configured to indicate whether or not that core has executed or completed an instruction within a particular cycle. If the cores are sequential, each core may be configured to execute its instruction only if one or more cores earlier in the sequence have already executed their instruction. Accordingly, a particular core may only execute if one or more cores earlier in the sequence have their execution complete flags set as true. Conversely, a particular core may not execute, or temporarily skip or suspend execution, if one or more cores earlier in the sequence has their execution complete flag set as false.
Each address module may comprise an address or pointer to a memory location. Each address module may include as many addresses or pointers as each core has input buffers. Where the core comprises two input buffers, each address module may comprise two addresses, one for each input buffer. Each core may be configured to fetch a value from the memory location addressed by the address module and place that value into the input buffer. The core may be configured to fetch the value at the beginning of, or during, each execution cycle.
The communication bus may comprise two portions, a first portion to communicate values and a second portion to communicate addresses pointing to the values. The first portion of the communication bus may be coupled to the input buffer(s) of each core. The second portion of the communication bus may be coupled to the address modules of each core. The first and second portions of the communication bus may be logical portions or physical portions.
The cores may be divided into groups. The groups may be characterised as resource groups. The cores with a resource group may be configured to perform the same or similar types of operations. The cores of different resource groups may be configured to perform different types of operations. The communication bus may also be divided into groups or sub-buses. A sub- bus may link cores within a resource group and then a larger sub-bus may link resource groups. This hierarchy of sub-buses may occur iteratively or exponentially, with groups of resource groups being connected by an even larger bus, and so forth.
For example, a resource group may comprise four cores (or quadrants), with a lowest level sub-bus interconnecting the four cores. The sub-bus may have a cross configuration (whether physical or logical). Four of these resource groups may be grouped together, by a second-lowest level sub-bus which may also have a cross pattern, and four of these groups of resource groups may be grouped together, and so forth. Foer example, there may thus be 4n cores, n levels of sub-buses,
Figure imgf000010_0001
individual sub-buses. The grouping of cores and division of the communication bus may be done in any practicable manner and need not necessarily be quadrangular or even symmetrical.
A gateway junction may be provided between adjacent levels of the sub-buses, e.g., between a larger, top level sub-bus and a smaller, second level sub-bus. This may result in a tree-like structure or system. Every resource group may have a gateway junction connecting it to a higher level sub-bus. By doing this, multiple resource groups may be cycled or executed simultaneously. A resource group may only be required to wait for time on the communication bus if it requests values or data from another resource group.
An operating system which operates the processor may be configured to address the cores dynamically to reflect available cores and resources. This way the operating system can keep a computer program’s execution in as few as possible resource groups, maximising performance.
As the processor may not include a CU, there may be a need to cycle the cores (and resource groups, if present) in a synchronised manner to avoid concurrency violations (see PCT/IB2018/054669). Accordingly, each core may be configured to trigger or invoke a subsequent core or resource group, e.g., in a cascading arrangement. It may be acceptable for more than one core to execute at the same time, as long as they all are in cyclic sync, e.g., every core must cycle the same amount. So, if one core has cycled, it must wait for every other core to cycle as well, before moving on to the next cycle, otherwise it will violate concurrency considerations (see PCT/IB2018/054669).
The disclosure extends to a method of operating a processor, the method comprising:
providing a plurality of cores, each core comprising:
at least one input buffer;
a logic unit having an input and an output, wherein the input is in communication with the input buffer; and
a memory unit in communication with the output of the logic unit; executing, by the logic unit of each core, only one type of operation;
interconnecting the cores by a communication bus;
storing, by a plurality of address modules respectively associated with the plurality of cores, an address which points to a value to be communicated to the input buffer; and
calculating, by at least one of the cores, an output based on the input value and the type of operation which its logic unit is configured to perform.
The disclosure extends to a non-transitory computer-readable medium which has stored thereon a set of instructions which, when executed by a computer processor, causes the computer processor to perform the method defined above. BRIEF DESCRIPTION OF DRAWINGS
The disclosure will now be further described, by way of example, with reference to the accompanying diagrammatic drawings.
In the drawings:
FIG. 1 shows an extract from FIG. 3 of PCT/IB2018/054669;
FIG. 2 shows a schematic view of a processor comprising at least one core in accordance with the present disclosure;
FIG. 3 shows a schematic view of a first resource group comprising plural cores of
FIG. 1 ;
FIG. 4 shows a schematic view of a second resource group comprising plural cores of FIG. 1 ;
FIG. 5 shows a schematic view of a pattern of resource groups of FIG. 4;
FIG. 6 shows a schematic view of the cores of FIG. 3 in a chain; and
FIG. 7 shows a flow diagram of a method of operating a processor in accordance with the present disclosure.
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENT
The following description of the disclosure is provided as an enabling teaching of the disclosure. Those skilled in the relevant art will recognise that many changes can be made to the embodiment described, while still attaining the beneficial results of the present disclosure. It will also be apparent that some of the desired benefits of the present disclosure can be attained by selecting some of the features of the present disclosure without utilising other features. Accordingly, those skilled in the art will recognise that modifications and adaptations to the present disclosure are possible and can even be desirable in certain circumstances, and are a part of the present disclosure. Thus, the following description is provided as illustrative of the principles of the present disclosure and not a limitation thereof. FIG. 2 illustrates a processor 100 which has plural cores 102. Only a single core 102 (core ri) is illustrated in FIG. 2 for clarity but it will be appreciated that the architecture of the processor 100 is massively scalable and there may be a massive number of cores, e.g., 232, merely to give a numerical illustration of the possible scale of the processor 100. The processor 100 has a communication bus 104 configured to interconnect the cores 102.
The core 102 has two input buffers 106, 108, namely buffer A 106 and buffer B 108. The core 102 has a logic unit in the form of an ALU 110 which has an input and an output, wherein the input is in communication with the input buffers 106, 108 and the output is in communication with a memory unit or cell 112. The ALU 110 is configured to perform or execute only one type of operation or a specific piece of opcode. This is a first aspect in which the present disclosure differs from conventional processors or cores which are configured to execute an operation defined by a CU based on opcode supplied by the CU.
This implies, practically, that when a particular operation is required, inputs are merely supplied to a core 102 which is configured to perform that operation, without supplying any opcode. To demonstrate this, Table 3 in the Background of Disclosure above can be re-written as Table 7:
Figure imgf000013_0001
Table 7
Because a large number of cores 102 is contemplated, the whole program can be loaded into the available cores 102 which eliminates the need to change the memory unit 112 of the cores 102 during runtime, which in turn eliminates the need to load new addresses into the cores 102 during execution. The number of instructions is still the same, but they are parallelised; the total number of cycles is still the same, as per Table 8:
Figure imgf000014_0001
Table 8
These methods may introduce a problem of concurrency. If, for example, core C3 finishes executing before core C2, the result will be incorrect. To solve this problem, every core 100 may have an execution complete flag 1 14, indicating whether it has executed at least once. Then, whenever a core (e.g., C3) tries to read the value of another core (e.g., C2), it may skip one execution cycle if flag 1 14 is false. In this fashion, the order of operation may be correctly maintained.
This may mean that a program will still take two cycles to execute even though there is only one row in the table. That row may have to be repeated over and over until the final result is calculated. Table 9 illustrates this principle:
Figure imgf000014_0002
Table 9
Tables 7-9 are also illustrative of methods of the present disclosure, as may be implemented by the processor 100 (refer also to FIG. 7).
If the processor 100 executes routines linearly, this may lead to the communication bus 104 being shared and possible contention issues. Also, only one core 102 may be able to execute at a time. To address this, each core 102 may be made responsible for its own execution.
Accordingly, the core 102 also includes address modules 122, 124. The address modules 122, 124 may correspond to the input buffers 106, 108. In this example, there are two input buffers 106, 108 and accordingly there are two address modules 122, 124, designated as l_A 122 (corresponding to input buffer A 106) and l_B 124 (corresponding to input buffer 108). (Instead, conceptually, the core 102 may have a single address module configured to store plural addresses.) Each address module 122, 124 is configured to store an address which points to a value to be communicated to its associated input buffer 106, 108. The inclusion of the address modules 122, 124 is another aspect which differentiates the present disclosure from conventional processors. The address modules 106, 108 may embody a distributed control system of the processor 100.
For example, and referring back to Table 7, core C2 may have the address of core CO in l_A 122 and of core C1 in l_B 124 (or the addresses of the memory units 1 12 of those cores). In order for efficient execution, the communication bus 104 may be modified to have two portions: the first portion being for the memory address and the second portion being for the value at that address. In this fashion, every core 102 may be able to write the values of their address modules 122, 124 respectively onto the communication bus 104 and, separately, then read values into the input buffers 106, 108.
Therefore, the processor 100 may, during normal operation, not have to load any literals, opcodes, or data. Accordingly, machine code source size may no longer be a consideration, and there may no longer be there a need for a centralised CU. Because every core 102 can operate itself based on the content of the memory addresses 122,
124.
The Applicant notes that an issue that may need to be overcome is that only one core 102 may use the bus 104 at a time. To address this issue, one or more groups of communications buses, groups of cores, and/or invocation chains may be implemented.
FIGS 3-4 illustrate cores 102 grouped into resource groups 200. Again, only a few cores 102 are illustrated (1 1 in FIG. 3 and four in FIG. 4) but the number may be adjusted as desired or based on intended application. Resource groups 200 may be considered a systematic grouping of cores 102. By grouping cores 102 together, they can be tightly connected with their own internal group sub-bus 206 (which forms part of the communication bus 102) thereby allowing the grouped cores 102 to share data with one another quickly inside their own group 200. Groups 200 can be specialised to specific tasks; some examples include: a group designed for graphics processing may consist of more floating point cores, and a group optimised for string processing may have more string manipulation cores, etc. The sub-bus 206 connects to a remainder of the bus 104 by a gateway junction 202. Instead, FIG. 4 could represent a grouping of four resource groups 200 of FIG. 3, rather than a grouping of four cores 102.
A way to determine the cores 102 in a group 200 is to perform a frequency analysis, e.g., on existing software, like Linux™, or a rendering engine, to determine which operation is performed most, and then using the resulting statistics, determine distribution and grouping of the cores 102. Once the resource groups 200 are defined, they can be interconnected or arrayed in a manner that allows any giving resource group 200 to communicate with any other given group 200. FIG. 5 illustrates a fractal pattern which is an example, but other layouts are contemplated.
FIG. 5, has 16 (42) groups 200 of cores 102, each group 200 having four cores 102. Each group 200 has a third level bus (B_3), with a second level bus (B_2) interconnecting the groups 200, and a first level bus (B_1 ) interconnecting the second level buses (B_2). This pattern can be recursively extrapolated as far as practically possible with the size of a given core in mind. In other embodiments, the pattern need not be symmetrical.
For every recursion performed on this pattern, a gateway junction is created on the new smaller bus onto the top level bus. That way, a pattern is created that leads to subgroups in a tree structure like system. Every resource group 200 may have a gateway junction connecting it to its sub-bus. By doing this, the individual resource groups 200 can be cycled at the same time. The only time a resource group 200 may wait for time on the bus is when it requests data from another resource group 200.
In FIG. 4, an example of a small group 200 is illustrated. If a program does not need any information from an external source, all four cores 102 may safely be cycled at the same time because none of them will use the network bus to get data from another group. The same applies to groups of resource groups 200. If a core 102 in one group 200 requests data from another group 200, the rest of the resource groups 200 that are not trying to use the sub-bus 206 can still be cycled. Accordingly, two groups 200 may not hold up the rest of the processor 100 when one is waiting for data from the other.
An operating system that loads a given program into the processor 100 may be configured to re-link the program to reflect the actual reality of the core’s and available resources. This way, the operating system can keep a program’s execution in as few resource groups 200 as possible, maximising performance.
As the processor 100 lacks a central CU, there may be a need to cycle cores 102 and resource groups 200 in a synchronised manner to avoid violating concurrency requirements. To do this, individual cores 102 may be chained in a chain reaction manner, and similarly resource groups 200 may be chained in a cascading reaction manner.
FIG. 6 illustrates such a chain 300. Each core 102 invokes the next core 102 in its respective chain 300 once it has completed its own execution cycle. This way, only one core 102 may use the internal sub-bus 206 at a time.
In FIG. 5, each sub-bus (B_1 , B_2, B_3) can be thought of as a tree structure (root at the top, expanding down). The top most bus (first level bus B_1 ) sends an invocation signal to the second level buses (B_2) which sends invocation signals to the third level buses (B_3). This may iterate over and over until the lowest sub-bus is reached where the resource groups 200 are, which comprise the cores 102. This may mean that every core 102 gets executed by the same clock.
It may be acceptable for each core 102 to execute at the same time, as long as they all are in cyclic sync, e.g., every core 102 must cycle the same amount. If one core 102 has cycled, it may be configured wait for every other core 102 to cycle as well, before moving on to the next cycle.
FIG. 7 illustrates a method 400 implemented by the processor 100 as has been explained above in explaining the functionality of the processor 100. By way of summary, the processor 100 is provided (at block 402) with a plurality of cores 102, each of which is configured to perform a single type of operation. Addresses pointing to the values to be inputted to the cores 102 are stored (at block 404) in the address modules 122. Then, as explained above, an output is calculated (at block 406) based on the input values which the address modules 122 point to based on the operation type which the logic unit 110 or the core 102 is configured to perform. This may be repeated (at block 408) numerously.
The disclosure may provide one or more of the following advantages:
Machine Code Size Bottleneck: Elimination of the need to cycle different opcodes every cycle, by creating a processor where every line of machine code is executed in a single cycle, as well as only using cores with a single function to eliminate the concept of opcodes.
Control Unit Bottleneck: Eliminating the need for a control unit by replacing it with a distributed control system, thereby removing the need for a linear execution cycle.
Bus Usage Bottleneck: Solving the problem of bus usage limitations, by subdividing groups.

Claims

1. A processor which comprises:
a plurality of cores, each core comprising:
at least one input buffer;
a logic unit having an input and an output, wherein the input is in communication with the input buffer; and
a memory unit in communication with the output of the logic unit, wherein the logic unit of each core is configured to perform or execute only one type of operation;
a communication bus configured to interconnect the cores; and a plurality of address modules respectively associated with the plurality of cores, each address module being configured to store an address which points to a value to be communicated to the input buffer,
wherein each core is configured to calculate an output based on the input value and the type of operation which its logic unit is configured to perform.
2. The processor of claim 1 , wherein the type of operation which each core is configured to perform is a statically pre-defined opcode or a statically pre-defined segment of opcode.
3. The processor of claim 1 , wherein each core comprises two input buffers.
4. The processor of claim 1 , wherein the logic unit is an Arithmetic Logic Unit (ALU), a Floating-Point Unit (FPU), or a Graphics Processing Unit (GPU).
5. The processor of claim 1 , wherein no CU is provided.
6. The processor of claim 1 , wherein: each core includes a core identifier; and
the cores are executed sequentially or cyclically.
7. The processor of claim 1 , wherein each core comprises an execution complete flag which is configured to indicate whether or not that core has executed or completed an instruction within a particular cycle.
8. The processor of claim 7, wherein each core is configured to execute its instruction only if one or more cores earlier in the cycle have already executed their instruction as indicated by the execution complete flag.
9. The processor of claim 1 , wherein each address module comprises an address or pointer to a memory location.
10. The processor of claim 1 , wherein each core comprises as many address modules, or wherein each address module stores as many memory address, as each core has input buffers.
11. The processor of claim 1 , wherein each core is configured to fetch the value pointed to by the address in the address module at the beginning of, or during, each execution cycle.
12. The processor of claim 1 , wherein the communication bus comprises two portions, a first portion to communicate values and a second portion to communicate addresses pointing to the values.
13. The processor of claim 12, wherein the first portion of the communication bus is coupled to each input buffer and the second portion of the communication bus is coupled to the address modules of each core.
14. The processor of claim 1 , in which the cores are divided into resource groups.
15. The processor of claim 14, in which the cores with a resource group are configured to perform the same or similar types of operations.
16. The processor of claim 14, in which the communication bus is divided into groups or sub-buses, wherein a sub-bus links cores within a resource group and then a larger sub-bus links resource groups.
17. The processor of claim 16, wherein a gateway junction is provided between adjacent levels of the sub-buses which results in a tree-like structure or system.
18. The processor of claim 17, wherein every resource group has a gateway junction connecting it to a higher level sub-bus, which enables multiple resource groups to be cycled or executed simultaneously.
19. A method of operating a processor, the method comprising:
providing a plurality of cores, each core comprising:
at least one input buffer;
a logic unit having an input and an output, wherein the input is in communication with the input buffer; and
a memory unit in communication with the output of the logic unit; executing, by the logic unit of each core, only one type of operation;
interconnecting the cores by a communication bus;
storing, by a plurality of address modules respectively associated with the plurality of cores, an address which points to a value to be communicated to the input buffer; and calculating, by at least one of the cores, an output based on the input value and the type of operation which its logic unit is configured to perform.
20. A non-transitory computer-readable medium which has stored thereon a set of instructions which, when executed by a computer processor, cause the computer processor to perform the method of claim 19.
PCT/IB2018/056875 2018-09-10 2018-09-10 A processor and a method of operating a processor WO2020053618A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/IB2018/056875 WO2020053618A1 (en) 2018-09-10 2018-09-10 A processor and a method of operating a processor
ZA2021/01831A ZA202101831B (en) 2018-09-10 2021-03-18 A processor and a method of operating a processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2018/056875 WO2020053618A1 (en) 2018-09-10 2018-09-10 A processor and a method of operating a processor

Publications (1)

Publication Number Publication Date
WO2020053618A1 true WO2020053618A1 (en) 2020-03-19

Family

ID=69776845

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2018/056875 WO2020053618A1 (en) 2018-09-10 2018-09-10 A processor and a method of operating a processor

Country Status (2)

Country Link
WO (1) WO2020053618A1 (en)
ZA (1) ZA202101831B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070192547A1 (en) * 2005-12-30 2007-08-16 Feghali Wajdi K Programmable processing unit
US20080086626A1 (en) * 2006-10-05 2008-04-10 Simon Jones Inter-processor communication method
US7958416B1 (en) * 2005-11-23 2011-06-07 Altera Corporation Programmable logic device with differential communications support
US20140122555A1 (en) * 2012-10-31 2014-05-01 Brian Hickmann Reducing power consumption in a fused multiply-add (fma) unit responsive to input data values
US20150242322A1 (en) * 2013-06-19 2015-08-27 Empire Technology Development Llc Locating cached data in a multi-core processor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7958416B1 (en) * 2005-11-23 2011-06-07 Altera Corporation Programmable logic device with differential communications support
US20070192547A1 (en) * 2005-12-30 2007-08-16 Feghali Wajdi K Programmable processing unit
US20080086626A1 (en) * 2006-10-05 2008-04-10 Simon Jones Inter-processor communication method
US20140122555A1 (en) * 2012-10-31 2014-05-01 Brian Hickmann Reducing power consumption in a fused multiply-add (fma) unit responsive to input data values
US20150242322A1 (en) * 2013-06-19 2015-08-27 Empire Technology Development Llc Locating cached data in a multi-core processor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
RANGER ET AL.: "Evaluating MapReduce for multi-core and multiprocessor systems", 2007 IEEE 13TH INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE, 14 February 2007 (2007-02-14), pages 13 - 24, XP055111222 *

Also Published As

Publication number Publication date
ZA202101831B (en) 2022-07-27

Similar Documents

Publication Publication Date Title
CN107347253B (en) Hardware instruction generation unit for special purpose processor
JP5035277B2 (en) A locking mechanism that allows atomic updates to shared memory
US8544019B2 (en) Thread queueing method and apparatus
US6237021B1 (en) Method and apparatus for the efficient processing of data-intensive applications
US20120066668A1 (en) C/c++ language extensions for general-purpose graphics processing unit
CN104050033A (en) System and method for hardware scheduling of indexed barriers
US20110072249A1 (en) Unanimous branch instructions in a parallel thread processor
US9110692B2 (en) Method and apparatus for a compiler and related components for stream-based computations for a general-purpose, multiple-core system
CN103197916A (en) Methods and apparatus for source operand collector caching
CN103226463A (en) Methods and apparatus for scheduling instructions using pre-decode data
US20210232394A1 (en) Data flow processing method and related device
CN103885893A (en) Technique For Accessing Content-Addressable Memory
CN103279379A (en) Methods and apparatus for scheduling instructions without instruction decode
CN113326066B (en) Quantum control microarchitecture, quantum control processor and instruction execution method
US9378533B2 (en) Central processing unit, GPU simulation method thereof, and computing system including the same
CN104050032A (en) System and method for hardware scheduling of conditional barriers and impatient barriers
CN103885902A (en) Technique For Performing Memory Access Operations Via Texture Hardware
CN112580792B (en) Neural network multi-core tensor processor
CN103885903A (en) Technique For Performing Memory Access Operations Via Texture Hardware
CN116783578A (en) Execution matrix value indication
CN103294449A (en) Pre-scheduled replays of divergent operations
WO2020053618A1 (en) A processor and a method of operating a processor
CN112463218B (en) Instruction emission control method and circuit, data processing method and circuit
CN116401039A (en) Asynchronous memory deallocation
Sakai et al. Towards automating multi-dimensional data decomposition for executing a single-GPU code on a multi-GPU system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18933547

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18933547

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 18933547

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13/09/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18933547

Country of ref document: EP

Kind code of ref document: A1