WO2022231733A1 - Method and apparatus for desynchronizing execution in a vector processor - Google Patents

Method and apparatus for desynchronizing execution in a vector processor Download PDF

Info

Publication number
WO2022231733A1
WO2022231733A1 PCT/US2022/021525 US2022021525W WO2022231733A1 WO 2022231733 A1 WO2022231733 A1 WO 2022231733A1 US 2022021525 W US2022021525 W US 2022021525W WO 2022231733 A1 WO2022231733 A1 WO 2022231733A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
instruction
register
access control
memory access
Prior art date
Application number
PCT/US2022/021525
Other languages
French (fr)
Inventor
Christopher I. W. Norrie
Original Assignee
Microchip Technology Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/468,574 external-priority patent/US20220342668A1/en
Priority claimed from US17/669,995 external-priority patent/US20220342590A1/en
Priority claimed from US17/701,582 external-priority patent/US11782871B2/en
Application filed by Microchip Technology Inc. filed Critical Microchip Technology Inc.
Priority to DE112022000535.1T priority Critical patent/DE112022000535T5/en
Priority to CN202280017945.7A priority patent/CN117083594A/en
Publication of WO2022231733A1 publication Critical patent/WO2022231733A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/3834Maintaining memory consistency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30043LOAD or STORE instructions; Clear instruction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30101Special purpose registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/345Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
    • G06F9/3455Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results using stride
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/383Operand prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding

Definitions

  • the present method and apparatus pertain to a vector processor. More particularly, the present method and apparatus relates to a Method and Apparatus for Desynchronizing Execution in a Vector Processor. BACKGROUND
  • VPU vector processing unit
  • a vector processor unit is provided with preload registers for vector length, vector constant, vector address, and vector stride, with each preload register having an input and an output. All the preload register inputs are coupled to receive new vector parameters. Each of the preload registers’ outputs are coupled to a first input of a respective multiplexor, and a second input of all the respective multiplexors are coupled to receive the new vector parameters.
  • Figure 1 illustrates, generally at 100, a block diagram overview of a decode unit according to an example.
  • Figure 2 illustrates, generally at 200, a block diagram overview of vector registers for addressing a memory access control.
  • Figure 3 illustrates, generally at 300, a block diagram overview of a portion of a vector processor unit comprising memory access control preload registers.
  • Figure 4 illustrates, generally at 400, a flowchart showing desynchronous execution of an instruction and synchronous execution of an instruction.
  • Figure 5 illustrates, generally at 500, a flowchart showing asynchronous, desynchronous, and synchronous execution of an instruction.
  • Figure 6 illustrates, generally at 600, a flowchart showing execution of vector instructions.
  • Figure 7 illustrates, generally at 700, a flowchart showing execution of desynchronized vector instructions in addition to non-desynchronized instructions.
  • a Method and Apparatus for Desynchronizing Execution in a Vector Processor is disclosed.
  • Desynchronized execution - is the act of an instruction performing a substantial component of its operation independent of the pipeline control.
  • the pipeline control can therefore control execution and completion of one or more instructions following the instruction undergoing desynchronized execution prior to completion of the desynchronized execution.
  • Desynchronized instruction is an instruction whose execution is not 100% under control of the pipeline control, i.e. a substantial component of its operation is not under control of the pipeline control, however the pipeline control can monitor its progression.
  • Non-desynchronized instruction is an instruction that does not execute desynchronously.
  • Resynchronized execution stops an instruction subsequent to a desynchronized instruction from executing until the desynchronized instruction completes. This occurs if the subsequent instruction would modify a critical processor state, in particular if that processor state would affect the results of the desynchronized instruction.
  • the pipeline control cannot monitor its progression. Meanwhile the processor can continue executing instructions.
  • Asynchronous reserialization waits for an asynchronous execution to complete before allowing a subsequent instruction to execute. Generally, this is in order to maintain integrity of the programs results.
  • desynchronized execution the processor has complete control over the two instructions that are executing even though it allows the second instruction to modify processor state before the first (desynchronized) instruction has completed.
  • asynchronous execution the processor has zero (no) control of the timing in which the activity external to the processor invoked by the asynchronous instruction will complete.
  • the desynchronization method disclosed is not so limited. That is, while we generally discuss non-vector instructions that execute when a desynchronized vector instruction executes for clarity of explanation, the desynchronization method disclosed is not so limited.
  • a second vector instruction may be allowed to execute in a desynchronized manner while a fist desynchronized vector instruction is executing.
  • other long running instructions i.e. taking a longer time than other instructions to complete execution
  • other than vector instructions are also candidates for desynchronized execution.
  • Modifying/changing/copying/transferring registers refers to modifying/changing/copying/transferring values or parameters stored within register(s).
  • copying a first register to a second register is to be understood as copying the contents or parameters contained or held in the first register into the second register such that the second register now contains the value or parameter of the first register.
  • Contention refers to two or more processes, such as, but not limited to, executing instructions trying to alter or access the same entity, such as, but not limited to a memory or register where the alteration would introduce uncertainty in the result of processing. For example, if two executing instructions are attempting to both alter a specific memory location, this is contention for the resource, i.e. contention for the same specific memory location. The contention may result in a different result in processing depending on which instruction completes execution first.
  • a desynchronization contention is a contention between an executing desynchronized instruction and another instruction that will affect the processor output resulting in a different output depending upon which instruction completes execution first.
  • an asynchronous contention is a contention between an executing asynchronous instruction and another instruction that will affect the processor output resulting in a different output depending upon which instruction completes execution first.
  • Vector parameters/new vector parameters refers to information about a vector. In one example it may be a plurality of signals. More specifically it is information needed by the processor to access memory (e.g. read and write a vector) “new” refers to the situation where the processor is already using vector parameters and a new vector operation is being queued up or placed in the pipeline for future execution, the vector parameters for this vector operation are called “new vector parameters” to distinguish them from vector parameters that are currently being used in a vector instruction that is executing.
  • a vector processor unit having preload registers for vector length, vector constant, vector address, and vector stride is provided.
  • Each preload register has a respective input and a respective output. All the preload register inputs are coupled to receive a new vector parameters.
  • Each of the preload registers’ outputs are coupled to a first input of a respective multiplexor, and a second input of all the respective multiplexors is coupled to receive the new vector parameters.
  • resynchronization and asynchronous reserialization mechanisms that determine when desynchronized and asynchronous execution can occur and mechanisms that stop instruction execution if the desynchronized and/or asynchronous execution must complete (called resynchronization and asynchronous reserialization respectively), generally in order to maintain integrity of the programs results.
  • the methods disclosed not only allow desynchronized and asynchronous execution but also limit the cases when resynchronization or asynchronous reserialization is to be performed since resynchronization and asynchronous reserialization reduce program performance.
  • Figure 1 illustrates, generally at 100, a block diagram overview of a decode unit.
  • At 102 is an instruction fetch control which fetches instructions from a memory system.
  • the memory system while not germane to the understanding of the decode unit 100 can be, for example, random access memory (RAM).
  • the instruction fetch control 102 outputs via 103 information to instruction decode 104, and outputs via 105 execute/halt information to operation state control 106 and to pipeline control 108.
  • the instruction decode 104 outputs via 107 information to stall detection 112, result bypass detection 114, and resource allocation tracking 116.
  • Pipeline control 108 outputs via 117 information to resource allocation tracking 116.
  • Resource allocation tracking 116 outputs via 119 information to result bypass detection 114, and stall detection 112.
  • Result bypass detection 114 outputs via 115 information to pipeline control 108.
  • Stall detection 112 outputs via 113 information to pipeline control 108.
  • Pipeline control 108 via 121 outputs and receives information to/from register unit 118, memory access control unit 120, scalar arithmetic logic units (ALUs) 122, vector arithmetic logic units (ALUs) 124 and branch unit 126.
  • Branch unit 126 outputs via 125 information to instruction fetch control 102.
  • Branch unit 126 outputs via 123 information to fault control 110.
  • Vector ALUs 124 outputs via 123 information to fault control 110.
  • Scalar ALUs 122 outputs via 123 information to fault control 110.
  • Memory access control unit 120 outputs via 123 information to fault control 110.
  • Register unit 118 outputs via 123 information to fault control 110.
  • Fault control 110 outputs via 109 to pipeline control 108, and via 111 to operational state control 106.
  • Branch unit 126 receives via 127 information output from scalar ALUs 122 and information from vector ALUs 124.
  • pipeline control 108 communicates, inter-alia, with register unit 118, memory access control unit 120, scalar ALUs 122, and vector ALUs 124.
  • Pipeline control 108 attempts to keep the processor in which decode unit 100 is situated running as fast as it can by trying to avoid stopping any scalar or vector ALUs from serially processing what can be done in parallel. It is in a simple sense a traffic cop directing traffic so as to improve throughput.
  • Figure 2 illustrates, generally at 200, a block diagram overview of vector registers for addressing a memory access control.
  • new vector parameters i.e. 201 represents the receipt of new vector parameters 201 to be loaded.
  • New vector parameters 201 is coupled to the input of vector length register 202 and the output of vector length register 202 is coupled via 203 to memory access control 220.
  • New vector parameters 201 is also coupled to the input of vector constant register 204 and the output of vector length register 202 is coupled via 205 to memory access control 220.
  • New vector parameters 201 is also coupled to the input of vector address register 206 and the output of vector length register 206 is coupled via 207 to memory access control 220.
  • New vector parameters 201 is coupled to the input of vector stride register 208 and the output of vector stride register 208 is coupled via 209 to memory access control 220. While vector length register 202, vector constant register 204, vector address register 206 and vector stride register 208 are illustrated, in some examples one or more of vector length register 202 and vector constant register 204 are not provided.
  • Memory access control 220 is a functional block, not a register. It takes in as inputs the vector length provided via 203 from vector length register 202, the vector constant provided via 205 from vector constant register 204, the vector address provided via 207 from vector address register 206, and the vector stride provided via 209 from the vector stride register 208.
  • the combination of vector length register 202, vector constant register 204, vector address register 206 and vector stride register 208 can be called Vector Control and memory access control 220 can be called a Memory Subsystem. That is Vector Control controls addressing to a Memory Subsystem.
  • the Memory Subsystem can include RAM (not shown).
  • Figure 3 illustrates, generally at 300, a block diagram overview of a portion of a vector processor unit comprising memory access control preload registers.
  • New vector parameters 301 is coupled to the input of vector length preload register 302 and the output of vector length preload register 302 is coupled via 303 to a first input of a respective multiplexor 310.
  • the second input of multiplexor 310 is coupled to new vector parameters 301, i.e. bypassing vector length preload register 302.
  • the output of multiplexor 310 is coupled via 311 to a vector length register 322.
  • the output of vector length register 322 is coupled via 323 to memory access control 320.
  • New vector parameters 301 is coupled to the input of vector constant preload register 304 and the output of vector constant preload register 304 is coupled via 305 to a first input of respective multiplexor 312.
  • the second input of multiplexor 312 is coupled to new vector parameters 301 , i.e. bypassing vector constant preload register 304.
  • the output of multiplexor 312 is coupled via 313 to a vector constant register 324.
  • the output of vector constant register 324 is coupled via 325 to memory access control 320.
  • New vector parameters 301 is coupled to the input of vector address preload register 306 and the output of vector address preload register 306 is coupled via 307 to a first input of respective multiplexor 314.
  • the second input of multiplexor 314 is coupled to new vector parameters 301 i.e. bypassing vector address preload register 306.
  • the output of multiplexor 314 is coupled via 315 to a vector constant register 326.
  • the output of vector constant register 326 is coupled via 327 to memory access control 320.
  • New vector parameters 301 is coupled to the input of vector stride preload register 308 and the output of vector stride preload register 308 is coupled via 309 to a first input of multiplexor 316.
  • the second input of multiplexor 316 is coupled to new vector parameters 301 i.e. bypassing vector stride preload register 308.
  • the output of multiplexor 316 is coupled via 317 to a vector stride register 328.
  • the output of vector stride register 328 is coupled via 329 to memory access control 320.
  • vector length preload register 302 vector constant preload register 304, vector address preload register 306, vector stride preload register 208, vector length register 322, vector constant register 324, vector address register 326 and vector stride register 328, with the respective multiplexors 310, 312, 314, 316 are illustrated, in some examples one or more of vector length preload register 302, vector length register 322, vector constant register 304 and vector constant register 324, and the respective multiplexors, are not provided.
  • multiplexor control At 330 is multiplexor control.
  • An output of multiplexor control 330 is coupled via 331 to respective control inputs of multiplexor 316, multiplexor 314, multiplexor 312, and multiplexor 310. That is, control inputs of multiplexor 316, multiplexor 314, multiplexor 312, and multiplexor 310 are all controlled via link 331 which is output from multiplexor control 330.
  • link 331 carries a single signal to all of the control inputs of multiplexor 316, multiplexor 314, multiplexor 312, and multiplexor 310, and in another example link 331 carries a respective signal to each of the control inputs of multiplexor 316, multiplexor 314, multiplexor 312, and multiplexor 310, so that they are individually controllable.
  • Multiplexor control 330 identifies whether memory access control registers 350 are to be loaded with new vector parameters setup 301 or from the respective outputs of memory access control preload registers 340, as described below, and therefore controls link 331 to as to update memory access control registers 350 at correct points between 2 desynchronized vector arithmetic operations.
  • the update is from the preload registers (302, 304, 306, 308) to the registers (322, 324, 326, 328), or from new vector parameter 301 to the registers (322, 324, 326, 328).
  • multiplexor control 330 further controls writing to each of the preload registers (302, 304, 306, 308) and the registers (322, 324, 326, 328).
  • Vector length preload register 302, vector constant preload register 304, vector address preload register 306, and vector stride preload register 308 together comprise memory access control preload registers 340. Individually each of vector length preload register 302, vector constant preload register 304, vector address preload register 306, and vector stride preload register 308 are considered a memory access control preload register.
  • Vector length register 322, vector constant register 324, vector constant register 326, and vector stride register 328 Individually each of vector length register 322, vector constant register 324, vector constant register 326, and vector stride register 328 together comprise memory access control registers 350. Individually each of vector length register 322, vector constant register 324, vector constant register 326, and vector stride register 328 are considered a memory access control register.
  • Memory access control 320 is a functional block, not a register. It takes in as inputs the vector length, the vector constant, the vector address, and the vector stride registers values (provided by respective memory access control registers 322, 324, 326, 328 via respective links 323, 325, 327, 329). Registers, 322, 324, 326, 328, and their respective parameters communicated via links 323, 325, 327, 329, are what can be called Vector Control and memory access control 320 can be called a Memory Subsystem. That is Vector Control controls addressing to a Memory Subsystem.
  • the Memory Subsystem can include RAM (not shown).
  • the multiplexor control 330 is considered to be in a non-preload position when new vector parameters 301 pass through multiplexors 310, 312, 314, and 316 respectively, and then via 311, 313, 315, and 317 respectively, into vector length register 322, vector constant register 324, vector constant register 326, and vector stride register 328.
  • the multiplexor control 330 is considered to be in a preload position when multiplexors 310, 312, 314, and 316 respectively receive inputs from vector length preload register 302, vector constant preload register 304, vector address preload register 306, and vector stride preload register 308 respectively via 303, 305, 307, and 309 respectively.
  • the memory access control registers 350 receive parameters from the memory access control preload registers 340.
  • multiplexor control 330 controls write signals to the access control registers 350 and the memory access control preload registers 340. In this way multiplexor control 330 controls which registers receive the new vector parameters 301.
  • multiplexor 310 is considered a first multiplexor.
  • multiplexor 312 is considered a second multiplexor.
  • multiplexor 314 is considered a third multiplexor.
  • multiplexor 316 is considered a fourth multiplexor.
  • Figure 4 illustrates, generally at 400, a flowchart showing desynchronous execution of an instruction and synchronous execution of an instruction.
  • fetch the next instruction to execute. The proceed via 403 to 404.
  • a desynchronization contention (Yes) then go via 419 to 420.
  • the next instruction to execute does not affect or is not dependent on the results of any current desynchronized instruction in progress (No) go via 405 to 430 Optional asynchronous execution.
  • 420 resynchronize execution by waiting for all desynchronized operations to complete before proceeding via 405 to 430 to Optional asynchronous execution.
  • When there is no optional asynchronous execution 430 then proceed via 409 to 410.
  • next instruction determines if the fetched next instruction can execute desynchronously. When the next instruction can execute desynchronously (Yes) then proceed via 411 to 412. At 412 initiate desynchronous execution by allowing the processor to execute the fetched next instruction desynchronously, that is, the completion of the fetched next instruction occurs desynchronously with respect to the control of the processor but the processor tracks when an internal signal is given that indicates the operation is complete. The processor does not wait for this completion signal before continuing onto via 415 to 402. [0063] When the next instruction cannot execute desynchronously (No) then proceed via 413 to 414.
  • At 414 initiate desynchronous execution by allowing the processor to execute the fetched next instruction synchronously, that is, the instruction has the appearance to the program that it fully completes before continuing via 415 to 402.
  • the processor may be pipelined or employ other overlapped execution techniques, however it does so in a manner that makes it appear to a program that it completes the instruction before continuing to 402.
  • the multiplexor control 330 in Figure 3 can perform the updating of the memory access control registers 350 at correct points between the two desynchronized vector arithmetic operations.
  • One vector instruction can desynchronize from the executing instructions in the pipeline, allowing another instruction to execute. If a subsequent instruction has a resource contention with the desynchronized instruction then the subsequent instruction must wait until the contention goes away - this is one example of a desynchronization contention, as described in relation to 404. However, if you can execute a second vector instruction without causing a resource contention, the second vector instruction may execute desynchronized.
  • Instructions that qualify for desynchronized execution are any long running instruction as this allows subsequent instructions to complete their execution while the desynchronized instruction is executing. So, the execution time for subsequent instructions which are executed while the desynchronized instruction is executing is effectively reduced because they do not wait on the desynchronized instruction to complete execution.
  • vector instructions are long running and represent the bulk of the work in a vector processor, ideally all non-vector instructions would be allowed to execute while a desynchronized vector instruction executes. If this can be achieved, then the processing time is bounded by the execution of the vector instructions as all other instructions would be executed while the desynchronized vector instructions are executing.
  • Vector instructions read operands from memory and write results to memory. Therefore, instructions that don’t access memory are candidates for execution when a desynchronized vector instruction is executing. These instructions include all scalar arithmetic instructions whose operands come from, and result go to, a register set. It also includes instructions that access memory using either a different memory or a different region of memory than a desynchronized vector instruction. This can include subroutine call and returns, pushing and popping parameters from a stack, without limitation.
  • contention causing instructions could also execute in parallel with a desynchronized vector instruction.
  • Vector length, vector constant, vector address, and vector stride are entities that can reside in registers, for example, in memory access control registers 350 (e.g. 322, 324, 326, and 328 respectively) in Figure 3 and via 323, 325, 327, and 329 respectively are in communication with memory access control 320.
  • Vector length preload, vector constant preload, vector address preload, and vector stride preload are entities that can reside in memory access control preload registers 340 (e.g. 302, 304, 306, and 308 respectively in Figure 3) and via 303, 305, 307, and 309 respectively are in communication with multiplexor 310, 312, 314, and 316 respectively.
  • vector port The vector length, vector constant, vector address, and vector stride, collectively called a vector port for ease of discussion, allow addressing a vector in the memory access control 320, called a memory for ease of discussion.
  • the vector port addresses the memory to point to a vector.
  • a vector length is a length of a vector in the memory.
  • a vector constant is a constant that is used when operating on a vector. For example, if there is a need to multiply every element of a vector A by 2, then vector B, the multiplier, is a vector whose elements all have the value 2. Instead of requiring vector B to be resident in memory, the vector constant can be in a register that specifies the value of each element of vector B.
  • a vector address is an address where a vector is to be found.
  • a vector stride is a stride value that is added to the vector address each time an access is made into memory for a vector element. For example, the stride may be equal to 1 if the vector is a row of a matrix but it may be set to N if it is a column of a matrix that has N elements in each row.
  • Vector address, and vector stride are used to address memory locations where a vector can be read or written.
  • sasO rO r1 // set-address-and-stride for vector 0, vO: address is 100, stride is 1
  • sas1 rO r1 // set-address-and-stride for vector 1 , v1 : address is 100, stride is 1
  • pipeline control, 108 also called pipe control
  • pipeline control 108 needs to allow the sqrt instruction to execute desynchronized so pipeline control 108 can allow the execution of subsequent instructions (in the example above, add r7 r8, and div r7 r9).
  • pipeline control 108 may need to resynchronize a desynchronized operation if it is still in progress. For example, if the vector ALU, 124, only supports one vector operation at a time, then the following demonstrates a resynchronization:
  • sqrt vO v1 // this desynchronizes from pipeline control, i.e. allows the sqrt instruction to execute desynchronized
  • log vO v1 // pipeline control must resynchronize since this cannot execute yet due to resource contention (of vO and v1), that is, log vO v1 is attempting to use vO and v1 however we don’t know if sqrt vO v1 is finished yet (with vO and v1), so it must resynchronize
  • Resynchronization represents a performance loss and, although sometimes necessary, is undesirable.
  • the vector ALU 124 should be kept as busy as possible, with a utilization as close to 100% as practical since the bulk of the work in a vector processor is the processing of vectors.
  • mov rO 200 // set up a new vector operation for the operand and result vectors in mem[200,201,202,295]
  • log vO v1 // mem[200,201,202,295] gets the log of mem[200, 201, 202, 2095]
  • the second occurrence of the sasO, sas1, and slen instructions changes the locations in memory that define where the operand and result vectors reside. But if the sqrt instruction which is still executing desynchronized when these instructions are executed, they will adversely affect the sqrt because the vectors for the sqrt are unexpectedly having the address, strides, and lengths changed. So the second occurrence of sasO must cause a resynchronization, which is not desirable.
  • Figure 3 shows an example how resynchronization can be avoided.
  • the second occurrence of the sasO, sas1, and slen instructions can be allowed to execute while the desynchronized sqrt is executing by writing into the memory access preload registers 302, 306, and 308 the operand and result ports rather than writing into the memory access control registers 322, 326, and 328.
  • Multiplexor control 330 which is controlled by pipeline control 108 recognizes the attempt to modify one of the memory access control registers 350 while a desynchronized operation is in progress and instead causes the memory access control preload registers 340 to be written instead, that is multiplexor control 330 decides whether the memory access control registers 350 or the memory access control preload registers 340 are written. Therefore, registers memory access control registers 350 are not affected by a subsequent instruction while a desynchronized operation is in progress and the desynchronized operation is therefore not adversely affected.
  • Pipeline control 108 further recognizes when the desynchronized operation is complete and if any of memory access control preload registers 340 have been modified then their contents are moved into the respective one of memory access control registers 350 by multiplexor control 330 of pipeline control 108.
  • the vector log instruction can now execute and, being a vector instruction, can execute in a desynchronized manner. If multiple vector instructions cannot execute in parallel, the vector log will resynchronize first, responsive to pipeline control 108, so that only one desynchronized vector instruction at a time is executing.
  • the above allows the vector unit to remain near 100% busy (ignoring any inefficiencies of startup in a particular implementation).
  • the vector ALU 124 went from performing square-roots on each element of one vector to immediately performing logarithms on another vector, thereby satisfying the objective of keeping the vector ALU 124 nearly 100% busy.
  • Pipeline control 108 recognizes this and via multiplexor control 330 allows memory access control registers 350 to be updated immediately by the new vector parameters 301 without having to use memory access control preload registers 340.
  • Figure 3 represents a method where desynchronized execution may continue and allow additional instructions to execute even when those instructions have a resource contention because the arrangement of Figure 3 resolves the resource contention.
  • the particular example shown in Figure 3 is illustrative and not limiting in scope.
  • Asynchronous execution is a form of desynchronized execution when certain actions cannot be predicted or anticipated because they are beyond the control of the processor.
  • An example of this is the programmatic loading or saving of local memory with an external memory or device. If a program instruction initiates the procedure for an external process to read out the local RAM and do something with the data, such as save it to an eternal memory, then the pipeline control 108 (also called pipe control) (in Figure 1), has no idea when that external process will actually read the local memory. Similarly, if a program instruction initiates the procedure for an external process to load new data into the local RAM then the pipe control, 108, has no idea when that data will actually be written and useable by the processor.
  • the pipeline control 108 also called pipe control
  • r1 , r2, and r3 are registers that contain the desired values for the operation.
  • pipe control 108 continues executing the instructions that follow the xload or xsave, just as it does for desynchronized execution.
  • This variation of desynchronized execution is called asynchronous execution, as certain activities of the xload and xsave instructions are carried out asynchronously with respect to pipe control 108.
  • Asynchronous execution allows faster program execution performance. However the same sort of issue like resynchronization must be considered when there is a resource contention or data dependency. Resource allocation tracking 116 monitors for these issues while the asynchronous operations have not received an external indication of their completion, and when necessary, instructs stall detection 112 to halt pipe control 108 from executing instructions when a problem is encountered that necessitates the halting of instruction execution until the problem is resolved or the asynchronous operation is completed. This is not the same as resynchronization because the asynchronous operation may complete while a desynchronized vector operation is still in progress. However the instruction that had to wait for the asynchronous operation to complete can now execute even though a resynchronization of the desynchronized vector operation has not been performed.
  • xload rO r1 r2 // load 64 bytes into local mem[100,101, 163] from external mem[0x12345678...]
  • the xload is executed but the loading of new data into the local memory is performed asynchronously.
  • the add and mul instructions can therefore be executed.
  • the store instruction needs to write data to the local memory. Since it is unpredictable when the xload will also write to the local memory, it is possible the store and xload will attempt to perform simultaneous writes which is not supported in a design with only one write port. Therefore, the store instruction must be stalled until xload has finished writing to the local memory.
  • Resource allocation tracking 116 monitors the asynchronous xload, detects this contention, and instructs stall detection 112 to halt pipe control 108 from executing the store instruction until resource allocation tracking 116 determines the contention is resolved.
  • One mechanism for such improvement is for the external process to request from the processor permission to write to the local memory and buffer the write data until such permission is given by pipe control 108. This may be perfectly satisfactory if only small amounts of data are to loaded from external memory but if a lot of data is being returned from external memory and permission from pipe control 108 to write to the local memory is delayed then the buffer may be unacceptably large. (If a very long running vector instruction is being executed desynchronized then pipe control 108 cannot interrupt it since it's desynchronized. It may take a long time to complete before it is no longer using the write port.)
  • fetch r7 r9 // fetch mem[100] and put it into r9 - this is a data contention with the xload!”
  • the fetch instruction retrieves data from the local memory that is being loaded by the prior xload.
  • the fetch cannot be allowed to execute until the xload has written this data into the local memory.
  • Resource allocation tracking 116 monitors the local memory addresses associated with the xload and initiates the process for stalling any instruction that reads or writes a memory address in that range. This is an automated means of resolving the contention. Programmatic means may also or alternatively be made available. A programmer generally knows if they are prefetching data and when, later on in the program that data is being used. Therefore, an instruction such as xlwait (xload wait) can be used by the programmer to alert pipe control 108 that it needs to wait until an outstanding asynchronous xload has completed before continuing with instruction execution. This can lead to a simpler design by moving the onus to the programmer to ensure the race hazard is avoided.
  • Pipe control 108 can issue an asynchronous execution of xsave and continue executing subsequent instructions until an instruction is encountered that has a memory read port contention.
  • Resource allocation tracking 116 monitors the local memory addresses associated with the xsave and initiates the process for stalling any instruction that modifies a memory address in that range.
  • xsave has an additional consideration regarding what it means for its operation to complete. In the case of xload, the operation is not considered complete until all the data has been loaded into the local memory. But for xsave, there are two points that could be considered complete:
  • a program may need to know that the xsave is 100% complete in every way and that the external write has been acknowledged.
  • the data may be of such critical nature that if the data arrived with a parity error at the receiving end, the program may want to re-xsave the data until confirmation that good data was received has been acknowledged.
  • Figure 5 illustrates, generally at 500, a flowchart showing asynchronous, desynchronous, and synchronous execution of an instruction.
  • fetch the next instruction to execute. The proceed via 503 to 504.
  • the fetched next instruction to execute affects or is dependent on the results of any current desynchronized instruction in progress (Yes) then go via 519 to 520.
  • At 520 resynchronize execution by waiting for all desynchronized operations to complete before proceeding via 505 to 506.
  • next instruction to execute When the fetched next instruction to execute does not affect or is not dependent on the results of any current desynchronized instruction in progress (No) go via 505 to 506.
  • the next instruction to execute affects or is dependent on the results of any asynchronous operation in progress (Yes)
  • go via 521 to 522 otherwise if the next instruction to execute does not affect or is not dependent on the results of any asynchronous operation in progress (No) go via 507 to 508.
  • At 522 synchronize execution by waiting for all asynchronized operations to complete before proceeding via 507 to 508.
  • next instruction to execute determines if the next instruction to execute can execute asynchronously. When the next instruction to execute can execute asynchronously (Yes), go via 517 to 518, otherwise if the next instruction to execute can not execute asynchronously (No) go via 509 to 510.
  • At 518 initiate asynchronous execution by allowing the processor to execute the next instruction asynchronously.
  • next instruction determines if the fetched next instruction can execute desynchronously. When the next instruction can execute desynchronously (Yes) then proceed via 511 to 512. At 512 initiate desynchronous execution by allowing the processor to execute the fetched next instruction desynchronously, that is, the completion of the fetched next instruction occurs desynchronously with respect to the control of the processor but the processor tracks when an internal signal is given that indicates the operation is complete. The processor does not wait for this completion signal before continuing onto via 515 to 502. [00203] When the next instruction cannot execute desynchronously (No) then proceed via 513 to 514.
  • At 514 initiate desynchronous execution by allowing the processor to execute the fetched next instruction synchronously, that is, the instruction has the appearance to the program that it fully completes before continuing via 515 to 502.
  • the processor may be pipelined or employ other overlapped execution techniques, however it does so in a manner that makes it appear to a program that it completes the instruction before continuing to 502.
  • Figure 6 illustrates, generally at 600, a flowchart showing execution of vector instructions.
  • a determination is made if a first vector instruction is currently executing.
  • the first vector instruction is not currently executing (No) then via 601 return to 602.
  • the first vector instruction is currently executing (Yes) then via 603 proceed to 604 and use parameters stored in registers for accessing a memory access control for the first vector instruction then proceed via 605 to 606.
  • At 614 switch a multiplexor to a preload position thereby copying contents of the memory access control preload registers into the memory access control registers, then proceed via 615 to 616.
  • At 616 switch the multiplexor to a non-preload position, then proceed via 617 to 618.
  • At 618 execute the second vector instruction, denoting the second vector instruction as the first vector instruction, and returning via 601 to 602.
  • multiplexor control 330 allows new vector parameters 301 to enter multiplexors 310, 312, 314, and 316, and to propagate respectively via 311 , 313, 315, and 317 to vector length register 322, vector constant register 324, vector address register 326, and vector stride register 328, respectively.
  • multiplexor control 330 allows new vector parameters 301 which have been loaded into vector length preload register 320, into vector constant preload register 304, into vector address preload register 306, and vector stride preload register 308 to enter multiplexors 310, 312, 314, and 316, via 303, 305, 307, and 309 respectively and to propagate respectively via 311, 313, 315, and 317 to vector length register 322, vector constant register 324, vector address register 326, and vector stride register 328, respectively.
  • Figure 7 illustrates, generally at 700, a flowchart showing execution of desynchronized vector instructions in addition to non-desynchronized instructions.
  • a determination is made if a desynchronized vector instruction is currently executing. If a desynchronized vector instruction is not currently executing (No) then via 703 proceeds to 714.
  • a new desynchronized vector instructions is allowed to execute in addition to non-desynchronized instructions, and it proceeds via 701 to 702.
  • a desynchronized vector instruction is currently executing (Yes) then via 705 proceed to 704.
  • At 704 use the parameters stored in the memory access control registers (e.g. Figure 3 at 350) for accessing a memory access control for vector instructions, then proceed via 707 to 706.
  • At 706 a determination is made if there is an instruction attempting to modify a memory access control register or registers (register(s)) (e.g. Figure 3 at 350). When there is not an instruction attempting to modify a memory access control register(s) (e.g. Figure 3 at 350) (No) then via 703 proceed to 714.
  • At 710 disallow new desynchronized vector instructions from executing but continue to allow non-desynchronized instructions to execute, then via 713 proceed to 712.
  • instructions that modify memory access control register(s) no longer modify memory access control preload register(s) then proceed via 703 to 714. That is, for example, instructions that would modify memory access control registers (e.g. Figure 3 at 350) can now do so rather than modifying the memory access control preload registers (e.g. Figure 3 at 340). After 718, proceed to 714 to allow new desynchronized vector instructions to execute in addition to non-desynchronized instructions.
  • Co-pending Application Number 17/468,574, filed on 09/07/2021 describes a parameter stack, register stack, and subroutine call stack that are separated from the local memory, these stacks are used extensively by the vector ALU 124.
  • Pushing/popping parameters onto/from a stack, saving and restoring of registers, and subroutine calls and returns are all very common operations and it is undesirable if they cause the resynchronization of desynchronized or asynchronous execution.
  • Co-pending Application Number 17/468,574, filed on 09/07/2021 avoids this resynchronization and therefore is synergistic with the techniques disclosed herein.
  • references to "one example” in this description do not necessarily refer to the same example; however, neither are such examples mutually exclusive. Nor does “one example” imply that there is but a single example. For example, a feature, structure, act, without limitation described in “one example” may also be included in other examples. Thus, the invention may include a variety of combinations and/or integrations of the examples described herein.

Abstract

In one implementation a vector processor unit having preload registers for at least some of vector length, vector constant, vector address, and vector stride. Each preload register has an input and an output. All the preload register inputs are coupled to receive a new vector parameters. Each of the preload registers' outputs are coupled to a first input of a respective multiplexor, and the second input of all the respective multiplexors are coupled to the new vector parameters.

Description

Method and Apparatus for Desynchronizing Execution in a Vector Processor
RELATED APPLICATION
[0000] This patent application claims priority of pending U.S. Application Serial No. 63/180,634 filed 04/27/2021 by the same inventor titled “Method and Apparatus for Programmable Machine Learning and Inference” which is hereby incorporated herein by reference. This patent application claims priority of pending U.S. Application Serial No. 63/180,562 filed 04/27/2021 by the same inventor titled “Method and Apparatus for Gather/Scatter Operations in a Vector Processor” which is hereby incorporated herein by reference. This patent application claims priority of pending U.S. Application Serial No. 17/669,995 filed 02/11/2022 by the same inventor titled “Method and Apparatus for Gather/Scatter Operations in a Vector Processor” which is hereby incorporated herein by reference. This patent application claims priority of pending U.S. Application Serial No. 63/180,601 filed 04/27/2021 by the same inventor titled “System of Multiple Stacks in a Processor Devoid of an Effective Address Generator” which is hereby incorporated herein by reference. This patent application claims priority of pending U.S. Application Serial No. 17/468,574 filed 09/07/2021 by the same inventor titled “System of Multiple Stacks in a Processor Devoid of an Effective Address Generator” which is hereby incorporated herein by reference. This patent application claims priority of pending U.S. Application Serial No. 17/701,582 filed 03/22/2022 by the same inventor titled “Method and Apparatus for Desynchronizing Execution in a Vector Processor” which is hereby incorporated herein by reference.
FIELD
[0001] The present method and apparatus pertain to a vector processor. More particularly, the present method and apparatus relates to a Method and Apparatus for Desynchronizing Execution in a Vector Processor. BACKGROUND
[0002] For improved throughput a vector processing unit (VPU) accesses vectors in memory and performs vector operations at a high rate of speed in a continuous fashion. Thus, the disruption of the vector pipeline for any reason, such as, for example to handle serial or scalar operations or housekeeping instructions comes at a high cost in lowered performance as vector processors are built for brute speed.
[0003] This presents a technical problem for which a technical solution is needed using a technical means.
BRIEF SUMMARY
[0004] A vector processor unit is provided with preload registers for vector length, vector constant, vector address, and vector stride, with each preload register having an input and an output. All the preload register inputs are coupled to receive new vector parameters. Each of the preload registers’ outputs are coupled to a first input of a respective multiplexor, and a second input of all the respective multiplexors are coupled to receive the new vector parameters.
BRIEF DESCRIPTION OF THE DRAWINGS
[0005] The techniques disclosed are illustrated by way of examples and not limitations in the figures of the accompanying drawings. Same numbered items are not necessarily alike.
[0006] The accompanying Figures illustrate various non-exclusive examples of the techniques disclosed.
[0007] Figure 1 illustrates, generally at 100, a block diagram overview of a decode unit according to an example.
[0008] Figure 2 illustrates, generally at 200, a block diagram overview of vector registers for addressing a memory access control.
[0009] Figure 3 illustrates, generally at 300, a block diagram overview of a portion of a vector processor unit comprising memory access control preload registers.
[0010] Figure 4 illustrates, generally at 400, a flowchart showing desynchronous execution of an instruction and synchronous execution of an instruction.
[0011] Figure 5 illustrates, generally at 500, a flowchart showing asynchronous, desynchronous, and synchronous execution of an instruction.
[0012] Figure 6 illustrates, generally at 600, a flowchart showing execution of vector instructions.
[0013] Figure 7 illustrates, generally at 700, a flowchart showing execution of desynchronized vector instructions in addition to non-desynchronized instructions.
DETAILED DESCRIPTION
[0014] A Method and Apparatus for Desynchronizing Execution in a Vector Processor is disclosed.
[0015] DEFINITIONS and NOTES
[0016] Various terms are used to describe the techniques herein disclosed. Applicant is the lexicographer and defines these terms as follows. Terms are quoted upon their initial usage below.
[0017] “Concurrent” is the same as “parallel” and is defined as two things that are at least partially going on at once. It does not imply anything about how they relate to one another - they could be “synchronized” or “desynchronized”.
[0018] “Synchronized” execution - is the act of the pipeline control controlling every aspect of the instruction’s operation.
[0019] “Desynchronized” execution - is the act of an instruction performing a substantial component of its operation independent of the pipeline control. The pipeline control can therefore control execution and completion of one or more instructions following the instruction undergoing desynchronized execution prior to completion of the desynchronized execution.
[0020] Note that execution of instructions subsequent to a desynchronized instruction is considered to modify a critical processor state if it makes unacceptable changes to the results of the program executing on the processor. An unacceptable change is a final result of all processing for a given program that is different than if all the instructions were executed in a serial fashion, that is each instruction executing to completion before the next instruction begins. A critical processor state is one that must be maintained to avoid an unacceptable change. Changes that are acceptable may include, but are not limited to, the order faults or interrupts occur and updates to program visible registers occurring out of order with respect to the desynchronized instruction (but not out of order with respect to non-desynchronized instructions). Note that changes that would be considered unacceptable are prohibited from occurring through a process of resynchronized execution. [0021] “Desynchronized instruction” - is an instruction whose execution is not 100% under control of the pipeline control, i.e. a substantial component of its operation is not under control of the pipeline control, however the pipeline control can monitor its progression.
[0022] “Non-desynchronized instruction” - is an instruction that does not execute desynchronously.
[0023] “Resynchronized” execution stops an instruction subsequent to a desynchronized instruction from executing until the desynchronized instruction completes. This occurs if the subsequent instruction would modify a critical processor state, in particular if that processor state would affect the results of the desynchronized instruction.
[0024] “Asynchronous” instruction/execution - an instruction, as part of its execution, invokes activity external to the processor that will complete in a time completely uncontrolled and unpredictable by the processor. The pipeline control cannot monitor its progression. Meanwhile the processor can continue executing instructions.
[0025] “Asynchronous reserialization” waits for an asynchronous execution to complete before allowing a subsequent instruction to execute. Generally, this is in order to maintain integrity of the programs results.
[0026] Note that the difference between desynchronized and asynchronous is subtle. In desynchronized execution the processor has complete control over the two instructions that are executing even though it allows the second instruction to modify processor state before the first (desynchronized) instruction has completed. In asynchronous execution, the processor has zero (no) control of the timing in which the activity external to the processor invoked by the asynchronous instruction will complete.
[0027] Note we use the term desynchronized execution when allowing non-vector instructions to execute after a vector instruction has started but not completed. The execution of the vector instruction is considered desynchronized from the subsequent non vector instructions that are allowed to execute.
[0028] However, the desynchronization method disclosed is not so limited. That is, while we generally discuss non-vector instructions that execute when a desynchronized vector instruction executes for clarity of explanation, the desynchronization method disclosed is not so limited. In alternative implementations, a second vector instruction may be allowed to execute in a desynchronized manner while a fist desynchronized vector instruction is executing. Furthermore, other long running instructions (i.e. taking a longer time than other instructions to complete execution), other than vector instructions, are also candidates for desynchronized execution.
[0029] Note we use the term asynchronous execution for example for the external load memory (xload) and external save memory (xsave) instructions that request processing machines external to the vector processing unit (VPU) to coordinate the movement of data between the VPU’s memory and external memory.
[0030] “Modifying/changing/copying/transferring registers” refers to modifying/changing/copying/transferring values or parameters stored within register(s).
That is, for example, copying a first register to a second register is to be understood as copying the contents or parameters contained or held in the first register into the second register such that the second register now contains the value or parameter of the first register.
[0031] “Contention” refers to two or more processes, such as, but not limited to, executing instructions trying to alter or access the same entity, such as, but not limited to a memory or register where the alteration would introduce uncertainty in the result of processing. For example, if two executing instructions are attempting to both alter a specific memory location, this is contention for the resource, i.e. contention for the same specific memory location. The contention may result in a different result in processing depending on which instruction completes execution first. For example, a desynchronization contention, is a contention between an executing desynchronized instruction and another instruction that will affect the processor output resulting in a different output depending upon which instruction completes execution first. For example, an asynchronous contention, is a contention between an executing asynchronous instruction and another instruction that will affect the processor output resulting in a different output depending upon which instruction completes execution first.
[0032] “Vector parameters/new vector parameters” refers to information about a vector. In one example it may be a plurality of signals. More specifically it is information needed by the processor to access memory (e.g. read and write a vector) “new” refers to the situation where the processor is already using vector parameters and a new vector operation is being queued up or placed in the pipeline for future execution, the vector parameters for this vector operation are called “new vector parameters” to distinguish them from vector parameters that are currently being used in a vector instruction that is executing.
[0033] DESCRIPTION
[0034] In one example a vector processor unit having preload registers for vector length, vector constant, vector address, and vector stride is provided. Each preload register has a respective input and a respective output. All the preload register inputs are coupled to receive a new vector parameters. Each of the preload registers’ outputs are coupled to a first input of a respective multiplexor, and a second input of all the respective multiplexors is coupled to receive the new vector parameters.
[0035] In one example disclosed are mechanisms that determine when desynchronized and asynchronous execution can occur and mechanisms that stop instruction execution if the desynchronized and/or asynchronous execution must complete (called resynchronization and asynchronous reserialization respectively), generally in order to maintain integrity of the programs results. The methods disclosed not only allow desynchronized and asynchronous execution but also limit the cases when resynchronization or asynchronous reserialization is to be performed since resynchronization and asynchronous reserialization reduce program performance.
[0036] Figure 1 illustrates, generally at 100, a block diagram overview of a decode unit. At 102 is an instruction fetch control which fetches instructions from a memory system.
The memory system, while not germane to the understanding of the decode unit 100 can be, for example, random access memory (RAM). The instruction fetch control 102 outputs via 103 information to instruction decode 104, and outputs via 105 execute/halt information to operation state control 106 and to pipeline control 108. The instruction decode 104 outputs via 107 information to stall detection 112, result bypass detection 114, and resource allocation tracking 116. Pipeline control 108 outputs via 117 information to resource allocation tracking 116. Resource allocation tracking 116 outputs via 119 information to result bypass detection 114, and stall detection 112. Result bypass detection 114 outputs via 115 information to pipeline control 108. Stall detection 112 outputs via 113 information to pipeline control 108. Pipeline control 108 via 121 outputs and receives information to/from register unit 118, memory access control unit 120, scalar arithmetic logic units (ALUs) 122, vector arithmetic logic units (ALUs) 124 and branch unit 126. Branch unit 126 outputs via 125 information to instruction fetch control 102. Branch unit 126 outputs via 123 information to fault control 110. Vector ALUs 124 outputs via 123 information to fault control 110. Scalar ALUs 122 outputs via 123 information to fault control 110. Memory access control unit 120 outputs via 123 information to fault control 110. Register unit 118 outputs via 123 information to fault control 110. Fault control 110 outputs via 109 to pipeline control 108, and via 111 to operational state control 106.
Branch unit 126 receives via 127 information output from scalar ALUs 122 and information from vector ALUs 124.
[0037] For sake of a simple germane discussion, from Figure 1 it can be seen that pipeline control 108 communicates, inter-alia, with register unit 118, memory access control unit 120, scalar ALUs 122, and vector ALUs 124. Pipeline control 108 attempts to keep the processor in which decode unit 100 is situated running as fast as it can by trying to avoid stopping any scalar or vector ALUs from serially processing what can be done in parallel. It is in a simple sense a traffic cop directing traffic so as to improve throughput.
[0038] In a processor capable of performing both scalar and vector operations it is preferable to keep the vector ALUs operating at the highest rate of speed possible because vector operations involve more processing than scalar operations, and thus substantially determine the overall processing rate.
[0039] Figure 2 illustrates, generally at 200, a block diagram overview of vector registers for addressing a memory access control. At 201 is new vector parameters, i.e. 201 represents the receipt of new vector parameters 201 to be loaded. New vector parameters 201 is coupled to the input of vector length register 202 and the output of vector length register 202 is coupled via 203 to memory access control 220. New vector parameters 201 is also coupled to the input of vector constant register 204 and the output of vector length register 202 is coupled via 205 to memory access control 220. New vector parameters 201 is also coupled to the input of vector address register 206 and the output of vector length register 206 is coupled via 207 to memory access control 220. New vector parameters 201 is coupled to the input of vector stride register 208 and the output of vector stride register 208 is coupled via 209 to memory access control 220. While vector length register 202, vector constant register 204, vector address register 206 and vector stride register 208 are illustrated, in some examples one or more of vector length register 202 and vector constant register 204 are not provided.
[0040] Memory access control 220 is a functional block, not a register. It takes in as inputs the vector length provided via 203 from vector length register 202, the vector constant provided via 205 from vector constant register 204, the vector address provided via 207 from vector address register 206, and the vector stride provided via 209 from the vector stride register 208. The combination of vector length register 202, vector constant register 204, vector address register 206 and vector stride register 208 can be called Vector Control and memory access control 220 can be called a Memory Subsystem. That is Vector Control controls addressing to a Memory Subsystem. The Memory Subsystem can include RAM (not shown).
[0041] Upon understanding Figure 3 described below, the reader will recognize that Figure 2 as illustrated is an example of an apparatus that does not support vector desynchronization in vector memory control whereas Figure 3 as illustrated is an example of an apparatus that does support vector desynchronization in vector memory control. [0042] Figure 3 illustrates, generally at 300, a block diagram overview of a portion of a vector processor unit comprising memory access control preload registers.
[0043] At 301 is a new vector parameters. New vector parameters 301 is coupled to the input of vector length preload register 302 and the output of vector length preload register 302 is coupled via 303 to a first input of a respective multiplexor 310. The second input of multiplexor 310 is coupled to new vector parameters 301, i.e. bypassing vector length preload register 302. The output of multiplexor 310 is coupled via 311 to a vector length register 322. The output of vector length register 322 is coupled via 323 to memory access control 320.
[0044] New vector parameters 301 is coupled to the input of vector constant preload register 304 and the output of vector constant preload register 304 is coupled via 305 to a first input of respective multiplexor 312. The second input of multiplexor 312 is coupled to new vector parameters 301 , i.e. bypassing vector constant preload register 304. The output of multiplexor 312 is coupled via 313 to a vector constant register 324. The output of vector constant register 324 is coupled via 325 to memory access control 320.
[0045] New vector parameters 301 is coupled to the input of vector address preload register 306 and the output of vector address preload register 306 is coupled via 307 to a first input of respective multiplexor 314. The second input of multiplexor 314 is coupled to new vector parameters 301 i.e. bypassing vector address preload register 306. The output of multiplexor 314 is coupled via 315 to a vector constant register 326. The output of vector constant register 326 is coupled via 327 to memory access control 320.
[0046] New vector parameters 301 is coupled to the input of vector stride preload register 308 and the output of vector stride preload register 308 is coupled via 309 to a first input of multiplexor 316. The second input of multiplexor 316 is coupled to new vector parameters 301 i.e. bypassing vector stride preload register 308. The output of multiplexor 316 is coupled via 317 to a vector stride register 328. The output of vector stride register 328 is coupled via 329 to memory access control 320.
[0047] While vector length preload register 302, vector constant preload register 304, vector address preload register 306, vector stride preload register 208, vector length register 322, vector constant register 324, vector address register 326 and vector stride register 328, with the respective multiplexors 310, 312, 314, 316 are illustrated, in some examples one or more of vector length preload register 302, vector length register 322, vector constant register 304 and vector constant register 324, and the respective multiplexors, are not provided.
[0048] At 330 is multiplexor control. An output of multiplexor control 330 is coupled via 331 to respective control inputs of multiplexor 316, multiplexor 314, multiplexor 312, and multiplexor 310. That is, control inputs of multiplexor 316, multiplexor 314, multiplexor 312, and multiplexor 310 are all controlled via link 331 which is output from multiplexor control 330. In one example link 331 carries a single signal to all of the control inputs of multiplexor 316, multiplexor 314, multiplexor 312, and multiplexor 310, and in another example link 331 carries a respective signal to each of the control inputs of multiplexor 316, multiplexor 314, multiplexor 312, and multiplexor 310, so that they are individually controllable.
[0049] Multiplexor control 330 identifies whether memory access control registers 350 are to be loaded with new vector parameters setup 301 or from the respective outputs of memory access control preload registers 340, as described below, and therefore controls link 331 to as to update memory access control registers 350 at correct points between 2 desynchronized vector arithmetic operations. The update is from the preload registers (302, 304, 306, 308) to the registers (322, 324, 326, 328), or from new vector parameter 301 to the registers (322, 324, 326, 328). As described below, multiplexor control 330 further controls writing to each of the preload registers (302, 304, 306, 308) and the registers (322, 324, 326, 328).
[0050] Vector length preload register 302, vector constant preload register 304, vector address preload register 306, and vector stride preload register 308 together comprise memory access control preload registers 340. Individually each of vector length preload register 302, vector constant preload register 304, vector address preload register 306, and vector stride preload register 308 are considered a memory access control preload register. [0051] Vector length register 322, vector constant register 324, vector constant register 326, and vector stride register 328. Individually each of vector length register 322, vector constant register 324, vector constant register 326, and vector stride register 328 together comprise memory access control registers 350. Individually each of vector length register 322, vector constant register 324, vector constant register 326, and vector stride register 328 are considered a memory access control register.
[0052] Memory access control 320 is a functional block, not a register. It takes in as inputs the vector length, the vector constant, the vector address, and the vector stride registers values (provided by respective memory access control registers 322, 324, 326, 328 via respective links 323, 325, 327, 329). Registers, 322, 324, 326, 328, and their respective parameters communicated via links 323, 325, 327, 329, are what can be called Vector Control and memory access control 320 can be called a Memory Subsystem. That is Vector Control controls addressing to a Memory Subsystem. The Memory Subsystem can include RAM (not shown).
[0053] The multiplexor control 330 is considered to be in a non-preload position when new vector parameters 301 pass through multiplexors 310, 312, 314, and 316 respectively, and then via 311, 313, 315, and 317 respectively, into vector length register 322, vector constant register 324, vector constant register 326, and vector stride register 328.
[0054] The multiplexor control 330 is considered to be in a preload position when multiplexors 310, 312, 314, and 316 respectively receive inputs from vector length preload register 302, vector constant preload register 304, vector address preload register 306, and vector stride preload register 308 respectively via 303, 305, 307, and 309 respectively. [0055] That is in the non-preload position the memory access control registers 350 receive parameters from the new vector parameters 301. In the preload position the memory access control registers 350 receive parameters from the memory access control preload registers 340.
[0056] Not shown so as not to obscure the example is that the multiplexor control 330 controls write signals to the access control registers 350 and the memory access control preload registers 340. In this way multiplexor control 330 controls which registers receive the new vector parameters 301.
[0057] In Figure 3 multiplexor 310 is considered a first multiplexor.
[0058] In Figure 3 multiplexor 312 is considered a second multiplexor.
[0059] In Figure 3 multiplexor 314 is considered a third multiplexor.
[0060] In Figure 3 multiplexor 316 is considered a fourth multiplexor.
[0061] Figure 4 illustrates, generally at 400, a flowchart showing desynchronous execution of an instruction and synchronous execution of an instruction. At 402 fetch the next instruction to execute. The proceed via 403 to 404. At 404 determine if the next instruction to execute affects or is dependent on the results of any current desynchronized instruction in progress. When the next instruction to execute affects or is dependent on the results of any current desynchronized instruction in progress, this being called a desynchronization contention (Yes) then go via 419 to 420. When the next instruction to execute does not affect or is not dependent on the results of any current desynchronized instruction in progress (No) go via 405 to 430 Optional asynchronous execution. At 420 resynchronize execution by waiting for all desynchronized operations to complete before proceeding via 405 to 430 to Optional asynchronous execution. When there is no optional asynchronous execution 430 then proceed via 409 to 410.
[0062] At 410 determine if the fetched next instruction can execute desynchronously. When the next instruction can execute desynchronously (Yes) then proceed via 411 to 412. At 412 initiate desynchronous execution by allowing the processor to execute the fetched next instruction desynchronously, that is, the completion of the fetched next instruction occurs desynchronously with respect to the control of the processor but the processor tracks when an internal signal is given that indicates the operation is complete. The processor does not wait for this completion signal before continuing onto via 415 to 402. [0063] When the next instruction cannot execute desynchronously (No) then proceed via 413 to 414. At 414 initiate desynchronous execution by allowing the processor to execute the fetched next instruction synchronously, that is, the instruction has the appearance to the program that it fully completes before continuing via 415 to 402. The processor may be pipelined or employ other overlapped execution techniques, however it does so in a manner that makes it appear to a program that it completes the instruction before continuing to 402.
[0064] Some operations are allowed to occur out of order and others are not. Not everything can be out of order otherwise the general integrity of a program (and therefore its usefulness) is undermined. To avoid instructions that can corrupt the processor state, there is provided a process called resynchronization, i.e. 420, that halts further execution until a desynchronized operation has completed. This impacts performance and this disclosure details the elimination of some of the causes of resynchronization, thereby speeding up program execution.
[0065] Knowing when there is desynchronized execution of one or more instructions, e.g. vector instruction, for example in Figure 4 at 412, then the multiplexor control 330 in Figure 3 can perform the updating of the memory access control registers 350 at correct points between the two desynchronized vector arithmetic operations.
[0066] One vector instruction can desynchronize from the executing instructions in the pipeline, allowing another instruction to execute. If a subsequent instruction has a resource contention with the desynchronized instruction then the subsequent instruction must wait until the contention goes away - this is one example of a desynchronization contention, as described in relation to 404. However, if you can execute a second vector instruction without causing a resource contention, the second vector instruction may execute desynchronized.
[0067] Instructions that qualify for desynchronized execution are any long running instruction as this allows subsequent instructions to complete their execution while the desynchronized instruction is executing. So, the execution time for subsequent instructions which are executed while the desynchronized instruction is executing is effectively reduced because they do not wait on the desynchronized instruction to complete execution.
[0068] Another way of looking at the examples disclosed herein is to see what instructions can execute when a desynchronized instruction is executing.
[0069] Since vector instructions are long running and represent the bulk of the work in a vector processor, ideally all non-vector instructions would be allowed to execute while a desynchronized vector instruction executes. If this can be achieved, then the processing time is bounded by the execution of the vector instructions as all other instructions would be executed while the desynchronized vector instructions are executing.
[0070] Vector instructions read operands from memory and write results to memory. Therefore, instructions that don’t access memory are candidates for execution when a desynchronized vector instruction is executing. These instructions include all scalar arithmetic instructions whose operands come from, and result go to, a register set. It also includes instructions that access memory using either a different memory or a different region of memory than a desynchronized vector instruction. This can include subroutine call and returns, pushing and popping parameters from a stack, without limitation.
[0071] There are a class of instructions that may cause contention with a desynchronized vector instruction. For example, instructions that set up a subsequent vector operation (vector addresses in memory, vector lengths, without limitation) and modify resources that can adversely affect the currently executing desynchronized vector instruction.
[0072] For performance reasons, it would be desirable if these contention causing instructions could also execute in parallel with a desynchronized vector instruction.
[0073] If the processing of vectors represents the bulk of the work in a vector processor, then instructions that set up those vectors are also very common and having to resynchronize execution every time a new vector is being set up is a significant performance degradation.
[0074] Therefore, there is a need for instructions that set up memory access control preload registers (e.g. Figure 3 at 340) that specify vector addresses, vector strides, vector lengths, and vector constant values so that the currently executing desynchronized vector instruction is not adversely affected.
[0075] Vector length, vector constant, vector address, and vector stride are entities that can reside in registers, for example, in memory access control registers 350 (e.g. 322, 324, 326, and 328 respectively) in Figure 3 and via 323, 325, 327, and 329 respectively are in communication with memory access control 320. Vector length preload, vector constant preload, vector address preload, and vector stride preload are entities that can reside in memory access control preload registers 340 (e.g. 302, 304, 306, and 308 respectively in Figure 3) and via 303, 305, 307, and 309 respectively are in communication with multiplexor 310, 312, 314, and 316 respectively. The vector length, vector constant, vector address, and vector stride, collectively called a vector port for ease of discussion, allow addressing a vector in the memory access control 320, called a memory for ease of discussion. Thus the vector port addresses the memory to point to a vector.
[0076] For example, a vector length, is a length of a vector in the memory.
[0077] For example, a vector constant is a constant that is used when operating on a vector. For example, if there is a need to multiply every element of a vector A by 2, then vector B, the multiplier, is a vector whose elements all have the value 2. Instead of requiring vector B to be resident in memory, the vector constant can be in a register that specifies the value of each element of vector B.
[0078] A vector address is an address where a vector is to be found. A vector stride is a stride value that is added to the vector address each time an access is made into memory for a vector element. For example, the stride may be equal to 1 if the vector is a row of a matrix but it may be set to N if it is a column of a matrix that has N elements in each row. Vector address, and vector stride are used to address memory locations where a vector can be read or written. [0079] DETAILED INSTRUCTION EXECUTION EXAMPLES
[0080] Because the techniques disclosed are used for enhancing execution of a vector processor, these detailed examples are illustrative of the techniques.
[0081] First is shown an example of Desynchronized Execution. Then an example of Asynchronous Execution. And finally an example showing relevance with respect to co pending Application Number 17/468,574 filed 09/07/2021 which describes a parameter stack, register stack, and subroutine call stack that are separated from the local memory, which is used extensively by the vector ALUs 124.
[0082] In the following examples these mnemonics mean the following:
[0083] mov - move
[0084] rX - register, where X is an integer number of the register [0085] sas - set-address-and-stride
[0086] slen - set vector length
[0087] sqrt - square root
[0088] vX - vector, where X is an integer number of the vector [0089] mem[Xs] - memory, where Xs are the memory addresses [0090] add - addition [0091] div - division
[0092] etc - etcetera, meaning possible continuing instructions
[0093] log - logarithm
[0094] // - a comment follows (not part of the executing code)
[0095] xload - load data from an external source [0096] xsave - save data to an external destination [0097] store - save in local memory
[0098] fetch - get from local memory
[0099] xswait - a stall instruction until an asynchronous xsave operation is complete
[00100] push - put the value referenced onto the top of a stack
[00101] call - passing control to the specified instructions/routine
[00102] - - a comment follows (not part of the executing code), and is an alternative syntax to // [00103] xlwait - a stall instruction until an asynchronous xload operation is complete [00104] In order to not confuse the reader, while in Figure 1 box 124 indicates vector ALUs (plural) the examples below will consider the case where box 124 is a single vector ALU and will refer to it as vector ALU 124. The techniques disclosed are not so limited and multiple ALUs are possible.
[00105] Desynchronized Execution
[00106] ========================
[00107] mov rO 100 // rO gets 100
[00108] mov r1 1 // r1 gets 1
[00109] sasO rO r1 // set-address-and-stride for vector 0, vO: address is 100, stride is 1
[00110] sas1 rO r1 // set-address-and-stride for vector 1 , v1 : address is 100, stride is 1
[00111] mov r264 // r2 gets 64
[00112] slen r2 // set vector length to 64, thus vO v1 occupy memory locations mem[100,101,102,...,163]
[00113] sqrt vO v1 // vO gets the square root of v1 , and since vO and v1 have the same address, v1 is also get the square root
[00114] add r7 r8 // without desynchronization this instruction has to wait until the previous sqrt instruction completes
[00115] div r7 r9 // without desynchronization this instruction has to wait until the previous sqrt instruction completes
[00116] etc
[00117] There is no reason the instructions illustrated above that follow the sqrt instruction cannot execute while the sqrt instruction is executing. This means pipeline control, 108 (also called pipe control), needs to allow the sqrt instruction to execute desynchronized so pipeline control 108 can allow the execution of subsequent instructions (in the example above, add r7 r8, and div r7 r9).
[00118] However at some point pipeline control 108 may need to resynchronize a desynchronized operation if it is still in progress. For example, if the vector ALU, 124, only supports one vector operation at a time, then the following demonstrates a resynchronization:
[00119] sqrt vO v1 // this desynchronizes from pipeline control, i.e. allows the sqrt instruction to execute desynchronized
[00120] add r7 r8 // pipeline control allows this to execute
[00121] div r7 r9 // pipeline control allows this to execute
[00122] log vO v1 // pipeline control must resynchronize since this cannot execute yet due to resource contention (of vO and v1), that is, log vO v1 is attempting to use vO and v1 however we don’t know if sqrt vO v1 is finished yet (with vO and v1), so it must resynchronize
[00123] In the immediately above example the original vector is square-rooted then since no vector addresses were changed, the result of that square root will then be operated on by the logarithm function. But if vector ALU 124 can only perform one vector operation at a time, then the square root must complete before the logarithm can start. If the square root has not completed (monitored by resource allocation tracking 116) then the desynchronized sqrt must be resynchronized with the pipeline control's 108 execution, since the sqrt instruction has not been resynchronized. This is done by resource allocation tracking 116 indicating to stall detection 112 that a resynchronization needs to occur and stall detection 112 stalls pipe control 108 from executing the log instruction until the resynchronization is complete and vector ALU 124 is available.
[00124] Resynchronization represents a performance loss and, although sometimes necessary, is undesirable. Ideally, the vector ALU 124 should be kept as busy as possible, with a utilization as close to 100% as practical since the bulk of the work in a vector processor is the processing of vectors.
[00125] Consider the following example, which is representative of many common scenarios:
[00126] mov rO 100 // same as the above example all the way down to the sqrt
[00127] mov r1 1
[00128] sasO rO r1
[00129] sas1 rO r1 [00130] mov r264
[00131] slen r2
[00132] sqrt vO vl
[00133] mov rO 200 // set up a new vector operation for the operand and result vectors in mem[200,201,202,295]
[00134] mov r1 1
[00135] sasO rO r1
[00136] sasl rO
[00137] mov r296
[00138] slen r2
[00139] log vO v1 // mem[200,201,202,295] gets the log of mem[200, 201, 202, 2095]
[00140] In this case, the second occurrence of the sasO, sas1, and slen instructions changes the locations in memory that define where the operand and result vectors reside. But if the sqrt instruction which is still executing desynchronized when these instructions are executed, they will adversely affect the sqrt because the vectors for the sqrt are unexpectedly having the address, strides, and lengths changed. So the second occurrence of sasO must cause a resynchronization, which is not desirable.
[00141] Figure 3, shows an example how resynchronization can be avoided.
[00142] The second occurrence of the sasO, sas1, and slen instructions can be allowed to execute while the desynchronized sqrt is executing by writing into the memory access preload registers 302, 306, and 308 the operand and result ports rather than writing into the memory access control registers 322, 326, and 328.
[00143] Multiplexor control 330, which is controlled by pipeline control 108 recognizes the attempt to modify one of the memory access control registers 350 while a desynchronized operation is in progress and instead causes the memory access control preload registers 340 to be written instead, that is multiplexor control 330 decides whether the memory access control registers 350 or the memory access control preload registers 340 are written. Therefore, registers memory access control registers 350 are not affected by a subsequent instruction while a desynchronized operation is in progress and the desynchronized operation is therefore not adversely affected.
[00144] Pipeline control 108, further recognizes when the desynchronized operation is complete and if any of memory access control preload registers 340 have been modified then their contents are moved into the respective one of memory access control registers 350 by multiplexor control 330 of pipeline control 108. Thus, the full functionality required by the second execution of the sasO, sas1, and slen instruction is provided without them having to resynchronize, and therefore lose performance. The vector log instruction can now execute and, being a vector instruction, can execute in a desynchronized manner. If multiple vector instructions cannot execute in parallel, the vector log will resynchronize first, responsive to pipeline control 108, so that only one desynchronized vector instruction at a time is executing.
[00145] The above allows the vector unit to remain near 100% busy (ignoring any inefficiencies of startup in a particular implementation). The vector ALU 124, went from performing square-roots on each element of one vector to immediately performing logarithms on another vector, thereby satisfying the objective of keeping the vector ALU 124 nearly 100% busy.
[00146] Had the sqrt completed before the second occurrence of the sasO, sas1 , and slen instructions, then no desynchronized operation was in progress. Pipeline control 108 recognizes this and via multiplexor control 330 allows memory access control registers 350 to be updated immediately by the new vector parameters 301 without having to use memory access control preload registers 340.
[00147] It may be that the second sasO updated registers 306 and 308 rather than 326 and 328 due to the desynchronized execution of the sqrt but when the slen instruction was executed, the desynchronized execution had completed. In this case, when the desynchronized execution is completed, multiplexor control 330 updates registers 326 and 328 from registers 306 and 308 when the sqrt completed and allows the slen to write directly into register 322.
[00148] Figure 3 represents a method where desynchronized execution may continue and allow additional instructions to execute even when those instructions have a resource contention because the arrangement of Figure 3 resolves the resource contention. The particular example shown in Figure 3 is illustrative and not limiting in scope.
[00149] Asynchronous Execution
[00150] ======================
[00151] Asynchronous execution is a form of desynchronized execution when certain actions cannot be predicted or anticipated because they are beyond the control of the processor.
[00152] An example of this is the programmatic loading or saving of local memory with an external memory or device. If a program instruction initiates the procedure for an external process to read out the local RAM and do something with the data, such as save it to an eternal memory, then the pipeline control 108 (also called pipe control) (in Figure 1), has no idea when that external process will actually read the local memory. Similarly, if a program instruction initiates the procedure for an external process to load new data into the local RAM then the pipe control, 108, has no idea when that data will actually be written and useable by the processor.
[00153] This example can be further elucidated by two instructions:
[00154] xload r1 r2 r3 - load r2 bytes of data starting from external memory address r3 onward to local memory starting with address r1 onwards. That is, load the contents of external memory locations r3, r3+1,...,r3+r2-1 into the respective local memory locations r1, r1+1,...,r1+r2-1.
[00155] xsave r1 r2 r3 - save r2 bytes of data from local memory address r3 onward to external memory address r1 onwards. That is, save the contents of local memory locations r3, r3+1,...,r3+r2-1 into the respective external memory locations r1, r1 + 1,...,r1+r2-1.
[00156] where r1 , r2, and r3 are registers that contain the desired values for the operation.
[00157] Because it may take a significant amount of time for xload and xsave to carry out the operation, it would be preferrable if pipe control 108 continues executing the instructions that follow the xload or xsave, just as it does for desynchronized execution.
This variation of desynchronized execution is called asynchronous execution, as certain activities of the xload and xsave instructions are carried out asynchronously with respect to pipe control 108.
[00158] Asynchronous execution allows faster program execution performance. However the same sort of issue like resynchronization must be considered when there is a resource contention or data dependency. Resource allocation tracking 116 monitors for these issues while the asynchronous operations have not received an external indication of their completion, and when necessary, instructs stall detection 112 to halt pipe control 108 from executing instructions when a problem is encountered that necessitates the halting of instruction execution until the problem is resolved or the asynchronous operation is completed. This is not the same as resynchronization because the asynchronous operation may complete while a desynchronized vector operation is still in progress. However the instruction that had to wait for the asynchronous operation to complete can now execute even though a resynchronization of the desynchronized vector operation has not been performed.
[00159] Consider the xload instruction. Once it is issued by pipe control 108, at some unpredictable point in the future an external process will write to the local memory the data that is being retrieved from and external memory or external device. If the local memory does not have separate write ports for external writes and internal (processor generated) writes, then this is a resource contention. Even if multiple write ports are present, a future instruction may need to use the new data being loaded by the xload. This too is a resource contention, the resource being the data and the contention being the correct ordering of the loading of the data from the external source and the usage of the data by an instruction that follows the xload.
[00160] Consider the xsave instruction. Once it is issued by pipe control 108 (i.e. pipeline control 108), at some unpredictable point in the future, an external process will read the data from the local memory and save it to external memory or to an external device. If the local memory does not have separate read ports for the external reads and internal (processor generated) reads then this is a resource contention. Even if multiple read ports are present, a future instruction may write over the data that is still in the process of being saved by the xsave instruction. This too is a resource contention, the resource being the data and the contention being the correct ordering of the reading of the data before it is overwritten by new data.
[00161] Here is an example instruction stream:
[00162] mov rO 100
[00163] mov r1 64
[00164] mov r20x12345678
[00165] xload rO r1 r2 // load 64 bytes into local mem[100,101, 163] from external mem[0x12345678...]
[00166] add r7 r8 // these can be executed while the xload continues asynchronously
[00167] mul r7 r9
[00168] mov r9500
[00169] store r9 r7 // writes r9 into local mem[500] - resource contention on memory write port with the xload
[00170] In this example, the xload is executed but the loading of new data into the local memory is performed asynchronously. The add and mul instructions can therefore be executed. But the store instruction needs to write data to the local memory. Since it is unpredictable when the xload will also write to the local memory, it is possible the store and xload will attempt to perform simultaneous writes which is not supported in a design with only one write port. Therefore, the store instruction must be stalled until xload has finished writing to the local memory. Resource allocation tracking 116 monitors the asynchronous xload, detects this contention, and instructs stall detection 112 to halt pipe control 108 from executing the store instruction until resource allocation tracking 116 determines the contention is resolved.
[00171] In this example, allowing xload to execute asynchronously gained some performance improvement, all the way up to the store instruction. But additional improvements can be made since the store instruction writes to a different memory location than the xload. It would be desirable for the store instruction and the instructions that follow to be allowed to execute while the asynchronous xload is still in progress.
[00172] One mechanism for such improvement is for the external process to request from the processor permission to write to the local memory and buffer the write data until such permission is given by pipe control 108. This may be perfectly satisfactory if only small amounts of data are to loaded from external memory but if a lot of data is being returned from external memory and permission from pipe control 108 to write to the local memory is delayed then the buffer may be unacceptably large. (If a very long running vector instruction is being executed desynchronized then pipe control 108 cannot interrupt it since it's desynchronized. It may take a long time to complete before it is no longer using the write port.)
[00173] Another mechanism that solves this problem and eliminates the buffer is for the external process to shut off the clocks to the vector processor, perform the writes then turn the vector processor clocks back on. This is like the vector processor becoming unconscious for a moment and during that time of zero activity the local RAM was written to and only then the vector processor became conscious again. From the perspective of the vector processor, it is as if the new data suddenly appeared in the local memory. This requires the local memory to be on a clock separate from the rest of the vector processor which is not shut off during this "unconscious" operation.
[00174] This "unconscious" operation does not solve all the problems. Consider the following instruction stream:
[00175] mov rO 100 // all the same instructions as before
[00176] mov r1 64
[00177] mov r20x12345678
[00178] xload rO r1 r2
[00179] add r7 r8
[00180] mul r7 r9
[00181] mov r9500
[00182] store r9 r7 // this instruction is now allowed to execute
[00183] etc // plus many more instructions
[00184] mov r9 100
[00185] fetch r7 r9 // fetch mem[100] and put it into r9 - this is a data contention with the xload!!!
[00186] In this example, the fetch instruction retrieves data from the local memory that is being loaded by the prior xload. The fetch cannot be allowed to execute until the xload has written this data into the local memory.
[00187] Resource allocation tracking 116 monitors the local memory addresses associated with the xload and initiates the process for stalling any instruction that reads or writes a memory address in that range. This is an automated means of resolving the contention. Programmatic means may also or alternatively be made available. A programmer generally knows if they are prefetching data and when, later on in the program that data is being used. Therefore, an instruction such as xlwait (xload wait) can be used by the programmer to alert pipe control 108 that it needs to wait until an outstanding asynchronous xload has completed before continuing with instruction execution. This can lead to a simpler design by moving the onus to the programmer to ensure the race hazard is avoided.
[00188] Similar considerations pertain to the xsave instruction:
[00189] - Pipe control 108 can issue an asynchronous execution of xsave and continue executing subsequent instructions until an instruction is encountered that has a memory read port contention.
[00190] - Memory read port contention can be eliminated by allowing external logic to shut off the vector processor clocks.
[00191] - Resource allocation tracking 116 monitors the local memory addresses associated with the xsave and initiates the process for stalling any instruction that modifies a memory address in that range.
[00192] - An xswait instruction can move the onus to the programmer to indicate when instruction execution should stall until the asynchronous operation is complete. [00193] xsave has an additional consideration regarding what it means for its operation to complete. In the case of xload, the operation is not considered complete until all the data has been loaded into the local memory. But for xsave, there are two points that could be considered complete:
[00194] - all the data to be saved has been read out of the local memory
[00195] - all the data to be saved has been read out of the local memory and the external memory/device has acknowledged the receipt of such data. [00196] The latter definition of complete allows the external memory/process to indicate that not only has the data been received (as in, the xsave saved it to a legal location) but to also indicate the integrity of the data received (as in, did it arrive with good parity, for example).
[00197] Most often, a program only cares for the former definition, i.e. that the data has been read from the internal memory even though it may not have yet been received and acknowledged by the external memory/device. This is because the program only cares that it can now continue execution and modify the data that was saved because the original state of the data is what is being saved.
[00198] But sometimes a program may need to know that the xsave is 100% complete in every way and that the external write has been acknowledged. For example, the data may be of such critical nature that if the data arrived with a parity error at the receiving end, the program may want to re-xsave the data until confirmation that good data was received has been acknowledged.
[00199] For this reason, there may be two variants of xswait that provides both variation of xsave-complete.
[00200] Figure 5 illustrates, generally at 500, a flowchart showing asynchronous, desynchronous, and synchronous execution of an instruction. At 502 fetch the next instruction to execute. The proceed via 503 to 504. At 504 determine if the fetched next instruction to execute affects or is dependent on the results of any current desynchronized instruction in progress, i.e. is there a desynchronization contention. When the fetched next instruction to execute affects or is dependent on the results of any current desynchronized instruction in progress (Yes) then go via 519 to 520. At 520 resynchronize execution by waiting for all desynchronized operations to complete before proceeding via 505 to 506. When the fetched next instruction to execute does not affect or is not dependent on the results of any current desynchronized instruction in progress (No) go via 505 to 506. [00201] At 506 determine if the fetched next instruction to execute affects or is dependent on the results of any asynchronous operation in progress, i.e. an asynchronous contention. When the next instruction to execute affects or is dependent on the results of any asynchronous operation in progress (Yes), go via 521 to 522, otherwise if the next instruction to execute does not affect or is not dependent on the results of any asynchronous operation in progress (No) go via 507 to 508. At 522 synchronize execution by waiting for all asynchronized operations to complete before proceeding via 507 to 508.
At 508 determine if the next instruction to execute can execute asynchronously. When the next instruction to execute can execute asynchronously (Yes), go via 517 to 518, otherwise if the next instruction to execute can not execute asynchronously (No) go via 509 to 510.
At 518 initiate asynchronous execution by allowing the processor to execute the next instruction asynchronously.
[00202] At 510 determine if the fetched next instruction can execute desynchronously. When the next instruction can execute desynchronously (Yes) then proceed via 511 to 512. At 512 initiate desynchronous execution by allowing the processor to execute the fetched next instruction desynchronously, that is, the completion of the fetched next instruction occurs desynchronously with respect to the control of the processor but the processor tracks when an internal signal is given that indicates the operation is complete. The processor does not wait for this completion signal before continuing onto via 515 to 502. [00203] When the next instruction cannot execute desynchronously (No) then proceed via 513 to 514. At 514 initiate desynchronous execution by allowing the processor to execute the fetched next instruction synchronously, that is, the instruction has the appearance to the program that it fully completes before continuing via 515 to 502. The processor may be pipelined or employ other overlapped execution techniques, however it does so in a manner that makes it appear to a program that it completes the instruction before continuing to 502.
[00204] Figure 6 illustrates, generally at 600, a flowchart showing execution of vector instructions. At 602 a determination is made if a first vector instruction is currently executing. When the first vector instruction is not currently executing (No) then via 601 return to 602. When the first vector instruction is currently executing (Yes) then via 603 proceed to 604 and use parameters stored in registers for accessing a memory access control for the first vector instruction then proceed via 605 to 606.
[00205] At 606 a determination is made if the first vector instruction has finished execution. When the first vector instruction has finished execution (Yes) then proceed via 601 to 602. When the first vector instruction has not finished execution (No) proceed via 607 to 608.
[00206] At 608 a determination is made if a second vector instruction is waiting to execute. When a second vector instruction is not waiting to execute (No) then return via 601 to 602. When a second vector instruction is waiting to execute (Yes) then proceed via 609 to 610 and load new vector parameters into memory access control preload registers for use with the second vector instruction, then proceed via 611 to 612. At 612 a determination is made if the first vector instruction has finished execution. When the first vector instruction has not finished execution (No) then proceed via 611 to 612. When the first vector instruction has finished execution (Yes) proceed via 613 to 614. At 614 switch a multiplexor to a preload position thereby copying contents of the memory access control preload registers into the memory access control registers, then proceed via 615 to 616. At 616 switch the multiplexor to a non-preload position, then proceed via 617 to 618. At 618 execute the second vector instruction, denoting the second vector instruction as the first vector instruction, and returning via 601 to 602.
[00207] When the multiplexor is in the non-preload position it allows new vector parameters to be set up. For example, referring to Figure 3, in the non-preload position multiplexor control 330 allows new vector parameters 301 to enter multiplexors 310, 312, 314, and 316, and to propagate respectively via 311 , 313, 315, and 317 to vector length register 322, vector constant register 324, vector address register 326, and vector stride register 328, respectively.
[00208] When the multiplexor is in the preload position it allows new vector parameters to be set up from the memory access control preload registers 340. For example, referring to Figure 3, in the preload position multiplexor control 330 allows new vector parameters 301 which have been loaded into vector length preload register 320, into vector constant preload register 304, into vector address preload register 306, and vector stride preload register 308 to enter multiplexors 310, 312, 314, and 316, via 303, 305, 307, and 309 respectively and to propagate respectively via 311, 313, 315, and 317 to vector length register 322, vector constant register 324, vector address register 326, and vector stride register 328, respectively. [00209] Figure 7 illustrates, generally at 700, a flowchart showing execution of desynchronized vector instructions in addition to non-desynchronized instructions. At 702 a determination is made if a desynchronized vector instruction is currently executing. If a desynchronized vector instruction is not currently executing (No) then via 703 proceeds to 714. At 714 a new desynchronized vector instructions is allowed to execute in addition to non-desynchronized instructions, and it proceeds via 701 to 702.
[00210] If a desynchronized vector instruction is currently executing (Yes) then via 705 proceed to 704. At 704 use the parameters stored in the memory access control registers (e.g. Figure 3 at 350) for accessing a memory access control for vector instructions, then proceed via 707 to 706. At 706 a determination is made if there is an instruction attempting to modify a memory access control register or registers (register(s)) (e.g. Figure 3 at 350). When there is not an instruction attempting to modify a memory access control register(s) (e.g. Figure 3 at 350) (No) then via 703 proceed to 714.
[00211] When there is an instruction attempting to modify a memory access control register(s) (Yes) then via 709 proceed to 708. At 708 modify the corresponding memory access control preload register or registers (register(s)) (e.g. Figure 3 at 340) instead of the memory access control register(s) (e.g. Figure 3 at 350), then via 711 proceed to 710. For example, using Figure 3, the vector length register 322 has a corresponding vector length preload register 302. The example holds for vector constant register 324 and corresponding vector constant preload register 304. The example holds for vector address register 326 and corresponding vector address preload register 306. The example holds for vector stride register 328 and corresponding vector stride preload register 308.
[00212] At 710 disallow new desynchronized vector instructions from executing but continue to allow non-desynchronized instructions to execute, then via 713 proceed to 712. [00213] At 712 a determination is made if all desynchronized vector instructions have completed. When all desynchronized vector instructions have not completed (No) then proceed via 715 to 704. When all desynchronized vector instructions have completed (Yes) then proceed via 717 to 716.
[00214] At 716 move any modified memory access control preload register(s) parameters into the memory access control register(s) and then proceed via 719 to 718. Optionally, at 720, move all memory access control preload registers parameters into the memory access control registers, without consideration as to whether they have been modified. For example, using Figure 3, move all memory access control preload registers 340 parameters into the memory access control registers 350, using the multiplexor control 330.
[00215] At 718 instructions that modify memory access control register(s) no longer modify memory access control preload register(s), then proceed via 703 to 714. That is, for example, instructions that would modify memory access control registers (e.g. Figure 3 at 350) can now do so rather than modifying the memory access control preload registers (e.g. Figure 3 at 340). After 718, proceed to 714 to allow new desynchronized vector instructions to execute in addition to non-desynchronized instructions.
[00216] Relevance with respect to co-pending Application Number 17/468,574, filed on 09/07/2021.
[00217] ================================
[00218] These methods can be used with co-pending Application Number 17/468,574, filed on 09/07/2021. Co-pending Application Number 17/468,574, filed on 09/07/2021 describes a parameter stack, register stack, and subroutine call stack that are separated from the local memory, these stacks are used extensively by the vector ALU 124.
[00219] Consider the following instruction sequence, which is similar to a previous example on desynchronized execution:
[00220] mov rO 100
[00221] mov r1 1
[00222] sasO rO r1
[00223] sas1 rO r1
[00224] mov r264
[00225] slen r2
[00226] sqrt vO v1 // this could execute desynchronized
[00227] push rO // as long as this does not have the stack in the same memory the vector ALU uses!
[00228] push r1 [00229] push r2
[00230] call function_that_does_vector_log
[00231] Pushing/popping parameters onto/from a stack, saving and restoring of registers, and subroutine calls and returns are all very common operations and it is undesirable if they cause the resynchronization of desynchronized or asynchronous execution. Co-pending Application Number 17/468,574, filed on 09/07/2021 avoids this resynchronization and therefore is synergistic with the techniques disclosed herein.
[00232] Thus a Method and Apparatus for Desynchronizing Execution in a Vector Processor have been described.
[00233] For purposes of discussing and understanding the examples, it is to be understood that various terms are used by those knowledgeable in the art to describe techniques and approaches. Furthermore, in the description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the examples. It will be evident, however, to one of ordinary skill in the art that the examples may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the examples. These examples are described in sufficient detail to enable those of ordinary skill in the art to practice the examples, and it is to be understood that other examples may be utilized and that logical, mechanical, and other changes may be made without departing from the scope of the examples.
[00234] As used in this description, "one example" or "an example" or similar phrases means that the feature(s) being described are included in at least one example.
References to "one example" in this description do not necessarily refer to the same example; however, neither are such examples mutually exclusive. Nor does “one example” imply that there is but a single example. For example, a feature, structure, act, without limitation described in “one example” may also be included in other examples. Thus, the invention may include a variety of combinations and/or integrations of the examples described herein.
[00235] As used in this description, "substantially" or "substantially equal” or similar phrases are used to indicate that the items are very close or similar. Since two physical entities can never be exactly equal, a phrase such as “substantially equal” is used to indicate that they are for all practical purposes equal.
[00236] It is to be understood that in any one or more examples where alternative approaches or techniques are discussed that any and all such combinations as may be possible are hereby disclosed. For example, if there are five techniques discussed that are all possible, then denoting each technique as follows: A, B, C, D, E, each technique may be either present or not present with every other technique, thus yielding 2L5 or 32 combinations, in binary order ranging from not A and not B and not C and not D and not E to A and B and C and D and E. Applicant(s) hereby claims all such possible combinations. Applicant(s) hereby submit that the foregoing combinations comply with applicable EP (European Patent) standards. No preference is given to any combination.
[00237] Thus a Method and Apparatus for Desynchronizing Execution in a Vector Processor have been described.

Claims

CLAIMS What is claimed is:
1. A vector processor unit comprising: a plurality of memory access control preload registers, each memory access control preload register having an input and an output, all the memory access control preload register inputs coupled to receive a new vector parameters; a plurality of multiplexors, each multiplexor having a first input, a second input, a switching input, and an output, each of the memory access control preload register outputs coupled to the first input of a respective multiplexor, each of the second input of the respective multiplexor coupled to receive the new vector parameters; a multiplexor control, each of the multiplexor switching inputs responsive to the multiplexor control; a plurality of memory access control registers, each memory access control register having an input and an output, each of the memory access control register inputs coupled to the respective multiplexor outputs; and a memory access control, the memory access control having a plurality of inputs, the plurality of memory access control register outputs coupled to the respective memory access control inputs.
2. The vector processing unit of claim 1 wherein the plurality of memory access control preload registers is selected from the group consisting of a vector length preload register, a vector constant preload register, a vector address preload register, and a vector stride preload register; and wherein the plurality of memory access control registers is selected from the group consisting of a vector length register, a vector constant register, a vector address register, and a vector stride register.
3. The vector processing unit of claim 1 wherein: the plurality of memory access control preload registers comprise a vector length preload register, a vector constant preload register, a vector address preload register, and a vector stride preload register; and wherein the plurality of memory access control registers comprise a vector length register, a vector constant register, a vector address register, and a vector stride register.
4. A method comprising:
(a) fetching a next instruction;
(b) determining if there is a desynchronization contention with the next instruction;
(c) when there is the desynchronization contention with the next instruction then waiting for any desynchronized operations to complete;
(h) determining if the next instruction can execute desynchronously;
(i) when the next instruction can execute desynchronously then initiating desynchronous execution and then return to (a); and
(j) when the next instruction cannot execute desynchronously then initiating synchronous execution and then return to (a).
5. The method of claim 4 comprising inserted in alphabetical order:
(d) determining if there is an asynchronous contention with the next instruction;
(e) when there is the asynchronous contention with the next instruction then waiting for any asynchronous operations to complete;
(f) determining if the next instruction can execute asynchronously;
(g) when the next instruction can execute asynchronously then initiating asynchronous execution and then return to (a).
6. A method comprising:
(a) determining if a first vector instruction is currently executing;
(b) when the first vector instruction is not currently executing then returning to (a);
(c) when the first vector instruction is currently executing then accessing a memory access control for the first vector instruction using vector parameters stored in registers; (d) determining if a second vector instruction is waiting to execute;
(e) when the second vector instruction is not waiting to execute then returning to (a);
(f) when the second vector instruction is waiting to execute then loading new vector parameters into preload registers for use with the second vector instruction;
(g) determining if the first vector instruction has finished execution;
(h) when the first vector instruction has not finished execution then returning to (g);
(i) when the first vector instruction has finished execution then switching a multiplexor to a preload position so as to copy contents of the preload registers into the registers;
(j) switching the multiplexor to a non-preload position; and
(k) executing the second vector instruction, denoting the second vector instruction as the first vector instruction, and returning to (a).
7. The method of 6 comprising the multiplexor non-preload position connecting to a new vector parameters.
8. A method comprising:
(a) determining if a desynchronized vector instruction is currently executing;
(b) when the desynchronized vector instruction is not currently executing proceed to
(c);
(c) allowing new desynchronized vector instructions to execute in addition to allowing non-desynchronized instructions to execute;
(d) using parameters stored in memory access control registers for accessing a memory access control for vector instructions;
(e) determining if an instruction is attempting to modify one or more memory access control registers;
(f) when the instruction is not attempting to modify the one or more memory access control registers then proceeding to (c); (g) when the instruction is attempting to modify the one or more memory access control registers then modifying one or more corresponding memory access control preload registers;
(h) disallowing new desynchronized vector instructions from executing but continuing to allow non-desynchronized instructions to execute;
(i) determining if all desynchronized vector instructions have completed execution;
(j) when all the desynchronized vector instructions have not completed execution then proceeding to (d).
(k) when all the desynchronized vector instructions have completed execution then proceeding to (I);
(L) moving any modified memory access control preload registers parameters into the one or more corresponding memory access control registers; and
(m) allowing instructions that modify memory access control register(s) parameters to no longer modify the corresponding memory access control preload registers parameter(s), then proceeding to (c).
9. The method of claim 8 wherein at (I) moving any modified memory access control preload register(s) parameters into the corresponding memory access control register(s) is by switching a multiplexor.
PCT/US2022/021525 2021-04-27 2022-03-23 Method and apparatus for desynchronizing execution in a vector processor WO2022231733A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
DE112022000535.1T DE112022000535T5 (en) 2021-04-27 2022-03-23 Method and device for desynchronizing execution in a vector processor
CN202280017945.7A CN117083594A (en) 2021-04-27 2022-03-23 Method and apparatus for desynchronized execution in a vector processor

Applications Claiming Priority (12)

Application Number Priority Date Filing Date Title
US202163180634P 2021-04-27 2021-04-27
US202163180601P 2021-04-27 2021-04-27
US202163180562P 2021-04-27 2021-04-27
US63/180,562 2021-04-27
US63/180,634 2021-04-27
US63/180,601 2021-04-27
US17/468,574 US20220342668A1 (en) 2021-04-27 2021-09-07 System of Multiple Stacks in a Processor Devoid of an Effective Address Generator
US17/468,574 2021-09-07
US17/669,995 US20220342590A1 (en) 2021-04-27 2022-02-11 Method and Apparatus for Gather/Scatter Operations in a Vector Processor
US17/669,995 2022-02-11
US17/701,582 2022-03-22
US17/701,582 US11782871B2 (en) 2021-04-27 2022-03-22 Method and apparatus for desynchronizing execution in a vector processor

Publications (1)

Publication Number Publication Date
WO2022231733A1 true WO2022231733A1 (en) 2022-11-03

Family

ID=81325847

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/021525 WO2022231733A1 (en) 2021-04-27 2022-03-23 Method and apparatus for desynchronizing execution in a vector processor

Country Status (1)

Country Link
WO (1) WO2022231733A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5590353A (en) * 1993-07-15 1996-12-31 Hitachi, Ltd. Vector processor adopting a memory skewing scheme for preventing degradation of access performance
US6101596A (en) * 1995-03-06 2000-08-08 Hitachi, Ltd. Information processor for performing processing without register conflicts

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5590353A (en) * 1993-07-15 1996-12-31 Hitachi, Ltd. Vector processor adopting a memory skewing scheme for preventing degradation of access performance
US6101596A (en) * 1995-03-06 2000-08-08 Hitachi, Ltd. Information processor for performing processing without register conflicts

Similar Documents

Publication Publication Date Title
JP2786574B2 (en) Method and apparatus for improving the performance of out-of-order load operations in a computer system
JP3120152B2 (en) Computer system
US4982402A (en) Method and apparatus for detecting and correcting errors in a pipelined computer system
US5404552A (en) Pipeline risc processing unit with improved efficiency when handling data dependency
US4991078A (en) Apparatus and method for a pipelined central processing unit in a data processing system
US5664135A (en) Apparatus and method for reducing delays due to branches
JPH0766329B2 (en) Information processing equipment
WO1990014629A2 (en) Parallel multithreaded data processing system
JP3439033B2 (en) Interrupt control device and processor
JPH0517588B2 (en)
US20240086193A1 (en) Nested loop control
US11442709B2 (en) Nested loop control
US20230084523A1 (en) Data Processing Method and Device, and Storage Medium
JPH03158928A (en) Data processor
KR100210205B1 (en) Apparatus and method for providing a stall cache
JP2000066894A (en) Pipelined floating point store
US20030028696A1 (en) Low overhead interrupt
US11782871B2 (en) Method and apparatus for desynchronizing execution in a vector processor
US6405300B1 (en) Combining results of selectively executed remaining sub-instructions with that of emulated sub-instruction causing exception in VLIW processor
WO2022231733A1 (en) Method and apparatus for desynchronizing execution in a vector processor
JP2000353092A (en) Information processor and register file switching method for the processor
KR102379886B1 (en) Vector instruction processing
CN117083594A (en) Method and apparatus for desynchronized execution in a vector processor
JPH0384632A (en) Data processor
JP3743155B2 (en) Pipeline controlled computer

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22715918

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202280017945.7

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 112022000535

Country of ref document: DE