WO2020220935A1 - 运算装置 - Google Patents

运算装置 Download PDF

Info

Publication number
WO2020220935A1
WO2020220935A1 PCT/CN2020/083280 CN2020083280W WO2020220935A1 WO 2020220935 A1 WO2020220935 A1 WO 2020220935A1 CN 2020083280 W CN2020083280 W CN 2020083280W WO 2020220935 A1 WO2020220935 A1 WO 2020220935A1
Authority
WO
WIPO (PCT)
Prior art keywords
operand
memory
instruction
instructions
sub
Prior art date
Application number
PCT/CN2020/083280
Other languages
English (en)
French (fr)
Inventor
刘少礼
赵永威
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201910545270.3A external-priority patent/CN111860798A/zh
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Publication of WO2020220935A1 publication Critical patent/WO2020220935A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0418Architecture, e.g. interconnection topology using chaos or fractal principles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • G06F9/3009Thread control instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/3826Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
    • G06F9/3828Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage with global bypass, e.g. between pipelines, between clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline, look ahead using a slave processor, e.g. coprocessor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3889Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates to the field of artificial intelligence technology, and in particular to a computing device.
  • neural network algorithm is a very popular machine learning algorithm recently, and has achieved very good results in various fields, such as image recognition, speech recognition, natural language processing, etc.
  • image recognition speech recognition
  • speech recognition natural language processing
  • the complexity of the algorithm is getting higher and higher.
  • the scale of the model is gradually increasing. Processing these large-scale models with GPU and CPU takes a lot of computing time and consumes a lot of power.
  • the present disclosure proposes an arithmetic device that constructs a hierarchical structure of the arithmetic device in a multi-layered iterative manner.
  • the structure of each arithmetic node of the arithmetic device is the same.
  • the arithmetic nodes of different layers and computers of different sizes have the same Its programming interface and instruction set architecture can execute programs in the same format, simplifying the complexity of user programming, and the expansion of computing devices or the migration of programs between different computing devices are very easy.
  • a computing device including: at least two layers of computing nodes, each computing node includes a memory component, a processor, and a next layer of computing nodes;
  • the processor in the any arithmetic node is used to decompose the input instructions of the any arithmetic node, obtain parallel sub-instructions, and send the parallel sub-instructions to the download of the any arithmetic node.
  • the any arithmetic node is also used to load the operands required to execute the parallel sub-instructions from the memory component of the upper-level arithmetic node to the memory component of any arithmetic node, so that the The next layer of arithmetic nodes executes the parallel sub-instructions in parallel according to the operands.
  • any one of the computing nodes further includes: a memory controller,
  • a data path is connected between the memory component of any one computing node and the memory components of the upper-level computing node and the next-level computing node of the arbitrary computing node, and the memory controller is connected to the data path to control The data path sends the operand of the input instruction from one memory component to another memory component.
  • the processor includes a serial resolver, a parallel resolver, and a decoder, and the memory controller is connected to the Serial decomposer and said decoder;
  • the serial resolver is configured to perform serial decomposition of the input instruction to obtain serial sub-instructions according to the capacity of the memory component of any one computing node and the memory capacity required by the input instruction;
  • the decoder is used to decode the serial sub-instructions and send them to the parallel resolver, and send a control signal to the memory controller according to the serial sub-instructions, and the memory controller according to the The control signal loads the operand required to execute the serial sub-instruction from the memory component of the upper-level computing node to the memory component of any computing node;
  • the parallel resolver is used to decompose the decoded serial sub-instructions in parallel to obtain parallel sub-instructions according to the number of operation nodes in the next layer, and send the parallel sub-instructions to the operation nodes of the next layer to The next-level arithmetic node executes parallel sub-instructions according to the operand.
  • the device performs serial decomposition of the input instruction to obtain serial sub-instructions.
  • the memory component of any one computing node includes a static memory segment and a dynamic memory segment.
  • the operands of the input instruction include shared operands and other operands, and the serial resolver is based on the size relationship between the memory capacity required by the shared operands and the remaining capacity of the static memory segment, and the other operands The size relationship between the required memory capacity and the capacity of the dynamic memory segment, the input instruction is serially decomposed to obtain serial sub-instructions,
  • the common operand is an operand commonly used by the serial sub-instructions, and other operands are data other than the common operand in the operand of the input instruction.
  • the decomposed serial sub-instructions include a head instruction and a main
  • the memory controller sends a first control signal, and the memory controller loads the common operand to the static memory segment from a memory component of an upper-level computing node according to the first control signal;
  • the decoder sends a second control signal to the memory controller according to the main body instruction, and the memory controller loads the other data from the memory component of the upper-level computing node to the memory controller according to the second control signal.
  • the dynamic memory segment
  • the processor further includes a control unit, and any one computing node further includes local processing unit,
  • the input terminal of the control unit is connected to the output terminal of the decoder, and the output terminal of the control unit is connected to the input terminal of the local processing unit.
  • the control unit controls the local processing according to the serial sub-instruction
  • the unit performs reduction processing on the operation result of the operation node of the next layer to obtain the operation result of the input instruction
  • the existence of output dependency of the serial sub-instruction means that the operation result of the serial sub-instruction needs to be reduced to obtain the operation result of the input instruction.
  • the control unit detects that the resources required to perform the reduction processing on the operation result of the next-level operation node are greater than The upper limit of the resources of the local processing unit, the control unit sends a delegation instruction to the parallel resolver according to the serial sub-instruction,
  • the parallel resolver controls the next-level arithmetic node to perform reduction processing on the arithmetic result of the next-level arithmetic node according to the delegation instruction to obtain the arithmetic result of the input instruction.
  • the hierarchical structure of the computing device is constructed through multi-layer iteration.
  • the structure of each computing node of the computing device is the same.
  • the computing nodes of different layers and computers of different sizes have the same programming interface and instruction set architecture. Execute programs in the same format, load data implicitly between layers, users do not need to manage memory space, simplify the complexity of user programming, and the expansion of computing devices or the migration of programs between different computing devices are very easy.
  • an arithmetic device includes a multi-level arithmetic node, and each arithmetic node includes a processor and a next-level arithmetic node;
  • the processor in the any arithmetic node controls the next-level arithmetic node to execute the operation corresponding to the input instruction of the any arithmetic node in multiple stages in a pipeline manner;
  • the multiple stages include: operation execution EX, and the next-level arithmetic node is used to execute the operation execution in the multiple stages in a pipeline manner.
  • any one of the computing nodes further includes: a local processing unit, a memory component, and a memory controller, and the processor includes: a pipeline control unit, a decoder, and a reduction control unit,
  • the input terminal of the decoder receives the input instruction, and the output terminal of the decoder is connected to the input terminal of the memory controller,
  • a data path is connected between the memory component of any one computing node and the memory components of the upper layer of computing node and the next layer of computing node of any one computing node,
  • the memory controller is connected to the data path, and controls the data path to send the operand of the input instruction from one memory component to another memory component,
  • the output terminal of the decoder is also connected to the input terminal of the next layer arithmetic node and the input terminal of the reduction control unit, and the reduction control unit is connected to the local processing unit,
  • the pipeline control unit is connected to the decoder, the reduction control unit, and the memory controller.
  • any one of the computing nodes further includes: a pipeline latch, between the decoder and the memory controller , Pipeline locks are set between the memory controller and the next-level computing node, between the next-level computing node and the local processing unit, and between the local processing unit and the memory controller.
  • the pipeline controller synchronizes the multiple stages by controlling the pipeline latch.
  • the multiple stages further include: instruction decoding ID, data loading LD, operation reduction RD, and data writing back WB,
  • the pipeline propagates in the order of instruction decoding ID, data loading LD, operation execution EX, operation reduction RD, and data writing back to WB,
  • the decoder is used for instruction decoding
  • the memory controller is used for data loading: the operand of the input instruction is loaded into the memory component
  • the reduction control unit is used for controlling the local processing unit to perform operation reduction
  • the operation result of the input instruction is obtained, and the memory controller is further configured to write the operation result back to the memory component of the operation node of the upper layer of any operation node.
  • the pipeline controller receives the decoder, the memory controller, the next-level computing node, and the return After receiving the first feedback signal sent by the control unit, the first control signal is sent to each of the pipeline latches, and each of the pipeline latches updates the output according to the first control signal.
  • the DD when the DD detects that the serial sub-instruction has data dependency, the DD stops acquiring the serial sub-instruction from the SQ.
  • the processor further includes a serial resolver, and the serial resolver is connected to the The input terminal of the decoder, the serial resolver is used to serially decompose the input instruction to obtain serial sub-instructions;
  • the processor controls the next layer of arithmetic nodes to execute operations corresponding to the serial sub-instructions in multiple stages in a pipeline manner.
  • the decoder detects that the input operand of the serial sub-instruction currently decoded is different from the previous serial sub-instructions. When there is no overlap between the output operands of the sub-instructions, the currently decoded serial sub-instructions are decoded and preloaded onto the next layer of arithmetic nodes.
  • the processor further includes a parallel resolver, and the input end of the parallel resolver is connected to the output end of the decoder , The output terminal of the parallel resolver is connected to the input terminal of the next-level computing node,
  • the parallel decomposer is used to decompose the decoded serial sub-instructions in parallel to obtain parallel sub-instructions according to the number of operation nodes in the next layer, and send the parallel sub-instructions to the operation nodes in the next layer.
  • a sub-instruction queue SQ is provided between the serial resolver and the decoder, and the sub-instruction queue SQ Used to temporarily store the serial sub-commands.
  • an arithmetic device includes a multi-layer arithmetic node, and each arithmetic node includes a memory component, a processor, and a next-level arithmetic node.
  • the memory component includes a static memory segment and Loop memory segment,
  • the processor is used to decompose the input instruction of any arithmetic node to obtain multiple sub-instructions
  • the processor allocates memory space for the shared operands in the static memory segment, and in the cyclic memory segment for other sub-instructions Operands allocate memory space;
  • the common operand is: the operand of the next layer of arithmetic nodes in any arithmetic node to be used when executing the multiple sub-instructions
  • the other operands are: operations of the multiple sub-instructions Number of operands other than the common operand.
  • a first counter is provided in the processor, and the cyclic memory segment includes multiple sub-memory blocks,
  • the processor allocating memory space for other operands of the multiple sub-instructions in the cyclic memory segment includes:
  • the processor allocates memory space for the other operands from the sub-memory block corresponding to the count value of the first counter in the cyclic memory segment.
  • a second counter is provided in the processor
  • the processor allocating memory space for the shared operand in the static memory segment includes:
  • the processor allocates memory space for the shared operand starting from the first starting end in the static memory segment, where the first starting end is the starting end corresponding to the count value of the second counter.
  • the processor includes a serial resolver SD,
  • the processor is used to decompose the input instruction of any arithmetic node to obtain multiple sub-instructions, including:
  • the SD According to the memory capacity required by the input instruction, the capacity of the static memory segment, and the capacity of the cyclic memory segment, the SD performs serial decomposition of the input instruction to obtain a serial sub-instruction.
  • the processor includes a serial decomposer SD, and the SD is determined as the share according to the value of the second counter The first starting end of the operand allocation memory space,
  • the SD calculation starts from the first starting end, the remaining memory capacity of the static memory segment, and the SD calculates the input instruction according to the remaining memory capacity of the static memory segment and the memory capacity required by the shared operand Perform the first serial decomposition to obtain the first serial sub-instruction;
  • the SD performs a second serial decomposition of the first serial sub-instruction to obtain the serial sub-instruction according to the memory capacity of the cyclic memory segment and the memory capacity required by the other operands.
  • the processor further includes a decoder DD, where the DD is used to decode the multiple sub-instructions,
  • the DD allocates memory space for the other operands from the sub-memory block corresponding to the count value of the first counter in the circular memory segment.
  • the serial sub-instruction includes a head instruction and a main instruction, and the head instruction is used to load the common operand
  • the head instruction records the address of the memory space allocated for the common operand
  • the main instruction is used to load the other operands and perform calculations on the common operand and other operands.
  • the processor in any one of the computing nodes controls the next layer of computing nodes to pipeline The operation corresponding to the serial sub-instruction of any one computing node is executed in multiple stages;
  • the multiple stages include: instruction decoding ID, data loading LD, operation execution EX, operation reduction RD, and data writing back WB.
  • the pipeline is based on instruction decoding ID, data loading LD, operation execution EX, operation reduction RD and the order in which data is written back to WB.
  • any one of the computing nodes further includes: a local processing unit LFU and a second memory controller DMA, and the processor includes: Decoder DD, reduction control unit RC,
  • the decoder DD is used for command decoding ID
  • the DMA is used for data load LD: load the operand of the input instruction to the memory component,
  • the next-level arithmetic node is used to execute EX according to the operand and the decoded instruction to obtain the execution result
  • the reduction control unit RC is used to control the LFU to operate on the execution result to reduce RD to obtain the operation result of the input instruction
  • the DMA is also used to write the operation result back to the memory component of the operation node of the upper layer of any operation node.
  • the cyclic memory segment includes multiple sub-memory blocks
  • the DMA, the next-level arithmetic node, and the LFU sequentially cyclically use the multiple sub-memory blocks.
  • the memory capacities of the multiple sub-memory blocks are the same.
  • an operand obtaining method including:
  • the storage address of the operand on the local memory component is determined according to the storage address of the operand in the external storage space and the data address information table;
  • the method further includes:
  • a control signal for loading the operand is generated according to the storage address of the operand, and the control signal for loading the operand is used to store the operand from the operand.
  • the address is loaded onto the local memory component.
  • the data address information table records an address correspondence relationship
  • the address correspondence relationship includes: the storage address of the operand on the local memory component and the operand in the external storage space The corresponding relationship of the storage address on the.
  • the data address information table look up whether the operand has been stored in the local memory component, including:
  • the address correspondence relationship includes the storage addresses of all the operands in the external storage space, it is determined that the operands have been stored in the local memory component.
  • the storage address of the operand in the external storage space and the data address information table determine the value of the operand on the local memory component.
  • Storage address including:
  • the storage address on the local memory component corresponding to the storage address of the operand in the external storage space is used as the storage address of the operand on the local memory component.
  • the method further includes:
  • the data address information table is updated according to the storage address of the loaded operand in the external storage space and the storage address on the local memory component.
  • the data address is updated according to the storage address of the loaded operand in the external storage space and the storage address on the local memory component Information sheet, including:
  • the corresponding relationship between the storage address of the loaded operand on the external storage space and the storage address on the local memory component is recorded in the data address information table.
  • the local memory component includes: a static memory segment
  • the data address information table is updated according to the storage address of the loaded operand in the external storage space and the storage address on the local memory component, including:
  • the data address information table to be updated is determined according to the count value of the first counter; wherein, the count value of the first counter is used to indicate that it is in the static memory segment Storage location information on
  • the data address information table to be updated is updated according to the storage address in the external storage space of the loaded operand and the storage address in the static memory segment.
  • the local memory component further includes: a cyclic memory segment, and the cyclic memory segment includes multiple sub-memory blocks,
  • the data address information table is updated according to the storage address of the loaded operand in the external storage space and the storage address on the local memory component, including:
  • the method is applied to a computing device, and the computing device includes: a multi-layer computing node , Each computing node includes local memory components, processors, and next-level computing nodes,
  • the external storage space is a memory component of an upper-level computing node or a memory component of a lower-level computing node of the computing node.
  • an arithmetic device includes: multi-layer arithmetic nodes, each arithmetic node includes a local memory component, a processor, and a next-layer arithmetic node,
  • the processor When the processor wants to load the operand from the memory component of the upper-level arithmetic node of the current arithmetic node to the local memory component, it looks up in the data address information table whether the operand has been saved on the local memory component;
  • the processor determines the storage address of the operand on the local memory component according to the storage address of the operand in the external storage space and the data address information table; The storage address on the local memory component is assigned to the instruction to obtain the operand.
  • the processor if the operand is not stored in the local memory component, the processor generates a control signal for loading the operand according to the storage address of the operand, and the load operand is The control signal is used to load the operand from the storage address of the operand to the local memory component.
  • the data address information table records an address correspondence relationship, and the address correspondence relationship includes: the operand is locally The corresponding relationship between the storage address on the memory component and the storage address of the operand in the external storage space.
  • the local memory component includes a static memory segment and a cyclic memory segment
  • the processor is used to decompose the input instruction of any arithmetic node to obtain multiple sub-instructions
  • the processor allocates memory space for the shared operands in the static memory segment, and in the cyclic memory segment for other sub-instructions Operands allocate memory space;
  • the common operand is: the operand of the next layer of arithmetic nodes in any arithmetic node to be used when executing the multiple sub-instructions
  • the other operands are: operations of the multiple sub-instructions Number of operands other than the common operand.
  • At least one data address information table corresponding to the static memory segment is provided in the processor, and is connected to the loop Multiple data address information tables corresponding to the memory segment.
  • the processor first corresponds to the static memory segment before allocating memory space for the shared operand in the static memory segment At least one of the data address information table to find whether the common operand has been stored in the static memory segment of the local memory component,
  • the processor before the processor allocates memory space for other operands on the cyclic memory segment, the processor first connects to the cyclic memory Find in the multiple data address information table corresponding to the segment whether other operands have been saved in the cyclic memory segment of the local memory component,
  • the processor determines the data address information table to be updated according to the count value of the first counter; wherein the count value of the first counter is used to determine different data address information tables corresponding to both ends of the static memory segment;
  • the processor when loading other operands from the external storage space to any of the multiple sub-memory blocks on the cyclic memory segment
  • the processor updates the data address information table corresponding to any one of the sub-memory blocks according to the storage addresses of other loaded operands on the external storage space and the storage addresses on the local memory component.
  • an operand obtaining device including:
  • a memory for storing processor executable instructions
  • the processor is used to implement any one possible implementation method of the fourth aspect when executing instructions.
  • a non-volatile computer-readable storage medium having computer program instructions stored thereon, wherein, when the computer program instructions are executed by a processor, any possibility of the fourth aspect is realized The method of realization.
  • an arithmetic device includes: a multi-layer arithmetic node, any arithmetic node includes a local memory component, a processor, a next-layer arithmetic node, and a memory controller, so The processor is connected to the next-level computing node and the memory controller;
  • the processor is used to receive an input instruction, decompose the input instruction to obtain a plurality of sub-instructions, and send the plurality of sub-instructions to the next-level computing node;
  • the memory controller is used to perform operations from any one of the The memory component of the upper-level computing node of the node loads the second operand in the first operand corresponding to the multiple sub-instructions to the local memory component;
  • the next-level computing node is used for computing according to the multiple sub-instructions And the second operand of the multiple sub-instructions execute the multiple sub-instructions;
  • the input instruction and the multiple sub-instructions have the same format.
  • the input instruction and the multiple sub-instructions both include an operator and an operand parameter
  • the operand parameter is a parameter pointing to the operand of the instruction
  • the Operand parameters include global parameters and local parameters.
  • Global parameters are parameters that indicate the size of the first operand corresponding to the instruction
  • local parameters are the starting position and the first operand of the second operand of the instruction in the first operand. The parameter of the size of the two operands;
  • the memory controller is configured to load, according to the operand parameter, the second operand in the first operand corresponding to the multiple sub-instructions from the memory component of the upper-level operation node of the any operation node to the local memory component .
  • a data communication network is connected between the local memory component and the memory components of the upper-level computing node and the next-level computing node of any computing node, and the memory controller is connected to the data path.
  • the processor is further configured to generate multiple corresponding control signals according to multiple sub-commands, and send the multiple control signals to the memory controller;
  • the memory controller controls the data path according to each control signal, and loads the operand of the sub-instruction corresponding to the control signal from the memory component of the upper-level computing node to the local memory component.
  • the memory controller includes a first memory controller and a second memory controller.
  • the first memory controller is connected to the data path through the second memory controller.
  • the first memory controller is also used to generate a load instruction according to a control signal to load
  • the instruction is sent to the second memory controller, and the second memory controller is used to control the data path according to the load instruction.
  • the first memory controller determines the base address, the start offset, the number of loaded data, and the jump offset according to the control signal. Shift, according to the base address, starting offset, the number of loaded data, the number of jump offsets to generate load instructions;
  • the base address is the starting address of the operand stored in the memory component
  • the starting offset is the offset of the starting position of the second operand relative to the starting position of the first operand
  • the number of loaded data Is the number of operands loaded from the start offset
  • the jump offset is the offset of the start offset of the next read data relative to the start offset of the last read data the amount.
  • the processor includes a serial decomposer, a decoder, and a parallel decomposer.
  • the input end of the serial decomposer is connected to the output end of the parallel decomposer in the processor of the upper layer of arithmetic nodes.
  • the output end is connected to the input end of the decoder, the output end of the decoder is connected to the input end of the parallel resolver, and the output end of the parallel resolver is connected to the input end of the next layer of arithmetic nodes.
  • the serial resolver is used to determine the capacity of the memory component of any computing node and the memory capacity required by the input instruction. , Perform serial decomposition of the input instruction to obtain serial sub-instructions;
  • the decoder is used to decode the serial sub-instructions and send them to the parallel resolver, and send a control signal to the memory controller according to the serial sub-instructions, and the memory controller from above according to the control signal Load the operand required to execute the serial sub-instructions into the memory component of any one of the computing nodes from the memory component of the first-level computing node;
  • the parallel resolver is used to decompose the decoded serial sub-instructions in parallel to obtain parallel sub-instructions according to the number of operation nodes in the next layer, and send the parallel sub-instructions to the operation nodes of the next layer so that all The operation node of the next layer executes parallel sub-instructions according to the operand.
  • the memory component of any computing node includes a static memory segment and a dynamic memory segment
  • the decomposed serial sub-instructions include a header instruction and a main body instruction.
  • the decoder is also used to send a first control signal to the memory controller according to the header instruction, and the memory controller according to the first control The signal loads the common operand from the memory component of the upper-level computing node to the static memory segment;
  • the decoder is also configured to send a second control signal to the memory controller according to the main body instruction, and the memory controller loads other data from the memory component of the upper-level computing node to the memory controller according to the second control signal. Describes the dynamic memory segment.
  • the first memory controller determines the initial offset according to the initial position in the local parameter, determines the amount of loaded data according to the size in the local parameter, and determines the jump offset according to all parameters or local parameters.
  • Figure 1 shows a graph of the energy efficiency growth of machine learning computers from 2012 to 2018.
  • Figure 2 shows an example of the organization of a traditional machine learning computer.
  • Fig. 3 shows a block diagram of a computing device according to an embodiment of the present disclosure.
  • 4a and 4b respectively show a block diagram of a computing node according to an embodiment of the present disclosure.
  • Fig. 5 shows a flowchart of a serial decomposition process according to an embodiment of the present disclosure.
  • Fig. 6 shows a schematic diagram of a pipeline according to an example of the present disclosure.
  • Fig. 7 shows a block diagram of a computing node according to an example of the present disclosure.
  • Fig. 8 shows a schematic diagram of an operation node and a pipeline operation process according to an example of the present disclosure.
  • Fig. 9 shows a schematic diagram of an operand according to an embodiment of the present disclosure.
  • Fig. 10a shows a block diagram of a computing node according to an embodiment of the present disclosure.
  • Figure 10b shows an example of a pipeline according to an embodiment of the present disclosure.
  • FIG. 11 shows a schematic diagram of an example of division of a memory component according to an embodiment of the present disclosure.
  • FIG. 12 shows a schematic diagram of an example of division of a memory component according to an embodiment of the present disclosure.
  • FIG. 13 shows a schematic diagram of a memory component according to an embodiment of the present disclosure.
  • Fig. 14 shows a schematic diagram of a memory space allocation method of a static memory segment according to an embodiment of the present disclosure.
  • FIG. 15 shows a schematic diagram of a memory space allocation method of a static memory segment according to an embodiment of the present disclosure.
  • Fig. 16 shows a schematic diagram of an application scenario according to an embodiment of the present disclosure.
  • Fig. 17 shows a flowchart of a method for obtaining an operand according to an embodiment of the present disclosure.
  • Fig. 18 shows a flowchart of a method for acquiring an operand according to an embodiment of the present disclosure.
  • Machine learning is a computing and memory access intensive technology that is highly parallel at different levels.
  • This disclosure decomposes machine learning into operations based on matrices and vectors, for example, vector multiplication matrix and matrix multiplication vector operations
  • Aggregation is matrix multiplication, which aggregates operations such as matrix addition/subtraction matrix, matrix multiplication scalar and vector basic arithmetic operations into element-wise operations, and so on.
  • seven main calculation primitives can be obtained, including: inner product (IP, inner production), convolution (CONV), pooling (POOL), matrix multiplying (MMM, matrix multiplying matrix) ), element-wise operation (ELTW, element-wise operation), sorting (SORT) and counting (COUNT).
  • IP inner product
  • CONV convolution
  • POOL pooling
  • MMMM matrix multiplying matrix
  • EHTW element-wise operation
  • SORT sorting
  • COUNT counting
  • f( ⁇ ) operation with operand X is called a decomposable operation, where f( ⁇ ) is the target operator, g( ⁇ ) is the retrieval operator, and X represents all the operands of f( ⁇ ), X A , X B ,... represent a subset of operand X, where X can be tensor data.
  • f(X) X ⁇ k, where k is a scalar
  • f(X) can be decomposed into:
  • the operation g( ⁇ ) is to combine the operation results of f(X A ), f(X B )... into a matrix or vector according to the way of decomposing X.
  • Independent operation It can mean that the decomposed operands X A , X B ... are independent of each other and do not overlap. Each subset X A , X B ... can do local operations, and only need to combine each subset to do The result of local operation can get the final operation result. Take the vector addition operation as an example to explain the independent operation.
  • Input-dependent operation it can mean that the decomposed operands X A , X B ... have overlap, and the decomposed local operations have overlap between the operands, that is, there is input redundancy. Take one-dimensional convolution as an example to explain the input-dependent operation.
  • Output dependent operation It can mean that the final operation result needs to be obtained after the result of each partial operation after decomposition.
  • the length in the decomposition method of IP may refer to the decomposition of the length direction of the vector.
  • the operand of the convolution operation can be tensor data represented by NHWC (batch, height, width, channels).
  • Decomposition in the feature direction can mean decomposition in the C dimension direction, and the POOL operation also decomposes the operand in the feature direction.
  • the meaning of the convolution operation is input dependence in the N-dimensional direction decomposition, and the input redundancy is the weight, which is the convolution kernel. There is also input dependence in the spatial decomposition. In addition to the weight, the input redundancy also includes the two decompositions. The overlap of tensor data.
  • the left and right sides in the decomposition method of MMM refer to decomposition of the left operand or the right operand of the MMM, and vertical can refer to decomposition in the vertical direction of the matrix.
  • the ELTW operation is independent of any decomposition method of the operand, and the SORT and COUNT operations have output dependence on any decomposition method of the operand.
  • the calculation primitives of machine learning are all decomposable calculations.
  • the calculation device of the present disclosure is used to perform the calculation of the machine learning technology, the calculation primitives can be decomposed and calculated according to actual requirements.
  • Input instruction It can be an instruction that describes the operation of machine learning.
  • the operation of machine learning can be composed of the above calculation primitives or composed of calculation primitives.
  • the input instructions can include operands and operators.
  • Shared operands The operands commonly used among multiple sub-operations after an operation is decomposed are shared operands, or in other words, after an input instruction is decomposed into multiple sub-instructions, the operands commonly used by multiple sub-instructions.
  • Machine learning is widely used in fields such as image recognition, voice recognition, facial recognition, video analysis, advertising recommendation, and games.
  • many dedicated machine learning computers of different sizes have been deployed in embedded devices, servers, and data centers.
  • the architecture of most machine learning computers still focuses on optimizing performance and energy efficiency.
  • machine learning accelerators have allowed the energy efficiency of machine learning computers to increase at an alarming rate.
  • FIG 2 shows an example of the organization of a traditional machine learning computer.
  • Traditional machine learning computers often have many heterogeneous parallel components organized in a hierarchical manner, such as the heterogeneous organization of CPU (Central Processing Unit) and GPU (Graphics Processing Unit) shown in Figure 2.
  • the form includes 2 CPUs and 8 GPUs, with GPU as the arithmetic unit.
  • the specific structure of each layer is different, and the storage method and control method are different. As a result, each layer may have a different programming interface, the programming is complicated, and the amount of code is large.
  • programming multiple GPUs requires manual work based on MPI (Message Passing Interface) or NCCL (Nvidia Collective multi-GPU Communication Library), and programming a single GPU chip requires CUDA (Compute Unified Device) Architecture, unified computing device architecture) language to manipulate thousands of GPU threads; CPU programming requires C/C++ and parallel API (Application Programming Interface, application programming interface) to write parallel programs containing dozens of CPU threads.
  • MPI Message Passing Interface
  • NCCL Nvidia Collective multi-GPU Communication Library
  • the software stack in a single GPU is also very complicated.
  • the software stack includes CUDA PTX (Parallel Thread Execution) and microcode.
  • CUDA PTX is used to program the grid/block/thread in the GPU, and the microcode is used for programming stream processing. Device.
  • the present disclosure provides a computing device.
  • the programming interface and instruction set architecture provided to users on each layer of the computing device are the same: computing nodes of different layers and computers of different sizes have The same programming interface and instruction set architecture can execute programs in the same format, the operands are stored in the top layer, and other layers are implicitly loaded with data. Users do not need to manage memory space, simplifying the complexity of user programming, and the expansion of computing devices or programs It is very easy to transplant between different computing devices.
  • the computing device of an embodiment of the present disclosure may include multiple (at least two) computing nodes, and each computing node includes a memory component, a processor, and a next-level computing node.
  • Fig. 3 shows a block diagram of a computing device according to an embodiment of the present disclosure.
  • the first layer of the computing device can be a computing node, which can include processors, memory components, and the next layer (second layer) computing nodes.
  • the next layer (second layer) computing nodes There can be multiple second layer computing nodes. The specific number is not limited in this disclosure.
  • each computing node in the second layer may also include a processor, a memory component, and a computing node in the next layer (third layer).
  • each computing node of the i-th layer may include: a processor, a memory component, and a computing node of the i+1-th layer, where i is a natural number.
  • the processor can be implemented in the form of hardware, such as digital circuits, analog circuits, etc.; the physical implementation of the hardware structure includes but not limited to transistors, memristors, etc., and the processor can also be implemented in software.
  • the memory component can be random access memory (RAM), read only memory (ROM), cache (CACHE), etc.
  • RAM random access memory
  • ROM read only memory
  • CACHE cache
  • the specific form of the memory component of the present disclosure is not limited.
  • FIG. 3 only shows the expanded structure of one of the second-level arithmetic nodes included in the first-level arithmetic nodes (the second level shown in Fig. 3), it can be understood that Figure 3 is only a schematic diagram.
  • the expanded structure of other second-level computing nodes also includes processors, memory components, and third-level computing nodes.
  • the i-th layer is not shown. The same is true for computing nodes.
  • the number of i+1th layer arithmetic nodes included in different i-th layer arithmetic nodes may be the same or different, which is not limited in the present disclosure.
  • the processor in any computing node can be used to decompose the input instructions of any computing node to obtain parallel sub Instruction, and send the parallel sub-instruction to the next-level arithmetic node of any arithmetic node; the any arithmetic node loads the operands needed to execute the parallel sub-instruction from the memory component of the upper-level arithmetic node to The memory component of any one arithmetic node, so that a next-level arithmetic node of the any arithmetic node executes the parallel sub-instructions in parallel according to the operand.
  • the parallel sub-instructions obtained by decomposition can be executed in parallel, and each computing node can include one or more next-level computing nodes. If multiple next-level computing nodes are included, multiple next-level computing nodes can run independently
  • the processor can decompose the input instructions to obtain parallel sub-instructions according to the number of computing nodes in the next layer.
  • the processor can decompose the input instructions and operands corresponding to the operations, and then decompose the parallel sub-instructions and the decomposed operands. They are sent to the computing nodes of the next layer and executed in parallel by the computing nodes of the next layer.
  • the hierarchical structure of the computing device is constructed through multi-layer iteration.
  • the structure of each computing node of the computing device is the same.
  • the computing nodes of different layers and computers of different sizes have the same programming interface and instruction set architecture. Execute programs in the same format, load data implicitly between layers, users do not need to manage memory space, simplify the complexity of user programming, and the expansion of computing devices or the migration of programs between different computing devices are very easy.
  • the processor can decompose input instructions in three stages: a serial decomposition stage, a (downgrade) decoding stage, and a parallel decomposition stage. Therefore, the processor can include a serial decomposer, a translation Coder and parallel resolver.
  • serial resolver is used for serially decomposing the input instruction to obtain serial sub-instructions according to the capacity of the memory component of any one computing node and the memory capacity required by the input instruction.
  • Serial decomposition can refer to the decomposition of an input instruction into multiple instructions that can be executed serially in sequence.
  • the serial resolver is based on the memory required by the input instruction and any one of the The capacity of the memory component of the computing node is serially decomposed to obtain the serial sub-instructions; if the memory required by the input instruction is less than or equal to the capacity of the memory component of any computing node, the The input command is sent to the decoder, and the decoder directly decodes the input command and sends it to the parallel resolver.
  • the decoder is used to decode the serial sub-instructions and send them to the parallel decomposer.
  • the any arithmetic node can load the operand required to execute the serial sub-instruction from the memory component of the upper-level arithmetic node to the memory component of the any arithmetic node.
  • any one of the computing nodes further includes a memory controller, and the memory controller is connected to the decoder.
  • the decoder may send a control signal to the memory controller according to the serial sub-instruction, and the memory controller may load and execute the serial sub-instruction from the memory component of the upper-level computing node according to the control signal The required operand to the memory component of any one of the computing nodes.
  • the memory controller can be implemented by a hardware circuit or a software program, which is not limited in the present disclosure.
  • the parallel resolver is used to decompose the decoded serial sub-instructions in parallel to obtain parallel sub-instructions according to the number of operation nodes in the next layer, and send the parallel sub-instructions to the operation nodes of the next layer to The next-level arithmetic node executes parallel sub-instructions according to the operand.
  • the processor may include a serial decomposer (SD), a decoder DD (demotion decoder, where degrading may refer to computing nodes from the upper layer to the next layer), and parallel decomposition ⁇ PD (Parallel decomposer).
  • SD serial decomposer
  • DD motion decoder
  • PD Parallel decomposition ⁇ PD
  • the input terminal of SD can be connected to the output terminal of PD in the processor of the upper layer of computing node
  • the output terminal of SD can be connected to the input terminal of DD
  • the output terminal of DD can be connected to the input terminal of PD
  • the output terminal of PD can be Connect the input terminal of the operation node of the next layer.
  • a data path is connected between the memory component of any computing node and the memory components of the upper and lower computing nodes of any computing node, as shown in Figure 4a .
  • the memory component i is connected to the memory component i-1, and the memory component i is connected to the next-level computing node may refer to the memory component i+1 connected to the next-level computing node.
  • the memory controller can be connected to the data path, and the memory controller can control the data path to send the operand of the input instruction from one memory component to another according to the control signals sent by other components in the computing node.
  • the memory controller can load the operand of the input instruction from the memory component of the upper computing node to the local memory component according to the control signal sent by the DD, or it can also write the operation result of the input instruction from the local memory component back to the local memory component.
  • the memory component of the upper computing node can load the operand of the input instruction from the memory component of the upper computing node to the local memory component according to the control signal sent by the DD, or it can also write the operation result of the input instruction from the local memory component back to the local memory component.
  • the memory component of the upper computing node can load the operand of the input instruction from the memory component of the upper computing node to the local memory component according to the control signal sent by the DD, or it can also write the operation result of the input instruction from the local memory component back to the local memory component.
  • the memory component of the upper computing node can load the operand of the input instruction from the memory component of the upper computing node to the local memory component according to the control signal sent by the DD, or it can also write the
  • the input of the SD can be connected to the instruction queue IQ (Instruction Queue), that is, the processor can first use the output instruction of the upper layer of operation node as the operation of this layer
  • the input instructions of the node are loaded into the instruction queue IQ.
  • the computing node of this layer can refer to the computing node to which the processor belongs.
  • SD obtains the input instructions from the IQ.
  • the SD can decompose the input instructions into multiple serials. Serial sub-instructions executed.
  • IQ By setting IQ as a buffer between the SD and the upper-level computing node, the strict synchronization execution relationship between the SD and the upper-level computing node can be omitted.
  • IQ can simplify circuit design and improve execution efficiency. For example, it allows SD to execute asynchronously with the upper-level computing node independently, and reduces the time that SD waits for the upper-level computing node to send input instructions.
  • the input instruction may be an instruction describing the operation of machine learning, the operation of machine learning may be composed of the above calculation primitives, and the input instruction may include operands and operators.
  • the serial decomposition of the input instruction may include the decomposition of the operand of the input instruction and the decomposition of the input instruction.
  • the serial sub-instructions obtained by the serial decomposition will have the largest possible decomposition granularity.
  • the decomposition granularity of the serial sub-instructions obtained by the serial decomposition is determined by the operation node.
  • the resources required for the input instructions are determined.
  • the resources of the computing nodes may be the capacity of the memory components of the computing nodes, and the resources required for the input instructions may refer to the memory capacity required to store the operands of the input instructions.
  • the decomposition granularity here can refer to the dimension of the decomposed operand.
  • the memory capacity required for the input instruction can be determined according to the memory capacity required to store the operand of the input instruction and the memory capacity required for the intermediate result after the operand is processed by the storage operator. After determining the memory capacity required for the input instruction, you can Determine whether the capacity of the memory component of the computing node of this layer meets the memory capacity required by the input instruction. If it is not satisfied, the input instruction can be serially decomposed according to the capacity of the memory component of the computing node of this layer and the memory capacity required for the input instruction. Serial subcommand.
  • SD can determine the memory capacity required by the input instruction according to the size of the matrix X and matrix Y, and the memory capacity required by the input instruction Comparing with the capacity of the memory component of the computing node of this layer, if the memory capacity required for the input instruction is greater than the capacity of the memory component of the computing node of this layer, the input instruction needs to be serially decomposed.
  • the specific process can be to decompose the operand, thereby dividing the input instruction into multiple serial sub-commands, which can be executed serially, for example, matrix X or matrix Y can be decomposed, or Decompose both matrix X and matrix Y.
  • the input instruction can be serially decomposed into serial sub-instructions for multiplying multiple matrices and serial sub-instructions for summation, which are executed in series After the serial sub-instructions of multiple matrix multiplications are completed, the operation results of the serial sub-instructions of the multiple matrix multiplications and the summed serial sub-instructions are summed to obtain the operation result of the input instruction.
  • serial decomposition method for matrix multiplication is only an example of the present disclosure to illustrate the function of SD, and does not limit the present disclosure in any way.
  • the serial resolver performs serial decomposition of the input instruction to obtain serial sub-instructions according to the capacity of the memory component of any one computing node and the memory capacity required by the input instruction , which can specifically include: determining the decomposition priority of the dimensions of the operand of the input instruction, selecting the dimensions of the operand decomposition in the order of decomposition priority and determining the maximum decomposition granularity in a dichotomy method until the decomposed operand requires The memory capacity is less than or equal to the capacity of the memory component of the computing node at this layer.
  • any selected dimension of decomposing operands before determining the maximum decomposition granularity in this dimension direction by dichotomy, it can be determined in this dimension direction.
  • Fig. 5 shows a flowchart of a serial decomposition process according to an embodiment of the present disclosure.
  • the decomposition priority of the dimension of the operand of the input instruction can be determined first.
  • the decomposition priority can be determined according to the dimension of the operand , The larger the dimension, the higher the priority of decomposition, and the largest dimension of the operand is decomposed first.
  • the operand X is an N-dimensional tensor, and the dimensions are t1, t2,...ti,...tN respectively, where t1 ⁇ t2 ⁇ ...
  • the granularity is 1.
  • the input instructions can be decomposed according to the decomposed operands, which can specifically include: decomposing the input instructions into multiple serial sub-commands, and the multiple serial sub-commands include the sub-sets responsible for the decomposition If there is output dependency after the serial decomposition of the serial sub-instruction for the operation of the operand, the multiple serial sub-instructions may also include a reduction instruction.
  • FIG. 5 is only an example of the process of decomposing operands, and does not limit the present disclosure in any way. It is understandable that the decomposition granularity can also be determined in other ways. For example, the decomposition priority can be selected in other ways, and the way of dimensional decomposition is not limited to the dichotomy, as long as the decomposition granularity can be selected as large as possible.
  • a sub-level instruction queue SQ (sub-level instruction Queue) may be connected between the output terminal of SD and the input terminal of DD of the present disclosure, and the output terminal of SD is connected The input terminal of SQ, the output terminal of SQ is connected to the input terminal of DD.
  • SQ acts as a buffer between SD and DD, which can eliminate the strict synchronization execution relationship between SD and DD. SQ can simplify circuit design and improve execution efficiency. For example, it allows SD to execute asynchronously on its own and reduces the time that DD waits for SD to serialize input instructions.
  • SD can output the serial sub-instructions after serial decomposition to SQ
  • DD obtains the serial sub-instructions from SQ
  • DD can allocate this layer of operations to the serial sub-instructions according to the storage requirements of the operands corresponding to the serial sub-instructions
  • the memory space on the memory component of the node, and the address (local address) of the allocated memory space is bound to the instruction to obtain the operand in the serial sub-instruction, so as to realize the decoding process.
  • DD can also send a control signal to the memory controller according to the serial sub-instruction, and the memory controller can load the operand corresponding to the serial sub-instruction into the allocated memory space according to the control signal, that is, according to the serial sub-instruction
  • the address of the operand corresponding to the input instruction recorded in the memory component of the upper layer of operation node finds the storage location of the operand corresponding to the serial sub-instruction, reads the operand, and writes it to this layer according to the local address In the memory component of the computing node.
  • the DD decodes the serial sub-instructions and sends them to the PD.
  • the PD can decompose the decoded serial sub-instructions in parallel according to the number of the next layer of operation nodes connected to the PD. It can mean that the decomposed parallel sub-instructions can be executed in parallel.
  • PD can decompose serial sub-instructions in parallel to get There are 4 parallel sub-commands, and the 4 parallel sub-commands are the addition of A1 and B1, A2 and B2, A3 and B3, and A4 and B4 respectively, and the PD can send the 4 parallel sub-commands to the next-level computing node.
  • the above examples are only to illustrate examples of parallel decomposition, and do not limit the present disclosure in any way.
  • the input dependency of the serial sub-instructions can be relieved, that is, there is no overlap between the operands corresponding to the parallel sub-instructions obtained by the parallel decomposition.
  • the dimension of decomposition can be selected to relieve input dependence, so that input redundancy can be avoided as much as possible and memory space can be saved.
  • the memory component of any one computing node includes a static memory segment and a dynamic memory segment. If the operands of the input instruction include shared operands and other operands, the serial resolver According to the size relationship between the memory capacity required by the shared operand and the remaining capacity of the static memory segment, and the size relationship between the memory capacity required by the other operands and the capacity of the dynamic memory segment, the Enter the instruction to perform serial decomposition to obtain serial sub-instructions.
  • the shared operand is an operand commonly used by the serial sub-instructions, and other operands are data in the operands of the input instruction except for the shared operand, and the remaining capacity of the static memory segment may be Refers to the unused capacity in the static memory segment.
  • SD, DD and PD in the processor are separated, and memory allocation can be staggered in time.
  • PD always allocates memory space after DD, but the allocated memory space is released earlier.
  • DD always allocates memory space after SD, but the allocated memory space is also released earlier.
  • the memory space used for the serial decomposition of SD may be used in multiple serial sub-commands. Therefore, a static memory segment is set for SD, and other parts of the shared memory component except for the static memory segment (dynamic Memory segment).
  • the shared operand For example, for some operations in machine learning, some of the operands will be shared among the decomposed parts of these operations. For this part of the operands, the present disclosure is called the shared operand.
  • the matrix multiplication operation Take the matrix multiplication operation as an example, suppose the input instruction is to multiply the matrix X and Y, if only the matrix X is decomposed, then the serial sub-instructions obtained by the serial decomposition of the input instruction need to use the operand Y together, the operation The number Y is the common operand.
  • the serial decomposer SD of the present disclosure can generate a prompt instruction ("load") when performing serial decomposition, and specify in the prompt instruction to load the common operand into the static memory segment, DD Treat the suggestive instruction as an ordinary serial sub-instruction that only needs to load data into the static memory segment without execution, specification, or write back.
  • DD sends the first control signal to the memory controller according to the suggestive instruction to change the shared operand Load into the static memory segment to avoid frequent data access and save bandwidth resources.
  • DD can generate a second control signal, and DD can send the generated second control signal to the memory controller, and the memory controller loads other operands into the dynamic memory segment according to the control signal.
  • the serial resolver can be based on the size relationship between the memory capacity required by the shared operand and the remaining capacity of the static memory segment, and the relationship between the memory capacity required by the other operands and the capacity of the dynamic memory segment.
  • the input instruction is serially decomposed to obtain serial sub-instructions.
  • the serial resolver may send the input instruction to Decoder, the decoder directly decodes the input instruction and sends it to the parallel resolver.
  • the input instruction needs to be serially decomposed.
  • the serial resolver can perform other operations according to the capacity of the dynamic memory segment The number is decomposed, and the input command is decomposed serially.
  • the specific process of splitting other operands according to the capacity of the dynamic memory segment and serializing the input instruction can be: determining the decomposition priority of the dimensions of the other operands, and selecting them in the order of the decomposition priority Decompose the dimensions of other operands and determine the maximum decomposition granularity in a dichotomy method until the memory capacity required by the other operands after decomposition is less than the capacity of the dynamic memory segment.
  • the specific process refer to Figure 5 and the related description above.
  • the serial resolver can perform shared operations based on the remaining capacity of the static memory segment The number is decomposed, and the input command is decomposed serially.
  • the specific decomposition method can also refer to the process in FIG. 5.
  • the decomposed serial sub-instructions may include a head instruction and a main body instruction, and the decoder may send a message to the memory controller according to the head instruction Send a control signal to load the shared operand from the memory component of the upper layer of the computing node to the static memory segment; the decoder sends a control signal to the memory controller according to the main body instruction to start from the previous Load the other data into the dynamic memory segment in the memory component of the layer computing node.
  • the processor may also include a control unit RC (Reduction Controller, also called a reduction controller), and any computing node may also include a local processing unit ( LFU, local functional units, the processing unit in Figure 4b), the input of the control unit RC is connected to the output of the decoder DD, and the output of the control unit RC is connected to the input of the local processing unit LFU At the end, the local processing unit LFU is connected to the memory component.
  • the local processing unit LFU is mainly used to perform reduction processing on the operation result of the serial sub-instruction with output dependency, and the RC can be used to send a reduction instruction to the LFU.
  • the LFU can be implemented in a hardware circuit or a software program, which is not limited in the present disclosure.
  • the control unit RC controls the local processing unit to perform the operation result of the next-level operation node according to the serial sub-instruction.
  • the reduction process obtains the operation result of the input instruction; wherein the output dependency of the serial sub-instruction means that the operation result of the serial sub-instruction needs to be reduced to obtain the operation result of the input instruction .
  • DD will send serial sub-commands to RC.
  • RC can check the output dependency of serial sub-commands. If there is output dependency of serial sub-commands, RC will send reduction commands to LFU according to serial sub-commands to make LFU performs reduction processing on the operation result of the operation node of the next layer to obtain the operation result of the input instruction.
  • the specific process can be that the next layer of computing node (memory controller in) can write the result of the parallel sub-instruction operation back to the memory component of the computing node of this layer, and LFU can read from the memory component of this layer of computing node Read the operation results of multiple serial sub-instructions.
  • the multiple serial sub-instructions can be obtained by serial decomposition of the same input instruction.
  • LFU reduces the operation results of multiple serial sub-instructions to obtain the corresponding
  • the operation result of the input instruction is stored in the memory component.
  • the processor determines that the execution of the input instruction of this layer is completed, it can send a write back signal to the memory controller, and the memory controller can write the operation result according to the write back signal Return to the memory component of the upper layer of computing node, until the first layer of computing node has completed all the instructions.
  • control unit RC detects that the resources required for the reduction processing of the calculation results of the next-level arithmetic nodes are greater than the upper limit of the resources of the local processing unit, the control The unit RC sends a delegation instruction to the parallel resolver according to the serial sub-instruction, and the parallel resolver controls the next-level arithmetic node to sort the calculation result of the next-level arithmetic node according to the commissioned instruction.
  • the calculation result of the input instruction is obtained by the reduction processing.
  • the RC can evaluate the resources (for example, computing resources, etc.) required for the reduction processing according to the serial sub-instructions.
  • the local processing unit can have a preset resource upper limit. Therefore, the RC can determine the result of the operation on the next-level operation node Whether the resource required for reduction processing is greater than the resource upper limit of the local processing unit, if it is greater, the processing speed of the LFU may have a great impact on the performance of the entire computing node.
  • the RC can send the PD according to the serial sub-instruction
  • the PD can control the next-level arithmetic node according to the commissioned instruction to perform reduction processing on the calculation result of the next-level arithmetic node to obtain the calculation result of the input instruction, and the processing efficiency can be improved by the way of delegation.
  • the processor may also include a CMR (Commission Register), and the RC determines that the resources required for the reduction processing of the operation results of the next layer of operation nodes are greater than the resources of the local processing unit.
  • the RC can write a commissioned instruction to the CMR according to the serial sub-instruction, and the PD can periodically check whether there is a commissioned instruction in the CMR. If there is a commissioned instruction, it will control the next-level computing node to compare to the next-level The operation result of the operation node is reduced to obtain the operation result of the input instruction.
  • CMR Commission Register
  • the periodical inspection may be based on the cycle inspection of the processing, and the processing cycle may be determined according to the time for processing a serial sub-instruction by the next layer of arithmetic nodes, which is not limited in the present disclosure.
  • the processing efficiency of the entire computing node can be improved.
  • the highest-level (level 0) arithmetic node decodes and sends instructions to the next-level arithmetic node (child node), where each next-level arithmetic
  • the node repeats the decoding and sending process until the leaf operation node executes.
  • the leaf operation node returns the calculation result to its parent node, and this operation is repeated until the highest level operation node (parent node).
  • the leaf operation node is in an idle state, which affects the efficiency of the operation.
  • the processor in the computing node of the computing device controls the next-level computing node to execute operations corresponding to the input instructions of the computing node in multiple stages in a pipeline manner.
  • the processor in the any arithmetic node controls the next-level arithmetic node, and executes the any arithmetic in multiple stages in a pipeline manner
  • the operation corresponding to the input instruction of the node; wherein the multiple stages include: an operation execution EX (Execution), and the next-level arithmetic node is used to execute the operation execution EX in the multiple stages in a pipeline manner.
  • the input instruction may be an instruction describing the operation of the machine learning technology, and the input instruction may include operands and operators.
  • the multiple stages may also include: instruction decoding ID (Instruction Decoding), data loading LD (Loading), operation reduction RD (Reduction), and data writing back WB (Writing Back)
  • instruction decoding ID Instruction Decoding
  • data loading LD Loading
  • operation reduction RD Reduction
  • data writing back WB Writing Back
  • the pipeline propagates in the order of instruction decoding ID, data loading LD, operation execution EX, operation reduction RD, and data writing back to WB.
  • the multiple stages in the above embodiments are only an example of the present disclosure and do not limit the present disclosure in any way.
  • the multiple stages may also include instruction input and the like.
  • the instruction decoding ID may refer to the decoding processing of the received input instruction sent by the upper layer (or input terminal), which may specifically include: according to the corresponding input instruction
  • the storage requirement of the operand allocates the memory space on the memory component of the computing node of this layer for the input instruction, and binds the address (local address) of the allocated memory space to the instruction to write the operand in the input instruction, and so on.
  • Data loading LD can refer to finding the storage location of the operand corresponding to the input instruction from the memory component of the upper level arithmetic node according to the address of the read operand corresponding to the input instruction recorded in the input instruction, and reading the operand, Then write it into the memory component of the computing node of this layer according to the local address.
  • Operation execution EX can refer to the process of obtaining operation results based on operators and operands.
  • the processor can also decompose the input instructions , Some operations also need to reduce the operation result of the decomposed instruction, that is, operation reduce RD, in order to get the operation result of the input instruction.
  • Writing data back to WB can refer to writing the operation result of the input instruction of the operation node of this layer back to the operation node of the upper layer.
  • Fig. 6 shows a schematic diagram of a pipeline according to an example of the present disclosure. The following describes the process of executing operations corresponding to input instructions in multiple stages in a pipeline manner with reference to the arithmetic device shown in FIG. 3 and FIG. 6.
  • the i-th layer arithmetic node receives the input instruction of the upper layer (i-1th layer) arithmetic node, and decodes the input instruction after the ID is decoded Command, load the data needed to run the input command, and then send the decoded command to the next layer (i+1 layer) computing node, and the next layer (i+1 layer) computing node will be based on the loaded data Execute the decoded instruction to complete the operation and execute the EX stage.
  • the processor can also Input instructions are decomposed, and some operations also need to reduce the operation results of the decomposed instructions, that is, the operation reduction stage RD, to get the operation results of the input instructions. If the i-th operation node is not the first-level operation node, the first The processor of the i-layer arithmetic node can also write the calculation result of the input instruction back to the upper-layer (i-1th layer) arithmetic node.
  • the operation nodes of the next layer (i+1th layer) are also pipelined to execute the operation and execute EX in the multiple stages, as shown in Figure 6, that is, the next layer (the first The i+1 layer) operation node can perform the input instruction after receiving the instruction sent by the processor of the operation node of this layer (layer i) (as the input instruction of the next layer (layer i+1) operation node) Instruction decoding, load the data required by the input instruction from the memory component of this layer, and send the decoded instruction to the next layer (i+2 layer) operation of the next layer (i+1 layer) operation node
  • the nodes are in the operation execution stage..., in other words, the next layer (i+1th layer) operation nodes are pipelined in the order of instruction decoding ID, data loading LD, operation execution EX, operation reduction RD, and data writing back to WB
  • the computing device of the embodiment of the present disclosure constructs the hierarchical structure of the computing device through a multi-layer iterative manner.
  • the structure of each computing node of the computing device is the same, and computing nodes of different layers and computers of different sizes have the same programming Interface and instruction set architecture, execute programs in the same format, and load data implicitly between layers.
  • the hierarchical structure of the computing device makes it possible to execute the operations corresponding to the input instructions in an iterative pipeline, efficiently utilize the computing nodes of each level, and improve the efficiency of computing.
  • any of the computing nodes may also include: a local processing unit LFU (local functional units), a memory controller (for example, DMA, Direct Memory Access), and the processor may include : Pipeline control unit, Demotion Decoder (DD (Demotion Decoder, here can refer to the operation node from the upper layer to the next layer)), and Reduction Controller (RC) (Reduction Controller, also called reduction controller).
  • LFU local functional units
  • a memory controller for example, DMA, Direct Memory Access
  • the processor may include : Pipeline control unit, Demotion Decoder (DD (Demotion Decoder, here can refer to the operation node from the upper layer to the next layer)), and Reduction Controller (RC) (Reduction Controller, also called reduction controller).
  • DD Demotion Decoder, here can refer to the operation node from the upper layer to the next layer
  • RC Reduction Controller
  • Fig. 7 shows a block diagram of a computing node according to an example of the present disclosure.
  • the input terminal of the decoder DD receives input instructions
  • the output terminal of the decoder DD is connected to the input terminal of the memory controller.
  • the memory component can be connected to the upper-level computing node of any computing node through the data path and The memory component of the next layer of computing node, the memory controller is connected to the above data path, as shown in Figure 7, the memory component i is connected to the memory component i-1, and the memory component i-1 can represent the memory of the previous computing node of the current computing node Component, memory component i connected to the next-level computing node means the memory component connected to the next-level computing node, and the memory controller is connected to the data path between the memory components.
  • the data path sends data from one memory component to another under the control of the memory controller.
  • the output terminal of the decoder DD is also connected to the input terminal of the next layer arithmetic node and the input terminal of the reduction control unit RC, which is connected to the local processing unit LFU.
  • the decoder DD is used for instruction decoding ID
  • the memory controller is used for data loading LD: the operand of the input instruction is loaded from the memory component of the upper computing node to the local memory component
  • the reduction control unit RC is used to control the execution of LFU
  • the operation reduction RD obtains the operation result of the input instruction
  • the memory controller is also used to write the operation result back to the memory component of the upper operation node of any operation node.
  • the pipeline control unit is connected to the decoder DD, the reduction control unit RC, the memory controller and the next layer of computing nodes.
  • the pipeline control unit is based on the decoder DD, the reduction control unit RC, the memory controller and the next layer of computing nodes.
  • the feedback synchronizes multiple stages. For example, after the pipeline control unit receives the first feedback signal sent by the decoder DD, the memory controller, the next-level arithmetic node, and the reduction control unit RC, the control pipeline propagates downward in order,
  • the first feedback signal may refer to a signal indicating that the decoder DD, the memory controller, the next-level arithmetic node, and the reduction control unit RC have completed the corresponding stage of the current instruction.
  • the pipeline control unit receives the memory controller, RC, and the next-level computing After the first feedback signal sent by the node and the DD, the pipeline can be controlled to propagate down in order: the memory controller writes the input instruction 2 data back to WB, and the RC controls the local processing unit to operate the input instruction 3 to reduce RD, and the next The layer arithmetic node operates the input instruction 4 to execute EX, the memory controller loads the input instruction 5 with data LD, and the DD performs instruction decoding ID on the input instruction 6.
  • Fig. 8 shows a schematic diagram of an operation node and a pipeline operation process according to an example of the present disclosure.
  • the processor may further include a serial decomposer SD (Sequential decomposer), the serial decomposer SD is connected to the input end of the decoder DD, and the serial decomposer SD is used to The input instruction is serially decomposed to obtain serial sub-instructions; the processor controls the next layer of arithmetic nodes to execute operations corresponding to the serial sub-instructions in multiple stages in a pipeline manner.
  • a sub-level instruction queue SQ (sub-level instruction Queue) can also be set between the serial resolver SD and the decoder DD.
  • the sub-instruction queue SQ is used to temporarily store the serial sub-instructions, and DD is also used to The sub-instruction is decoded to obtain the decoded serial sub-instruction.
  • Setting SQ to temporarily store serial sub-instructions, for input instructions that need to be serially decomposed, can accelerate the propagation of the pipeline and improve the efficiency of calculation.
  • the input of the SD can also be connected to the instruction queue IQ (Instruction Queue), that is to say, the processor can first load the output instructions of the upper layer of arithmetic nodes as the input instructions of this layer of arithmetic nodes to IQ.
  • the computing node of this layer can refer to the computing node to which the processor belongs.
  • SD obtains input instructions from IQ. Taking into account the limitations of hardware, SD can decompose the input instructions into multiple serial sub-commands that can be executed serially and temporarily store them To SQ, DD obtains serial sub-instructions from SQ for decoding.
  • IQ By setting IQ as a buffer between the SD and the upper-level computing node, the strict synchronization execution relationship between the SD and the upper-level computing node can be omitted.
  • IQ can simplify circuit design and improve execution efficiency. For example, it allows SD to execute asynchronously with the upper-level computing node independently, and reduces the time that SD waits for the upper-level computing node to send input instructions.
  • SQ acts as a buffer between SD and DD, which can eliminate the strict synchronization execution relationship between SD and DD.
  • SQ can simplify circuit design and improve execution efficiency. For example, it allows SD to execute asynchronously on its own and reduces the time that DD waits for SD to serialize input instructions.
  • the processing efficiency of the computing device can be improved by setting IQ and SQ.
  • the serial decomposition of the input instruction may include the decomposition of the operand of the input instruction and the decomposition of the input instruction.
  • the serial sub-instructions obtained by the serial decomposition will have the largest possible decomposition granularity.
  • the decomposition granularity of the serial sub-instructions obtained by the serial decomposition is determined by the operation node.
  • the resource of the computing node can be the capacity of the memory component of the computing node.
  • the decomposition granularity here can refer to the dimension of the decomposition operand.
  • the memory capacity required for the input instruction can be determined according to the memory capacity required to store the operand of the input instruction and the memory capacity required for the intermediate result after the operand is processed by the storage operator. After determining the memory capacity required for the input instruction, you can It is judged whether the capacity of the memory component of the computing node of this layer meets the memory capacity required by the input instruction. If not, the input instruction can be serially decomposed according to the capacity of the memory component of the computing node of this layer to obtain serial sub-commands.
  • SD can determine the memory capacity required by the input instruction according to the size of the matrix X and matrix Y, and the memory capacity required by the input instruction Comparing with the capacity of the memory component of the computing node of this layer, if the memory capacity required for the input instruction is greater than the capacity of the memory component of the computing node of this layer, the input instruction needs to be serially decomposed.
  • the specific process can be to decompose the operand, thereby dividing the input instruction into multiple serial sub-commands, which can be executed serially, for example, matrix X or matrix Y can be decomposed, or Decompose both matrix X and matrix Y.
  • the input instruction can be serially decomposed into serial sub-instructions for multiplying multiple matrices and serial sub-instructions for summation, which are executed in series After the serial sub-instructions of multiple matrix multiplications are completed, the operation results of the serial sub-instructions of the multiple matrix multiplications and the summed serial sub-instructions are summed to obtain the operation result of the input instruction.
  • serial decomposition method for matrix multiplication is only an example of the present disclosure to illustrate the function of SD, and does not limit the present disclosure in any way.
  • the processor may also include a parallel decomposer PD (Parallel decomposer).
  • the input end of the parallel decomposer PD is connected to the output end of the decoder DD.
  • the output terminal of the PD is connected to the input terminal of the next layer of arithmetic nodes.
  • the parallel resolver PD is used to decompose the decoded serial sub-instructions in parallel to obtain parallel sub-instructions according to the number of the next layer of arithmetic nodes.
  • parallel sub-instruction is sent to the next-level arithmetic node, so that the next-level arithmetic node runs the parallel sub-instruction in parallel according to the operands corresponding to the parallel sub-instruction.
  • parallel decomposition may mean that the decomposed parallel sub-instructions can be executed in parallel.
  • PD can decompose serial sub-instructions in parallel to obtain 4 parallel sub-instructions.
  • the 4 parallel sub-instructions are the addition of A1 and B1, A2 and B2, A3 and B3, and A4 and B4, respectively.
  • the four parallel sub-commands can be sent to the next-level computing node. It should be noted that the above examples are only to illustrate examples of parallel decomposition, and do not limit the present disclosure in any way.
  • the memory controller may include DMA (Memory Controller, Direct Memory Access) and DMAC (Direct Memory Access Controller).
  • DMAC is referred to as the first memory controller and DMA as the second Memory controller.
  • the DMA is connected to the data path
  • the DMAC is connected to the DMA, DD, SD, pipeline control unit, and the next layer of computing nodes.
  • the DMAC can generate a load instruction according to the control signal, and send the load instruction to the DMA, and the DMA controls the data path according to the load instruction to realize data loading.
  • the DMAC can also send the above-mentioned first feedback signal to the pipeline control unit, and can notify the DMAC after the DMA performs data loading or data writing back, and the DMAC can send the first feedback signal to the pipeline control unit after receiving the notification.
  • the input instruction may include: an operator, an operand parameter, the operand parameter may be a parameter pointing to the operand of the input instruction, the operand parameter may include a global parameter and a local parameter, and the global parameter indicates the first corresponding to the input instruction A parameter of the size of an operand, the local parameter is a parameter indicating the starting position of the second operand of the input instruction in the first operand and the size of the second operand.
  • the second operand can be part or all of the data in the first operand.
  • the processing of the second operand can be realized when the input instruction is executed, and the processing of the second operand can be an operation with the input instruction The corresponding processing.
  • the instruction used by the computing device of the present disclosure can be a triple ⁇ O, P, G>, where O represents an operator, P represents a finite set of operands, and G represents a granularity index, the specific expression form Can be "O, P[N][n1][n2]", where N can be a positive integer, representing a global parameter, and multiple different Ns can be set according to the tensor dimension, n1 and n2 are smaller than N Natural numbers, representing local parameters, where n1 represents the starting position when performing operations on the operands, and n2 represents the size.
  • the execution of the above instructions can realize the operations O on operands from n1 to n1+n2 in operand P.
  • n1 and n2 can be set.
  • the format of the input instruction received by each layer of the computing device of the present disclosure is the same, so the decomposition of the instruction, the operation corresponding to the execution instruction, etc. can be automatically completed.
  • any (current) operation node After any (current) operation node receives the input instruction sent by the upper operation node, it can read the corresponding operand from the memory component of the upper operation node according to the operand parameters of the input instruction, and save it in the current In the memory component of the computing node, any computing node can write the computing result back to the memory component of the computing node in the upper layer after executing the input instruction to obtain the computing result.
  • the processor of the current computing node can send a control signal to the DMAC according to the operand parameters of the input instruction.
  • the DMAC can control the DMA according to the control signal.
  • the DMA controls the memory component of the current computing node and the memory component of the upper computing node.
  • the data path connected indirectly, so as to load the operand of the input instruction into the memory component of the current operation node.
  • the DMAC may generate a load instruction according to the control signal, and send the load instruction to the DMA, and the DMA controls the data path according to the load instruction to implement data loading.
  • the DMAC can determine the base address, starting offset, the number of loaded data, jump offset and other parameters according to the control signal, and then according to the base address, starting offset, load data size, and jump offset Load instructions are generated by parameters such as quantity, and the number of cycles to load data can also be set according to the dimension of the operand.
  • the base address can be the starting address of the operand stored in the memory component
  • the starting offset is the position where the operand to be read starts in the original operand
  • the starting offset can be based on the starting address in the local parameter
  • the position is determined.
  • the number of loaded data can be determined according to the size in the local parameter.
  • the offset of the jump indicates that the position of the operand to be read in the next part of the original operand is relative to the operand read in the previous part in the original operand
  • the offset between the start positions, that is, the jump offset is the offset jump of the start offset of the next read data relative to the start offset of the last read data.
  • the offset of can be determined according to all parameters or local parameters. For example, the starting position can be used as the starting offset, the size in the local parameter can be used as the amount of data loaded at one time, and the size in the local parameter can be used as the jump offset.
  • the start address of the read operand can be determined according to the base address and the start offset, and the end address of a read operand can be determined according to the amount of loaded data and the start address.
  • the start address of the operand to be read in the next part can be determined according to the start address and the offset of the jump.
  • the current read can be determined according to the amount of data loaded and the start address of the operand to be read in the next part. Take the end position of the operand...Repeat the above process until the number of cycles to load the operand is reached.
  • the read operand at one time and the read operand at this time can refer to: reading the same operand needs to be completed one or more times, each time reading part of the operand in the same operand, the above one and this time Can refer to one of multiple times.
  • reading an operand may need to loop multiple readings to complete.
  • the first memory controller can determine each read based on the base address, starting offset, the number of loaded data, and the offset of the jump.
  • the start address and end address of the operand for example, for each read process, the start address of this read process can be determined according to the start address of the previous read process and the offset of the jump. Determine the end address of the local reading process according to the starting address of this reading process and the number of loaded data (and the format of the data).
  • the jump offset can be determined according to the number of jumped data and the format of the data.
  • FIG. 9 shows a schematic diagram of an operand according to an embodiment of the present disclosure.
  • the operand P is a matrix P[M,N] with M rows and N columns
  • the control signal is "Load P [M,N][0,0][M,N/2], P'”.
  • the DMAC can set the starting offset in the row and column directions to be 0, the number of loaded data is N/2, the jump offset is N, and the number of cycles is M.
  • start reading N/2 columns of data from the first row and first column jump to the second row and first column to read N/2 columns of data...
  • the data can be loaded in M cycles.
  • any one of the computing nodes may also include: a pipeline latch, between the decoder DD and the memory controller, between the memory controller and the next layer Pipeline latches are respectively provided between the computing nodes FFU (Fractal Functional Units), between the computing node FFU of the next layer and the local processing unit LFU, and between the local processing unit LFU and the memory controller.
  • the pipeline latch is used to cache the instructions to be processed in the next stage.
  • the pipeline control unit synchronizes the multiple stages by controlling the pipeline latch.
  • the first control signal is sent to each of the pipeline latches, and each of the pipeline latches updates the output according to the first control signal.
  • the first control signal may be a high-level signal or a low-level signal, which is not limited in the present disclosure.
  • Update output means that when the pipeline latch receives the first control signal (as shown in Figure 8, the control signal sent by the pipeline control unit to the pipeline latch), it outputs the parallel sub-instruction following the input or the operation with the input instruction The related control signal changes.
  • the input parallel sub-command or the control signal related to the operation of the input command refers to the input from the left side of the pipeline latch in FIG. 8.
  • the DMAC receives the control signal output by the pipeline latch 4, and controls the DMA according to the control signal to write data back to the input instruction 1 to WB;
  • the local processing unit LFU receives the control signal output by the pipeline latch 3, operates the input instruction 2 to reduce RD, and stores the reduction result (the operation result of the input instruction 2) in the memory component;
  • the next layer of arithmetic nodes receives the parallel sub-instructions in the pipeline latch 2 (obtained after decomposing the input instruction 3), performs an EX operation on the input instruction 3, and writes the execution result back to the memory component;
  • the DMAC receives the control signal sent by the pipeline latch 1, and controls the DMA to load the input operand of the input instruction 4 into the memory component according to the control signal;
  • DD decodes the ID of the input instruction 5, and sends the decoded input instruction 5 to PD and RC, and buffers related control signals such as data loading and data writing back in the pipeline latch 1.
  • PD decomposes the decoded input instruction 5 in parallel to obtain parallel sub-instructions, caches the parallel sub-instructions in the pipeline latch 1, and RC caches the control signal corresponding to the operation reduction of the input instruction 5 in the pipeline latch 1 in.
  • the first feedback signal can be sent to the pipeline control unit.
  • the pipeline control unit receives the DMAC, RC, the next-level arithmetic node, and After the first feedback signal sent by the DD, the first control signal can be sent to each of the pipeline latches, and the control pipeline propagates downward in order.
  • the output control signal follow the input signal changes. For example, (1) the control signal corresponding to the data write-back of the input instruction 2 is output from the pipeline latch 4, and the control signal corresponding to the data write-back of the input instruction 3 is output from the pipeline latch 3 to the pipeline latch 4.
  • the control signal corresponding to the operation reduction of the input instruction 3 is output from the pipeline latch 3, and the control signal corresponding to the operation reduction of the input instruction 2 is output from the pipeline latch 2 to the pipeline latch 3.
  • the control signal corresponding to the operation reduction of the input instruction 1 is output from the pipeline latch 1 to the pipeline latch 2;
  • the parallel sub-instruction for the input instruction 4 is output from the pipeline latch 2 and the control signal for the input instruction 5
  • Parallel sub-instructions are output from the pipeline latch 1 to the pipeline latch 2;
  • the control signal corresponding to the data load of the input instruction 5 is output from the pipeline latch 1;
  • the input instruction 6 is input to the DD, DD decodes the ID of the input instruction 6 and sends the decoded input instruction 6 to PD and RC, and buffers the related control signals such as data loading and data writing back in the pipeline latch 1.
  • the PD pairs The decoded input instruction 6 is decomposed in parallel to obtain parallel sub-instructions, and the parallel sub-instructions are cached in the pipeline latch 1, and the RC caches the control signal corresponding to the operation reduction of the input instruction 6 in the pipeline latch 1.
  • the execution process of DMAC, RC, next-level computing nodes and DD is as follows:
  • the DMAC receives the control signal output by the pipeline latch 4, and controls the DMA to write back the data of the operation result of the input instruction 2 to WB;
  • LFU receives the control signal output by the pipeline latch 3, and obtains the execution result of the input instruction 3 after executing EX from the memory component according to the control signal, and operates the instruction result of the input instruction 3 to reduce RD, and The reduction result (the operation result of input instruction 3) is stored in the memory component;
  • the next layer of arithmetic node receives the parallel sub-instruction output from the pipeline latch 2 for the input instruction 4, performs an EX operation on the input instruction 4, and writes the execution result back to the memory component;
  • the DMAC receives the control signal sent by the pipeline latch 1, and controls the DMA according to the control signal to load the input operand of the input instruction 5 into the memory component;
  • (2.5) DD obtains input instruction 6 from SQ, and decodes the ID of input instruction 6.
  • DD can detect the data dependency of the serial sub-instruction when it obtains the serial sub-instruction from the SQ. If the data dependency of the serial sub-instruction is detected, the DD can stop from the SQ Get serial sub-commands in.
  • the existence of data dependency of the serial sub-instruction may refer to the overlap (data dependency) between the input operand of the serial sub-instruction and the output operand of multiple previous serial sub-instructions.
  • the number of previous serial sub-commands can be determined according to the number of stages of the pipeline. For example, in the 5-stage pipeline example of the embodiment of the present disclosure, the previous serial sub-commands may refer to the previous 4 serial sub-commands.
  • Sub-instructions The input operand of the currently decoded serial sub-instruction overlaps with the output operands of the previous serial sub-instructions. It can refer to the input operand of the serial sub-instruction currently decoded and the previous serial sub-instructions.
  • the output operands of any one or more of the sub-instructions overlap, which is not limited in the present disclosure.
  • the input operands of the currently decoded serial sub-instructions overlap with the output operands of the previous serial sub-instructions that is to say, the input operands of the serial sub-instructions currently decoded are the previous ones. Part or all of the output operands of the serial sub-instructions, therefore, the input operands of the serial sub-instructions currently decoded can be loaded after the output operands are obtained after the execution of the previous serial sub-instructions. Therefore, the propagation of the pipeline needs to be suspended until the multiple serial sub-instructions before the execution are completed to obtain the output operands, and the propagation process of the pipeline is continued.
  • the specific process can be as follows: DD stops obtaining serial sub-commands from SQ, and the output of DD remains unchanged.
  • the first pipeline latch after DD no longer outputs the latched control signal, but outputs the cavitation control signal.
  • the functional components to the cavitation control signal do not operate, and only immediately send the first feedback signal to the pipeline control unit.
  • the pipeline control unit continues to transmit the first control signal according to the original conditions, so that the pipeline continues to execute with the cavity injected from the first pipeline latch until the data dependency is resolved. After the data dependency is resolved, DD continues to fetch instructions from SQ, and the first pipeline latch continues to output the latched control signal.
  • the process of the pipeline can be flexibly controlled to avoid errors in calculation results.
  • the decoder when the decoder detects that the input operand of the currently decoded serial sub-instruction does not overlap with the output operands of the previous serial sub-instructions, it will decode the current The serial sub-instructions of the code are decoded and preloaded on the next layer of arithmetic nodes.
  • the decoder can send a preload signal to the pipeline control unit. If the next layer of arithmetic node has completed the execution of the parallel sub-instruction of input instruction 4 and sent the first feedback signal to the pipeline control unit, at this time, the pipeline control unit can send to the pipeline latch 1 according to the preload signal According to the first control signal, the pipeline latch 1 outputs the parallel sub-instructions of the input instruction 6 to the next layer of arithmetic nodes (that is, pre-loaded serial sub-instructions, as shown in the pipeline latch in Figure 8). 1 to FFU (shown by the dotted arrow), so that the next layer of computing nodes perform EX operations on the input instruction 6 in advance, thereby improving the computing efficiency of the computing device.
  • FFU shown by the dotted arrow
  • the decoder DD can detect the previous multiple (for example, 5) The address of the output operand of the serial sub-instruction and the address and size descriptor of the input operand of the serial sub-instruction currently decoded are determined.
  • the instruction preloading method can be used to speed up the processing speed and improve the processing efficiency of the arithmetic device.
  • machine learning is a computational and memory-intensive technology.
  • the present disclosure provides a memory management method adopted by a computing device.
  • the memory component may include a static memory segment and a cyclic memory segment.
  • FIG. 11 shows a schematic diagram of an example of division of a memory component according to an embodiment of the present disclosure. As shown in FIG. 11, the memory space of the memory component can be divided into a static memory segment and a cyclic memory segment.
  • some of the operands will be shared among the several parts of the operations after these operations are decomposed. For this part of the operands, this disclosure is called a shared operand.
  • this disclosure Take the matrix multiplication operation as an example, suppose the input instruction is to multiply the matrix X and Y, if only the matrix X is decomposed, then the serial sub-instructions obtained by the serial decomposition of the input instruction need to use the operand Y together, the operation The number Y is the common operand.
  • the input instruction may be an instruction describing the operation (operation) of machine learning.
  • the operation (operation) of the machine learning may consist of the above calculation primitives.
  • the input instruction may include operands and operators. That is to say, for the input instruction of any arithmetic node: multiple sub-instructions obtained by the processor decomposing the input instruction, these multiple sub-instructions may share a part of the operand, and this part of the operand is the shared operand.
  • whether the decomposed operation or instruction has a shared operand can be determined according to the operation type and the decomposed dimension, where the operation type can refer to a specific operation or operation, for example, matrix multiplication;
  • the dimension of decomposition can refer to the dimension by which the operand (tensor) of the input instruction is decomposed. For example, assume that the representation of the operand is NHWC (batch, height, width, channels), which is determined according to the process shown in Figure 5
  • the decomposed dimension is the C dimension, then the decomposed dimension of the operand is the C dimension.
  • the processor allocates memory space for the shared operands in the static memory segment, and in the cyclic memory segment for other sub-instructions
  • the operand allocates memory space; wherein, the shared operand is: an operand that must be used when the next-level arithmetic node in any arithmetic node executes the multiple sub-instructions, and the other operands are: The operands of the multiple sub-instructions except for the common operand.
  • the present disclosure sets a static memory segment in the memory component to store the shared operands.
  • the shared operands of multiple sub-instructions it only needs to be executed once before executing multiple sub-instructions.
  • the operation of loading the shared operand from the memory component of the upper-level computing node of any computing node to the static memory segment is sufficient, which can avoid frequent data access and save bandwidth resources.
  • operands can refer to the decomposed operands of the operands of the input instruction, the intermediate result obtained by executing the sub-instruction, the reduction result, etc., where the reduction result can be obtained by reducing the intermediate result Yes, operational reduction can refer to the reduction process mentioned above.
  • the processor is used to decompose an input instruction of any computing node to obtain multiple sub-instructions, which may include: the memory capacity required by the SD according to the input instruction, and the capacity of the static memory segment And the capacity of the cyclic memory segment, the input instruction is serially decomposed to obtain serial sub-instructions.
  • the input instruction may be serially decomposed to obtain a serial sub-instruction according to the memory capacity required by the input instruction and the capacity of the cyclic memory segment.
  • the size relationship between the memory capacity required by the shared operand and the remaining capacity of the static memory segment and the memory capacity required by the other operands may be used The size relationship with the capacity of the cyclic memory segment, serial decomposition of the input instruction to obtain serial sub-instructions.
  • the input instruction For input instructions with shared operands after decomposition, if the memory capacity required by the shared operand is greater than the remaining capacity of the static memory segment, or the memory capacity required by other operands is greater than the capacity of the circular memory segment, the input instruction Perform serial decomposition.
  • SD can calculate the remaining memory capacity of the static memory segment, and the SD performs the first sequence of the input instruction according to the remaining memory capacity of the static memory segment and the memory capacity required by the shared operand Line decomposition obtains the first serial sub-instruction.
  • the decomposition priority of the dimensions of the shared operands can be determined, the dimensions of the shared operands are sequentially selected in the order of decomposition priority, and the maximum decomposition granularity is determined in a dichotomy method, until the memory required by the decomposed shared operands
  • the capacity is less than or equal to the remaining memory capacity of the static memory segment of the computing node of this layer.
  • the input instruction can be decomposed according to the decomposition method of the common operand.
  • SD may perform a second serial decomposition of the first serial sub-instruction to obtain the serial sub-instruction according to the memory capacity of the cyclic memory segment and the memory capacity required by the other operands.
  • the decomposition priority of the dimensions of other operands can be determined, and the dimensions to be decomposed to other operands are selected in the order of decomposition priority and the maximum decomposition granularity is determined by dichotomy, until the memory required by other operands after decomposition
  • the capacity is less than or equal to the remaining memory capacity of the loop memory segment of the computing node of this layer.
  • the input instruction can be decomposed according to the decomposition method of other operands.
  • the memory capacity required to store operand Y and the capacity of the static memory segment can be determined. If the memory capacity required to store operand Y is less than the capacity of the static memory segment, then operand Y may not be decomposed, if The memory capacity required to store operand Y is greater than the capacity of the static memory segment, then the decomposition method of operand Y can be performed according to the process shown in FIG. 5. According to the decomposition method of operand Y, the input instruction can be serially decomposed.
  • the memory capacity required to store the operand X, the intermediate result, and the reduction result can be determined by combining the operand X and the decomposed operand Y. If storing other The memory capacity required by the operand is less than the capacity of the cyclic memory segment, then the operand X may not be decomposed. If the memory capacity required to store other operands is greater than the capacity of the static memory segment, the operand can be adjusted according to the process shown in Figure 5 The way X is decomposed is that what needs to be judged each time is the size of the memory capacity required to store other operands and the capacity of the loop memory segment, not just the operand X.
  • the serial sub-instructions obtained after serial decomposition of input instructions include head instructions and main instructions.
  • the head instructions are used to load common operands.
  • SD can be in the static memory segment. The memory space is allocated for the shared operand, the head instruction records the address of the memory space allocated for the shared operand, and the main instruction is used to load the other operands, and for the shared operand and Other operands are calculated.
  • the computing node of the present disclosure is provided with a local processing unit LFU (local functional units), a first memory controller (DMAC, Direct Memory Access Controller), and a second memory controller (DMA, Direct Memory Access)
  • LFU local functional units
  • DMAC Direct Memory Access Controller
  • DMA Direct Memory Access
  • the first memory controller can be implemented in a hardware circuit or a software program, which is not limited in the present disclosure.
  • the first memory controller is connected to the second memory controller.
  • the second memory controller for other content, please refer to the introduction above, so I won't repeat it.
  • the first memory controller is respectively connected to SD and DD, and reads operands from the memory components of the upper-level computing node according to the control signals sent by the SD or DD, and writes them to the memory components of the current computing node.
  • the first memory controller is also responsible for writing back data between different layers of operation nodes, for example, writing the operation results of the i+1 layer operation node back to the i-th layer operation node.
  • the memory component of each computing node is also connected to the local processing unit LFU in the same computing node.
  • the output terminal of the decoder DD is also connected to the reduction control unit RC, which is connected to the local processing unit LFU.
  • the reduction control unit RC is used to control the LFU to perform operation reduction RD to obtain the operation result of the input instruction, and write the operation result to the memory component.
  • the first memory controller can control the second memory controller to perform the operation in the memory component The result is written back to the memory component of the upper-level computing node.
  • SD can output the serial sub-instructions after serial decomposition to SQ
  • DD obtains the serial sub-instructions from SQ
  • DD mainly allocates memory space on the loop memory segment according to the main instruction's data storage needs
  • DD can according to the main instruction
  • the storage requirement of the corresponding operand allocates the memory space on the memory component of the operation node of this layer for the serial sub-instruction, and binds the address (local address) of the allocated memory space to the instruction to obtain the operand in the main instruction. So as to realize the decoding process.
  • DD can also send a control signal to the first memory controller DMAC according to the serial sub-instruction, and the first memory controller DMAC can control the second memory controller DMA according to the control signal to load the operand corresponding to the serial sub-instruction to its allocation
  • the memory space of the serial sub-instruction that is to say, according to the address of the operand corresponding to the input instruction recorded in the serial sub-instruction, find the storage location of the operand corresponding to the serial sub-instruction from the memory component of the upper-level computing node, and read Take the operand and write it into the memory component of the computing node of this layer according to the local address.
  • the processor in any one computing node controls the next-level computing node to execute the serial sub-instructions of any one computing node in multiple stages in a pipeline manner The corresponding operation.
  • Figure 10b shows an example of a pipeline according to an embodiment of the present disclosure.
  • multiple stages can include: instruction decoding ID (Instruction Decoding), data loading LD (Loading), operation execution EX (Execution), operation reduction RD (Reduction), and data writing back WB (Writing Back) ), the pipeline propagates in the order of instruction decoding ID, data loading LD, operation execution EX, operation reduction RD, and data writing back to WB.
  • instruction decoding ID Instruction Decoding
  • data loading LD Loading
  • operation execution EX Executecution
  • operation reduction RD Reduction
  • WB Writing Back
  • DD is used to decode the ID of the multiple sub-commands (serial sub-commands).
  • the decoder sends a first control signal to the first memory controller according to the head instruction, so that the first memory controller controls the second memory controller to load a common operand according to the first control signal.
  • DD can allocate the memory space on the loop memory segment of the computing node of this layer according to the storage requirements of other operands corresponding to the main instruction, and bind the address of the allocated memory space (local address) to the main instruction In order to obtain or store other operands, the decoding process can be realized.
  • the decoder may also send a second control signal to the first memory controller according to the main body instruction, so that the memory controller controls the second memory controller to access other operands according to the second control signal.
  • the second memory controller DMA is used for data loading LD: loading the operand of the input instruction to the memory component, which specifically includes: loading from the memory component of the upper-level computing node according to the first control signal corresponding to the head instruction
  • the shared operand is to the static memory segment, and the other data is loaded into the cyclic memory segment from the memory component of the upper-level computing node according to the second control signal corresponding to the main instruction.
  • the second memory controller loads the other data from the memory component of the upper-level computing node to the cyclic memory segment, where it is mainly part of the loaded other operands, It is mainly a part of the input operand, not an intermediate result or reduction result.
  • the DD decodes the serial sub-instructions and sends them to the PD.
  • the PD can decompose the decoded serial sub-instructions in parallel according to the number of the next layer of operation nodes connected to the PD.
  • the parallel decomposition can refer to the decomposed Parallel sub-instructions can be executed in parallel.
  • the operation node of the next layer can execute the operation execution EX in the multiple stages in a pipeline manner to obtain the execution result.
  • RC is used to control LFU to operate on the execution result to reduce RD to obtain the operation result of the input instruction.
  • the DMA is also used to write data back to WB: write the operation result back to any one of the operation nodes In the memory component of the first-level computing node.
  • SD, DD and PD in the processor are separated, and memory allocation can be staggered in time.
  • PD always allocates memory space after DD, but the allocated memory space is released earlier.
  • DD always allocates memory space after SD, but the allocated memory space is also released earlier.
  • the memory space used for the serial decomposition of SD may be used in multiple serial sub-commands. Therefore, a static memory segment is set for SD, and other parts of the shared memory component except static memory (cyclic memory) segment).
  • the LD and WB stages are both DMA accessing memory segments.
  • the sequence of LD and WB is controlled by DMAC. There will be no conflicts when accessing memory. That is to say, only 3 instructions need to access the cyclic memory segment at the same time. Therefore, the cyclic memory can be
  • the segment is divided into multiple sub-memory blocks, for example, it can be divided into three sub-memory blocks.
  • memory space can be allocated for the operands of the serial sub-instructions in the three sub-memory blocks in the order of input of the serial sub-instructions. In this way, the memory can be reduced. Manage complexity, and can improve memory space utilization.
  • the processor is provided with a first counter
  • the cyclic memory segment includes multiple sub-memory blocks
  • the processor is the other of the multiple sub-instructions in the cyclic memory segment.
  • Allocating memory space for operands includes: the processor allocates memory space for the other operands from the sub-memory block corresponding to the count value of the first counter in the cyclic memory segment.
  • FIG. 12 and FIG. 13 show schematic diagrams of examples of the division of memory components according to an embodiment of the present disclosure.
  • the cyclic memory segment is divided into three sub-memory blocks.
  • the memory capacities of the three sub-memory blocks may be the same or different, and the present disclosure does not limit this.
  • the processor can be provided with a counter 1. After DD obtains the serial sub-instruction from SQ, for the main instruction in the serial sub-instruction, the memory space of the loop memory segment can be allocated to it in the order of the main instruction and the count value of the counter 1. .
  • DD will allocate memory space for the operand of main instruction 1 in the loop memory segment 0; and then obtain a main instruction 2, at this time The count value of counter 1 is 1, then DD will allocate memory space for the operand of the main instruction 2 in the loop memory segment 1. Then a main instruction 3 is obtained, and the count value of counter 1 is 2, then DD will be in In the loop memory segment 2, the memory space is allocated for the operand of the main instruction 3.
  • Fig. 12 also shows a schematic diagram of a pipeline propagation process of multiple instructions according to an embodiment of the present disclosure. This will be described below in conjunction with the above example of allocating memory space and the propagation process of the pipeline.
  • DD allocates memory space for the main instruction 2 in the loop memory segment 1.
  • the input operand of the main instruction 1 is loaded into the loop memory segment 0 by the DMA, which is this time Circular memory segment 0 is used by DMA.
  • DD allocates memory space for the main instruction 3 in the loop memory segment 2.
  • the input operand of the main instruction 2 is loaded into the loop memory segment 1 by the DMA, which is this time DMA uses cyclic memory segment 1;
  • the next level of operation node FFU (Fractal Functional Units) executes parallel instruction 1, and writes the execution result back to cyclic memory segment 0, that is, this time Loop memory segment 0 is used by FFU.
  • DD is the main instruction 3 and allocates memory space in the loop memory segment 0;
  • the input operand of the main instruction 3 is loaded into the loop memory segment 2 by the DMA , That is, the DMA uses the circular memory segment 2 at this time;
  • the FFU executes the parallel instruction 2 and writes the execution result back to the circular memory segment 1, that is, the FFU uses the circular memory at this time Segment 1:
  • LFU performs operations on the execution result to reduce RD, that is, at this time, LFU uses loop memory segment 0.
  • the DMA writes the reduction result in the circular memory segment 0 back to the memory component of the upper layer of the computing node.
  • the DMA will The input operand of instruction 4 is loaded into the loop memory segment 0, that is, the loop memory segment 0 is used by the DMA at this time; for the main instruction 3, in the EX phase, the parallel instruction 3 is executed by the FFU and the execution result is written back to the loop Memory segment 2, that is, the cyclic memory segment 2 is used by the FFU at this time; for the main instruction 2, the LFU operates on the execution result to reduce RD, that is, the cyclic memory segment 1 is used by the LFU at this time.
  • the DMA, the next layer of operation node (FFU) and the LFU use 3 sub-memory blocks in sequence. It can reduce the complexity of memory management and improve the utilization of memory space.
  • A1, A2, and B are located in the memory components of the upper computing node, and K1 and K2 are allocated to the static memory segment by SD.
  • Fig. 14 shows a schematic diagram of a memory space allocation method of a static memory segment according to an embodiment of the present disclosure.
  • SD first allocates memory space for operand 1 of input instruction 1, and then allocates memory space for operand 2 of second input instruction 2. At this time, operand 1 is still in use, so it can be used in operand
  • the adjacent position stored in 1 allocates memory space for the operand; when the third input instruction 3 arrives, operand 1 may have been used and operand 2 is still in use.
  • the storage location in operand 1 is Operand 3 allocates memory space, but the memory space required for operand 3 may be slightly less than the memory space of operand 1. At this time, there may be a part of the memory space between the memory space of operand 3 and operand 2.
  • the memory space required for storing operand 3 may be slightly larger than the memory space for storing operand 1. In this case, it may be necessary to allocate memory space for operand 3 on the right side of operand 2 in FIG. 14. This leads to complicated memory management and low memory space utilization.
  • the present disclosure also provides a second counter (which can be called counter 2) in the processor.
  • the SD can follow the sequence of the head instructions generated by serial decomposition And the value of counter 2, allocate memory space for shared operands at different ends in the static memory segment.
  • that the processor allocates memory space for the shared operand in the static memory segment may include: the processor starts from the first starting end in the static memory segment for the The shared operand allocates memory space, wherein the first starting end is the starting end corresponding to the count value of the second counter.
  • the count value of counter 2 may include 0 and 1, where 0 may correspond to one end of the static memory segment, and 1 may correspond to the other end of the static memory segment.
  • FIG. 15 shows a schematic diagram of a memory space allocation method of a static memory segment according to an embodiment of the present disclosure.
  • SD obtains input instruction 1 from SQ, and serially decomposes input instruction 1 to obtain multiple serial sub-instructions 1. Multiple serial sub-instructions 1 share operand 1, SD must be operand 1 from the static memory segment Allocate memory space. Assuming that the count value of counter 2 is 0 at this time, SD can allocate memory space for operand 1 from the left end shown in Figure 15. SD obtains input instruction 2 from SQ, and serially decomposes input instruction 2 to obtain multiple serial sub-instructions 2.
  • SD must be operand 2 from the static memory segment Allocate memory space. Assuming that the count value of counter 2 is 1, then SD can allocate memory space for operand 2 from the right side shown in Figure 15. SD obtains input instruction 3 from SQ, and serially decomposes input instruction 3 to obtain multiple serial sub-instructions 3. Multiple serial sub-instructions 3 share operand 3, SD must be operand 3 from static memory segment Allocate memory space. Assuming that the count value of counter 2 is 0 at this time, SD can allocate memory space for operand 3 from the left side shown in Figure 15.
  • the SD may determine the first starting end of allocating memory space for the shared operand according to the count value of the second counter, the SD calculation starts from the first starting end, and the static For the remaining memory capacity of the memory segment, the SD performs a first serial decomposition of the input instruction according to the remaining memory capacity of the static memory segment and the memory capacity required by the shared operand to obtain a first serial sub-instruction. That is to say, in this embodiment, when the SD calculates the remaining memory capacity of the static memory segment, it can determine the starting end of the calculation according to the count value of the second counter, and then calculate the remaining memory capacity of the static memory segment from the starting end. Then, according to the size relationship between the memory capacity required to store the shared operand and the remaining memory capacity of the static memory segment, it is determined whether to decompose the shared operand and the corresponding input instruction.
  • the computing device of the present disclosure can reduce the complexity of memory management and improve the utilization of memory space.
  • Machine learning is a computing and memory access-intensive technology. Frequent access to data places high requirements on the bandwidth of computing devices for machine learning operations.
  • the present disclosure provides an operand This method can be applied to a processor, and the processor can be a general-purpose processor.
  • the processor can be a central processing unit (CPU), a graphics processing unit (GPU), etc.
  • the processor may also be an artificial intelligence processor for performing artificial intelligence operations.
  • the artificial intelligence operations may include machine learning operations, brain-like operations, and the like. Among them, machine learning operations include neural network operations, k-means operations, and support vector machine operations.
  • the artificial intelligence processor may include, for example, NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Processor, digital signal processing unit), field programmable gate array (Field-Programmable Gate Array, FPGA) chip One or a combination of.
  • the artificial intelligence processor may include multiple computing units, and multiple computing units can perform operations in parallel. The method for obtaining operands provided in the present disclosure can also be applied to the arithmetic device described above.
  • Fig. 16 shows a schematic diagram of an application scenario according to an embodiment of the present disclosure.
  • the processor executes the input instruction, it needs to load the operand of the input instruction from the external storage space to the local memory component. After the input instruction is executed, the operation result of the input instruction is output to the external storage space. .
  • the process of frequent loading and output requires a lot of bandwidth.
  • the embodiments of the present disclosure record the data stored on the local memory component by setting the data address information table, so as to realize the loading of input commands from the external storage space. Before the operand, check whether the operand has been stored on the local memory component. If the above operand has been stored, there is no need to load the operand of the input instruction from the external storage space to the local memory component, and directly use the local memory component Only the operands are stored, which can save bandwidth resources.
  • the address correspondence relationship may be recorded in the data address information table, and the address correspondence relationship may include: the correspondence relationship between the storage address of the operand on the local memory component and the storage address of the operand on the external storage space.
  • Table 1 shows an example of a data address information table according to an embodiment of the present disclosure.
  • Out_addr1, In_addr1, etc. in Table 1 are just symbols representing addresses.
  • the address recorded in the data address information table of the embodiment of the present disclosure may be in the form of start address + granularity identification, and the start address may refer to The starting address of the memory space where the operand is stored, the granularity identifier can indicate the size of the operand, that is to say, the starting address of the data storage and the size of the data are recorded.
  • Fig. 17 shows a flowchart of a method for obtaining an operand according to an embodiment of the present disclosure. As shown in Figure 17, the method may include:
  • Step S11 searching in the data address information table whether the operand has been stored in the local memory component
  • Step S12 if the operand has been stored on the local memory component, determine the storage address of the operand on the local memory component according to the storage address of the operand in the external storage space and the data address information table;
  • Step S13 Assign the storage address of the operand on the local memory component to the instruction for obtaining the operand.
  • the processor After the processor receives the data load instruction, it can execute the data load instruction to load the operand to the local memory component.
  • the data load instruction is bound with the storage address of the operand in the external storage space, and the control signal for loading data is generated according to the data load instruction (bound storage address), and the data is executed by DMA (Direct Memory Access) according to the control signal The process of loading.
  • DMA Direct Memory Access
  • step S11 may be executed to find whether the operand to be loaded is stored in the local memory component in the data address information table.
  • the address correspondence can be recorded in the data address information table.
  • the address correspondence can include the storage addresses of all operands in the external storage space, it is determined that the operands have been stored in the local memory component.
  • the address correspondence does not include the storage addresses of all operands in the external storage space, it is determined that the operands are not stored in the local memory component.
  • it can be found in the storage address on the external storage space recorded in the data address information table whether the operand has been saved on the local memory component. In other words, assuming that the operand to be loaded has been stored before, then the data address information
  • the table records the correspondence between the storage address of the operation on the external storage space and the storage address on the local memory component.
  • the storage address contains the storage address of the operand to be loaded in the external storage space, which means that the operand to be loaded has been stored in the local memory component, and it can be used directly without repeated loading.
  • the operand may not only be a number, but may be multiple numbers or a vector, matrix, tensor, etc. containing multiple numbers.
  • the storage address on the external storage space of the operand bound by the data load instruction can be the address of a section of storage space, and the storage address on the external storage space in the address correspondence completely contains the data load instruction binding
  • the storage address of the operand it can be determined that the operand has been stored on the local memory component; if the storage address on the external storage space in the address correspondence does not contain or only contains a part of the data load instruction bound
  • the operand is stored in the external storage space, it can be determined that the operand is not stored in the local memory component.
  • the method of checking whether there is a containment relationship between two pieces of addresses may not need to traverse the addresses of all data in the operands to check, but only need to check the addresses of the data at the two points of the operands Whether it falls on the storage address of the external storage space in any one of the address correspondences recorded in the data address information table. For example, if the operand is a matrix, just check whether the storage address of the data of the two vertices on the diagonal of the matrix is contained in the storage address of the external storage space in any one of the address correspondences recorded in the data address information table.
  • each data in the matrix is contained by the storage address of the external storage space in any one of the address correspondences recorded in the data address information table.
  • N-dimensional space in the N-dimensional space with two parallel hypercubes, it is only necessary to check whether the data storage address of the two vertices on the main diagonal of the operand is any one recorded in the data address information table
  • the storage address of the external storage space in the address correspondence relationship may be included.
  • the hardware structure of each entry can be equipped with two discriminators in addition to the registers required for the entry record. The two discriminators can be used to determine whether the vertices of the two diagonals meet the inclusion conditions.
  • the entry is considered to be a hit, that is, the storage address of the operand to be queried in the external storage space falls into the storage address of the external storage space in the (table entry) address correspondence relationship, indicating that the query is to be queried
  • the operand of has been saved on the local memory component. For example, suppose:
  • the constant (1) of the low-dimensional variable (x0) is always a factor of the constant (21) of the high-dimensional variable (x1), it is only necessary to do integer division to solve this equation. (When the dimension is 1, it can be solved directly; when the dimension is 2, an integer division is required; when the dimension is n, it is necessary to do n-1 consecutive integer divisions, each time the remainder is used as the dividend, and the value is assigned from the high dimension to the low dimension)
  • n is the largest dimension, usually within 8.
  • Two discriminators judge the two vertices separately. If both discriminators give a positive judgment, the entry is considered a hit.
  • each TTT There is no need to reserve many items in each TTT, for example, it can be 8 to 32 items, because there are not many tensors processed in the operation.
  • When making a query first calculate the maximum and minimum addresses, and broadcast the addresses to the two discriminators of each TTT and each record. All discriminators work at the same time. TTT only needs to return any one to Make sure to determine the table items.
  • step S12 if it is determined that the operand has been stored on the local memory component, the storage address of the operand in the external storage space and the address corresponding relationship recorded in the data address information table can be used to determine the storage of the operand on the local memory component. address. Specifically, the storage address on the local memory component corresponding to the storage address of the operand in the external storage space in the address correspondence relationship is used as the storage address of the operand on the local memory component.
  • the storage address of the operand in the external storage space is Out_addr1, then according to the address correspondence in Table 1, it can be determined that the storage address of the operand on the local memory component is In_addr1; or, if The storage address of the operand in the external storage space is a part of Out_addr1, then according to the address correspondence, the corresponding part in In_addr1 can be determined as the storage address of the operand on the local memory component.
  • Out_addr1 is addr11 ⁇ addr12
  • the operand The storage address in the external storage space is addr13 to addr14 in the addr11 to addr12, then the address corresponding to the addr13 to addr14 in In_addr1 is the storage address of the operand on the local memory component.
  • the instruction to obtain the operand may refer to a data load instruction.
  • the storage address of the operand on the local memory component can be set Bind to the data load instruction corresponding to the operand, in this way, the processor can directly execute the data load instruction, obtain the operand from the local memory component, and save the process of loading the operand from the external storage space to the local memory component. Save bandwidth resources.
  • Fig. 18 shows a flowchart of a method for acquiring an operand according to an embodiment of the present disclosure. As shown in Figure 18, the method may further include:
  • Step S14 If the operand is not stored in the local memory component, generate a control signal for loading the operand according to the storage address of the operand, and the control signal for loading the operand is used to transfer the operand from the operation The storage address of the number is loaded onto the local memory component.
  • the operand can be loaded from the external storage space to the local memory component according to the normal process.
  • the specific process can be to allocate memory space for the operand on the local memory component, determine the address of the allocated memory space, and generate the load operand according to the storage address of the operand bound by the data load instruction and the address of the allocated memory space
  • the control signal sends the control signal for loading the operand to the DMA, and the DMA loads the operand from the storage address of the operand to the local memory component according to the control signal.
  • the method may further include:
  • Step S15 when the operand is loaded from the external storage space to the local memory component, the data address information table is updated according to the storage address of the loaded operand in the external storage space and the storage address on the local memory component.
  • the loaded operand covers the operand originally stored on the local memory component
  • the storage address of the loaded operand in the external storage space can be used to correspond to the storage address on the local memory component Relation, replace the address correspondence relation of the operand originally stored in the data address information table.
  • the specific process can also be to first determine whether the storage address of the recorded operand in the external storage space overlaps with the storage address on the external storage space in the address correspondence. If there is overlap, the original recorded address correspondence can be invalidated. , And record the address correspondence of the newly loaded operand, that is, record the correspondence between the storage address of the loaded operand on the external storage space and the storage address on the local memory component.
  • the processor allocates the memory space of In_addr1 to the above operand, and after loading the operand, it overwrites the data originally stored in the memory space of In_addr1.
  • the data address information table can be The address correspondence between Out_addr1 and In_addr1 is invalid. Replace with the address correspondence between Out_addr3 and In_addr1.
  • In_addr1 represents a section of memory space, and the processor only allocates a part of the memory space In_addr3 to the above operands, then you can The address correspondence relationship between Out_addr3 and In_addr3 is used to replace the original address correspondence relationship between Out_addr1 and In_addr1.
  • the original address correspondence in the data address information table is replaced with: the correspondence between the storage address of the loaded operand in the external storage space and the storage address on the local memory component.
  • only the address correspondence of the most recently loaded operand is recorded in the data address information. Therefore, when loading the operand from the external storage space to the local memory component, directly replace the original address correspondence in the data address information table with: the storage address of the loaded operand in the external storage space and the local memory component The corresponding relationship of the storage address.
  • the specific process can also include the above invalid process, that is, you can set the aging time. After recording an address correspondence, you can start timing.
  • the length of the aging time can be set according to the balance between the requirements for bandwidth and efficiency, and the present disclosure does not specifically limit the length of the aging time.
  • the aging time may be set to be greater than or equal to two pipeline cycles, and one pipeline cycle may refer to the time required for the pipeline of the computing node to propagate one stage forward.
  • step S11 when the address correspondence is valid, and the storage address on the external storage space in the address correspondence contains the storage address of the operand to be loaded in the external storage space, the operand will be returned.
  • the result stored on the local memory component If any of the above two conditions is not met, the operand will not be returned to save the result on the local memory component. For example, the address correspondence is invalid and the operand will not be returned.
  • the operand is the result stored on the local memory component.
  • the invalid identification bit of the address correspondence can also be recorded in the data address information table.
  • the invalid identification bit can indicate whether the address correspondence is valid. For example, the invalid identification bit is 1 to indicate valid, and 0 can be Indicates invalid.
  • the corresponding invalid flag can be set to 1, and when the aging time is reached, the invalid flag can be set to 0.
  • the processor when the operands have been stored in the local memory component, the processor can directly execute the data load instruction to obtain the operands from the local memory component, eliminating the need for external storage space
  • the process of loading operands to local memory components saves bandwidth resources.
  • the method of the present disclosure may be applied to a computing device, and the computing device may include: a multi-layer computing node, and each computing node includes a local memory component, a processor, and a next-level computing node.
  • the external storage space may be the memory component of the upper-level computing node of the computing node or the memory component of the next-level computing node.
  • the computing device may be provided with a tensor replacement table (an example of a data address information table), tensor replacement table
  • tensor replacement table an example of a data address information table
  • the corresponding relationship between the storage address of the operand stored in the static memory segment in the external storage space and the storage address in the static memory segment can be recorded in the static memory segment.
  • the external storage space here can refer to the memory component of the upper-level computing node.
  • the tensor substitution table Before SD allocates memory space for shared operands in the static memory segment, it can first look up in the tensor substitution table whether the shared operands have been saved in the static memory segment of the local memory component, if it has been saved in the local memory component On the static memory segment, the storage address of the shared operand in the external storage space (the storage address of the operand on the memory component of the upper-level computing node) and the tensor replacement table determine that the shared operand is in the local memory component The storage address on the; the storage address of the shared operand on the local memory component is assigned to the head instruction.
  • step S15 may include: when the operand is loaded from the external storage space to the static memory segment, determining the data address information table (tensor replacement table) to be updated according to the count value of the second counter; The storage address of the operand in the external storage space and the storage address in the static memory segment update the data address information table to be updated (tensor replacement table).
  • the external storage space may be the memory component of the upper computing node of the current computing node.
  • a tensor replacement table 1 and a tensor replacement table 2 can be set in the operation node.
  • the tensor replacement table 1 is used to record the correspondence between the addresses of the operands stored at the left end of the static memory segment
  • the tensor replacement table 2 Used to record the correspondence between the addresses of the operands stored on the right side of the static memory segment.
  • SD gets input instruction 1 from SQ, decomposes input instruction 1 serially to obtain multiple serial sub-instructions 1, multiple serial sub-instructions 1 share operand 1, and SD needs to be static
  • the memory space is allocated for operand 1 in the memory segment.
  • SD looks up in the tensor substitution table 1 and tensor substitution table 2 whether the common operand 1 has been stored in the static memory segment. If it is not stored in the static memory segment, assume this When the count value of time counter 2 is 0, SD can allocate memory space for operand 1 from the left side shown in Figure 15, and record the memory of shared operand 1 in the upper layer of operation node in tensor replacement table 1. Correspondence between the storage address in the component and the storage address in the local memory component.
  • SD obtains input instruction 2 from SQ, and serially decomposes input instruction 2 to obtain multiple serial sub-instructions 2. Multiple serial sub-instructions 2 share operand 2, SD must be operand 2 from the static memory segment Allocate memory space, SD looks in the tensor substitution table 1 and tensor substitution table 2 to find whether the shared operand 3 has been stored in the static memory segment. If it is not stored in the static memory segment, assume that the count value of counter 2 is 1. Then SD can allocate memory space for operand 2 from the right side shown in Figure 15, and record the storage address and local storage address of common operand 2 in the memory component of the upper layer of operation node in tensor replacement table 2. Correspondence of storage addresses in memory components.
  • the SD After recording the address correspondence in the tensor replacement table, the SD can set the timer corresponding to the address correspondence to start timing. When the timer reaches the aging time, the SD can set the address correspondence to the timer to be invalid.
  • timer 1 can be set, and for the address correspondence of shared operand 2, timer 2 can be set. Before timer 1 and timer 2 reach the aging time, The address correspondence relationship of shared operand 1 and the address correspondence relationship of shared operand 2 are both valid. After timer 1 reaches the aging time, the address correspondence relationship of shared operand 1 can be set to be invalid, and when timer 2 reaches the aging time After that, you can set the address correspondence of common operand 2 to be invalid.
  • SD obtains input instruction 3 from SQ, and serially decomposes input instruction 3 to obtain multiple serial sub-instructions 3. Multiple serial sub-instructions 3 share operand 3, SD must be operand 3 from the static memory segment Allocate memory space, SD looks in the tensor substitution table 1 and tensor substitution table 2 whether the shared operand 3 has been saved in the static memory segment, if it finds a part of the saved shared operand 1 is the shared operand 3 , Then directly bind the storage address of shared operand 1 corresponding to shared operand 3 to the head instruction.
  • the shared operand 3 will not be returned to save the result in the static memory.
  • the corresponding timer 1 in the address correspondence of the shared operand 1 is not When the aging time is reached, and the storage address on the external storage space in the address correspondence of the shared operand 1 contains the storage address of the shared operand 3 in the external storage space, will the shared operand 3 be returned to be stored in the static memory segment On the results.
  • the complexity of memory management can be reduced, and the utilization rate of the memory space can be improved while saving bandwidth resources.
  • multiple tensor replacement tables may be set to record the operands stored in different sub-memory blocks of the cyclic memory segment.
  • DD allocates memory space for the operands on the cyclic memory segment, it can first find whether the operands have been saved in the cyclic memory segment of the local memory component in the multiple tensor substitution tables corresponding to the cyclic memory segment.
  • the storage address of the operand on the local memory component is determined according to the tensor replacement table, and the storage address of the operand on the local memory component is assigned to the instruction to obtain the operand ; If it is not saved in the loop memory segment of the local memory component, load the data.
  • the invalid flag of the address correspondence can also be recorded in the tensor replacement table, and after recording an address correspondence, a timer can be set for timing. When the device reaches the aging time, the address correspondence is set to be invalid. Moreover, the address correspondence in the tensor replacement table is valid, and the storage address on the external storage space in the address correspondence contains the storage address of the operand to be loaded in the external storage space, only then will the load to be returned be returned The operand has been stored in the loop memory segment of the local memory component.
  • step S15 may include: when the operand is loaded from the external storage space to any one of the multiple sub-memory blocks on the cyclic memory segment, DD may store the operand in the external storage space according to the loaded operand.
  • the storage address on the above and the storage address on the local memory component update the data address information table (tensor replacement table) corresponding to any of the sub-memory blocks.
  • any sub-memory block respectively set the tensor replacement table corresponding to any sub-memory block.
  • cyclic memory segment 0, cyclic memory segment 1, and cyclic memory segment 2 you can Set tensor replacement table 4, tensor replacement table 5, and tensor replacement table 6 to correspond to loop memory segment 0, loop memory segment 1, and loop memory segment 2, respectively.
  • the tensor replacement table 4 is updated according to the storage address of the loaded operand in the external storage space and the storage address on the local memory component.
  • a third counter is provided in the processor, the cyclic memory segment includes multiple sub-memory blocks, and the processor is the other of the multiple sub-instructions in the cyclic memory segment.
  • Allocating memory space for operands includes: the processor allocates memory space for the other operands from the sub-memory block corresponding to the count value of the third counter in the cyclic memory segment.
  • the circular memory segment is divided into multiple sub-memory blocks, such as three sub-memory blocks.
  • the memory capacities of the three sub-memory blocks may be the same or different, and the present disclosure does not limit this.
  • the processor can be provided with a counter 3. After DD obtains the serial sub-instruction from SQ, for the main instruction in the serial sub-instruction, it can allocate the memory space of the cyclic memory segment in the order of the main instruction and the count value of the counter 3.
  • DD Before allocating memory space, DD can find whether the operand has been stored in the circular memory segment of the local memory component in the multiple tensor substitution tables corresponding to the circular memory segment, if it has been stored in the circular memory of the local memory component On the segment, the storage address of the operand on the local memory component is assigned to the instruction for obtaining the operand.
  • DD will allocate memory space for the operand of the main instruction 1 in the loop memory segment 0; then obtain a main instruction 2 in the tensor In substitution table 4, tensor substitution table 5 and tensor substitution table 6, find out whether the operand of the main instruction 2 has been saved in the loop memory segment of the local memory component, if not saved in the loop memory segment, and counter 3 at this time
  • the count value of is 1, then DD will allocate memory space for the operand of the main instruction 2 in the loop memory segment 1, and then obtain a main instruction 3, in the tensor substitution table 4, tensor substitution table 5 and tensor substitution Find in Table 6 whether the operand of the main instruction 3 has been saved in
  • DD assigns the storage address of the operand on the local memory component to the one that gets the operand
  • the PD can directly obtain the operands from the circular memory segment of the local memory component when executing the main instruction 3, without the need for the upper layer of the DMAC to load the operation node onto the circular memory segment of the local memory component.
  • the complexity of memory management can be reduced, and the utilization rate of the memory space can be improved while saving bandwidth resources.
  • the method for obtaining operands of the present disclosure supports data reuse in the form of "pipeline forwarding".
  • the next instruction can use the result of the previous instruction as input, so that two instructions are in the pipeline. There is no bubble barrier during execution.
  • the tensor substitution table After adding the tensor substitution table, the tensor substitution table will record the address of the first instruction output operand B stored in the local memory component, and the output operand will be ready after the EX phase ends; accordingly, the first instruction After the input operand address of the two instructions is replaced with the address on the local memory component, the LD stage becomes a vacant bubble, and the EX as the initial stage of the instruction is directly arranged at the beat of the data preparation.
  • the pipeline is:
  • the present disclosure also provides an instruction set architecture, in which instructions in the instruction set architecture can be decomposed during execution.
  • the corresponding input instruction is also decomposed into multiple sub-instructions, and the execution of the sub-instructions can complete the operation of part of the operands of the input instruction.
  • the processor is further configured to generate corresponding multiple control signals according to multiple sub-commands, and send the multiple control signals to the memory controller; the memory controller according to each control signal The data path is controlled, and the operand of the sub-instruction corresponding to the control signal is loaded from the memory component of the upper-level computing node to the local memory component.
  • the processor therein can receive input instructions sent by the upper-level arithmetic node or input instructions input by other means (such as user programming).
  • the input instruction may include: an operator, an operand parameter, the operand parameter may be a parameter pointing to the operand of the input instruction, the operand parameter may include a global parameter and a local parameter, and the global parameter indicates the first corresponding to the input instruction
  • a parameter of the size of an operand the local parameter is a parameter indicating the starting position of the second operand of the input instruction in the first operand and the size of the second operand.
  • the second operand can be part or all of the data in the first operand.
  • the processing of the second operand can be realized when the input instruction is executed, and the processing of the second operand can be an operation with the input instruction The corresponding processing.
  • the memory controller is configured to load, according to the operand parameter, the first operand of the first operand corresponding to the multiple sub-instructions from the memory component of the upper-level operation node of the any operation node. Two operands to the local memory component.
  • the instruction used by the computing device of the present disclosure can be a triple ⁇ O, P, G>, where O represents an operator, P represents a finite set of operands, and G represents a granularity index, specific performance
  • the form can be "O, P[N][n1][n2]", where N can be a positive integer, representing a global parameter, and multiple different N can be set according to the tensor dimension, n1 and n2 are less than N
  • the natural number of represents a local parameter, where n1 represents the starting position when the operand is operated on, and n2 represents the size.
  • the execution of the above instruction can realize the operation O of the operand from n1 to n1+n2 in the operand P, the same , According to different tensor dimensions, you can set multiple different n1 and n2.
  • the format of the input instruction received by each layer of the computing device of the present disclosure is the same, so the decomposition of the instruction, the operation corresponding to the execution instruction, etc. can be automatically completed.
  • the computing nodes of different layers and computers of different sizes all have the same programming interface and instruction set architecture, which can execute programs of the same format, implicitly load data between layers, simplify the complexity of user programming, and computing devices It is very easy to expand or transfer programs between different computing devices.
  • any arithmetic node can decompose an input instruction to obtain multiple sub-instructions.
  • the input instruction and the multiple sub-instructions have the same format, and at least some of the sub-instruction operators and the operation of the input instruction The symbols are the same.
  • any operation node after any operation node receives an input instruction, it can decompose the input instruction according to the number of operation nodes in the next layer to obtain multiple parallel sub-instructions, and execute one parallel sub-instruction to complete the input instruction For the operation of part of the operand of the corresponding operand, executing all parallel sub-instructions can complete the operation corresponding to the input instruction.
  • the first-level computing node can decompose the received input instructions according to the number of the next-level computing nodes to obtain multiple parallel sub-instructions. As shown in Figure 1, the first-level computing node includes three next-level computing nodes, so , The above input instruction can be decomposed into at least three parallel sub-instructions:
  • C1, C2 and C3 have the same format as C.
  • the first-level arithmetic nodes can send the decomposed parallel sub-instructions to the next-level arithmetic nodes, and the next-level arithmetic nodes receive parallel sub-instructions C1, C2, and C3, and can perform similar decomposition until the last-level arithmetic nodes.
  • any (current) operation node can start from the upper layer according to the operand parameters of the input instruction after receiving the input instruction sent by the upper layer operation node.
  • the corresponding operand is read from the memory component of the computing node and stored in the memory component of the current computing node.
  • any computing node After any computing node has executed the input instruction to obtain the result of the operation, it can also write the result of the operation back to the previous layer of operation In the memory component of the node.
  • the processor of the current computing node can send a control signal to the memory controller according to the operand parameters of the input instruction, and the memory controller can control the connection between the memory component of the current computing node and the memory component of the upper computing node according to the control signal In order to load the operand of the input instruction into the memory component of the current operation node.
  • the memory controller of any computing node includes a first memory controller and a second memory controller, and the first memory controller can pass through the second memory controller (for example, DMA, Direct Memory Access (Direct Memory Access) is connected to the data path, the first memory controller can be DMAC (Direct Memory Access controller), the first memory controller can generate load instructions according to the control signal, and send the load instructions to the second memory controller , The second memory controller controls the data path according to the loading instruction to realize the loading of data.
  • the first memory controller can be implemented by a hardware circuit or a software program, which is not limited in the present disclosure.
  • the first memory controller can determine the base address, starting offset, the number of loaded data, and the jump offset according to the control signal, and then according to the base address, starting offset, the number of loaded data, jump Parameters such as the offset of the rotation generate a load instruction, and the number of cycles to load data can also be set according to the dimension of the operand.
  • the base address can be the starting address of the original operand stored in the memory component;
  • the starting offset can be the starting position of the operand to be read in the original operand, and the starting offset can be based on the local parameter
  • the starting position is determined;
  • the number of loaded data can be the number of operands loaded from the starting offset, and the number of loaded data can be determined according to the size in the local parameter;
  • the jump offset indicates the next part to be read
  • the offset between the start position of the operand in the original operand and the start position of the operand read in the previous part in the original operand, that is, the offset of the jump is the start of the next part of the read data
  • the start offset is the offset relative to the start offset of the data read in the previous part.
  • the jump offset can be determined based on all parameters or local parameters.
  • the starting position can be used as the starting offset
  • the size in the local parameter can be used as the amount of data loaded at one time
  • the size in the local parameter can be used as the jump offset
  • the start address of the read operand can be determined according to the base address and the start offset, and the end address of a read operand can be determined according to the amount of loaded data and the start address.
  • the start address of the operand to be read in the next part can be determined according to the start address and the offset of the jump.
  • the current read can be determined according to the amount of data loaded and the start address of the operand to be read in the next part. Take the end position of the operand...Repeat the above process until the number of cycles to load the operand is reached.
  • the read operand at one time and the read operand at this time can refer to: reading the same operand needs to be completed one or more times, each time reading part of the operand in the same operand, the above one and this time Can refer to one of multiple times.
  • reading an operand may need to loop multiple readings to complete.
  • the first memory controller can determine each read based on the base address, starting offset, the number of loaded data, and the offset of the jump.
  • the start address and end address of the operand for example, for each read process, the start address of this read process can be determined according to the start address of the previous read process and the offset of the jump. Determine the end address of the local reading process according to the starting address of this reading process and the number of loaded data (and the format of the data).
  • the jump offset can be determined according to the number of jumped data and the format of the data.
  • the above example is still taken as an example.
  • the processor can generate the control signal "Load A[N][0][N/3] according to the input command C1, A'" and "Load B[N][0][N/3], B'", where A'and B'are the memory space allocated by the processor on the memory component of the second-tier computing node.
  • the first memory controller can set the start offset to 0 according to the control signal, and the number of loaded data is N/3. Since the operand A is a one-dimensional vector, it is not necessary to set the jump offset and cyclic loading The number of data.
  • a load instruction can be generated in the same way to load data.
  • the first memory controller can set the starting offset in the row and column directions to be 0, the number of loaded data is N/2, the jump offset is N, and the number of cycles is M.
  • the first memory controller can set the starting offset in the row and column directions to be 0, the number of loaded data is N/2, the jump offset is N, and the number of cycles is M.
  • the first row and the first column read the N/2 column data, jump to the second row and the first column to read the N/2 column data...
  • the data can be loaded in M cycles.
  • the SD in the processor of an arithmetic node of the i-th layer obtains the input instruction from IQ, the operand of the input instruction is P[M,N][0,0][M,N/2], SD determines the storage operand P[M,N][0,0][M,N/2] requires a memory capacity greater than that of a memory component, and serial decomposition of input commands is required.
  • the granularity of the decomposition is determined to be M and N/4, that is to say, the operands of the serial sub-instruction are P[M,N][0,0][M,N/4] and P[M,N][0,(N/4)+1][M,N/2].
  • SD outputs serial sub-commands to SQ
  • DD gets serial sub-commands from SQ.
  • DD can allocate memory space for the operand of the serial sub-instruction, and bind the address (local address) of the allocated memory space to the instruction that gets the operand in the serial sub-instruction, that is, DD can generate control signals :
  • the first memory controller can set the starting offset in the row and column directions to be 0, the number of loaded data is N/4, the jump offset is N, and the number of cycles is M. As shown in Figure 9, read the N/4 column data from the first row and the first column and write it to the location of the local memory component P1', jump to the second row and the first column to read the N/4 column data... loop Data loading can be completed M times.
  • the first memory controller can generate a load instruction according to the determined base address, start offset, amount of loaded data, and jump offset number, and send the load instruction to the second memory controller. According to the load instruction, the operand is read in the above manner and written to the local memory component.
  • the first memory controller can set the row start offset to 0, the column start offset to (N/4)+1, the number of loaded data is N/4, and the jump The offset is N, and the number of cycles is M.
  • the data in column N/4 is read from column (N/4)+1 in the first row and written to the location of the local memory component P1', and then it jumps to column (N/4)+ in the second row.
  • One column reads N/4 columns of data...The data can be loaded in M cycles.
  • the memory component of any one computing node includes a static memory segment and a dynamic memory segment. If the operands of the input instruction include shared operands and other operands, the serial resolver is based on The size relationship between the memory capacity required by the shared operand and the remaining capacity of the static memory segment, and the size relationship between the memory capacity required by the other operands and the capacity of the dynamic memory segment, the input The instruction is serially decomposed into serial sub-instructions.
  • the shared operand is an operand commonly used by the serial sub-instructions, and other operands are data in the operands of the input instruction except for the shared operand, and the remaining capacity of the static memory segment may be Refers to the unused capacity in the static memory segment.
  • the shared operand For example, for some operations in machine learning, some of the operands will be shared among the decomposed parts of these operations. For this part of the operands, the present disclosure is called the shared operand.
  • the matrix multiplication operation Take the matrix multiplication operation as an example, suppose the input instruction is to multiply the matrix X and Y, if only the matrix X is decomposed, then the serial sub-instructions obtained by the serial decomposition of the input instruction need to use the operand Y together, the operation The number Y is the common operand.
  • the serial decomposer SD of the present disclosure can generate a prompt instruction ("load") when performing serial decomposition, and specify in the prompt instruction to load the common operand into the static memory segment, DD Treat the suggestive instruction as an ordinary serial sub-instruction that only needs to load data into the static memory segment without execution, protocol or write back.
  • DD sends the first control signal to the first memory controller according to the suggestive instruction to share Operands are loaded into the static memory segment to avoid frequent data access and save bandwidth resources.
  • DD can generate a second control signal
  • DD can send the generated second control signal to the first memory controller
  • the first memory controller controls the second memory controller to load other operands according to the control signal To the dynamic memory segment.
  • the process of loading the shared operand and other operands by the memory controller can refer to the process described above, and will not be repeated.
  • the format of the input instruction received by each layer of the computing device of the present disclosure is the same, so the decomposition of the instruction, the operation corresponding to the execution instruction, etc. can be automatically completed.
  • the computing nodes of different layers and computers of different sizes all have the same programming interface and instruction set architecture, which can execute programs of the same format, implicitly load data between layers, simplify the complexity of user programming, and computing devices It is very easy to expand or transfer programs between different computing devices.
  • the above device embodiments are only illustrative, and the device of the present disclosure may also be implemented in other ways.
  • the division of the units/modules in the foregoing embodiment is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units, modules, or components may be combined or integrated into another system, or some features may be omitted or not implemented.
  • the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist.
  • the modules are integrated together.
  • the above-mentioned integrated units/modules can be implemented in the form of hardware or software program modules.
  • the hardware may be a digital circuit, an analog circuit, and so on.
  • the physical realization of the hardware structure includes but is not limited to transistors, memristors and so on.
  • the processor may be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on.
  • the memory component may be any suitable magnetic storage medium or magneto-optical storage medium, such as RRAM (Resistive Random Access Memory), DRAM (Dynamic Random Access Memory), Static random access memory SRAM (Static Random-Access Memory), enhanced dynamic random access memory EDRAM (Enhanced Dynamic Random Access Memory), high-bandwidth memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc. Wait.
  • RRAM Resistive Random Access Memory
  • DRAM Dynamic Random Access Memory
  • Static random access memory SRAM Static Random-Access Memory
  • enhanced dynamic random access memory EDRAM Enhanced Dynamic Random Access Memory
  • high-bandwidth memory HBM High-Bandwidth Memory
  • hybrid storage cube HMC Hybrid Memory Cube
  • the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present disclosure essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a memory, A number of instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

Abstract

一种运算装置。所述运算装置可以包括处理器、内存控制器、存储单元以及多个运算节点,其中,处理器用于接收输入指令,内存控制器用于加载操作数到存储单元,运算节点用于根据输入指令以及操作数执行输入指令以实现输入指令对操作数的处理。运算装置可以提高运算效率。

Description

运算装置
本申请主张2019年4月27日提交的中国专利申请号为201910347027.0的优先权、以及2019年6月21日提交的中国专利申请号为201910544723.0、201910544726.4、201910545271.8、201910545270.3、201910545272.2的优先权,其全部内容通过引用包含于此。
技术领域
本公开涉及人工智能技术领域,尤其涉及一种运算装置。
背景技术
在人工智能技术领域,神经网络算法是最近非常流行的一种机器学习算法,在各种领域中都取得了非常好的效果,比如图像识别,语音识别,自然语言处理等。随着神经网络算法的发展,算法的复杂度也越来越高,为了提高识别度,模型的规模也在逐渐增大。用GPU和CPU处理起这些大规模的模型,要花费大量的计算时间,并且耗电量很大。
发明内容
本公开提出了一种运算装置,通过多层迭代的方式构建运算装置的层级架构,该运算装置的每个运算节点的结构是相同的,不同层的运算节点、不同规模的计算机上都具有相同的编程接口和指令集架构,能够执行相同格式的程序,简化用户编程的复杂性,且运算装置的扩展或者程序在不同运算装置之间的移植都非常容易。
根据本公开的第一方面,提供了一种运算装置,包括:至少两层运算节点,每一个运算节点包括内存组件、处理器以及下一层运算节点;
对于任意一个运算节点,所述任意一个运算节点中的处理器用于对所述任意一个运算节点的输入指令进行分解,获得并行子指令,并将并行子指令发送给所述任意一个运算节点的下一层运算节点;
所述任意一个运算节点还用于从上一层运算节点的内存组件中加载执行所述并行子指令需要的操作数到所述任意一个运算节点的内存组件,以使所述任意一个运算节点的下一层运算节点根据所述操作数并行执行所述并行子指令。
结合第一方面的第一种可能的实现方式中,所述任意一个运算节点还包括:内存控制器,
所述任意一个运算节点的内存组件与所述任意一个运算节点的上一层运算节点和下一层运算节点的内存组件之间连接有数据通路,所述内存控制器连接所述数据通路,控制所述数据通路将输入指令的操作数从一个内存组件送往另一个内存组件。
结合第一方面的第一种可能的实现方式,在第二种可能的实现方式中,所述处理器包括:串行分解器、并行分解器以及译码器,所述内存控制器连接所述串行分解器和所述译码器;
其中,所述串行分解器用于根据所述任意一个运算节点的内存组件的容量、以及所述输入指令需要的内存容量,对所述输入指令进行串行分解得到串行子指令;
所述译码器用于对所述串行子指令进行译码处理后发送给所述并行分解器、并根据串行子指令向所述内存控制器发送控制信号,所述内存控制器根据所述控制信号从上一层运算节点的内存组件中加载执行所述串行子指令需要的操作数到所述任意一个运算节点的内存组件;
所述并行分解器用于根据所述下一层运算节点的数量,对译码后的串行子指令进行并行分解得到并行子指令,并将并行子指令发送给所述下一层运算节点,以使所述下一层运算节点根据所述操作数执行并行子指令。
结合第一方面的第二种可能的实现方式,在第三种可能的实现方式中,若所述输入指令需要的内存大于所述任意一个运算节点的内存组件的容量,则所述串行分解器根据所述输入指令需要的内存和所述任意一个运算节点的内存组件的容量,对所述输入指令进行串行分解得到串行子指令。
结合第一方面的第一、第二或第三种可能的实现方式,在第四种可能的实现方式中,所述任意一个运算节点的内存组件包括静态内存段以及动态内存段,若所述输入指令的操作数包括共用操作数以及其他操作数,则串行分解器根据所述共用操作数需要的内存容量与所述静态内存段的剩余容量之间的大小关系、以及所述其他操作数需要的内存容量与动态内存段的容量之间的大小关系,对所述输入指令进行串行分解得到串行子指令,
其中,所述共用操作数为所述串行子指令共同使用的操作数,其他操作数为所述输入指令的操作数中除了所述共用操作数以外的数据。
结合第一方面的第四种可能的实现方式,在第五种可能的实现方式中,分解得到的串行子指令包括头部指令和主体指令,所述译码器根据所述头部指令向所述内存控制器发送第一控制信号,所述内存控制器根据所述第一控制信号从上一层运算节点的内存组件中加载所述共用操作数到所述静态内存段;
所述译码器根据所述主体指令向所述内存控制器发送第二控制信号,所述内存控制器根据所述第二控制信号从上一层运算节点的内存组件中加载所述其他数据到所述动态内存段。
结合第一方面的第二种可能的实现方式,在第六种可能的实现方式中,并行分解得到的并行子指令对应的操作数之间不存在重叠的部分。
结合第一方面的第一至第六种可能的实现方式种的任意一种,在第七种可能的实现方式中,所述处理器还包括控制单元,所述任意一个运算节点还包括本地处理单元,
所述控制单元的输入端连接所述译码器的输出端,所述控制单元的输出端连接所述本地处理单元的输入端。
结合第一方面的第七种可能的实现方式,在第八种可能的实现方式中,若所述串行子指令存在输出依赖,所述控制单元根据所述串行子指令控制所述本地处理单元对所述下一层运算节点的运算结果进行归约处理得到所述输入指令的运算结果;
其中,所述串行子指令存在输出依赖是指,需要对所述串行子指令的运算结果进行归约处理才能得到所述输入指令的运算结果。
结合第一方面的第八种可能的实现方式,在第九种可能的实现方式中,若所述控制单元检测到对所述下一层运算节点的运算结果进行归约处理所需要的资源大于所述本地处理单元的资源上限,则所述控制单元根据所述串行子指令向所述并行分解器发送委托指令,
所述并行分解器根据所述委托指令控制所述下一层运算节点对所述下一层运算节点的运算结果进行归约处理得到所述输入指令的运算结果。
通过多层迭代的方式构建运算装置的层级架构,该运算装置的每个运算节点的结构是相同的,不 同层的运算节点、不同规模的计算机上都具有相同的编程接口和指令集架构,能够执行相同格式的程序,层与层之间隐式装载数据,用户无需管理内存空间,简化用户编程的复杂性,且运算装置的扩展或者程序在不同运算装置之间的移植都非常容易。
根据本公开的第二方面提供了一种运算装置,所述运算装置包括多层运算节点,每一个运算节点包括处理器以及下一层运算节点;
对于任意一个运算节点,所述任意一个运算节点中的所述处理器控制所述下一层运算节点,以流水线的方式分多个阶段执行所述任意一个运算节点的输入指令对应的操作;
其中,所述多个阶段包括:操作执行EX,所述下一层运算节点用于以流水线的方式分所述多个阶段执行所述操作执行。
结合第二方面的第一种可能的实现方式中,所述任意一个运算节点还包括:本地处理单元、内存组件、内存控制器,所述处理器包括:流水线控制单元、译码器、归约控制单元,
所述译码器的输入端接收所述输入指令,所述译码器的输出端连接所述内存控制器的输入端,
所述任意一个运算节点的内存组件与所述任意一个运算节点的上一层运算节点和下一层运算节点的内存组件之间连接有数据通路,
所述内存控制器连接所述数据通路,控制所述数据通路将输入指令的操作数从一个内存组件送往另一个内存组件,
所述译码器的输出端还连接下一层运算节点的输入端以及所述归约控制单元的输入端,所述归约控制单元连接所述本地处理单元,
所述流水线控制单元连接所述译码器、所述归约控制单元、所述内存控制器。
结合第二方面的第一种可能的实现方式,在第二种可能的实现方式中,所述任意一个运算节点还包括:流水线锁存器,所述译码器和所述内存控制器之间、所述内存控制器和所述下一层运算节点之间、下一层运算节点和所述本地处理单元之间、以及所述本地处理单元和所述内存控制器之间分别设置有流水线锁存器;
所述流水线控制器通过控制所述流水线锁存器同步所述多个阶段。
结合第二方面的第二种可能的实现方式,在第三种可能的实现方式中,所述多个阶段还包括:指令译码ID、数据加载LD、操作归约RD以及数据写回WB,所述流水线按照指令译码ID、数据加载LD、操作执行EX、操作归约RD以及数据写回WB的顺序传播,
所述译码器用于指令译码,所述内存控制器用于数据加载:将所述输入指令的操作数加载到所述内存组件,所述归约控制单元用于控制本地处理单元进行操作归约得到所述输入指令的运算结果,所述内存控制器还用于将所述运算结果写回到所述任意一个运算节点的上一层运算节点的内存组件中。
结合第二方面的第二种可能的实现方式,在第四种可能的实现方式中,所述流水线控制器在接收到所述译码器、内存控制器、下一层运算节点以及所述归约控制单元发送的第一反馈信号后,分别向各个所述流水线锁存器发送第一控制信号,各个所述流水线锁存器根据所述第一控制信号更新输出。
结合第二方面的第二种可能的实现方式,在第五种可能的实现方式中,DD在检测到串行子指令存在数据依赖,则DD停止从SQ中获取串行子指令。
结合第二方面的第一至第五种可能的实现方式中的任意一种,在第六种可能的实现方式中,所述处理器还包括串行分解器,所述串行分解器连接所述译码器的输入端,所述串行分解器用于对所述输入指令进行串行分解得到串行子指令;
所述处理器控制所述下一层运算节点,以流水线的方式分多个阶段执行所述串行子指令对应的操作。
结合第二方面的第六种可能的实现方式,在第七种可能的实现方式中,所述译码器在检测到当前译码的串行子指令的输入操作数与之前的多条串行子指令的输出操作数不存在重叠时,将当前译码的串行子指令译码后预加载到所述下一层运算节点上。
结合第二方面的第七种可能的实现方式,在第八种可能的实现方式中,所述处理器还包括并行分解器,所述并行分解器的输入端连接所述译码器的输出端,所述并行分解器的输出端连接所述下一层运算节点的输入端,
所述并行分解器用于根据所述下一层运算节点的数量,对译码后的串行子指令进行并行分解得到并行子指令,并将并行子指令发送给所述下一层运算节点。
结合第二方面的第六种可能的实现方式,在第九种可能的实现方式中,所述串行分解器和所述译码器之间设置有子指令队列SQ,所述子指令队列SQ用于暂存所述串行子指令。
根据本公开的第三方面提供了一种运算装置,所述运算装置包括多层运算节点,每一个运算节点包括内存组件、处理器以及下一层运算节点,所述内存组件包括静态内存段和循环内存段,
处理器用于对任意一个运算节点的输入指令进行分解得到多个子指令;
如果所述多个子指令之间存在共用操作数,则所述处理器在所述静态内存段中为所述共用操作数分配内存空间,在所述循环内存段中为所述多个子指令的其他操作数分配内存空间;
其中,所述共用操作数为:所述任意一个运算节点中的下一层运算节点执行所述多个子指令时都要使用的操作数,所述其他操作数为:所述多个子指令的操作数中除了所述共用操作数以外的操作数。
结合第三方面的第一种可能的实现方式中,所述处理器中设置有第一计数器,所述循环内存段包括多段子内存块,
所述处理器在所述循环内存段中为所述多个子指令的其他操作数分配内存空间,包括:
所述处理器从所述循环内存段中与所述第一计数器的计数值对应的子内存块内,为所述其他操作数分配内存空间。
结合第三方面的第二种可能的实现方式中,所述处理器中设置有第二计数器,
所述处理器在所述静态内存段中为所述共用操作数分配内存空间,包括:
所述处理器从所述静态内存段中的第一起始端开始为所述共用操作数分配内存空间,其中,所述第一起始端为与所述第二计数器的计数值对应的起始端。
结合第三方面的第二种可能的实现方式,在第三种可能的实现方式中,所述处理器包括串行分解器SD,
处理器用于对任意一个运算节点的输入指令进行分解得到多个子指令,包括:
所述SD根据所述输入指令需要的内存容量、所述静态内存段的容量以及所述循环内存段的容量, 对所述输入指令进行串行分解得到串行子指令。
结合第三方面的第二种可能的实现方式,在第四种可能的实现方式中,所述处理器包括串行分解器SD,所述SD根据所述第二计数器的数值确定为所述共用操作数分配内存空间的第一起始端,
所述SD计算从所述第一起始端开始,所述静态内存段剩余的内存容量,所述SD根据所述静态内存段剩余的内存容量以及所述共用操作数需要的内存容量对所述输入指令进行第一串行分解得到第一串行子指令;
所述SD根据所述循环内存段的内存容量以及所述其他操作数需要的内存容量对所述第一串行子指令进行第二串行分解得到所述串行子指令。
结合第三方面的第一种可能的实现方式,在第五种可能的实现方式中,所述处理器还包括译码器DD,所述DD用于对所述多个子指令进行指令译码,
所述DD在对所述多个子指令进行指令译码过程中,从所述循环内存段中与所述第一计数器的计数值对应的子内存块内,为所述其他操作数分配内存空间。
结合第三方面的第三种可能的实现方式,在第六种可能的实现方式中,所述串行子指令包括头部指令和主体指令,所述头部指令用于加载所述共用操作数,所述头部指令记录了为所述共用操作数分配的内存空间的地址,所述主体指令用于加载所述其他操作数、以及对所述共用操作数和其他操作数进行计算。
结合第三方面的第三种或第六种可能的实现方式,在第七种可能的实现方式中,所述任意一个运算节点中的所述处理器控制所述下一层运算节点,以流水线的方式分多个阶段执行所述任意一个运算节点的串行子指令对应的操作;
所述多个阶段包括:指令译码ID、数据加载LD、操作执行EX、操作归约RD以及数据写回WB,所述流水线按照指令译码ID、数据加载LD、操作执行EX、操作归约RD以及数据写回WB的顺序传播。
结合第三方面的第七种可能的实现方式,在第八种可能的实现方式中,所述任意一个运算节点还包括:本地处理单元LFU、第二内存控制器DMA,所述处理器包括:译码器DD、归约控制单元RC,
所述译码器DD用于指令译码ID,
所述DMA用于数据加载LD:将所述输入指令的操作数加载到所述内存组件,
所述下一层运算节点用于根据操作数和译码后的指令进行操作执行EX得到执行结果,
所述归约控制单元RC用于控制所述LFU对所述执行结果进行操作归约RD得到所述输入指令的运算结果,
所述DMA还用于将所述运算结果写回到所述任意一个运算节点的上一层运算节点的内存组件中。
结合第三方面的第八种可能的实现方式,在第九种可能的实现方式中,所述循环内存段包括多段子内存块,
在所述流水线传播的过程中,所述DMA、下一层运算节点以及LFU按顺序循环使用所述多段子内存块。
结合第三方面的第九种可能的实现方式,在第十种可能的实现方式中,所述多段子内存块的内存容量大小相同。
根据本公开的第四方面提供了一种操作数的获取方法,所述方法包括:
在数据地址信息表中查找操作数是否已保存在本地内存组件上;
若操作数已保存在本地内存组件上,则根据操作数在外部存储空间上的存储地址和数据地址信息表确定所述操作数在本地内存组件上的存储地址;
将所述操作数在本地内存组件上的存储地址赋值给获取所述操作数的指令。
结合第四方面的第一种可能的实现方式中,所述方法还包括:
若操作数未保存在本地内存组件上,则根据所述操作数的存储地址生成加载操作数的控制信号,所述加载操作数的控制信号用于将所述操作数从所述操作数的存储地址加载到本地内存组件上。
结合第四方面的第二种可能的实现方式中,所述数据地址信息表记录有地址对应关系,所述地址对应关系包括:操作数在本地内存组件上的存储地址和操作数在外部存储空间上的存储地址的对应关系。
结合第四方面的第二种可能的实现方式,在第三种可能的实现方式中在数据地址信息表中查找操作数是否已保存在本地内存组件上,包括:
在所述地址对应关系中包含全部所述操作数在外部存储空间上的存储地址时,确定所述操作数已保存在本地内存组件上。
结合第四方面的第三种可能的实现方式,在第四种可能的实现方式中,根据操作数在外部存储空间上的存储地址和数据地址信息表确定所述操作数在本地内存组件上的存储地址,包括:
将所述地址对应关系中,与所述操作数在外部存储空间上的存储地址对应的本地内存组件上的存储地址,作为所述操作数在本地内存组件上的存储地址。
结合第四方面的第五种可能的实现方式中,所述方法还包括:
当从外部存储空间上加载操作数到本地内存组件时,根据加载的操作数在外部存储空间上的存储地址和在本地内存组件上的存储地址更新所述数据地址信息表。
结合第四方面的第五种可能的实现方式,在第六种可能的实现方式中,根据加载的操作数在外部存储空间上的存储地址和在本地内存组件上的存储地址更新所述数据地址信息表,包括:
在数据地址信息表中记录加载的操作数在外部存储空间上的存储地址和在本地内存组件上的存储地址的对应关系。
结合第四方面的第五种可能的实现方式,在第七种可能的实现方式中,所述本地内存组件包括:静态内存段,
当从外部存储空间上加载操作数到本地内存组件时,根据加载的操作数在外部存储空间上的存储地址和在本地内存组件上的存储地址更新所述数据地址信息表,包括:
当从外部存储空间上加载操作数到所述静态内存段时,根据第一计数器的计数值确定待更新的数据地址信息表;其中,所述第一计数器的计数值用于表示在静态内存段上的存储位置信息;
根据加载的操作数在外部存储空间上的存储地址和在静态内存段上的存储地址更新所述待更新数据地址信息表。
结合第四方面的第五种可能的实现方式,在第八种可能的实现方式中,所述本地内存组件还包括:循环内存段,所述循环内存段包括多个子内存块,
当从外部存储空间上加载操作数到本地内存组件时,根据加载的操作数在外部存储空间上的存储地址和在本地内存组件上的存储地址更新所述数据地址信息表,包括:
当从外部存储空间上加载操作数到循环内存段上的多个子内存块中的任一子内存块时,根据加载的操作数在外部存储空间上的存储地址和在本地内存组件上的存储地址更新与所述任一子内存块对应的数据地址信息表。
结合第四方面的第二种至第八种可能的实现方式中的任意一种,在第九种可能的实现方式中,所述方法应用于运算装置,所述运算装置包括:多层运算节点,每一个运算节点包括本地内存组件、处理器以及下一层运算节点,
所述外部存储空间为所述运算节点的上一层运算节点的内存组件或者下一层运算节点的内存组件。
根据本公开的第五方面,提供了一种运算装置,所述运算装置包括:多层运算节点,每一个运算节点包括本地内存组件、处理器以及下一层运算节点,
所述处理器要从当前运算节点的上一层运算节点的内存组件中加载操作数到本地内存组件时,在数据地址信息表中查找操作数是否已保存在本地内存组件上;
若操作数已保存在本地内存组件上,则处理器根据操作数在外部存储空间上的存储地址和数据地址信息表确定所述操作数在本地内存组件上的存储地址;并将所述操作数在本地内存组件上的存储地址赋值给获取所述操作数的指令。
结合第五方面的第一种可能的实现方式中,若操作数未保存在本地内存组件上,则处理器根据所述操作数的存储地址生成加载操作数的控制信号,所述加载操作数的控制信号用于将所述操作数从所述操作数的存储地址加载到本地内存组件上。
结合第五方面或者第五方面的第一种可能的实现方式,在第二种可能的实现方式中,所述数据地址信息表记录有地址对应关系,所述地址对应关系包括:操作数在本地内存组件上的存储地址和操作数在外部存储空间上的存储地址的对应关系。
结合第五方面的第二种可能的实现方式,在第三种可能的实现方式中,所述本地内存组件包括静态内存段和循环内存段,
处理器用于对任意一个运算节点的输入指令进行分解得到多个子指令;
如果所述多个子指令之间存在共用操作数,则所述处理器在所述静态内存段中为所述共用操作数分配内存空间,在所述循环内存段中为所述多个子指令的其他操作数分配内存空间;
其中,所述共用操作数为:所述任意一个运算节点中的下一层运算节点执行所述多个子指令时都要使用的操作数,所述其他操作数为:所述多个子指令的操作数中除了所述共用操作数以外的操作数。
结合第五方面的第三种可能的实现方式,在第四种可能的实现方式中,所述处理器内设置有与所述静态内存段对应的至少一个数据地址信息表,以及与所述循环内存段对应的多个数据地址信息表。
结合第五方面的第四种可能的实现方式,在第五种可能的实现方式中,所述处理器在静态内存段中为共用操作数分配内存空间之前,先在与所述静态内存段对应的至少一个数据地址信息表中查找共用操作数是否已保存在本地内存组件的静态内存段上,
若已经保存在了本地内存组件的静态内存段上,则根据共用操作数在上一层运算节点的内存组件上的存储地址和所述与所述静态内存段对应的至少一个数据地址信息表确定所述共用操作数在本地内存组件上的存储地址;
将所述共用操作数在本地内存组件上的存储地址赋值给加载共用操组数的指令。
结合第五方面的第四种可能的实现方式,在第六种可能的实现方式中,所述处理器在循环内存段上为其他操作数分配内存空间之前,先在所述与所述循环内存段对应的多个数据地址信息表中查找其他操作数是否已保存在本地内存组件的循环内存段上,
若已经保存在了本地内存组件的循环内存段上,则根据其他操作数在上一层运算节点的内存组件上的存储地址和所述与所述循环内存段对应的多个数据地址信息表确定所述其他操作数在本地内存组件上的存储地址,
将所述其他操作数在本地内存组件上的存储地址赋值给获取其他操作数的指令;
若未保存在本地内存组件的循环内存段上,则加载数据。
结合第五方面的第五种或第六种可能的实现方式,在第七种可能的实现方式中,当从上一层运算节点的内存组件上加载操作数到所述静态内存段时,所述处理器根据第一计数器的计数值确定待更新的数据地址信息表;其中,第一计数器的计数值用于确定所述静态内存段的两端对应的不同的数据地址信息表;
根据加载的操作数在上一层运算节点的内存组件上的存储地址和在静态内存段上的存储地址更新所述待更新数据地址信息表。
结合第五方面的第五种或第六种可能的实现方式,在第八种可能的实现方式中,当从外部存储空间上加载其他操作数到循环内存段上的多个子内存块中的任一子内存块时,处理器根据加载的其他操作数在外部存储空间上的存储地址和在本地内存组件上的存储地址更新与所述任一子内存块对应的数据地址信息表。
根据本公开的第六方面,提供了一种操作数的获取装置,包括:
处理器;
用于存储处理器可执行指令的存储器;
其中,所述处理器用于执行指令时实现第四方面的任意一种可能的实现方式的方法。
根据本公开的第七方面,提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其中,所述计算机程序指令被处理器执行时实现第四方面任意一种可能的实现方式的方法。
根据本公开的第八方面,提供了一种运算装置,所述运算装置包括:多层运算节点,任意一个运算节点包括本地内存组件、处理器、下一层运算节点、以及内存控制器,所述处理器连接下一层运算节点和内存控制器;
其中,所述处理器用于接收输入指令,并对输入指令进行分解得到多个子指令,将所述多个子指令发送给所述下一层运算节点;所述内存控制器用于从所述任意一个运算节点的上一层运算节点的内 存组件加载多个子指令对应的第一操作数中的第二操作数到所述本地内存组件;所述下一层运算节点用于根据所述多个子指令的运算符和所述多个子指令的第二操作数执行所述多个子指令;
所述输入指令和多个子指令具有相同的格式。
结合第八方面的第一种可能的实现方式中,所述输入指令和所述多个子指令都包括:运算符、操作数参数,所述操作数参数是指向指令的操作数的参数,所述操作数参数包括全局参数和局部参数,全局参数是表示指令对应的第一操作数的大小的参数,局部参数是表示指令的第二操作数在所述第一操作数中的起始位置和第二操作数的大小的参数;
所述内存控制器用于根据所述操作数参数从所述任意一个运算节点的上一层运算节点的内存组件加载多个子指令对应的第一操作数中的第二操作数到所述本地内存组件。
结合第八方面或者第八方面的第一种可能的实现方式,在第二种可能的实现方式中,
所述本地内存组件与所述任意一个运算节点的上一层运算节点和下一层运算节点的内存组件之间连接有数据通络,所述内存控制器连接所述数据通路。
结合第八方面的第二种可能的实现方式,在第三种可能的实现方式中,
所述处理器还用于根据多个子指令生成对应的多个控制信号,并将多个控制信号发送给内存控制器;
所述内存控制器根据每个控制信号控制所述数据通路,从上一层运算节点的内存组件中加载该控制信号对应的子指令的操作数到本地内存组件。
结合第八方面的第三种可能的实现方式,在第四种可能的实现方式中,
所述内存控制器包括第一内存控制器和第二内存控制器,第一内存控制器通过第二内存控制器连接数据通路,第一内存控制器还用于根据控制信号生成加载指令,将加载指令发送给第二内存控制器,第二内存控制器用于根据加载指令控制数据通路。
结合第八方面的第四种可能的实现方式,在第五种可能的实现方式中,第一内存控制器根据控制信号确定基地址、起始偏移量、加载数据的数量、跳转的偏移量,根据基地址、起始偏移量、加载数据的数量、跳转的偏移量数生成加载指令;
其中,基地址为操作数在内存组件中存储的起始地址,起始偏移量为第二操作数的起始位置相对于第一操作数的起始位置的偏移量,加载数据的数量为从起始偏移量开始加载的操作数的个数,跳转的偏移量为下一个读取数据的起始偏移量相对于上一个读取数据的起始偏移量的偏移量。
结合第八方面的第五种可能的实现方式,在第六种可能的实现方式中,
所述处理器包括串行分解器、译码器以及并行分解器,其中,串行分解器的输入端连接上一层运算节点的处理器中的并行分解器的输出端,串行分解器的输出端连接译码器的输入端,译码器的输出端连接并行分解器的输入端,并行分解器的输出端连接下一层运算节点的输入端。
结合第八方面的第六种可能的实现方式,在第七种可能的实现方式中,串行分解器用于根据所述任意一个运算节点的内存组件的容量、以及所述输入指令需要的内存容量,对所述输入指令进行串行分解得到串行子指令;
译码器用于对所述串行子指令进行译码处理后发送给并行分解器、并根据串行子指令向所述内存控制器发送控制信号,所述内存控制器根据所述控制信号从上一层运算节点的内存组件中加载执行所 述串行子指令需要的操作数到所述任意一个运算节点的内存组件;
并行分解器用于根据所述下一层运算节点的数量,对译码后的串行子指令进行并行分解得到并行子指令,并将并行子指令发送给所述下一层运算节点,以使所述下一层运算节点根据所述操作数执行并行子指令。
结合第八方面的第七种可能的实现方式,在第八种可能的实现方式中,
所述任意一个运算节点的内存组件包括静态内存段以及动态内存段,
分解得到的串行子指令包括头部指令和主体指令,译码器还用于根据所述头部指令向所述内存控制器发送第一控制信号,所述内存控制器根据所述第一控制信号从上一层运算节点的内存组件中加载共用操作数到所述静态内存段;
译码器还用于根据所述主体指令向所述内存控制器发送第二控制信号,所述内存控制器根据所述第二控制信号从上一层运算节点的内存组件中加载其他数据到所述动态内存段。
结合第八方面的第五种可能的实现方式,在第九种可能的实现方式中,
所述第一内存控制器根据局部参数中的起始位置确定起始偏移量,根据局部参数中的大小确定加载数据的数量,根据全部参数或局部参数确定跳转的偏移量。
根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。
附图说明
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面,并且用于解释本公开的原理。
图1示出2012年-2018年期间机器学习计算机的能效增长的曲线图。
图2示出了传统的机器学习计算机的组织形式的一个示例。
图3示出根据本公开一实施例的运算装置的框图。
图4a和图4b分别示出根据本公开一实施例的运算节点的框图。
图5示出根据本公开一实施方式的串行分解的过程的流程图。
图6示出根据本公开一示例的流水线的示意图。
图7示出根据本公开一示例的运算节点的框图。
图8示出根据本公开一示例的运算节点以及流水线运行过程的示意图。
图9示出根据本公开一实施例的操作数的示意图。
图10a示出根据本公开一实施例的运算节点的框图。
图10b示出根据本公开一实施例的流水线的示例。
图11示出根据本公开一实施例的内存组件的划分的示例的示意图。
图12示出根据本公开一实施例的内存组件的划分的示例的示意图。
图13示出根据本公开一实施例的内存组件的示意图。
图14示出根据本公开一实施例的静态内存段的内存空间分配方法的示意图。
图15示出根据本公开一实施例的静态内存段的内存空间分配方法的示意图。
图16示出根据本公开一实施例的应用情景示意图。
图17示出根据本公开一实施例的操作数的获取方法的流程图。
图18示出根据本公开一实施例的操作数的获取方法的流程图。
具体实施方式
以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。
另外,为了更好的说明本公开,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本公开同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本公开的主旨。
为了便于更好的理解本申请所描述的技术方案,下面先解释本申请实施例所涉及的技术术语:
计算原语:机器学习为计算及访存密集型技术,在不同层次上是高度并行的,本公开将机器学习分解为基于矩阵和向量的运算,例如,将向量乘法矩阵和矩阵乘法向量等操作聚合为矩阵相乘,将矩阵加/减矩阵、矩阵乘法标量和向量基本算数运算等操作聚合为逐元素运算,等等。通过将机器学习进行分解、聚合可以得到七个主要的计算原语,包括:内积(IP,inner production),卷积(CONV),池化(POOL),矩阵相乘(MMM,matrix multiplying matrix),逐元素运算(ELTW,element-wise operation),排序(SORT)和计数(COUNT)。以上计算原语概括了机器学习的主要特征,并且这些计算原语都是可以分解的运算。
可以分解的运算:如果一个运算g(·)满足以下公式(1)
f(X)=g(f(X A),f(X B),...)    (1)
则带有操作数X的f(·)运算称为可以分解的运算,其中,f(·)是目标算子,g(·)是检索算子,X表示f(·)所有的操作数,X A、X B,...表示操作数X的子集,其中,X可以为张量数据。
举例来说,如果f(X)=X×k,其中,k为标量,那么f(X)可以分解为:
f(X)=[X A,X B,...]×k=g(f(X A),f(X B),…),
其中,运算g(·)就是根据分解X的方式,将f(X A)、f(X B)…的运算结果合并成矩阵或向量的形式。
运算的分类:对于上文所述的可以分解的运算,基于分解后的操作数X A、X B…和X之间的关系,可以将运算分为三类:独立运算、输入依赖运算和输出依赖运算。
独立运算:可以是指,分解后的操作数X A、X B...彼此独立且不重合,每个子集X A、X B...可以做局部运算,且只需要组合每个子集做局部运算的结果即可得到最终的运算结果。以向量加法运算作为示例来解释说明独立运算,首先可以将X分成两个操作数(即两个输入向量x,y)用于加法运算,由于x,y可以分为两个子集(x A,x B)和(y A,y B),所以两个子集可以独立完成局部向量加法运算,即z A=x A+y A和z B=x B+y B,最终的运算结果只需要组合每个局部运算的结果即可,即z=[z A,z B]。
输入依赖运算:可以是指,分解后的操作数X A、X B...有重合,分解后的局部运算的操作数之间有重合,即有输入冗余。以一维卷积为示例来解释说明输入依赖运算,使用x、y表示两个操作数,并且 x=[x A,x B],z=[z A,z B]=x*y=[x A,x B]*y,运算仍分成两部分,然而这两部分局部运算的操作数有重叠,还额外需要部分x A和部分x B(分别为x a,x b),即z A=[x A,x b]*y、z B=[x a,x B]*y,每一部分的局部运算可以独立进行,而最终的运算结果只需要组合每个局部运算的结果即可,即z=[z A,z B]。
输出依赖运算:可以是指,最终的运算结果需要对分解后每个局部运算的结果进行归约处理后得到。以内积运算为示例来解释说明输出依赖运算,内积运算(z=x·y)可以分成两部分局部运算,其中,每个部分的局部运算仍然执行内积运算z A=x A·y A和z B=x B·y B,但要获得最终的运算结果,则需要对每个局部运算的结果进行求和,即z=z A+z B。因此,g(·)为求和操作,g(·)=sum(·)。需要注意的是,有些运算在分解后既可以是输入依赖,也可以是输出依赖,具体的依赖性与分解方式有关。
在一种可能的实现方式中,可以将上述计算原语划分为三类,但是,需要注意的是,不同的分解方式会导致依赖性的不同,具体可以参见如下表1所示。
表1计算原语分析
计算原语 分解方式 依赖性 g(·) 数据冗余
IP 长度 输出依赖 相加  
CONV 特征 输出依赖 相加  
CONV N维度(批量) 输入依赖   权值
CONV H或者W维度(空间) 输入依赖   权值,重合
POOL 特征 独立    
POOL H或者W维度(空间) 输入依赖   重合
MMM 左侧,垂直 输出依赖 相加  
MMM 右侧,垂直 输入依赖   左矩阵
ELTW 任意 独立    
SORT 任意 输出依赖 合并  
COUNT 任意 输出依赖 相加  
其中,IP的分解方式中的长度可以是指对向量的长度方向进行分解。卷积操作的操作数可以为采用NHWC(batch,height,width,channels)表示的张量数据,在特征方向分解可以是指在C维度方向进行分解,POOL操作对操作数在特征方向分解也是同样的含义,卷积操作在N维度方向分解存在输入依赖,输入冗余为权值,也就是卷积核,在空间上进行分解也存在输入依赖,输入冗余除了权值还包括分解后的两个张量数据的重合。MMM的分解方式中的左侧、右侧是指对MMM的左侧操作数或者右侧操作数进行分解,垂直可以是指在矩阵的垂直方向进行分解。ELTW操作对操作数的任意分解方式都是独立的,SORT和COUNT操作对操作数的任意分解方式都存在输出依赖。
根据上述分析可知,机器学习的计算原语都是可以分解的运算,采用本公开的运算装置进行机器学习技术的运算时,可以根据实际的需求对计算原语进行分解后运算。
输入指令:可以是描述了机器学习的操作的指令,机器学习的操作可以由上文中的计算原语或者由计算原语组成,输入指令可以包括操作数和操作符等。
共用操作数:一个运算被分解后的多个子运算之间共同使用的操作数为共用操作数,或者说,一条输入指令被分解为多个子指令后,多个子指令共同使用的操作数。
机器学习广泛应用于图像识别、语音识别、面部认知、视频分析、广告推荐和游戏等领域。近年来,许多不同规模的专用机器学习计算机已经部署在了嵌入式设备、服务器和数据中心中。目前,大多数机器学习计算机的架构仍然关注优化性能和能效,如图1所示为2012年-2018年期间,机器学习加速器使得机器学习计算机的能效以惊人的速度增长。
图2示出了传统的机器学习计算机的组织形式的一个示例。传统的机器学习计算机往往有许多异构并行组件以分层方式组织,例如图2中所示的CPU(Central Processing Unit,中央处理器)和GPU(Graphics Processing Unit,图形处理器)的异构组织形式,包含2个CPU和8个GPU,GPU作为运算单元。各层具体结构是不同的,存储方式和控制方式都有区别,导致每一层可能有不同的编程接口,编程复杂,代码量很大。对于图2所示的示例,编程多个GPU需要基于MPI(Message Passing Interface,消息通信接口)或NCCL(Nvidia Collective multi-GPU Communication Library)的手动工作,编程单个GPU芯片需要使用CUDA(Compute Unified Device Architecture,统一计算设备架构)语言来操纵数千个GPU线程;为CPU编程需要通过C/C++和并行API(Application Programming Interface,应用程序编程接口)编写包含数十条CPU线程的并行程序。
另外,单个GPU内的软件堆栈也很复杂,其中,软件堆栈包括CUDA PTX(Parallel Thread Execution)和微代码,CUDA PTX用于编程GPU中的网格/块/线程,微代码用于编程流处理器。
由于以上编程复杂以及软件堆栈开发难的问题,导致现有的机器学习计算机在扩展和程序移植上存在很大的困难。
为了解决上述技术问题,本公开提供了一种运算装置,该运算装置在每一层上提供给用户的编程接口和指令集架构是相同的:不同层的运算节点、不同规模的计算机上都具有相同的编程接口和指令集架构,能够执行相同格式的程序,操作数存储于最上层,其它层隐式装载数据,用户无需管理内存空间,简化用户编程的复杂性,且运算装置的扩展或者程序在不同运算装置之间的移植都非常容易。
本公开一实施方式的运算装置可以包括:多层(至少两层)运算节点,每一个运算节点包括内存组件、处理器以及下一层运算节点。
图3示出根据本公开一实施例的运算装置的框图。如图3所示,运算装置的第一层可以为一个运算节点,该运算节点可以包括处理器、内存组件以及下一层(第二层)运算节点,第二层运算节点可以有多个,具体的数量本公开不作限定。如图3所示,第二层每个运算节点内也可以包括:处理器、内存组件以及下一层(第三层)运算节点。同样的,第i层每个运算节点内可以包括:处理器、内存组件以及第i+1层运算节点,其中,i为自然数。
其中,处理器可以以硬件的形式实现,例如可以是数字电路,模拟电路等等;硬件结构的物理实现包括但不局限于晶体管,忆阻器等等,处理器也可以通过软件的方式实现,本公开对此不作限定。内存组件可以为随机存储器(RAM),只读存储器(ROM),以及高速缓存(CACHE)等,本公开内存组件的具体形式不作限定。
需要说明的是,尽管附图3中只画出了第一层运算节点中包括的第二层运算节点中的一个运算节点的展开结构(图3示出的第二层),可以理解的是图3仅仅是示意图,其他第二层运算节点的展开结构中同样包括处理器、内存组件以及第三层运算节点,图3为了简化没有示出其他第二层运算节点的 展开结构,第i层运算节点同样也是如此。其中,不同的第i层运算节点中包括的第i+1层运算节点的个数可能相同,也可能不同,本公开对此不作限定。
采用本公开的运算装置,在对机器学习指令进行处理时,对于任意一个运算节点,所述任意一个运算节点中的处理器可以用于对所述任意一个运算节点的输入指令进行分解得到并行子指令,并将并行子指令发送给所述任意一个运算节点的下一层运算节点;所述任意一个运算节点从上一层运算节点的内存组件中加载执行所述并行子指令需要的操作数到所述任意一个运算节点的内存组件,以使所述任意一个运算节点的下一层运算节点根据所述操作数并行执行所述并行子指令。
其中,分解得到的并行子指令是可以并行执行的,每个运算节点可以包括一个或多个下一层运算节点,如果包括多个下一层运算节点,多个下一层运算节点可以独立运行,在一种可能的实现方式中,处理器可以根据下一层运算节点的数量对输入指令进行分解得到并行子指令。
对于可以分解的运算,在采用本公开的运算装置执行运算对应的输入指令时,可以由处理器将运算对应的输入指令和操作数分解后,将分解后的并行子指令以及分解后的操作数分别发送给下一层的运算节点,由下一层的运算节点并行执行。
通过多层迭代的方式构建运算装置的层级架构,该运算装置的每个运算节点的结构是相同的,不同层的运算节点、不同规模的计算机上都具有相同的编程接口和指令集架构,能够执行相同格式的程序,层与层之间隐式装载数据,用户无需管理内存空间,简化用户编程的复杂性,且运算装置的扩展或者程序在不同运算装置之间的移植都非常容易。
在一种可能的实现方式中,处理器对输入指令进行分解可以包括三个阶段:串行分解阶段、(降级)译码阶段和并行分解阶段,因此,处理器可以包括串行分解器、译码器以及并行分解器。
其中,所述串行分解器用于根据所述任意一个运算节点的内存组件的容量、以及所述输入指令需要的内存容量,对所述输入指令进行串行分解得到串行子指令。串行分解可以是指将输入指令分解成多个可以按顺序串行执行的指令。
在一种可能的实现方式中,若所述输入指令需要的内存大于所述任意一个运算节点的内存组件的容量,则所述串行分解器根据所述输入指令需要的内存和所述任意一个运算节点的内存组件的容量,对所述输入指令进行串行分解得到串行子指令;若所述输入指令需要的内存小于或等于所述任意一个运算节点的内存组件的容量,则将所述输入指令发送给译码器,由译码器直接对输入指令进行译码处理后发送给并行分解器。
对于分解后的串行子指令,所述译码器用于对串行子指令进行译码处理后发送给所述并行分解器。所述任意一个运算节点可以从上一层运算节点的内存组件中加载执行所述串行子指令需要的操作数到所述任意一个运算节点的内存组件。在一种可能的实现方式中,所述任意一个运算节点还包括:内存控制器,所述内存控制器连接所述译码器。所述译码器可以根据串行子指令向所述内存控制器发送控制信号,所述内存控制器可以根据所述控制信号从上一层运算节点的内存组件中加载执行所述串行子指令需要的操作数到所述任意一个运算节点的内存组件。内存控制器可以通过硬件电路或者软件程序的方式实现,本公开对此不作限定。
所述并行分解器用于根据所述下一层运算节点的数量,对译码后的串行子指令进行并行分解得到并行子指令,并将并行子指令发送给所述下一层运算节点,以使所述下一层运算节点根据所述操作数 执行并行子指令。
图4a和图4b分别示出根据本公开一实施例的运算节点的框图。如图4a所示,所述处理器可以包括串行分解器SD(Sequential decomposer)、译码器DD(Demotion Decoder,这里的降级可以是指从上一层到下一层运算节点)以及并行分解器PD(Parallel decomposer)。其中,SD的输入端可以连接上一层运算节点的处理器中的PD的输出端,SD的输出端可以连接DD的输入端,DD的输出端可以连接PD的输入端,PD的输出端可以连接下一层运算节点的输入端。
在一种可能的实现方式中,任意一个运算节点的内存组件与所述任意一个运算节点的上一层运算节点和下一层运算节点的内存组件之间连接有数据通路,如图4a所示,内存组件i连接内存组件i-1,内存组件i连接下一层运算节点可以是指连接下一层运算节点的内存组件i+1。内存控制器可以连接数据通路,内存控制器可以根据运算节点中的其他组件发送的控制信号控制所述数据通路将输入指令的操作数从一个内存组件送往另一个内存组件。例如,内存控制器可以根据DD发送的控制信号将输入指令的操作数从上一层运算节点的内存组件加载到本地内存组件,或者,也可以将输入指令的运算结果从本地内存组件写回到上一层运算节点的内存组件。
在一种可能的实现方式中,如图4b所示,SD的输入端可以连接指令队列IQ(Instruction Queue),也就是说,处理器可以先将上一层运算节点的输出指令作为本层运算节点的输入指令加载到指令队列IQ,本层运算节点可以是指处理器所属的运算节点,SD从IQ中获取输入指令,考虑到硬件的限制,SD可以将输入指令分解为多个可以串行执行的串行子指令。通过设置IQ作为SD与上一层运算节点之间的缓冲,可以省去SD与上一层运算节点之间严格的同步执行关系。IQ可以简化电路设计,同时提高执行效率,例如,允许SD和上一层运算节点之间独自异步执行,减少SD等待上一层运算节点发送输入指令的时间等。
其中,输入指令可以是描述了机器学习的操作的指令,机器学习的操作可以由上文中的计算原语组成,输入指令可以包括操作数和操作符等。对输入指令的串行分解可以包括对输入指令的操作数的分解以及对输入指令的分解。在进行串行分解时,为了更有效的利用运算节点的资源,串行分解得到的串行子指令将具有尽可能大的分解粒度,串行分解得到的串行子指令的分解粒度由运算节点的资源以及输入指令需要的资源决定,例如,运算节点的资源可以为运算节点的内存组件的容量,输入指令需要的资源可以是指存储输入指令的操作数需要的内存容量。这里的分解粒度可以指分解后的操作数的维度。
输入指令需要的内存容量可以根据存储输入指令的操作数需要的内存容量、以及存储操作符对操作数进行处理后的中间结果需要的内存容量等确定,在确定输入指令需要的内存容量后,可以判断本层运算节点的内存组件的容量是否满足输入指令需要的内存容量,如果不满足,则可以根据本层运算节点的内存组件的容量以及输入指令需要的内存容量对输入指令进行串行分解得到串行子指令。
以矩阵相乘运算作为示例说明SD的功能,假设输入指令为对矩阵X和Y相乘,SD可以根据矩阵X和矩阵Y的大小确定输入指令需要的内存容量,可以将输入指令需要的内存容量与本层运算节点的内存组件的容量进行比较,如果输入指令需要的内存容量大于本层运算节点的内存组件的容量,则需要对输入指令进行串行分解。具体的过程可以为,对操作数进行分解,从而将输入指令分为多个串行子指令,该多个串行子指令可以串行执行,例如,可以对矩阵X或者矩阵Y进行分解,或者对矩阵X和矩 阵Y都进行分解,以对矩阵X进行分解为例,可以将输入指令串行分解为多个矩阵相乘的串行子指令以及求和的串行子指令,在串行执行完多个矩阵相乘的串行子指令后,根据多个矩阵相乘的串行子指令的运算结果以及求和的串行子指令进行求和得到输入指令的运算结果。需要说明的是,上述对于矩阵相乘的串行分解方式仅仅是本公开为了说明SD的功能的一个示例,不以任何方式限制本公开。
在一种可能的实现方式中,串行分解器根据所述任意一个运算节点的内存组件的容量、以及所述输入指令需要的内存容量,对所述输入指令进行串行分解得到串行子指令,具体可以包括:确定输入指令的操作数的维度的分解优先级,按照分解优先级的顺序依次选择对操作数分解的维度并以二分法方式确定最大分解粒度,直到分解后的操作数需要的内存容量小于或等于本层运算节点的内存组件的容量。
在一种可能的实现方式中,为了提高分解的效率,对于任一选择的对操作数分解的维度,在该维度方向上以二分法方式确定最大分解粒度之前,可以先确定在该维度方向上分解为原子大小之后的操作数需要的内存容量与本层运算节点的内存组件的容量之间的大小关系:如果在该维度方向上分解为原子大小之后的操作数需要的内存容量<本层运算节点的内存组件的容量,则可以在该维度方向上以二分法方式拆分操作数;如果在该维度方向上分解为原子大小之后的操作数需要的内存容量>本层运算节点的内存组件的容量,则可以按照分解优先级在下一个维度方向上重复以上过程;如果在该维度方向上分解为原子大小之后的操作数需要的内存容量=本层运算节点的内存组件的容量,则可以直接确定分解的维度,结束分解操作数的过程。其中,分解为原子大小可以指分解粒度为1。
图5示出根据本公开一实施方式的串行分解的过程的流程图。如图5所示:(1)在步骤S50中,可以先确定输入指令的操作数的维度的分解优先级,在一种可能的实现方式中,可以按照操作数的维度的大小确定分解优先级,维度越大分解优先级越高,优先分解操作数的最大维度,比如说,操作数X为N维张量,维度分别为t1、t2、…ti、…tN,其中,t1<t2<…ti…<tN,其中,i表示不同的维度,i为正整数且i≤N,那么在确定对操作数X的维度的分解优先级时,tN维度最大,分解优先级最高,其次为tN-1…ti…t2、t1。(2)按照分解优先级的顺序选择对操作数分解的维度,将i初始化为N,此时,在步骤S51中,可以判断i=N>0;在步骤S52中,在tN方向上确定分解粒度为1,在步骤S53中,判断在tN方向分解为1后的操作数需要的内存容量与本层运算节点的内存组件的容量的大小关系,若小于,则在tN维度方向上以二分法方式分解操作数,具体过程可以为:步骤S54,确定最小分解粒度min=0,最大分解粒度max=tN;步骤S55,确定在tN方向上分解粒度为[(max-min)/2];步骤S56,判断在tN方向上分解为[(max-min)/2]的操作数需要的内存容量与本层运算节点的内存组件的容量的大小关系,若分解为[(max-min)/2]的操作数需要的内存容量=本层运算节点的内存组件的容量,则可以结束分解的过程,在tN方向上按照分解粒度[(max-min)/2]对操作数进行分解;若分解为[(max-min)/2]的操作数需要的内存容量<本层运算节点的内存组件的容量,则步骤S57设置最小分解粒度min=[(max-min)/2],若分解为[(max-min)/2]的操作数需要的内存容量>本层运算节点的内存组件的容量,则步骤S58设置最大分解粒度max=[(max-min)/2];步骤S59,判断此时最大分解粒度与最小分解粒度的差值是否为1,如果为1,则执行步骤S60,在tN方向上确定分解粒度为min,若不为1,则返回步骤S55继续再确定在tN方向上分解粒度为[(max-min)/2],重复以上S55-S60的过程。(3)回到刚才的步骤S51,若在tN方向分解为1后的操作数需要的内存容量等于本层运算节点的内存组件的容量,则可以确定分解的维度,结束分解操作 数的过程;若在tN方向分解为1维后的操作数需要的内存容量大于本层运算节点的内存组件的容量,则令i=i-1,并返回到步骤S51,判断此时i=N-1>0,则执行步骤S52,重复上述过程,直到确定出分解后的操作数需要的内存容量满足本层运算节点的内存组件的容量。
在分解完操作数后,可以根据分解的操作数对输入指令进行分解,具体可以包括:将输入指令分解为多个串行子指令,多个串行子指令中包括负责分解后的各子集的操作数的运算的串行子指令,若串行分解后存在输出依赖,则多个串行子指令中还可以包括归约指令。
需要说明的是,图5仅仅是对操作数分解的过程的一个示例,不以任何方式限制本公开。可以理解的是,还可以通过其他方式确定分解粒度,比如,分解优先级可以通过其他方式选择,对维度分解的方式也不限于二分法,只要能选择尽可能大的分解粒度即可。
如图4b所示,在一种可能的实现方式中,本公开的SD的输出端和DD的输入端之间还可以连接有子指令队列SQ(sub-level instruction Queue),SD的输出端连接SQ的输入端,SQ的输出端连接DD的输入端。SQ作为SD与DD之间的缓冲,可以省去SD与DD之间严格的同步执行关系。SQ可以简化电路设计,同时提高执行效率,例如,允许SD独自异步执行,减少DD等待SD对输入指令进行串行分解的时间等。
SD可以将串行分解后的串行子指令输出到SQ中,DD从SQ中获取串行子指令,DD可以根据串行子指令对应的操作数的存储需求为串行子指令分配本层运算节点的内存组件上的内存空间,并将分配的内存空间的地址(本地地址)绑定到串行子指令中获取操作数的指令上,从而实现译码处理。DD还可以根据串行子指令向内存控制器发送控制信号,内存控制器可以根据控制信号将串行子指令对应的操作数加载到为其分配的内存空间中,也就是说根据串行子指令中记载的输入指令对应的操作数的地址从上一层运算节点的内存组件中查找到串行子指令对应的操作数的存储位置,并读取操作数,然后根据本地地址写入到本层运算节点的内存组件中。
如图4b所示,DD对串行子指令进行译码处理后发送给PD,PD可以根据PD连接的下一层运算节点的数量对译码处理后的串行子指令进行并行分解,并行分解可以是指分解后的并行子指令可以并行执行。举例来说,假设串行子指令为对向量A和B相加,其中,A=(A1,A2…Aj,…An),B=(B1,B2…Bj,…Bn),其中,n表示向量A和B中元素的个数,n为正整数,j表示元素的序号,j为正整数且j≤n,那么PD可以根据下一层运算节点的数量将串行子指令并行分解为多个并行子指令,每个并行子指令负责处理向量中部分数据的相加操作,例如,假设n=4,PD连接了4个下一层运算节点,则PD可以对串行子指令并行分解得到4个并行子指令,4个并行子指令分别为对A1和B1、A2和B2、A3和B3以及A4和B4相加,PD可以将4个并行子指令发送给所述下一层运算节点。需要说明的是,以上举例仅仅是为了说明并行分解的示例,不以任何方式限制本公开。
在一种可能的实现方式中,PD在进行并行分解时,可以解除串行子指令的输入依赖,也就是说,并行分解得到的并行子指令对应的操作数之间不存在重叠的部分。例如,根据表1所示,可以选择分解的维度以解除输入依赖,这样可以尽量避免输入冗余,节省内存空间。
在另一种可能的实现方式中,所述任意一个运算节点的内存组件包括静态内存段以及动态内存段,若所述输入指令的操作数包括共用操作数以及其他操作数,则串行分解器根据所述共用操作数需要的内存容量与所述静态内存段的剩余容量之间的大小关系、以及所述其他操作数需要的内存容量与动态 内存段的容量之间的大小关系,对所述输入指令进行串行分解得到串行子指令。
其中,所述共用操作数为所述串行子指令共同使用的操作数,其他操作数为所述输入指令的操作数中除了所述共用操作数以外的数据,静态内存段的剩余容量可以是指静态内存段中未被使用的容量。
处理器中的SD、DD和PD是分开的,内存分配在时间上可以很好地错开。具体来说,PD总是在DD之后分配内存空间,但分配的内存空间释放得更早,DD总是在SD之后分配内存空间,但分配的内存空间同样释放得更早。而用于SD进行串行分解的内存空间可能会在多个串行子指令中用到,因此,为SD设置了静态内存段,而其他部分共用内存组件中除了静态内存段外的内存(动态内存段)。
举例来说,对于机器学习中的一些运算,这些运算被分解后的几部分运算之间会共用一部分操作数,对于这部分操作数,本公开称作共用操作数。以矩阵相乘运算作为示例,假设输入指令为对矩阵X和Y相乘,如果仅仅对矩阵X进行分解,那么对输入指令进行串行分解得到的串行子指令需要共同使用操作数Y,操作数Y为共用操作数。对于共用操作数,本公开的串行分解器SD可以在进行串行分解时生成一条提示性指令(“装载”),并在提示性指令中指明将共用操作数装载到静态内存段中,DD将提示性指令作为一条只需要装载数据至静态内存段、而无需执行、规约或写回的普通串行子指令处理,DD根据提示性指令向内存控制器发送第一控制信号以将共用操作数加载到静态内存段,以避免频繁存取数据、节约带宽资源。对于其他操作数,DD可以生成第二控制信号,DD可以将生成的第二控制信号发送给内存控制器,由内存控制器根据控制信号将其他操作数加载到动态内存段中。
因此,串行分解器可以根据所述共用操作数需要的内存容量与所述静态内存段的剩余容量之间的大小关系、以及所述其他操作数需要的内存容量与动态内存段的容量之间的大小关系,对所述输入指令进行串行分解得到串行子指令。
如果共用操作数需要的内存容量小于或等于所述静态内存段的剩余容量,且其他操作数需要的内存容量小于或等于动态内存段的容量,则串行分解器可以将所述输入指令发送给译码器,由译码器直接对输入指令进行译码处理后发送给并行分解器。
如果共用操作数需要的内存容量大于所述静态内存段的剩余容量,或者,其他操作数需要的内存容量大于动态内存段的容量,则需要对输入指令进行串行分解。
如果其他操作数需要的内存容量大于动态内存段的容量,而共用操作数需要的内存容量小于或等于所述静态内存段的剩余容量,则串行分解器可以根据动态内存段的容量对其他操作数进行分解,并对输入指令进行串行分解。其中,根据动态内存段的容量对其他操作数进行拆分,并对输入指令进行串行分解的具体过程可以为:确定对其他操作数的维度的分解优先级,按照分解优先级的顺序依次选择对其他操作数分解的维度,并以二分法方式确定最大分解粒度,直到分解后的其他操作数需要的内存容量小于动态内存段的容量。具体的过程,可以参见图5以及上文的相关描述。
如果共用操作数需要的内存容量大于所述静态内存段的剩余容量,其他操作数需要的内存容量小于或等于动态内存段的容量,则串行分解器可以根据静态内存段的剩余容量对共用操作数进行分解,并对输入指令进行串行分解。具体的分解方式同样可以参见图5的过程。
在一种可能的实现方式中,对于存在共用操作数的输入指令,分解得到的串行子指令可以包括头部指令和主体指令,所述译码器可以根据所述头部指令向内存控制器发送控制信号,以从上一层运算节点的内存组件中加载所述共用操作数到所述静态内存段;所述译码器根据所述主体指令向内存控制 器发送控制信号,以从上一层运算节点的内存组件中加载所述其他数据到所述动态内存段。
在一种可能的实现方式中,如图4b所示,所述处理器还可以包括控制单元RC(Reduction Controller,也叫归约控制器),所述任意一个运算节点还可以包括本地处理单元(LFU,local functional units,图4b中的处理单元),所述控制单元RC的输入端连接所述译码器DD的输出端,所述控制单元RC的输出端连接所述本地处理单元LFU的输入端,本地处理单元LFU连接内存组件。其中,所述本地处理单元LFU主要用于对存在输出依赖的串行子指令的运算结果进行归约处理,RC可以用于向LFU发送归约指令。LFU都可以通过硬件电路或者软件程序的方式实现,本公开对此不作限定。
在一种可能的实现方式中,若所述串行子指令存在输出依赖,所述控制单元RC根据所述串行子指令控制所述本地处理单元对所述下一层运算节点的运算结果进行归约处理得到所述输入指令的运算结果;其中,所述串行子指令存在输出依赖是指,需要对所述串行子指令的运算结果进行归约处理才能得到所述输入指令的运算结果。
DD会发送串行子指令到RC,RC可以对串行子指令的输出依赖的情况进行检查,若串行子指令存在输出依赖,由RC根据串行子指令向LFU发送归约指令,以使得LFU对下一层运算节点的运算结果进行归约处理得到所述输入指令的运算结果。具体的过程可以为,下一层运算节点(中的内存控制器)可以将对并行子指令的运算结果写回到本层运算节点的内存组件中,LFU可以从本层运算节点的内存组件中读取多个串行子指令的运算结果,该多个串行子指令可以是由同一条输入指令串行分解得到的,LFU对多个串行子指令的运算结果进行归约处理可以得到对应的输入指令的运算结果,将运算结果存储在内存组件中,处理器在确定本层输入指令执行完成后,可以向内存控制器发送写回信号,内存控制器可以根据写回信号将运算结果写回到上一层运算节点的内存组件中,直到第一层运算节点完成所有指令的运算。
在一种可能的实现方式中,若所述控制单元RC检测到对所述下一层运算节点的运算结果进行归约处理所需要的资源大于所述本地处理单元的资源上限,则所述控制单元RC根据所述串行子指令向所述并行分解器发送委托指令,所述并行分解器根据所述委托指令控制所述下一层运算节点对所述下一层运算节点的运算结果进行归约处理得到所述输入指令的运算结果。
RC可以根据串行子指令评估进行归约处理需要的资源(例如,计算资源等),本地处理单元可以具有预设的资源上限,因此,RC可以判断对所述下一层运算节点的运算结果进行归约处理所需要的资源是否大于本地处理单元的资源上限,若大于,那么LFU的处理速度可能会对整个运算节点的性能产生很大的影响,因此,RC可以根据串行子指令向PD发送委托指令,PD可以根据委托指令控制下一层运算节点对所述下一层运算节点的运算结果进行归约处理得到输入指令的运算结果,通过委托的方式可以提高处理的效率。
在一种可能的实现方式中,处理器还可以包括CMR(Commission Register,委托寄存器),在RC判断对所述下一层运算节点的运算结果进行归约处理所需要的资源大于本地处理单元的资源上限时,RC可以根据串行子指令向CMR写入委托指令,PD可以定期检查CMR中是否存在委托指令,若存在委托指令,则根据委托指令控制下一层运算节点对所述下一层运算节点的运算结果进行归约处理得到输入指令的运算结果。其中的定期检查可以是根据处理的周期检查,处理的周期可以根据下一层运算节点处理完一条串行子指令的时间等确定,本公开对此不作限定。通过设置CMR可以提高整个运算节点 的处理效率。
对于具有父子连接关系的运算节点的运算装置,最高级(0级)运算节点(父节点)译码并将指令发送到其下一层运算节点(子节点),其中,每个下一层运算节点重复译码、发送过程直到叶子运算节点执行为止。叶子运算节点将计算结果返回到其父节点,该操作一直重复到最高级运算节点(父节点)为止。在该过程中,当叶子运算节点的上层运算节点译码指令时,叶子运算节点处于空闲状态,影响运算的效率。
为了解决上述技术问题,本公开提供的运算装置的运算节点中的处理器控制下一层运算节点以流水线的方式分多个阶段执行所述运算节点的输入指令对应的操作。在一种可能的实现方式中,对于任意一个运算节点,所述任意一个运算节点中的所述处理器控制所述下一层运算节点,以流水线的方式分多个阶段执行所述任意一个运算节点的输入指令对应的操作;其中,所述多个阶段包括:操作执行EX(Execution),所述下一层运算节点用于以流水线的方式分所述多个阶段执行所述操作执行EX。其中,输入指令可以是描述了机器学习技术的操作的指令,输入指令可以包括操作数和操作符等。
在一种可能的实现方式中,所述多个阶段还可以包括:指令译码ID(Instruction Decoding)、数据加载LD(Loading)、操作归约RD(Reduction)以及数据写回WB(Writing Back),所述流水线按照指令译码ID、数据加载LD、操作执行EX、操作归约RD以及数据写回WB的顺序传播。需要说明的是,以上实施方式中的多个阶段仅仅是本公开的一个示例,不以任何方式限制本公开,例如,多个阶段还可以包括指令输入等。
对于运算装置中的任意一层运算节点,其中,指令译码ID可以是指对接收到的上一层(或者输入端)发送的输入指令进行译码处理,具体可以包括:根据输入指令对应的操作数的存储需求为输入指令分配本层运算节点的内存组件上的内存空间,并将分配的内存空间的地址(本地地址)绑定到输入指令中写操作数的指令上,等等。数据加载LD可以是指根据输入指令中记载的输入指令对应的读取操作数的地址从上一层运算节点的内存组件中查找到输入指令对应的操作数的存储位置,并读取操作数,然后根据本地地址写入到本层运算节点的内存组件中。操作执行EX可以是指根据操作符以及操作数获得运算结果的过程。如上所述,由于下一层运算节点可能有多个,或者,本层运算节点的内存组件的容量小于存储输入指令需要的数据所需要的内存的容量,因此处理器还可以对输入指令进行分解,有些操作还需要对分解后的指令的运算结果进行归约,即操作归约RD,才能得到输入指令的运算结果。数据写回WB可以是指将本层运算节点的输入指令的运算结果写回到上一层运算节点中。
图6示出根据本公开一示例的流水线的示意图。下面结合图3所示的运算装置以及图6对以流水线的方式分多个阶段执行输入指令对应的操作的过程进行说明。如图3所示,以第i层运算节点为例,第i层运算节点接收上一层(第i-1层)运算节点的输入指令,并对输入指令进行指令译码ID得到译码后的指令,加载运行输入指令需要的数据,然后将译码后的指令发送给下一层(第i+1层)运算节点,由下一层(第i+1层)运算节点根据加载的数据执行译码后的指令以完成操作执行EX阶段。由于下一层(第i+1层)运算节点可能有多个,或者,本层运算节点的内存组件的容量可能小于存储输入指令需要的数据所需要的内存的容量,因此处理器还可以对输入指令进行分解,有些操作还需要对分解后的指令的运算结果进行归约,即操作归约阶段RD,才能得到输入指令的运算结果,如果第i层运算节点不是第一层运算节点,第i层运算节点的处理器还可以将输入指令的运算结果写回到上一层(第i-1 层)运算节点中。需要说明的是,下一层(第i+1层)运算节点也是以流水线的方式分所述多个阶段执行所述操作执行EX,如图6所示,也就是说,下一层(第i+1层)运算节点在接收到本层(第i层)运算节点的处理器发送的指令(作为下一层(第i+1层)运算节点的输入指令)后,可以对输入指令进行指令译码,从本层的内存组件中加载输入指令需要的数据,将译码后的指令发送给下一层(第i+1层)运算节点的下一层(第i+2层)运算节点以进行操作执行阶段……,换言之,下一层(第i+1层)运算节点按照指令译码ID、数据加载LD、操作执行EX、操作归约RD以及数据写回WB的顺序以流水线的形式执行下一层(第i+1层)运算节点的上一层(第i层)运算节点发送的输入指令对应的操作。
本公开实施例的运算装置通过多层迭代的方式构建运算装置的层级架构,该运算装置的每个运算节点的结构是相同的,不同层的运算节点、不同规模的计算机上都具有相同的编程接口和指令集架构,执行相同格式的程序,层与层之间隐式装载数据。运算装置的层级架构使得可以通过迭代的流水线的方式执行输入指令对应的操作,高效利用每一层级的运算节点,提高了运算的效率。
在一种可能的实现方式中,所述任意一个运算节点还可以包括:本地处理单元LFU(local functional units)、内存控制器(例如,可以为DMA,Direct Memory Access),所述处理器可以包括:流水线控制单元、译码器DD(Demotion Decoder,这里的降级可以是指从上一层到下一层运算节点)、归约控制单元RC(Reduction Controller,也叫归约控制器)。
图7示出根据本公开一示例的运算节点的框图。如图7所示,译码器DD的输入端接收输入指令,译码器DD的输出端连接内存控制器的输入端,内存组件可以通过数据通路连接任意一个运算节点的上一层运算节点和下一层运算节点的内存组件,内存控制器连接上述数据通路,如图7所示内存组件i连接内存组件i-1,内存组件i-1可以表示当前运算节点的上一层运算节点的内存组件,内存组件i连接下一层运算节点表示连接下一层运算节点的内存组件,内存控制器连接内存组件之间的数据通路。数据通路在内存控制器的控制下将数据从一个内存组件送往另一个内存组件。译码器DD的输出端还连接下一层运算节点的输入端以及归约控制单元RC的输入端,归约控制单元RC连接本地处理单元LFU。
译码器DD用于指令译码ID,内存控制器用于数据加载LD:将输入指令的操作数从上一层运算节点的内存组件加载到本地内存组件,归约控制单元RC用于控制LFU执行操作归约RD得到输入指令的运算结果,内存控制器还用于将运算结果写回到所述任意一个运算节点的上一层运算节点的内存组件中。
流水线控制单元连接译码器DD、归约控制单元RC、内存控制器以及下一层运算节点,流水线控制单元根据译码器DD、归约控制单元RC、内存控制器以及下一层运算节点的反馈同步多个阶段。例如,所述流水线控制单元在接收到所述译码器DD、内存控制器、下一层运算节点以及所述归约控制单元RC发送的第一反馈信号后,控制流水线按顺序向下传播,其中,第一反馈信号可以是指表示译码器DD、内存控制器、下一层运算节点以及所述归约控制单元RC执行完当前指令的相应阶段的信号。
示例性的,假设有输入指令1、输入指令2、输入指令3、输入指令4和输入指令5、输入指令6,内存控制器对输入指令1进行数据写回WB,RC控制本地处理单元LFU对输入指令2进行操作归约RD,下一层运算节点对输入指令3进行操作执行EX,内存控制器对输入指令4进行数据加载LD,DD对输入指令5进行指令译码ID。在DMAC、RC、下一层运算节点以及DD执行完当前指令的相应阶段的处理后,可以向流水线控制单元发送第一反馈信号,流水线控制单元在接收到内存控制器、RC、下一层运算 节点以及DD发送的第一反馈信号后,可以控制流水线按顺序向下传播:内存控制器对输入指令2进行数据写回WB,RC控制本地处理单元对输入指令3进行操作归约RD,下一层运算节点对输入指令4进行操作执行EX,内存控制器对输入指令5进行数据加载LD,DD对输入指令6进行指令译码ID。
图8示出根据本公开一示例的运算节点以及流水线运行过程的示意图。在一种可能的实现方式中,所述处理器还可以包括串行分解器SD(Sequential decomposer),串行分解器SD连接译码器DD的输入端,串行分解器SD用于对所述输入指令进行串行分解得到串行子指令;所述处理器控制所述下一层运算节点,以流水线的方式分多个阶段执行所述串行子指令对应的操作。串行分解器SD和译码器DD之间还可以设置有子指令队列SQ(sub-level instruction Queue),子指令队列SQ用于暂存所述串行子指令,DD还用于对串行子指令进行译码得到译码后的串行子指令。设置SQ暂存串行子指令,对于需要做串行分解的输入指令,可以加速流水线的传播,提高运算效率。
如图8所示,SD的输入端还可以连接指令队列IQ(Instruction Queue),也就是说,处理器可以先将上一层运算节点的输出指令作为本层运算节点的输入指令加载到IQ,本层运算节点可以是指处理器所属的运算节点,SD从IQ中获取输入指令,考虑到硬件的限制,SD可以将输入指令分解为多个可以串行执行的串行子指令,并暂存到SQ中,DD从SQ中获取串行子指令进行译码。
通过设置IQ作为SD与上一层运算节点之间的缓冲,可以省去SD与上一层运算节点之间严格的同步执行关系。IQ可以简化电路设计,同时提高执行效率,例如,允许SD和上一层运算节点之间独自异步执行,减少SD等待上一层运算节点发送输入指令的时间等。SQ作为SD与DD之间的缓冲,可以省去SD与DD之间严格的同步执行关系。SQ可以简化电路设计,同时提高执行效率,例如,允许SD独自异步执行,减少DD等待SD对输入指令进行串行分解的时间等。通过设置IQ和SQ可以提高运算装置的处理效率。
对输入指令的串行分解可以包括对输入指令的操作数的分解以及对输入指令的分解。在进行串行分解时,为了更有效的利用运算节点的资源,串行分解得到的串行子指令将具有尽可能大的分解粒度,串行分解得到的串行子指令的分解粒度由运算节点的资源决定,例如,运算节点的资源可以为运算节点的内存组件的容量。这里的分解粒度可以指分解操作数的维度。
输入指令需要的内存容量可以根据存储输入指令的操作数需要的内存容量、以及存储操作符对操作数进行处理后的中间结果需要的内存容量等确定,在确定输入指令需要的内存容量后,可以判断本层运算节点的内存组件的容量是否满足输入指令需要的内存容量,如果不满足,则可以根据本层运算节点的内存组件的容量对输入指令进行串行分解得到串行子指令。
以矩阵相乘运算作为示例说明SD的功能,假设输入指令为对矩阵X和Y相乘,SD可以根据矩阵X和矩阵Y的大小确定输入指令需要的内存容量,可以将输入指令需要的内存容量与本层运算节点的内存组件的容量进行比较,如果输入指令需要的内存容量大于本层运算节点的内存组件的容量,则需要对输入指令进行串行分解。具体的过程可以为,对操作数进行分解,从而将输入指令分为多个串行子指令,该多个串行子指令可以串行执行,例如,可以对矩阵X或者矩阵Y进行分解,或者对矩阵X和矩阵Y都进行分解,以对矩阵X进行分解为例,可以将输入指令串行分解为多个矩阵相乘的串行子指令以及求和的串行子指令,在串行执行完多个矩阵相乘的串行子指令后,根据多个矩阵相乘的串行子指令的运算结果以及求和的串行子指令进行求和得到输入指令的运算结果。需要说明的是,上述对于矩 阵相乘的串行分解方式仅仅是本公开为了说明SD的功能的一个示例,不以任何方式限制本公开。
在一种可能的实现方式中,如图8所示,处理器还可以包括并行分解器PD(Parallel decomposer),所述并行分解器PD的输入端连接译码器DD的输出端,并行分解器PD的输出端连接下一层运算节点的输入端,并行分解器PD用于根据所述下一层运算节点的数量,对译码后的串行子指令进行并行分解得到并行子指令,并将并行子指令发送给所述下一层运算节点,以使所述下一层运算节点根据并行子指令对应的操作数并行运行并行子指令。其中,并行分解可以是指分解后的并行子指令可以并行执行,举例来说,假设串行子指令为对向量A和B相加,其中,A=(A1,A2…Aj,…An),B=(B1,B2…Bj,…Bn),其中,n表示向量A和B中元素的个数,n为正整数,j表示元素的序号,j为正整数且j≤n,那么PD可以根据下一层运算节点的数量将串行子指令并行分解为多个并行子指令,每个并行子指令负责处理向量中部分数据的相加操作,例如,假设n=4,PD连接了4个下一层运算节点,则PD可以对串行子指令并行分解得到4个并行子指令,4个并行子指令分别为对A1和B1、A2和B2、A3和B3以及A4和B4相加,PD可以将4个并行子指令发送给所述下一层运算节点。需要说明的是,以上举例仅仅是为了说明并行分解的示例,不以任何方式限制本公开。
在一种可能的实现方式中,所述内存控制器可以包括DMA(内存控制器,Direct Memory Access)以及DMAC(Direct Memory Access Controller),本文中称DMAC为第一内存控制器、DMA为第二内存控制器。其中,DMA连接数据通路,DMAC连接DMA以及DD、SD、流水线控制单元、下一层运算节点等。DMAC可以根据控制信号生成加载指令,将加载指令发送给DMA,由DMA根据加载指令控制数据通路,实现数据的加载。DMAC还可以向流水线控制单元发送上文所述的第一反馈信号,在DMA执行完数据加载或者数据写回后可以通知DMAC,DMAC收到通知后可以向流水线控制单元发送第一反馈信号。
输入指令可以包括:运算符、操作数参数,所述操作数参数可以是指向输入指令的操作数的参数,所述操作数参数可以包括全局参数和局部参数,全局参数是表示输入指令对应的第一操作数的大小的参数,局部参数是表示输入指令的第二操作数在所述第一操作数中的起始位置和第二操作数的大小的参数。也就是说,第二操作数可以是第一操作数中的部分数据或者全部数据,执行输入指令时可以实现对第二操作数的处理,对第二操作数的处理可以是与输入指令的运算符对应的处理。
就是说,本公开的运算装置采用的指令可以是一个三元组<O,P,G>,其中,O表示运算符,P表示一个操作数的有限集,G表示粒度指标,具体的表现形式可以为“O,P[N][n1][n2]”,其中,N可以为正整数,表示全局参数,根据张量维度的不同可以设置多个不同的N,n1和n2为小于N的自然数,表示局部参数,其中,n1表示对操作数进行运算时的起始位置,n2表示大小,执行上述指令可以实现对操作数P中n1到n1+n2的操作数的运算O,同样的,根据张量维度的不同可以设置多个不同的n1和n2。本公开的运算装置的每一层接收到的输入指令的格式都是相同的,因此,可以自动完成指令的分解、执行指令对应的操作,等等。
任意一个(当前)运算节点在接收到上一层运算节点发送的输入指令后,可以根据输入指令的操作数参数从上一层运算节点的内存组件中读取相应的操作数,并保存在当前运算节点的内存组件中,任意一个运算节点在执行完输入指令得到运算结果后,还可以将运算结果写回到上一层运算节点的内存组件中。举例来说,当前运算节点的处理器可以根据输入指令的操作数参数向DMAC发送控制信号, DMAC可以根据控制信号控制DMA,DMA控制当前运算节点的内存组件和上一层运算节点的内存组件之间连接的数据通路,从而将输入指令的操作数加载到当前运算节点的内存组件中。
在一种可能的实现方式中,DMAC可以根据控制信号生成加载指令,将加载指令发送给DMA,由DMA根据加载指令控制数据通路,实现数据的加载。
DMAC可以根据控制信号确定基地址、起始偏移量、加载数据的数量、跳转的偏移量等参数,然后根据基地址、起始偏移量、加载数据的大小、跳转的偏移量等参数生成加载指令,还可以根据操作数的维度设置循环加载数据的次数。其中,基地址可以是操作数在内存组件中存储的起始地址,起始偏移量为要读的操作数在原操作数中开始的位置,起始偏移量可以根据局部参数中的起始位置确定,加载数据的数量可以根据局部参数中的大小确定,跳转的偏移量表示下一部分要读的操作数在原操作数中开始的位置相对于上一部分读的操作数在原始操作数中开始的位置之间的偏移,也就是说,跳转的偏移量为下一个读取数据的起始偏移量相对于上一个读取数据的起始偏移量的偏移量跳转的偏移量可以根据全部参数或局部参数确定。例如,可以将起始位置作为起始偏移量,将局部参数中的大小作为一次加载的数据的数量,可以将局部参数中的大小作为跳转的偏移量。
在一种可能的实现方式中,可以根据基地址以及起始偏移量确定开始读取操作数的起始地址,根据加载数据的数量以及起始地址可以确定一次读取操作数的结束地址,根据起始地址以及跳转的偏移量可以确定下一部分要读的操作数的起始地址,同样的,可以根据加载数据的数量以及下一部分要读的操作数的起始地址确定本次读取操作数的结束位置……重复以上过程,直到达到循环加载操作数的次数。其中的一次读取操作数和本次读取操作数可以是指:读取同一个操作数需要一次或多次完成,每次读取同一个操作数中的部分操作数,上述一次和本次可以是指多次中的一次。
也就是说,读取一个操作数可能需要循环多次读取完成,第一内存控制器可以根据基地址、起始偏移量、加载数据的数量、跳转的偏移量确定每次读取操作数时的起始地址和结束地址,例如,针对每次读取过程,可以根据上一次读取过程的起始地址和跳转的偏移量确定本次读取过程的起始地址,可以根据本次读取过程的起始地址和加载数据的数量(以及数据的格式)确定本地读取过程的结束地址。其中,跳转的偏移量可以根据跳转的数据的数量以及数据的格式确定。
示例性的,图9示出根据本公开一实施例的操作数的示意图,如图9所示,假设操作数P为M行N列的矩阵P[M,N],控制信号为“Load P[M,N][0,0][M,N/2],P’”。DMAC根据控制信号可以设置在行和列方向的起始偏移量均为0,加载数据的数量为N/2,跳转的偏移量为N,循环的次数为M。如图9所示,从第一行第一列开始读取N/2列数据,跳转到第二行第一列读取N/2列数据……循环M次可以完成数据的加载。
需要说明的是,以上示例仅仅是为了说明本公开的运算装置加载数据的方式,不以任何方式限制本公开。
在一种可能的实现方式中,所述任意一个运算节点还可以包括:流水线锁存器,所述译码器DD和所述内存控制器之间、所述内存控制器和所述下一层运算节点FFU(Fractal Functional Units)之间、下一层运算节点FFU和所述本地处理单元LFU之间、以及所述本地处理单元LFU和所述内存控制器之间分别设置有流水线锁存器。流水线锁存器用于缓存下一阶段要处理的指令。所述流水线控制单元通过控制所述流水线锁存器同步所述多个阶段。
在一种可能的实现方式中,所述流水线控制单元在接收到所述译码器DD、内存控制器、下一层运算节点LFU以及所述归约控制单元RC发送的第一反馈信号后,分别向各个所述流水线锁存器发送第一控制信号,各个所述流水线锁存器根据所述第一控制信号更新输出。其中,所述第一控制信号可以是高电平信号或者低电平信号,本公开对此不作限定。更新输出是指流水线锁存器在接收到第一控制信号(如图8所示,流水线控制单元向流水线锁存器发送的控制信号)时,输出跟随输入的并行子指令或者与输入指令的操作相关的控制信号而变化,这里输入的并行子指令或者与输入指令的操作相关的控制信号是指图8中从流水线锁存器的左侧输入的。
仍然以上文所述的输入指令1、输入指令2、输入指令3、输入指令4和输入指令5、输入指令6为例,结合图8对流水线的处理过程进行说明。
(1.1)DMAC接收到流水线锁存器4输出的控制信号,根据控制信号控制DMA对输入指令1进行数据写回WB;
(1.2)本地处理单元LFU接收流水线锁存器3输出的控制信号,对输入指令2进行操作归约RD,将归约结果(输入指令2的运算结果)存储到内存组件中;
(1.3)下一层运算节点接收流水线锁存器2中的并行子指令(对输入指令3分解后得到的),对输入指令3进行操作执行EX,将执行结果写回到内存组件中;
(1.4)DMAC接收流水线锁存器1发送的控制信号,根据控制信号控制DMA将输入指令4的输入操作数加载到内存组件中;
(1.5)DD对输入指令5进行指令译码ID,并将译码后的输入指令5发送给PD和RC,将数据加载、以及数据写回等相关的控制信号缓存在流水线锁存器1中,PD对译码后的输入指令5进行并行分解得到并行子指令,将并行子指令缓存在流水线锁存器1中,RC将输入指令5的操作归约对应的控制信号缓存在流水线锁存器1中。
在DMAC、RC、下一层运算节点以及DD执行完当前指令的相应阶段的处理后,可以向流水线控制单元发送第一反馈信号,流水线控制单元在接收到DMAC、RC、下一层运算节点以及DD发送的第一反馈信号后,可以向各个所述流水线锁存器发送第一控制信号,控制流水线按顺序向下传播,各流水线锁存器在接收到第一控制信号后,输出的控制信号跟随输入信号变化。例如,(1)针对输入指令2的数据写回对应的控制信号从流水线锁存器4输出、针对输入指令3的数据写回对应的控制信号从流水线锁存器3输出到流水线锁存器4;(2)针对输入指令3的操作归约对应的控制信号从流水线锁存器3输出、针对输入指令2的操作归约对应的控制信号从流水线锁存器2输出到流水线锁存器3、针对输入指令1的操作归约对应的控制信号从流水线锁存器1输出到流水线锁存器2;(3)针对输入指令4的并行子指令从流水线锁存器2输出、针对输入指令5的并行子指令从流水线锁存器1输出到流水线锁存器2;(4)针对输入指令5的数据加载对应的控制信号从流水线锁存器1输出;(5)输入指令6输入到DD中,DD对输入指令6进行指令译码ID,并将译码后的输入指令6发送给PD和RC,将数据加载、以及数据写回等相关的控制信号缓存在流水线锁存器1中,PD对译码后的输入指令6进行并行分解得到并行子指令,将并行子指令缓存在流水线锁存器1中,RC将输入指令6的操作归约对应的控制信号缓存在流水线锁存器1中。DMAC、RC、下一层运算节点以及DD的执行过程如下:
(2.1)DMAC接收到流水线锁存器4输出的控制信号,控制DMA对输入指令2的运算结果进行数 据回写WB;
(2.2)LFU接收流水线锁存器3输出的控制信号,根据控制信号从内存组件中获取对输入指令3进行操作执行EX后的执行结果,对输入指令3的指令结果进行操作归约RD,将归约结果(输入指令3的运算结果)存储到内存组件中;
(2.3)下一层运算节点接收流水线锁存器2输出的针对输入指令4的并行子指令,对输入指令4进行操作执行EX,将执行结果写回到内存组件中;
(2.4)DMAC接收流水线锁存器1发送的控制信号,根据控制信号控制DMA将输入指令5的输入操作数加载到内存组件中;
(2.5)DD从SQ中获取输入指令6,对输入指令6进行指令译码ID。
在一种可能的实现方式中,DD在从SQ中获取到串行子指令时,可以检测串行子指令的数据依赖情况,若检测到串行子指令存在数据依赖,则DD可以停止从SQ中获取串行子指令。
串行子指令存在数据依赖可以是指串行子指令的输入操作数与之前的多条串行子指令的输出操作数存在重叠(数据依赖)。之前的多条串行子指令的条数可以根据流水线的级数确定,比如在本公开实施例的5级流水线的示例中,之前的多条串行子指令可以是指之前的4条串行子指令。当前译码的串行子指令的输入操作数与之前的多条串行子指令的输出操作数存在重叠,可以是指当前译码的串行子指令的输入操作数与之前的多条串行子指令中的任意一条或多条的输出操作数存在重叠,本公开对此不作限定。
由于当前译码的串行子指令的输入操作数与之前的多条串行子指令的输出操作数存在重叠,也就是说,当前译码的串行子指令的输入操作数是之前的多条串行子指令的输出操作数中的部分或全部,因此,需要之前多条串行子指令执行完得到输出操作数之后才能够加载当前译码的串行子指令的输入操作数。所以,需要暂停流水线的传播,直到运行完之前的多条串行子指令得到输出操作数,继续流水线的传播过程。具体过程可以为,DD停止从SQ中获取串行子指令,DD的输出不变,DD之后的第一个流水线锁存器不再输出锁存的控制信号,而是输出空泡控制信号,收到空泡控制信号的各功能部件不进行操作,仅立刻向流水线控制单元发送第一反馈信号。流水线控制单元继续按原条件发射第一控制信号,让流水线带着从第一个流水线锁存器注入的空泡继续执行,直到数据依赖得以解决。数据依赖解决后,DD继续从SQ中取指令,第一个流水线锁存器继续输出锁存的控制信号。
根据上述实施方式的流水线控制过程,可以灵活的控制流水线的进程,避免计算结果出错。
在一种可能的实现方式中,所述译码器在检测到当前译码的串行子指令的输入操作数与之前的多条串行子指令的输出操作数不存在重叠时,将当前译码的串行子指令译码后预加载到所述下一层运算节点上。
根据上文描述的过程可知,对于一条串行子指令,在译码完成后,需要等待数据加载LD完成后,才会加载到下一层运算节点上进行操作执行EX。根据上文中的示例,在(2.3)中下一层运算节点对输入指令4进行操作执行EX时,(2.5)DD从SQ中获取输入指令6,对输入指令6进行指令译码ID,输入指令的并行子指令被缓存在流水线锁存器1中,还没有加载到下一层运算节点上,在下一个第一控制信号到来时,才会加载到下一层运算节点上。
对于输入操作数与之前的多条串行子指令的输出操作数不存在重叠的情况,译码器可以向流水线 控制单元发送预加载信号。如果下一层运算节点已经完成了输入指令4的并行子指令的操作执行并向流水线控制单元发送了第一反馈信号,这时,流水线控制单元可以根据预加载信号,向流水线锁存器1发送第一控制信号,流水线锁存器1根据第一控制信号预先将输入指令6的并行子指令输出到下一层运算节点(也就是预加载串行子指令,如图8中的流水线锁存器1到FFU的虚线箭头所示),以使下一层运算节点提前对输入指令6的进行操作执行EX,从而提升运算装置的运算效率。
在以上示例中,对于当前译码的串行子指令的输入操作数与之前的多条串行子指令的输出操作数是否存在重叠,译码器DD可以通过检测之前多条(例如5条)串行子指令的输出操作数的地址以及当前译码的串行子指令的输入操作数的地址和大小描述符来确定。
通过本实施方式可以在输入操作数与之前的多条串行子指令的输出操作数不存在重叠的情况,采用指令预加载的方式加快处理的速度,提高运算装置的处理效率。
如上所述机器学习为计算及访存密集型技术,为了提高机器学习的运算效率,本公开提供了一种运算装置采用的内存管理方法。
在一种可能的实现方式中,所述内存组件可以包括静态内存段和循环内存段。图11示出根据本公开一实施例的内存组件的划分的示例的示意图。如图11所示,可以所述内存组件的内存空间划分为静态内存段和循环内存段。
对于机器学习中的一些运算,这些运算被分解后的几部分运算之间会共用一部分操作数,对于这部分操作数,本公开称作共用操作数。以矩阵相乘运算作为示例,假设输入指令为对矩阵X和Y相乘,如果仅仅对矩阵X进行分解,那么对输入指令进行串行分解得到的串行子指令需要共同使用操作数Y,操作数Y为共用操作数。
如上文所述,输入指令可以是描述了机器学习的操作(运算)的指令,机器学习的操作(运算)以由上文中的计算原语组成,输入指令可以包括操作数和操作符等。也就是说,对于任意一个运算节点的输入指令:处理器对输入指令进行分解得到的多个子指令,这多个子指令可能会共用一部分操作数,这部分操作数即共用操作数。
在一种可能的实现方式中,被分解后的运算或者指令是否存在共用操作数可以根据操作类型和被分解的维度确定,其中操作类型可以是指具体的操作或运算,例如,矩阵乘法;被分解的维度可以是指输入指令的操作数(张量)被分解的维度,举例来说,假设操作数的表示形式为NHWC(batch,height,width,channels),根据图5所示的过程确定分解的维度为C维度,那么操作数被分解的维度为C维度。
如果所述多个子指令之间存在共用操作数,则所述处理器在所述静态内存段中为所述共用操作数分配内存空间,在所述循环内存段中为所述多个子指令的其他操作数分配内存空间;其中,所述共用操作数为:所述任意一个运算节点中的下一层运算节点执行所述多个子指令时都要使用的操作数,所述其他操作数为:所述多个子指令的操作数中除了所述共用操作数以外的操作数。
对于共用操作数,为了避免频繁的读写,本公开在内存组件中设置静态内存段专门用于存储共用操作数,对于多个子指令的共用操作数,在执行多条子指令之前,只需要执行一次将共用操作数从任意一个运算节点的上一层运算节点的内存组件中加载共用操作数到所述静态内存段的操作即可,可以避免频繁存取数据、节约带宽资源。
上述其他操作数可以是指,输入指令的操作数中被分解的操作数、执行子指令得到的中间结果、 归约结果,等等,其中,归约结果可以是对中间结果进行操作归约得到的,操作归约可以是指上文中提到的归约处理。
在一种可能的实现方式中,处理器用于对任意一个运算节点的输入指令进行分解得到多个子指令,可以包括:所述SD根据所述输入指令需要的内存容量、所述静态内存段的容量以及所述循环内存段的容量,对所述输入指令进行串行分解得到串行子指令。
在一个示例中,对于分解后不存在共用操作数的输入指令,可以根据输入指令需要的内存容量以及循环内存段的容量,对所述输入指令进行串行分解得到串行子指令。
在一个示例中,对于分解后存在共用操作数的输入指令,可以根据共用操作数需要的内存容量与所述静态内存段的剩余容量之间的大小关系、以及所述其他操作数需要的内存容量与循环内存段的容量之间的大小关系,对所述输入指令进行串行分解得到串行子指令。
对于分解后存在共用操作数的输入指令,如果共用操作数需要的内存容量大于所述静态内存段的剩余容量,或者,其他操作数需要的内存容量大于循环内存段的容量,则需要对输入指令进行串行分解。
对于共用操作数:SD可以计算所述静态内存段剩余的内存容量,所述SD根据所述静态内存段剩余的内存容量以及所述共用操作数需要的内存容量对所述输入指令进行第一串行分解得到第一串行子指令。具体地,可以确定共用操作数的维度的分解优先级,按照分解优先级的顺序依次选择对共用操作数分解的维度并以二分法方式确定最大分解粒度,直到分解后的共用操作数需要的内存容量小于或等于本层运算节点的静态内存段剩余的内存容量。具体的过程可以参见关于图5部分的描述,不再赘述。然后可以根据对共用操作数的分解方式对输入指令进行分解。
对于其他操作数:SD可以根据所述循环内存段的内存容量以及所述其他操作数需要的内存容量对所述第一串行子指令进行第二串行分解得到所述串行子指令。同样的,可以确定其他操作数的维度的分解优先级,按照分解优先级的顺序依次选择对其他操作数分解的维度并以二分法方式确定最大分解粒度,直到分解后的其他操作数需要的内存容量小于或等于本层运算节点的循环内存段剩余的内存容量。具体的过程可以参见关于图5部分的描述,不再赘述。然后可以根据对其他操作数的分解方式对输入指令进行分解。
举例来说,假设输入指令为对矩阵X和Y相乘,操作数Y为共用操作数,其他操作数包括操作数X。根据本公开的实施方式,可以确定存储操作数Y需要的内存容量以及静态内存段的容量,如果存储操作数Y需要的内存容量小于静态内存段的容量,那么可以不对操作数Y进行分解,如果存储操作数Y需要的内存容量大于静态内存段的容量,那么可以根据图5所示的过程对操作数Y的分解方式。根据对操作数Y的分解方式可以对输入指令进行串行分解。还可以确定存储操作数X、中间结果以及归约结果需要的内存容量,其中,存储中间结果、归约结果需要的内存容量可以结合操作数X以及上述分解后的操作数Y确定,如果存储其他操作数需要的内存容量小于循环内存段的容量,那么可以不对操作数X进行分解,如果存储其他操作数需要的内存容量大于静态内存段的容量,那么可以根据图5所示的过程对操作数X的分解方式,只不过每次需要判断的是存储其他操作数需要的内存容量与循环内存段的容量的大小,而不单单是操作数X。
SD确定对操作数的分解方式后,对输入指令进行串行分解后得到的串行子指令包括头部指令和 主体指令,所述头部指令用于加载共用操作数,SD可以在静态内存段中为所述共用操作数分配内存空间,头部指令记录了为所述共用操作数分配的内存空间的地址,所述主体指令用于加载所述其他操作数、以及对所述共用操作数和其他操作数进行计算。
如图10a所示,本公开的运算节点中设置有本地处理单元LFU(local functional units)、第一内存控制器(DMAC,Direct Memory Access Controller)以及第二内存控制器(DMA,Direct Memory Access),第一内存控制器可以通过硬件电路或者软件程序的方式实现,本公开对此不作限定。第一内存控制器连接第二内存控制器。其他内容可以参见上文的介绍,不再赘述。
第一内存控制器分别连接SD、DD,根据SD或DD发送的控制信号从上一层运算节点的内存组件中读取操作数、并写入当前运算节点的内存组件中。第一内存控制器除了负责数据的读取、写入,还负责不同层运算节点之间的数据写回,例如,将i+1层运算节点的运算结果写回到第i层运算节点。
在一种可能的实现方式中,每一个运算节点的内存组件还连接同一运算节点内的本地处理单元LFU。译码器DD的输出端还连接归约控制单元RC,归约控制单元RC连接本地处理单元LFU。归约控制单元RC用于控制LFU执行操作归约RD得到输入指令的运算结果,并将运算结果写入到内存组件中,第一内存控制器可以控制第二内存控制器将内存组件中的运算结果写回到上一层运算节点的内存组件中。
SD可以将串行分解后的串行子指令输出到SQ中,DD从SQ中获取串行子指令,DD主要根据主体指令存储数据的需求在循环内存段上分配内存空间,DD可以根据主体指令对应的操作数的存储需求为串行子指令分配本层运算节点的内存组件上的内存空间,并将分配的内存空间的地址(本地地址)绑定到主体指令中获取操作数的指令上,从而实现译码处理。
DD还可以根据串行子指令向第一内存控制器DMAC发送控制信号,第一内存控制器DMAC可以根据控制信号控制第二内存控制器DMA将串行子指令对应的操作数加载到为其分配的内存空间中,也就是说根据串行子指令中记载的输入指令对应的操作数的地址从上一层运算节点的内存组件中查找到串行子指令对应的操作数的存储位置,并读取操作数,然后根据本地地址写入到本层运算节点的内存组件中。
在一种可能的实现方式中,所述任意一个运算节点中的所述处理器控制所述下一层运算节点,以流水线的方式分多个阶段执行所述任意一个运算节点的串行子指令对应的操作。图10b示出根据本公开一实施例的流水线的示例。
如图10b所示,多个阶段可以包括:指令译码ID(Instruction Decoding)、数据加载LD(Loading)、操作执行EX(Execution)、操作归约RD(Reduction)以及数据写回WB(Writing Back),所述流水线按照指令译码ID、数据加载LD、操作执行EX、操作归约RD以及数据写回WB的顺序传播。
DD用于对所述多个子指令(串行子指令)进行指令译码ID。译码器根据所述头部指令向所述第一内存控制器发送第一控制信号,以使第一内存控制器根据第一控制信号控制第二内存控制器加载共用操作数。对于所述主体指令,DD可以根据主体指令对应的其他操作数的存储需求分配本层运算节点的循环内存段上的内存空间,并将分配的内存空间的地址(本地地址)绑定到主体指令中获取或者存储其他操作数的指令上,从而实现译码处理。译码器还可以根据主体指令向第一内存控制器发送第 二控制信号,以使内存控制器根据第二控制信号控制第二内存控制器存取其他操作数。
第二内存控制器DMA用于数据加载LD:将输入指令的操作数加载到内存组件,具体包括:根据与所述头部指令对应的第一控制信号从上一层运算节点的内存组件中加载所述共用操作数到所述静态内存段,根据与所述主体指令对应的第二控制信号从上一层运算节点的内存组件中加载所述其他数据到所述循环内存段。所述第二内存控制器根据所述第二控制信号从上一层运算节点的内存组件中加载所述其他数据到所述循环内存段,这里主要是加载的其他操作数中的部分操作数,主要是输入操作数中的一部分,而不是中间结果或者归约结果。
DD对串行子指令进行译码处理后发送给PD,PD可以根据PD连接的下一层运算节点的数量对译码处理后的串行子指令进行并行分解,并行分解可以是指分解后的并行子指令可以并行执行。
下一层运算节点可以以流水线的方式分所述多个阶段执行所述操作执行EX,得到执行结果。RC用于控制LFU对所述执行结果进行操作归约RD,得到所述输入指令的运算结果,所述DMA还用于数据写回WB:将运算结果写回到所述任意一个运算节点的上一层运算节点的内存组件中。以流水线的方式分多个阶段执行输入指令对应的操作的过程可以参照上文中结合图3以及图6进行说明的示例。
处理器中的SD、DD和PD是分开的,内存分配在时间上可以很好地错开。具体来说,PD总是在DD之后分配内存空间,但分配的内存空间释放得更早,DD总是在SD之后分配内存空间,但分配的内存空间同样释放得更早。而用于SD进行串行分解的内存空间可能会在多个串行子指令中用到,因此,为SD设置了静态内存段,而其他部分共用内存组件中除了静态内存外的内存(循环内存段)。
在以上流水线的多个阶段中,除了ID外其他4个阶段均涉及内存的访问,因此,最多有4条指令同时需要访问内存。而LD和WB阶段都是DMA访问内存段,LD和WB的先后顺序由DMAC控制,访问内存时不会产生冲突,也就是说只有3条指令同时需要访问循环内存段,因此,可以将循环内存段划分为多段子内存块,例如可以划分为3段子内存块。在DD需要为串行子指令的操作数分配内存空间时,可以按照串行子指令的输入顺序依次在3段子内存块中为串行子指令的操作数分配内存空间,这样的话,可以降低内存管理复杂性、并且可以提高内存空间利用率。
在一种可能的实现方式中,所述处理器中设置有第一计数器,所述循环内存段包括多段子内存块,所述处理器在所述循环内存段中为所述多个子指令的其他操作数分配内存空间,包括:所述处理器从所述循环内存段中与所述第一计数器的计数值对应的子内存块内,为所述其他操作数分配内存空间。
在一种可能的实现方式中,控制器中的DD在对所述多个子指令进行指令译码过程中,从所述循环内存段中与所述第一计数器的计数值对应的子内存块内,为所述其他操作数分配内存空间。
图12以及图13示出根据本公开一实施例的内存组件的划分的示例的示意图。如图12和图13所示,将循环内存段划分为3段子内存块,所述3段子内存块的内存容量大小可以相同,也可以不同,本公开对此不作限定。处理器中可以设置有计数器1,DD从SQ中获取串行子指令后,对于串行子指令中的主体指令,可以按照主体指令以及计数器1的计数值顺序为其分配循环内存段的内存空间。举例来说,若获取了一条主体指令1,计数器1的计数值为0,那么DD将在循环内存段0中为主体指令1的操作数分配内存空间;然后获取了一条主体指令2,此时计数器1的计数值为1,那么DD将在循环内存段1中为主体指令2的操作数分配内存空间;然后获取了一条主体指令3,此时计数器1的计数值为2,那么DD 将在循环内存段2中为主体指令3的操作数分配内存空间……。
在一种可能的实现方式中,在所述流水线传播的过程中,所述DMA、下一层运算节点以及LFU按顺序循环使用所述3段子内存块。图12中还示出根据本公开一实施例的多条指令的流水线传播过程的示意图。下面结合上述分配内存空间的示例以及流水线的传播过程对此进行说明。
如图12所示,在T1时,在DD为主体指令1分配内存空间后发送给PD由PD对主体指令1进行并行分解得到(多个)并行子指令1。
在T2时,DD为主体指令2在循环内存段1中分配内存空间,对于主体指令1,在LD阶段,由DMA将主体指令1的输入操作数加载到循环内存段0中,也就是此时由DMA使用循环内存段0。
在T3时,DD为主体指令3在循环内存段2中分配内存空间;对于主体指令2,在LD阶段,由DMA将主体指令2的输入操作数加载到循环内存段1中,也就是此时由DMA使用循环内存段1;对于主体指令1,在EX阶段,由下一层运算节点FFU(Fractal Functional Units)执行并行指令1,并将执行结果写回到循环内存段0,也就是此时由FFU使用循环内存段0。
在T4时,对于主体指令4,DD为主体指令3在循环内存段0中分配内存空间;对于主体指令3,在LD阶段,由DMA将主体指令3的输入操作数加载到循环内存段2中,也就是此时由DMA使用循环内存段2;对于主体指令2,在EX阶段,由FFU执行并行指令2,并将执行结果写回到循环内存段1,也就是此时由FFU使用循环内存段1;对于主体指令1,LFU对执行结果进行操作归约RD,也就是此时由LFU使用循环内存段0。
在T5时,对于主体指令1,在WB阶段,DMA将循环内存段0中的归约结果写回到上一层运算节点的内存组件上,对于主体指令4,在LD阶段,由DMA将主体指令4的输入操作数加载到循环内存段0中,也就是此时由DMA使用循环内存段0;对于主体指令3,在EX阶段,由FFU执行并行指令3,并将执行结果写回到循环内存段2,也就是此时由FFU使用循环内存段2;对于主体指令2,LFU对执行结果进行操作归约RD,也就是此时由LFU使用循环内存段1。
通过以上过程可知,在流水线传播的过程中,DMA、下一层运算节点(FFU)以及LFU按顺序循环使用3段子内存块。能够降低内存管理的复杂性,并提高内存空间利用率。
需要说明的是,五级流水线并不是在每条指令执行的过程中都具备。示例,比如对输入指令:排序SORT A,B,进行串行分解,会产生规约,SD会得到串行子指令:
SORT A1,K1;
SORT A2,K2;
MERGE K1,K2,B;
其中A1,A2,B位于上一层运算节点的内存组件中,K1,K2被SD分配于静态内存段。
所以执行串行子指令SORT A1,K1时,DD不对T1进行降级,因此,也就不需要写回,WB阶段会成为空泡,RD阶段LFU将结果写到静态内存段的K1上;执行串行子指令SORT A2,K2的过程和执行串行子指令SORT A1,K1的过程类似。
而执行MERGE T1,T2,B时,DD也不对T1,T2进行降级,也就会说,不需要加载数据,LD阶段会成为空泡,EX阶段FFU会直接访问静态内存段来取数据。
在一种可能的实现方式中,如果三条相邻(或者相距比较近)的输入指令都存在共用操作数时, 由于本公开的运算装置采用流水线的方式处理输入指令,那么对于三条相邻的输入指令都存在共用操作数时,SD在静态内存段为共用操作数分配内存空间时,就有可能出现静态内存段碎片化的问题,造成内存空间利用率低。举例来说,假设三条相邻的输入指令的共用操作数分别为操作数1、操作数2和操作数3。
图14示出根据本公开一实施例的静态内存段的内存空间分配方法的示意图。如图14所示,SD先为输入指令1的操作数1分配内存空间,再为第二条输入指令2的操作数2分配内存空间,此时操作数1还在使用,因此可以在操作数1存储的相邻的位置为操作数分配内存空间;在第三条输入指令3到达时,操作数1可能已经使用完成,操作数2还在使用,此时可以在操作数1存储的位置为操作数3分配内存空间,但是操作数3需要的内存空间可能稍微小于存储操作数1的内存空间,此时,存储操作数3和操作数2的内存空间之间可能就会有一部分内存空间无法利用。或者,存储操作数3需要的内存空间可能稍微大于存储操作数1的内存空间,此时,可能需要在图14中操作数2的右侧为操作数3分配内存空间。导致内存管理复杂,并且内存空间利用率低。
为了解决上述技术问题,本公开还在所述处理器中设置有第二计数器(可以称作计数器2),计数器2为不同的计数值时,SD可以按照串行分解产生的头部指令的顺序以及计数器2的值,在静态内存段中不同的端为共用操作数分配内存空间。
在一种可能的实现方式中,处理器在所述静态内存段中为所述共用操作数分配内存空间,可以包括:所述处理器从所述静态内存段中的第一起始端开始为所述共用操作数分配内存空间,其中,所述第一起始端为与所述第二计数器的计数值对应的起始端。举例来说,计数器2的计数值可以包括0和1,其中,0可以对应静态内存段的一端,1可以对应静态内存段的另一端。
图15示出根据本公开一实施例的静态内存段的内存空间分配方法的示意图。结合图15对SD为共用操作数分配静态内存段的内存空间的过程进行说明。SD从SQ中获取输入指令1,对输入指令1进行串行分解后得到多个串行子指令1,多个串行子指令1共用操作数1,SD要从静态内存段中为操作数1分配内存空间,假设此时计数器2的计数值为0,那么SD可以从图15所示的左侧一端为操作数1分配内存空间。SD从SQ中获取输入指令2,对输入指令2进行串行分解后得到多个串行子指令2,多个串行子指令2共用操作数2,SD要从静态内存段中为操作数2分配内存空间,假设此时计数器2的计数值为1,那么SD可以从图15所示的右侧一端为操作数2分配内存空间。SD从SQ中获取输入指令3,对输入指令3进行串行分解后得到多个串行子指令3,多个串行子指令3共用操作数3,SD要从静态内存段中为操作数3分配内存空间,假设此时计数器2的计数值为0,那么SD可以从图15所示的左侧一端为操作数3分配内存空间。
在一种可能的实现方式中,所述SD可以根据所述第二计数器的计数值确定为所述共用操作数分配内存空间的第一起始端,SD计算从所述第一起始端开始,所述静态内存段剩余的内存容量,所述SD根据所述静态内存段剩余的内存容量以及所述共用操作数需要的内存容量对所述输入指令进行第一串行分解得到第一串行子指令。也就是说,在本实施方式中,SD在计算静态内存段剩余的内存容量时,可以根据第二计数器的计数值确定计算的起始端,然后从起始端开始计算静态内存段剩余的内存容量,然后根据存储共用操作数需要的内存容量与静态内存段剩余的内存容量之间的大小关系确定是否要对共用操作数以及对应的输入指令进行分解。
通过以上内存管理的方式,本公开的运算装置可以降低内存管理复杂性,并且提高内存空间利用率。
机器学习为计算及访存密集型技术,频繁的存取数据对进行机器学习运算的运算装置的带宽提出了很高的要求,为了降低运算装置的带宽的压力,本公开提供了一种操作数的获取方法,该方法可以应用于处理器,所述处理器可以为通用处理器,例如,处理器可以为中央处理单元CPU(Central Processing Unit)、图形处理单元GPU(Graphics Processing Unit)等。所述处理器还可以为用于执行人工智能运算的人工智能处理器,人工智能运算可包括机器学习运算,类脑运算等。其中,机器学习运算包括神经网络运算、k-means运算、支持向量机运算等。该人工智能处理器可例如包括NPU(Neural-Network Processing Unit,神经网络处理单元)、DSP(Digital Signal Processor,数字信号处理单元)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)芯片中的一种或组合。人工智能处理器可以包括多个运算单元,多个运算单元可以并行执行运算。本公开提供的操作数的获取方法也可以应用于上文所述的运算装置中。
图16示出根据本公开一实施例的应用情景示意图。如图16所示,处理器在执行输入指令时,需要将输入指令的操作数从外部存储空间加载到本地内存组件上,执行完输入指令后,再将输入指令的运算结果输出到外部存储空间。频繁的加载和输出的过程需要很大的带宽,为了降低带宽压力,本公开的实施方式通过设置数据地址信息表记录本地内存组件上存储的数据,从而可以实现在从外部存储空间加载输入指令的操作数之前先检查本地内存组件上是否已存储有该操作数,如果已经存储了上述操作数,则无需将输入指令的操作数从外部存储空间加载到本地内存组件上,直接使用本地内存组件上存储的操作数即可,可以节省带宽资源。
其中,所述数据地址信息表中可以记录有地址对应关系,所述地址对应关系可以包括:操作数在本地内存组件上的存储地址和操作数在外部存储空间上的存储地址的对应关系。
表1示出根据本公开一实施例的数据地址信息表的示例。
需要说明的是,表1中的Out_addr1、In_addr1等仅仅是一个表示地址的符号,本公开实施方式的数据地址信息表中记录的地址可以是起始地址+粒度标识的形式,起始地址可以指操作数存储的内存空间的起始地址,粒度标识可以表示操作数的大小,也就是说记录了数据存储的起始地址以及数据的大小等信息。
表1数据地址信息表
外部存储空间上的存储地址 本地内存组件上的存储地址
Out_addr1 In_addr1
Out_addr2 In_addr2
   
图17示出根据本公开一实施例的操作数的获取方法的流程图。如图17所述,所述方法可以包括:
步骤S11,在数据地址信息表中查找操作数是否已保存在本地内存组件上;
步骤S12,若操作数已保存在本地内存组件上,则根据操作数在外部存储空间上的存储地址和数 据地址信息表确定所述操作数在本地内存组件上的存储地址;
步骤S13,将所述操作数在本地内存组件上的存储地址赋值给获取所述操作数的指令。
处理器在接收到数据加载指令后,可以执行数据加载指令以加载操作数到本地内存组件上。具体地,数据加载指令绑定有操作数在外部存储空间上的存储地址,根据数据加载指令(绑定的存储地址)生成加载数据的控制信号,由DMA(Direct Memory Access)根据控制信号执行数据加载的过程。
而根据本公开的实施例,在生成加载数据的控制信号加载操作数之前,可以执行步骤S11,在数据地址信息表中查找要加载的操作数是否已保存在本地内存组件上。
如上所述,数据地址信息表中可以记录有地址对应关系,可以在地址对应关系中包含全部操作数在外部存储空间上的存储地址时,确定所述操作数已保存在本地内存组件上,在地址对应关系中未包含全部操作数在外部存储空间上的存储地址时,确定操作数未保存在本地内存组件上。具体地,可以在数据地址信息表中记录的外部存储空间上的存储地址中查找操作数是否已保存在本地内存组件上,换言之,假设之前存储过要加载的操作数,那么会在数据地址信息表中记录有操作在外部存储空间上的存储地址和在本地内存组件上的存储地址的对应关系,在下一次要加载同样的操作数时,如果发现数据地址信息表中记录的外部存储空间上的存储地址包含要加载的操作数在外部存储空间上的存储地址,那么就说明要加载的操作数已经存储在本地内存组件上了,直接使用就可以了,不需要重复加载。
示例性的,有些情况下操作数可能不仅仅是一个数,而有可能是多个数或者包含多个数的向量、矩阵、张量,等等。在这种情况下,数据加载指令绑定的操作数在外部存储空间上的存储地址可以为一段存储空间的地址,在地址对应关系中的外部存储空间上的存储地址完全包含数据加载指令绑定的操作数在外部存储空间上的存储地址时,可以确定操作数已保存在本地内存组件上;若地址对应关系中的外部存储空间上的存储地址不包含或者仅包含一部分数据加载指令绑定的操作数在外部存储空间上的存储地址时,可以确定操作数未保存在本地内存组件上。
在一种可能的实现方式中,检查两段地址之间是否为包含关系的方法可以不用遍历操作数中的所有数据的地址进行检查,而是只需要检查操作数的两个点的数据的地址是否落在数据地址信息表中记录的任意一条地址对应关系中的外部存储空间的存储地址上即可。举例来说,如果操作数为矩阵,只要检查矩阵对角线上的两个顶点的数据的存储地址是否被数据地址信息表中记录的任意一条地址对应关系中的外部存储空间的存储地址包含即可,不需要检查矩阵中的每一个数据的存储地址是否被数据地址信息表中记录的任意一条地址对应关系中的外部存储空间的存储地址包含。推广至N维空间,在N维空间中两个平行的超立方,也只需要检查操作数的主对角线上的两个顶点的数据的存储地址是否被数据地址信息表中记录的任意一条地址对应关系中的外部存储空间的存储地址包含即可。每一个表项的硬件结构除了表项记录所需的寄存器外,还可以配备两个判别器,两个判别器可以用于判断两个对角线的顶点是否满足包含条件,如果两个判别器均给出肯定判别,则认为表项命中,也就是说待查询的操作数在外部存储空间上的存储地址落入(表项)地址对应关系中的外部存储空间的存储地址中,表明待查询的操作数已保存在本地内存组件上。举例来说,假设:
记录表项10000[10,11][1,2][20,21],
待查询项10053[4,5][6,7][18,19]
由记录表项的粒度标识,可以知道地址为10000+21*x1+x0的数据位于此张量内的条件为:
0<=x0<21
2<=x0<2+11
0<=x1<20
1<=x1<1+10
由待查询项的粒度标识,可以知道地址为10053+19*y1+y0的数据位于此张量内的条件为:
0<=y0<19
7<=y0<7+5
0<=y1<18
6<=y1<6+4
检查待查询项在主对角线上的两个顶点:y0,y1同时取极小值的点,和y0,y1同时取极大值的点,也分别对应着数据地址范围中的最小值和最大值。最小值为y0=7,y1=6,地址为10174;最大值为y0=11,y1=9,地址为10235。
检查10174和10235是否位于记录表项内部,首先要反求坐标x0和x1。令
10000+21*x1+x0=10174
21*x1+x0=174
因为,低维度变量(x0)的常数(1)总是高维度变量(x1)的常数(21)的因数,求解这个方程只需要做整数除法即可。(维度为1时可以直接得解;维度为2时需要一次整数除法;维度为n时需要连续做n-1次整数除法,每一次将余数作为被除数,从高维度向低维度依次赋值)
174/21=8余6,舍去尾数,令x1=8,则x0=6。如此即可得到x的唯一解。
接下来判断x1=8,x0=6是否满足位于张量内部的条件。由于1<=x1<11,2<=x0<13,这个点是位于张量内部的。
如上判别器需要一个减法器(10174-10000)、n个整数除法器、2n个比较器可以实现。n为最大维度,通常在8以内。
两个判别器对两个顶点分别进行判断。如果两个判别器均给出肯定判别,则认为表项命中。
每一个TTT内不需要预留很多项,例如可以为8~32项,因为运算中处理的张量数量不多。做查询时,首先将极大、极小两个地址计算出来,将地址广播至每一个TTT、每一项记录的两个判别器,所有判别器都同时工作,TTT只需要返回任意一项给出肯定判别的表项。
对于步骤S12,若确定操作数已保存在本地内存组件上,则可以根据操作数在外部存储空间上的存储地址和数据地址信息表中记录的地址对应关系确定操作数在本地内存组件上的存储地址。具体可以为:将所述地址对应关系中,与所述操作数在外部存储空间上的存储地址对应的本地内存组件上的存储地址,作为所述操作数在本地内存组件上的存储地址。举例来说,如表1所示,若操作数在外部存储空间的存储地址为Out_addr1,那么根据表1中的地址对应关系可以确定操作数在本地内存组件上的存储地址为In_addr1;或者,若操作数在外部存储空间的存储地址为Out_addr1中的一部分,那么根据地址对应关系可以确定In_addr1中相应的部分为操作数在本地内存组件上的存储地址,具体地,Out_addr1为addr11~addr12,操作数在外部存储空间的存储地址为addr11~addr12中的一段addr13~ addr14,那么In_addr1中与addr13~addr14段对应的地址为操作数在本地内存组件上的存储地址。
对于步骤S13,其中的获取所述操作数的指令可以是指数据加载指令,在步骤S12中确定了操作数在本地内存组件上的存储地址后,可以将操作数在本地内存组件上的存储地址绑定到与操作数对应的数据加载指令上,这样,处理器可以直接执行数据加载指令,从本地内存组件上获取操作数,省去了从外部存储空间加载操作数到本地内存组件的过程,节省带宽资源。
图18示出根据本公开一实施例的操作数的获取方法的流程图。如图18所示,所述方法还可以包括:
步骤S14,若操作数未保存在本地内存组件上,则根据所述操作数的存储地址生成加载操作数的控制信号,所述加载操作数的控制信号用于将所述操作数从所述操作数的存储地址加载到本地内存组件上。
如果操作数没有保存在本地内存组件上,则可以按照正常的过程将操作数从外部存储空间加载到本地内存组件上。具体过程可以为,可以在本地内存组件上为操作数分配内存空间,确定分配的内存空间的地址,根据数据加载指令绑定的操作数的存储地址以及分配的内存空间的地址生成加载操作数的控制信号,将加载操作数的控制信号发送给DMA,DMA根据控制信号将操作数从操作数的存储地址加载到本地内存组件上。
在一种可能的实现方式中,如图18所述,所述方法还可以包括:
步骤S15,当从外部存储空间上加载操作数到本地内存组件时,根据加载的操作数在外部存储空间上的存储地址和在本地内存组件上的存储地址更新所述数据地址信息表。
在一种可能的实现方式中,加载的操作数覆盖了本地内存组件上原来存储的操作数,可以用加载的操作数在外部存储空间上的存储地址和在本地内存组件上的存储地址的对应关系,替换数据地址信息表中上述原来存储的操作数的地址对应关系。具体过程也可以为,先判断记载的操作数在外部存储空间上的存储地址与地址对应关系中的外部存储空间上的存储地址是否存在重叠,如果存在重叠,则可以无效原来记录的地址对应关系,并记录新加载的操作数的地址对应关系,也就是记录加载的操作数在外部存储空间上的存储地址和在本地内存组件上的存储地址的对应关系。
举例来说,如表1所示,处理器将In_addr1的内存空间分配给了上述操作数,加载操作数后覆盖了In_addr1的内存空间处原来存储的数据,此时,可以将数据地址信息表中Out_addr1和In_addr1的地址对应关系无效,替换为Out_addr3和In_addr1的地址对应关系。需要说明的是,以上仅仅是本公开的一个示例,不以任何方式限制本公开,例如,In_addr1表示的是一段内存空间,处理器只是分配了其中的一部分内存空间In_addr3给上述操作数,那么可以采用Out_addr3和In_addr3的地址对应关系替换原来的Out_addr1和In_addr1的地址对应关系。
在一种可能的实现方式中,将数据地址信息表中原来的地址对应关系替换为:加载的操作数在外部存储空间上的存储地址和在本地内存组件上的存储地址的对应关系。在本实施方式中,数据地址信息中只记录最近一次加载的操作数的地址对应关系。因此,在从外部存储空间上加载操作数到本地内存组件时,直接将数据地址信息表中原来的地址对应关系替换为:加载的操作数在外部存储空间上的存储地址和在本地内存组件上的存储地址的对应关系。具体过程也可以包括上述无效的过程,也就是可以设置老化时间,在记录了一条地址对应关系后,可以开始计时,在到达老化时间时,可以设置相应的地址对应关系无效,即使要加载新的操作数时,查找到数据地址信息表中记录了本地内存组件已 经保存了要加载的操作数,但是由于地址对应关系已经无效了,返回的结果仍然是未保存在本地内存组件上。
其中,老化时间的长短可以根据对带宽和效率的需求平衡而设置,本公开对老化时间的长短不作具体限定。在一种可能的实现方式中,老化时间可以设置为大于或等于两个流水线周期,一个流水线周期可以是指运算节点的流水线向前传播一级需要的时间。
也就是说,对于步骤S11,在地址对应关系有效,且地址对应关系中的在外部存储空间上的存储地址包含要加载的操作数在外部存储空间上的存储地址时,才会返回操作数已保存在本地内存组件上的结果,以上两个条件中的任何一个不满足,都不会返回操作数以保存在本地内存组件上的结果,比如说,地址对应关系无效,不会返回操作数以保存在本地内存组件上的结果,或者虽然地址对应关系有效,但地址对应关系中的在外部存储空间上的存储地址不包含要加载的操作数在外部存储空间上的存储地址,则不会返回操作数以保存在本地内存组件上的结果。
在一种可能的实现方式中,还可以在数据地址信息表中记录地址对应关系的无效标识位,无效标识位可以表示地址对应关系是否有效,例如,无效标识位为1表示有效,为0可以表示无效。相应的,在记录一条地址对应关系后,可以设置对应的无效标识位为1,到达老化时间时,将无效标识设置为0。
根据本公开上述实施方式的操作数的获取方法,在操作数已保存在本地内存组件上时,处理器可以直接执行数据加载指令,从本地内存组件上获取操作数,省去了从外部存储空间加载操作数到本地内存组件的过程,节省带宽资源。
在一种可能的实现方式中,本公开的方法可以应用于运算装置,该运算装置可以包括:多层运算节点,每一个运算节点包括本地内存组件、处理器以及下一层运算节点,所述外部存储空间可以为所述运算节点的上一层运算节点的内存组件或者下一层运算节点的内存组件。
下面结合图3所示的运算装置对本申请的实施方式进行说明,在一种可能的实现方式中,运算装置中可以设置有张量置换表(数据地址信息表的一个示例),张量置换表中可以记录静态内存段中存储的操作数在外部存储空间上的存储地址和在静态内存段中的存储地址的对应关系,此处的外部存储空间可以指上一层运算节点的内存组件。
SD在静态内存段中为共用操作数分配内存空间之前时,可以先在张量置换表中查找共用操作数是否已保存在本地内存组件的静态内存段上,若已经保存在了本地内存组件的静态内存段上,则根据共用操作数在外部存储空间上的存储地址(操作数在上一层运算节点的内存组件上的存储地址)和张量置换表确定所述共用操作数在本地内存组件上的存储地址;将所述共用操作数在本地内存组件上的存储地址赋值给上述头部指令。
对于图15中对应的操作数分配地址的实施方式,在本公开的实施方式中可以设置多个张量置换表分别记录静态内存段的不同端存储的操作数在外部存储空间上的存储地址和在静态内存段中的存储地址的对应关系。这样,步骤S15,可以包括:当从外部存储空间上加载操作数到所述静态内存段时,根据第二计数器的计数值确定待更新的数据地址信息表(张量置换表);根据加载的操作数在外部存储空间上的存储地址和在静态内存段上的存储地址更新所述待更新数据地址信息表(张量置换表)。 其中的外部存储空间可以是当前运算节点的上一层运算节点的内存组件。
举例来说,运算节点中可以设置有张量置换表1和张量置换表2,张量置换表1用于记录静态内存段左侧一端存储的操作数的地址的对应关系,张量置换表2用于记录静态内存段右侧一端存储的操作数的地址的对应关系。
以上文中的示例为例,SD从SQ中获取输入指令1,对输入指令1进行串行分解后得到多个串行子指令1,多个串行子指令1共用操作数1,SD要从静态内存段中为操作数1分配内存空间,SD在张量置换表1和张量置换表2中查找共用操作数1是否已保存在静态内存段上,若没有保存在静态内存段上,假设此时计数器2的计数值为0,那么SD可以从图15所示的左侧一端为操作数1分配内存空间,并在张量置换表1中记录共用操作数1在上一层运算节点的内存组件中的存储地址与本地内存组件中的存储地址的对应关系。
SD从SQ中获取输入指令2,对输入指令2进行串行分解后得到多个串行子指令2,多个串行子指令2共用操作数2,SD要从静态内存段中为操作数2分配内存空间,SD在张量置换表1和张量置换表2中查找共用操作数3是否已保存在静态内存段上,若没有保存在静态内存段上,假设此时计数器2的计数值为1,那么SD可以从图15所示的右侧一端为操作数2分配内存空间,并在张量置换表2中记录共用操作数2在上一层运算节点的内存组件中的存储地址与本地内存组件中的存储地址的对应关系。
在张量置换表中记录地址对应关系后,SD可以分别设置与地址对应关系相应计时器开始计时,在计时器到达老化时间时,SD可以设置与计时器相应的地址对应关系无效。如上所述的示例,针对共用操作数1的地址对应关系,可以设置计时器1,针对共用操作数2的地址对应关系可以设置计时器2,在计时器1、计时器2到达老化时间之前,共用操作数1的地址对应关系和共用操作数2的地址对应关系都是有效的,在计时器1到达老化时间后,可以设置共用操作数1的地址对应关系无效,在计时器2到达老化时间后,可以设置共用操作数2的地址对应关系无效。
SD从SQ中获取输入指令3,对输入指令3进行串行分解后得到多个串行子指令3,多个串行子指令3共用操作数3,SD要从静态内存段中为操作数3分配内存空间,SD在张量置换表1和张量置换表2中查找共用操作数3是否已保存在静态内存段上,若查找到已保存的共用操作数1中的一部分为共用操作数3,则直接将与共用操作数3对应的共用操作数1的存储地址绑定到头部指令上。
需要说明的是,如果共用操作数1的地址对应关系无效时,是不会返回共用操作数3以保存在静态内存上的结果的,在共用操作数1的地址对应关系相应的计时器1未到达老化时间,且共用操作数1的地址对应关系中的在外部存储空间上的存储地址包含共用操作数3在外部存储空间上的存储地址,才会返回共用操作数3以存储在静态内存段上的结果。
通过上述实施方式的内存分配方式,可以在降低内存管理复杂性,并且提高内存空间利用率的同时,节省带宽资源。
对于本实施方式的内存管理方式,可以设置多个张量置换表(数据地址信息表的示例)分别记录循环内存段的不同子内存块存储的操作数。DD在循环内存段上为操作数分配内存空间之前,可以先在与循环内存段对应的多个张量置换表中查找操作数是否已保存在本地内存组件的循环内存段上,若已经保存在了本地内存组件的循环内存段上,则根据张量置换表确定所述操作数在本地内存组件上的存储地址,将所述操作数在本地内存组件上的存储地址赋值给获取操作数的指令;若未保存在本地内 存组件的循环内存段上,则加载数据。
在图12所示的内存管理方式的实施方式中,同样可以在张量置换表中记录地址对应关系的无效标识位,并且,在记录一条地址对应关系后,可以设置计时器进行计时,在计时器到达老化时间时,将地址对应关系设置为无效。而且,在张量置换表中的地址对应关系为有效,且地址对应关系中的在外部存储空间上的存储地址包含要加载的操作数在外部存储空间上的存储地址,才会返回要加载的操作数已经保存在了本地内存组件的循环内存段上的结果。
在本实施方式中,步骤S15可以包括:当从外部存储空间上加载操作数到循环内存段上的多个子内存块中的任一子内存块时,DD可以根据加载的操作数在外部存储空间上的存储地址和在本地内存组件上的存储地址更新与所述任一子内存块对应的数据地址信息表(张量置换表)。
例如,对于任一子内存块,分别设置与该任一子内存块对应的张量置换表,对于包含3个子内存块的示例:循环内存段0、循环内存段1和循环内存段2,可以设置张量置换表4、张量置换表5和张量置换表6分别与循环内存段0、循环内存段1和循环内存段2对应。这样,当从外部存储空间上加载操作数到循环内存段0时,根据加载的操作数在外部存储空间上的存储地址和在本地内存组件上的存储地址更新张量置换表4。
在一种可能的实现方式中,所述处理器中设置有第三计数器,所述循环内存段包括多段子内存块,所述处理器在所述循环内存段中为所述多个子指令的其他操作数分配内存空间,包括:所述处理器从所述循环内存段中与所述第三计数器的计数值对应的子内存块内,为所述其他操作数分配内存空间。
在一种可能的实现方式中,处理器中的DD在对所述多个子指令进行指令译码过程中,从所述循环内存段中与所述第三计数器的计数值对应的子内存块内,为所述其他操作数分配内存空间。
如图12所示,将循环内存段划分为多段子内存块,例如3段子内存块,所述3段子内存块的内存容量大小可以相同,也可以不同,本公开对此不作限定。处理器中可以设置有计数器3,DD从SQ中获取串行子指令后,对于串行子指令中的主体指令,可以按照主体指令以及计数器3的计数值顺序为其分配循环内存段的内存空间,在分配内存空间之前,DD可以在与循环内存段对应的多个张量置换表中查找操作数是否已保存在本地内存组件的循环内存段上,若已经保存在了本地内存组件的循环内存段上,则将所述操作数在本地内存组件上的存储地址赋值给获取操作数的指令。
举例来说,若获取了一条主体指令1,在张量置换表4、张量置换表5和张量置换表6中查找主体指令1的操作数是否已保存在本地内存组件的循环内存段上,若未保存在循环内存段上,且计数器3的计数值为0,那么DD将在循环内存段0中为主体指令1的操作数分配内存空间;然后获取了一条主体指令2,在张量置换表4、张量置换表5和张量置换表6中查找主体指令2的操作数是否已保存在本地内存组件的循环内存段上,若未保存在循环内存段上,且此时计数器3的计数值为1,那么DD将在循环内存段1中为主体指令2的操作数分配内存空间;然后获取了一条主体指令3,在张量置换表4、张量置换表5和张量置换表6中查找主体指令3的操作数是否已保存在本地内存组件的循环内存段上,若保存在循环内存段上,那么DD将操作数在本地内存组件上的存储地址赋值给获取操作数的指令,这样PD在执行主体指令3时可以直接从本地内存组件的循环内存段上获取操作数,不需要DMAC上上一层运算节点加载到本地内存组件的循环内存段上。
通过上述实施方式的内存分配方式,可以在降低内存管理复杂性,并且提高内存空间利用率的同 时,节省带宽资源。
在一种可能的实现方式中,本公开的操作数的获取方法支持以“流水线前递”的形式进行数据重用,下一条指令可使用前一条指令的结果作为输入,从而使两条指令在流水线执行时没有气泡阻隔。
举例说明。现在有两条指令:
ELTW A,B;
ELTW B,C
假设它们都不需要RD。
在没有张量置换表时,B需要先被第一条指令WB,然后再被第二条指令LD。流水线是:
ID LD EX RD WB;
__ __ __ __ ID LD EX RD WB;
加入了张量置换表之后,张量置换表会记录到第一条指令输出操作数B存储在本地的内存组件上的地址,且输出操作数会在EX阶段结束后准备完毕;相应地,第二条指令的输入操作数地址被替换为本地内存组件上的地址后,LD阶段变为空泡,EX作为指令的初始阶段被直接安排在数据准备完毕的那一拍。流水线是:
ID LD EX RD WB;
__ ID LD EX RD WB;
流水线的执行变得和没有依赖一样,数据被从第一条指令的EX直接传递至第二条指令的EX。这种技术在传统静态流水线处理器里被称为“流水线前递”,是通过增加额外数据通路实现的,而在本案通过张量置换表实现了相同的效果,相比于传统的静态流水线可以简化数据通路,降低处理器结构的复杂度。
为了更好的实现可以分解的运算,本公开还提供了一种指令集架构,该指令集架构中的指令在执行时是可以分解的。
对于上文所述的可以分解的运算,对运算分解后,对应的输入指令也被分解为多条子指令,执行子指令可以完成输入指令的操作数中部分操作数的运算。
在一种可能的实现方式中,所述处理器还用于根据多个子指令生成对应的多个控制信号,并将多个控制信号发送给内存控制器;所述内存控制器根据每个控制信号控制所述数据通路,从上一层运算节点的内存组件中加载该控制信号对应的子指令的操作数到本地内存组件。
对于任意一个运算节点,其中的处理器可以接收上一层运算节点发送的输入指令或者通过其他方式输入(例如用户编程)的输入指令。输入指令可以包括:运算符、操作数参数,所述操作数参数可以是指向输入指令的操作数的参数,所述操作数参数可以包括全局参数和局部参数,全局参数是表示输入指令对应的第一操作数的大小的参数,局部参数是表示输入指令的第二操作数在所述第一操作数中的起始位置和第二操作数的大小的参数。也就是说,第二操作数可以是第一操作数中的部分数据或者全部数据,执行输入指令时可以实现对第二操作数的处理,对第二操作数的处理可以是与输入指令的运算符对应的处理。
在一种可能的实现方式中,所述内存控制器用于根据所述操作数参数从所述任意一个运算节点的上一层运算节点的内存组件加载多个子指令对应的第一操作数中的第二操作数到所述本地内存组件。
也就是说,本公开的运算装置采用的指令可以是一个三元组<O,P,G>,其中,O表示运算符,P表示一个操作数的有限集,G表示粒度指标,具体的表现形式可以为“O,P[N][n1][n2]”,其中,N可以为正整数,表示全局参数,根据张量维度的不同可以设置多个不同的N,n1和n2为小于N的自然数,表示局部参数,其中,n1表示对操作数进行运算时的起始位置,n2表示大小,执行上述指令可以实现对操作数P中n1到n1+n2的操作数的运算O,同样的,根据张量维度的不同可以设置多个不同的n1和n2。
本公开的运算装置的每一层接收到的输入指令的格式都是相同的,因此,可以自动完成指令的分解、执行指令对应的操作,等等。不同层的运算节点、不同规模的计算机上都具有相同的编程接口和指令集架构,能够执行相同格式的程序,层与层之间隐式装载数据,简化用户编程的复度性,且运算装置的扩展或者程序在不同运算装置之间的移植都非常容易。
以图3所示的示例为例对本公开的运算装置采用的指令集架构进行说明,假设第一层运算节点接收的外部输入的输入指令为C=“ADD,A[N][0][N],B[N][0][N]”,其中,“ADD”表示运算符,A[N][0][N]和B[N][0][N]为操作数及操作数参数,第一个N表示操作数A和B的大小,“0”表示对操作数A和B执行加运算时的起始位置,第二个N表示执行加运算的操作数的大小。执行上述指令可以实现对操作数A和B的加运算。
在一种可能的实现方式中,任意一个运算节点都可以对输入指令进行分解得到多个子指令,所述输入指令和多个子指令具有相同的格式,至少部分子指令的运算符与输入指令的运算符是相同的。
在一种可能的实现方式中,任意一个运算节点在收到输入指令后,可以根据下一层运算节点的数量对输入指令进行分解得到多个并行子指令,执行一个并行子指令可以完成输入指令对应的操作数的部分操作数的运算,执行全部并行子指令可以完成输入指令对应的运算。
第一层运算节点可以根据下一层运算节点的数量对接收到的输入指令进行分解得到多个并行子指令,如图1所示,第一层运算节点包括3个下一层运算节点,因此,可以将上述输入指令分解为至少三个并行子指令:
C1=“ADD,A[N][0][N/3],B[N][0][N/3]”,
C2=“ADD,A[N][(N/3)+1][N/3],B[N][(N/3)+1][N/3]”,
C3=“ADD,A[N][(2N/3)+1][N/3],B[N][(2N/3)+1][N/3]”,
C1、C2和C3与C的格式都相同。
第一层运算节点可以将分解后的并行子指令发送给下一层运算节点,下一层运算节点接收到并行子指令C1、C2和C3,可以进行类似的分解,直到最后一层运算节点。
对于操作数的存储,为了避免不用层之间频繁的数据交换,任意一个(当前)运算节点在接收到上一层运算节点发送的输入指令后,可以根据输入指令的操作数参数从上一层运算节点的内存组件中读取相应的操作数,并保存在当前运算节点的内存组件中,任意一个运算节点在执行完输入指令得到运算结果后,还可以将运算结果写回到上一层运算节点的内存组件中。例如,当前运算节点的处理器可以根据输入指令的操作数参数向内存控制器发送控制信号,内存控制器可以根据控制信号控制当前运算节点的内存组件和上一层运算节点的内存组件之间连接的数据通路,从而将输入指令的操作数加载到当前运算节点的内存组件中。
在一种可能的实现方式中,任意一个运算节点的所述内存控制器包括第一内存控制器和第二内存控制器,第一内存控制器可以通过第二内存控制器(例如,DMA,Direct Memory Access,直接内存存取)连接数据通路,第一内存控制器可以为DMAC(Direct Memory Access controller),第一内存控制器可以根据控制信号生成加载指令,将加载指令发送给第二内存控制器,由第二内存控制器根据加载指令控制数据通路,实现数据的加载。第一内存控制器可以通过硬件电路或者软件程序的方式实现,本公开对此不作限定。
第一内存控制器可以根据控制信号确定基地址、起始偏移量、加载数据的数量、跳转的偏移量等参数,然后根据基地址、起始偏移量、加载数据的数量、跳转的偏移量等参数生成加载指令,还可以根据操作数的维度设置循环加载数据的次数。其中,基地址可以是原操作数在内存组件中存储的起始地址;起始偏移量可以为要读的操作数在原操作数中开始的位置,起始偏移量可以根据局部参数中的起始位置确定;加载数据的数量可以为从起始偏移量开始加载的操作数的个数,加载数据的数量可以根据局部参数中的大小确定;跳转的偏移量表示下一部分要读的操作数在原操作数中开始的位置相对于上一部分读的操作数在原始操作数中开始的位置之间的偏移,也就是说,跳转的偏移量为下一部分读取数据的起始偏移量相对于上一部分读取数据的起始偏移量的偏移量,跳转的偏移量可以根据全部参数或局部参数确定。
举例来说,可以将起始位置作为起始偏移量,将局部参数中的大小作为一次加载的数据的数量,可以将局部参数中的大小作为跳转的偏移量。
在一种可能的实现方式中,可以根据基地址以及起始偏移量确定开始读取操作数的起始地址,根据加载数据的数量以及起始地址可以确定一次读取操作数的结束地址,根据起始地址以及跳转的偏移量可以确定下一部分要读的操作数的起始地址,同样的,可以根据加载数据的数量以及下一部分要读的操作数的起始地址确定本次读取操作数的结束位置……重复以上过程,直到达到循环加载操作数的次数。其中的一次读取操作数和本次读取操作数可以是指:读取同一个操作数需要一次或多次完成,每次读取同一个操作数中的部分操作数,上述一次和本次可以是指多次中的一次。
也就是说,读取一个操作数可能需要循环多次读取完成,第一内存控制器可以根据基地址、起始偏移量、加载数据的数量、跳转的偏移量确定每次读取操作数时的起始地址和结束地址,例如,针对每次读取过程,可以根据上一次读取过程的起始地址和跳转的偏移量确定本次读取过程的起始地址,可以根据本次读取过程的起始地址和加载数据的数量(以及数据的格式)确定本地读取过程的结束地址。其中,跳转的偏移量可以根据跳转的数据的数量以及数据的格式确定。
示例性的,仍然以上文中的示例为例,第二层运算节点在接收到输入指令C1时,处理器可以根据输入指令C1生成控制信号“Load A[N][0][N/3],A’”以及“Load B[N][0][N/3],B’”,其中,A’和B’是处理器在第二层运算节点的内存组件上分配的内存空间。第一内存控制器可以根据控制信号设置起始偏移量为0,加载数据的数量为N/3,由于操作数A为一维向量,所以,可以不设置跳转的偏移量以及循环加载数据的次数。对于操作数B可以采用同样的方式生成加载指令进行数据加载。
示例性的,如图9所示,假设操作数P为M行N列的矩阵P[M,N],控制信号为“Load P[M,N][0,0][M,N/2],P’”。第一内存控制器根据控制信号可以设置在行和列方向的起始偏移量均为0,加载数据的数量为N/2,跳转的偏移量为N,循环的次数为M。如图9所示,从第一行第一列开始读 取N/2列数据,跳转到第二行第一列读取N/2列数据……循环M次可以完成数据的加载。
需要说明的是,以上示例仅仅是为了说明本公开的运算装置加载数据的方式,不以任何方式限制本公开。
下面结合图5、图9和图10a所示的示例,对本公开加载操作数的过程进行详细的说明。
第i层的一个运算节点的处理器中的SD从IQ中获取输入指令,输入指令的操作数为P[M,N][0,0][M,N/2],SD确定存储操作数P[M,N][0,0][M,N/2]需要的内存容量大于内存组件的容量,需要对输入指令进行串行分解。根据图5所示的过程确定分解的粒度为M、N/4,也就是说,串行子指令的操作数分别为P[M,N][0,0][M,N/4]和P[M,N][0,(N/4)+1][M,N/2]。SD将串行子指令输出到SQ,DD从SQ中获取串行子指令。DD可以为串行子指令的操作数分配内存空间,并将分配的内存空间的地址(本地地址)绑定到串行子指令中获取操作数的指令上,也就是说,DD可以生成控制信号:
Load P[M,N][0,0][M,N/4],P1’;
第一内存控制器根据控制信号可以设置在行和列方向的起始偏移量均为0,加载数据的数量为N/4,跳转的偏移量为N,循环的次数为M。如图9所示,从第一行第一列开始读取N/4列数据写到本地内存组件P1’的位置,跳转到第二行第一列读取N/4列数据……循环M次可以完成数据的加载。第一内存控制器可以根据确定的基地址、起始偏移量、加载数据的数量、跳转的偏移量数生成加载指令,将加载指令发送给第二内存控制器,第二内存控制器根据加载指令以上述方式读取操作数并写入到本地内存组件中。
DD在获取到与操作数P[M,N][0,(N/4)+1][M,N/2]对应的串行子指令时,还可以生成控制信号:
Load P[M,N][0,(N/4)+1][M,N/2],P2’;
第一内存控制器根据控制信号可以设置在行起始偏移量为0,在列方向的起始偏移量为(N/4)+1,加载数据的数量为N/4,跳转的偏移量为N,循环的次数为M。如图9所示,从第一行第(N/4)+1列开始读取N/4列数据写到本地内存组件P1’的位置,跳转到第二行第(N/4)+1列读取N/4列数据……循环M次可以完成数据的加载。
需要说明的是,以上仅仅是为了更清楚的说明本公开的数据加载的方法而列举的示例,不以任何方式限制本公开。
在一种可能的实现方式中,所述任意一个运算节点的内存组件包括静态内存段以及动态内存段,若所述输入指令的操作数包括共用操作数以及其他操作数,则串行分解器根据所述共用操作数需要的内存容量与所述静态内存段的剩余容量之间的大小关系、以及所述其他操作数需要的内存容量与动态内存段的容量之间的大小关系,对所述输入指令进行串行分解得到串行子指令。
其中,所述共用操作数为所述串行子指令共同使用的操作数,其他操作数为所述输入指令的操作数中除了所述共用操作数以外的数据,静态内存段的剩余容量可以是指静态内存段中未被使用的容量。
举例来说,对于机器学习中的一些运算,这些运算被分解后的几部分运算之间会共用一部分操作数,对于这部分操作数,本公开称作共用操作数。以矩阵相乘运算作为示例,假设输入指令为对矩阵X和Y相乘,如果仅仅对矩阵X进行分解,那么对输入指令进行串行分解得到的串行子指令需要共同使用操作数Y,操作数Y为共用操作数。
对于共用操作数,本公开的串行分解器SD可以在进行串行分解时生成一条提示性指令(“装载”),并在提示性指令中指明将共用操作数装载到静态内存段中,DD将提示性指令作为一条只需要装载数据至静态内存段、而无需执行、规约或写回的普通串行子指令处理,DD根据提示性指令向第一内存控制器发送第一控制信号以将共用操作数加载到静态内存段,以避免频繁存取数据、节约带宽资源。对于其他操作数,DD可以生成第二控制信号,DD可以将生成的第二控制信号发送给第一内存控制器,由第一内存控制器根据控制信号控制第二内存控制器将其他操作数加载到动态内存段中。内存控制器加载共用操作数和其他操作数的过程都可以参见上文描述的过程,不再赘述。
本公开的运算装置的每一层接收到的输入指令的格式都是相同的,因此,可以自动完成指令的分解、执行指令对应的操作,等等。不同层的运算节点、不同规模的计算机上都具有相同的编程接口和指令集架构,能够执行相同格式的程序,层与层之间隐式装载数据,简化用户编程的复度性,且运算装置的扩展或者程序在不同运算装置之间的移植都非常容易。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本披露并不受所描述的动作顺序的限制,因为依据本披露,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本披露所必须的。
进一步需要说明的是,虽然图中的各个模块按照箭头的指示依次显示,但是这些并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,执行顺序并没有严格的顺序限制。
应该理解,上述的装置实施例仅是示意性的,本披露的装置还可通过其它的方式实现。例如,上述实施例中所述单元/模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如,多个单元、模块或组件可以结合,或者可以集成到另一个系统,或一些特征可以忽略或不执行。
另外,若无特别说明,在本披露各个实施例中的各功能单元/模块可以集成在一个单元/模块中,也可以是各个单元/模块单独物理存在,也可以两个或两个以上单元/模块集成在一起。上述集成的单元/模块既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元/模块如果以硬件的形式实现时,该硬件可以是数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于晶体管,忆阻器等等。若无特别说明,处理器可以是任何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。若无特别说明,所述内存组件可以是任何适当的磁存储介质或者磁光存储介质,比如,阻变式存储器RRAM(Resistive Random Access Memory)、动态随机存取存储器DRAM(Dynamic Random Access Memory)、静态随机存取存储器SRAM(Static Random-Access Memory)、增强动态随机存取存储器EDRAM(Enhanced Dynamic Random Access Memory)、高带宽内存HBM(High-Bandwidth Memory)、混合存储立方HMC(Hybrid Memory Cube)等等。
所述集成的单元/模块如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本披露的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本披露各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM, Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。上述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (10)

  1. 一种运算装置,其特征在于,包括:至少两层运算节点,每一个运算节点包括内存组件、处理器以及下一层运算节点;
    对于任意一个运算节点,所述任意一个运算节点中的处理器用于对所述任意一个运算节点的输入指令进行分解,获得并行子指令,并将并行子指令发送给所述任意一个运算节点的下一层运算节点;
    所述任意一个运算节点还用于从上一层运算节点的内存组件中加载执行所述并行子指令需要的操作数到所述任意一个运算节点的内存组件,以使所述任意一个运算节点的下一层运算节点根据所述操作数并行执行所述并行子指令。
  2. 根据权利要求1所述的运算装置,其特征在于,所述任意一个运算节点还包括:内存控制器,
    所述任意一个运算节点的内存组件与所述任意一个运算节点的上一层运算节点和下一层运算节点的内存组件之间连接有数据通路,所述内存控制器连接所述数据通路,控制所述数据通路将输入指令的操作数从一个内存组件送往另一个内存组件。
  3. 根据权利要求2所述的运算装置,其特征在于,所述处理器包括:串行分解器、并行分解器以及译码器,所述内存控制器连接所述串行分解器和所述译码器;
    其中,所述串行分解器用于根据所述任意一个运算节点的内存组件的容量、以及所述输入指令需要的内存容量,对所述输入指令进行串行分解得到串行子指令;
    所述译码器用于对所述串行子指令进行译码处理后发送给所述并行分解器、并根据串行子指令向所述内存控制器发送控制信号,所述内存控制器根据所述控制信号从上一层运算节点的内存组件中加载执行所述串行子指令需要的操作数到所述任意一个运算节点的内存组件;
    所述并行分解器用于根据所述下一层运算节点的数量,对译码后的串行子指令进行并行分解得到并行子指令,并将并行子指令发送给所述下一层运算节点,以使所述下一层运算节点根据所述操作数执行并行子指令。
  4. 根据权利要求3所述的运算装置,其特征在于,若所述输入指令需要的内存大于所述任意一个运算节点的内存组件的容量,则所述串行分解器根据所述输入指令需要的内存和所述任意一个运算节点的内存组件的容量,对所述输入指令进行串行分解得到串行子指令。
  5. 根据权利要求2-4任意一项所述的运算装置,其特征在于,所述任意一个运算节点的内存组件包括静态内存段以及动态内存段,若所述输入指令的操作数包括共用操作数以及其他操作数,则串行分解器根据所述共用操作数需要的内存容量与所述静态内存段的剩余容量之间的大小关系、以及所述其他操作数需要的内存容量与动态内存段的容量之间的大小关系,对所述输入指令进行串行分解得到串行子指令,
    其中,所述共用操作数为所述串行子指令共同使用的操作数,其他操作数为所述输入指令的操作数中除了所述共用操作数以外的数据。
  6. 根据权利要求5所述的运算装置,其特征在于,分解得到的串行子指令包括头部指令和主体指令,所述译码器根据所述头部指令向所述内存控制器发送第一控制信号,所述内存控制器根据所述第一控制信号从上一层运算节点的内存组件中加载所述共用操作数到所述静态内存段;
    所述译码器根据所述主体指令向所述内存控制器发送第二控制信号,所述内存控制器根据所述第二控制信号从上一层运算节点的内存组件中加载所述其他数据到所述动态内存段。
  7. 根据权利要求3所述的运算装置,其特征在于,并行分解得到的并行子指令对应的操作数之间不存在重叠的部分。
  8. 根据权利要求2-7任意一项所述的运算装置,其特征在于,所述处理器还包括控制单元,所述任意一个运算节点还包括本地处理单元,
    所述控制单元的输入端连接所述译码器的输出端,所述控制单元的输出端连接所述本地处理单元的输入端。
  9. 根据权利要求8所述的运算装置,其特征在于,
    若所述串行子指令存在输出依赖,所述控制单元根据所述串行子指令控制所述本地处理单元对所述下一层运算节点的运算结果进行归约处理得到所述输入指令的运算结果;
    其中,所述串行子指令存在输出依赖是指,需要对所述串行子指令的运算结果进行归约处理才能得到所述输入指令的运算结果。
  10. 根据权利要求9所述的运算装置,其特征在于,若所述控制单元检测到对所述下一层运算节点的运算结果进行归约处理所需要的资源大于所述本地处理单元的资源上限,则所述控制单元根据所述串行子指令向所述并行分解器发送委托指令,
    所述并行分解器根据所述委托指令控制所述下一层运算节点对所述下一层运算节点的运算结果进行归约处理得到所述输入指令的运算结果。
PCT/CN2020/083280 2019-04-27 2020-04-03 运算装置 WO2020220935A1 (zh)

Applications Claiming Priority (12)

Application Number Priority Date Filing Date Title
CN201910347027 2019-04-27
CN201910347027.0 2019-04-27
CN201910544726.4 2019-06-21
CN201910545271.8 2019-06-21
CN201910544726 2019-06-21
CN201910544723.0 2019-06-21
CN201910545272.2 2019-06-21
CN201910545270.3 2019-06-21
CN201910545270.3A CN111860798A (zh) 2019-04-27 2019-06-21 运算方法、装置及相关产品
CN201910545271 2019-06-21
CN201910545272.2A CN111860799A (zh) 2019-04-27 2019-06-21 运算装置
CN201910544723.0A CN111860797B (zh) 2019-04-27 2019-06-21 运算装置

Publications (1)

Publication Number Publication Date
WO2020220935A1 true WO2020220935A1 (zh) 2020-11-05

Family

ID=72984883

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2020/083280 WO2020220935A1 (zh) 2019-04-27 2020-04-03 运算装置
PCT/CN2020/087043 WO2020221170A1 (zh) 2019-04-27 2020-04-26 分形计算装置、方法、集成电路及板卡

Family Applications After (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/087043 WO2020221170A1 (zh) 2019-04-27 2020-04-26 分形计算装置、方法、集成电路及板卡

Country Status (4)

Country Link
US (2) US20220261637A1 (zh)
EP (3) EP3998528A1 (zh)
CN (6) CN111860806A (zh)
WO (2) WO2020220935A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469327A (zh) * 2021-06-24 2021-10-01 上海寒武纪信息科技有限公司 执行转数提前的集成电路装置
CN115221102A (zh) * 2021-04-16 2022-10-21 中科寒武纪科技股份有限公司 用于优化片上系统的卷积运算操作的方法和相关产品

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111831332A (zh) * 2020-07-16 2020-10-27 中国科学院计算技术研究所 用于智能处理器的控制系统、方法及电子设备
KR20220045828A (ko) * 2020-10-06 2022-04-13 삼성전자주식회사 태스크 수행 방법 및 이를 이용하는 전자 장치

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992329A (zh) * 2017-07-20 2018-05-04 上海寒武纪信息科技有限公司 一种计算方法及相关产品
CN108710943A (zh) * 2018-05-21 2018-10-26 南京大学 一种多层前馈神经网络并行加速器
US20190073590A1 (en) * 2017-09-01 2019-03-07 Facebook, Inc. Sparse Neural Network Training Optimization
US20190073586A1 (en) * 2017-09-01 2019-03-07 Facebook, Inc. Nested Machine Learning Architecture

Family Cites Families (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2161001B (en) * 1984-06-25 1988-09-01 Rational Distributed microcode address apparatus for computer
JPH06100968B2 (ja) * 1986-03-25 1994-12-12 日本電気株式会社 情報処理装置
AU7305491A (en) * 1990-01-29 1991-08-21 Teraplex, Inc. Architecture for minimal instruction set computing system
JP2669158B2 (ja) * 1991-01-22 1997-10-27 三菱電機株式会社 データ処理装置
CA2053941A1 (en) * 1991-03-29 1992-09-30 Stamatis Vassiliadis System for preparing instructions for instruction parallel processor and system with mechanism for branching in the middle of a compound instruction
GB9204360D0 (en) * 1992-02-28 1992-04-08 Monro Donald M Fractal coding of data
JP2572522B2 (ja) * 1992-05-12 1997-01-16 インターナショナル・ビジネス・マシーンズ・コーポレイション コンピューティング装置
US6240508B1 (en) * 1992-07-06 2001-05-29 Compaq Computer Corporation Decode and execution synchronized pipeline processing using decode generated memory read queue with stop entry to allow execution generated memory read
US5542058A (en) * 1992-07-06 1996-07-30 Digital Equipment Corporation Pipelined computer with operand context queue to simplify context-dependent execution flow
US5542059A (en) * 1994-01-11 1996-07-30 Exponential Technology, Inc. Dual instruction set processor having a pipeline with a pipestage functional unit that is relocatable in time and sequence order
JPH07219769A (ja) * 1994-02-07 1995-08-18 Mitsubishi Electric Corp マイクロプロセッサ
US5748978A (en) * 1996-05-17 1998-05-05 Advanced Micro Devices, Inc. Byte queue divided into multiple subqueues for optimizing instruction selection logic
JPH11149561A (ja) * 1997-11-14 1999-06-02 Dainippon Printing Co Ltd 非フラクタルなスカラ場の生成方法及び装置
US6304954B1 (en) * 1998-04-20 2001-10-16 Rise Technology Company Executing multiple instructions in multi-pipelined processor by dynamically switching memory ports of fewer number than the pipeline
US6460068B1 (en) * 1998-05-01 2002-10-01 International Business Machines Corporation Fractal process scheduler for testing applications in a distributed processing system
JP2001014161A (ja) * 1999-06-25 2001-01-19 Matsushita Electric Works Ltd プログラマブルコントローラ
DE69912860T2 (de) * 1999-09-29 2004-11-04 Stmicroelectronics Asia Pacific Pte Ltd. Mehrfachinstanzimplementierung von sprachkodierer-dekodierern
US7093097B2 (en) * 2001-11-27 2006-08-15 International Business Machines Corporation Dynamic self-tuning memory management method and system
US7913069B2 (en) * 2006-02-16 2011-03-22 Vns Portfolio Llc Processor and method for executing a program loop within an instruction word
EP1821199B1 (en) * 2006-02-16 2012-07-04 VNS Portfolio LLC Execution of microloop computer instructions received from an external source
US20080091909A1 (en) * 2006-10-12 2008-04-17 International Business Machines Corporation Method and system to manage virtual machine memory
CN100461094C (zh) * 2007-03-19 2009-02-11 中国人民解放军国防科学技术大学 一种针对流处理器的指令控制方法
US8156307B2 (en) * 2007-08-20 2012-04-10 Convey Computer Multi-processor system having at least one processor that comprises a dynamically reconfigurable instruction set
US8209702B1 (en) * 2007-09-27 2012-06-26 Emc Corporation Task execution using multiple pools of processing threads, each pool dedicated to execute different types of sub-tasks
US8341613B2 (en) * 2007-12-04 2012-12-25 International Business Machines Corporation Reducing stack space consumption via head-call optimization
US20120324462A1 (en) * 2009-10-31 2012-12-20 Rutgers, The State University Of New Jersey Virtual flow pipelining processing architecture
US10949415B2 (en) * 2011-03-31 2021-03-16 International Business Machines Corporation Logging system using persistent memory
US20150019468A1 (en) * 2013-07-09 2015-01-15 Knowmtech, Llc Thermodynamic computing
US9355061B2 (en) * 2014-01-28 2016-05-31 Arm Limited Data processing apparatus and method for performing scan operations
CN105893319A (zh) * 2014-12-12 2016-08-24 上海芯豪微电子有限公司 一种多车道/多核系统和方法
CN105159903B (zh) * 2015-04-29 2018-09-28 北京交通大学 基于分形多级蜂窝网格的大数据组织与管理方法及系统
US20170168819A1 (en) * 2015-12-15 2017-06-15 Intel Corporation Instruction and logic for partial reduction operations
CN105550157B (zh) * 2015-12-24 2017-06-27 中国科学院计算技术研究所 一种分形树结构通信结构、方法、控制装置及智能芯片
CN105630733B (zh) * 2015-12-24 2017-05-03 中国科学院计算技术研究所 分形树中向量数据回传处理单元的装置、方法、控制装置及智能芯片
CN107329936A (zh) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 一种用于执行神经网络运算以及矩阵/向量运算的装置和方法
US10863138B2 (en) * 2016-05-31 2020-12-08 Intel Corporation Single pass parallel encryption method and apparatus
US10761851B2 (en) * 2017-12-22 2020-09-01 Alibaba Group Holding Limited Memory apparatus and method for controlling the same
US10540227B2 (en) * 2018-01-03 2020-01-21 Hewlett Packard Enterprise Development Lp Sequential memory access on a high performance computing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992329A (zh) * 2017-07-20 2018-05-04 上海寒武纪信息科技有限公司 一种计算方法及相关产品
US20190073590A1 (en) * 2017-09-01 2019-03-07 Facebook, Inc. Sparse Neural Network Training Optimization
US20190073586A1 (en) * 2017-09-01 2019-03-07 Facebook, Inc. Nested Machine Learning Architecture
CN108710943A (zh) * 2018-05-21 2018-10-26 南京大学 一种多层前馈神经网络并行加速器

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115221102A (zh) * 2021-04-16 2022-10-21 中科寒武纪科技股份有限公司 用于优化片上系统的卷积运算操作的方法和相关产品
CN115221102B (zh) * 2021-04-16 2024-01-19 中科寒武纪科技股份有限公司 用于优化片上系统的卷积运算操作的方法和相关产品
CN113469327A (zh) * 2021-06-24 2021-10-01 上海寒武纪信息科技有限公司 执行转数提前的集成电路装置
CN113469327B (zh) * 2021-06-24 2024-04-05 上海寒武纪信息科技有限公司 执行转数提前的集成电路装置

Also Published As

Publication number Publication date
CN111860804A (zh) 2020-10-30
CN111860804B (zh) 2022-12-27
CN111860807B (zh) 2023-05-02
EP3964950A4 (en) 2022-12-14
EP4012556A2 (en) 2022-06-15
CN111860805A (zh) 2020-10-30
CN111860806A (zh) 2020-10-30
US20220188614A1 (en) 2022-06-16
CN111860803A (zh) 2020-10-30
EP3998528A1 (en) 2022-05-18
CN111860805B (zh) 2023-04-07
CN111860807A (zh) 2020-10-30
EP4012556A3 (en) 2022-08-10
WO2020221170A1 (zh) 2020-11-05
CN111860808A (zh) 2020-10-30
EP3964950A1 (en) 2022-03-09
US20220261637A1 (en) 2022-08-18

Similar Documents

Publication Publication Date Title
WO2020220935A1 (zh) 运算装置
US20210072986A1 (en) Methods for performing processing-in-memory operations on serially allocated data, and related memory devices and systems
JP4246204B2 (ja) マルチプロセッサシステムにおける共有メモリの管理のための方法及び装置
US11061742B2 (en) System, apparatus and method for barrier synchronization in a multi-threaded processor
US20090138680A1 (en) Vector atomic memory operations
US11126690B2 (en) Machine learning architecture support for block sparsity
Yavits et al. GIRAF: General purpose in-storage resistive associative framework
US20120198178A1 (en) Address-based hazard resolution for managing read/write operations in a memory cache
US11500828B1 (en) Method and device for constructing database model with ID-based data indexing-enabled data accessing
US20220114270A1 (en) Hardware offload circuitry
Zhao et al. Cambricon-F: machine learning computers with fractal von Neumann architecture
JP2020530176A5 (zh)
Chen et al. fgSpMSpV: A fine-grained parallel SpMSpV framework on HPC platforms
Guan et al. Crane: Mitigating accelerator under-utilization caused by sparsity irregularities in cnns
US20220114133A1 (en) Fractal calculating device and method, integrated circuit and board card
US20210255793A1 (en) System and method for managing conversion of low-locality data into high-locality data
TW202221705A (zh) 儲存器系統
CN111860797B (zh) 运算装置
Cicalese et al. The design of a distributed key-value store for petascale hot storage in data acquisition systems
US11442643B2 (en) System and method for efficiently converting low-locality data into high-locality data
Angelic et al. Near data processing and its applications
Dubeyko Processor in Non-Volatile Memory (PiNVSM): Towards to Data-centric Computing in Decentralized Environment
김병호 DRAM-based Processing-in-Memory Microarchitectures for Memory-intensive Machine Learning Applications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20798218

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20798218

Country of ref document: EP

Kind code of ref document: A1