WO2019015541A1 - 一种计算方法及相关产品 - Google Patents

一种计算方法及相关产品 Download PDF

Info

Publication number
WO2019015541A1
WO2019015541A1 PCT/CN2018/095706 CN2018095706W WO2019015541A1 WO 2019015541 A1 WO2019015541 A1 WO 2019015541A1 CN 2018095706 W CN2018095706 W CN 2018095706W WO 2019015541 A1 WO2019015541 A1 WO 2019015541A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
matrix
instruction
unit
operation instruction
Prior art date
Application number
PCT/CN2018/095706
Other languages
English (en)
French (fr)
Inventor
陈天石
刘少礼
王在
胡帅
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Priority to CN202010189417.2A priority Critical patent/CN111221578B/zh
Priority to CN202010189355.5A priority patent/CN111176727B/zh
Priority to CN201880004680.0A priority patent/CN110036369B/zh
Priority to EP18835662.0A priority patent/EP3686734A4/en
Publication of WO2019015541A1 publication Critical patent/WO2019015541A1/zh
Priority to US16/745,743 priority patent/US11481215B2/en
Priority to US17/929,730 priority patent/US11983534B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • G06F9/30079Pipeline control instructions, e.g. multicycle NOP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30025Format conversion instructions, e.g. Floating-Point to Integer, decimal conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30043LOAD or STORE instructions; Clear instruction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the present disclosure relates to the field of data processing, and in particular, to a computing method and related products.
  • Data processing is the step or stage that most algorithms need to go through. After the computer introduces the data processing field, more and more data processing is realized by computer.
  • the existing algorithm has the speed of the computing device in the data calculation of the neural network. Slow and inefficient.
  • the computing device controls the matrix calculation unit to acquire a first operation instruction, the first operation instruction including a matrix read indication required to execute the instruction, the required matrix is at least one matrix, the at least one The matrix is a matrix of the same length or a matrix of different lengths;
  • the computing device controls the computing unit to send a read command to the memory according to the matrix read indication
  • the computing device controls the computing unit to read the matrix corresponding to the matrix read indication according to the batch read mode, and execute the first operation instruction on the matrix.
  • the matrix read indication comprises: a storage address of a matrix required by the instruction or an identifier of a matrix required by the instruction.
  • the computing device controls the operation unit to send the read command to the memory according to the matrix read indication, including:
  • the computing device controls the computing unit to read a storage address corresponding to the identifier from the register unit according to the identifier, and the computing device controls the computing unit to send a read to the memory.
  • the reading command of the storage address acquires the matrix by using a batch reading manner.
  • the computing device controls the computing unit to perform a first flow level calculation on the matrix to obtain a first result, and input the first result to the second flow level to perform a second flow level to obtain a second result, and the second result
  • the input to the third flow level performs the third flow level to obtain a third result
  • the third result is input to the memory for storage.
  • the computing device further includes: a cache unit, the method further includes:
  • the computing device caches an operation instruction to be executed in the cache unit.
  • the method further includes: before the calculating, by the computing device, the matrix calculation unit to obtain the first operation instruction:
  • the computing device determines whether there is a relationship between the first operation instruction and the second operation instruction before the first operation instruction, and if the first operation instruction has an association relationship with the second operation instruction, The first operation instruction cache and the cache unit, after the execution of the second operation instruction is completed, the first operation instruction is extracted from the cache unit and transmitted to the operation unit;
  • Determining whether the first operation instruction is associated with the second operation instruction before the first operation instruction comprises:
  • a computing device comprising: a memory, a register unit, a matrix computing unit, and a control unit;
  • the memory is used to store a matrix, the matrix is at least one matrix, and the at least one matrix is a matrix of the same length or a matrix of different lengths;
  • the memory unit is configured to store scalar data, where the scalar data includes at least: a storage address of the matrix in the memory;
  • the control unit is configured to control the matrix calculation unit to acquire a first operation instruction, where the first operation instruction includes a matrix read indication required to execute the instruction, and the required matrix is at least one matrix.
  • Said at least one matrix is a matrix of the same length or a matrix of different lengths;
  • the operation unit is configured to send a read command to the storage medium according to the matrix read indication; and read the matrix corresponding to the matrix read indication according to a batch read manner, and execute the first on the matrix Operation instruction.
  • the matrix read indication comprises: a storage address of a matrix required by the instruction or an identifier of a matrix required by the instruction.
  • the control unit is configured to control the operation unit to read a storage address corresponding to the identifier by using a unit read manner according to the identifier, and control the operation unit to send and read to the storage unit.
  • the reading command of the storage address acquires the matrix by using a batch reading manner.
  • the operation unit is specifically configured to perform a first flow level calculation on the matrix to obtain a first result, and input the first result to the second flow level to perform a second flow level to obtain a second result, where the The second result is input to the third flow level to execute the third flow level to obtain a third result, and the third result is input to the memory for storage.
  • the computing device further includes:
  • a cache unit configured to cache an operation instruction to be executed
  • the control unit is configured to cache an operation instruction to be executed in the cache unit.
  • control unit is configured to determine whether an association relationship between the first operation instruction and the second operation instruction before the first operation instruction, such as the first operation instruction and the second operation instruction If there is an association relationship, the first operation instruction is cached in the cache unit, and after the execution of the second operation instruction, the first operation instruction is extracted from the cache unit and transmitted to the operation unit;
  • Determining whether the first operation instruction is associated with the second operation instruction before the first operation instruction comprises:
  • a computer readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to perform the method provided by the first aspect.
  • 1A is a schematic structural view of a computing device.
  • FIG. 1B is a schematic structural diagram of another computing device.
  • FIG. 2 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • 2A is a schematic structural diagram of a matrix calculation unit provided by an embodiment of the present application.
  • FIG. 2B is a schematic structural diagram of a pipeline stage provided by an embodiment of the present application.
  • FIG. 3 is a schematic flow chart of a matrix calculation method disclosed in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a format of an instruction set provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of another computing device provided by an embodiment of the present application.
  • FIG. 6 is a flowchart of a computing device performing a matrix multiplication vector instruction according to an embodiment of the present application.
  • FIG. 6A is another schematic structural diagram of a computing device provided by an embodiment of the present application.
  • FIG. 6B is a schematic flowchart of a convolution calculation instruction provided by an embodiment of the present application.
  • FIG. 6C is a schematic flowchart of a full connection layer forward operation instruction according to an embodiment of the present application.
  • FIG. 6D is a flowchart of a forward operation of the pooling operation provided by the embodiment of the present application.
  • FIG. 6F is a flowchart of batch normalization forward operation provided by an embodiment of the present application.
  • FIG. 7A is a schematic diagram of a format of an instruction set of the present application.
  • FIG. 7B is a schematic diagram of a format of a neural network operation instruction of the present application.
  • 7C is a schematic diagram showing the format of a matrix operation instruction of the present application.
  • 7D is a schematic diagram of a format of a vector operation instruction of the present application.
  • 7E is a schematic diagram of a format of a matrix-vector operation instruction of the present application.
  • Figure 7 is a schematic structural diagram of a hub_one_to_two of the present application.
  • FIG. 8 is a schematic diagram of a behavior of a hub_one_to_two handshake with a data receiver in the present application
  • FIG. 10 is a schematic diagram of the behavior of data transmission in a hub in another embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of an h-tree structure of the present application developed into a complete binary tree topology
  • FIG. 12 is a schematic diagram of data of full bandwidth and data segments corresponding to each leaf tile on an h-tree in another embodiment of the present application.
  • FIG. 13 is a schematic diagram of an on-chip multi-core structure of 64+1 cores connected by using an x-tree in an embodiment of the present application;
  • FIG. 14 is a schematic diagram of the behavior of data transmission in a hub in another embodiment of the present application.
  • 16 is a schematic diagram of data of full bandwidth and data segments corresponding to each leaf tile on an x-tree in another embodiment of the present application;
  • Figure 17 is a schematic block diagram showing the overall structure of an embodiment of the present application.
  • FIG. 18 is a schematic structural diagram of a node of a sparsely connected neural network as an embodiment of the present application.
  • 20 is a schematic diagram showing a connection relationship of a sparsely connected neural network as another embodiment of the present application.
  • 21 is a schematic diagram of a convolution operation as an embodiment of the present application.
  • Figure 22 is a graph showing changes in input, output, and weight when the convolutional neural network becomes sparse
  • FIG. 23 is a schematic structural diagram of a sparsely connected artificial neural network computing device as an embodiment of the present application.
  • FIG. 24 is a schematic structural diagram of a mapping unit as an embodiment of the present application.
  • 25 is a flowchart of an artificial neural network operation process of a sparse connection as an embodiment of the present application
  • 26 is a schematic structural diagram of a sparsely connected artificial neural network computing device as another embodiment of the present application.
  • FIG. 27 is a schematic structural diagram of a mapping unit as another embodiment of the present application.
  • FIG. 28 is a schematic structural diagram of an artificial neural network computing device which is a sparse connection according to still another embodiment of the present application.
  • mapping unit 29 is a schematic structural diagram of a mapping unit as still another embodiment of the present application.
  • FIG. 31 is a schematic structural diagram of a mapping unit as still another embodiment of the present application.
  • 32 is a structural block diagram of an embodiment of a processing system of a neural network of the present application.
  • 33 is a structural block diagram of another embodiment of a processing system of a neural network of the present application.
  • FIG. 34 is a schematic diagram of neural network division in an embodiment of the present application.
  • 35 is a schematic diagram of neural network division in another embodiment of the present application.
  • 36 is a schematic diagram of neural network partitioning in still another embodiment of the present application.
  • H-tree module an embodiment of an interconnection module
  • FIG. 40 illustrates an example block diagram of a main operational module structure in an apparatus for performing artificial neural network forward operations supporting discrete data representations in accordance with an embodiment of the present application.
  • 41 illustrates an example block diagram of a slave arithmetic module structure in an apparatus for performing artificial neural network forward operations supporting discrete data representations in accordance with an embodiment of the present application.
  • FIG. 42 shows an example block diagram of a neural network forward operation process in accordance with an embodiment of the present application.
  • 43 illustrates an example block diagram of a neural network reverse training process that supports discrete data representations in accordance with an embodiment of the present application.
  • FIG. 45 shows an example structure of an arithmetic unit according to an embodiment of the present application.
  • FIG. 48 is a block diagram showing the structure of a neural network computing device in accordance with the present disclosure.
  • 49 is a flow chart of a neural network computing method in accordance with the present disclosure.
  • Figure 49.1 is a schematic illustration of a coding table in accordance with the present disclosure.
  • 49.5 is a schematic diagram showing a method of representing power data according to the present disclosure.
  • Figure 49.7 is a schematic diagram of the multiplication operation of neurons and power weights in accordance with the present disclosure.
  • 50 is a flow chart of a neural network computing method in accordance with the present disclosure.
  • Figure 50.1 is a schematic illustration of a coding table in accordance with the present disclosure.
  • Figure 50.2 is another schematic diagram of a coding table in accordance with the present disclosure.
  • Figure 50.3 is another schematic diagram of a coding table in accordance with the present disclosure.
  • FIG. 50.5 is a schematic diagram showing a method of representing power data according to the present disclosure.
  • 50.6 is a schematic diagram of a multiplication operation of a power neuron and a power weight according to the present disclosure
  • 51 is a flow chart of a processing method of an embodiment of the present disclosure.
  • FIG. 52 is another flow chart of a processing method of an embodiment of the present disclosure.
  • 53 is a pruning method of a fully connected layer of a neural network according to an embodiment of the present disclosure.
  • Figure 55 is a block diagram showing the structure of a processing apparatus according to an embodiment of the present disclosure.
  • Figure 56 is a schematic structural view of an acceleration device according to an embodiment of the present disclosure.
  • 57 is a schematic structural view of another acceleration device according to an embodiment of the present disclosure.
  • Figure 58 is a specific embodiment of the present disclosure in a processing method
  • FIG. 59 is a specific representation of a short-digit floating-point data structure for storing data according to an embodiment of the present application.
  • 60A is a block diagram showing an example of an apparatus for performing an artificial neural network forward operation according to the present application.
  • 60 is a block diagram showing an example of a floating point data statistic module in an apparatus for performing an artificial neural network forward operation according to an embodiment of the present application;
  • 61 is a block diagram showing an example of a short-digit floating-point calculation portion of a forward operation module in an apparatus for performing an artificial neural network forward operation according to an embodiment of the present application;
  • FIG. 63 is a block diagram showing an example of an operational flow for executing an artificial neural network forward computing device according to an embodiment of the present application.
  • FIG. 64 is a specific representation of a fixed point data structure for storing data according to an embodiment of the present application.
  • 65A is a block diagram showing an example of an apparatus for performing an artificial neural network forward operation according to the present application.
  • 65 is a block diagram showing an example of a floating point data statistics module in an apparatus for performing an artificial neural network forward operation according to an embodiment of the present application;
  • 66 is a block diagram showing an example of a short-digit fixed-point calculation portion of a forward operation module in an apparatus for performing an artificial neural network forward operation according to an embodiment of the present application;
  • 67 is a block diagram showing an example of a neural network forward operation process according to an embodiment of the present application.
  • FIG. 68 is a block diagram showing an exemplary flow of operations for executing an artificial neural network forward computing device according to an embodiment of the present application.
  • FIG. 69 is a general flowchart of an algorithm implementation according to an embodiment of the present application.
  • Figure 70 is a block diagram showing an example of the overall structure of a preferred embodiment of the apparatus for on-chip repeat addressing of the present application.
  • Figure 71 is a data address division diagram of a preferred embodiment of the method for on-chip repeat addressing of the present application.
  • Figure 72 is a schematic diagram of data division of a preferred embodiment of the method for on-chip repeat addressing of the present application.
  • Figure 73 is a second schematic diagram of data division of a preferred embodiment of the method for on-chip repeat addressing of the present application.
  • Figure 74 is a schematic diagram of an alternative strategy of a preferred embodiment of the method of on-chip repeat addressing of the present application.
  • Figure 75 is a flow diagram of a specific embodiment of the method of on-chip repeat addressing of the present application.
  • 76 is a schematic diagram of a preferred embodiment of an on-chip repeat index of the method for on-chip repeat addressing of the present application
  • 77 is a schematic structural diagram of an on-chip data division and reading system according to the present application.
  • 79A is a schematic diagram of implementation of an on-chip data partitioning strategy according to the present application.
  • 79B is a second schematic diagram of implementation of the on-chip data partitioning strategy according to the present application.
  • 80 is a schematic diagram of an embodiment of an on-chip data index of the on-chip data partitioning and reading system according to the present application.
  • FIG. 81 is a schematic diagram of a physical framework of a method for dividing and reading data on-chip according to the present application.
  • 83 is a schematic flowchart of a method for dividing and reading data on the chip according to the present application.
  • Figure 84 is a flow chart showing a specific embodiment of the on-chip data division and reading method in the present application.
  • FIG. 85 is a block diagram showing the structure of a neural network computing system according to an embodiment of the present application.
  • 86A is a schematic diagram showing one embodiment of a multiprocessor in accordance with an embodiment of the present application.
  • 86B is a schematic diagram showing another embodiment of a multiprocessor in accordance with an embodiment of the present application.
  • Figure 87 shows a block diagram of a neural network computing system for training and reasoning in accordance with an embodiment of the present application.
  • FIG. 88 is a diagram showing the structure of a computing system in which a computing processor shares a storage unit according to an embodiment of the present application.
  • FIG. 89 is a block diagram showing a structure of a neural network computing system in which a computing processor shares a storage unit according to an embodiment of the present application.
  • Figure 90 illustrates an example block diagram of a system for complex neural network tasks in accordance with an embodiment of the present application.
  • the matrix in the specific implementation of the present application may specifically be an m*n matrix, a 1*n matrix, or an m*1 matrix, where m and n are integers greater than or equal to 2.
  • the matrix is a 1*n matrix or an m*1 matrix, it may also be referred to as a vector.
  • the following matrix may be any one of the above three types of matrices, which will not be described below.
  • the artificial neural network algorithm is taken as an example, and many neural network algorithms contain a large number of matrix operations.
  • FIG. 1A is a computing device.
  • a computing device of the matrix shown in FIG. 1A it includes a plurality of general-purpose processors 101 (CPUs), each of which includes its own memory, and a processing method thereof. It can be that multiple CPUs process the calculation of the matrix in parallel.
  • CPUs general-purpose processors 101
  • FIG. 1B is another computing device.
  • a graphics processing unit (GPU) 102 is included, and the operation of the matrix is performed by the GPU 102.
  • the GPU itself
  • the memory 1021 is also included.
  • the GPU 102 processes the matrix operation, the GPU 102 needs to extract the matrix required for the matrix operation from the memory 1021.
  • the matrix occupies a larger storage space than the scalar because of the large amount of data.
  • the memory capacity of the GPU 102 is not enough to store a large number of matrices. To solve this problem, FIG.
  • the GPU 102 can read the matrix from the off-chip database 103.
  • the reading mode is that the GPU 102 extracts the matrix to be calculated from the off-chip database 103, stores the matrix in the memory 1021, performs matrix decoding processing when performing the matrix operation, and then extracts the matrix from the memory 1021. Calculation.
  • the decoding of the matrix instruction by the GPU 102 takes up a large part of the computing power of the GPU, affecting the calculation speed of the matrix, and the efficiency is low.
  • the input neurons and output neurons mentioned in this application do not refer to the neurons in the input layer of the entire neural network and the neurons in the output layer, but to any two adjacent layers in the network, which are under the network feedforward operation.
  • the neurons in the middle are the input neurons
  • the neurons in the upper layer of the network feedforward operation are the output neurons.
  • the specific embodiment of the present application provides a matrix calculation method, which is completed in a computing device as shown in FIG. 2, as shown in FIG. 2, the computing device includes:
  • the memory 201 is configured to store a matrix.
  • the memory can be a high-speed temporary storage memory, and can support matrix data of different lengths; the present application temporarily stores necessary calculation data in a memory (preferably a scratch pad memory), so that the computing device performs matrix operations.
  • the process can support different lengths of data more flexibly and effectively.
  • the above memory may also be an off-chip database, a database or other medium capable of being stored, and the like.
  • a scalar data storage unit 202 (e.g., a scalar register unit) for storing scalar data includes, but is not limited to, an address of the matrix data at the storage medium 201 and a scalar of the matrix and scalar operations.
  • the scalar register unit can be a scalar register file that provides the scalar registers required during the operation.
  • the scalar registers not only store matrix addresses but also scalar data. When it comes to matrix and scalar operations, the unit must acquire the matrix address from the register unit and the corresponding scalar from the register unit.
  • the operation unit 203 is configured to acquire and execute the first operation instruction.
  • the arithmetic unit includes a plurality of arithmetic units including, but not limited to, a matrix adder 231, a matrix multiplier 232, a size comparison operator 233, a nonlinear operator 234, and a matrix scalar multiplication.
  • the operator 235 is configured to acquire and execute the first operation instruction.
  • the matrix calculation method is as shown in FIG. 3, and includes the following steps:
  • Step S301 the operation unit 203 acquires a first operation instruction, where the first operation instruction includes: a matrix read instruction required to execute the instruction.
  • the first operation instruction may carry the storage address of the matrix required by the matrix operation formula.
  • the storage address of A is 0000-0FFF
  • the storage address of B is 1000-1FFF.
  • the identifiers of A and B may be carried.
  • the identifier of A is 0101
  • the identifier of B is 1010.
  • Step S302 the operation unit 203 sends a read command to the memory 201 according to the matrix read instruction.
  • the implementation method of the foregoing step S302 may specifically be:
  • the operation unit 203 sends the read command for reading the storage address to the memory 201 and acquires the corresponding matrix by using the batch read mode.
  • the operation unit 203 reads the storage address corresponding to the identifier from the scalar data storage unit according to the identifier, and then the operation unit 203 sends the storage address to the memory 201.
  • the read command for reading the storage address is sent and the corresponding matrix is obtained by batch reading.
  • Step S303 the operation unit 203 reads the matrix corresponding to the indication by using a batch reading manner, and executes the first operation instruction on the matrix.
  • the batch reading mode in the above step S303 may specifically be that each read is a plurality of data, that is, regardless of the amount of data required, each of which reads a plurality of data, the batch
  • the data read method is very suitable for the reading of big data.
  • the matrix because of the large capacity occupied by it, if the single reading mode is adopted, the reading speed will be very slow, so the batch reading method is adopted here. Obtain multiple data to quickly read matrix data, avoiding the problem of matrix calculation speed due to the slow reading of matrix data.
  • the computing device of the technical solution provided by the present application is provided with a scalar data storage unit and a memory, which respectively store scalar data and matrix data, and the present application allocates a unit reading mode and a batch reading mode for the two memories, by using matrix data.
  • the characteristics are assigned to match the characteristics of the data reading mode, can make good use of bandwidth, avoid the impact of the bandwidth bottleneck on the matrix calculation speed, in addition, for the scalar data storage unit, because it stores the scalar data, set
  • the scalar data reading method improves the bandwidth utilization. Therefore, the technical solution provided by the present application can utilize the bandwidth well and avoid the influence of the bandwidth on the calculation speed, so it has the advantages of high calculation speed and high efficiency.
  • the performing the first operation instruction on the matrix may be:
  • n-stage flow level calculation on the matrix specifically, performing a first flow level calculation on the matrix to obtain a first result, and inputting the first result to the second flow level to perform a second flow level calculation to obtain a second result
  • the second result is input to the third flow level to perform the third flow level calculation to obtain a third result.
  • the n-1th result is input to the nth stage to perform the nth stage.
  • the nth result is calculated, and the nth result is input to the memory.
  • the first flow level described above includes, but is not limited to, a matrix addition calculator, a matrix multiplication calculator, and the like.
  • the second flow level described above includes, but is not limited to, a size comparison calculator and the like.
  • the operation of dividing the matrix into three pipelines is mainly to improve the speed of the calculation.
  • the operation steps may be specifically: processor-to-matrix Performing the calculation to obtain the first result, and then storing the first result in the memory, the processor reading the first result from the memory, performing the second calculation to obtain the second result, and then storing the second result in the memory, the processor from the inside A third calculation is performed by reading the second result to obtain a third result, and then the third result is stored in the memory.
  • the general-purpose processor performs matrix calculation, it does not calculate the water level, so the calculated data needs to be saved after each calculation, and the next calculation needs to be read again. Therefore, this solution needs to repeatedly store and read multiple data.
  • the first result of the first flow level calculation directly enters the second flow level for calculation, and the second result of the second flow level calculation.
  • the first result and the second result of the first flow level and the second flow level calculation need not be stored, firstly, the memory space is reduced, and secondly, the multiple storage of the result is avoided.
  • the utilization of bandwidth is improved, and the calculation efficiency is further improved.
  • each flow component can be freely combined or a first stage flow stage can be employed.
  • the second flow level and the third flow level are combined, or the first and second and third pipelines are combined or the respective flow levels are responsible for different operations.
  • the first stage of the flow is responsible for the comparison operation, the partial multiplication, and the second stage of the flow is responsible for the combination of the nonlinear operation and the matrix scalar multiplication.
  • the second operation instruction Determining whether there is a relationship between the first operation instruction and the second operation instruction before the first operation instruction. If the first operation instruction is associated with the second operation instruction before the first operation instruction, the second operation instruction is executed. Thereafter, the first operation instruction is extracted from the cache unit and passed to the operation unit 203. If the first operation instruction has no relationship with the instruction before the first operation instruction, the first operation instruction is directly transmitted to the operation unit.
  • the area between the address interval and the second storage address interval is determined to have an association relationship between the first operation instruction and the second operation instruction. If the first storage address interval and the second storage address interval do not overlap, it is determined that the first operation instruction and the second operation instruction have no association relationship.
  • the condition may be that the storage area accessed by the second operation instruction includes a storage area accessed by the first operation instruction, for example, the second operation instruction accesses the A matrix storage area, the B matrix storage area, and the C matrix storage area, If the A, B storage areas are adjacent or the A and C storage areas are adjacent, the storage area accessed by the second operation instruction is A, B storage area and C storage area, or A, C storage area and B storage area.
  • the storage area of the matrix accessed by the first operation instruction cannot be the same as the storage area of the matrix of the second operation instruction template, if the same is used.
  • the determination condition determines that the first operation instruction is not associated with the second operation instruction, but it is proved by practice that the first operation instruction and the second operation instruction belong to the association relationship at this time, so the present application determines whether the association is related by whether there is an overlapping area. The conditions of the relationship can avoid the misjudgment of the above situation.
  • the matrix required for the first operation instruction is an A matrix and a D matrix, wherein the storage area of the A matrix is [0001, 0FFF], and the storage area of the D matrix is [A000, AFFF], which is required for the second operation instruction.
  • the matrix is an A matrix, a B matrix, and a C matrix.
  • the corresponding storage areas are [0001, 0FFF], [1000, 1FFF], [B000, BFFF].
  • the matrix required by the first operation instruction is an E matrix and a D matrix, wherein the storage area of the A matrix is [C000, CFFF], and the storage area of the D matrix is [A000, AFFF], which is required for the second operation instruction.
  • the matrix is an A matrix, a B matrix, and a C matrix.
  • the corresponding storage areas are [0001, 0FFF], [1000, 1FFF], [B000, BFFF].
  • the corresponding storage area is : [C000, CFFF], [A000, AFFF], for the second operation instruction, the corresponding storage area is: [0001, 1FFF], [B000, BFFF], so the storage area of the second operation instruction and the The storage area of an operation instruction does not have an overlapping area, so the first operation instruction has no relationship with the second operation instruction.
  • the operation instruction includes an operation code and at least one operation field, wherein the operation code is used to indicate the function of the operation instruction, and the operation unit can perform different matrix operations by identifying the operation code, and the operation field is used to indicate The data information of the operation instruction, wherein the data information may be an immediate number or a register number. For example, when a matrix is to be acquired, the matrix start address and the matrix length may be obtained in the corresponding register according to the register number, and then according to the matrix start The address and matrix length are obtained in the storage medium to obtain a matrix of the corresponding address.
  • the instruction set contains arithmetic instructions with different functions:
  • MMV matrix multiplication vector instruction
  • the device extracts matrix data and vector data of a set length from a specified address of a memory (preferred scratch pad memory or a scalar register file), and performs matrix multiplication vectors in the arithmetic unit Multiply and write the result back.
  • the result of the calculation is written back to the specified address of the memory (preferred scratch pad memory or scalar register file); it is worth noting that the vector can be stored in the memory as a special form of matrix (a matrix of only one row of elements) In a preferred scratch pad memory or scalar register file).
  • a vector multiplication matrix instruction according to which the device extracts vector data and matrix data of a set length from a specified address of a memory (preferably a scratch pad memory or a scalar register file), and performs a vector multiplication matrix in the arithmetic unit. Multiply and write the result back. Preferably, the result of the calculation is written back to the specified address of the memory (preferred scratch pad memory or scalar register file); it is worth noting that the vector can be stored in the memory as a special form of matrix (a matrix of only one row of elements) In a preferred scratch pad memory or scalar register file).
  • a matrix multiplier instruction (VMS), according to which the matrix data of the set length is taken out at a specified address of the device memory (preferably a scratch pad memory or a scalar register file), and a matrix of a specified size is taken from a specified address of the scalar register file.
  • VMS matrix multiplier instruction
  • the scalar register file stores not only the address of the matrix but also the scalar data.
  • a tensor operation instruction (TENS) according to which the device extracts two pieces of matrix data of a set length from two designated addresses of a memory (preferably a scratch pad memory or a scalar register file), and two pairs of data in the operation unit
  • the matrix data is subjected to a tensor operation and the calculation result is written back.
  • the result of the calculation is written back to the designated address of the memory (preferred scratch pad memory or scalar register file).
  • a matrix addition instruction (MA)
  • the device fetches two pieces of matrix data of a set length from two designated addresses of a memory (preferably a scratch pad memory or a scalar register file), and pairs the matrix in the operation unit Add the operation and write back the result.
  • a memory preferably a scratch pad memory or a scalar register file
  • the result of the calculation is written back to the designated address of the memory (preferred scratch pad memory or scalar register file).
  • a matrix subtraction instruction (MS) according to which the device fetches two pieces of matrix data of a set length from two designated addresses of a memory (preferred scratch pad memory or a scalar register file), and pairs the matrix in the operation unit Perform a subtraction and write back the result.
  • the result of the calculation is written back to the designated address of the memory (preferred scratch pad memory or scalar register file).
  • a matrix search command (MR) according to which the device fetches vector data of a set length from a specified address of a memory (preferably a scratch pad memory or a scalar register file) from a memory (preferably a scratch pad memory or a scalar register file)
  • the specified address is taken out of the matrix data of the specified size.
  • the vector is an index vector, and the i-th element in the output vector is indexed by the i-th element of the index vector, in the ith column of the matrix. The number found, which is written back to the specified address of the memory (preferred scratch pad memory or scalar register file).
  • a matrix load instruction according to which the device loads a set length of data from a specified external source address to a specified address of a memory (preferably a scratch pad memory or a scalar register file).
  • a matrix storage instruction according to which the device stores matrix data of a set length of a specified address of a memory (preferably a scratch pad memory or a scalar register file) to an external destination address.
  • a matrix transfer instruction (MMOVE) according to which the device stores matrix data of a set length of a specified address of a memory (preferably a scratch pad memory or a scalar register file) to a memory (a preferred scratch pad memory or a scalar register file) Another specified address.
  • the set length in the above instruction can be set by the user.
  • the user can set the set length to a value.
  • the user can also set the set length to a large value. Values.
  • the specific embodiment of the present application does not limit the specific value and the number of the set length.
  • FIG. 5 is another computing device 50 according to an embodiment of the present application.
  • the computing device 50 includes a memory 501, a scalar data storage unit 502 (preferably a scalar register unit), a matrix calculation unit 503, and a control unit 504;
  • a memory 501 configured to store a matrix
  • the scalar data storage unit 502 is configured to store scalar data, where the scalar data includes at least: a storage address of the matrix in the memory;
  • the control unit 504 is configured to control the matrix calculation unit to acquire a first operation instruction, where the first operation instruction includes a matrix read indication required to execute the instruction;
  • the operation unit 503 is configured to send a read command to the memory according to the matrix read indication; and read the matrix corresponding to the matrix read indication according to a batch read manner, and execute the first operation instruction on the matrix .
  • the matrix read indication comprises: a storage address of a matrix required by the instruction or an identifier of a matrix required by the instruction.
  • the control unit 504 is configured to control the operation unit to read the storage address corresponding to the identifier from the register unit according to the identifier, and control the operation unit to send the read to the memory.
  • the read command of the address is stored and the matrix is obtained in a batch read manner.
  • the operation unit 503 is configured to perform a first flow level calculation on the matrix to obtain a first result, and input the first result to the second flow level to perform a second flow level to obtain a second result, where the The second result is input to the third flow level to execute the third flow level to obtain a third result.
  • the n-1th result is input to the nth stage to perform the calculation of the nth stage.
  • n may be an integer greater than or equal to 2.
  • the computing device further includes:
  • a cache unit 505 configured to cache an operation instruction to be executed
  • the control unit 504 is configured to cache the operation instruction to be executed in the cache unit 504.
  • control unit 504 is configured to determine whether the first operation instruction is associated with the second operation instruction before the first operation instruction, if the first operation instruction and the second operation instruction exist Correlation relationship, the first operation instruction is cached in the cache unit, after the execution of the second operation instruction, the first operation instruction is extracted from the cache unit and transmitted to the operation unit;
  • Determining whether the first operation instruction is associated with the second operation instruction before the first operation instruction comprises:
  • control unit 503 is configured to obtain an operation instruction from the instruction cache unit, and process the operation instruction, and then provide the operation instruction to the operation unit.
  • the control unit 503 can be divided into three modules, namely: an instruction module 5031, a decoding module 5032, and an instruction queue module 5033.
  • the fingerprint module 5031 is configured to obtain an operation instruction from the instruction cache unit
  • the instruction queue 5033 is used for sequentially storing the decoded operation instructions. Considering that different instructions may have dependencies on the included registers, the instructions for buffering the decoded instructions are transmitted after the dependencies are satisfied.
  • FIG. 6 is a flowchart of a matrix multiplication vector instruction executed by a computing device according to an embodiment of the present application.
  • the hardware structure of the computing device is as shown in FIG.
  • the memory takes the high-speed scratch memory as an example.
  • the process of executing the matrix multiplication vector instruction includes:
  • Step S601 the computing device controls the fetch module to take out the matrix multiplication vector instruction, and sends the matrix multiplication vector instruction to the decoding module.
  • Step S602 the decoding module decodes the matrix multiplication vector instruction, and sends the matrix multiplication vector instruction to the instruction queue.
  • Step S603 in the instruction queue, the matrix multiplication vector instruction needs to obtain data in a scalar register corresponding to five operation fields in the instruction from the scalar register file, the data including an input vector address, an input vector length, an input matrix address, Output vector address and output vector length.
  • Step S606 after the operation of the arithmetic unit is completed, the result is written into a designated address of a memory (a preferred scratch pad memory or a scalar register file), and the matrix multiply vector instruction in the reordering buffer is submitted.
  • a memory a preferred scratch pad memory or a scalar register file
  • the matrix calculation instruction in FIG. 6 is taken as an example of a matrix multiplication vector instruction.
  • the matrix multiplication vector instruction in the embodiment shown in FIG. 6 can use a vector multiplication matrix instruction, a matrix multiplication scalar instruction, and a tensor operation instruction. , matrix addition instruction, matrix subtraction instruction, matrix retrieval instruction, matrix loading instruction, matrix storage instruction or matrix handling instruction replacement, which will not be repeated here.
  • FIG. 6A provides a computing device, including: a memory 611 (optional), a register unit 612, an interconnect module 613, an operation unit 614, a control unit 615, and a data access unit 616;
  • the operation unit 614 includes at least two of an addition calculator, a multiplication calculator, a comparator, and an activation operator.
  • the interconnecting module 613 is configured to control the connection relationship of the calculators in the computing unit 614 such that the at least two types of calculators form different computing topologies.
  • the control unit 615 is configured to extract an operation instruction from the register unit 612, an operation domain corresponding to the operation instruction, and a first calculation topology corresponding to the operation instruction, and decode the operation instruction into an execution instruction, where the execution instruction is used to control
  • the arithmetic unit performs an arithmetic operation and transmits the operational domain to the data access unit 616.
  • the data access unit 616 is configured to extract a data block corresponding to the operation domain from the memory 611, and transmit the data block to the interconnection module 613.
  • the interconnection module 613 is configured to receive a data block, and send the data block to the operation unit 614.
  • the operation unit 614 is configured to perform an operation operation on the data block by the calculator executing the instruction call operation unit 614 to obtain an operation result, and the operation result is transmitted to the data access unit and stored in the memory.
  • the operation unit 614 is configured to perform an operation operation on the data block according to the first calculation topology and the execution instruction to obtain an operation result, and the operation result is transmitted to the data access unit and stored in the memory.
  • the first computing topology may be: a multiplier-adder-adder-activation operator.
  • the specific calculation method of the computing device shown in FIG. 6A is illustrated by different operation instructions.
  • the operation instruction here is exemplified by a convolution calculation instruction, which can be applied in a neural network, so the convolution calculation Instructions can also be referred to as convolutional neural networks.
  • the formula that it actually needs to execute can be: Wherein, the convolution kernel w is multiplied by the input data x i , summed, and then the offset b is added to perform an activation operation to obtain a final output result s.
  • the calculation topology can be obtained as a multiplier-adder-(optional) activation operator.
  • the convolution calculation instruction may include an instruction set including: a convolutional neural network instruction, a convolutional neural network COMPUTE instruction having different functions, and a CONFIG instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction.
  • the COMPUTE instruction includes:
  • a convolutional neural network instruction which the device extracts input data of a specified size and a convolution kernel from a specified address of a memory (preferably a scratch pad memory or a scalar register file), and performs a convolution operation in the convolution operation unit Get the output directly. That is, the instruction does not perform subsequent operations, and directly performs a convolution operation to obtain an output result.
  • a memory preferably a scratch pad memory or a scalar register file
  • a convolutional neural network sigmoid instruction according to which the device respectively extracts input data of a specified size and a convolution kernel from a specified address of a memory (preferably a scratch pad memory or a scalar register file), and performs convolution in the convolution operation unit Operation, preferably, then sigmoid activation of the output;
  • the convolutional neural network TanH instruction according to which the device extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferably a scratch pad memory), and performs a convolution operation in the convolution operation unit, preferably And then the output is done as TanH activation;
  • the convolutional neural network ReLU instruction according to the instruction, the device respectively extracts input data of a specified size and a convolution kernel from a designated address of a memory (preferably a scratch pad memory), and performs a convolution operation in the convolution operation unit, preferably And then the output is re-activated by ReLU;
  • the convolutional neural network group instruction according to which the device respectively extracts input data of a specified size and a convolution kernel from a specified address of a memory (preferred scratch pad memory), divides the group, and performs convolution in the convolution operation unit. Operation, preferably, then the output is activated.
  • the IO instruction reads the input data required for the calculation from the external storage space and stores the data back to the external space after the calculation is completed.
  • the NOP instruction is responsible for clearing the control signals in all control signal buffer queues of the current device, and ensuring that all instructions before the NOP instruction are all completed.
  • the NOP instruction itself does not contain any operations;
  • the JUMP instruction is responsible for controlling the jump of the next instruction address to be read from the instruction storage unit, and is used to implement the jump of the control flow;
  • the MOVE instruction is responsible for carrying data of an address in the internal address space of the device to another address in the internal address space of the device.
  • the process is independent of the operation unit and does not occupy the resources of the operation unit during execution.
  • the method for executing the convolution calculation instruction by the computing device shown in FIG. 6A may specifically be:
  • the control unit 615 extracts an operation domain corresponding to the convolution calculation instruction and the convolution calculation instruction from the register unit 612, and the control unit transmits the operation domain to the data access unit.
  • the data access unit extracts the convolution kernel w and the offset b corresponding to the operation domain from the memory (when b is 0, there is no need to extract the offset b), and the convolution kernel w and the offset b are transmitted to the arithmetic unit.
  • the multiplier of the calculation unit obtains the first result after performing the multiplication operation on the convolution kernel w and the input data Xi, and inputs the first result to the adder to perform the addition operation to obtain the second result, and the second result and the offset b are executed.
  • the addition operation obtains a third result, and the third result is input to the activation operator to perform an activation operation to obtain an output result s, and the output result s is transmitted to the data access unit for storage in the memory.
  • the direct output result can be transferred to the data access storage to the memory.
  • the step of performing the addition of the second result and the offset b to obtain the third result is an optional step, that is, when b is 0, this step is not required.
  • the technical solution provided by the present application realizes the calculation of convolution by an instruction, that is, a convolution calculation instruction, and the intermediate data (for example, the first result, the second result, and the third result) in the convolution calculation need not be stored or extracted, and is reduced.
  • the storage and extraction operations of the intermediate data have the advantage of reducing the corresponding operational steps and improving the computational effect of the convolution.
  • FIG. 6B is a flowchart of a convolutional neural network computing device performing a convolutional neural network according to an embodiment of the present application. As shown in FIG. 6B, the process of executing a convolutional neural network instruction includes:
  • step S6B1 an IO instruction is pre-stored at the first address of the instruction memory location instruction memory location.
  • step S6B2 the controller unit reads the IO instruction from the first address of the instruction storage unit, and according to the decoded control signal, the data access unit reads all corresponding convolutional neural network operation instructions from the memory and caches them. In the instruction storage unit.
  • step S6B4 the controller unit then reads the next CONFIG command from the instruction storage unit, and according to the translated control signal, the device configures various constants required for the calculation of the layer neural network.
  • the arithmetic unit configures the value of the unit internal register based on parameters in the control signal, including, for example, data required for the activation function.
  • step S6B5 the controller unit then reads the next COMPUTE instruction from the instruction storage unit, and based on the translated control signal, the interconnection module sends the input data in the convolution window to each calculator in the calculation unit.
  • step S6B6 according to the control signal decoded by the COMPUTE instruction, the interconnection module connects the multiplication calculator, the addition calculator and the activation calculator to form a first calculation topology.
  • step S6B7 the multiplier performs a multiplication operation on the convolution kernel w and the input data Xi to obtain a first result, and inputs the first result to the adder to perform an addition operation to obtain a second result, and the second result and the offset b
  • the addition operation is performed to obtain a third result
  • the third result is input to the activation operator to perform an activation operation to obtain an output result s
  • the output result s is transmitted to the data access unit for storage into the storage medium.
  • the step of performing the addition of the second result and the offset b to obtain the third result is optional, that is, when b is 0, this step is not required.
  • the specific calculation method of the computing device shown in FIG. 6A is illustrated by different operation instructions.
  • the operation instruction here is an example of a full connection layer forward operation instruction, and the full connection layer forward operation instruction can be applied to a neural network.
  • Vector, w1 is the weight, and f is the activation function.
  • the calculation topology can be obtained as a multiplier-adder-activation operator.
  • the above-mentioned offset b may also be 0, and the specific value of the offset b may be determined by the full-connection layer forward operation instruction.
  • the artificial neural network full connection layer forward operation instruction set includes a CONFIG instruction, a COMPUTE instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction, among which:
  • the CONFIG command configures various constants required for current layer calculation before each layer of artificial neural network calculation begins;
  • the NOP instruction is responsible for clearing the control signals in all control signal buffer queues of the current device, and ensuring that all instructions before the NOP instruction are all executed.
  • the NOP instruction itself does not contain any calculation operations;
  • the JUMP instruction is responsible for controlling the jump of the next instruction address to be read from the instruction storage unit, and is used to implement the jump of the control flow;
  • the MOVE instruction is responsible for carrying data of an address in the internal address space of the device to another address in the internal address space of the device.
  • the process is independent of the operation unit and does not occupy the resources of the operation unit during execution.
  • the method for executing the full connection layer forward operation instruction by the computing device shown in FIG. 6A may specifically be:
  • the control unit 615 extracts the operation domain corresponding to the all-connection layer forward operation instruction and the all-connection layer forward operation instruction from the registration unit 612, and the control unit transmits the operation domain to the data access unit.
  • the data access unit extracts the weight W1 and the offset b corresponding to the operation domain from the storage medium, and transmits the weight W1 and the offset b to the operation unit.
  • the arithmetic unit can perform the operation according to the second topological structure (multiplier-adder-(optional)-activated operator). Specifically, the multiplier obtains the multiplication operation after the weight W1 and the input data in are multiplied. As a result, the first result and the bias are input to the adder to perform an addition operation to obtain a second result, the second result is input to the activation operator to perform an activation operation to obtain an output result, and the output result is transmitted to the data access unit for storage to the memory. Inside. Among them, after each step, the direct output result can be transferred to the data access storage to the memory without the following steps. In addition, if the offset b is 0, there is no need to input the first result and the offset to the adder to perform the addition to obtain the second result.
  • the offset b is 0, there is no need to input the first result and the offset to the adder to perform the addition to obtain the second result.
  • FIG. 6C illustrates another more detailed implementation of a single layer artificial neural network full connection layer forward operation, the method illustrated in FIG. 6C being implemented in a computing device, the arithmetic unit in the computing device including a primary arithmetic unit And the one or more slave computing units, wherein the computing device in the method shown in FIG. 6C illustrates a plurality of slave computing units, wherein the interconnecting module is connected to the main computing unit and the plurality of slave computing units, and the interconnecting module may be a tree. Structure, ring structure, grid structure, hierarchical interconnection, bus structure.
  • step S2.1 the first IO instruction is pre-stored at the instruction storage unit.
  • step S2.2 the controller unit reads the first IO instruction from the instruction storage unit, and according to the decoded control signal, the data access unit reads all the corresponding artificial neural network connection layer operation instructions from the memory, and Stored in the instruction memory location.
  • step S2.3 the controller unit then reads in the second IO instruction from the instruction storage unit, and according to the control signal decoded by the second IO instruction, the data access unit reads the main operation unit (ie, activates the arithmetic unit) from the memory. All data (eg, including input neuron vectors, interpolation tables, constant tables, and offsets, etc.) to the first memory location of the main arithmetic unit.
  • step S2.4 the controller unit then reads in the third IO instruction from the instruction storage unit, and the data access unit reads the slave operation unit from the memory according to the third decoded control signal (addition calculator or multiplication calculator) ) The required weight matrix data.
  • step S2.5 the controller unit then reads the CONFIG command from the instruction storage unit, and configures various constants required for the calculation of the layer neural network according to the decoded control signal.
  • step S2.6 the controller unit then reads the full connection layer forward operation instruction from the instruction storage unit, and according to the decoded control signal, the main operation unit first sends the input neuron vector to each slave operation unit through the interconnection module. And saved to the second storage unit of the slave arithmetic unit.
  • step S2.7 according to the control signal decoded by the COMPUTE instruction, the weight is read from the third storage unit from the second operation unit of the operation unit, and the input neuron vector is read from the second storage unit to complete the weight and input.
  • the dot product of the neuron vector yields an intermediate result, and the intermediate result is returned by the interconnect module.
  • step S2.8 in the interconnect module, the intermediate results returned by each slave unit are progressively combined into a complete intermediate result vector.
  • step S2.9 the main operation unit obtains the returned intermediate result vector of the interconnect module, reads the offset vector from the first storage unit according to the control signal decoded by the COMPUTE instruction, and passes the offset vector and the intermediate result vector through the vector.
  • the addition unit adds up to obtain an addition result
  • the activation unit activates the addition result to obtain an output neuron vector, and writes the last output neuron vector back to the first storage unit.
  • step S2.10 the controller unit then reads in the fourth IO instruction from the instruction storage unit, and according to the decoded control signal, the data access unit stores the output neuron vector to the memory designated address, and the operation ends.
  • the specific calculation method of the computing device shown in FIG. 6A is illustrated by different operation instructions.
  • the operation instruction here is exemplified by a pooling operation instruction, which can be applied to machine learning, such as a neural network.
  • the pooling operation refers to performing a downsampling operation of local features in the feature layer of the neural network to reduce the dimension of the feature layer.
  • the pooling operation includes but is not limited to three types: maxpooling refers to taking the maximum value as the result in the kernel, and avgpooling refers to taking the average value as the result in the kernel. Minpooling is in the kernel, taking the minimum value as a result.
  • Kernel_area is the area of the pooling kernel kernel (the total number of kernels in the kernel).
  • the above pooling can be average pooling according to the actual algorithm requirements. Of course, in actual applications, it can also be max pooling, min pooling, or other forms of pooling.
  • the calculation topology can be obtained as (optional) multiplier-adder/comparison operator-(optional), and the operator is activated.
  • the pooling instruction set includes a CONFIG instruction, a COMPUTE instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction, where:
  • the CONFIG command configures the various constants required for the current layer calculation before each layer of artificial neural network calculation begins; for example, 1/kernel_area can be configured using the config command.
  • the COMPUTE instruction includes a pooling operation instruction, which includes:
  • the Maxpooling forward operation instruction according to which the device respectively extracts input data of a specified size from a specified address of a memory (preferred scratch pad memory or a scalar register file), performs a Maxpooling forward operation operation in the pooling operation unit, and then The output is written back to the specified memory address of the memory (preferred scratch pad memory or scalar register file).
  • Maxpooling reverse training instruction according to which the device respectively takes out input data of a specified size from a specified address of a memory (preferred scratch pad memory or scalar register file), performs a maxpooling reverse training operation in the pooling operation unit, and then The output is written back to the specified memory address of the memory (preferred scratch pad memory or scalar register file).
  • the Avgpooling forward operation instruction according to which the device respectively extracts input data of a specified size from a specified address of a memory (preferred scratch pad memory or a scalar register file), performs an Avgpooling forward operation operation in the pooling operation unit, and then The output is written back to the specified memory address of the memory (preferred scratch pad memory or scalar register file).
  • the Avgpooling reverse training instruction according to which the device respectively extracts input data of a specified size from a specified address of a memory (preferred scratch pad memory or a scalar register file), performs an Avgpooling reverse training operation in the pooling operation unit, and then The output is written back to the specified memory address of the memory (preferred scratch pad memory or scalar register file).
  • Minpooling forward operation instruction according to which the device respectively takes out input data of a specified size from a specified address of a memory (preferred scratch pad memory or scalar register file), performs a Minpooling forward operation operation in the pooling operation unit, and then The output is written back to the specified memory address of the memory (preferred scratch pad memory or scalar register file).
  • Minpooling reverse training instruction according to which the device respectively takes out input data of a specified size from a specified address of a memory (preferred scratch pad memory or a scalar register file), performs a Minpooling reverse training operation in the pooling operation unit, and then The output is written back to the specified memory address of the memory (preferred scratch pad memory or scalar register file).
  • the IO instruction realizes reading input data required for calculation from the storage medium and storing the data back to the external space after the calculation is completed;
  • the NOP instruction is responsible for clearing the microinstructions currently loaded into all internal microinstruction buffer queues, ensuring that all instructions preceding the NOP instruction are completed.
  • the NOP instruction itself does not contain any calculation operations;
  • the JUMP instruction is responsible for the jump of the next instruction address that the controller will read from the instruction storage unit, and is used to implement the jump of the control flow;
  • the MOVE instruction is responsible for carrying data of an address in the internal address space of the device to another address in the internal address space of the device.
  • the process is independent of the operation unit and does not occupy the resources of the operation unit during execution.
  • the method for performing a pooling operation of the present application includes the following stages,
  • the data access unit may extract in (all numbers in the kernel) from the memory according to the value of the kernel_area stored in the instruction storage unit, and then The 1/kernel_area and the in a are transmitted to the arithmetic unit for forward operation, and the operation unit sequentially performs the operation of comparing the size of each input vector and taking the maximum value (or the minimum value) to obtain an output vector.
  • the maxpooling (or minpooling) reverse training instruction the corresponding index vector is saved at the same time; the input vector of the new pooling core kernel is cyclically read, and the comparison operation operation is performed to obtain the output vector of the new kernel until the layer The pooling operation ends.
  • the arithmetic unit outputs the input gradient vector correspondingly to the corresponding storage location according to the index vector saved in the forward operation, and obtains the output gradient vector.
  • the data access unit may extract in (all numbers in the kernel) from the memory according to the kernel_area stored in the instruction storage unit, and then transfer the 1/kernel_area and in to the operation unit for forward direction.
  • the operation is performed by the operation module 4 to sequentially accumulate each input vector; then, the operation module 4 performs multiplication by 1/kernel_area operation to obtain an output vector; and cyclically reads the input vector of the new kernel to perform the above-mentioned accumulation and multiplication operations.
  • the operation module 4 multiplies the input gradient vector by 1/kernel_area, and outputs the input gradient vector correspondingly to the corresponding data through the data access unit 3. Store the location and get the output gradient vector.
  • the control unit 615 extracts the pooling operation instruction, the operation domain corresponding to the pooling operation instruction, and the third calculation topology corresponding to the pooling operation instruction from the register unit 612 ((optional) multiplier-adder/comparison operator- (Optional), the operator is activated), the control unit transmits the operation domain to the data access unit, and transmits the third computing topology to the interconnection module.
  • the data access unit storage medium extracts in and 1/kernel_area corresponding to the operation domain, and transmits in and 1/kernel_area to the computing unit.
  • the computing unit receives the data and executes a pooling instruction.
  • the multiplication unit of the calculation unit multiplies the input data in with 1/kernel_area to obtain a first result, and inputs the first result to the adder to perform an addition operation to obtain a second result, (preferably And then input the second result into the activation operator for the activation operation.
  • Other instructions are not described here.
  • Kernel_area Figure 6D shows a flow chart of the pooling operation forward operation according to one embodiment.
  • the flowchart depicts a process for performing a pooling operation forward operation using the apparatus and instruction set of the present application.
  • step S1 an IO instruction is pre-stored at the first address of the instruction memory location.
  • step S2 the operation starts, the control unit reads the IO instruction from the first address of the instruction storage unit, and according to the translated microinstruction, the data access unit reads all the corresponding pooling operation instructions from the storage medium and caches them in In memory.
  • step S3 the control unit then reads in the next IO instruction from the instruction storage unit, and according to the translated microinstruction, the data access unit reads all data required by the operation unit from the storage medium (for example, including an input neuron vector, an interpolation table) , constant table, etc.) to the memory of the arithmetic unit.
  • the storage medium for example, including an input neuron vector, an interpolation table
  • constant table etc.
  • step S4 the controller unit then reads the next CONFIG instruction from the instruction storage unit, and according to the translated microinstruction, the device configures various constants required for the layering operation.
  • the arithmetic unit configures the value of the internal register of the unit according to the parameters in the microinstruction, and the parameters include, for example, the precision setting of the calculation of the layer, and the data of the activation function (for example, the precision bit of the calculation of the layer, and the reciprocal of the size of the pooling kernel when avgpooling is 1) /kernel_area, etc.)
  • step S5 the adder of the operation unit reads the input neuron vector and the intermediate result vector from the neuron storage unit, and completes the operation on the input neuron vector (acgpooling is the cumulative input nerve)
  • the metavector is then multiplied by 1/kernel_area, maxpooling is the comparison size, the maximum is obtained, and the last output neuron vector is written back to the neuron storage unit.
  • step S6 the control unit then reads the next IO instruction from the instruction storage unit.
  • the data access unit stores the output neuron vector in the neuron storage unit to the storage medium designated address, and the operation ends.
  • 6E is a flow chart showing a reverse training of pooling operations, according to one embodiment.
  • the flowchart depicts a process for implementing a pooling operation reverse training using the apparatus and instruction set of the present application.
  • an IO instruction is pre-stored at the first address of the instruction memory location.
  • the operation begins, the controller unit reads the IO instruction from the first address of the instruction storage unit, and according to the translated microinstruction, the data access unit reads all instructions related to the reverse training of the pooling operation from the storage medium. And cache it in the instruction memory location.
  • step T3 the controller unit then reads the next IO instruction from the instruction storage unit.
  • the data access unit reads all the data required by the operation unit from the storage medium to the neuron storage unit of the operation unit.
  • the data includes the input gradient vector and the index vector index required for maxpooling.
  • step T4 the controller unit then reads the next CONFIG instruction from the instruction storage unit, and the operation unit configures the value of the operation unit internal register according to the parameters in the translated microinstruction, including various constants required for the layering operation, avgpooling
  • the reciprocal of the pooling kernel size is 1/kernel_area, the precision of the calculation of this layer, and the learning rate when updating the weight.
  • step T5 the addition unit of the operation unit reads the input gradient vector and the index vector index required for maxpooling from the neuron storage unit to complete the multiplication operation (avgpooling is compared with 1/kernel_area) Multiply, maxpooling is multiplied by the index vector index.
  • the output gradient vector is passed to obtain the input gradient vector of the next layer of reverse training, which is written back to the neuron storage unit.
  • the controller unit then reads the next IO instruction from the instruction storage unit.
  • the data access unit stores the output gradient vector in the neuron storage unit to the storage medium designated address, and the operation ends.
  • the implementation process is similar to the pooling operation of single-layer neural network.
  • the next-level operation instruction will calculate the output nerve calculated in the operation unit.
  • the meta-vector or the output gradient vector is used as the input neuron vector or the input gradient vector of the next layer to perform the above calculation process, and the weight address and the weight gradient address in the instruction are also changed to the address corresponding to the layer.
  • the specific calculation method of the computing device shown in FIG. 6A is illustrated by different operation instructions.
  • the operation instruction here is exemplified by a batch normalization operation instruction, which can be applied to a neural network.
  • the intermediate values, middle1, and middle2 may be the same or different.
  • the calculation topology can be obtained as an adder-multiplier.
  • the batch normalization instruction set includes the CONFIG instruction, the batch normalization instruction, the IO instruction, the NOP instruction, the JUMP instruction, and the MOVE instruction, where:
  • the CONFIG directive configures the various constants required for the current layer calculation before the batch normalization calculation begins;
  • the batch normalization instruction completes the calculation of batch normalization
  • the IO instruction realizes reading input data required for calculation from the external address space and storing the data back to the external space after the calculation is completed;
  • the NOP instruction is responsible for clearing the microinstructions in all the microinstruction memory queues of the current device, and ensuring that all the instructions before the NOP instruction are all executed.
  • the NOP instruction itself does not contain any calculation operations;
  • the JUMP instruction is responsible for controlling the jump of the next instruction address to be read from the instruction storage unit, and is used to implement the jump of the control flow;
  • the MOVE instruction is responsible for carrying data of an address in the internal address space of the device to another address in the internal address space of the device.
  • the process is independent of the operation unit and does not occupy the resources of the operation unit during execution.
  • the specific method for the computing device to perform batch normalization as shown in FIG. 6A may include:
  • the control unit 615 extracts an operation domain corresponding to the batch normalization operation instruction and the batch normalization operation instruction from the register unit 612, and the control unit transmits the operation domain to the data access unit.
  • the data access unit stores the -middle1 and 1/middle2 corresponding to the operation domain, and transfers the middle to the operation unit.
  • the arithmetic unit executes a batch normalization operation instruction to obtain an output result, and transmits the output result to the data access unit for storage in the memory.
  • the method for the arithmetic unit to execute the batch normalization operation instruction to obtain the output result may include: the adder of the operation unit obtains the first result after performing the addition operation of the input data in and the -middle1, and inputs the first result and the 1/middle2 to The multiplier performs a multiplication operation to obtain an output result.
  • Figure 6F illustrates a flow chart of the batch normalization forward operation during training in accordance with one embodiment.
  • the flowchart depicts the process of implementing the forward operation of the batch normalization operation shown in FIG. 6F using the apparatus and instruction set of FIG. 6A.
  • step F2 the operation starts, the controller unit reads the IO instruction from the first address of the instruction memory, and according to the translated microinstruction, the data access unit reads all the corresponding batch normalization forward operation instructions from the external address space, and It is cached in the instruction memory location.
  • the controller unit then reads the next IO instruction from the instruction memory unit, and according to the translated microinstruction, the data access unit reads all data required by the operation unit from the external address space (eg, including the input neuron vector, The batch size, learning parameters alpha, beta, minimum values eps, mean, variance, etc.) to the neuron cache unit of the arithmetic unit.
  • the external address space eg, including the input neuron vector, The batch size, learning parameters alpha, beta, minimum values eps, mean, variance, etc.
  • the controller unit then reads the next CONFIG instruction from the instruction storage unit, and the device configures the batch normalization operation based on the translated microinstruction. For example, does the forward calculation process use the calculated mean variance or calculate the mean variance based on the input.
  • step F5 the controller unit then reads the next COMPUTE instruction from the instruction storage unit.
  • the operation unit reads the input neuron vector from the neuron buffer unit, and calculates the mean and variance of the input neuron.
  • the intermediate value is in the cache unit.
  • step F6 the arithmetic unit subtracts the mean value from the data in the input neuron buffer unit and the intermediate value buffer unit according to the micro-instruction decoded by the COMPUTE instruction, and divides the square root operation of the variance and the minimum amount of eps, and stores the result back.
  • Intermediate value cache unit the arithmetic unit
  • step F7 the arithmetic unit reads the learning parameter alpha from the neuron buffer unit according to the microinstruction decoded by the COMPUTE instruction, multiplies the intermediate value, and adds the learning parameter beta to return to the neuron cache.
  • step F8 the controller unit then reads the next IO instruction from the instruction storage unit.
  • the data access unit stores the output neuron vector in the neuron buffer unit to the specified address in the external address space, and the operation ends. .
  • step F4 The difference between the forward process of the batch normalizaiton operation in use and the batch normalization operation in the training process is that the configuration uses constant mean and variance in step F4, and does not require dynamic calculation every time, that is, step F5 is removed.
  • step F5 The other is the same as Fig. 6F.
  • the reverse process for the batch normalization operation is similar to the forward process described above. The difference is that the data of the operation is different.
  • the gradient introduced by one pixel is dl/dY
  • the gradient of the backward transmission is dl/dx
  • the output of the forward process is Y
  • the other parameters indicate the same meaning as the forward process, and then propagated backward by batch normalization.
  • the gradient dl/dx (alpha/sqrt(var(x)+eps))*(dl/dY-mean(dl/dY)-mean(dl/dY*Y)*Y), where mean is the averaging operation.
  • the reverse process of batch normalization normalizes the gradient data by the arithmetic unit, for example, the mean value, the variance, and the like.
  • the arithmetic unit then completes the rest of the equations in parallel.
  • the relationship between parallel and serial is better balanced by using a dedicated arithmetic unit for batch normalization operations. Avoiding the CPU architecture is only a serial operation, the data is slower when the data size is larger, and the GPU architecture is only a parallel operation, which deals with the weakness of the normalized operation. In this application, the data storage unit and the arithmetic unit cooperate to better balance the normalized serial operation and the parallel operation.
  • the computing instruction of the computing device may be one or more, that is, the computing device may execute one or more of the foregoing computing instructions, including but not limited to the convolution instruction, the full connection instruction, A batch normalization instruction or a pooling instruction, the specific structure of the above instruction and how to apply it can be described in the embodiments of FIG. 6A, FIG. 6B, FIG. 6C, FIG. 6D, FIG. 6E and FIG. 6F, optionally, in addition to the above instructions, the computing device
  • the calculation instructions that can be executed may specifically include:
  • the device respectively extracts vector data of a specified size from a specified address of a memory (preferably a scratch pad memory or a scalar register file), and performs inner product (tensor) calculation on the two vectors in the vector calculation unit, and the result Write it back.
  • a memory preferably a scratch pad memory or a scalar register file
  • inner product tensor
  • the result is written to the memory (preferably scratch pad memory or scalar register file) to specify the address.
  • the device respectively extracts vector data of a specified size from a specified address of a memory (preferably a scratch pad memory or a scalar register file), performs an outer product operation on the two vectors in the vector calculation unit, and writes the result back.
  • a memory preferably a scratch pad memory or a scalar register file
  • the device respectively extracts vector data of a specified size from a specified address of a memory (preferably a scratch pad memory or a scalar register file), performs an outer product operation on the two vectors in the vector calculation unit, and writes the result back.
  • a memory preferably a scratch pad memory or a scalar register file
  • a vector four operation comprising: a vector scalar instruction (VAS), according to which the device fetches vector data of a specified size from a specified address of a memory (preferred scratch pad memory or scalar register file), from the specification of the memory scalar register file
  • VAS vector scalar instruction
  • the address takes the scalar data, adds the scalar value to each element of the vector in the scalar unit, and writes the result back and writes the result back.
  • the result back to the designated address of the memory (preferred scratch pad memory or scalar register file);
  • Scalar minus vector instruction SSV
  • the device fetches the scalar data from a specified address of the scalar register file of the memory (preferably a scratch pad memory or a scalar register file), and fetches the vector data from a specified address of the memory (preferably a scratch pad memory or a scalar register file).
  • the corresponding element in the vector is subtracted from the scalar in the vector calculation unit, and the result is written back and the result is written back.
  • the result back to the designated address of the memory (preferred scratch pad memory or scalar register file);
  • VD Vector division instruction
  • the device fetches the vector data of the specified size from the designated address of the memory (preferably the scratch pad memory or the scalar register file), divides the two vector pairs in the vector operation unit, and writes the result back and The result is written back.
  • the designated address of the memory preferably the scratch pad memory or the scalar register file
  • Vector logic instructions including:
  • VAV Vector and instruction
  • the device fetches the vector data of the specified size from the specified address of the memory (preferably the scratch pad memory or the scalar register file), pairs the two vector pairs in the vector operation unit, and writes the result back and The result is written back.
  • the designated address of the memory preferably the scratch pad memory or scalar register file
  • VAND Vector and instruction
  • the device fetches vector data of a specified size from a specified address of a memory (preferably a scratch pad memory or a scalar register file), each bit in the vector in the vector operation unit, and writes the result back and writes the result return.
  • a memory preferably a scratch pad memory or a scalar register file
  • the result return Preferably, and writing the result back to the designated address of the scalar register file of the memory (preferably the scratch pad memory or the scalar register file);
  • VOV Vector or instruction
  • the device fetches the vector data of the specified size from the designated address of the memory (preferably, the scratch pad memory), and phase-aligns the two vector pairs in the vector operation unit, and writes the result back and writes the result back. .
  • the designated address of the memory preferably, the scratch pad memory or the scalar register file
  • the device fetches vector data of a specified size from a specified address of a memory (preferably, a scratch pad memory or a scalar register file), each phase of the vector in the vector operation unit, and writes the result back and the result Write it back.
  • a memory preferably, a scratch pad memory or a scalar register file
  • the device fetches vector data of a specified size from a specified address of a memory (preferably, a scratch pad memory or a scalar register file), each phase of the vector in the vector operation unit, and writes the result back and the result Write it back.
  • a memory preferably, a scratch pad memory or a scalar register file
  • the transcendental function instruction fetches vector data of a specified size from a specified address of a memory (preferably, a scratch pad memory or a scalar register file), performs a transcendental function operation on the vector data in the operation unit, and writes the result Go back and write the result back.
  • a memory preferably, a scratch pad memory or a scalar register file
  • the result is written back to the designated address of the memory (preferably scratch pad memory or scalar register file) memory location.
  • the result is written back to the designated address of the memory (preferably, the scratch pad memory or the scalar register file).
  • the operation instruction (GE) is greater than or equal to, according to which the device can obtain the parameters of the instruction, including the vector, directly from the instruction or by accessing the number of the register memory (preferably, the scratch pad memory or the scalar register file) provided by the instruction.
  • the length, the start address of the two vectors, and the storage address of the output vector then read the two vector data, and compare the elements at all positions in the vector in the vector comparison operation unit. If the value of the previous vector in a position line is greater than Equal to the value of the latter vector, the value of the comparison result vector at this position is set to 1, otherwise it is set to 0. Finally, the result of the comparison is written back to the specified memory address of the memory (preferably, the scratch pad memory or the scalar register file).
  • the device can obtain the parameters of the instruction, including the vector, directly from the instruction or by accessing the number register number of the memory (preferably, the scratch pad memory or the scalar register file) provided by the instruction.
  • the length, the start address of the two vectors, and the storage address of the output vector then read the two vector data, and compare the elements at all positions in the vector in the vector comparison operation unit, if the value of the previous vector in a position line is smaller than Equal to the value of the latter vector, the value of the comparison result vector at this position is set to 1, otherwise it is set to 0.
  • the result of the comparison is written back to the specified memory address of the memory (preferably, scratch pad memory or scalar register file) value.
  • the device can obtain the parameters of the instruction, including the vector, directly from the instruction or by accessing the number register number of the memory (preferably, the scratch pad memory or the scalar register file) provided by the instruction.
  • the value of a vector sets the value of the comparison result vector to 1 at this position, otherwise it is set to 0. Finally, the result of the comparison is written back to the specified memory address of the memory (preferred scratch pad memory or scalar register file).
  • VMAX Vector Maximum Instruction
  • the device extracts the vector data of the specified size from the designated address of the memory (preferably, the scratch pad memory or the scalar register file) of the scratch pad memory, selects the largest element as the result, and writes the result back and The result is written back.
  • the designated address of the scalar register file of the memory preferably, the scratch pad memory or the scalar register file
  • VMIN Vector minimum instruction
  • the device extracts the vector data of the specified size from the designated address of the memory (preferably, the scratch pad memory or the scalar register file) of the scratch pad memory, selects the smallest element as the result, and writes the result back and The result is written back.
  • the designated address of the scalar register file of the memory preferably, the scratch pad memory or the scalar register file
  • Cyclic shift operation instruction According to the instruction, the device can obtain the parameter of the instruction directly from the instruction or by accessing the register number of the memory (preferably, the scratch pad memory or the scalar register file) provided by the instruction, and then shifting the vector A cyclic shift shift is performed in a bit cell (which may be a separate vector shift unit or a computing unit), and the shifted result is written back to the memory (preferably, a scratch pad memory or a scalar register file) The specified storage address of the scratch pad memory.
  • the cyclic shift operation instruction format is shown in Figure 3. It contains four operation fields, the start address and length of the vector, the shift step size, and the storage address of the output vector.
  • a random vector generation instruction according to which the device reads one or more random distribution parameters from the instruction or from a memory (preferably, scratch pad memory or scalar register file) register file, and the size and sum of the random vectors to be generated
  • the address is stored, then a random vector subject to random distribution is generated in the random vector generation unit, and the generated random vector result is written back to the storage address of the specified memory (preferably, the scratch pad memory or the scalar register file).
  • the random vector generation instruction may specifically be:
  • Uniformly distributed instruction (UNIF), according to which the device reads uniformly distributed upper and lower bound parameters from the instruction or from a memory (preferably, scratch pad memory or scalar register file) register file, and the random to be generated The size of the vector and the storage address, then generating a random vector obeying the uniform distribution in the random vector generation unit, and writing the generated random vector result back to the specified memory (preferably, the scratch pad memory or the scalar register file) Storage address.
  • a memory preferably, scratch pad memory or scalar register file
  • Gaussian Distribution Instruction (GAUS), according to which the device reads the mean and variance parameters of the Gaussian distribution from the instruction or from the register memory (preferably, the scratch pad memory or the scalar register file) and the random vector to be generated Size and storage address, then generate a random vector obeying the Gaussian distribution in the random vector generation unit and write the generated random vector result back to the storage of the specified memory (preferably, scratch pad memory or scalar register file) address.
  • FIG. 7A The format of the above instruction is shown in Figure 7A.
  • the format of the neural network operation instruction is shown in Figure 7B.
  • the format of the matrix operation instruction is shown in Figure 7C.
  • the format of the vector operation instruction is shown in Figure 7D.
  • Matrix-vector A schematic diagram of the format of the arithmetic instruction is shown in Fig. 7E. It should be noted that the format diagram of the above instruction is only one possible embodiment, and the format of the above instruction is not limited to the expression in the above illustration.
  • the embodiment of the present application further provides a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, the computer program causing the computer to execute part of any matrix calculation method as described in the foregoing method embodiment. Or all steps.
  • the embodiment of the present application further provides a computer program product, comprising: a non-transitory computer readable storage medium storing a computer program, the computer program being operative to cause a computer to perform the operations as recited in the foregoing method embodiments Part or all of the steps of any matrix calculation method.
  • the artificial neural network computing device in the above embodiment can be a generalized computing device integrated with a DMA and a control unit.
  • the artificial neural network computing device may further comprise a general-purpose computing device, such as a general-purpose processor.
  • the specific implementation form of the foregoing storage medium may be a storage device, an on-chip storage medium, a memory or a storage unit, etc.
  • the specific implementation form of the instruction storage unit may be DMA.
  • the specific implementation form of the arithmetic unit may be a main operation module, a slave operation module, a discrete data operation unit or a continuous data operation unit, etc.
  • the specific implementation form of the cache unit may be an instruction cache, an input neuron cache, a weight buffer, and an output.
  • the neuron cache, the instruction cache unit, the neuron cache unit that supports the discrete data representation, or the weight buffer unit that supports the discrete data representation, and the like are not limited in the embodiment of the present application.
  • One or more central nodes which are communication data centers of the on-chip network, for broadcasting or multicasting communication data to the plurality of leaf nodes;
  • leaf nodes which are communication data nodes of the on-chip network, for transmitting communication data to the central leaf node;
  • a repeater module configured to connect the central node and the plurality of leaf nodes, and the communication data is forwarded by the repeater module;
  • the plurality of leaf nodes are divided into N groups, and the central node is separately connected to each group of leaf nodes by the repeater module.
  • the number of leaf nodes in each group is the same.
  • the average technician can also understand that the number of leaf nodes in each group can also be different.
  • the communication structure formed by each group of leaf nodes has self-similarity.
  • the data distribution device has a fractal tree network structure.
  • the general technical staff can also understand that the communication structure formed by each group of leaf nodes has other structures, that is, the structure is not limited to the self-similarity.
  • the plurality of leaf nodes and the central node are communicably connected in a completely multi-tree manner by using the plurality of the repeater modules.
  • the central node or the leaf node may be a computing device as shown in FIG. 6A.
  • the central node or the leaf node may also be referred to as a computing unit.
  • Each node includes a local cache structure for storing a subset of the central node publishing data
  • Each leaf node has an id identifier, and the id identifier sequentially increases the sequence number from the top side of the topology of the completely multi-fork tree.
  • the data distribution device shares a clock signal.
  • the repeater module includes a local cache structure for storing data.
  • the present application further provides a data distribution method using the data distribution apparatus, and the communication data is distributed to the plurality of leaf nodes by the central node, wherein after the data sender is ready to send data, the data valid signal is sent. And placing the data on the bus; after the data receiver is ready to receive the data, the data is ready to receive the signal; when both the data valid signal and the data ready to receive signal are detected, the data sender considers that the data has been sent, and Received by the data receiver.
  • first data is entered from the central node into a local cache of the repeater module directly connected to the central node by a handshake protocol.
  • Temporary storage after each handshake protocol is successful, enters the local buffer of the next intermediate repeater module for temporary storage, and finally inputs a repeater module directly connected to the leaf node, and is respectively issued by the repeater module to the connected one.
  • Group leaf nodes When broadcasting communication data from the central node to the plurality of leaf nodes, first data is entered from the central node into a local cache of the repeater module directly connected to the central node by a handshake protocol.
  • Temporary storage after each handshake protocol is successful, enters the local buffer of the next intermediate repeater module for temporary storage, and finally inputs a repeater module directly connected to the leaf node, and is respectively issued by the repeater module to the connected one.
  • Group leaf nodes When broadcasting communication data from the central node to the plurality of leaf nodes, first data is entered from the central node into a local cache of the
  • next clock tick data sender and the data receiver handshake protocol are successful, the data is stored in the local buffer of the data receiver in a pipeline manner; if the handshake protocol is unsuccessful, the data is saved in the local cache of the current layer, and The current layer acts as the data receiver of the upper layer and stops sending data ready to receive signals, so that the data in the local cache of the current layer stops updating, and the data is always saved in the current layer until the handshake protocol is successful.
  • the central node When the central node multicasts communication data to the plurality of leaf nodes, firstly, data is transferred from the central node to a local cache of the repeater module directly connected to the central node by using a handshake protocol. Temporary storage, after each handshake protocol is successful, enters the local buffer of the next intermediate repeater module for temporary storage, and finally inputs a repeater module directly connected to the leaf node, and is respectively issued by the repeater module to the connected one. Group leaf nodes.
  • the leaf node When receiving the data, the leaf node selects data of a preset bandwidth according to the id identifier corresponding thereto.
  • the application also proposes a control device comprising the data distribution device.
  • the application also proposes a smart chip comprising the control device.
  • FIG. 7 is a schematic diagram of an on-chip multi-core structure of 16+1 cores connected by using an h-tree in an embodiment of the present application, wherein 16 and 1 are by way of example only, and are not particularly limited. The average person can also understand that 2n+m cores, or yn+m cores.
  • the root node of the h-tree is a central tile, which is the starting point of data publishing; the leaf node of the h-tree is a leaf tile, which is the end point of data publishing; the remaining intermediate nodes are hubs for transmitting and distributing data;
  • the 16 leaf tiles in the figure are divided into 8 groups, and the number of leaf tiles in each group is 2, and the hub is separately connected with each group of leaf tiles through the transponder module, and each group of leaf tiles is composed.
  • the communication structure has self-similarity, and the plurality of leaf tiles and the central tile are connected in a complete binary tree manner by the plurality of the transponder modules; the device implements processing from a data center in a broadcast or multicast manner The situation in which the unit publishes data.
  • Figure 8 shows a schematic diagram of a hub structure.
  • the hub is composed of a hub_one_to_two module.
  • Hub_one_to_two divides a set of full-bandwidth input data 20 into two sets of full-bandwidth data 21 and 22 outputs for transmission from the central tile to the leaf tile.
  • the hub_one_to_two module labeled 310 when the hub_one_to_two module labeled 310 has sent data and data valid signals to the bus, the data receiver 0 labeled 320 and the data receiver 1 labeled 330 have prepared the data for reception.
  • the handshake protocol is considered successful: the beat 310 considers that the data receivers, 320 and 330, have received the data, while the next shots 320 and 330 store the data on the beat bus in its own buffer.
  • the central tile broadcast data of the flag 410 initializes all the leaf tiles.
  • the local caches of all the hubs and the leaf tiles are empty, and the data ready to receive signals are all high, and the data is directly connected to the 410.
  • Hub0_0 labeled 420, is also high in its data ready to receive signal.
  • 410 is ready for data, and it is set high with the data valid signal. Since the data of hub0_0 of the marker 420 is ready to receive the signal high, the handshake between 410 and 420 is successful, and at the second beat, the data is 420. Stored from the bus to its local cache for temporary storage.
  • the second buffer Since the second buffer has 420 of the local cache already stored data, it sends the data and its valid signal to the bus in the direction of the next level 430 and 431, and this The data ready to receive signal of hub1_0 marked with 430 and hub1_1 labeled 431 is also high.
  • 430 and 431 of the next layer are successfully handshaked, at the third beat, 430 and 431 store data from the bus. It is temporarily stored in the local cache, and is executed in sequence, and each beat of the data goes further from the upper layer to the next layer.
  • the data is temporarily stored in the local cache of hub2_0 marked 440; at the fifth beat, the data inflow is marked as The partial cache of hub3_0 of 450 is temporarily stored; in the sixth beat, 450 passes the full bandwidth data to the local cache of a set of leaf tiles connected thereto by the two input ports after the handshake protocol succeeds, at this time, the data Reach the leaf tile0 labeled 460.
  • the hierarchical flow of data is guaranteed.
  • this example takes hub1_0 as an example.
  • the hub1_0 labeled 520 receives the data from the hub0_0 labeled 510. , 520 places the data and its data valid signal on the bus in the direction of the next layer 530 and 531.
  • the current setting scenario is as follows. At this time, hub2_0 labeled 530 and hub2_1 labeled 531 do not issue a data preparation signal at this time, and remain in such a state for a while, at this time due to 520 and 530 and 531 of the next layer.
  • the handshake is unsuccessful, the data of 520 cannot be transmitted to the next level 530 and 531, and stays in the local cache of 520. At this time, 520 cannot send data to receive the signal, and in the following time, since the local cache of 510 is empty, It can receive new data again, but since 520 does not send data to receive the signal, the handshake between 520 and 510 is unsuccessful, that is, the data of 510 cannot be sent to 520, thereby ensuring the security of the data in the local cache of 520, thereby This enables the reliability of data transmission to be achieved.
  • this example takes hub1_0 as an example.
  • the hub will be able to stream data.
  • the hub1_0 labeled 520 receives the data from the hub0_0 labeled 510.
  • 520 places the data and its data valid signal on the bus in the direction of the next layer 530 and 531.
  • the current setting scenario is as follows.
  • the hub2_0 labeled 530 and the hub2_1 labeled 531 issue a data preparation signal at this time, and maintain such a situation for a while, at which time the handshake between 520 and the next layer of 530 and 531 is successful.
  • the data of 520 is transmitted to the next level 530 and 531.
  • 520 can already send data to receive the signal. If the local buffer of 510 is ready for new data, and the data and its data valid signal are placed at 520. On the direction of the bus, when shooting, because 520 sends data ready to receive signals, 520 and 510 handshake successfully, in the second beat, 520 stores the data transmitted by 510 in the local cache, and puts the data and its valid signal Down to the bus on the 530 and 531 directions, it can be seen that the hub can stream data while the data path is comfortable, that is, the data source is sufficient.
  • the hub is named by the combination of its number of layers and the serial number.
  • the flag 610 is hub0_0, that is, the node 0 of the first layer
  • the flag 620 is hub1_0, that is, the node 0 of the second layer
  • the flag 621 is hub1_1, that is, the first Node 1 on the second floor.
  • the central tile multicast data of the tag 60 is used to initialize all the leaf tiles.
  • the local caches of all the hubs and the leaf tiles are empty, and the data ready to receive signals are high. That is, the data path is smooth and transmitted according to the data stream.
  • the handshake between 60 and 610 is successful.
  • 610 stores the data from the bus into its local cache for temporary storage, when the 610 is shot with the next layer.
  • the handshaking of 620 and 621 are successful.
  • 620 and 621 store the data from the bus into its local cache for temporary storage.
  • each leaf tile when the data arrives at each leaf tile, it is full bandwidth. Assume that the default bandwidth of each leaf tile is 16 bits of data, as shown in Figure 12, it can be from full bandwidth according to its id number. In the data, the data is multicast to itself, and the position of the data in the full bandwidth is [id*16: id*16+15]. For example, data D0 with id number 15 is located at data[255:240], and data D0 with id number 0 is located at data[15:0].
  • FIG. 13 is a schematic diagram of an on-chip multi-core structure of 64+1 cores connected by using an x-tree in an embodiment of the present application.
  • the root node of the x-tree is a central tile, which is a starting point of data publishing; the leaf node of the x-tree is a leaf.
  • Tile which is the end point of data publishing; the remaining intermediate nodes are hubs for transmitting and distributing data; 64 leaf tiles in the figure are divided into 16 groups, and the number of leaf tiles in each group is 4, the hub Communicating by each of the set of leaf tiles by the repeater module, the communication structure formed by each set of leaf tiles has self-similarity, and the plurality of leaf tiles and the central tile pass the plurality of the transponder modules Connected in a fully quadtree mode; this device enables the distribution of data from a data center to a processing unit in a broadcast or multicast manner.
  • Figure 14 shows a schematic diagram of a hub structure.
  • the hub is composed of hub_one_to_four modules.
  • Hub_one_to_four divides a set of full-bandwidth input data 800 into four sets of full-bandwidth data 801, 802, 803, and 804 for transmission from the central tile to the leaf tile. .
  • the central tile broadcast data of A10 is initialized to initialize all leaf tiles.
  • the local caches of all hubs and leaf tiles are empty, and the data ready to receive signals are high, and is directly connected to A10.
  • Hub0_0 labeled A20, has the same data ready to receive signal as high.
  • A10 prepares the data and sets it and the data valid signal high.
  • the handshake between A10 and A20 is successful, and in the second beat, the data of A20 will be Temporary storage from the bus into its local cache, due to the second beat, the A20's local cache already has data, it sends the data and its valid signal to the bus in the direction of the next level A30, A31, A32, A33
  • the data ready to receive signals of hub1_0 marked A30, hub1_1 marked A31, hub1_2 marked A32, and hub1_3 labeled A33 are also high, when A20 and A30, A31, A32 of the next layer are taken. A33 handshake is successful.
  • A30, A31, A32, and A33 store the data from the bus into its local cache for temporary storage, and execute it in sequence.
  • Each beat of the data goes further from the previous layer to the next layer.
  • the hub1_3 of A33 to the leaf tile48 branch labeled A50 as an example: at the fourth beat, the data flows temporarily into the local cache of hub2_12 labeled A40; at the fifth beat, the A40 passes four
  • the input port stores the full bandwidth data to a group connected to it, that is, the local cache of four leaf tiles, including A50, A51, A52, and A53; at this time, the data arrives at the leaf tile 48 marked A50.
  • the hierarchical flow of data is guaranteed.
  • the hub is a non-leaf node
  • the leaf tile is a leaf node
  • the nodes with the same height in the tree are the same.
  • the clocks are sequentially incremented counterclockwise, and the hub is named by the combination of its number of layers and the serial number.
  • the flag 910 is hub0_0, that is, the node 0 of the first layer
  • the flag 920 is hub1_0, that is, the node 0 of the second layer
  • the flag 921 is Hub1_1, the node 1 of the second layer.
  • the central tile multicast data of 90 is initialized to initialize all leaf tiles.
  • the local caches of all hubs and leaf tiles are empty, and the data ready to receive signals are high. That is, the data path is smooth and transmitted according to the data stream.
  • the handshaking of 90 and 910 are successful; at the second beat, the 910 stores the data from the bus into its local cache for temporary storage, when the 910 is shot with the next layer.
  • 920, 921, 922, and 923 handshaking succeeded; at the third beat, 920, 921, 922, and 923 store data from the bus into its local cache for temporary storage, when taking 920 with the next layer of 930, 931, 932 and The 933 handshake was successful.
  • the 921 successfully communicated with the next layer of 934, 935, 936 and 937.
  • the 922 successfully communicated with the next layer of 938, 939, 93a and 93b.
  • each leaf tile when the data arrives at each leaf tile, it is full bandwidth. Assume that the default bandwidth of each leaf tile is 16 bits of data, as shown in Figure 16, it can be from the full bandwidth according to its id number. In the data, the data is multicast to itself, and the position of the data in the full bandwidth is [id*16: id*16+15]. For example, data D0 with id number 63 is located at data[1023:1008], and data D0 with id number 0 is located at data[15:0].
  • the present application also discloses a machine learning computing device for sparse connection.
  • the machine learning may include an artificial neural network, including:
  • mapping unit configured to convert the input data into input neurons, weights, and connection data, and select the calculated neurons according to the connection data to the input neurons, and store the calculated neurons in a memory or a cache;
  • a memory for storing computational neurons, weights, and calculation instructions
  • An operation unit configured to perform a corresponding operation on the computing neuron and the weight according to the calculation instruction stored in the storage device; the operation unit mainly performs a three-step operation, and the first step is to calculate the neuron and the weight The data is multiplied to obtain the first result; the second step is to perform the addition tree operation to obtain the second result, specifically, the first result after the first step is added step by step through the addition tree to obtain the second result, or the first A result is added by adding a bias to obtain a second result; the third step performs an activation function operation on the second result to obtain a final output neuron.
  • the above operation unit may specifically include: an addition calculator, a multiplication calculator, and an activation calculator, and the connection relationship is as shown in FIG. 2B.
  • Each calculator corresponds to a pipeline level, and the calculation manner can save operation time and speed up calculation.
  • the flow components can be freely combined or a first stage flow stage can be employed.
  • the second flow level and the third flow level are combined, or the first and second and third pipelines are combined or the respective flow levels are responsible for different operations.
  • the first stage of the flow is responsible for the comparison operation, the partial multiplication, and the second stage of the flow is responsible for the combination of the nonlinear operation and the matrix scalar multiplication.
  • connection data is expressed as follows:
  • the first case is a first case:
  • connection state of each output neuron and all input neurons constitutes a string of 0 and 1 to indicate the connection relationship of the output neurons
  • connection state of each input neuron and all output neurons forms a string of 0 and 1 to indicate the connection relationship of the input neurons
  • the distance from the position where the first connection of an output neuron is located to the first input neuron, the distance from the second input neuron of the output neuron to the distance of the previous input neuron, and the third of the output neurons The distance of the input neuron from the last input neuron, and so on, until all input neurons of the output neuron are exhausted to represent the connection relationship of the output neuron.
  • said artificial neural network computing device further comprises a DMA for reading or writing data or instructions in said storage device and cache.
  • the artificial neural network computing device further comprises:
  • Instruction cache for storing dedicated instructions
  • control unit configured to read the dedicated instruction from the instruction cache and decode the instruction into each operation unit instruction.
  • the artificial neural network computing device further comprises:
  • Weight buffer for caching weight data
  • the artificial neural network computing device further comprises:
  • Output a neuron cache for buffering output neurons output by the arithmetic unit.
  • the mapping unit is configured to convert the input data into a storage manner in which the input neurons and the weights are in one-to-one correspondence, and output the neurons to the operation unit instead of being stored in the storage device.
  • said artificial neural network computing device further comprises an input neuron buffer and/or a weight buffer, said input neuron buffering for inputting neuron data of said input neuron to said arithmetic unit, said weight
  • the cache is used to cache weight data
  • the mapping unit is configured to convert the input data into a storage manner in which the input neurons and the weights are in one-to-one correspondence, and output the neurons to the input neuron cache and/or the weight buffer.
  • the activation function performed by the arithmetic unit in the third step is a sigmoid function, a tanh function or a ReLU function.
  • Step 1 Convert the input data into input neurons, weights, and connection data; wherein the connection data is expressed as:
  • connection state of each output neuron and all input neurons constitutes a string of 0 and 1 to indicate the connection relationship of the output neurons
  • connection state of each input neuron and all output neurons forms a string of 0 and 1 to indicate the connection relationship of the input neurons
  • the second case is a first case
  • Step 2 screening the input neurons according to the connection data to obtain the calculated neurons, and multiplying the calculated neurons and the weight data to obtain the first result;
  • the input data includes: an input neuron, a weight, and a connection data, which are directly included in the input data, and the input neuron, the weight, and the connection data are directly extracted from the input data.
  • the above-mentioned computing neurons can filter the input neurons according to the connection data to obtain the calculated neurons.
  • the implementation of the above screening may specifically be, for example, assuming that the input neurons are four, and the connection data is 1 to indicate the connection. If the connection data is 1011 as shown in FIG. 18, the input neurons are i 1 , i 2 , i. 3 and i 4 , the second neuron i 2 with no connection relationship is deleted to obtain the calculated neuron data as: i 1 , i 3 and i 4 . Of course, 1 in the above connection data may also indicate no connection. If 1 indicates no connection, the unconnected i 1 , i 3 and i 4 are deleted to obtain the calculated neuron data: i2.
  • Step 3 Perform an addition tree operation on the first result to obtain a second result.
  • Step 4 Perform an activation function operation on the second result to obtain a final output neuron; wherein the activation function is a sigmoid function, a tanh function, or a ReLU function.
  • the I/O interface 1 for I/O data needs to be sent to the sparse multi-layer artificial neural network computing device by the central processing unit CPU 3, and then written into the memory 2 by the sparse multi-layer artificial neural network computing device 4, sparse
  • the dedicated program required by the multi-layer artificial neural network computing device 4 is also transmitted by the CPU 3 to the sparse multi-layer artificial neural network computing device 4.
  • the memory 2 is used to temporarily store sparse multi-layer artificial neural network models and neuron data, especially when all models cannot be stored in the cache on the sparse multi-layer artificial neural network computing device 4.
  • the CPU 3 is used for basic control such as data transfer and sparse multi-layer artificial neural network computing device 4 to start and stop, and serves as an interface between the sparse multilayer artificial neural network computing device 4 and external control.
  • the sparse artificial neural network operation device 4 is configured to execute a sparse multi-layer artificial neural network operation unit, receive data and programs from the CPU 3, execute the sparse multi-layer artificial neural network operation algorithm, and sparse artificial neural network operation device 4 The execution result will be transferred back to CPU 3.
  • the sparse artificial neural network operation device 4 is used as a coprocessor of the CPU 3 or the GPU to execute a sparse multi-layer artificial neural network operation algorithm.
  • a system structure in which a plurality of sparse artificial neural network computing devices are interconnected a plurality of sparse artificial neural network computing devices 4 can be interconnected through a PCIE bus to support larger-scale sparse multi-layer artificial neural network operations, and can share the same host
  • the CPUs either have their own host CPUs and can share memory or each processor has its own memory.
  • the interconnection method can be any interconnection topology.
  • a sparsely connected neural network there are four input neurons: i 1 , i 2 , i 3 , i 4 , with 2 output neurons: o 1 , o 2 .
  • o 1 and i 1 , i 3 , i 4 are connected, and the weights of the connections are represented as w 11 , w 31 , w 41 ; o 2 and i 2 , i 3 are connected, and the weights of the connections are respectively Expressed as w 22 , w 32 .
  • connection relationship of the sparse neural network There are two ways to represent the connection relationship of the sparse neural network above. One is to use one bit between each input neuron and the output neuron to indicate whether there is a connection, and the other is to use the distance between the connections to represent each The location of the connection.
  • the first connection means :
  • connection relationship of the output neurons o 1 is: 1011, each bit indicates whether or not there is a connection with the input neuron, 1 indicates that there is a connection, and 0 indicates that there is no connection, and the output nerve
  • the connection relationship of the element o 2 is 0110.
  • the input neurons corresponding to the connection relationship of 0 will be filtered and deleted, that is, no operation will be performed.
  • the output neuron o 1 its i 2 will be deleted by filtering.
  • o 2 its i 1 , i 4 will be filtered out so that there is no need to calculate the filtered input neurons during the calculation.
  • connection relationship When storing the connection relationship, the connection relationship may be stored in the order of the priority input neurons or the output neurons.
  • the specific storage formats are as follows:
  • Format 1 Place all the input neurons of each output neuron in turn.
  • the order of the above example is 10110110.
  • the second connection means is a first connection means
  • the output neuron o 1 is connected to the input neurons i 1 , i 3 , i 4 , then the connection relationship is 0 , 2 , 1.
  • 0 indicates the position distance of the first connection.
  • the distance of one input neuron is 0, which is the first input neuron, and 2 indicates that the distance of the second input neuron from the previous input neuron is 2, which means the third input neuron, and 1 represents the third.
  • the distance of the input neurons from the previous input neuron is 1, which represents the fourth input neuron.
  • the connection relationship of o 2 is 1,1.
  • the mapping unit of the present application includes, but is not limited to, the above connection relationship.
  • a convolutional neural network is a type of artificial neural network.
  • the convolutional layer contains a plurality of filters, that is, convolution kernels. These convolution kernels are repeatedly applied to all input images to extract local features. Different convolution kernels can extract different kinds of local features, and an input image becomes an abstract feature that can be better understood after passing through the convolution layer.
  • Natural images have their inherent characteristics, that is, the statistical properties of a part of an image are the same as other parts. This also means that the features learned in this part can also be used in another part, so the same learning characteristics can be used for all positions on this image.
  • the features learned from this 8*8 sample can be used as a probe. Apply to any part of this image.
  • features learned from 8*8 samples can be convolved with the original large size image to obtain an activation value for a different feature at any location on the large size image.
  • This 8*8 sample feature is called a convolution kernel. For the calculation of the above convolution, refer to the description in the embodiment of FIG. 6B, and details are not described herein.
  • Figure 21 is an example of a convolution operation.
  • the convolution kernel is a 2*2 matrix, and the convolution kernel slides over the input image.
  • the convolution kernel matrix is multiplied and added with the corresponding input image data.
  • the required input neurons are i 0 , i 1 , i 3 , i 4 , and the input weights are: w 0 , w 3 , and the connection relationship is 1001 or 0, 2;
  • the required input neurons are i 3 , i 5 , i 7 , i 8 , and the input weights are: w 0 , w 3 , and the connection relationship is 1001 or 0, 2.
  • An artificial neural network computing device that can perform sparse connections can process sparsely connected artificial neural networks of various sparsely connected representations, and an artificial neural network computing device that can perform sparse connections has a unit dedicated to processing sparse connections, which is referred to herein.
  • the mapping unit the structure of the sparsely connected artificial neural network computing device will be slightly different for different sparse connection relationships and processing methods, and different structures and methods will be separately described below.
  • mapping unit 1 is used to convert input data into input neurons, weights, and connection data.
  • Memory 2 used to store data and instructions, especially when the size of the neural network is very large, the instruction cache 4, the input neuron cache 6, the output neuron cache 9, the weight buffer 8 can not put so much data, only the data Temporarily stored in the memory 2.
  • Instruction cache 4 is used to store dedicated instructions.
  • the control unit 5 reads the dedicated instruction from the instruction buffer 4 and decodes it into each arithmetic unit instruction.
  • the input neuron cache 6 is used to store the input neuron data of the operation.
  • the operation unit 7 is configured to perform a specific operation.
  • the arithmetic unit is mainly divided into three stages, and the first stage performs a multiplication operation for multiplying the input neurons and weight data.
  • the second stage performs the addition tree operation, and the first and second stages are combined to complete the vector inner product operation.
  • the third stage performs an activation function operation, and the activation function may be a sigmoid function, a tanh function, or the like.
  • the third stage gets the output neurons and writes back to the output neuron cache.
  • the weight buffer 8 is used to store weight data.
  • the output neuron cache 9 is used to store the output neurons of the operation.
  • mapping unit The structure of the mapping unit is as shown in FIG.
  • connection relationship may be one of the two sparse representations described above, and the mapping unit outputs the mapped neurons and weights according to the connection relationship according to the connection relationship.
  • the mapped neurons and weights can be used directly during the operation without considering the connection relationship.
  • the specific process for mapping the output neurons o 1 is as follows:
  • the input neurons are: i 1 , i 2 , i 3 , i 4 , and the input weights are: w 11 , w 31 , w 41 , and the connection relationship can be: 1011, or 0, 2, 1.
  • the mapping unit changes the input neurons and the weights into corresponding relationships according to the connection relationship.
  • the output has two cases: one is to remove the input neurons that are not connected, and the mapped neurons are i 1 , i 3 , i 4 , the weights after mapping are w 11 , w 31 , w 41 ; the other is that the weights are added to 0 where there is no connection, then the mapped neurons are i 1 , i 2 , i 3 , i 4 , the weights after mapping are w 11 , 0, w 31 , w 41 .
  • the arithmetic unit may include three parts, a first partial multiplier, a second partial addition tree, and the third part is a linear function unit.
  • the first part multiplies the input neuron (in) by the weight (w) to obtain the weighted output neuron (out).
  • the storage device 1 is configured to store data and instructions. Especially when the neural network is very large, the instruction cache 3, the input neuron cache 6, the output neuron cache 9, and the weight buffer 8 cannot be placed. With multiple data, data can only be temporarily stored in the storage device 1.
  • DMA2 is used to move data or instructions in the storage device to each cache.
  • Instruction cache 3 is used to store dedicated instructions.
  • the control unit 4 reads the dedicated instruction from the instruction buffer 3 and decodes it into each arithmetic unit instruction.
  • the mapping unit 5 is configured to convert the input data into a storage manner in which the input neurons and the weights are in one-to-one correspondence.
  • the input neuron cache 6 is used to store the input neuron data of the operation.
  • the operation unit 7 is configured to perform a specific operation.
  • the arithmetic unit is mainly divided into three stages, and the first stage performs a multiplication operation for multiplying the input neurons and weight data.
  • the second stage performs the addition tree operation, and the first and second stages are combined to complete the vector inner product operation.
  • the third stage performs an activation function operation, and the activation function may be a sigmoid function, a tanh function, or the like.
  • the third stage gets the output neurons and writes back to the output neuron cache.
  • the weight buffer 8 is used to store weight data.
  • the output neuron cache 9 is used to store the output neurons of the operation.
  • mapping unit The structure of the mapping unit is as shown in FIG.
  • connection relationship may be one of the two sparse representations described above, and the mapping unit outputs the mapped neurons and weights according to the connection relationship according to the connection relationship.
  • the mapped neurons and weights can be used directly during the operation without considering the connection relationship.
  • the specific process for mapping the output neurons o 1 is as follows:
  • the input neurons are: i 1 , i 2 , i 3 , i 4 , and the input weights are: w 11 , w 31 , w 41 , and the connection relationship can be: 1011, or 0, 2, 1.
  • the mapping unit changes the input neurons and the weights into corresponding relationships according to the connection relationship.
  • the output has two cases: one is to remove the input neurons that are not connected, and the mapped neurons are i 1 , i 3 , i 4 , the weights after mapping are w 11 , w 31 , w 41 ; the other is that the weights are added to 0 where there is no connection, then the mapped neurons are i 1 , i 2 , i 3 , i 4 , the weights after mapping are w 11 , 0, w 31 , w 41 .
  • mapping unit in Structure 1 and Structure 2 The main difference between the mapping unit in Structure 1 and Structure 2 is that the mapping unit in Structure and Method 1 is stored in the storage device before mapping the input neurons and weights before calculation.
  • the structure and method are Mapping is performed in the calculation, and the mapped data is directly calculated to the arithmetic unit.
  • a slight modification based on the structure and method 2 can be changed to the structure shown in FIG. 28, and the mapping unit only maps the input neurons.
  • the input neurons are: i 1 , i 2 , i 3 , i 4 , and the connection relationship can be: 1011, or: 0, 2, 1.
  • the mapping unit changes the input neuron and the weight into a corresponding relationship according to the connection relationship, and removes the input neurons that are not connected, and the mapped neurons are i 1 , i 3 , i 4 .
  • a slight modification based on the structure and method 2 can be changed to the structure shown in FIG. 30, and the mapping unit only maps the input weights.
  • the input weights are: w 11 , w 31 , w 41 , and the connection relationship can be: 1011, or: 0, 2, 1.
  • the mapping unit changes the input neuron and the weight into a corresponding relationship according to the connection relationship, and the mapped weights are w 11 , 0, w 31 , w 41 .
  • the present application further provides a processing system 100 for a neural network.
  • the processing system 100 of the neural network may be a computing device as shown in FIG. 6A.
  • the computing device shown in FIG. 6A one or more arithmetic logic units are added, the plurality of arithmetic logic units for performing non-linear operations, and in an alternative embodiment, the calculations as shown in FIG. 6A
  • the device may also extend the units or modules in the processing system of the neural network as shown in FIG.
  • the system includes at least one upper storage medium 10, at least one internal address indexing module 20, a multi-core processing module 30, and one or more Arithmetic Logic Unit (ALU) modules 40.
  • ALU Arithmetic Logic Unit
  • the multi-core processing module 30 includes a plurality of core processing sub-modules 31.
  • the on-chip address indexing module 20 is connected to the on-chip storage medium 10, and the on-chip address indexing module 20, the multi-core processing module 30, and the ALU module 40 are respectively connected to each other.
  • the multi-core processing module 30 is configured to perform vector multiply-accumulate operations in neural network operations, and the plurality of ALU modules 40 are configured to acquire input data from the multi-core processing module 30 or the on-chip storage medium 10 to perform non-linearities that the multi-core processing module 30 cannot perform.
  • the plurality of core processing sub-modules 31 share the on-chip storage medium 10 and the ALU module 40.
  • the on-chip storage medium 10 is configured to store data transmitted from outside the neural network processing system or to store data generated during the processing.
  • the data generated during this process includes the processing results or intermediate results generated during the processing. These results may come from the on-chip core computing module of the processor or from other computing components, such as the ALU module 40 in this application.
  • the on-chip storage medium 10 may be a static random access memory (SRAM), a dynamic random access memory (DRAM), an enhanced dynamic random access memory (e-DRAM), and a register.
  • a common storage medium such as a register file (RF) may also be a novel storage device such as a non-volatile memory (NVM) or a 3D memory device.
  • NVM non-volatile memory
  • the on-chip address indexing module 20 is configured to map to the correct storage address according to the input index when performing the operation to send the correct data to the multi-core processing module 30 for processing. This allows data and on-chip storage media to interact correctly.
  • the address mapping process here includes direct mapping, arithmetic transformation, and the like.
  • the indexing module can be implemented by hardware circuitry including, but not limited to, FPGA, CGRA, application specific integrated circuit ASIC, analog circuitry, and memristor.
  • the multi-core processing module 30 includes a plurality of core processing sub-modules 31 for performing vector multiply-accumulate operations in neural network operations. Specifically, the multi-core processing module 30 performs most of the operations in the neural network algorithm, which are linear operations, that is, multiply-accumulate operations.
  • the structure of each core processing module 31 can be various, such as a one-dimensional processing element (PE) implementation, a two-dimensional PE or a multi-dimensional implementation.
  • the single core processing module 31 itself is not limited to specific implementation principles, including different implementation methods, such as the systolic scheme, matrix vector multiply-add operators.
  • the plurality of core processing sub-modules 31 of the multi-core processing module 30 may be in a homogeneous design or a heterogeneous design.
  • the processing module can be implemented by hardware circuitry including, but not limited to, FPGA, CGRA, application specific integrated circuit ASIC, analog circuitry, and memristor.
  • the ALU module 40 is configured to acquire input data from the multi-core processing module 30 or the on-chip storage medium to perform a non-linear operation that cannot be completed by the core processing module.
  • the module can be implemented by hardware circuitry including, but not limited to, FPGAs, CGRAs, application specific integrated circuit ASICs, analog circuits, and memristors.
  • the data path of the multi-core processing module 30, the ALU module 40, and the on-chip storage medium 10 includes, but is not limited to, H-TREE, or an interconnection technology such as FAT-TREE.
  • a plurality of core processing sub-modules 31 collectively multiplex a portion of the input to reduce bandwidth requirements, and when the processing system 100 of the neural network performs processing, the same input neuron is separately transmitted to multiple cores of the multi-core processing module 30.
  • the sub-module 31 is processed, and different input weights are assigned to different core processing modules 31, and the plurality of core processing sub-modules 31 respectively perform vector inner product (multiplication and addition) operation on the input neurons and input weights to obtain different Output neurons.
  • Different output neurons correspond to different weights, that is, for processing different output neurons, the input neurons are the same, and the weights are different.
  • the weights are not multiplexed by multiple cores in most cases. However, in some cases, if multiple cores work together on the same feature map, the weights can also be multiplexed.
  • the core processing part of the neural network processing system of the present application improves the processing speed of the core computing part in the neural network algorithm by increasing the number of core processing modules on the chip, so that the processor obtains higher performance.
  • Core processing refers to vector multiply-accumulate operations that take up most of the processing time in neural network algorithms. Therefore, the application can improve the operation speed of the neural network processing system, so that the neural network processing system has higher performance and is more efficient.
  • the processing system 200 of the neural network includes a plurality of on-chip storage media 201, a plurality of on-chip address indexing modules 202, a plurality of core processing modules 203, and a plurality of ALU modules 204, wherein each core processing module 203 has a separate
  • the input interface and input structure, wherein the ALU module 204 is also partitioned, can exist in each core.
  • the plurality of core processing sub-modules 31 only perform a specific core operation, which does not have more functions per se, and the multi-core processing core shares the on-chip storage medium 10 and the ALU module 40.
  • each core processing module 203 has its own independent on-chip storage medium 201 and ALU module 204.
  • multiple cores can be processed together to achieve higher performance requirements, but each core lacks flexibility; each core has a certain degree in the tightly coupled design shown in Figure 33. Flexibility, however, due to the independence of each core, the complexity of multi-core coordination is higher, making the complexity of control increase. Loose coupling is suitable for multi-core isomorphic design, and tight coupling is used for multi-core heterogeneous design.
  • the convolutional layer is organized according to the feature map, that is, the input is a plurality of graphs, and the output is a plurality of graphs.
  • neural network partitioning can be performed by processing one layer of output feature maps per core from the output angle.
  • FIG. 34 includes an input feature map 1, an input feature map 2, a core processing module 1, a core processing module 2, an output feature map 1, and an input feature map 2, each of which is a two-dimensional matrix.
  • the input feature maps 1 and 2 are respectively sent to the core processing modules 1 and 2, the core processing module 1 processes the output feature map 1, the core processing module processes the output feature map 2, and the core processing module 1 and the core processing module 2 Process one layer of output feature map separately. That is, when performing two-dimensional or multi-dimensional processing, the input feature maps are respectively sent to a plurality of core processing modules, and the plurality of core processing modules respectively process one layer of output feature maps. After the core processing modules respectively complete the processing of the current output feature map, the multi-core processing module performs a new output feature map processing, that is, only when all the cores complete the current output feature map processing, new features are performed. Figure processing.
  • core #1, core #2 there may be multiple input feature maps, core processing modules, and output processing modules.
  • core #1, core #2 4 output feature maps (output feature maps #1, #2, #3, #4), and 4 input feature maps (input feature maps #1, # 2, #3, #4) as an example to illustrate the processing method of the multi-core processing module: after the start of processing, the core #1 is responsible for processing the output feature map #1, the core #2 is responsible for processing the output feature map #2, the input feature map #1 It is sent to core #1 and core #2 (that is, shared input feature map #1), and the corresponding weights are also sent to core #1 and core #2 for processing; when the input feature map #1 is processed, The input feature map #2 is read from the on-chip storage, sent to the core #1 and the core #2 for processing (the same read weight); when the core #1 and the core #2 complete the processing of the output feature maps #1 and #2 After that, core #1 and core #2 start processing the output feature maps #3 and #4, that is, repeat the above operation.
  • neural network partitioning can also be performed by processing one layer of output feature maps per core from the output angle. Different cores are responsible for processing different regions of the same feature map, and the corresponding inputs are sent to each core, and the weights are read according to the corresponding connections. Here, there may be multiplexing of weights, such as convolutional neural networks. Convolutional layer in . New feature map processing will only be performed after all cores have completed the current output feature map processing.
  • the input feature map 1 and the input feature map 2 are both sent to the core processing module 1 and the core processing module 2, and the core processing module 1 is responsible for processing the region 1 of the output feature map 1 and the region 1 of the output feature map 2.
  • the core processing module 2 is responsible for processing the area 2 of the output feature map 1 and the area 2 of the output feature map 2. Therefore, when performing two-dimensional or multi-dimensional operations, the input feature maps are respectively sent to a plurality of core processing modules, and the plurality of core processing modules respectively process different regions of the same output feature map, and the plurality of core processing modules respectively complete the current output feature maps. After the processing, the multi-core processing module performs a new output feature map processing.
  • neural network partitioning is performed from the output angle in accordance with a portion of the processing output of each core processing module.
  • Each core is responsible for processing different neurons, and the division method here can be various, and is not limited to the division method shown in FIG.
  • the input is sent to each core processing module, and the weights are read according to the corresponding connections.
  • the new feature map processing is performed only after all the core processing modules have completed the current output feature map processing. That is, when the neural network processing system performs the one-dimensional operation, the same input is separately sent to the plurality of core processing modules, and the plurality of core processing modules respectively process different output neurons, and the plurality of core processing modules respectively complete the current output neurons. After processing, the processing of the new input is performed.
  • Neural network partitioning includes partitioning from input neurons, outputting neuron partitioning, and weighting connections.
  • the present application divides according to the output neurons, and the output neurons require multiple or even all input neurons to participate in the processing, and the processing of the output neurons is mostly independent of each other.
  • Input neurons can be multiplexed according to the output neuron partition, reducing bandwidth requirements, making the processor more efficient.
  • FIG. 37 is a flowchart of a method for processing a neural network according to the present application.
  • the method is implemented in a computing device as shown in FIG. 2, FIG. 5 or FIG. 6A.
  • the computing device includes a plurality of ALUs, including:
  • Step S601 the on-chip address indexing module maps to the correct storage address according to the input index
  • Step S602 obtaining input data from the on-chip storage medium according to the storage address
  • Step S603 sending input data to the multi-core processing module or the ALU module;
  • Step S604 the multi-core processing module performs a vector multiply-accumulate operation in the neural network operation, and the ALU module performs a non-linear operation that cannot be completed by the multi-core processing module according to the processing result of the multi-core processing module or the input data obtained from the on-chip storage medium;
  • Step S605 buffering the data generated during the processing to the on-chip storage medium.
  • the method further comprises: transmitting the same input neuron to the plurality of core processing modules respectively, and assigning different input weights to different core processing modules, wherein the plurality of core processing modules respectively input the input neurons and the input rights
  • the values are vector inner product operations to get different output neurons.
  • the core processing part of the neural network processing system of the present application improves the processing speed of the core computing part in the neural network algorithm by increasing the number of core processing modules on the chip, so that the processor obtains higher performance.
  • Core processing refers to vector multiply-accumulate operations that take up most of the processing time in neural network algorithms. Therefore, the application can improve the operation speed of the neural network processing system, so that the neural network processing system has higher performance and is more efficient.
  • the forward operation of a multi-layer artificial neural network supporting discrete data representation includes two or more layers of multiple neurons.
  • the input neuron vector is first subjected to a dot product with the weight vector, and the result is the output neuron through the activation function.
  • the activation function may be a sigmoid function, a tanh, a relu, a softmax function, etc., and supports a discretized representation or a continuous representation of the activated output neurons.
  • the device supports the conversion of the dot product operation into a shift, a non-exclusive, an exclusive OR, or an equal bit operation of the data.
  • the device supports discrete representation or non-discrete representation of the data, the user can customize which layer of the data is in discrete representation or non-discrete representation, and can customize the number of bits of the discrete data according to specific needs, thereby Instead of the number of real data to be represented, for example, discrete data set to a number of bits such as 1 bit, 2 bits, or 3 bits, respectively, can represent 2, 4, or 8 real data.
  • FIG. 38 illustrates an example block diagram of an overall structure of an apparatus for performing artificial neural network forward operation supporting discrete data representations in accordance with an embodiment of the present application.
  • the device may be a computing device as shown in FIG. 6A.
  • a continuous discrete conversion module may be added in the computing device as shown in FIG. 6A.
  • the computing device shown in FIG. 6A can also be expanded or increased as shown in FIG. Module or unit of the device.
  • the apparatus comprises an instruction cache unit 1, a controller unit 2, a data access unit 3, an interconnection module 4, a main operation module 5, and a plurality of slave operation modules 6, optionally further comprising a continuous Discrete conversion module 7.
  • the instruction cache unit 1, the controller unit 2, the data access unit 3, the interconnect module 4, the main operation module 5, the slave operation module 6, and the continuous discrete conversion module 7 may all pass through hardware circuits (including but not limited to FPGA, CGRA, dedicated Implementation of integrated circuit ASICs, analog circuits, and memristors.
  • the device can provide storage and computational support for discrete data.
  • the instruction cache unit 1 reads in an instruction through the data access unit 3 and caches the read instruction.
  • the controller unit 2 reads instructions from the instruction cache unit 1 and translates the instructions into micro-instructions that control the behavior of other modules, such as the data access unit 3, the main arithmetic module 5, and the slave arithmetic module 6.
  • the data access unit 3 can access the external address space, directly read and write data to each cache unit inside the device, and complete data loading and storage.
  • the data is either discretely represented or non-discretely represented. This unit is used to design data that can read discrete representations.
  • the interconnection module 4 is used for connecting the main operation module and the slave operation module, and can realize different interconnection topologies (such as a tree structure, a ring structure, a grid structure, a hierarchical interconnection, a bus structure, etc.)
  • FIG. 39 schematically illustrates an embodiment of the interconnection module 4: a tree module.
  • the tree module 4 constitutes a data path between the main arithmetic module 5 and the plurality of slave arithmetic modules 6, and has a tree structure.
  • the tree module may be an n-tree structure, for example, a binary tree path as shown in FIG. 39, each node sends the upstream data to the downstream two nodes in the same manner, and returns the two downstream nodes. The data is merged and returned to the upstream node.
  • the neuron data in the main operation module 5 may be a discrete representation or a non-discrete representation sent to the respective slave arithmetic module 6 through the tree module 4; when the slave arithmetic module 6 After the calculation process is completed, the value of each neuron output from the arithmetic module is progressively formed into a complete vector composed of neurons in the tree as an intermediate result vector.
  • discrete data representation we specifically mention that the operation module dedicated to discrete data operation inside the master-slave operation module is shown in Fig. 44. The neural network full connection layer is described.
  • the intermediate result vector is segmented by N, each segment has N elements, and the i-th slave computing module calculates the i-th element in each segment. .
  • the N elements are assembled into a vector of length N through the tree module and returned to the main operation module. So if the network has only N output neurons, each slave unit only needs to output the value of a single neuron. If the network has m*N output neurons, each slave unit needs to output m neuron values.
  • the tree module supports discrete data representation in the process of storing and transmitting data.
  • FIG. 40 shows an example block diagram of the structure of the main operation module 5 in the apparatus for performing artificial neural network forward operation according to an embodiment of the present application.
  • the main operation module 5 includes an operation unit 51, a data dependency determination unit 52, and a neuron buffer unit 53 that supports discrete data representation.
  • the neuron buffer unit 53 supporting the discrete data representation is used to buffer the input data and the output data used by the main operation module 5 in the calculation process.
  • the arithmetic unit 51 performs various arithmetic functions of the main arithmetic module 5.
  • the addition, subtraction, multiplication and division of discrete data and discrete data can be realized by looking up the table.
  • 2-bit discrete data can represent 4 consecutive data values.
  • There are 4*4 16 combinations for 4 consecutive data.
  • the 4*4 index table can be created and maintained, and the corresponding calculated value is found through the index table.
  • a total of four 4*4 index tables are required for the four operations.
  • the corresponding bit operations may be preset for the addition, subtraction, multiplication, and division operations for different discrete data.
  • a dot product operation of discrete data and continuous data may be replaced by a method of cumulative summation after bitwise exclusive OR and multiplied by 2 corresponding powers.
  • a multiplication operation if the multiplicative factor data is discretely represented, it may be replaced by a discrete data index corresponding operation (eg, bitwise XOR, NAND, shift, etc. of the corresponding data) and the discrete data representation. Multiplication of continuous data, which reduces the number of multiplier parts.
  • the function of the arithmetic unit can be replaced by a method of finding a switch such as a search index. For example, it can be specified that the discrete data representation method of -1/2 is 01. If an operation factor is -1/2, the discrete data received by the arithmetic unit 51 is 01. The arithmetic unit 51 uses the operation corresponding to the discrete data 01.
  • the decimal representation is -8.
  • 16 is divided by -2. 16 is continuous data and -2 is discrete data. If the discrete data-2 binary is specified as 10.
  • the arithmetic unit uses the division operation corresponding to the discrete data 10. The result is obtained by inverting the sign bit by subtracting 1000 bits from the 8-bit fixed-point number of 16 to 10001000 and obtaining 10001000 in decimal.
  • the addition and subtraction operations are similar to the above process. According to the binary of the discrete data as an index, the index is shifted to the left, right, or XOR. After this operation, the addition or subtraction operation with the real data represented by the discrete data is realized.
  • the dependency judging unit 52 is a port in which the arithmetic unit 51 reads and writes the neuron buffer unit 53, and at the same time, can ensure read and write consistency of data in the neuron cache unit.
  • the data dependency judging unit 52 is also responsible for transmitting the read data to the slave arithmetic module through the interconnect module 4, and the output data from the arithmetic module 6 is directly sent to the arithmetic unit 51 via the interconnect module 4.
  • the command output from the controller unit 2 is sent to the calculation unit 51 and the data dependency determination unit 52 to control its behavior.
  • the arithmetic unit 61 receives the microinstructions issued by the controller unit 2 and performs an arithmetic logic operation.
  • the addition, subtraction, multiplication and division of discrete data and discrete data can be realized by looking up the table.
  • 2-bit discrete data can represent 4 consecutive data values.
  • There are 4*4 16 combinations for 4 consecutive data.
  • the 4*4 index table can be created and maintained, and the corresponding calculated value is found through the index table. A total of four 4*4 index tables are required for the four operations.
  • the corresponding bit operations may be preset for the addition, subtraction, multiplication, and division operations for different discrete data.
  • a dot product operation of discrete data and continuous data may be replaced by a method of cumulative summation after bitwise exclusive OR and multiplied by 2 corresponding powers.
  • a multiplication operation if the multiplicative factor data is discretely represented, it may be replaced by a discrete data index corresponding operation (eg, bitwise XOR, NAND, shift, etc. of the corresponding data) and the discrete data representation. Multiplication of continuous data, which reduces the number of multiplier parts.
  • the decimal representation is -8.
  • 16 is divided by -2. 16 is continuous data and -2 is discrete data. If the discrete data-2 binary is specified as 10.
  • the arithmetic unit uses the division operation corresponding to the discrete data 10. The result is obtained by inverting the sign bit by subtracting 1000 bits from the 8-bit fixed-point number of 16 to 10001000 and obtaining 10001000 in decimal.
  • the addition and subtraction operations are similar to the above process. According to the binary of the discrete data as an index, the index is shifted to the left, right, or XOR. After this operation, the addition or subtraction operation with the real data represented by the discrete data is realized.
  • the data dependency judging unit 62 is responsible for reading and writing operations to the neuron cache unit in the calculation process. Before the data dependency determination unit 62 performs the read/write operation, it first ensures that there is no read/write consistency conflict between the data used between the instructions. For example, all microinstructions sent to the data dependency unit 62 are stored in an instruction queue inside the data dependency unit 62, in which the range of read data of the read instruction is a write command ahead of the queue position. If the range of write data conflicts, the instruction must wait until the write instruction it depends on is executed.
  • the neuron buffer unit 63 supporting the discrete data representation buffers the input neuron vector data and the output neuron value data of the slave arithmetic module 6. This data can be stored and transmitted in the form of discrete data.
  • the weight buffer unit 64 supporting the discrete data representation buffers the weight data required by the slave arithmetic module 6 in the calculation process. This data can be discretely represented or not according to user definition. For each slave arithmetic module 6, only the weights between all input neurons and partial output neurons are stored. Taking the all-connected layer as an example, the output neurons are segmented according to the number N of the operation units, and the weights corresponding to the n-th output neurons of each segment are stored in the n-th slave operation unit.
  • Each of the slave arithmetic modules 6 calculates an output neuron value, and all of the output neuron values are combined in the interconnect module 4 to obtain an intermediate result vector.
  • Each slave arithmetic module 6 only needs to calculate the output neuron value corresponding to the module in the intermediate result vector y.
  • the interconnection module 4 sums all the neuron values output from the arithmetic module 6 to obtain a final intermediate result vector y.
  • the main operation module 5 performs subsequent calculations based on the intermediate result vector y, such as adding offset, pooling (for example, MAXPOOLING or AVGPOOLING, etc.), performing activation and sampling.
  • Fig. 45 is a block diagram showing the structure of an arithmetic unit which can be used for the arithmetic unit 51 in the main arithmetic module or the arithmetic unit 61 in the slave arithmetic module.
  • the input data during the operation can be discrete data or continuous data.
  • the data type judging unit 71 judges that the input data is all continuous data, all discrete data, or mixed data including both continuous data and discrete data.
  • the continuous data operation unit 72 performs a corresponding operation.
  • the discrete data operation unit 73 When the input data is all discrete data, the discrete data operation unit 73 performs a corresponding operation.
  • the addition, subtraction, multiplication and division of discrete data and discrete data can be realized by looking up the table.
  • 2-bit discrete data can represent 4 consecutive data values.
  • There are 4*4 16 combinations for 4 consecutive data.
  • the operation decision unit 74 decides which operation to perform according to the discrete data therein.
  • the corresponding operations can be preset for different discrete data.
  • the mixed data operation unit 75 performs a corresponding operation based on the determination result of the arithmetic decision unit 74.
  • the corresponding bit operations may be preset for the addition, subtraction, multiplication, and division operations for different discrete data.
  • a dot product operation of discrete data and continuous data may be replaced by a method of cumulative summation after bitwise exclusive OR and multiplied by 2 corresponding powers.
  • multiplicative factor data for a multiplication operation, if the multiplicative factor data is discretely represented, it may be replaced by a discrete data index corresponding operation (eg, bitwise XOR, NAND, shift, etc. of the corresponding data) and the discrete data representation.
  • a discrete data index corresponding operation eg, bitwise XOR, NAND, shift, etc. of the corresponding data
  • Multiplication of continuous data which reduces the number of multiplier parts. For example, for multiplication operations of continuous data and discrete data, -1/2 is multiplied by 16. Traditional multiplier components will multiply -1/2 and 16 directly.
  • the function of the arithmetic unit can be replaced by a method of finding a switch such as a search index. For example, it can be specified that the discrete data representation method of -1/2 is 01.
  • the discrete data received by the arithmetic unit 51 is 01.
  • the arithmetic unit 51 uses the operation corresponding to the discrete data 01.
  • the decimal representation is -8.
  • 16 is divided by -2.
  • 16 is continuous data and -2 is discrete data.
  • the discrete data-2 binary is specified as 10.
  • the arithmetic unit uses the division operation corresponding to the discrete data 10. The result is obtained by inverting the sign bit by subtracting 1000 bits from the 8-bit fixed-point number of 16 to 10001000 and obtaining 10001000 in decimal.
  • the addition and subtraction operations are similar to the above process. According to the binary of the discrete data as an index, the index is shifted to the left, right, or XOR. After this operation, the addition or subtraction operation with the real data represented by the discrete data is realized.
  • Figure 46 shows a continuous discrete conversion unit. Users can define the use of this module to convert continuous data to discrete data or not. Enter continuous data and output discrete data.
  • the unit includes a random number generation module, a judgment module, and an operation module.
  • the input continuous data is obtained by the arithmetic module, and the comparison module compares the random number with the calculated result to determine which interval the random number falls in, thereby determining the specific value of the output discrete data.
  • user definitions produce binary discrete data. For any continuous data x entered.
  • the judgment module if the random number is greater than y, the output discrete data is 1, and the output discrete data is 0.
  • Discrete data 1 and 0 represent -1 and +1 of continuous data, respectively. Store the resulting discrete data back into memory. Waiting for the operation unit in the master-slave operation module to generate the corresponding operation.
  • the weight data and the output input data in the forward process may be represented by discrete data or not.
  • the multiplication operation of continuous data can be replaced by an exclusive OR, a non-displacement, a displacement, or the like based on discrete data.
  • the weight is represented by 1-bit discrete data, 0 represents +1, and 1 represents -1, and the multiplication of the weight is realized by XORing the sign bit of the data multiplied by the weight.
  • an instruction set for performing an artificial neural network forward operation on the aforementioned device includes a CONFIG instruction, a COMPUTE instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction, among which:
  • the CONFIG command configures various constants required for current layer calculation before each layer of artificial neural network calculation begins;
  • the COMPUTE instruction completes the arithmetic logic calculation of each layer of artificial neural network
  • the IO instruction implements the input data required for the calculation from the external address space and stores the data back to the external space after the calculation is completed, the data supporting the discretization representation;
  • the NOP instruction is responsible for clearing all the microinstructions in the microinstruction buffer queue of the current device, and ensuring that all the instructions before the NOP instruction are all completed.
  • the NOP instruction itself does not contain any calculation operations;
  • the JUMP instruction is responsible for the jump of the next instruction address that the controller will read from the instruction cache unit to implement the control flow jump;
  • the MOVE instruction is responsible for carrying data of an address in the internal address space of the device to another address in the internal address space of the device.
  • the process is independent of the operation unit and does not occupy the resources of the operation unit during execution.
  • the input neuron vector is respectively subjected to a dot product operation with the weight vector of the slave operation module 6, to obtain a corresponding output neuron value, and all of the output neuron values constitute an intermediate result vector, and the intermediate result
  • the vector is obtained by adding the offset vector and the activation operation to obtain the final output neuron vector of the layer neural network.
  • the weight vector of each slave arithmetic module 6 is a column vector corresponding to the slave arithmetic module 6 in the weight matrix.
  • the interconnect module sends the input neuron vector [in0,...,inN] to all slave arithmetic units, temporarily stored in the neuron cache unit.
  • the dot product of its corresponding weight vector [w_i0, . . . , w_iN] and the input neuron vector is calculated.
  • the result output from the arithmetic unit is integrated into the complete output vector through the interconnect module and returned to the main operation unit, and the activation operation is performed in the main operation unit to obtain the final output neuron vector [out0, out1, out2, ..., outN].
  • FIG. 43 is an implementation of an artificial neural network forward calculation that supports a single layer supporting discrete data representations, in accordance with one embodiment.
  • the flowchart depicts an artificial neural network forward operation process for implementing a single layer discrete data representation as shown in FIG. 5 using the apparatus and instruction set of the present application. This calculation method is implemented in a computing device as shown in FIG. 2, FIG. 5 or FIG. 6A.
  • Step S1.1 storing the initial instruction into the instruction storage unit 1;
  • Step S1.2 reading an instruction from the instruction storage unit 1;
  • Step S1.3 decoding the above instruction
  • Step S1.4 performing corresponding operations according to the decoded control signal
  • the readable instructions include, but are not limited to, a CONFIG instruction, a COMPUTE instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction.
  • step S1.3 the control signal of the corresponding module is obtained according to the operation type of the instruction (CONFIG, COMPUTE, IO, NOP, JUMP, MOVE, etc.).
  • the decoding obtains the configuration information of the remaining modules.
  • the control signal of the master-slave operation module is obtained by decoding, and the corresponding operations taken by different discrete data are controlled.
  • the control signal of the data access module is decoded.
  • the NOP instruction the actual control signal is not generated, and only the control signal in all the control signal buffer queues of the current device is cleared, and all the instructions before the NOP instruction are all executed.
  • the JUMP instruction a control signal for the jump instruction stream is obtained.
  • the MOVE command a control signal for carrying data inside the device is obtained.
  • each module writes the result of the operation back to the corresponding cache.
  • the output neuron vector obtained by the main operation module is written back to the storage unit.
  • Figure 44 is another, more detailed implementation of a single layer artificial neural network forward operation in accordance with one embodiment.
  • the flowchart depicts a process for implementing a single layer neural network forward operation illustrated in FIG. 4 using the apparatus and instruction set of the present application.
  • step S2 the operation starts, the controller unit 2 reads the IO instruction from the first address of the instruction cache unit 1, and according to the translated microinstruction, the data access unit 3 reads all the corresponding artificial neural network operations from the external address space.
  • the instruction is cached in the instruction cache unit 1.
  • step S3 the controller unit 2 then reads in the next IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit 3 reads all data required by the main operation module 5 from the external address space (for example, including input). Neuron vectors, interpolation tables, constant tables, and offsets, etc.) to the neuron buffer unit 53 of the main computation module 5, the data supporting discrete representations, which may be all discrete or partially discrete.
  • step S4 the controller unit 2 then reads in the next IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit 3 reads the weight matrix data required from the arithmetic module 6 from the external address space, the data Support for discrete representations, which can be all discrete or partially discrete.
  • step S5 the controller unit 2 then reads the next CONFIG command from the instruction cache unit, and according to the translated microinstruction, the device configures various constants required for the calculation of the layer neural network.
  • the arithmetic unit 51, 61 configures the value of the internal register of the unit according to the parameters in the microinstruction, and the parameters include, for example, the precision setting of the calculation of the layer, and the data of the activation function (for example, the precision bit of the calculation of the layer, the rang of the Lrn layer algorithm). Parameters, the inverse of the window size of the AveragePooling layer algorithm, etc.).
  • step S6 the controller unit 2 then reads the next COMPUTE instruction from the instruction cache unit.
  • the main operation module 5 first sends the input neuron vector to each slave operation module 6 through the interconnection module 4, and saves The neuron buffer unit 63 to the arithmetic module 6.
  • step S8 in the interconnection module 4, the intermediate results returned from each of the arithmetic modules 6 are successively assembled into a complete intermediate result vector.
  • step S9 the main operation module 5 obtains the return value of the interconnection module 4, reads the offset vector from the neuron buffer unit 53 according to the micro-instruction decoded by the COMPUTE instruction, adds it to the vector returned by the interconnection module 4, and then The addition result is activated, and the device supports user-defined whether to discretize the result after activation. The last output neuron vector is written back to the neuron buffer unit 53.
  • step S10 the controller unit then reads the next IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit 3 stores the output neuron vector in the neuron buffer unit 53 to the external address space designation address. The operation ends.
  • FIG. 47 is a schematic diagram of a neural network computing device in accordance with the present embodiment.
  • the neural network computing device may be a computing device as shown in FIG. 6A.
  • a power conversion unit may be added, and the power conversion unit is A storage medium connection for converting non-power weight data in the neural network input data into power weight data.
  • the foregoing computing device may further include: a control unit, an operation unit, and the like.
  • the illustrated computing device can also add or expand a neural network computing device as shown in FIG.
  • the structure of the neural network computing device is as shown in FIG. 47, including:
  • a storage unit 1 for storing data and an operation instruction
  • control unit coupled to the storage unit, for controlling interaction of data and an operation instruction, receiving data and an operation instruction sent by the storage unit, and decoding the operation instruction into an operation micro instruction
  • the operation unit 7 is connected to the control unit, receives the data and the operation micro-instruction sent by the control unit, and performs a neural network operation on the neuron data and the weight data received by the micro-instruction according to the operation micro-instruction;
  • a power conversion unit 9 which is coupled to the storage unit, for converting input neuron data and/or output neuron data of a neural network operation into power neuron data.
  • control unit includes:
  • the data control module 2 is connected to the storage unit, and is used for interaction between data and operation instructions between the storage unit and each cache module;
  • the instruction cache module 3 is connected to the data control module and configured to receive an operation instruction sent by the data control module;
  • the decoding module 4 is connected to the instruction cache module, and is configured to read an operation instruction from the instruction cache module and decode the operation instruction into each operation micro instruction;
  • the input neuron cache module 5 is connected to the data control module and configured to receive the neuron data sent by the data control module;
  • the weight caching module 6 is connected to the data control module and configured to receive the weight data sent from the data control module.
  • the operation unit 7 is respectively connected to the decoding module, the input neuron cache module and the weight buffer module, and receives the operation microinstruction, the neuron data and the weight data, and is used for the micro instruction according to the operation micro instruction.
  • the received neuron data and weight data perform corresponding operations.
  • the output neuron buffer unit 8 is connected to the operation unit for receiving neuron data output by the operation unit; and transmitting the data to the data control module 2 of the control unit. This can be used as input data for the next layer of neural network operations.
  • the storage unit receives data and instructions from the external address space, the data including neural network weight data, neural network input data, and the like.
  • the first power conversion method :
  • d in is the input data of the power conversion unit
  • d out is the output data of the power conversion unit
  • s in is the sign of the input data
  • s out is the sign of the output data
  • d in+ is the positive part of the input data.
  • d in+ d in ⁇ s in
  • d out+ is the positive part of the output data
  • d out + d out ⁇ s out , Indicates that the data x is taken.
  • d in is the input data of the power conversion unit
  • d out is the output data of the power conversion unit
  • s in is the sign of the input data
  • s out is the sign of the output data
  • d in+ is the positive part of the input data.
  • d in+ d in ⁇ s in
  • d out+ d out ⁇ s out , Indicates that the data x is taken.
  • the third power conversion method is the third power conversion method.
  • Figure 48 is a schematic diagram of a neural network computing device in accordance with the present embodiment.
  • the neural network computing device of this embodiment includes:
  • the storage unit 101 is configured to store data and an operation instruction; the storage unit receives data and operation instructions from an external address space, and the data includes neural network weight data, neural network input data, and the like.
  • control unit coupled to the storage unit, for controlling interaction of data and operation instructions, receiving data and instructions sent by the storage unit, and decoding the operation instructions into operation microinstructions;
  • the operation unit 107 is connected to the control unit, receives the data and the operation micro-instruction sent by the control unit, and performs a neural network operation on the weight data and the neuron data received according to the operation micro-instruction;
  • An output neuron buffer unit 108 connected to the operation unit, for receiving neuron data output by the operation unit, and transmitting the same to the control unit;
  • a power conversion unit 109 coupled to the storage unit, for converting input neuron data and/or output neuron data of a neural network operation into power neuron data;
  • the power conversion unit 110 is connected to the output neuron buffer unit 108 for converting the neuron data after the neural network operation into power neuron data and transmitting the data to the control unit.
  • control unit includes:
  • the data control module 102 is connected to the storage unit, and is used for interaction between data and operation instructions between the storage unit and each cache module;
  • the instruction cache module 103 is connected to the data control module and configured to receive an operation instruction sent by the data control module;
  • the decoding module 104 is connected to the instruction cache module, and is configured to read an operation instruction from the instruction cache module and decode the operation instruction into each operation micro instruction;
  • the weight caching module 106 is coupled to the data control module for receiving weight data transmitted from the data control module.
  • the operation unit 107 is respectively connected to the decoding module, the input neuron cache module, and the weight buffer module, and receives each operation micro instruction, neuron data, and weight data, and is used according to each operation micro instruction. Perform corresponding operations on the neuron data and weight data received.
  • the power conversion unit 110 is connected to the data control module, and is configured to convert the neural network data after the neural network operation into power neuron data, and send the data to the data control module 102 of the control unit.
  • the power neuron data obtained by the power conversion unit 110 can be used as an input neuron of the next layer of the neural network operation.
  • the embodiment of the present disclosure further provides a neural network operation method
  • FIG. 49 is a flowchart of the neural network operation method of the embodiment.
  • the neural network of the embodiment of the present disclosure is a multi-layer neural network.
  • the operation method shown in FIG. 49 can be performed, wherein the first layer of the neural network input power weight data can be stored.
  • the unit reads in from the external address. If the weight data read by the external address is already the power weight data, it is directly transferred to the storage unit. Otherwise, it is first converted to the power weight data by the power conversion unit.
  • the single layer neural network operation method of this embodiment includes:
  • step S1 an instruction, a neuron data, and a power weight data are acquired.
  • the step S1 includes the following sub-steps:
  • the data control module receives the instruction, the neuron data, and the power weight data sent by the storage unit.
  • the instruction cache module, the input neuron cache module, and the weight buffer module respectively receive the operation instruction, the neuron data, and the power weight data sent by the data control module, and distribute the data to the decoding module or the operation unit.
  • the power weight data indicates that the value of the weight data is represented by its power exponential value.
  • the power weight data includes a sign bit and a power bit, and the sign bit represents the weight with one or more bits.
  • the sign of the data, the power bit represents the power bit data of the weight data with m bits, and m is a positive integer greater than 1.
  • the storage unit prestores a coding table, and provides an exponential value corresponding to each power bit data of the power weight data.
  • the coding table sets one or more power-bit data (ie, zero-powered bit data) to specify that the corresponding power weight data is zero. That is to say, when the power bit data of the power weight data is the zero power bit data in the code table, it indicates that the power weight data is 0.
  • the coding table may have a flexible storage manner, and may be stored in a form of a table or a mapping by a function relationship.
  • the correspondence of the coding tables can be arbitrary.
  • the correspondence of the coding tables may be out of order.
  • the power sub-bit data is 00001, the corresponding index value is 3.
  • the power sub-bit data is 00010, the corresponding index value is 4.
  • the power sub-bit data is 00011, the corresponding index value is 1.
  • the power sub-bit data is 00100, the corresponding power weight data is 0.
  • the correspondence relationship of the coding table may also be positively correlated.
  • the storage unit prestores an integer value x and a positive integer value y, and the smallest power bit data corresponds to an exponential value of x, and any other one or more power sub-bit data corresponds to a power.
  • the weight data is 0.
  • x represents the offset value and y represents the step size.
  • the smallest power bit data corresponds to an exponent value of x
  • the largest power sub-bit data corresponds to a power sub-weighted data of 0, and the powers other than the minimum and maximum power sub-bit data
  • the bit data corresponds to an index value of (power sub-data + x) * y.
  • the neural network computing device By presetting different x and y and by changing the values of x and y, the range of representations of the powers becomes configurable and can be applied to different application scenarios requiring different ranges of values. Therefore, the neural network computing device has a wider application range, is more flexible to use, and can be adjusted according to user requirements.
  • y is 1, and the value of x is equal to -2 m-1 .
  • the exponent of the numerical value represented by the power weight data is -2 m-1 to 2 m-1 -1.
  • a part of the coding table in which m is 5, x is 0, and y is 1.
  • the power bit data is 00000
  • the corresponding index value is 0.
  • the power sub-bit data is 00010
  • the corresponding index value is 2.
  • the power sub-bit data is 00011
  • the corresponding index value is 3.
  • the power sub-bit data is 11111
  • the corresponding power weight data is 0.
  • another m is 5, x is 0, and y is part of the code table.
  • the power bit data is 00000
  • the corresponding index value is 0.
  • the power sub-bit data is 00001
  • the corresponding index value is 2.
  • the power sub-bit data is 00010
  • the corresponding index value is 4.
  • the power sub-bit data is 00011
  • the corresponding index value is 6.
  • the power sub-bit data is 11111
  • the corresponding power weight data is 0.
  • the correspondence relationship of the coding table may be negatively correlated.
  • the storage unit prestores an integer value x and a positive integer value y, and the largest power bit data corresponds to an exponential value of x, and any other one or more power sub-bit data corresponds to a power
  • the weight data is 0.
  • x represents the offset value and y represents the step size.
  • the largest power bit data corresponds to an exponent value of x
  • the smallest power bit data corresponds to a power subordinate weight data of 0, and the powers other than the minimum and maximum power subbits.
  • the bit data corresponds to an exponent value of (power sub-data -x)*y.
  • y is 1 and the value of x is equal to 2 m-1 .
  • the exponential range of the numerical value represented by the power weight data is -2 m-1 -1 to 2 m-1 .
  • the corresponding index value is 1.
  • the power sub-bit data is 11101
  • the corresponding index value is 2.
  • the power sub-bit data is 11100
  • the corresponding index value is 3.
  • the power sub-bit data is 00000
  • the corresponding power weight data is 0.
  • the correspondence relationship of the coding table may be that the highest bit of the power bit data represents a zero bit, and the other m-1 bits of the power bit data correspond to an index value.
  • the corresponding power weight data is 0; when the power bit data highest bit is 1, the corresponding power weight data is not 0.
  • the power bit of the power weight data is divided by one bit to indicate whether the power weight data is zero.
  • Step S2 performing neural network operations on the neuron data and the power weight data according to the operation instruction.
  • the step S2 includes the following sub-steps:
  • the decoding module reads the instruction from the instruction cache module, and decodes the instruction into each operation instruction;
  • the operation unit respectively receives the operation instruction, the power weight data, and the neuron data sent by the decoding module, the input neuron cache module, and the weight buffer module, and represents the neuron data and the power according to the operation instruction.
  • Weight data is used for neural network operations.
  • the multiplication operation of the neuron and the power weight is specifically: the neuron data sign bit is XORed with the power weight data bit position; if the correspondence relationship of the coding table is out of order, the search code table is found to find the power The exponential value corresponding to the power sub-power sub-bit, the corresponding value of the coding table is positive correlation, and the index numerical value minimum value of the coding table is recorded and added to find the exponential value corresponding to the power sub-power data sub-bit, the coding table The correspondence relationship is negatively correlated, and the maximum value of the coding table is recorded and subtracted to find the exponential value corresponding to the power sub-power of the power-of-power value; the exponential value and the neuron data power sub-bit are added, the neuron data The valid bits remain unchanged.
  • the neuron data is 16-bit floating point data
  • the sign bit is 0, the power bit is 10101, and the valid bit is 0110100000.
  • the actual value represented by it is 1.40625*2 6 .
  • the power weight data symbol bit is 1 bit, and the power bit data bit is 5 bits, that is, m is 5.
  • the coding table is power sub-bit data is 11111, the corresponding power weight data is 0, and the power sub-bit data is other values, and the power sub-bit data corresponds to the corresponding two-complement code.
  • the power weight is 000110, and the actual value it represents is 64, which is 2 6 .
  • the power of the power weight and the power of the neuron result is 11011, and the actual value of the result is 1.40625*2 12 , which is the product of the product of the neuron and the power weight.
  • the multiplication operation becomes an addition operation, and the amount of calculation required for the calculation is reduced.
  • the specific example 2 is shown in Figure 49.7.
  • the neuron data is 32-bit floating point data, the sign bit is 1, the power bit is 10000011, and the effective bit is 10010010000000000000000.
  • the actual value represented by it is -1.5703125*2 4 .
  • the power weight data symbol bit is 1 bit, and the power bit data bit is 5 bits, that is, m is 5.
  • the coding table is power sub-bit data is 11111, the corresponding power weight data is 0, and the power sub-bit data is other values, and the power sub-bit data corresponds to the corresponding two-complement code.
  • the power neuron is 111100, and its actual value is -2 -4 . (The power of the neuron plus the power of the power of the power is 0111111111, then the actual value of the result is 1.5703125*2 0 , which is the product of the product of the neuron and the power weight.
  • the method further includes the step S3 of outputting the neural network data after the neural network operation as the input data of the next layer neural network operation.
  • the output neuron buffer unit receives the neuron data obtained by the neural network operation sent by the computing unit.
  • the power neuron data obtained by the power conversion unit can be used as the input power neuron of the next layer of the neural network operation, and then steps 1 to 3 are repeated until the last layer operation of the neural network ends.
  • the power neuron data range that can be represented by the neural network operation device can be adjusted.
  • FIG. 50 is a flowchart of the neural network operation method of the embodiment.
  • step S4 the instruction, the power neuron data, and the power weight data are acquired.
  • the step S4 includes the following sub-steps:
  • the data control module receives the operation instruction, the power neuron data, and the power weight data sent by the storage unit.
  • the instruction cache module, the input neuron cache module, and the weight buffer module respectively receive the instruction, the power neuron data, and the power weight data sent by the data control module, and distribute the data to the decoding module or the operation unit.
  • the correspondence of the coding tables can be arbitrary.
  • the correspondence of the coding tables may be out of order.
  • the power sub-bit data is 00001, the corresponding index value is 3.
  • the power sub-bit data is 00010, the corresponding index value is 4.
  • the power sub-bit data is 00011, the corresponding index value is 1.
  • the power sub-bit data is 00100, the corresponding power neuron data and the power weight data are 0.
  • the correspondence relationship of the coding table may also be positively correlated.
  • the storage unit prestores an integer value x and a positive integer value y, and the smallest power bit data corresponds to an exponential value of x, and any other one or more power sub-bit data corresponds to a power.
  • Neuron data and power weight data are zero.
  • x represents the offset value and y represents the step size.
  • the smallest power bit data corresponds to an exponent value of x
  • the largest power sub-bit data corresponds to a power sub-neuron data and a power sub-weight data is 0, and the minimum and maximum power sub-bit data
  • the other power-bit data corresponding to the exponent value is (power sub-data + x) * y.
  • the representation range of the powers becomes configurable and can be applied to different application scenarios requiring different ranges of values. Therefore, the neural network computing device has a wider application range, is more flexible to use, and can be adjusted according to user requirements.
  • a part of the coding table in which m is 5, x is 0, and y is 1.
  • the power bit data is 00000
  • the corresponding index value is 0.
  • the power sub-bit data is 00010
  • the corresponding index value is 2.
  • the power sub-bit data is 00011
  • the corresponding index value is 3.
  • the power sub-bit data is 11111
  • the corresponding power neuron data and the power weight data are 0.
  • Fig. 50.3 another m is 5, x is 0, and y is part of the code table.
  • the power bit data is 00000
  • the corresponding index value is 0.
  • the power sub-bit data is 00001
  • the corresponding index value is 2.
  • the power sub-bit data is 00010
  • the corresponding index value is 4.
  • the power sub-bit data is 00011
  • the corresponding index value is 6.
  • the power sub-bit data is 11111
  • the corresponding power neuron data and the power weight data are 0.
  • the corresponding index value is 1.
  • the power sub-bit data is 11101
  • the corresponding index value is 2.
  • the power sub-bit data is 11100
  • the corresponding index value is 3.
  • the power sub-bit data is 00000
  • the corresponding power neuron data and the power weight data are 0.
  • the correspondence relationship of the coding table may be that the highest bit of the power bit data represents a zero bit, and the other m-1 bits of the power bit data correspond to an index value.
  • the highest bit of the power sub-bit data is 0, the corresponding power neuron data and the power weight data are 0; when the power bit data highest bit is 1, the corresponding power neuron data and the power weight data are not Is 0.
  • the reverse is also true, that is, when the highest bit of the power sub-bit data is 1, the corresponding power neuron data and the power weight data are 0; when the power bit data highest bit is 0, the corresponding power neuron data and power The secondary weight data is not zero.
  • the power sub-bits of the power sub-neuron data and the power-off weight data are separated by one bit to indicate whether the power-of-power neuron data and the power-of-threshold data are zero.
  • the storage space required to store the neuron data and the weight data can be reduced.
  • the power data is 8-bit data, and it should be recognized that the data length is not fixed. In different occasions, different data are used according to the data range of the neuron data and the weight data. length.
  • step S5 the neural network operation is performed on the power neuron data and the power weight data according to the operation instruction.
  • the step S5 includes the following sub-steps:
  • the decoding module reads the operation instruction from the instruction cache module, and decodes the operation instruction into each operation micro instruction;
  • the operation unit respectively receives the operation instruction, the power neuron data, and the power weight data sent by the decoding module, the input neuron cache module, and the weight buffer module, and the power neuron data according to the operation micro instruction. And power weight data for neural network operations.
  • the power sub-neuron and the power-weight weight multiplication operation are specifically: the power-order neuron data sign bit and the power-weight weight data sign bit are XORed; the correspondence relationship of the coding table is the out-of-order search code table Find the exponential value corresponding to the power sub-neuron data and the power sub-power data power sub-bit.
  • the correspondence relationship of the coding table is positive correlation, record the index numerical value minimum value of the coding table and add the power to find the power neuron data.
  • the correspondence value of the coding table is negative correlation
  • the maximum value of the coding table is recorded and subtracted to find the power neuron secretary and the power weight data power sub-bit
  • the corresponding index value; the exponential value corresponding to the power neuron data and the exponential value corresponding to the power weight data are added.
  • the method of this embodiment may further include, in step S6, outputting the neural network data after the neural network operation as the input data of the next layer neural network operation.
  • the bandwidth required for transmitting it to the data control module is greatly reduced compared to the bandwidth required for the floating point data, thereby further reducing the neural network storage resources and computing resources.
  • the overhead increases the speed of the neural network.
  • All of the modules of the disclosed embodiments may be hardware structures including, but not limited to, physical devices including, but not limited to, transistors, memristors, DNA computers.
  • Step S101 is actually a process of pruning the neural network; in step S1022, the pruned neural network is retrained using a back propagation algorithm, and the weight value already set to 0 during the training process will be Always keep 0.
  • the method for selecting a set of weights of the neural network may be as follows, the arithmetic mean value of the absolute value of the ownership value in the group is smaller than the first threshold; or the geometric mean value of the absolute value of the ownership value in the group is less than the second threshold; or The maximum value of the absolute value of the ownership value within the group is less than the third threshold.
  • the selection of each of the foregoing first threshold, the second threshold, and the third threshold may be preset according to the situation, and the disclosure is not limited thereto.
  • pruning the neural network may include: pruning the weights of the fully connected layer, the convolution layer, or the LSTM layer of the neural network.
  • the sliding window can slide along the direction of Bin according to the stride of Sin, or can slide according to the step of Sout along the direction of Bout, wherein Sin is a positive integer greater than or equal to 1 and less than Bin, and Sout is greater than or equal to 1 is a positive integer less than or equal to Bout.
  • Sin is a positive integer greater than or equal to 1 and less than Bin
  • Sout is greater than or equal to 1 is a positive integer less than or equal to Bout.
  • the convolutional layer of the neural network can be regarded as a four-dimensional matrix (Nfin, Nfout, Kx, Ky), where Nfin represents the number of input feature maps, and Nout represents the number of output feature images, (Kx , Ky) indicates the size of the convolution kernel.
  • coarse-grained pruning we first set a sliding window of size Bfin*Bfout*Bx*By, where Bfin is a positive integer greater than or equal to 1 and less than or equal to Nfin, and Bfout is a positive integer greater than or equal to 1 and less than or equal to Nfout.
  • Bx is a positive integer greater than or equal to 1 and less than or equal to Kx
  • By is a positive integer greater than or equal to 1 and less than or equal to Ky.
  • the sliding window can be slid in the direction of Bfin according to the stride of Sfin, or in the direction of Bfout according to the step of Sfout, or in the direction of Bx according to the step of Sx, or along the direction of By.
  • Sy Sliding according to the step size of Sy, where Sfin is a positive integer greater than or equal to 1 and less than or equal to Bfin, Sfout is a positive integer greater than or equal to 1 and less than or equal to Bfout, and Sx is a positive integer greater than or equal to 1 and less than or equal to Bx, and Sy is greater than or equal to 1 A positive integer less than or equal to By.
  • Sfin is a positive integer greater than or equal to 1 and less than or equal to Bfin
  • Sfout is a positive integer greater than or equal to 1 and less than or equal to Bfout
  • Sx is a positive integer greater than or equal to 1 and less than or equal to Bx
  • Sy is greater than or equal to 1 A positive integer less than or equal to By.
  • the weight of the LSTM layer is composed of multiple fully connected layer weights. It is assumed that the weight of the LSTM layer is composed of m full connection layer weights, where m is a positive integer greater than zero.
  • the i-th full-join layer weight is (Nin_i, Nout_i,), where i is a positive integer greater than 0 and less than or equal to m, Nin_i represents the number of i-th fully connected layer weight input neurons, and Nout_i represents the i-th full The number of neurons in the connection layer is output.
  • Bin_i*Bout_i Bin_i*Bout_i
  • Bin_i is greater than or equal to 1 and less than or equal to Nin_i.
  • An integer, Bout_i is a positive integer greater than or equal to 1 and less than or equal to Nout_i.
  • the sliding window can be slid according to the direction of Bin_i according to the step of Sin_i, or can be slid according to the step of Sout_i along the direction of Bout_i, wherein Sin_i is a positive integer greater than or equal to 1 less than or equal to Bin_i, and Sout_i is greater than or equal to 1 less than or equal to A positive integer of Bout_i.
  • the set of weights in the sliding window the set of weights will all be set to 0, ie Bin_i*Bout_i weights will be set to 0 at the same time.
  • the embodiment of the present application further provides a processing device.
  • the processing device may be the computing device shown in FIG. 6A. It should be noted that the computing device shown in FIG. 6A may be thick. The granularity pruning unit and the neural network training unit, in practical applications, the computing device shown in FIG. 6A above may also add or expand a module or unit of the processing device as shown in FIG. In another alternative embodiment, the processing device is as shown in FIG. 55 for performing neural network
  • the coarse-grained pruning includes a memory for storing executable instructions, including a memory for storing executable instructions, and reducing computation time.
  • Coarse-grained pruning unit used for pruning a neural network, including using a sliding window to select a set of weights from the neural network, and setting the selected weights to zero;
  • Neural network training unit used to train the pruned neural network: the weight that has been set to zero during training remains at zero.
  • the training unit integrates the neural network reverse training algorithm, receives the coarse-grained pruned neural network, and uses the reverse training algorithm to train.
  • the weight of the pruned is always 0 in the training process.
  • the training unit transmits the trained neural network to the coarse-grained pruning unit for further pruning operations, or directly outputs.
  • the coarse-grained pruning unit further comprises a full-connection layer coarse-grained pruning unit, which implements a coarse-grain pruning operation on the fully connected layer of the neural network.
  • the coarse-grained pruning unit further includes a coarse-grained pruning unit of the convolution layer, and performs a coarse-grain pruning operation on the convolution layer of the neural network.
  • the coarse-grained pruning unit further includes a coarse-grained pruning unit of the LSTM layer to perform coarse-grain pruning operations on the LSTM layer of the neural network.
  • FIG. 56 is a schematic structural diagram of a processing apparatus according to an embodiment of the present disclosure.
  • the processing device shown in FIG. 56 can process the coarse-grained and sparse neural network, fully exploit the coarse-grained and sparse characteristics, reduce the memory access and reduce the computation amount, thereby reducing the computation time and reducing the energy consumption.
  • the processing device includes a storage unit, an instruction control unit, a coarse-grain selection unit, and an operation unit.
  • the processing device can be for neural network processing.
  • the storage unit can be used to store neurons, weights, and instructions of the neural network.
  • the instruction control unit is configured to receive an instruction in the storage part, generate a control information after decoding, and control the coarse-grain selection unit to perform a selection operation and an operation unit to perform a calculation operation.
  • the arithmetic unit in the present application can be used to execute a neural network dedicated instruction.
  • the neural network specific instructions in this application include, but are not limited to, all instructions dedicated to performing artificial neural network operations.
  • Neural network specific instructions include, but are not limited to, control instructions, data transfer instructions, arithmetic instructions, and logic instructions.
  • the control command controls the execution process of the neural network.
  • Data transfer instructions complete the transfer of data between different storage media, including but not limited to matrices, vectors, and scalars.
  • the arithmetic instruction completes the arithmetic operation of the neural network, including but not limited to the matrix operation instruction, the vector operation instruction, the scalar operation instruction, the convolutional neural network operation instruction, the fully connected neural network operation instruction, the pooled neural network operation instruction, the RBM neural network operation Instruction, LRN neural network operation instruction, LCN neural network operation instruction, LSTM neural network operation instruction, RNN neural network operation instruction, RELU neural network operation instruction, PRELU neural network operation instruction, SIGMOID neural network operation instruction, TANH neural network operation instruction, MAXOUT neural network operation instructions.
  • Logic instructions complete the logical operations of the neural network, including but not limited to vector logic operations instructions and scalar logic operation instructions.
  • the RBM neural network operation instruction is used to implement the Restricted Boltzmann Machine (RBM) neural network operation.
  • RBM Restricted Boltzmann Machine
  • the LRN neural network operation instruction is used to implement the Local Response Normalization (LRN) neural network operation.
  • LRN Local Response Normalization
  • the LSTM neural network operation instruction is used to implement Long Short-Term Memory (LSTM) neural network operation.
  • LSTM Long Short-Term Memory
  • the RNN neural network operation instruction is used to implement Recurrent Neural Networks (RNN) neural network operation.
  • RNN Recurrent Neural Networks
  • the RELU neural network operation instruction is used to implement the Rectified linear unit (RELU) neural network operation.
  • the neural network operation instruction is used to implement the S-type growth curve (SIGMOID) neural network operation.
  • the TANH neural network operation instruction is used to implement the hyperbolic tangent function (TANH) neural network operation.
  • TANH hyperbolic tangent function
  • W is the weight and b is the offset.
  • Cambricon instruction set More specifically, it includes the Cambricon instruction set.
  • the Cambricon instruction set is characterized in that each instruction length in the instruction set is a set length (for example, 64 bits or 128 bits), and the instruction is composed of an operation code and an operand.
  • the instruction set contains four types of instructions, namely, control instructions, data transfer instructions, computational instructions, and logical instructions.
  • Control instructions are used to control the execution process.
  • Control instructions include jump instructions and conditional branch instructions.
  • the data transfer instruction is used to complete data transfer between different storage media.
  • the data transfer instructions include a load instruction, a store instruction, and a move instruction.
  • the load instruction is used to load data from the main memory to the cache
  • the store instruction is used to store data from the cache to the main memory
  • the move instruction is used to transfer data between the cache and the cache or the cache and registers or registers and registers.
  • Data transfer instructions support three different ways of organizing data, including matrices, vectors, and scalars.
  • the arithmetic instructions are used to perform neural network arithmetic operations.
  • the arithmetic instructions include matrix operation instructions, vector operation instructions, and scalar operation instructions.
  • the matrix operation instruction completes the matrix operation in the neural network, including a matrix multiply vector, a vector multiply matrix, a matrix multiply scalar, an outer product, Matrix add matrix, matrix subtract matrix.
  • the vector operation instruction completes the vector operation in the neural network, including vector elementary arithmetics, vector transcendental functions, dot products, and random vector generators. , the maximum/minimum of a vector.
  • the vector basic operations include vector addition, subtraction, multiplication, and division (add, subtract, multiply, divide).
  • Vector transcendental functions are functions that do not satisfy any polynomial equations with polynomials as coefficients, including but not limited to exponential functions. Number function, trigonometric function, inverse trigonometric function.
  • scalar operations complete scalar operations in neural networks, including scalar elementary arithmetics and scalar transcendental functions.
  • the scalar basic operations include scalar addition, subtraction, multiplication, and division.
  • the scalar transcendental functions are functions that do not satisfy any polynomial equations with polynomials as coefficients, including but not limited to exponential functions. Number function, trigonometric function, inverse trigonometric function.
  • Logical operations include vector logic operations instructions and scalar logic operation instructions.
  • vector logic operation instructions include vector comparison, vector logical operations, and vector greater than merge. Where vector comparison includes but greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to. Vector logic operations include with, or, not.
  • scalar logic operations include scalar comparison, scalar logical operations.
  • the scalar comparison includes but greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to.
  • Scalar logic operations include AND, OR, and NOT.
  • the coarse-grained selection unit is configured to receive input neuron and non-zero weight position information, use a sliding window to select a set of weights of the neural network, set the selected weights to zero, and select non-zero weights corresponding to Neurons.
  • the arithmetic unit is configured to receive the input selected neuron and the non-zero weight, complete the neural network operation by the multiply-add operation unit, and retransmit the output neuron to the storage portion.
  • the storage unit stores the weight, only the location information of the non-zero weight and the non-zero weight is stored.
  • the coarse-grained selection unit selects only the neurons corresponding to the non-zero weights and transmits them to the arithmetic unit.
  • the acceleration device may further include a pre-processing module. As shown in Figure 57, the module preprocesses the raw data, including segmentation, Gaussian filtering, binarization, regularization, normalization, and so on.
  • the acceleration device may further include a direct data access unit DMA (direct memory access).
  • DMA direct memory access
  • the acceleration device may further include an instruction cache, an input neuron cache, a non-zero weight cache, a non-zero weight location cache, and an output neuron cache.
  • the storage unit is mainly used to store neurons, weights, and instructions of the neural network. Where the weight is stored, only the location information of the non-zero weight and the non-zero weight is stored.
  • the DMA is used to read or write data or instructions in the storage unit, the instruction cache, the non-zero weight cache, the non-zero weight location cache, the input neuron cache, and the output neuron cache.
  • Instruction cache for storing dedicated instructions
  • non-zero weighted cache for caching non-zero weighted data
  • Non-zero weight location cache for caching non-zero weight location data
  • the non-zero weight location cache maps each connection weight in the input data to the corresponding input neuron.
  • the one-to-one correspondence of non-zero weight position buffers is that 1 means that there is a connection, 0 means no connection, and each group of output and all input connection states form a string of 0 and 1 to indicate the connection of the output. relationship.
  • the method of non-zero weight location buffer one-to-one correspondence is to use 1 for connection, 0 for no connection, and the connection state of each group of inputs and all outputs to form a string of 0 and 1 to represent the input. Connection relationship.
  • the one-to-one correspondence of the non-zero weight position buffers is to output a set of output neuron positions where the first connection is located from the first input neuron, and output the second set of input neurons. The distance from the last input neuron, which outputs the distance of the third set of input neurons from the previous input neuron, ..., and so on, until all inputs of the output are exhausted to represent the connection of the output relationship.
  • the output neuron cache unit is configured to cache the output neurons output by the operation unit.
  • the control unit is configured to receive an instruction in the instruction cache, and generate a control information to control the operation unit to perform a calculation operation after being decoded.
  • the coarse-grained selection unit is configured to receive input neuron and non-zero weight position information, and select a neuron that needs to be operated.
  • the coarse-grained selection unit only selects the neurons corresponding to the non-zero weights and transmits them to the arithmetic unit.
  • an operation unit configured to perform a corresponding operation on the data according to an instruction stored in the storage unit.
  • the arithmetic unit includes, but is not limited to, three parts, a first part one or more multipliers, a second part one or more adders, preferably, the second part includes a plurality of adders, and the plurality of adders constitute an addition tree.
  • the third part is the activation function unit.
  • the third part obtains the activated output data (out) by the fifth input data (in5) through an activation function (active), the process is
  • Pooling operations include, but are not limited to, average pooling, maximum pooling, median pooling, and input data in is the data in a pooled core associated with output out.
  • the operation unit performs an operation including several parts, the first part is to multiply the first input data and the second input data to obtain the multiplied data; the second part performs an addition tree operation for the third input The data is added step by step through the addition tree, or the third input data is added to the fourth input data to obtain output data; the third part performs an activation function operation, and the fifth input data is obtained by an activation function (active) Output Data.
  • the operations of the above parts can be freely combined to realize the operation of various functions.
  • Figure 58 is a specific embodiment of the present disclosure in a method of processing. As shown in Fig. 8, it is the result of coarse-grained pruning of a fully connected layer of the neural network.
  • the total connected layer has 8 input neurons of n1 to n8 and 3 output neurons of o1 to o3.
  • the weights between the four input neurons n3, n4, n7, n8 and the three output neurons of o1, o2, o3 are set to zero by coarse-grained sparse; n1 and o1, o2, o3 pass s11, S12, s13 three weights are connected, n2 and o1, o2, o3 are connected by three weights of s21, s22, s23, n5 and o1, o2, o3 are connected by three weights of s31, s32, s33 , n6 and o1, o2, o3 are connected by three weights of s41, s42, s43; we use the bit string of 11001100 to represent the connection between the input neuron and the output neuron, that is, the first representation represents non-zero weight.
  • Equation 1 In the case of value position information, 1 means that the input neuron is connected to the three output neurons, and 0 means that the input neuron is not connected to the three output neurons.
  • Table 1 describes the information of neurons and weights in the embodiment, and Equation 1 describes the arithmetic formulas of the three output neurons of o1, o2, and o3. It can be seen from Equation 1 that o1, o2, and o3 will receive the same neuron for operation.
  • the processing device When the processing device performs the operation, 8 input neurons, 12 weights and 8-bit position information and corresponding instructions are transmitted to the storage unit.
  • the coarse-grained selection unit receives 8 input neurons and non-zero weight positions, and selects n1, n2, n5, n6 four neurons that need to participate in the operation.
  • the arithmetic unit receives the four selected neurons and weights, completes the operation of the output neurons by Equation 1, and then transmits the output neurons back to the storage portion.
  • the disclosed related apparatus and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division.
  • there may be another division manner for example, multiple units or components may be used. Combinations can be integrated into another system, or some features can be ignored or not executed.
  • a coarse grained thinning processing method and corresponding processing device of a neural network and a chip, a chip package structure, a board, and an electronic device are provided.
  • the coarse-grained sparse processing method can make the sparse neural network more regular, which is convenient for hardware acceleration, and reduces the storage space of non-zero weight positions.
  • the neural network processor can fully exploit the coarse-grained and sparse characteristics, reduce the memory access and reduce the amount of computation, thereby obtaining the speedup ratio and reducing the energy consumption.
  • the present application also discloses an apparatus for performing an artificial neural network forward operation.
  • the apparatus for performing an artificial neural network forward operation may be a calculation as shown in FIG. 6A.
  • the device may further include a fixed point data conversion module and a corresponding fixed point data operation module, the fixed point data conversion module includes a floating point data statistics module and a data conversion unit; the computing device shown in FIG. 6A may also add The unit or module shown in Fig. 59 or Fig. 60.
  • the floating point data statistics module is used for counting and calculating the number of bits required to obtain the exponential bit offset and exponent bits required for storing various types of data in the artificial neural network forward operation;
  • the floating point data conversion unit is used to implement the short Conversion of digits floating-point data types to long-digit floating-point data types, such as 32-bit floating-point data types;
  • floating-point data arithmetic modules are used to perform various types of operations required for short-digit floating-point data.
  • the "long-digit floating-point data” indicates the original floating-point data, such as 32-bit floating-point data, and may be a standard 64-bit or 16-bit floating-point number, etc., but only 32-bit is used as a specific embodiment.
  • "Less digits of floating point data”, also known as “short digit floating point data” means floating point data represented by fewer bits relative to the original floating point data.
  • the forward operation of the multi-layer artificial neural network includes two or more layers of multiple neurons.
  • weights, offsets and other data required in the forward operation the short-digit floating-point data type is used to represent the operation between the layers.
  • FIG. 59 shows a specific representation of a short-digit floating-point data structure for storing data in accordance with an embodiment of the present application. Among them, 1 bit is used to represent the symbol, M bit is used to represent the exponent part, and N bit is used to represent the valid bit part.
  • the short-bit floating-point data representation used has two flag bits for the same layer and the same type of data in the neural network, such as the ownership data of the first convolutional layer, in addition to occupying fewer bits.
  • the flag offset is used to record the initial offset of the exponent bit
  • the actual exponent bit representation the exponent bit represents the data + offset (offset)
  • the flag bit EL is used to record the exponent bit
  • Figure 60A shows an example block diagram of an apparatus for performing an artificial neural network forward operation. As shown in Figure 60A, the apparatus includes:
  • the floating point data statistic module 11 is configured to perform data analysis on the input neurons, weights, and/or offset data in the forward operation of the neural network to obtain an exponential bit offset and an exponent bit of the floating point data.
  • Length EL Length EL
  • the floating point data conversion module 12 is configured to: convert the input neurons, weights, and/or offset data from long-digit floating-point data according to the exponential bit offset of the floating-point data and the length EL of the exponent bit Type conversion to a short-digit floating point data type;
  • the floating point data operation module 13 is configured to perform an artificial neural network forward operation according to the input neurons, weights, and/or offset data converted to the short-digit floating-point data type.
  • FIG. 60 shows an example block diagram of a floating point data statistics module.
  • the floating point data statistics module includes a data extracting unit 21, a statistic unit 22, and an analyzing unit 23.
  • the purpose of this module is to extract all long-digit floating-point data in a neural network represented by a long-digit floating-point data type, such as input neurons, weights, and/or offset data, and analyze these long
  • the number of floating point data is obtained from the exponential bit offset (offset) and exponent bit length required for each different type of data (such as input neurons, weights, and offset data) represented by the short-digit floating-point data type in the neural network.
  • EL in order to have a better effect in the short-digit floating-point forward operation.
  • the data extracting unit 21 is configured to extract data of different types in the long-term floating-point forward operation process; the statistical unit 22 is configured to count the data range of the same type of data and the data distribution of each data segment; the analyzing unit 23 is configured according to The result of the statistical unit 22 is statistically obtained to obtain the exponent bit length EL and the exponent bit offset (offset) which should be set when the short-numbered floating point indicates each type of data, and the index bit length EL is set so that the data range can be represented. Whenever possible, include all data of that type.
  • the foregoing apparatus for performing an artificial neural network forward operation acquires each of the units represented by the long-digit floating-point data type in the forward operation from other units or devices, such as a CPU.
  • Different types of data including input neurons, weights, and offset data, and then statistics the data range of the same type of data and the distribution of each data segment.
  • short-numbered floating-point data is used to represent each type of data.
  • the above device for performing the artificial neural network forward operation should be set from other units or devices, such as a CPU, when using short-digit floating-point data to represent each type of data in the artificial neural network or each data type of each layer.
  • the fixed index bit length EL and the exponential bit offset should be set from other units or devices, such as a CPU, when using short-digit floating-point data to represent each type of data in the artificial neural network or each data type of each layer.
  • y represents the short-digit floating-point data after random rounding
  • x represents the long-digit floating-point data before random rounding
  • is the smallest positive integer that can be represented by the current short-digit floating-point data type, that is, 2 Offset-(X-1-EL)
  • wp represents the probability, that is, the data y obtained by random rounding is Probability is in order Probability is
  • y represents the short-digit floating-point data after rounding up
  • x represents the long-digit floating-point data before rounding up
  • the value is the smallest number greater than or equal to x
  • is the smallest positive integer that the current short-digit floating-point data type representation can represent, that is, 2 offset-(X-1-EL) .
  • y represents the short-digit floating-point data after rounding up
  • x represents the long-digit floating-point data before rounding up
  • is the smallest positive integer that can be represented by the current short-digit floating-point data type, that is, 2 offset-(X-1-EL) .
  • the truncation rounding unit performs the following operations:
  • y represents the short-digit floating-point data after truncation rounding
  • x represents the long-digit floating-point data before truncation
  • [x] represents the number obtained by directly intercepting the short-digit floating-point data of the original data x .
  • the application also discloses a method for performing a forward operation of an artificial neural network, and the specific implementation steps are:
  • the short-digit floating-point data type obtained by statistical analysis is used for the neural network forward operation, that is, all data in the neural network forward operation is represented by the short-digit floating-point data type, and at the same time, the weight and offset of the neural network.
  • the data retains a copy of the long-digit floating-point data type and then performs a forward operation.
  • some operations will lead to the expansion of the data range, such as addition, multiplication, etc., the cache space is needed to store the intermediate calculation results, and the intermediate results are stored in the long-digit floating-point data type, and then converted back to the corresponding Short-digit floating point data type.
  • Rounding from long-digit floating-point data types to short-digit floating-point data types requires rounding, including random rounding, rounding rounding, rounding up, rounding down, and truncating rounding. Expressed as follows:
  • y represents the short-digit floating-point data after rounding
  • x represents the long-digit floating-point data before rounding
  • is the smallest positive integer that can be represented by the current short-digit floating-point data type, that is, 2 offset-(X -1-EL) , Is an integer multiple of ⁇ whose value is less than or equal to the maximum number of x.
  • y represents the short-digit floating-point data after rounding up
  • x represents the long-digit floating-point data before rounding up
  • is the smallest positive integer that can be represented by the current short-digit floating-point data type, that is, 2 offset-(X-1-EL) .
  • y represents the truncated floating-point data after truncation
  • x represents the truncated floating-point data before truncation
  • [x] represents the number obtained by directly truncating the short-digit floating-point data to the original data x.
  • FIG. 62 is a flow chart of a forward operation of a single-layer artificial neural network according to an embodiment of the present application.
  • the flowchart depicts a process for a single layer neural network forward operation implemented using the apparatus and instruction set of the present application. This operation is implemented in a computing device as shown in FIG. 2, FIG. 5 or FIG. 6A.
  • the input neuron vector is first weighted and summed to calculate the intermediate result vector of the layer.
  • the intermediate result vector is biased and activated to obtain an output neuron vector.
  • the output neuron vector is used as the input neuron vector of the next layer.
  • the type conversion unit 54 converts the data represented by the long-digit floating-point data type into the data represented by the short-digit floating-point data type, and the precision that can be expressed beyond the short-digit floating-point data type in the conversion process
  • the range of data is rounded, where the rounding operation is done by rounding unit 55, which performs the rounding operation with the rounding unit in Fig. 62.
  • the foregoing forward operation may also adopt input neurons, weights, and/or offset data represented by a long-digit floating-point data type
  • the reverse training may also adopt a short-digit floating-point data type.
  • the data range space represented by the short-digit floating-point data type is fully utilized, which is greatly reduced compared with the data represented by the long-bit floating-point data type.
  • the space required to store network parameters optimizes the area-to-power ratio of the hardware.
  • the forward operation of the multi-layer artificial neural network includes two or more layers of multiple neurons.
  • weights, offsets and other data required in the forward operation the short-digit fixed-point data type is used and participates in the operations between the layers.
  • FIG. 64 shows a specific representation of a short-digit fixed-point data structure for storing data according to an embodiment of the present application. Wherein, 1 bit is used to represent the symbol, M bit is used to represent the integer part, and N bit is used to represent the fractional part; compared to the 32-bit floating point data representation, the short digit fixed point data representation used in the present application is occupied.
  • an additional flag location is set to record the position of the decimal point, so that the actual data can be The distribution adjusts the precision and representable data range that can be represented by a fixed data type.
  • Figure 65A shows an example block diagram of an apparatus for performing an artificial neural network forward operation. As shown in Figure 60A, the apparatus includes:
  • the floating point data statistic module 11 is configured to perform data analysis on the input neurons, weights, and/or offset data in the forward operation of the artificial neural network to obtain a decimal point position of the fixed point data type;
  • the data conversion module 12 is configured to convert the input neuron, weight, and/or offset data from a long-digit floating-point data type to a short-digit fixed-point data type according to a decimal point position of the fixed-point data;
  • the fixed point data operation module 13 is configured to perform an artificial neural network forward operation according to input neurons, weights, and/or offset data converted to a short-digit fixed-point data type.
  • FIG 65 shows an example block diagram of a floating point data statistics module.
  • the floating point data statistic module 11 includes a data extracting unit 21, a statistic unit 22, and an analyzing unit 23.
  • the purpose of this module is to extract all long-digit floating-point data in a neural network represented by a long-digit floating-point data type, such as input neurons, weights, and/or offset data, and analyze these long
  • the number of floating-point data is obtained from the point location of the decimal point required for each different type of data (such as input neurons, weights, and offset data) represented by the short-digit fixed-point data type in the neural network, so that the short digits afterwards There is a better effect in fixed-point forward operations.
  • the data extracting unit 21 is configured to extract data of different types in the long-term floating-point forward operation process; the statistical unit 22 is configured to count the data range of the same type of data and the data distribution of each data segment; the analyzing unit 23 is configured according to The statistical unit 22 counts the results to obtain a point location where the short-digit fixed-point data type indicates that the type data should be set for each type of data.
  • the above device for performing the artificial neural network forward operation should be set from other units or devices, such as a CPU, when using short-digit floating-point data to represent each type of data in the artificial neural network or each data type of each layer.
  • the fixed index bit length EL and the exponential bit offset should be set from other units or devices, such as a CPU, when using short-digit floating-point data to represent each type of data in the artificial neural network or each data type of each layer.
  • Fig. 66 is a block diagram showing an example of a short-digit fixed-point calculation section of the forward operation module.
  • the operation buffer unit 31, the data conversion unit 32, and the rounding unit 33 are included.
  • the operation buffer unit 31 is configured to store the intermediate result of the forward operation represented by the data type with higher precision, because in the forward operation, the addition or multiplication may cause the data range to expand; after the operation ends, The data exceeding the precision range indicated by the short-digit fixed-point data type is subjected to a rounding operation, and then the data stored in the operation buffer unit is converted from the long-digit floating-point data type to the short-digit fixed-point data type by the data conversion unit 32. .
  • the rounding unit 33 is configured to perform rounding operation on data exceeding the precision range of the short-point fixed-point data type, and the unit may be a random rounding unit, a rounding unit, an rounding unit, a rounding unit, and a truncating rounding unit. Etc., different rounding units can be used to perform different rounding operations on data beyond the short-digit fixed-point data type representation precision range.
  • the random rounding unit performs the following operations:
  • the rounding unit performs the following operations:
  • y represents the short-digit fixed-point data after rounding up
  • x represents the long-digit floating-point data before rounding up
  • is the smallest positive integer that can be represented by the current short-digit fixed-point data type, that is, 2 -Point_location .
  • the truncation rounding unit performs the following operations:
  • y represents the short-digit fixed-point data after truncation rounding
  • x represents the long-digit floating-point data before truncation
  • [x] represents the number obtained by directly intercepting the short-digit fixed-point data of the original data x.
  • the application also discloses a method for performing a forward operation of an artificial neural network, and the specific implementation steps are:
  • the 32-bit floating-point model data of each layer of the neural network is obtained through the trained neural network 32-bit floating point model, including the weight, offset, input and output values and other data parameters of each layer.
  • p n , and the n+1 values are the same type of input data in each layer of the multi-layer network model in the above n+1 sub
  • An overflow rate EPL is preset, and the largest i is obtained from 0, 1, 2, ..., n such that p i ⁇ 1-EPL, and the largest i is in each layer of the above multilayer network model.
  • the decimal point position of the same type of input data in each layer of the above multi-layer network model is: max ⁇ i/p i ⁇ 1-EPL, i ⁇ 0,1,2,...,n ⁇ , that is, in p i satisfying greater than or equal to 1-EPL, the largest subscript value i is selected as the decimal point position of the same type of input data in each layer of the above multilayer network model.
  • the above p i is the value of the same type of input data in each layer of the above multi-layer network model.
  • m2 input data in the same type of input data are in the interval [-2 X-1-i , 2 X-1-i -2 -i ], then Above
  • the short-digit fixed-point data type obtained by statistical analysis is used for neural network forward operation, that is, all data in the neural network forward operation is represented by short-digit fixed-point data type, and at the same time, the weight and offset data of the neural network are retained.
  • a copy of the long-digit floating-point data type then a forward operation.
  • some operations will lead to the expansion of the data range, such as addition, multiplication, etc., the cache space is needed to store the intermediate calculation results, and the intermediate results are stored in the long-digit floating-point data type, and then converted to the corresponding after calculation.
  • Short digit fixed point data type Rounding from long-digit floating-point data types to short-digit fixed-point data types requires rounding, including random rounding, rounding rounding, rounding up, rounding down, and truncating rounding. Expressed as follows:
  • the random rounding unit performs the following operations:
  • the rounding unit performs the following operations:
  • y represents the short-digit fixed-point data after rounding up
  • x represents the long-digit floating-point data before rounding up
  • the value is the minimum number greater than or equal to x
  • is the smallest positive integer that can be represented by the current short-digit fixed-point data type, that is, 2 -Point_location .
  • y represents the short-digit fixed-point data after rounding up
  • x represents the long-digit floating-point data before rounding up
  • is the smallest positive integer that can be represented by the current short-digit fixed-point data type, that is, 2 -Point_location .
  • the data represented by the short-digit fixed-point data type in the forward operation needs to be converted into the data represented by the long-digit floating-point data type, and then used to float in long bits.
  • the data represented by the point data type participates in the inverse operation, wherein the weights and offset data participating in the inverse operation are copies represented by the long-digit floating-point data type retained in the forward operation, and after the reverse operation ends,
  • the data represented by the long-digit floating-point data type is converted into data represented by the short-digit fixed-point data type, and then the data represented by the short-digit fixed-point data type is used to participate in the subsequent forward operation, and at the same time, in the forward operation process
  • the weight and offset data of the neural network are still retained as a copy represented by the long-digit floating-point data type, and the rounding operation needs to be performed in the conversion process, and the operation is the same as the rounding operation in the above forward operation.
  • Figure 67 is a flow diagram showing a single layer artificial neural network forward operation in accordance with one embodiment. This operation is implemented in a computing device as shown in FIG. 2, FIG. 5 or FIG. 6A.
  • the flowchart depicts a process for a single layer neural network forward operation implemented using the apparatus and instruction set of the present application. For each layer, the input neuron vector is first weighted and summed to calculate the intermediate result vector of the layer. The intermediate result vector is biased and activated to obtain an output neuron vector. The output neuron vector is used as the input neuron vector of the next layer.
  • FIG. 68 is a block diagram showing an example of an operational flow according to an embodiment of the present application. This operation is implemented in a computing device as shown in FIG. 2, FIG. 5 or FIG. 6A.
  • the data represented by the short-digit fixed-point data type other than the weighted value and the offset data obtained by the forward operation module 51 in the forward operation module must first pass the short-digit fixed-point data-long digits when performing reverse training.
  • the floating-point data conversion unit 53 converts the data represented by the growth-bit floating-point data type, and then performs the back-propagation operation. After the back-propagation operation by the inverse operation module 53 is completed, it is necessary to pass the long-digit floating-point data-short bit.
  • the short-digit floating-point data type is relative to the above-mentioned long-digit floating-point data type, and when the short-digit floating-point data type is a 16-bit floating-point data type, the above-mentioned long-digit floating-point data
  • the type can be a 32-bit floating point data type or a 64-bit floating point data type; when the short bit floating point data type is a 32-bit floating point data type, the above long-digit floating point data type is a 64-bit floating point data type.
  • Figure 70 is an exemplary block diagram of the overall structure of a preferred embodiment. As shown in the embodiment of FIG. 70, in an actual application, an interconnection module and an operation unit as shown in FIG. 6A may be further included, and the operation unit includes a plurality of calculators.
  • the on-chip memory 20 of the processor can store very limited data, and generally the limited resources on the chip limit the possibility of placing all data on the chip.
  • the present application provides a method for on-chip repeat addressing, which is a data management strategy used when the total data is too large and larger than the storage capacity of the on-chip storage medium 20, so that off-chip data can be read into the chip.
  • Fast repeat addressing of course, can also achieve off-chip repeated addressing, but it is efficient to put together the data that is centrally accessed, move it to the chip at a time, and then quickly address it directly on-chip.
  • the method includes:
  • the index address 50 of the data includes a data block address 51 and an intra-block address 52; that is, the address of each data is the current data block address 51 and the intra-block address 52. Stitched together. After dividing the data into reasonable data blocks, it is more efficient to divide the addresses into on-chip and off-chip addresses.
  • the techniques used for address indexing are not limited to simple data indexes, but also include partitioning implementations such as codebooks.
  • the on-chip storage medium 20 and the on-chip processing unit 30 exchange data through the on-chip data path; the on-chip storage medium 20 and the off-chip storage medium 10 exchange data through the on-chip external data path, the on-chip storage medium 20 or the off-chip storage medium. 10 reads and writes from inside or outside at least once; the data is carried between the on-chip storage medium 20, the off-chip storage medium 10, and/or the on-chip processing unit 30 in units of data blocks.
  • the on-chip storage medium 20 is designed to be separated by a read/write port, so that reading and writing of data are independent of each other and can be performed simultaneously.
  • the on-chip processing unit 30 is an on-chip computing module, and the data is selected according to a predetermined condition, and the data satisfying the predetermined condition is divided into the same data block.
  • the predetermined condition includes a simple division condition, an average of a predetermined number of data block conditions, a condition related to different output neurons, or a predetermined mathematical relationship condition.
  • Figure 73 is a schematic diagram of data partitioning of a preferred embodiment. Also taking the common neural network as an example (vector operation), the weight connections satisfying the specified conditions are divided and stored in the same data block, such as solid line weight connection and dotted line weight connection. At different times, different data blocks are loaded, and the operation unit selects data according to the specified conditions. For example, all the output neurons first calculate the correlation calculation with the solid line weight connection, and then calculate the connection with the dotted line weight after the data block is replaced. Related calculations.
  • the present application accordingly provides an apparatus for implementing a method of on-chip repeat addressing, the apparatus comprising:
  • a data dividing module configured to divide data of the on-chip storage medium and/or the off-chip storage medium into different data blocks according to a predetermined data division principle, where the data division principle comprises dividing data with a reuse distance lower than a predetermined distance threshold The same data block;
  • a data indexing module configured to sequentially load different data blocks into at least one on-chip processing unit according to a sequential relationship of the predetermined replacement policy, and the repeated data in the loaded data block is repeatedly addressed on the chip.
  • the data indexing module is configured to sequentially load different data blocks into at least one of the on-chip processing units according to the order relationship of the replacement policy and the data block address, in the loaded data block.
  • the repeated data is repeatedly addressed on-chip, and the new data block is replaced when all the intra-block addresses of the data block are indexed until no data block is required to be loaded.
  • the on-chip storage medium and the on-chip processing unit exchange data through an on-chip data path;
  • the on-chip storage medium and the off-chip storage medium exchange data through an on-chip external data path, and the on-chip storage medium or the off-chip storage medium reads and writes from inside or outside at least once; the data is in a data block
  • the unit is transported between the on-chip storage medium, the off-chip storage medium, and/or the on-chip processing unit.
  • the data amount of the data block is smaller than the capacity of the on-chip storage medium.
  • the on-chip storage medium is designed to be separated by a read/write port.
  • the apparatus is applied to a learning class processor.
  • the device is applied to a heterogeneous environment.
  • the on-chip processing unit is an on-chip arithmetic module, and the data is selected according to a predetermined condition, and the data that satisfies the predetermined condition is divided into the same data block.
  • the predetermined condition includes a simple division condition, an average of a predetermined number of data block conditions, a condition associated with a different output neuron, or a predetermined mathematical relationship condition.
  • the replacement strategy includes sequential replacement, reverse replacement or out-of-order replacement; or
  • the replacement strategy includes data write back, and the final result or intermediate result is written back to the on-chip storage medium, the off-chip storage medium, and/or the on-chip processing unit after the data processing is completed.
  • Figure 75 is a flow diagram showing the use of on-chip data repetitive addressing to reduce access bandwidth requirements by a preferred embodiment. After starting the calculation,
  • step S101 the data is divided into different data blocks according to the principle of data division.
  • step S102 the data block is loaded into the on-chip storage medium 20.
  • the data block is loaded into the on-chip storage medium 20.
  • only one block of data is loaded onto the on-chip storage medium 20 for on-chip computation, and different blocks of data are loaded for operation in a different order according to different replacement strategies.
  • Step S103 performing on-chip calculation on the acquired data.
  • step S104 it is judged whether or not all the data blocks need to be loaded again after all the calculations are completed, and if so, all the calculation ends, otherwise, the process returns to step S102.
  • Figure 76 is a block diagram showing the repeating addressing of a computing unit of a preferred embodiment based on an address.
  • the address index the data stored in the address DA is required by the computing units #0, #2, #4, then the embodiment indexes to the address DA, and the data in the DA is propagated to the required computing unit, ie #0, #2 and #4.
  • the data required by the three calculation units is the same, so only one copy is stored on the slice, that is, the same data is repeatedly addressed three times.
  • the way in which the data in Figure 76 is transferred to the on-chip computing unit is not limited to the connection method of the BUS bus, but also includes other connection methods such as Crossbar structure, FAT-TREE, and H-TREE.
  • the present application divides the data whose distance is less than the predetermined distance threshold into the same data block, and the reuse distance refers to the distance that one data is used twice, and the distance refers to the number of times of access, and the data with the similar distance is used.
  • the operation will be accessed in the short term, that is, there will be a strong correlation in time.
  • These data partitions can be loaded into on-chip storage at one time and then used as many times as possible, making fetching more efficient.
  • This application is intended to utilize on-chip repeat addressing for reducing the memory bandwidth.
  • the device of the present application and related methods of use can effectively provide data reusability and flexible addressing requirements, can be applied to different scenarios, and are not limited to machine learning class processors.
  • FIG. 77 shows that the present application provides an on-chip data partitioning and reading system 100.
  • the on-chip data partitioning and reading system shown in FIG. 77 can be applied to FIG. 6A, FIG. 26, FIG. 28, and FIG.
  • the memory of the computing device is an off-chip storage system, and the calculation shown in FIG. 6A may include an on-chip data partitioning read/write system as shown in FIG.
  • the system includes:
  • the data dividing module 10 is configured to divide the on-chip storage data into different areas according to the data partitioning strategy, and store the on-chip memory and the off-chip memory respectively;
  • the pre-operation module 20 is configured to perform an operation process on the on-chip address index of the on-chip stored data in advance when performing data splicing;
  • the data splicing module 30 is configured to splicing the on-chip storage data and the off-chip input data according to the data splicing strategy to obtain the original data representation.
  • the on-chip data that can be stored on the processor is very limited. It is necessary to divide all the data into data blocks that can be stored on the chip. The data is exchanged between the off-chip large memory and the on-chip small memory. The data block needs to be read in or written out. Meanwhile, the on-chip data address is provided to the on-chip computing unit (such as the arithmetic unit shown in FIG. 6A) by the on-chip address index, and the physical frame is as shown in FIG. 81; the embodiment shown in FIG. 78 and FIG. 79A and FIG. 79B
  • the division is only a typical case involved in the present application, and the present application is not limited to a specific data division. In the extreme case, if the data is all on the chip, or the data is all divided off-chip, it is also within the scope of implementation of the present application.
  • the on-chip data partitioning and reading system 100 of the present application further includes:
  • a storage module 40 configured to store the on-chip storage data of the on-chip storage medium and the off-chip input data from the off-chip storage medium;
  • the storage module 40 is separated by a read/write port, and data read and write are independent of each other;

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Optimization (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Neurology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Memory System (AREA)

Abstract

本公开提供了一种计算方法,所述方法应用于计算装置内,所述计算装置包括:存储器、寄存器单元和矩阵计算单元;所述方法包括如下步骤:所述计算装置控制所述矩阵计算单元获取第一运算指令,所述第一运算指令包括执行所述指令所需的矩阵读取指示;所述计算装置控制所述运算单元依据所述矩阵读取指示向所述存储器发送读取命令;所述计算装置控制所述运算单元依据采用批量读取方式读取所述矩阵读取指示对应的矩阵,对该矩阵执行所述第一运算指令。本申请提供的技术方案具有计算速度快,效率高的优点。

Description

一种计算方法及相关产品 技术领域
本公开涉及数据处理领域,尤其涉及一种计算方法及相关产品。
背景技术
数据处理是大部分算法需要经过的步骤或阶段,在计算机引入数据处理领域后,越来越多的数据处理通过计算机来实现,现有的算法中有计算设备在进行神经网络的数据计算时速度慢,效率低。
发明内容
本申请实施例提供了一种计算方法及相关产品,可提升计算装置的处理速度,提高效率。
第一方面,本申请实施例提供一种计算方法,所述方法应用于计算装置内,所述计算装置包括:存储器、寄存器单元和矩阵计算单元;所述方法包括如下步骤:
所述计算装置控制所述矩阵计算单元获取第一运算指令,所述第一运算指令包括执行所述指令所需的矩阵读取指示,所述所需的矩阵为至少一个矩阵,所述至少一个矩阵为长度相同的矩阵或长度不相同的矩阵;
所述计算装置控制所述运算单元依据所述矩阵读取指示向所述存储器发送读取命令;
所述计算装置控制所述运算单元依据采用批量读取方式读取所述矩阵读取指示对应的矩阵,对该矩阵执行所述第一运算指令。
可选的,所述矩阵读取指示包括:所述指令所需的矩阵的存储地址或所述指令所需矩阵的标识。
可选的,如所述矩阵读取指示为所述指令所需矩阵的标识时,所述计算装置控制所述运算单元依据所述矩阵读取指示向所述存储器发送读取命令包括:
所述计算装置控制所述运算单元依据所述标识从所述寄存器单元出采用单位读取方式读取所述标识对应的存储地址,所述计算装置控制所述运算单元向所述存储器发送读取所述存储地址的读取命令并采用批量读取方式获取所述矩阵。
可选的,所述对该矩阵执行所述第一运算指令包括:
所述计算装置控制所述运算单元对该矩阵执行第一流水级的计算得到第一结果,将第一结果输入到第二流水级执行第二流水级得到第二结果,将所述第二结果输入到第三流水级执行第三流水级得到第三结果,将所述第三结果输入到所述存储器进行存储。
可选的,所述计算装置还包括:缓存单元,所述方法还包括:
所述计算装置将待执行的运算指令缓存于所述缓存单元内。
可选的,所述方法在所述计算装置控制所述矩阵计算单元获取第一运算指令之前还包括:
所述计算装置确定所述第一运算指令与所述第一运算指令之前的第二运算指令是否存在关联关系,如所述第一运算指令与所述第二运算指令存在关联关系,则将所述第一运算指令缓存与所述缓存单元内,在所述第二运算指令执行完毕后,从所述缓存单元提取所述第一运算指令传输至所述运算单元;
所述确定该第一运算指令与第一运算指令之前的第二运算指令是否存在关联关系包括:
依据所述第一运算指令提取所述第一运算指令中所需矩阵的第一存储地址区间,依据所述第二运算指令提取所述第二运算指令中所需矩阵的第二存储地址区间,如所述第一存储地址区间与所述第二存储地址 区间具有重叠的区域,则确定所述第一运算指令与所述第二运算指令具有关联关系,如所述第一存储地址区间与所述第二存储地址区间不具有重叠的区域,则确定所述第一运算指令与所述第二运算指令不具有关联关系。
可选的,所述矩阵为m*n矩阵、1*n矩阵或m*1矩阵,其中m、n为大于等于2的整数。
第二方面,提供一种计算装置,所述计算装置包括:存储器、寄存器单元、矩阵计算单元和控制单元;
所述存储器,用于存储矩阵,所述矩阵为至少一个矩阵,所述至少一个矩阵为长度相同的矩阵或长度不相同的矩阵;
所述存储器单元,用于存储标量数据,所述标量数据至少包括:所述矩阵在所述存储器内的存储地址;
所述控制单元,用于控制所述矩阵计算单元获取第一运算指令,所述第一运算指令包括执行所述指令所需的矩阵读取指示,所述所需的矩阵为至少一个矩阵,所述至少一个矩阵为长度相同的矩阵或长度不相同的矩阵;
所述运算单元,用于依据所述矩阵读取指示向所述存储介质发送读取命令;依据采用批量读取方式读取所述矩阵读取指示对应的矩阵,对该矩阵执行所述第一运算指令。
可选的,所述矩阵读取指示包括:所述指令所需的矩阵的存储地址或所述指令所需矩阵的标识。
可选的,如所述矩阵读取指示为所述指令所需矩阵的标识时,
所述控制单元,用于控制所述运算单元依据所述标识从所述寄存器单元出采用单位读取方式读取所述标识对应的存储地址,控制所述运算单元向所述存储单元发送读取所述存储地址的读取命令并采用批量读取方式获取所述矩阵。
可选的,所述运算单元,具体用于对该矩阵执行第一流水级的计算得到第一结果,将第一结果输入到第二流水级执行第二流水级得到第二结果,将所述第二结果输入到第三流水级执行第三流水级得到第三结果,将所述第三结果输入到所述存储器进行存储。
可选的,所述计算装置还包括:
缓存单元,用于缓存待执行的运算指令;
所述控制单元,用于将待执行的运算指令缓存于所述缓存单元内。
可选的,所述控制单元,用于确定所述第一运算指令与所述第一运算指令之前的第二运算指令是否存在关联关系,如所述第一运算指令与所述第二运算指令存在关联关系,则将所述第一运算指令缓存与所述缓存单元内,在所述第二运算指令执行完毕后,从所述缓存单元提取所述第一运算指令传输至所述运算单元;
所述确定该第一运算指令与第一运算指令之前的第二运算指令是否存在关联关系包括:
依据所述第一运算指令提取所述第一运算指令中所需矩阵的第一存储地址区间,依据所述第二运算指令提取所述第二运算指令中所需矩阵的第二存储地址区间,如所述第一存储地址区间与所述第二存储地址区间具有重叠的区域,则确定所述第一运算指令与所述第二运算指令具有关联关系,如所述第一存储地址区间与所述第二存储地址区间不具有重叠的区域,则确定所述第一运算指令与所述第二运算指令不具有关联关系。
第三方面,提供一种计算机可读存储介质,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行第一方面提供的方法。
第四方面,提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行第一方面提供的方法。
实施本申请实施例,具有如下有益效果:
可以看出,通过本申请实施例,计算装置设置有寄存器单元和存储器,其分别存储标量数据以及矩阵 数据,并且本申请为两种存储器分配了单位读取方式以及批量读取方式,通过对矩阵数据的特点分配匹配其特征的数据读取方式,能够很好的利用带宽,避免因为带宽的瓶颈对矩阵计算速度的影响,另外,对于寄存器单元来说,由于其存储的为标量数据,设置了标量数据的读取方式,提高了带宽的利用率,所以本申请提供的技术方案能够很好的利用带宽,避免带宽对计算速度的影响,所以其具有计算速度快,效率高的优点。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1A是一种计算装置结构示意图。
图1B是另一种计算装置结构示意图。
图2是本申请实施例提供的计算装置的结构示意图。
图2A是本申请实施例提供的矩阵计算单元的结构示意图。
图2B是本申请实施例提供的流水级的结构示意图。
图3是本申请实施例公开的一种矩阵计算方法的流程示意图。
图4是本申请实施例提供的指令集的格式示意图。
图5是本申请实施例提供的另一种计算装置的结构示意图。
图6是本申请实施例提供的计算装置执行矩阵乘向量指令的流程图。
图6A是本申请实施例提供的计算装置的另一种结构示意图。
图6B是本申请实施例提供的卷积计算指令的流程示意图。
图6C是本申请实施例提供的全连接层正向运算指令的流程示意图。
图6D是本申请实施例提供的池化运算正向运算流程图。
图6E是本申请实施例提供的池化运算反向运算流程图。
图6F是本申请实施例提供的批归一化正向运算流程图。
图7A是本申请的指令集的格式示意图;
图7B是本申请的神经网络运算指令的格式示意图;
图7C是本申请的矩阵运算指令的格式示意图;
图7D是本申请的向量运算指令的格式示意图;
图7E是本申请的矩阵-向量运算指令的格式示意图;
图7为本申请的hub_one_to_two结构示意图;
图8为本申请的hub_one_to_two与数据接收方握手的行为示意图;
图9为本申请的一个实施例中使用h-tree连接的16+1个核的片上多核结构示意图;
图10为本申请的另一个实施例中数据在hub中传输的行为示意图;
图11为本申请的h-tree结构的展开成完全二叉树拓扑的结构示意图;
图12为本申请的另一个实施例中在h-tree上,全带宽的数据与对应每个leaf tile的数据段的示意图。
图13为本申请的一个实施例中使用x-tree连接的64+1个核的片上多核结构示意图;
图14为本申请的另一个实施例中数据在hub中传输的行为示意图;
图15为本申请的x-tree结构的完全四叉树拓扑的结构示意图;
图16为本申请的另一个实施例中,在x-tree上,全带宽的数据与对应每个leaf tile的数据段的示意图;
图17是作为本申请一实施例的总体结构的示意性框图;
图18是作为本申请一实施例的一稀疏连接的神经网络的节点结构示意图;
图19是图4的神经网络的连接关系示意图;
图20是作为本申请又一实施例的一稀疏连接的神经网络的连接关系示意图;
图21是作为本申请一实施例的一卷积操作的示意图;
图22是卷积神经网络变得稀疏时输入、输出和权值的变化图;
图23是作为本申请一实施例的稀疏连接的人工神经网络运算装置的结构示意图;
图24是作为本申请一实施例的映射单元的结构示意图;
图25是作为本申请一实施例的稀疏连接的人工神经网络运算过程的流程图;
图26是作为本申请另一实施例的稀疏连接的人工神经网络运算装置的结构示意图;
图27是作为本申请另一实施例的映射单元的结构示意图;
图28是作为本申请再一实施例的稀疏连接的人工神经网络运算装置的结构示意图;
图29是作为本申请再一实施例的映射单元的结构示意图;
图30是作为本申请还一实施例的稀疏连接的人工神经网络运算装置的结构示意图;
图31是作为本申请还一实施例的映射单元的结构示意图;
图32是本申请神经网络的处理系统的一种实施例的结构框图;
图33是本申请神经网络的处理系统的另一实施例的结构框图;
图34是本申请一种实施例中神经网络划分的示意图;
图35是本申请另一实施例中神经网络划分的示意图;
图36是本申请又一实施例中神经网络划分的示意图;
图37是本申请神经网络的处理方法的流程图;
图38示出了根据本申请实施例的用于执行支持离散数据表示的人工神经网络正向运算的装置的整体结构的示例框图。
图39示意性示出了根据本申请实施例的用于执行支持离散数据表示的人工神经网络正向运算的装置中H树模块(互联模块的一种实施方式)的结构。
图40示出了根据本申请实施例的用于执行支持离散数据表示的人工神经网络正向运算的装置中主运算模块结构的示例框图。
图41示出了根据本申请实施例的用于执行支持离散数据表示的人工神经网络正向运算的装置中从运算模块结构的示例框图。
图42示出了根据本申请实施例的神经网络正向运算过程的示例框图。
图43示出了根据本申请实施例的支持离散数据表示的神经网络反向训练过程的示例框图。
图44示出了根据本申请实施例的单层人工神经网络运算的流程图。
图45示出了根据本申请实施例的运算单元示例结构。
图46示出了根据本申请实施例的连续数据和离散数据转化的连续离散转化模块的示例结构;
图47为依据本公开的神经网络运算装置的结构示意图。
图48为依据本公开的神经网络运算装置的结构示意图。
图49为依据本公开的神经网络运算方法流程图。
图49.1为依据本公开的编码表的示意图。
图49.2为依据本公开的编码表的另一示意图。
图49.3为依据本公开的编码表的另一示意图。
图49.4为依据本公开的编码表的另一示意图。
图49.5为依据本公开的幂次数据的表示方法示意图。
图49.6为依据本公开的神经元与幂次权值的乘法操作示意图。
图49.7为依据本公开的神经元与幂次权值的乘法操作示意图。
图50为依据本公开的神经网络运算方法流程图。
图50.1为依据本公开的编码表的示意图。
图50.2为依据本公开的编码表的另一示意图。
图50.3为依据本公开的编码表的另一示意图。
图50.4为依据本公开的编码表的另一示意图。
图50.5为依据本公开的幂次数据的表示方法示意图。
图50.6为依据本公开的幂次神经元与幂次权值的乘法操作示意图;
图51是本公开实施例的处理方法的流程图。
图52是本公开实施例的处理方法的另一流程图。
图53是本公开实施例神经网络的全连接层的剪枝方法。
图54是本公开实施例神经网络的卷积层粗粒度剪枝方法。
图55是本公开实施例的处理装置的结构示意图。
图56是本公开实施例的加速装置的结构示意图。
图57是本公开实施例的另一种加速装置的结构示意图。
图58是本公开以处理方法的一具体实施例;
图59为根据本申请一实施例的用于存储数据的短位数浮点数据结构的具体表示方法;
图60A为本申请执行人工神经网络正向运算的装置的示例框图;
图60为根据本申请一实施例的用于执行人工神经网络正向运算的装置中浮点数据统计模块的示例框图;
图61为根据本申请一实施例的用于执行人工神经网络正向运算的装置中正向运算模块的短位数浮点计算部分的示例框图;
图62为根据本申请一实施例的神经网络正向运算过程的示例框图;
图63示意性示出了根据本申请一实施例的用于执行人工神经网络正向运算装置的运算流程示例框图;
图64为根据本申请一实施例的用于存储数据的定点数据结构的具体表示方法;
图65A为本申请执行人工神经网络正向运算的装置的示例框图;
图65为根据本申请一实施例的用于执行人工神经网络正向运算的装置中浮点数据统计模块的示例框图;
图66为根据本申请一实施例的用于执行人工神经网络正向运算的装置中正向运算模块的短位数定点计算部分示例框图;
图67为根据本申请一实施例的神经网络正向运算过程的示例框图;
图68示意性示出了根据本申请一实施例的用于执行人工神经网路正向运算装置的运算流程示例框图;
图69为根据本申请一实施例的算法实施总体流程图;
图70是本申请片上重复寻址的装置的优选实施例的总体结构的示例的框图;
图71是本申请片上重复寻址的方法的优选实施例的数据地址划分图;
图72是本申请片上重复寻址的方法的优选实施例的数据划分示意图之一;
图73是本申请片上重复寻址的方法的优选实施例的数据划分示意图之二;
图74是本申请片上重复寻址的方法的优选实施例的替换策略示意图;
图75是本申请片上重复寻址的方法的一种具体实施例的流程图;
图76是本申请片上重复寻址的方法的片上重复索引优选实施例示意图;
图77是本申请所述片上数据划分读写系统的结构示意图;
图78是本申请优选实施例的所述片上数据划分读写系统的结构示意图;
图79A是本申请所述片上数据划分策略的实现示意图之一;
图79B是本申请所述片上数据划分策略的实现示意图之二;
图80是本申请根据本申请所述片上数据划分读写系统的片上数据索引实施例示意图;
图81是本申请根据本申请所述片上数据划分读写方法的物理框架示意图;
图82是本申请根据本申请所述片上数据划分读写方法一个实施例数据拼接操作的物理设计框架图;
图83是本申请中所述片上数据划分读写方法流程示意图;
图84是本申请中所述片上数据划分读写方法一个具体实施例流程示意图;
图85示出了根据本申请实施例的神经网络计算系统的结构示意图。
图86A示意性示出了根据本申请实施例的多处理器的一种实施例示意图。
图86B示意性示出了根据本申请实施例的多处理器的另一种实施例示意图。
图87示出了根据本申请实施例的用于训练和推理的神经网络计算系统结构示意图。
图88示出了根据本申请实施例的计算处理器共享存储单元的计算系统结构示意图。
图89示出了根据本申请实施例的计算处理器,控制处理器共享存储单元的神经网络计算系统的结构示意图。
图90示出了根据本申请实施例的用于复杂神经网络任务的系统的示例框图。
具体实施方式
为使本公开的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本公开作进一步的详细说明。
需要说明的是,本申请具体实施方式中的矩阵具体可以为m*n矩阵、1*n矩阵或m*1矩阵,其中m、n为大于等于2的整数。当矩阵为1*n矩阵或m*1矩阵时,也可以称为向量,下述矩阵均可以为上述三种类型矩阵中的任意一种,下面不在赘述。在机器学习算法中,以人工神经网络算法为例,多种神经网络算法中都含有大量的矩阵运算。在神经网络中,输出神经元的运算表达式为y=f(wx+b),其中w是第一矩阵,x是第二矩阵、b是第三矩阵,计算输出矩阵y的过程为矩阵w与矩阵x相乘,加上矩阵b。因此,矩阵运算成为目前各种计算装置在设计时需要考虑的问题,现有的矩阵的计算速度慢,无法满足用户对计算装置的要求,效率低。
参阅图1A,图1A为一种计算装置,在如图1A所示的矩阵的计算装置中,其包含多个通用处理器101(CPU),每个CPU均包含自身的内存,其处理的方法可以为,多个CPU并行处理矩阵的计算,此方案虽在在矩阵的计算中采用并行处理的方式,但是其并不能有效的提高效率,因为对于矩阵运算中,第二矩阵运算的结果可能需要使用第一矩阵运算的结果,具体的,第一矩阵运算为f(1)=A+B,第二矩阵运算为:f(2)=f(1)+C,对于第二矩阵运算来说,其需要提取第一矩阵运算的结果f(1)才能够进行实际的矩阵计算处理,此种情况在神经网络计算中尤为突出,由于多个CPU并行处理矩阵运算,那么在矩阵计算的分配时,很有可能CPU1执行第一矩阵运算,CPU2执行第二矩阵运算,那么对于CPU2来说,其需要 从CPU1提取第一矩阵运算的结果f(1),所以对于多CPU并行处理矩阵来说,多个CPU之间的通讯成为矩阵运算的瓶颈,影响矩阵计算的速度。
参阅图1B,图1B为另一种计算装置,在如图1B所示的计算装置中,其包含有图形处理器(GPU)102,通过GPU102来执行矩阵的运算,对于GPU来说,其本身也包含内存1021,GPU102在处理矩阵运算时,GPU102需要从内存1021中提取矩阵运算所需的矩阵,矩阵由于其数据量大,单个矩阵所占用的存储空间比标量要大很多,对于GPU102来说,虽然其运算能够非常强,但是GPU102的内存的容量不够,无法存储大量的矩阵,为了解决这个问题,图1B配置了片外数据库103,GPU102可以从片外数据库103中读取矩阵,具体的读取方式为,GPU102从片外数据库103中提取待计算的矩阵,将该矩阵存储在内存1021中,在执行矩阵运算时,进行矩阵指令的译码处理,然后从内存1021中提取该矩阵进行计算。此技术方案在执行矩阵计算中,GPU102进行矩阵指令的译码会占用GPU很大部分的计算能力,影响矩阵的计算速度,效率低。
本申请中提到的输入神经元和输出神经元并非是指整个神经网络的输入层中神经元和输出层中神经元,而是对于网络中任意相邻的两层,处于网络前馈运算下层中的神经元即为输入神经元,处于网络前馈运算上层中的神经元即为输出神经元。以卷积神经网络为例,设一个卷积神经网络有L层,K=1,2,...,L-1,对于第K层和第K+1层来说,我们将第K层称为输入层,其中的神经元为所述输入神经元,第K+1层称为输出层,其中的神经元为所述输出神经元。即除最顶层外,每一层都可以作为输入层,其下一层为对应的输出层。
本申请具体实施方式提供一种矩阵计算方法,该矩阵计算方法在如图2所示的计算装置内完成,如图2所示,该计算装置包括:
存储器201、用于存储矩阵。优选该存储器可以为高速暂存存储器,能够支持不同长度的矩阵数据;本申请将必要的计算数据暂存在存储器(优选的高速暂存存储器)上(Scratchpad Memory),使本计算装置在进行矩阵运算过程中可以更加灵活有效地支持不同长度的数据。上述存储器还可以为片外数据库、数据库或其他的能够存储的介质等等。
标量数据存储单元202(例如标量寄存器单元),用于存储标量数据,其中,该标量数据包括但不限于:矩阵数据在存储介质201的地址以及矩阵与标量运算时的标量。在一种实施方式中,标量寄存器单元可以是标量寄存器堆,提供运算过程中所需的标量寄存器,标量寄存器不只存放矩阵地址,还存放有标量数据。当涉及到矩阵与标量的运算时,运算单元不仅要从寄存器单元中获取矩阵地址,还要从寄存器单元中获取相应的标量。
运算单元203,用于获取并执行第一运算指令。如图2A所示,该运算单元包括多个运算器,该运算器包括但不限于:矩阵加法运算器231、矩阵乘法运算器232、大小比较运算器233、非线性运算器234和矩阵标量乘法运算器235。
该矩阵计算方法如图3所示,包括如下步骤:
步骤S301、运算单元203获取第一运算指令,所述第一运算指令包括:执行该指令所需的矩阵读取指示。
在步骤S301中,上述执行该指令所需的矩阵读取指示具体可以为多种,例如,在本申请一个可选的技术方案中,上述执行该指令所需的矩阵读取指示可以为所需矩阵的存储地址。又如,在本申请另一个可选的技术方案中,上述执行该指令所需的矩阵读取指示可以为所需矩阵的标识,该标识的表现形式可以为多种,例如,矩阵的名称,又如,矩阵的识别号,再如该矩阵在寄存器单元的寄存器号或地址,标识还可以包括矩阵的大小。
下面通过一个实际的例子来说明上述第一运算指令包含的执行该指令所需的矩阵读取指示,这里假设该矩阵运算公式为f(x)=A+B,其中,A、B均为矩阵。那么在第一运算指令中除了携带该矩阵运算公式外,还可以携带该矩阵运算公式所需矩阵的存储地址,具体的,例如A的存储地址为0000-0FFF,B的存储地址为1000-1FFF。又如,可以携带A以及B的标识,例如A的标识为0101,B的标识为1010。
步骤S302、运算单元203依据该矩阵读取指示向所述存储器201发送读取命令。
上述步骤S302的实现方法具体可以为:
如该矩阵读取指示可以为所需矩阵的存储地址,运算单元203向该存储器201发送该读取该存储地址的读取命令并采用批量读取方式获取对应的矩阵。
又如该矩阵读取指示可以为所需矩阵的标识时,运算单元203依据该标识从标量数据存储单元处采用单个读取方式读取该标识对应的存储地址,然后运算单元203向该存储器201发送该读取该存储地址的读取命令并采用批量读取方式获取对应的矩阵。
上述单个读取方式具体可以为,每次读取均为单个的数据,例如1bit或者多bit,1字节,4字节,8字节数据。此时设置单个读取方式的原因为,对于标量数据来说,其占用的容量非常小,如果采用批量数据读取方式,那么读取的数据量容易大于所需的数据的容量,这样会导致带宽的浪费,所以对于标量的数据这里采用单个读取方式来读取以减少带宽的浪费。
步骤S303、运算单元203采用批量读取方式读取该指示对应的矩阵,对该矩阵执行所述第一运算指令。
上述步骤S303中批量读取方式具体可以为,每次读取均为多个的数据,,即无论其所需的数据量是多少,其每次读取的均为多个的数据,此批量读取的数据方式非常适合大数据的读取,对于矩阵来说,由于其所占用的容量大,如果采用单个读取方式,其读取的速度会非常慢,所以这里采用批量读取方式来获取多个的数据从而快速读取矩阵数据,避免因为读取矩阵数据过慢影响矩阵计算速度的问题。
本申请提供的技术方案的计算装置设置有标量数据存储单元和存储器,其分别存储标量数据以及矩阵数据,并且本申请为两种存储器分配了单位读取方式以及批量读取方式,通过对矩阵数据的特点分配匹配其特征的数据读取方式,能够很好的利用带宽,避免因为带宽的瓶颈对矩阵计算速度的影响,另外,对于标量数据存储单元来说,由于其存储的为标量数据,设置了标量数据的读取方式,提高了带宽的利用率,所以本申请提供的技术方案能够很好的利用带宽,避免带宽对计算速度的影响,所以其具有计算速度快,效率高的优点。
可选的,上述对该矩阵执行所述第一运算指令具体可以为:
对该矩阵执行n级流水级计算,具体的,对该矩阵执行第一流水级的计算得到第一结果,将第一结果输入到第二流水级执行第二流水级的计算得到第二结果,将第二结果输入到第三流水级执行第三流水级计算得到第三结果,一级一级向下执行后,将所述第n-1结果输入到第n流水级执行第n流水级的计算得到第n结果,将所述第n结果输入到所述存储器。n可以为大于等于2的整数。如n=3为例,上述流水级的操作流程图如图2B所示。
上述第一流水级包括但不限于:矩阵加法计算器、矩阵乘法计算器等等。
上述第二流水级包括但不限于:大小比较计算器等等。
上述第三流水级包括但不限于:非线性运算器、矩阵标量乘法器等等。
将矩阵分三个流水级运算主要是为了提高运算的速度,对于矩阵的计算来说,例如采用如图1A所示的通用处理器在计算时,其运算的步骤具体可以为,处理器对矩阵进行计算得到第一结果,然后将第一结果存储在内存中,处理器从内存读取第一结果执行第二次计算得到第二结果,然后将第二结果存储在内存中,处理器从内从读取第二结果执行第三次计算得到第三结果,然后将第三结果存储在内存中。从上述计算的步骤可以看出,在通用处理器进行矩阵计算时,其并没有分流水级进行计算,那么每次计算完毕后 均需要将计算完的数据进行保存,下次计算时需要再次读取,所以此方案需要重复存储读取多次数据,对于本申请的技术方案来说,第一流水级计算的第一结果直接进入第二流水级进行计算,第二流水级计算的第二结果直接进入到第三流水级进行计算,第一流水级与第二流水级计算的第一结果和第二结果无需存储,首先其减少了内存的占用空间,其次,其避免了结果的多次存储以及读取,提高了带宽的利用率,进一步提高了计算效率。
在本申请另一实施例中,可以自由组合各流水部件或者采取一级流水级。例如将第二个流水级和第三个流水级合并,或者将第一和第二以及第三个流水线都合并或者各个流水级负责不同的运算可以排列组合。例如,第一级流水负责比较运算,部分乘法运算,第二级流水负责非线性运算和矩阵标量乘法等组合。
可选的,上述计算装置还可以包括:缓存单元204,用于缓存第一运算指令。指令在执行过程中,同时也被缓存在缓存单元中,当一条指令执行完之后,如果该指令同时也是缓存单元中未被提交指令中最早的一条指令,该指令将被提交,一旦提交,该条指令进行的操作对装置状态的改变将无法撤销。在一种实施方式中,指令缓存单元可以是重排序缓存。
可选的,上述方法在步骤S301之前还可以包括:
确定该第一运算指令与第一运算指令之前的第二运算指令是否存在关联关系,如第一运算指令与第一运算指令之前的第二运算指令存在关联关系,则在第二运算指令执行完毕以后,从缓存单元中提取出该第一运算指令传递至运算单元203。如第一运算指令与该第一运算指令之前的指令无关联关系,则直接将第一运算指令传递至运算单元。
上述确定该第一运算指令与第一运算指令之前的第二运算指令是否存在关联关系的具体实现方法可以为:
依据该第一运算指令提取该第一运算指令中所需矩阵的第一存储地址区间,依据该第二运算指令提取该第二运算指令中所需矩阵的第二存储地址区间,如第一存储地址区间与第二存储地址区间具有重叠的区域,则确定第一运算指令与第二运算指令具有关联关系。如第一存储地址区间与第二存储地址区间无重叠的区域,则确定第一运算指令与第二运算指令不具有关联关系。
此存储地区区间中有重叠区域出现说明第一运算指令与第二运算指令访问了相同的矩阵,对于矩阵来说,由于其存储的空间比较大,比如采用相同的存储区域作为判断是否为关联关系的条件,可能出现的情况是,第二运算指令访问的存储区域包含了第一运算指令访问的存储区域,例如,第二运算指令访问A矩阵存储区域、B矩阵存储区域和C矩阵存储区域,如果A、B存储区域相邻或A、C存储区域相邻,则第二运算指令访问的存储区域为,A、B存储区域以及C存储区域,或A、C存储区域以及B存储区域。这种情况下,如果第一运算指令访问的为A矩阵与D矩阵的存储区域,那么第一运算指令访问的矩阵的存储区域无法与第二运算指令范文的矩阵的存储区域相同,如果采用相同的判断条件,则确定第一运算指令与第二运算指令不关联,但是实践证明,此时第一运算指令与第二运算指令属于关联关系,所以本申请通过是否有重叠区域来判断是否为关联关系的条件,能够避免上述情况的误判。
下面以一个实际的例子来说明何种情况属于关联关系,何种情况属于非关联关系。这里假设第一运算指令所需的矩阵为A矩阵和D矩阵,其中A矩阵的存储区域为【0001,0FFF】,D矩阵的存储区域为【A000,AFFF】,对于第二运算指令所需的矩阵为A矩阵、B矩阵和C矩阵,其分别对应的存储区域为【0001,0FFF】、【1000,1FFF】、【B000,BFFF】,对于第一运算指令来说,其对应的存储区域为:【0001,0FFF】、【A000,AFFF】,对于第二运算指令来说,其对应的存储区域为:【0001,1FFF】、【B000,BFFF】,所以第二运算指令的存储区域与第一运算指令的存储区域具有重叠区域【0001,0FFF】,所以第一运算指令与第二运算指令具有关联关系。
这里假设第一运算指令所需的矩阵为E矩阵和D矩阵,其中A矩阵的存储区域为【C000,CFFF】,D矩阵的存储区域为【A000,AFFF】,对于第二运算指令所需的矩阵为A矩阵、B矩阵和C矩阵,其分别对应的存储区域为【0001,0FFF】、【1000,1FFF】、【B000,BFFF】,对于第一运算指令来说,其对应的存储区域为:【C000,CFFF】、【A000,AFFF】,对于第二运算指令来说,其对应的存储区域为:【0001,1FFF】、【B000,BFFF】,所以第二运算指令的存储区域与第一运算指令的存储区域不具有重叠区域,所以第一运算指令与第二运算指令无关联关系。
图4是本申请提供的指令集的格式示意图,图4中的省略号表示可以包括多个寄存器或立即数。如图4所示,运算指令包括一操作码和至少一操作域,其中,操作码用于指示该运算指令的功能,运算单元通过识别该操作码可进行不同的矩阵运算,操作域用于指示该运算指令的数据信息,其中,数据信息可以是立即数或寄存器号,例如,要获取一个矩阵时,根据寄存器号可以在相应的寄存器中获取矩阵起始地址和矩阵长度,再根据矩阵起始地址和矩阵长度在存储介质中获取相应地址存放的矩阵。
指令集包含有不同功能的运算指令:
矩阵乘向量指令(MMV),根据该指令,装置从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出设定长度的矩阵数据和向量数据,在运算单元中进行矩阵乘向量的乘法运算,并将结果写回。优选的,并将计算结果写回至存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址;值得说明的是,向量可以作为特殊形式的矩阵(只有一行元素的矩阵)存储于存储器(优选的高速暂存存储器或者标量寄存器堆)中。
向量乘矩阵指令(VMM),根据该指令,装置从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出设定长度的向量数据和矩阵数据,在运算单元中进行向量乘矩阵的乘法运算,并将结果写回。优选的,并将计算结果写回至存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址;值得说明的是,向量可以作为特殊形式的矩阵(只有一行元素的矩阵)存储于存储器(优选的高速暂存存储器或者标量寄存器堆)中。
矩阵乘标量指令(VMS),根据该指令,装置存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出设定长度的矩阵数据,从标量寄存器堆的指定地址中取出指定大小的矩阵数据,在运算单元中进行标量乘矩阵的乘法运算,并将计算结果写回。优选的,并将计算结果写回至存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址,需要说明的是,标量寄存器堆不仅存储有矩阵的地址,还存储有标量数据。
张量运算指令(TENS),根据该指令,装置从存储器(优选的高速暂存存储器或者标量寄存器堆)的两个指定地址取出分别取出设定长度的两块矩阵数据,在运算单元中对两矩阵数据进行张量运算,并将计算结果写回。优选的,并将计算结果写回至存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址。
矩阵加法指令(MA),根据该指令,装置从存储器(优选的高速暂存存储器或者标量寄存器堆)的两个指定地址取出分别取出设定长度的两块矩阵数据,在运算单元中对两矩阵进行加法运算,并将计算结果写回。优选的,并将计算结果写回至存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址。
矩阵减法指令(MS),根据该指令,装置从存储器(优选的高速暂存存储器或者标量寄存器堆)的两个指定地址取出分别取出设定长度的两块矩阵数据,在运算单元中对两矩阵进行减法运算,并将计算结果写回。优选的,并将计算结果写回至存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址。
矩阵检索指令(MR),根据该指令,装置从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出设定长度的向量数据,从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的矩阵数据,在运算单元中,该向量是索引向量,输出的向量中的第i个元素是以索引向量的第i个元素作为索引,在矩阵的第i列中找到的数,该输出向量写回至存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址。
矩阵加载指令(ML),根据该指令,装置从指定外部源地址载入设定长度的数据至存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址。
矩阵存储指令(MS),根据该指令,装置将存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址的设定长度的矩阵数据存至外部目的地址处。
矩阵搬运指令(MMOVE),根据该指令,装置将存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址的设定长度的矩阵数据存至存储器(优选的高速暂存存储器或者标量寄存器堆)的另一指定地址处。
上述指令中的设定长度可以由用户自行设定,在一个可选的实施方案中,用户可以将该设置长度设置为一个值,当然在实际应用中,用户也可以将该设置长度设置为多个值。本申请具体实施方式并不限定该设定长度的具体值以及个数。为使本申请的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本申请进一步详细说明。
参阅图5,图5为本申请具体实施方式提供的另一种计算装置50,如图所示的实施例中的具体实施方案、细化方案或技术效果可以参见如图2或图3所示实施例中的描述,这里不再赘述。图5所示,计算装置50包括:存储器501、标量数据存储单元502(优选的为标量寄存器单元)、矩阵计算单元503和控制单元504;
存储器501,用于存储矩阵;
标量数据存储单元502,用于存储标量数据,所述标量数据至少包括:所述矩阵在所述存储器内的存储地址;
控制单元504,用于控制所述矩阵计算单元获取第一运算指令,所述第一运算指令包括执行所述指令所需的矩阵读取指示;
运算单元503,用于依据所述矩阵读取指示向所述存储器发送读取命令;依据采用批量读取方式读取所述矩阵读取指示对应的矩阵,对该矩阵执行所述第一运算指令。
可选的,上述矩阵读取指示包括:所述指令所需的矩阵的存储地址或所述指令所需矩阵的标识。
可选的如所述矩阵读取指示为所述指令所需矩阵的标识时,
控制单元504,用于控制所述运算单元依据所述标识从所述寄存器单元出采用单位读取方式读取所述标识对应的存储地址,控制所述运算单元向所述存储器发送读取所述存储地址的读取命令并采用批量读取方式获取所述矩阵。
可选的,运算单元503,具体用于对该矩阵执行第一流水级的计算得到第一结果,将第一结果输入到第二流水级执行第二流水级得到第二结果,将所述第二结果输入到第三流水级执行第三流水级得到第三结果,一级一级向下执行后,将所述第n-1结果输入到第n流水级执行第n流水级的计算得到第n结果,将所述第n结果输入到所述存储器。n可以为大于等于2的整数。
可选的,所述计算装置还包括:
缓存单元505,用于缓存待执行的运算指令;
所述控制单元504,用于将待执行的运算指令缓存于所述缓存单元504内。
可选的,控制单元504,用于确定所述第一运算指令与所述第一运算指令之前的第二运算指令是否存在关联关系,如所述第一运算指令与所述第二运算指令存在关联关系,则将所述第一运算指令缓存与所述缓存单元内,在所述第二运算指令执行完毕后,从所述缓存单元提取所述第一运算指令传输至所述运算单元;
所述确定该第一运算指令与第一运算指令之前的第二运算指令是否存在关联关系包括:
依据所述第一运算指令提取所述第一运算指令中所需矩阵的第一存储地址区间,依据所述第二运算指令提取所述第二运算指令中所需矩阵的第二存储地址区间,如所述第一存储地址区间与所述第二存储地址区间具有重叠的区域,则确定所述第一运算指令与所述第二运算指令具有关联关系,如所述第一存储地址区间与所述第二存储地址区间不具有重叠的区域,则确定所述第一运算指令与所述第二运算指令不具有关联关系。
可选的,上述控制单元503,可以用于从指令缓存单元获取运算指令,并对该运算指令进行处理后,提供给所述运算单元。其中,控制单元503可以划分为三个模块,分别为:取指模块5031、译码模块5032和指令队列模块5033,
取指模5031,用于从指令缓存单元中获取运算指令;
译码模块5032,用于对获取的运算指令进行译码;
指令队列5033,用于对译码后的运算指令进行顺序存储,考虑到不同指令在包含的寄存器上有可能存在依赖关系,用于缓存译码后的指令,当依赖关系被满足之后发射指令。
参阅图6,图6是本申请实施例提供的计算装置执行矩阵乘向量指令的流程图,如图6所示,该计算装置的硬件结构参阅图5所示的结构,如图5所示的存储器以高速暂存存储器为例,执行矩阵乘向量指令的过程包括:
步骤S601,计算装置控制取指模块取出矩阵乘向量指令,并将该矩阵乘向量指令送往译码模块。
步骤S602,译码模块对该矩阵乘向量指令译码,并将该矩阵乘向量指令送往指令队列。
步骤S603,在指令队列中,该矩阵乘向量指令需要从标量寄存器堆中获取指令中五个操作域所对应的标量寄存器里的数据,该数据包括输入向量地址、输入向量长度、输入矩阵地址、输出向量地址和输出向量长度。
步骤S604,控制单元确定所述矩阵乘向量指令与矩阵乘向量指令之前的运算指令是否存在关联关系,如存在关联关系,将矩阵乘向量指令存入到缓存单元,如不存在关联管理,将该矩阵乘向量指令传输至运算单元。
步骤S605,运算单元根据五个操作域所对应的标量寄存器里的数据从高速暂存器中取出需要的矩阵和向量数据,然后在运算单元中完成乘法运算。
步骤S606,运算单元运算完成后,将结果写入存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址,重排序缓存中的该矩阵乘向量指令被提交。
上述图6中的矩阵计算指令以矩阵乘向量指令为例,在实际应用中,如图6所示实施例中的矩阵乘向量指令可以用向量乘矩阵指令、矩阵乘标量指令、张量运算指令、矩阵加法指令、矩阵减法指令、矩阵检索指令、矩阵加载指令、矩阵存储指令或矩阵搬运指令替换,这里不一一赘述。
参阅图6A,图6A提供了一种计算装置,该计算装置包括:存储器611(可选的)、寄存器单元612、互联模块613、运算单元614、控制单元615和数据访问单元616;
其中,运算单元614包括:加法计算器、乘法计算器、比较器、激活运算器中至少二种。
互联模块613,用于控制运算单元614中计算器的连接关系使得该至少二种计算器组成不同的计算拓扑结构。
指令存储单元(可以是寄存器单元,指令缓存,高速暂存存储器)612,用于存储该运算指令、数据块的在存储介质的地址、运算指令对应的计算拓扑结构。
该运算指令可以包括:操作域以及操作码,以卷积计算指令为例,如表1所示,其中,寄存器0、寄存器1、寄存器堆2、寄存器3、寄存器4可以为操作域。其中,每个寄存器0、寄存器1、寄存器2、寄存器3、寄存器4可以是一个或者多个寄存器。
Figure PCTCN2018095706-appb-000001
存储器611可以为片外存储器,当然在实际应用中,参见图6D,当为片内存储器时,该片内存储器可以为缓存,具体的,可以为高速暂存缓存,用于存储数据块,该数据块具体可以为n维数据,n为大于等于1的整数,例如,n=1时,为1维数据,即向量,如n=2时,为2维数据,即矩阵,如n=3或3以上时,为多维数据。
控制单元615,用于从寄存器单元612内提取运算指令、该运算指令对应的操作域以及该运算指令对应的第一计算拓扑结构,将该运算指令译码成执行指令,该执行指令用于控制运算单元执行运算操作,将该操作域传输至数据访问单元616,。
数据访问单元616,用于从存储器611中提取该操作域对应的数据块,并将该数据块传输至互联模块613。
互联模块613、用于接收数据块,将该数据块发送至运算单元614。
运算单元614,用于该执行指令调用运算单元614的计算器对该数据块执行运算操作得到运算结果,将该运算结果传输至数据访问单元存储在存储器内。一个实施例里,运算单元614,用于按第一计算拓扑结构以及该执行指令调用计算器对数据块执行运算操作得到运算结果,将该运算结果传输至数据访问单元存储在存储器内。
在一种可选的实施例中,上述第一计算拓扑结构可以为:乘法运算器-加法运算器-加法运算器-激活运算器。
下面通过不同的运算指令来说明如图6A所示的计算装置的具体计算方法,这里的运算指令以卷积计算指令为例,该卷积计算指令可以应用在神经网络中,所以该卷积计算指令也可以称为卷积神经网络。对于卷积计算指令来说,其实际需要执行的公式可以为:
Figure PCTCN2018095706-appb-000002
其中,即将卷积核w乘以输入数据x i,进行求和,然后加上偏置b后做激活运算,得到最终的输出结果s。依据该公式即可以得到该计算拓扑结构为,乘法运算器-加法运算器-(可选的)激活运算器。
上述卷积计算指令可以包括指令集,该指令集包括:卷积神经网络指令,有不同功能的卷积神经网络COMPUTE指令以及CONFIG指令、IO指令、NOP指令、JUMP指令和MOVE指令。在一种实施例中,COMPUTE指令包括:
卷积神经网络指令,根据该指令,装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积运算直接得到输出结果。即该指令不执行后续的操作,直接做卷积运算得到输出结果。
卷积神经网络sigmoid指令,根据该指令,装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积操作,优选的,然后将输出结果做sigmoid激活;
卷积神经网络TanH指令,根据该指令,装置分别从存储器(优选的高速暂存存储器)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积操作,优选的,然后将输出结果做TanH激活;
卷积神经网络ReLU指令,根据该指令,装置分别从存储器(优选的高速暂存存储器)的指定地址取出指定大小的输入数据和卷积核,在卷积运算部件中做卷积操作,优选的,然后将输出结果做ReLU激活;以及
卷积神经网络group指令,根据该指令,装置分别从存储器(优选的高速暂存存储器)的指定地址取出指定大小的输入数据和卷积核,划分group之后,在卷积运算部件中做卷积操作,优选的,然后将输出结果做激活。
CONFIG指令在每层人工神经网络计算开始前配置当前层计算需要的各种常数。
IO指令实现从外部存储空间读入计算需要的输入数据以及在计算完成后将数据存回至外部空间。
NOP指令负责清空当前装置内部所有控制信号缓存队列中的控制信号,保证NOP指令之前的所有指令全部指令完毕。NOP指令本身不包含任何操作;
JUMP指令负责控制将要从指令存储单元读取的下一条指令地址的跳转,用来实现控制流的跳转;
MOVE指令负责将装置内部地址空间某一地址的数据搬运至装置内部地址空间的另一地址,该过程独立于运算单元,在执行过程中不占用运算单元的资源。
如图6A所示的计算装置执行卷积计算指令的方法具体可以为:
控制单元615从寄存器单元612内提取卷积计算指令、卷积计算指令对应的操作域,控制单元将该操作域传输至数据访问单元。
数据访问单元从存储器内提取该操作域对应的卷积核w和偏置b(当b为0时,不需要提取偏置b),将卷积核w和偏置b传输至运算单元。
计算单元的乘法运算器将卷积核w与输入数据Xi执行乘法运算以后得到第一结果,将第一结果输入到加法运算器执行加法运算得到第二结果,将第二结果和偏置b执行加法运算得到第三结果,将第三结果输到激活运算器执行激活运算得到输出结果s,将输出结果s传输至数据访问单元存储至存储器内。其中,每个步骤后都可以直接输出结果传输到数据访问存储至存储器内。另外,将第二结果和偏置b执行加法运算得到第三结果这一步骤为可选步骤,即当b为0时,不需要这个步骤。
本申请提供的技术方案通过一个指令即卷积计算指令即实现了卷积的计算,在卷积计算的中间数据(例如第一结果、第二结果、第三结果)均无需存储或提取,减少了中间数据的存储以及提取操作,所以其具有减少对应的操作步骤,提高卷积的计算效果的优点。
图6B是本申请实施例提供的卷积神经网络运算装置执行卷积神经网络的流程图,如图6B所示,执行卷积神经网络指令的过程包括:
在步骤S6B1,在指令存储单元指令存储单元的首地址处预先存入一条IO指令。
在步骤S6B2,控制器单元从指令存储单元的首地址读取该条IO指令,根据译出的控制信号,数据访问单元从存储器读取相应的所有卷积神经网络运算指令,并将其缓存在指令存储单元中。
在步骤S6B3,控制器单元接着从指令存储单元读入下一条IO指令,根据译出的控制信号,数据访问单元从存储器读取运算单元需要的所有数据块(例如,包括输入数据、用于作快速的激活函数运算的插值表、用于配置运算器件参数的常数表、偏置数据等)。
在步骤S6B4,控制器单元接着从指令存储单元读入下一条CONFIG指令,根据译出的控制信号,装置配置该层神经网络计算需要的各种常数。例如,运算单元根据控制信号里的参数配置单元内部寄存器的值,所述参数包括例如激活函数需要的数据。
在步骤S6B5,控制器单元接着从指令存储单元读入下一条COMPUTE指令,根据译出的控制信号,互连模块将卷积窗口内的输入数据发给计算单元内的各计算器。
在步骤S6B6,根据COMPUTE指令译出的控制信号,互联模块将乘法计算器、加法计算器和激活计算器连接形成第一计算拓扑结构。
在步骤S6B7,乘法运算器将卷积核w与输入数据Xi执行乘法运算以后得到第一结果,将第一结果输入到加法运算器执行加法运算得到第二结果,将第二结果和偏置b执行加法运算得到第三结果,将第三结果输入到激活运算器执行激活运算得到输出结果s,将输出结果s传输至数据访问单元存储至存储介质内。其中,将第二结果和偏置b执行加法运算得到第三结果这一步骤可选,即当b为0时,不需要这个步骤。
下面通过不同的运算指令来说明如图6A所示的计算装置的具体计算方法,这里的运算指令以全连接层正向运算指令为例,该全连接层正向运算指令可以应用在神经网络中。对于全连接层正向运算指令来说,其实际需要运算的公式可以为:out=f(w1*in+b),其中,out输出神经元向量、in是输入神经元向量、b是偏置向量,w1是权值,f是激活函数。依据该实际运算即可以得到该计算拓扑结构为,乘法运算器-加法运算器-激活运算器。在实际应用中,上述偏置b也可以为0,其偏置b的具体值可以由全连接层正向运算指令来确定。
人工神经网络全连接层正向运算的指令集。指令集中包括CONFIG指令、COMPUTE指令、IO指令、NOP指令、JUMP指令和MOVE指令等,其中:
CONFIG指令在每层人工神经网络计算开始前配置当前层计算需要的各种常数;
全连接层正向运算指令,根据该指令,装置从存储器的指定地址取出权值数据和偏置数据,在计算单元中进行全连接运算,并将结果写回。优选的,并将结果写回存储器(优选的高速暂存存储器或者标量寄存器堆)高速暂存存储器的指定地址。IO指令实现从存储器读入计算需要的输入数据以及在计算完成后将数据存回至外部空间;
NOP指令负责清空当前装置内部所有控制信号缓存队列中的控制信号,保证NOP指令之前的所有指令全部执行完毕。NOP指令本身不包含任何计算操作;
JUMP指令负责控制将要从指令存储单元读取的下一条指令地址的跳转,用来实现控制流的跳转;
MOVE指令负责将装置内部地址空间某一地址的数据搬运至装置内部地址空间的另一地址,该过程独立于运算单元,在执行过程中不占用运算单元的资源。
如图6A所示的计算装置执行全连接层正向运算指令的方法具体可以为:
控制单元615从寄存处单元612内提取全连接层正向运算指令、全连接层正向运算指令对应的操作域,控制单元将该操作域传输至数据访问单元。
数据访问单元从存储介器内提取该操作域对应的权值W1和偏置b,将权值W1和偏置b传输至运算单元。
运算单元可以按第二计拓扑结构(乘法运算器-加法运算器-(可选的)激活运算器)执行运算,具体的:乘法运算器将权值W1与输入数据in执行乘法运算以后得到第一结果,将第一结果和偏置输入到加法 运算器执行加法运算得到第二结果,将第二结果输入到激活运算器执行激活运算得到输出结果,将输出结果传输至数据访问单元存储至存储器内。其中,每个步骤后都可以直接输出结果传输到数据访问存储至存储器内,无需下面的步骤。另外,如果偏置b为0时候,就不需要将第一结果和偏置输入到加法运算器执行加法运算得到第二结果这一步骤。
图6C示出单层人工神经网络全连接层正向运算的另一种更详细的实施方法,如图6C所示的方法在计算装置中实现,上述计算装置中的运算单元包括一个主运算单元和一个或多个从运算单元,如图6C所示的方法中计算装置以多个从运算单元为例进行说明,上述互联模块连接主运算单元和多个从运算单元,上述互联模块可以为树状结构、环状结构、网格状结构、分级互连,总线结构。
在步骤S2.1,在指令存储单元处预先存入第一IO指令。
在步骤S2.2,控制器单元从指令存储单元读取该第一IO指令,根据译出的控制信号,数据访问单元从存储器读取相应的所有人工神经网络全连接层运算指令,并将其存储在指令存储单元中。
在步骤S2.3,控制器单元接着从指令存储单元读入第二IO指令,根据第二IO指令译出的控制信号,数据访问单元从存储器读取主运算单元(即激活运算器)需要的所有数据(例如,包括输入神经元向量、插值表、常数表和偏置等)至主运算单元的第一存储单元。
在步骤S2.4,控制器单元接着从指令存储单元读入第三条IO指令,根据第三条译出的控制信号,数据访问单元从存储器读取从运算单元(加法计算器或乘法计算器)需要的权值矩阵数据。
在步骤S2.5(可选的),控制器单元接着从指令存储单元读入CONFIG指令,根据译出的控制信号,配置该层神经网络计算需要的各种常数。
在步骤S2.6,控制器单元接着从指令存储单元读入全连接层正向运算指令,根据译出的控制信号,主运算单元首先通过互连模块将输入神经元向量发给各从运算单元,保存至从运算单元的第二存储单元。
在步骤S2.7,根据COMPUTE指令译出的控制信号,从运算单元的第二运算单元从第三存储单元读取权值,从第二存储单元读取输入神经元向量,完成权值和输入神经元向量的点积运算得到中间结果,将中间结果通过互连模块返回。
在步骤S2.8,在互连模块中,各从运算单元返回的中间结果被逐级拼成完整的中间结果向量。
在步骤S2.9,主运算单元得到互连模块的返回的中间结果向量,根据COMPUTE指令译出的控制信号,从第一存储单元读取偏置向量,将偏置向量与中间结果向量通过向量加单元相加得到相加结果,激活单元对相加结果做激活得到输出神经元向量,并将最后的输出神经元向量写回至第一存储单元中。
在步骤S2.10,控制器单元接着从指令存储单元读入第四IO指令,根据译出的控制信号,数据访问单元将输出神经元向量存至存储器指定地址,运算结束。
下面通过不同的运算指令来说明如图6A所示的计算装置的具体计算方法,这里的运算指令以池化(pooling)运算指令为例,该pooling运算指令可以应用在机器学习中,例如神经网络中。pooling运算是指在神经网络的特征层中,进行局部特征的降采样运算,减少特征层的维度。pooling运算包括但不仅限于三种:maxpooling是指在kernel内,取最大值作为结果;avgpooling是指在kernel内,取平均值作为结果。minpooling是在kernel内,取最小值作为结果。指此处的kernel即pooling核,大小是参数指定的,并根据步长stride在特征层上进行滑动,进行pooling运算,得到结果。”对于pooling运算指令来说,其实际需要运算的公式可以为:out=avg(in)=Σin*1/kernel_area,其中,out输出神经元向量、in是每个kernel里的所有输入神经元向量、kernel_area为pooling核kernel的面积(kernel里的数的总数),上述pooling根据实际的算法的需求可以为average pooling,当然在实际应用中,还可以为max pooling,min pooling,或其他形式的pooling。依据该实际运算即可以得到该计算拓扑结构为,(可选的)乘法运算器-加法运算器/比较运算器-(可选的),激活运算器。
pooling指令集中包括CONFIG指令、COMPUTE指令、IO指令、NOP指令、JUMP指令和MOVE指令,其中:
CONFIG指令在每层人工神经网络计算开始前配置当前层计算需要的各种常数;例如1/kernel_area可以使用config指令配置得到。
COMPUTE指令包括pooling运算指令,其包括:
Maxpooling正向运算指令,根据该指令,装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的输入数据,在pooling运算部件中做Maxpooling正向运算操作,然后将输出结果写回到存储器(优选的高速暂存存储器或者标量寄存器堆)的指定存储地址。
Maxpooling反向训练指令,根据该指令,装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的输入数据,在pooling运算部件中做maxpooling反向训练操作,然后将输出结果写回到存储器(优选的高速暂存存储器或者标量寄存器堆)的指定存储地址。
Avgpooling正向运算指令,根据该指令,装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的输入数据,在pooling运算部件中做Avgpooling正向运算操作,然后将输出结果写回到存储器(优选的高速暂存存储器或者标量寄存器堆)的指定存储地址。
Avgpooling反向训练指令,根据该指令,装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的输入数据,在pooling运算部件中做Avgpooling反向训练操作,然后将输出结果写回到存储器(优选的高速暂存存储器或者标量寄存器堆)的指定存储地址。
Minpooling正向运算指令,根据该指令,装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的输入数据,在pooling运算部件中做Minpooling正向运算操作,然后将输出结果写回到存储器(优选的高速暂存存储器或者标量寄存器堆)的指定存储地址。
Minpooling反向训练指令,根据该指令,装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的输入数据,在pooling运算部件中做Minpooling反向训练操作,然后将输出结果写回到存储器(优选的高速暂存存储器或者标量寄存器堆)的指定存储地址。
IO指令实现从存储介质读入计算需要的输入数据以及在计算完成后将数据存回至外部空间;
NOP指令负责清空当前装至内部所有微指令缓存队列中的微指令,保证NOP指令之前的所有指令全部指令完毕。NOP指令本身不包含任何计算操作;
JUMP指令负责控制器将要从指令存储单元读取的下一条指令地址的跳转,用来实现控制流的跳转;
MOVE指令负责将装置内部地址空间某一地址的数据搬运至装置内部地址空间的另一地址,该过程独立于运算单元,在执行过程中不占用运算单元的资源。
本申请的执行pooling运算的方法包括如下几个阶段,
对于maxpooling(或者minpooling)正向运算指令,在运算单元进行正向运算之前,数据访问单元可以依据指令存储单元内存储的kernel_area的值从存储器内提取出in(kernel里的所有数),,然后将1/kernel_area和in a传输至运算单元进行正向运算,由运算单元依次完成比较每一个输入向量大小,取最大值(或者最小值)的操作,得到输出向量。对于maxpooling(或者minpooling)反向训练指令,则同时保存对应的索引向量;循环读取新的pooling核kernel的输入向量,做上述比较大小的运算操作,得到新的kernel的输出向量,直至本层pooling运算结束。反向训练时,运算单元根据正向运算时保存的索引向量,通过数据访问单元将输入梯度向量对应输出至相应的存储位置,得到输出梯度向量。或者对于avgpooling正向运算指令,数据访问单元可以依据指令存储单元内存储的kernel_area从存储器内提取出in(kernel里的所有数),,然后将该1/kernel_area和in传输至运算单元进行正向运算,由运算模块4依次完成累加每一个输入向量;然后在运算模块4中完成乘以1/kernel_area运算,得到输出向量;循环读取新的kernel的输 入向量,做上述累加、乘法运算操作,得到新的kernel的输出向量,直至本层pooling运算结束;或者对于avgpooling反向训练指令,运算模块4将输入梯度向量乘以1/kernel_area,通过数据访问单元3将输入梯度向量对应输出至相应的存储位置,得到输出梯度向量。
控制单元615从寄存处单元612内提取pooling运算指令、pooling运算指令对应的操作域以及pooling运算指令对应的第三计算拓扑结构((可选的)乘法运算器-加法运算器/比较运算器-(可选的),激活运算器),控制单元将该操作域传输至数据访问单元,将该第三计算拓扑结构传输至互联模块。
数据访问单元存储介质内提取该操作域对应的in和1/kernel_area,将in和1/kernel_area传输至计算单元。
计算单元接收数据,执行pooling指令。
举例说明对于avgpooling正向运算指令,计算单元的乘法运算器将输入数据in与1/kernel_area进行乘法运算后得到第一结果,将第一结果输入到加法器执行加法运算得到第二结果,(优选的),再将第二结果输入到激活运算器里进行激活运算。其他的指令不再赘述。
需要说明的是,上述加法运算(或者比较运算)和乘法运算的顺序可以调换。
kernel_area图6D示出根据一个实施例的pooling运算正向运算流程图。该流程图描述利用本申请的装置和指令集实现的一种pooling运算正向运算的过程。
在步骤S1,在指令存储单元的首地址处预先存入一条IO指令。
在步骤S2,运算开始,控制单元从指令存储单元的首地址读取该条IO指令,根据译出的微指令,数据访问单元从存储介质读取相应的所有pooling运算指令,并将其缓存在存储器中。
在步骤S3,控制单元接着从指令存储单元读入下一条IO指令,根据译出的微指令,数据访问单元从存储介质读取运算单元需要的所有数据(例如,包括输入神经元向量、插值表、常数表等)至运算单元的存储器。
在步骤S4,控制器单元接着从指令存储单元读入下一条CONFIG指令,根据译出的微指令,装置配置该层pooling运算需要的各种常数。例如,运算单元根据微指令里的参数配置单元内部寄存器的值,所述参数例如包括本层计算的精度设置、激活函数的数据(例如本层计算的精度位,avgpooling时pooling核大小的倒数1/kernel_area等)
在步骤S5,根据COMPUTE指令译出的微指令,运算单元的加法运算器从神经元存储单元读取输入神经元向量和中间结果向量,完成对输入神经元向量的运算(avgpooling中是累加输入神经元向量,然后与1/kernel_area相乘,maxpooling是比较大小,求得最大值),并将最后的输出神经元向量写回至神经元存储单元。
在步骤S6,控制单元接着从指令存储单元读入下一条IO指令,根据译出的微指令,数据访问单元将神经元存储单元中的输出神经元向量存至存储介质指定地址,运算结束。
图6E是示出根据一个实施例的pooling运算反向训练流程图。该流程图描述利用本申请的装置和指令集实现一种pooling运算反向训练的过程。
在步骤T1,在指令存储单元的首地址处预先存入一条IO指令。
在步骤T2,运算开始,控制器单元从指令存储单元的首地址读取该条IO指令,根据译出的微指令,数据访问单元从存储介质读取与该pooling运算反向训练有关的所有指令,并将其缓存在指令存储单元中。
在步骤T3,控制器单元接着从指令存储单元读入下一条IO指令,根据译出的微指令,数据访问单元从存储介质读取运算单元需要的所有数据至运算单元的神经元存储单元,所述数据包括输入梯度向量和maxpooling时需要的索引向量index。
在步骤T4,控制器单元接着从指令存储单元读入下一条CONFIG指令,运算单元根据译出的微指令里的参数配置运算单元内部寄存器的值,包括该层pooling运算需要的各种常数,avgpooling时pooling核大小的倒数1/kernel_area、本层计算的精度设置、更新权值时的学习率等。
在步骤T5,根据COMPUTE指令译出的微指令,运算单元的加法计算器从神经元存储单元读取输入梯度向量和maxpooling时需要的索引向量index,完成乘法运算(avgpooling中是与1/kernel_area相乘,maxpooling是与索引向量index相乘),传递输出梯度向量,得到下一层反向训练的输入梯度向量,将其写回至神经元存储单元。
在步骤T6,控制器单元接着从指令存储单元读入下一条IO指令,根据译出的微指令,数据访问单元将神经元存储单元中的输出梯度向量存至存储介质指定地址,运算结束。
对于多层人工神经网络的pooling运算,其实现过程与单层神经网络的pooling运算类似,当上一层人工神经网络执行完毕后,下一层的运算指令会将运算单元中计算出的输出神经元向量或输出梯度向量作为下一层训练的输入神经元向量或输入梯度向量进行如上的计算过程,指令中的权值地址和权值梯度地址也会变更至本层对应的地址。
通过采用用于执行pooling运算的装置和指令集,解决了CPU和GPU运算性能不足,前端译码开销大的问题。有效提高了对多层人工神经网络pooling运算的支持。
通过采用针对pooling运算的专用片上缓存,充分挖掘了输入神经元和权值数据的重用性,避免了反复向内存读取这些数据,降低了内存访问带宽,避免了内存带宽成为pooling运算正向运算及反向训练性能瓶颈的问题。
下面通过不同的运算指令来说明如图6A所示的计算装置的具体计算方法,这里的运算指令以批归一化(batch normalization)运算指令为例,该batch normalization运算指令可以应用在神经网络中。对于batch normalization运算指令来说,其实际需要运算的公式可以为:out=(in-middle1)/middle2,其中,out输出神经元向量、in是输入神经元向量、middle1,middle2是运算过程中的中间值,middle1,middle2的值可以相同也可以不同。依据该实际运算即可以得到该计算拓扑结构为,加法运算器-乘法运算器。或者实际需要运算的公式可以为:out=(in/middle2-middle1/middle2,这种情况下,可以得到该计算拓扑结构为乘法运算器-加法运算器
batch normalization指令集中包括CONFIG指令、batch normalization指令、IO指令、NOP指令、JUMP指令和MOVE指令,其中:
CONFIG指令在batch normalization计算开始前配置当前层计算需要的各种常数;
batch normalization指令完成batch normalization的计算;
IO指令实现从外部地址空间读入计算需要的输入数据以及在计算完成后将数据存回至外部空间;
NOP指令负责清空当前装置内部所有微指令存储队列中的微指令,保证NOP指令之前的所有指令全部执行完毕。NOP指令本身不包含任何计算操作;
JUMP指令负责控制将要从指令存储单元读取的下一条指令地址的跳转,用来实现控制流的跳转;
MOVE指令负责将装置内部地址空间某一地址的数据搬运至装置内部地址空间的另一地址,该过程独立于运算单元,在执行过程中不占用运算单元的资源。
如图6A所示的计算装置执行batch normalization的具体方法可以包括:
控制单元615从寄存器单元612内提取batch normalization运算指令、batch normalization运算指令对应的操作域,控制单元将该操作域传输至数据访问单元。
数据访问单元存储介质内提取该操作域对应的-middle1和1/middle2,将middle传输至运算单元。
运算单元执行batch normalization运算指令得到输出结果,将输出结果传输至数据访问单元存储至存储器内。
具体的,运算单元执行batch normalization运算指令得到输出结果的方法可以包括:运算单元的加法运算器将输入数据in与-middle1执行加法运算以后得到第一结果,将第一结果和1/middle2输入到乘法运算器执行乘法运算得到输出结果。
图6F示出根据一个实施例的训练过程中的batch normalization正向运算流程图。该流程图描述利用如图6A的装置和指令集实现图6F所示的batch normalization运算的正向运算的过程。
在步骤F1,在指令存储单元的首地址处预先存入一条IO指令。
在步骤F2,运算开始,控制器单元从指令存储器的首地址读取该条IO指令,根据译出的微指令,数据访问单元从外部地址空间读取相应的所有batch normalization正向运算指令,并将其缓存在指令存储单元中。
在步骤F3,控制器单元接着从指令存储单元读入下一条IO指令,根据译出的微指令,数据访问单元从外部地址空间读取运算单元需要的所有数据(例如,包括输入神经元向量、batch大小、学习参数alpha、beta、极小值eps、均值、方差等)至运算单元的神经元缓存单元。
在步骤F4,控制器单元接着从指令存储单元读入下一条CONFIG指令,根据译出的微指令,装置配置batch normalization运算。例如,本次正向运算过程是使用计算好的均值方差,还是根据输入计算均值方差。
在步骤F5,控制器单元接着从指令存储单元读入下一条COMPUTE指令,根据译出的微指令,运算单元从神经元缓存单元读取输入神经元向量,计算输入神经元的均值和方差存入中间值缓存单元中。
在步骤F6,运算单元根据COMPUTE指令译出的微指令将输入神经元缓存单元和中间值缓存单元中的数据完成减去均值后除以方差与极小量eps和的平方根操作,将结果存回中间值缓存单元。
在步骤F7,运算单元根据COMPUTE指令译出的微指令,从神经元缓存单元读取学习参数alpha,与中间值相乘后加上学习参数beta返回至神经元缓存。
在步骤F8,控制器单元接着从指令存储单元读入下一条IO指令,根据译出的微指令,数据访问单元将神经元缓存单元中的输出神经元向量存至外部地址空间指定地址,运算结束。
对于使用过程中的batch normalizaiton运算的正向过程与训练过程中的batch normalization运算的正向过程区别在于步骤F4中配置使用常数均值和方差,不需要每次动态计算,也就是去掉了步骤F5。其他与图6F相同。
对于batch normalization运算的反向过程与上述的正向过程类似。区别在于操作的数据不同。假设一个像素点传入的梯度为dl/dY,反向传出的梯度是dl/dx,正向过程输出为Y,其余参数表示含义与正向过程相同,则经过batch normalization反向传播出的梯度dl/dx=(alpha/sqrt(var(x)+eps))*(dl/dY-mean(dl/dY)-mean(dl/dY*Y)*Y),其中mean是取均值操作。学习参数的alpha的梯度:dl/dalpha=(Σdl/dY)*Y,学习参数beta的梯度:dl/dbeta=Σdl/dY,通过这两个梯度更新学习参数的数值。batch normalization的反向过程通过运算单元归一化运算梯度数据例如取均值、方差等。之后运算单元并行的完成公式中其余操作。
通过采用用于执行batch normalization运算的装置和指令集,解决了CPU和GPU运算性能不足,前端译码开销大的问题。有效提高了对batch normalization正反向运算的支持。
通过采用针对batch normalization运算的专用片上缓存,充分挖掘了输入神经元和中间数据的重用性,避免了反复向内存读取这些数据,降低了内存访问带宽,避免了内存带宽成为多层人工神经网络正向运算性能瓶颈的问题。
通过采用针对batch normalization运算的专用运算单元较好的平衡了并行和串行之间的关系。避免了CPU架构只是串行运算,数据规模较大时速度较慢,GPU架构只是并行运算,处理不好归一化运算的弱点。本申请中数据存储单元和运算单元相配合可以较好的平衡归一化串行运算和并行运算。
需要说明的是,上述计算装置的计算指令可以为一个或多个,即该计算装置可以执行一个或多个上述的计算指令,该计算指令包括但不限于上述的卷积指令、全连接指令、batch normalization指令或pooling指令,上述指令的具体结构以及如何应用可以参见如图6A、图6B、图6C、图6D、图6E和图6F实施例描述,可选的,除了上述指令,该计算装置能够执行的计算指令具体可以包括:
向量内积指令(VP)。根据该指令,装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的向量数据,在向量计算单元中将两向量进行内积(张量)运算,并将结果写回。优选的,结果写会至存储器(优选的高速暂存存储器或者标量寄存器堆)得指定地址。
向量外积指令(TENS)。根据该指令,装置分别从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的向量数据,在向量计算单元中将两向量进行外积运算,并将结果写回。优选的,并将结果写回至存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址;
向量四则运算,包括:向量加标量指令(VAS),根据该指令,装置从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的向量数据,从存储器标量寄存器堆的指定地址取出标量数据,在标量运算单元中将向量的每一个元素加上该标量值,并将结果写回并将结果写回。优选的,并将结果写回至存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址;
标量减向量指令(SSV)。根据该指令,装置从存储器(优选的高速暂存存储器或者标量寄存器堆)标量寄存器堆的指定地址取出标量数据,从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出向量数据,在向量计算单元中用该标量减去向量中的相应元素,并将结果写回并将结果写回。优选的,并将结果写回存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址;
向量除法指令(VD)。根据该指令,装置从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出分别取出指定大小的向量数据,在向量运算单元中将两向量对位相除,并将结果写回并将结果写回。优选的,并将结果写回至存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址;
标量除向量指令(SDV)。根据该指令,装置从存储器(优选的高速暂存存储器或者标量寄存器堆)标量寄存器堆的指定位置取出标量数据,从存储器(优选的高速暂存存储器)的指定位置取出指定大小的向量数据,在向量计算单元中用标量分别除以向量中的相应元素,并将结果写回并将结果写回。优选的,并将结果写回至存储器(优选的高速暂存存储器或者标量寄存器堆)的指定位置;
向量逻辑指令,包括:
向量间与指令(VAV)。根据该指令,装置从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出分别取出指定大小的向量数据,在向量运算单元中将两向量对位相与,并将结果写回并将结果写回。优选的,并将结果写回至存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址;
向量内与指令(VAND)。根据该指令,装置从存储器(优选的高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的向量数据,在向量运算单元中向量中每一位相与,并将结果写回并将结果写回。优选的,并将结果写回至存储器(优选的高速暂存存储器或者标量寄存器堆)标量寄存器堆的指定地址;
向量间或指令(VOV)。根据该指令,装置从存储器(优选的,高速暂存存储器)的指定地址取出分别取出指定大小的向量数据,在向量运算单元中将两向量对位相或,并将结果写回并将结果写回。优选的,并将结果写回至存储器(优选的,高速暂存存储器或者标量寄存器堆)的指定地址;
向量内或指令(VOR)。根据该指令,装置从存储器(优选的,高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的向量数据,在向量运算单元中向量中每一位相或,并将结果写回并将结果写回。优选的,并将结果写回至存储器(优选的,高速暂存存储器或者标量寄存器堆)标量寄存器堆的指定地址;
超越函数指令,根据该指令,装置从存储器(优选的,高速暂存存储器或者标量寄存器堆)的指定地址取出指定大小的向量数据,在运算单元中对向量数据做超越函数运算,并将结果写回并将结果写回。优选的,并将结果写回至存储器(优选的高速暂存存储器或者标量寄存器堆)存储单元的指定地址。优选的,将结果写回至存储器(优选的,高速暂存存储器或者标量寄存器堆)的指定地址。
向量比较运算指令,包括
大于等于运算指令(GE),根据该指令,装置可以直接从指令中或者通过访问指令提供的寄存器存储器(优选的,高速暂存存储器或者标量寄存器堆)的编号号来获得指令的参数,包括向量的长度、两向量的起始地址以及输出向量的存储地址,然后读取两向量数据,在向量比较运算单元中对向量中所有位置上的元素进行比较,若某位置行前一向量的值大于等于后一向量的值,则将比较结果向量在该位置上的值置为1,否则置为0。最后将比较结果写回至存储器(优选的,高速暂存存储器或者标量寄存器堆)的指定存储地址。
小于等于运算指令(LE),根据该指令,装置可以直接从指令中或者通过访问指令提供的存储器(优选的,高速暂存存储器或者标量寄存器堆)的编号寄存器号来获得指令的参数,包括向量的长度、两向量的起始地址以及输出向量的存储地址,然后读取两向量数据,在向量比较运算单元中对向量中所有位置上的元素进行比较,若某位置行前一向量的值小于等于后一向量的值,则将比较结果向量在该位置上的值置为1,否则置为0。最后将比较结果写回到存储器(优选的,高速暂存存储器或者标量寄存器堆)值的指定存储地址。
大于运算指令(GT),根据该指令,装置可以直接从指令中或者通过访问指令提供的存储器(优选的,高速暂存存储器或者标量寄存器堆)的编号寄存器号来获得指令的参数,包括向量的长度、两向量的起始地址以及输出向量的存储地址,然后读取两向量数据,在向量比较运算单元中对向量中所有位置上的元素进行比较,若某位置行前一向量的值大于后一向量的值,则将比较结果向量在该位置上的值置为1,否则置为0。最后将比较结果写回值到存储器(优选的高速暂存存储器或者标量寄存器堆)的指定存储地址。
小于运算指令(LT),根据该指令,装置可以直接从指令中或者通过访问指令提供的存储器(优选的,高速暂存存储器或者标量寄存器堆)的编号寄存器号来获得指令的参数,包括向量的长度、两向量的起始地址以及输出向量的存储地址,然后读取两向量数据,在向量比较运算单元中对向量中所有位置上的元素进行比较,若某位置行前一向量的值小于后一向量的值,则将比较结果向量在该位置上的值置为1,否则置为0。最后将比较结果写回到存储器(优选的,高速暂存存储器或者标量寄存器堆)值的指定存储地址。
等于运算指令(EQ),根据该指令,装置可以直接从指令中或者通过访问指令提供的存储器(优选的高速暂存存储器或者标量寄存器堆)的编号寄存器号来获得指令的参数,包括向量的长度、两向量的起始地址以及输出向量的存储地址,然后读取两向量数据,在向量比较运算单元中对向量中所有位置上的元素进行比较,若某位置行前一向量的值等于后一向量的值,则将比较结果向量在该位置上的值置为1,否则置为0。最后将比较结果写回值到存储器(优选的,高速暂存存储器或者标量寄存器堆)的编号的指定存储地址。
不等于运算指令(UEQ),根据该指令,装置可以直接从指令中或者通过访问指令提供的存储器(优选的,高速暂存存储器或者标量寄存器堆)的编号寄存器号来获得指令的参数,包括向量的长度、两向量的起始地址以及输出向量的存储地址,然后读取两向量数据,在向量比较运算单元中对向量中所有位置上的元素进行比较,若某位置行前一向量的值不等于后一向量的值,则将比较结果向量在该位置上的值置为 1,否则置为0。最后将比较结果写回值到存储器(优选的,高速暂存存储器或者标量寄存器堆)的指定存储地址。
向量最大值指令(VMAX)。根据该指令,装置从存储器(优选的,高速暂存存储器或者标量寄存器堆)高速暂存存储器的指定地址取出指定大小的向量数据,从中选出最大的元素作为结果,并将结果写回并将结果写回。优选的,并将结果写回至存储器(优选的,高速暂存存储器或者标量寄存器堆)标量寄存器堆的指定地址;
向量最小值指令(VMIN)。根据该指令,装置从存储器(优选的,高速暂存存储器或者标量寄存器堆)高速暂存存储器的指定地址取出指定大小的向量数据,从中选出最小的元素作为结果,并将结果写回并将结果写回。优选的,并将结果写回至存储器(优选的,高速暂存存储器或者标量寄存器堆)标量寄存器堆的指定地址;
循环移位运算指令:根据该指令,装置可以直接从指令中或者通过访问指令提供的存储器(优选的,高速暂存存储器或者标量寄存器堆)的编寄存器号来获得指令的参数,然后在向量移位单元(可以是独立的向量移位单元也可以是使用计算单元)中进行循环移位移位,并将移位后的结果写回至存储器(优选的,高速暂存存储器或者标量寄存器堆)高速暂存存储器的指定存储地址。循环移位运算指令格式如图3所示,包含四个操作域,向量的起始地址和长度,移位步长,以及输出向量的存储地址,
随机向量生成指令,根据该指令,装置从指令或从存储器(优选的,高速暂存存储器或者标量寄存器堆)寄存器堆中读取一个或多个随机分布参数,以及要生成的随机向量的大小和存储地址,然后在随机向量生成单元中生成服从随机分布的随机向量,并将生成的随机向量结果写回至指定的存储器(优选的,高速暂存存储器或者标量寄存器堆)的存储地址。
随机向量生成指令具体可以为:
均匀分布指令(UNIF),根据该指令,装置从指令或从存储器(优选的,高速暂存存储器或者标量寄存器堆)寄存器堆中读取均匀分布的上界参数和下界参数,以及要生成的随机向量的大小和存储地址,然后在随机向量生成单元中生成服从该均匀分布的随机向量,并将生成的随机向量结果写回至指定的存储器(优选的,高速暂存存储器或者标量寄存器堆)的存储地址。
高斯分布指令(GAUS),根据该指令,装置从指令或从寄存器存储器(优选的,高速暂存存储器或者标量寄存器堆)堆中读取高斯分布的均值参数和方差参数,以及要生成的随机向量的大小和存储地址,然后在随机向量生成单元中生成服从该高斯分布的随机向量,并将生成的随机向量结果写回至指定的存储器(优选的,高速暂存存储器或者标量寄存器堆)的存储地址。
上述指令的格式示意图如图7A所示,神经网络运算指令的格式示意图如7B所示,矩阵运算指令的格式示意图如图7C所示;向量运算指令的格式示意图如图7D所示;矩阵-向量运算指令的格式示意图如图7E所示。需要说明的是,上述指令的格式示意图仅仅只是一种可能存在的实施例,本申请对上述指令的格式并不限定在上述图示中的表现形式。
本申请实施例还提供一种计算机存储介质,其中,该计算机存储介质存储用于电子数据交换的计算机程序,该计算机程序使得计算机执行如上述方法实施例中记载的任何一种矩阵计算方法的部分或全部步骤。
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如上述方法实施例中记载的任何一种矩阵计算方法的部分或全部步骤。
其中,上述实施例中人工神经网络运算装置可以通用的集成有DMA、控制单元的计算器件。人工神经网络运算装置还可以包括通用计算器件,例如通用的处理器,上述存储介质的具体实现形式可以是存储 装置、片上存储介质、内存或存储单元等;指令存储单元的具体实现形式可以是DMA等;运算单元的具体实现形式可以是主运算模块、从运算模块、离散数据运算单元或连续数据运算单元等;缓存单元的具体实现形式可以是指令缓存、输入神经元缓存、权值缓存、输出神经元缓存、指令缓存单元、支持离散数据表示的神经元缓存单元或支持离散数据表示的权值缓存单元等;本申请实施例不作限制。
本申请提出一种数据发布装置,包括:
一个或多个中心节点,其为所述片上网络的通信数据中心,用于向所述多个叶子节点进行通信数据的广播或多播;
多个叶子节点,其为所述片上网络的通信数据节点,用于向所述中心叶子节点进行通信数据的传递;
转发器模块,用于连接所述中心节点与所述多个叶子节点,通信数据通过所述转发器模块进行转发;
将所述多个叶子节点分为N组,所述中心节点通过所述转发器模块单独与每一组叶子节点进行通信连接。
可选的,每组中叶子节点的个数相同。一般技术人员也可以理解每组中叶子节点的个数也可以不同。
可选的,每组叶子节点构成的通信结构具有自相似性。这种情况下,数据发布装置具有分形树网络结构。一般技术人员也可以理解,每组叶子节点构成的通信结构有其他的结构,即不局限在自相似性的结构
可选的,所述多个叶子节点与所述中心节点通过多层所述转发器模块以完全多叉树方式进行通信连接。
在本申请的一个实施例方式中是,上述中心节点或叶子节点具体可以为如图6A中的计算装置,当然在其他技术场景中,上述中心节点或叶子节点也可以称为计算单元。
每个节点包括本地高速缓存结构,用于存储所述中心节点发布数据的子集;
每个叶子节点均有id标识,且所述id标识从完全多叉树的拓扑一侧按序依次增加序号。
所述数据发布装置共享一个时钟信号。
所述转发器模块包括本地高速缓存结构,用于存储数据。
本申请还提出一种利用所述数据发布装置的数据发布方法,通过所述中心节点将通信数据向所述多个叶子节点进行发布,其中,数据发送方准备好发送数据后,发送数据有效信号,并将数据置于总线;数据接收方准备好接收数据后,发送数据准备接收信号;当所述数据有效信号与所述数据准备接收信号双方检测到后,数据发送方认为数据已经发出,且被数据接收方接收。
当从所述中心节点将通信数据向所述多个叶子节点之间进行广播时,首先数据通过握手协议从所述中心节点进入与所述中心节点直接相连的所述转发器模块的局部缓存中暂时存储,每次握手协议成功后,进入下一层中间转发器模块局部缓存中暂时存储,最后输入与所述叶子节点直接相连的转发器模块,并由转发器模块分别发布给与其相连的一组叶子节点。
如果下一时钟节拍数据发送方与数据接收方握手协议成功,则数据以流水方式进入数据接收方的局部缓存中存储;如果握手协议不成功,则数据在当前层的局部缓存中保存,且使得当前层作为上一层的数据接收方,并停止发送数据准备接收信号,使得当前层的局部缓存中的数据停止更新,数据一直保存在当前层,直到握手协议成功。
当所述中心节点将通信数据向所述多个叶子节点之间进行多播时,首先数据通过握手协议从所述中心节点进入与所述中心节点直接相连的所述转发器模块的局部缓存中暂时存储,每次握手协议成功后,进入下一层中间转发器模块局部缓存中暂时存储,最后输入与所述叶子节点直接相连的转发器模块,并由转发器模块分别发布给与其相连的一组叶子节点。
在接收的数据时,所述叶子节点根据与其相对应的id标识选取预设带宽的数据。
本申请还提出一种包含所述数据发布装置的控制装置。
本申请还提出一种包含所述控制装置的智能芯片。
下面结合附图对本申请做进一步的详细说明,以令本领域技术人员参照说明书文字能够据以实施。
附图7为本申请的一个实施例中使用h-tree连接的16+1个核的片上多核结构示意图,其中16和1只是作为举例,没有进行特别限定。一般人也可以理解是2n+m个核,或者yn+m个核。h树的根节点为central tile,其为数据发布的起点;h树的叶子节点为leaf tile,其为数据发布的终点;其余的中间节点为hub,用于传输并分发数据;
将图中16个leaf tiles分为8组,每组中leaf tile的个数均为2,所述hub通过所述转发器模块单独与每一组leaf tile进行通信连接,每组leaf tile构成的通信结构具有自相似性,所述多个leaf tile与所述central tile通过多层所述转发器模块以完全二叉树方式进行连接;此设备实现了从一个数据中心以广播或者多播的方式向处理单元发布数据的情况。
附图8表示了hub结构示意图,hub由hub_one_to_two模块构成,hub_one_to_two将一组全带宽的输入数据20分成两组全带宽的数据21和22输出,用于从central tile到leaf tile的传输。
如图9所示,当标记为310的hub_one_to_two模块已经将数据与数据有效信号发至总线上,且标记为320的数据接收方0与标记为330的数据接收方1已经将数据准备接收信号发至总线时,此时握手协议才算成功:此拍310认为数据接收方,即320和330,已经接收数据,而下一拍320和330将此拍总线上的数据存入自己的缓冲区。
如图7所示,标记410的central tile广播数据来初始化全部的leaf tile,此时所有hub和leaf tile的局部缓存均为空,其数据准备接收信号均为高,此时与410直接相连的标记为420的hub0_0其数据准备接收信号同样为高。在第一拍时,410准备好数据,将其和数据有效信号置高,由于标记420的hub0_0此时的数据准备接收信号为高,410与420握手成功,在第二拍时,420将数据从总线存入其局部缓存中暂时存储,由于第二拍时,420的局部缓存中已经存有数据,它将数据及其有效信号发送至向下一级430与431方向的总线上,而此时标记为430的hub1_0与标记为431的hub1_1的数据准备接收信号也为高,当拍420与下一层的430和431握手成功,在第三拍时,430与431将数据从总线存入其局部缓存中暂时存储,依次执行,数据每一拍都从上一层向下一层行进一步。在此实例中,以430的hub1_0至标记为460的leaf tile0分支为例,在第四拍时,数据流入标记为440的hub2_0的局部缓存中暂时存储;在第五拍时,数据流入标记为450的hub3_0的局部缓存中暂时存储;在第六拍是,450通过两个输入端口在握手协议成功后分别将全带宽的数据存储到与其相连的一组leaf tile的局部缓存中,此时数据到达标记为460的leaf tile0。由此,在数据通路顺畅的情况下,数据的按层级流水传输得以保证。
如图10所示,此一实例以hub1_0作例,当如下情况发生时,数据将滞留在hub中,在第一拍时,标记为520的hub1_0收到来自标记为510的hub0_0的数据,此时,520将数据与其数据有效信号置于向下一层530与531方向的总线上。现设置情景如下,此时标记为530的hub2_0与标记为531的hub2_1此时并未发出数据准备信号,并在之后时间内一直保持这样的状况,此时由于520与下一层的530和531握手不成功,520的数据无法传输给下一级530与531,并滞留于520的局部缓存中,此时520无法发送数据准备接收信号,在之后的时间内,由于510的局部缓存为空,其又可以接收新的数据,但由于520并未发送数据准备接收信号,导致520与510握手不成功,即510的数据无法发送至520,保证了520的局部缓存中的数据的安全性,从而,使得数据传输的可靠性得以实现。
如图10所示,此一实例以hub1_0作例,当如下情况发生时,hub将可以流水传输数据,在第一拍时,标记为520的hub1_0收到来自标记为510的hub0_0的数据,此时,520将数据与其数据有效信号置于向下一层530与531方向的总线上。现设置情景如下,此时标记为530的hub2_0与标记为531的hub2_1此时发出数据准备信号,并在之后时间内一直保持这样的状况,此时由于520与下一层的530和531握手成 功,520的数据传输给下一级530与531,此时520已经可以发送数据准备接收信号,若此时510的局部缓存已经准备好新的数据,并将数据与与其数据有效信号置于向520方向的总线上,在当拍,由于520发送数据准备接收信号,520与510握手成功,在第二拍,520将510传输过来的数据存于局部缓存中,并将数据及其有效信号置于向下一层530与531方向的总线上,由此,可见,hub在数据通路舒畅即数据源充足的情况下,可以进行流水传输数据。
如图11所示,假设有16个leaf tile,将h树以完全二叉树的拓扑展开,hub为非叶节点,而leaf tile为叶节点,将在树中高度相同的节点都从左到右依次增序,hub以其层数与序号相结合命名,如标记610为hub0_0,即第一层的0号节点,标记620为hub1_0,即第二层的0号节点,标记621为hub1_1,即第二层的1号节点。
如图11所示,在一实施例中,标记60的central tile多播数据来初始化全部的leaf tile,此时所有hub和leaf tile的局部缓存均为空,其数据准备接收信号均为高,即数据通路顺畅,按数据流水传输,在第一拍时,60与610握手成功,在第二拍时,610将数据从总线存入其局部缓存中暂时存储,当拍610与下一层的620和621握手成功,在第三拍时,620与621将数据从总线存入其局部缓存中暂时存储,当拍620与下一层的630和631握手成功,621与下一层的632和633握手成功,在第四拍时,630,631,632,633将数据从总线存入其局部缓存中暂时存储,630与下一层的640和641握手成功,631与下一层的642和643握手成功,632与下一层的644和645握手成功,633与下一层的646和647握手成功,在第五拍时,640,641,642,643,644,645,646,647将数据从总线存入其局部缓存中暂时存储,640与下一层的650和651握手成功,641与下一层的652和653握手成功,642与下一层的654和655握手成功,643与下一层的656和657握手成功,644与下一层的658和659握手成功,645与下一层的65a和65b握手成功,646与下一层的65c和65d握手成功,647与下一层的65e和65f握手成功,在第六拍时,数据同时存储至所有leaf tile,650,651,652,653,654,655,656,657,658,659,65a,65b,65c,65e,65f的局部缓存中,由此可见,数据从中心向叶子节点广播的数据在数据通路顺畅的情况下可以同时到达,数据的同步性得以实现。
在上一实例中,数据到达每个leaf tile时都是全带宽的,假设如图12所示,每个leaf tile的预设带宽均为16位数据,则其可以按照其id序号从全带宽的数据中取得对自己多播的数据,数据在全带宽中的位置为[id*16:id*16+15]。如id序号为15的数据D0,位于data[255:240],而id序号为0的数据D0,位于data[15:0]。
图13为本申请的一个实施例中使用x-tree连接的64+1个核的片上多核结构示意图,x树的根节点为central tile,其为数据发布的起点;x树的叶子节点为leaf tile,其为数据发布的终点;其余的中间节点为hub,用于传输并分发数据;将图中64个leaf tiles分为16组,每组中leaf tile的个数均为4,所述hub通过所述转发器模块单独与每一组leaf tile进行通信连接,每组leaf tile构成的通信结构具有自相似性,所述多个leaf tile与所述central tile通过多层所述转发器模块以完全四叉树方式进行连接;此设备实现了从一个数据中心以广播或者多播的方式向处理单元发布数据的情况。
图14表示了hub结构示意图,hub由hub_one_to_four模块构成,hub_one_to_four将一组全带宽的输入数据800分成四组全带宽的数据801、802、803和804输出,用于从central tile到leaf tile的传输。
如图15所示,标记A10的central tile广播数据来初始化全部的leaf tile,此时所有hub和leaf tile的局部缓存均为空,其数据准备接收信号均为高,此时与A10直接相连的标记为A20的hub0_0其数据准备接收信号同样为高。在第一拍时,A10准备好数据,将其和数据有效信号置高,由于标记A20的hub0_0此时的数据准备接收信号为高,A10与A20握手成功,在第二拍时,A20将数据从总线存入其局部缓存中暂时存储,由于第二拍时,A20的局部缓存中已经存有数据,它将数据及其有效信号发送至向下一级A30、 A31、A32、A33方向的总线上,而此时标记为A30的hub1_0、标记为A31的hub1_1、标记为A32的hub1_2、标记为A33的hub1_3的数据准备接收信号也为高,当拍A20与下一层的A30、A31、A32、A33握手成功,在第三拍时,A30、A31、A32、A33将数据从总线存入其局部缓存中暂时存储,依次执行,数据每一拍都从上一层向下一层行进一步。在此实例中,以A33的hub1_3至标记为A50的leaf tile48分支为例:在第四拍时,数据流入标记为A40的hub2_12的局部缓存中暂时存储;在第五拍时,A40通过四个输入端口在握手协议成功后分别将全带宽的数据存储到与其相连的一组,即4个leaf tile的局部缓存中,包括A50、A51、A52、A53;此时数据到达标记为A50的leaf tile48。由此,在数据通路顺畅的情况下,数据的按层级流水传输得以保证。
如图13所示,假设有64个leaf tile与一个central tile,通过x树以完全四叉树为拓扑连接,hub为非叶节点,而leaf tile为叶节点,将在树中高度相同的节点都逆时针依次增序,hub以其层数与序号相结合命名,如标记910为hub0_0,即第一层的0号节点,标记920为hub1_0,即第二层的0号节点,标记921为hub1_1,即第二层的1号节点。
如图13所示,在一实施例中,标记90的central tile多播数据来初始化全部的leaf tile,此时所有hub和leaf tile的局部缓存均为空,其数据准备接收信号均为高,即数据通路顺畅,按数据流水传输,在第一拍时,90与910握手成功;在第二拍时,910将数据从总线存入其局部缓存中暂时存储,当拍910与下一层的920、921、922和923握手成功;在第三拍时,920、921、922和923将数据从总线存入其局部缓存中暂时存储,当拍920与下一层的930、931、932和933握手成功,921与下一层的934、935、936和937握手成功,922与下一层的938、939、93a和93b握手成功,923与下一层的93c、93d、93e和93f握手成功;在第四拍时,930,931,932,933,934、935、936、937、938、939、93a、93b、93c、93d、93e和93f将数据从总线存入其局部缓存中暂时存储,930与下一层的940、941、942和943握手成功,931与下一层的944、945、946和947握手成功,932与下一层的948、949、950和951握手成功,933与下一层的952、953、954和955握手成功,934与下一层的956、957、958和959握手成功,935与下一层的960、961、962和963握手成功,936与下一层的964、965、966和967握手成功,937与下一层的968、969、970和971握手成功,938与下一层的972、973、974和975握手成功,939与下一层的976、977、978和979握手成功,93a与下一层的980、981、982和983握手成功,93b与下一层的984、985、986和988握手成功,93c与下一层的988、989、990和991握手成功,93d与下一层的992、993、994和995握手成功,93e与下一层的996、997、998和999握手成功,93f与下一层的9a0、9a1、9a2和9a3握手成功;在第五拍时,数据同时存储至所有leaf tile,940~9a3的局部缓存中,由此可见,数据从中心向叶子节点广播的数据在数据通路顺畅的情况下可以同时到达,数据的同步性得以实现。
在上一实例中,数据到达每个leaf tile时都是全带宽的,假设如图16所示,每个leaf tile的预设带宽均为16位数据,则其可以按照其id序号从全带宽的数据中取得对自己多播的数据,数据在全带宽中的位置为[id*16:id*16+15]。如id序号为63的数据D0,位于data[1023:1008],而id序号为0的数据D0,位于data[15:0]。
本申请还公开了一种用于稀疏连接的机器学习计算装置,具体的,该机器学习可以包括人工神经网络,包括:
映射单元,用于将输入数据转换成输入神经元、权值和连接数据,依据连接数据对该输入神经元的筛选得到计算神经元,将该计算神经元并存储在存储器或者缓存中;
存储器,用于存储计算神经元、权值和计算指令;
运算单元,用于根据所述存储装置中存储的计算指令对所述计算神经元以及权值执行相应的运算;所述运算单元主要执行三步运算,第一步是将计算神经元和权值数据相乘得到第一结果;第二步执行加法树 运算得到第二结果,具体的,用于将第一步处理后的第一结果通过加法树逐级相加得到第二结果,或者将第一结果通过和偏置相加得到第二结果;第三步对第二结果执行激活函数运算,得到最终输出神经元。
上述运算单元具体可以包括:加法计算器、乘法计算器和激活计算器,其连接的关系如图2B所示,每个计算器对应一个流水级,此计算方式能够节省运算时间,加快运算。在另一种可选的实施例可以自由组合各流水部件或者采取一级流水级。例如将第二个流水级和第三个流水级合并,或者将第一和第二以及第三个流水线都合并或者各个流水级负责不同的运算可以排列组合。例如,第一级流水负责比较运算,部分乘法运算,第二级流水负责非线性运算和矩阵标量乘法等组合。
其中,所述连接数据表示如下:
第一种情形:
采用1表示有连接,0表示无连接,每个输出神经元与所有输入神经元的连接状态组成一个0和1的字符串来表示该输出神经元的连接关系;或者
采用1表示有连接,0表示无连接,每个输入神经元与所有输出神经元的连接状态组成一个0和1的字符串来表示该输入神经元的连接关系;
第二种情形:
将一输出神经元第一个连接所在的位置距离第一个输入神经元的距离、所述输出神经元第二个输入神经元距离上一个输入神经元的距离,所述输出神经元第三个输入神经元距离上一个输入神经元的距离,依次类推,直到穷举所述输出神经元的所有输入神经元,来表示所述输出神经元的连接关系。
作为优选,所述人工神经网络计算装置还包括DMA,用于在所述存储装置和缓存中进行数据或者指令读写。
作为优选,所述人工神经网络计算装置还包括:
指令缓存,用于存储专用指令;以及
控制单元,用于从所述指令缓存中读取专用指令,并将其译码成各运算单元指令。
作为优选,所述人工神经网络计算装置还包括:
输入神经元缓存,用于缓存输入神经元到所述运算单元的输入神经元数据;以及
权值缓存,用于缓存权值数据。
作为优选,所述人工神经网络计算装置还包括:
输出神经元缓存,用于缓存所述运算单元输出的输出神经元。
作为优选,所述映射单元用于将输入数据转换成输入神经元和权值一一对应的存储方式,并输出神经元到所述运算单元,而不是存储在存储装置中。
作为优选,所述人工神经网络计算装置还包括输入神经元缓存和/或权值缓存,所述输入神经元缓存用于缓存输入神经元到所述运算单元的输入神经元数据,所述权值缓存用于缓存权值数据,所述映射单元用于将输入数据转换成输入神经元和权值一一对应的存储方式,并输出神经元到所述输入神经元缓存和/或权值缓存。
作为优选,所述运算单元在第三步执行的激活函数为sigmoid函数、tanh函数或ReLU函数。
本申请还公开了一种用于稀疏连接的人工神经网络的计算方法,该计算方法在如图26,图28或图30所示的装置中实现,包括以下步骤:
步骤1,将输入数据转换成输入神经元、权值和连接数据;其中,所述连接数据表示为:
第一种情形:
采用1表示有连接,0表示无连接,每个输出神经元与所有输入神经元的连接状态组成一个0和1的字符串来表示该输出神经元的连接关系;或者
采用1表示有连接,0表示无连接,每个输入神经元与所有输出神经元的连接状态组成一个0和1的字符串来表示该输入神经元的连接关系;
第二种情形:
将一输出神经元第一个连接所在的位置距离第一个输入神经元的距离、所述输出神经元第二个输入神经元距离上一个输入神经元的距离,所述输出神经元第三个输入神经元距离上一个输入神经元的距离,依次类推,直到穷举所述输出神经元的所有输入,来表示所述输出神经元的连接关系。
步骤2,依据连接数据将输入神经元进行筛选得到计算神经元,将计算神经元和权值数据相乘得到第一结果;
上述输入数据包括:输入神经元、权值和连接数据,直接包含在输入数据内,直接从输入数据内提取该输入神经元、权值以及连接数据即可。上述计算神经元可以依据连接数据将输入神经元进行筛选得到计算神经元。
上述筛选的实现方案具体可以为,例如假设输入神经元为4个,连接数据为1时表示连接,如果连接数据为如图18所示的1011,该输入神经元为i 1、i 2、i 3和i 4,则将无连接关系的第二神经元i 2删除得到计算神经元数据为:i 1、i 3和i 4。当然上述连接数据中的1也可以表示不连接,如果1表示不连接,那么无连接关系的i 1、i 3和i 4删除得到计算神经元数据为:i2。
步骤3,将第一结果执行加法树运算得到第二结果。
具体的实现方法由多种,例如,将第一结果执行加法树逐级相加得到第二结果。又如将第二结果加偏置得到第二结果;
步骤4,对第二结果执行激活函数运算,得到最终输出神经元;其中,所述激活函数为sigmoid函数、tanh函数或ReLU函数。
下面结合附图和具体实施例对本申请的技术方案进行进一步的阐释说明。
图17是根据本申请一个实施例的总体结构的示意性框图。
I/O接口1,用于I/O数据需要经过中央处理器CPU 3发给稀疏的多层人工神经网络运算装置,然后由稀疏的多层人工神经网络运算装置4写入存储器2,稀疏的多层人工神经网络运算装置4需要的专用程序也是由CPU 3传输到稀疏的多层人工神经网络运算装置4。
存储器2,用于暂存稀疏的多层人工神经网络模型和神经元数据,特别是当全部模型无法在稀疏的多层人工神经网络运算装置4上的缓存中存储时。
CPU 3,用于进行数据搬运以及稀疏的多层人工神经网络运算装置4启动停止等基本控制,作为稀疏的多层人工神经网络运算装置4与外部控制的接口。
稀疏的人工神经网络运算装置4,用于执行稀疏的多层人工神经网络运算单元,接受来自CPU3的数据和程序,执行上述稀疏的多层人工神经网络运算算法,稀疏的人工神经网络运算装置4的执行结果将传输回CPU 3。
通用系统结构:将稀疏的人工神经网络运算装置4作为CPU 3或者GPU的协处理器来执行稀疏的多层人工神经网络运算算法。
多个稀疏的人工神经网络运算装置互联的系统结构:多个稀疏的人工神经网络运算装置4可以通过PCIE总线互联,以支持更大规模的稀疏的多层人工神经网络运算,可以共用同一个宿主CPU或者分别有自己的宿主CPU,可以共享内存也可以每个处理器有各自的内存。此外其互联方式可以是任意互联拓扑。
对于一个稀疏连接的神经网络如图18所示,有4个输入神经元:i 1,i 2,i 3,i 4,有2个输出神经元:o 1,o 2。其中,o 1和i 1,i 3,i 4有连接,把连接的权值分别表示为w 11,w 31,w 41;o 2和i 2,i 3有连接,把连接的权值分别表示为w 22,w 32
有两种方法可以表示上面稀疏神经网络的连接关系,一种是每个输入神经元与输出神经元之间都用一位表示是否有连接,另一种是用连接之间的距离来表示每个连接的位置。
第一种连接表示:
对于图18中的神经网络,如图19所示,输出神经元o 1的连接关系为:1011,每一位表示是否与输入神经元有连接,1表示有连接,0表示无连接,输出神经元o 2的连接关系为0110。在运算时,连接关系为0所对应的输入神经元会筛选删除,即不会进行运算,具体的对于输出神经元o 1,其i 2会被筛选删除,对于o 2,其i 1、i 4会被筛选删除,这样在计算时无需对筛选的输入神经元进行计算。
在存储连接关系时,可以按照优先输入神经元或者输出神经元的顺序对连接关系进行存储。具体存储格式有以下几种:
格式一:将每个输出神经元的所有输入神经元依次摆放完,上面的例子摆放的顺序为10110110。
格式二:将每个输入神经元的所有的输出神经元依次摆放完,上面的例子摆放的顺序为10011110。
第二种连接表示:
比如对于图20中的神经网络,输出神经元o 1与输入神经元i 1,i 3,i 4相连接,那么连接关系为0,2,1。0表示第一个连接所在的位置距离第一个输入神经元的距离为0,即第一个输入神经元,2表示第二个输入神经元距离上一个输入神经元的距离为2,即表示第三个输入神经元,1表示第三个输入神经元距离上一个输入神经元的距离为1,即表示第四个输入神经元。同理,o 2的连接关系为1,1。
本申请的映射单元包括但不限于以上的连接关系。
卷积神经网络是人工神经网络的一种,卷积层包含多个滤波器,也就是卷积核,这些卷积核重复的作用于所有输入图像上,提取局部特征。不同的卷积核能够提取出不同种类的局部特征,一副输入图像在经过卷积层之后就变成一些能够被更好理解的抽象特征。
自然图像有其固有特性,也就是说,图像的一部分的统计特性与其他部分是一样的。这也意味着在这一部分学习的特征也能用在另一部分上,所以对于这个图像上的所有位置,都能使用同样的学习特征。当从一个大尺寸图像中随机选取一小块,比如说8*8作为样本,并且从这个小块样本中学习到了一些特征,这时可以把从这个8*8样本中学习到的特征作为探测器,应用到这个图像的任意地方中去。特别是,可以用从8*8样本中学习到的特征跟原本的大尺寸图像做卷积,从而对这个大尺寸图像上的任意位置获得一个不同特征的激活值。这个8*8的样本特征被称作卷积核。上述卷积的计算可以参见如图6B实施例中的描述,这里不在赘述。
如图21是一个卷积操作的例子。卷积核是一个2*2的矩阵,卷积核在输入图像上滑动。
假设每次滑动一个像素点,则总共会有四次卷积操作。对于每次卷积操作,卷积核矩阵与对应的输入图像数据做乘加操作。
假设卷积核的权值变得稀疏,由之前的2*2,变成只有两个参数,如图22所示。则对于输出神经元o 0来说,需要的输入神经元为i 0,i 1,i 3,i 4,输入权值为:w 0,w 3,连接关系为1001或者0,2;
对于输出神经元o 3来说,需要的输入神经元为i 3,i 5,i 7,i 8,输入权值为:w 0,w 3,连接关系为1001或者0,2。
由此可见,对于同个输出神经元特征图上的不同的输出神经元,所需要的输入神经元不同,权值和连接关系是相同的。
可执行稀疏连接的人工神经网络运算装置可以处理各种稀疏连接表示的稀疏连接的人工神经网络,可执行稀疏连接的人工神经网络运算装置中有一个专门用于处理稀疏连接的单元,在这里称为映射单元,对于不同的稀疏连接关系和处理方法,稀疏连接的人工神经网络运算装置结构会略有不同,下面将分别描述不同的结构和方法。
结构和方法一:
如图23所示,映射单元1,用来将输入数据转换成输入神经元、权值和连接数据。
存储器2,用来存储数据和指令,尤其是神经网络规模很大的时候,指令缓存4、输入神经元缓存6、输出神经元缓存9、权值缓存8放不下这么多数据,只能将数据临时存放在存储器2。
DMA3,用来将存储装置中的数据或者指令搬到各个缓存中。
指令缓存4,用来存储专用指令。
控制单元5,从指令缓存4中读取专用指令,并将其译码成各运算单元指令。
输入神经元缓存6,用来存储运算的输入神经元数据。
运算单元7,用于执行具体的运算。运算单元主要被分为三个阶段,第一阶段执行乘法运算,用于将输入的神经元和权值数据相乘。第二阶段执行加法树运算,第一、二两阶段合起来完成了向量内积运算。第三阶段执行激活函数运算,激活函数可以是sigmoid函数、tanh函数等。第三阶段得到输出神经元,写回到输出神经元缓存。
权值缓存8,用来存储权值数据。
输出神经元缓存9,用来存储运算的输出神经元。
映射单元的结构如图24所示。
以上面稀疏连接的神经网络为例,连接关系可以是上述的两种稀疏表示之一,映射单元会根据连接关系,将输入神经元和输入权值按照连接关系输出映射后的神经元和权值,映射后的神经元和权值可以在运算时被直接使用而不需要考虑连接关系,对于输出神经元o 1映射的具体过程如下:
输入神经元为:i 1,i 2,i 3,i 4,输入权值为:w 11,w 31,w 41,连接关系可以为:1011,或0,2,1。映射单元根据连接关系,将输入神经元和权值变成相对应的关系,输出有两种情况:一种是去除掉没有连接的输入神经元,则映射后的神经元为i 1,i 3,i 4,映射后的权值为w 11,w 31,w 41;另一种是权值在没有连接的地方补成0,则映射后的神经元为i 1,i 2,i 3,i 4,映射后的权值为w 11,0,w 31,w 41
运算单元可以包括三个部分,第一部分乘法器,第二部分加法树,第三部分为线性函数单元。第一部分将输入神经元(in)通过和权值(w)相乘得到加权输出神经元(out),过程为:out=w*in;第二部分将加权输出神经元通过加法树逐级相加,另外还可以将输出神经元(in)通过和偏置(b)相加得到加偏置输出神经元(out),过程为:out=in+b;第三部分将输出神经元(in)通过激活函数(active)运算得到激活输出神经元(out),过程为:out=active(in),激活函数active可以是sigmoid、tanh、relu、softmax等,除了做激活操作,第三部分可以实现其他的非线性函数,可将将输入神经元(in)通过运算(f)得到输出神经元(out),过程为:out=f(in)。
运算过程如图25所示。
结构和方法二:
如图26所示,存储装置1,用来存储数据和指令,尤其是神经网络规模很大的时候,指令缓存3、输入神经元缓存6、输出神经元缓存9、权值缓存8放不下这么多数据,只能将数据临时存放在存储装置1。
DMA2,用来将存储装置中的数据或者指令搬到各个缓存中。
指令缓存3,用来存储专用指令。
控制单元4,从指令缓存3中读取专用指令,并将其译码成各运算单元指令。
映射单元5,用来将输入数据转换成输入神经元和权值一一对应的存储方式。
输入神经元缓存6,用来存储运算的输入神经元数据。
运算单元7,用于执行具体的运算。运算单元主要被分为三个阶段,第一阶段执行乘法运算,用于将输入的神经元和权值数据相乘。第二阶段执行加法树运算,第一、二两阶段合起来完成了向量内积运算。 第三阶段执行激活函数运算,激活函数可以是sigmoid函数、tanh函数等。第三阶段得到输出神经元,写回到输出神经元缓存。
权值缓存8,用来存储权值数据。
输出神经元缓存9,用来存储运算的输出神经元。
映射单元的结构如图27所示。
以上述稀疏连接的神经网络为例,连接关系可以是上述的两种稀疏表示之一,映射单元会根据连接关系,将输入神经元和输入权值按照连接关系输出映射后的神经元和权值,映射后的神经元和权值可以在运算时被直接使用而不需要考虑连接关系,对于输出神经元o 1映射的具体过程如下:
输入神经元为:i 1,i 2,i 3,i 4,输入权值为:w 11,w 31,w 41,连接关系可以为:1011,或0,2,1。映射单元根据连接关系,将输入神经元和权值变成相对应的关系,输出有两种情况:一种是去除掉没有连接的输入神经元,则映射后的神经元为i 1,i 3,i 4,映射后的权值为w 11,w 31,w 41;另一种是权值在没有连接的地方补成0,则映射后的神经元为i 1,i 2,i 3,i 4,映射后的权值为w 11,0,w 31,w 41
结构和方法一和结构方法二中的映射单元的主要区别是结构和方法一中的映射单元是在计算之前事先把输入神经元和权值映射好后存储在存储装置中,结构和方法二是在计算中进行映射,将映射好的数据直接给运算单元进行运算。
结构和方法三:
基于结构和方法二稍作修改可以改成如图28所示的结构,映射单元只对输入神经元进行映射。
此时,映射单元的结构图如图29所示。
对于输出神经元o1映射的具体过程如下:
输入神经元为:i 1,i 2,i 3,i 4,连接关系可以为:1011,或者:0,2,1。映射单元根据连接关系,将输入神经元和权值变成相对应的关系,去除掉没有连接的输入神经元,则映射后的神经元为i 1,i 3,i 4
结构和方法四:
基于结构和方法二稍作修改可以改成如图30所示的结构,映射单元只对输入权值进行映射。
此时,映射单元的结构图如图31所示。
对于输出神经元o1映射的具体过程如下:
输入权值为:w 11,w 31,w 41,连接关系可以为:1011,或者:0,2,1。映射单元根据连接关系,将输入神经元和权值变成相对应的关系,映射后的权值为w 11,0,w 31,w 41
如图32所示,本申请还提供了一种神经网络的处理系统100,在一种可选的实施方案中,该神经网络的处理系统100具体可以为如图6A所示的计算装置,相对于如图6A所示的计算装置,增加了一个或者多个算数逻辑单元,该多个算数逻辑单元,用于执行非线性运算,在一种可选实施例中,如图6A所示的计算装置还可以扩展如图32所示的神经网络的处理系统中的单元或模块。在另一种实施例中,该系统包括至少一片上存储介质10、至少一片内地址索引模块20、多核心处理模块30以及一个或者多个算数逻辑单元(Arithmetic Logic Unit,ALU)模块40。多核心处理模块30包括多个核心处理子模块31。其中片内地址索引模块20与片上存储介质10连接,片内地址索引模块20、多核心处理模块30以及ALU模块40分别相互连接。多核心处理模块30用于执行神经网络运算中的向量乘加操作,多个ALU模块40用于从多核心处理模块30或片上存储介质10获取输入数据执行多核心处理模块30无法完成的非线性运算,在本实施例中,多个核心处理子模块31共享片上存储介质10以及ALU模块40。
片上存储介质10,用于存储神经网络处理系统外部传来的数据或用于存储处理过程中产生的数据。该处理过程中产生的数据包括处理过程中产生的处理结果或中间结果。这些结果可能来自处理器的片内核心 运算模块,也可能来自其他运算部件,如本申请中ALU模块40。该片上存储介质10可以是静态随机存储器(Static Random Access Memory,SRAM),动态随机存储器(Dynamic Random Access Memory,DRAM),增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,e-DRAM),寄存器堆(Register file,RF)等常见存储介质,也可以是新型的存储器件,如非易失存储器(Non-Volatile Memory,NVM)或者3D存储器件等等。
片内地址索引模块20,用于在执行运算时候根据输入的索引映射至正确的存储地址以将正确的数据送至多核心处理模块30进行处理。从而使得数据和片上存储介质可以正确的交互。这里的地址映射过程包括直接映射,算术变换等。该索引模块可以通过硬件电路(包括但不限于FPGA、CGRA、专用集成电路ASIC、模拟电路和忆阻器等)实现。
多核心处理模块30包括多个核心处理子模块31,用于执行神经网络运算中的向量乘加操作。具体的,多核心处理模块30完成神经网络算法中的大部分运算,均为线性运算,即乘加操作。每个核心处理模块31的结构可以多种,例如一维处理单元(processing element,PE)实现方式,二维PE或者多维实现方式。单个核心处理模块31本身不局限于特定实施原则,包括不同的实现方法,如systolic方案,矩阵向量乘加操作符。且多核心处理模块30的多个核心处理子模块31之间可以为同构设计或异构设计。该处理模块可以通过硬件电路(包括但不限于FPGA、CGRA、专用集成电路ASIC、模拟电路和忆阻器等)实现。
ALU模块40,用于从多核心处理模块30或片上存储介质获取输入数据执行核心处理模块无法完成的非线性运算。该模块可以通过硬件电路(包括但不限于FPGA、CGRA、专用集成电路ASIC、模拟电路和忆阻器等)实现。在本申请中,多核心处理模块30、ALU模块40与片上存储介质10的数据通路包括但不局限于H-TREE,或者FAT-TREE等互联技术。
在本申请中,多个核心处理子模块31共同复用部分输入以减少带宽需求,所述神经网络的处理系统100进行处理时,将同一输入神经元分别发送至多核心处理模块30的多个核心处理子模块31,而将不同的输入权值分配至不同的核心处理模块31,多个核心处理子模块31分别将输入神经元和输入权值进行向量内积(乘加和)操作后得到不同的输出神经元。不同的输出神经元对应不同的权值,也即对于处理不同的输出神经元,输入神经元是相同的,权值则不同。在本申请中,权值大部分情况下不可被多个核心复用,然而在某些情况下,如多个核心共同处理同一个特征图时,权值也可以被复用。
本申请针对神经网络处理系统的核心处理部分通过提升片上核心处理模块的数目从而提升神经网络算法中的核心运算部分处理速度,使得处理器获得更高的性能。核心处理指的是神经网络算法中占据大部分处理时间的向量乘加操作。从而本申请能够提升神经网络处理系统的运算速度,使得神经网络处理系统性能更高,更加高效。
图33是本申请一种神经网络的处理系统的另一实施例的结构框图,其与图32中神经网络的处理系统的区别是,图32中神经网络的处理系统是采用松耦合设计,而图33中神经网络的处理系统采用紧耦合设计。在图33中,神经网络的处理系统200包括多个片上存储介质201,多个片内地址索引模块202,多个核心处理模块203以及多个ALU模块204,其中每个核心处理模块203具有单独的输入接口和输入结构,其中ALU模块204也被划分可以存在于每个核心中。
在图32中,多个核心处理子模块31只完成特定的核心操作,本身不具有更多的功能,多核处理核心共享片上存储介质10和ALU模块40。与之相比,在图2的紧耦合设计中,每个核心处理模块203具有自己独立的片上存储介质201和ALU模块204。在图32所示的松耦合设计中多个核心可以协同处理,更易实现更高的性能需求,然而每个核缺少灵活性;在如图33所示的紧耦合设计中每个核心具有一定的灵活性,然而由于每个核的独立性也使得多核协同的复杂度更高,使得控制的复杂度增加。松耦合多适用多核同构的设计,紧耦合则多使用于多核异构的设计。
在本申请中,神经网络可以根据多核处理模式设计进行神经网络的划分,其中包括从输入神经元进行划分,输出神经元划分和权值连接进行划分。神经网络的划分是对于神经网络处理模式的分解,并不是将神经网络划分成为独立的子网,也即划分是算法层面的划分,是软件或者编译器完成的操作,其目的是将处理划分成为可以在多个核心处理的多个部分。
图34是本申请一种实施例中神经网络划分的示意图;图35是本申请另一实施例中神经网络划分的示意图;图36是本申请又一实施例中神经网络划分的示意图。
在神经网络的处理中,卷积层是按照特征图进行组织,也即输入是多个图,输出是多个图。在图34中,对于二维或者多维运算,从输出角度可按照每个核处理一层输出特征图进行神经网络划分。图34中包括输入特征图1、输入特征图2、核心处理模块1、核心处理模块2、输出特征图1、输入特征图2,每个特征图为二维矩阵。在进行处理时,将输入特征图1、2分别发送至核心处理模块1、2,核心处理模块1处理输出特征图1,核心处理模块处理输出特征图2,核心处理模块1和核心处理模块2分别处理一层输出特征图。也即,在进行二维或多维处理时,将输入特征图分别发送至多个核心处理模块,多个核心处理模块分别处理一层输出特征图。多个核心处理模块均分别完成当前输出特征图的处理后,多核心处理模块再执行新的输出特征图处理,也即只有当所有的核完成当前的输出特征图处理后才会进行新的特征图处理。
在实际应用中,输入特征图、核心处理模块、输出处理模块均可以有多个。下面以2个核(核#1、核#2)、4个输出特征图(输出特征图#1、#2、#3、#4)、4个输入特征图(输入特征图#1、#2、#3、#4)为例说明多核心处理模块的处理方式:处理开始后,核#1负责处理输出特征图#1,核#2负责处理输出特征图#2,输入特征图#1被送入核#1和核#2(也即共享输入特征图#1),同时相应的权值也被送入核#1和核#2进行处理;当输入特征图#1处理完成后,输入特征图#2被从片上存储读取,送入核#1和核#2进行处理(同样读取权值);当核#1和核#2完成输出特征图#1和#2的处理后,核#1和核#2则开始处理输出特征图#3和#4,也即重复以上的操作过程。
如图35所示,对于二维或者多维运算,从输出角度也可按照每个核处理一层输出特征图进行神经网络划分。不同的核负责处理同一特征图的不同区域,而输入相应的则被送至每一个核中,权值则根据相应的连接进行读取,这里权值有可能存在复用,如卷积神经网中的卷积层。只有当所有的核完成当前的输出特征图处理后才会进行新的特征图处理。在图35中,输入特征图1、和输入特征图2均送入核心处理模块1和核心处理模块2,核心处理模块1负责处理输出特征图1的区域1和输出特征图2的区域1,核心处理模块2负责处理输出特征图1的区域2和输出特征图2的区域2。从而,在执行二维或者多维运算时,将输入特征图分别发送至多个核心处理模块,多个核心处理模块分别处理同一输出特征图的不同区域,多个核心处理模块均分别完成当前输出特征图的处理后,多核心处理模块再执行新的输出特征图处理。
如图36所示,对于一维运算,从输出角度按照每个核心处理模块处理输出的一部分进行神经网络划分。每个核负责处理不同的神经元,这里的划分方式则可以多种多样,并不局限于图36所示的划分方法。输入被送至每一个核心处理模块中,权值则根据相应的连接进行读取,只有当所有的核心处理模块完成当前的输出特征图处理后才会进行新的特征图处理。也即神经网络处理系统在执行一维运算时,将同一输入分别发送至多个核心处理模块,多个核心处理模块分别处理不同的输出神经元,多个核心处理模块均分别完成当前输出神经元的处理后,再执行新的输入的处理。
神经网络划分包括从输入神经元进行划分,输出神经元划分和权值连接进行划分。本申请按照输出神经元进行划分,输出神经元需要多个甚至全部输入神经元参与处理,而输出神经元的处理多数情况下彼此独立。按照输出神经元划分可以复用输入神经元,降低带宽需求,从而使得处理器更加高效。
图37是本申请一种神经网络的处理方法的流程图,该方法在如图2、图5或图6A所示的计算装置中实现,此时该计算装置中包括多个ALU,包括:
步骤S601,片内地址索引模块根据输入的索引映射至正确的存储地址;
步骤S602,根据存储地址从片上存储介质中获取输入数据;
步骤S603,将输入数据发送至多核心处理模块或所述ALU模块;
步骤S604,多核心处理模块执行神经网络运算中的向量乘加操作,ALU模块根据多核心处理模块的处理结果或者从片上存储介质中获取的输入数据执行多核心处理模块无法完成的非线性运算;
步骤S605,将处理过程中产生的数据缓存至片上存储介质。
优选的是,所述方法还包括:将同一输入神经元分别发送至多个核心处理模块,将不同的输入权值分配至不同的核心处理模块,多个核心处理模块分别将输入神经元和输入权值进行向量内积操作后得到不同的输出神经元。
综上所述,本申请针对神经网络处理系统的核心处理部分通过提升片上核心处理模块的数目从而提升神经网络算法中的核心运算部分处理速度,使得处理器获得更高的性能。核心处理指的是神经网络算法中占据大部分处理时间的向量乘加操作。从而本申请能够提升神经网络处理系统的运算速度,使得神经网络处理系统性能更高,更加高效。
根据本申请实施例的支持离散数据表示的多层人工神经网络的正向运算,包括两层或者两层以上的多个神经元。对于每一层来说,输入神经元向量首先和权值向量进行点积运算,结果经过激活函数得到输出神经元。其中激活函数可以是sigmoid函数,tanh、relu、softmax函数等,支持将激活后的输出神经元离散化表示或连续化表示。
对于离散数据表示的输入神经元向量或离散数据表示的权值向量的点积运算,本装置支持将点积运算转换为数据的移位、取非、异或等位运算。对于数据的表示方式,本装置支持数据离散表示或非离散表示,用户可以自定义哪一个层的哪些数据采用离散表示形式或非离散表示,并且可以根据具体需要自定义离散数据的位数,从而代替表示的真实数据的个数,例如设定为1比特、2比特、3比特等位数的离散数据,分别可以表示2个、4个、8个真实数据。
图38示出了根据本申请实施例的用于执行支持离散数据表示的人工神经网络正向运算的装置的整体结构的示例框图。如图38所示,该装置在一种可选实施例中,可以为如图6A所示的计算装置,可选的,在如图6A所示的计算装置内还可以添加连续离散转换模块,用于将连续数据与离散数据的互换,其与数据访问单元连接实现数据互通,在一种可选实施例中,如图6A所示的计算装置还可以扩展或增加如图38所示的装置的模块或单元。在另一种可选实施例中,该装置包括指令缓存单元1、控制器单元2、数据访问单元3、互联模块4、主运算模块5和多个从运算模块6,可选地还包括连续离散转换模块7。指令缓存单元1、控制器单元2、数据访问单元3、互联模块4、主运算模块5和从运算模块6、连续离散转换模块7均可以通过硬件电路(例如包括但不限于FPGA、CGRA、专用集成电路ASIC、模拟电路和忆阻器等)实现。特别的,本装置可以对离散数据提供存储和运算支持。
指令缓存单元1通过数据访问单元3读入指令并缓存读入的指令。
控制器单元2从指令缓存单元1中读取指令,将指令译成控制其他模块行为的微指令,所述其他模块例如数据访问单元3、主运算模块5和从运算模块6等。
数据访问单元3能够访存外部地址空间,直接向装置内部的各个缓存单元读写数据,完成数据的加载和存储。该数据是离散表示的或非离散表示的。该单元用来设计可以读取离散表示的数据。
互联模块4用于连接主运算模块和从运算模块,可以实现成不同的互连拓扑(如树状结构、环状结构、网格状结构、分级互连、总线结构等)
图39示意性示出了互联模块4的一种实施方式:树型模块。树型模块4构成主运算模块5和多个从运算模块6之间的数据通路,并具有树型的结构。可选的,该树型模块可以为n叉树结构,例如如图39所示的二叉树通路,每个节点将上游的数据同样地发给下游的两个节点,将下游的两个节点返回的数据进行合并,并返回给上游的节点。例如,在每层人工神经网络开始计算阶段,主运算模块5内的神经元数据该数据可以是离散表示或非离散表示的通过树型模块4发送给各个从运算模块6;当从运算模块6的计算过程完成后,每个从运算模块输出的神经元的值会在树型中逐级拼成一个完整的由神经元组成的向量,作为中间结果向量。针对于离散数据表示的运算,我们特别提到了在主从运算模块内部的专用于离散数据运算的运算模块见图44。以神经网络全连接层进行说明,假设装置中共有N个从运算模块,则中间结果向量按N分段,每段有N个元素,第i个从运算模块计算每段中的第i个元素。N个元素经过树型模块拼成长度为N的向量并返回给主运算模块。所以如果网络只有N个输出神经元,则每个从运算单元只需输出单个神经元的值,若网络有m*N个输出神经元,则每个从运算单元需输出m个神经元值。树型模块在存储和传输数据的过程中均支持离散数据表示。
图40示出了根据本申请实施例的用于执行人工神经网络正向运算的装置中主运算模块5的结构的示例框图。如图40所示,主运算模块5包括运算单元51、数据依赖关系判断单元52和支持离散数据表示的神经元缓存单元53。
支持离散数据表示的神经元缓存单元53用于缓存主运算模块5在计算过程中用到的输入数据和输出数据。
运算单元51完成主运算模块5的各种运算功能。对于运算因子全是离散数据的情况,可以通过查表实现离散数据与离散数据的加减乘除运算。例如2位的离散数据,可以表示4个连续数据值。对于4个连续数据共有4*4=16种组合。对于每种加减乘除运算的操作,可以制作并维护该4*4的索引表,通过索引表找到对应的计算值。4种运算共需要4张4*4的索引表。
对于运算因子包含离散数据和连续数据的情况,可以针对不同离散数据,为加、减、乘、除运算预先设定相应的位操作。例如,可以采取按位异或后乘2的相应位次幂之后累加求和的方式代替离散数据与连续数据的点积运算。例如,对于乘法操作,乘法因子数据如果存在离散表示的,可以通过离散数据索引相应的操作(例如,对相应数据的按位异或、取非、移位等操作)代替和该离散数据表示的连续数据的乘法操作,从而减少了乘法器部件数量。例如对于连续数据与离散数据的乘法操作,-1/2乘以16。传统的乘法器部件会将-1/2与16直接做乘法。在运算单元51中,由于离散数据的可能性较少,可以通过查找索引这样一种开关判断的方法代替了运算单元的功能。例如,可以规定-1/2的离散数据表示方法为01。如果一个运算因子是-1/2,则运算单元51接收到的离散数据为01。运算单元51便采用离散数据01对应的操作。通过对于16的8位定点数表示00010000符号位取反,向右移1位得到10001000,十进制表示为-8。对于除法操作,16除以-2。其中16是连续数据,-2是离散数据。如果规定离散数据-2二进制表示为10。运算单元便采用离散数据10对应的除法操作。通过对16的8位定点数表示0001000右移1位之后符号位取反得到10001000,十进制表示为-8得到结果。加法和减法操作与上述过程类似。根据离散数据的二进制作为一个索引,索引到按位左移、右移、异或等操作。经过该操作后实现了与离散数据表示的真实数据的相加或者相减操作。
依赖关系判断单元52是运算单元51读写神经元缓存单元53的端口,同时能够保证神经元缓存单元中数据的读写一致性。同时,数据依赖关系判断单元52也负责将读取数据通过互联模块4发送给从运算 模块,而从运算模块6的输出数据通过互联模块4直接发送给运算单元51。控制器单元2输出的指令发送给计算单元51和数据依赖关系判断单元52,来控制其行为。
图41示出了根据本申请实施例的用于执行支持离散数据表示的人工神经网络正向运算的装置中从运算模块6的结构的示例框图。如图41所示,每个从运算模块6包括运算单元61、数据依赖关系判定单元62、支持离散数据表示的神经元缓存单元63和支持离散数据表示的权值缓存单元64。
运算单元61接收控制器单元2发出的微指令并进行算数逻辑运算。对于运算因子全是离散数据的情况,可以通过查表实现离散数据与离散数据的加减乘除运算。例如2位的离散数据,可以表示4个连续数据值。对于4个连续数据共有4*4=16种组合。对于每种加减乘除运算的操作,可以制作并维护该4*4的索引表,通过索引表找到对应的计算值。4种运算共需要4张4*4的索引表。
对于运算因子包含离散数据和连续数据的情况,可以针对不同离散数据,为加、减、乘、除运算预先设定相应的位操作。例如,可以采取按位异或后乘2的相应位次幂之后累加求和的方式代替离散数据与连续数据的点积运算。例如,对于乘法操作,乘法因子数据如果存在离散表示的,可以通过离散数据索引相应的操作(例如,对相应数据的按位异或、取非、移位等操作)代替和该离散数据表示的连续数据的乘法操作,从而减少了乘法器部件数量。例如对于连续数据与离散数据的乘法操作,-1/2乘以16。传统的乘法器部件会将-1/2与16直接做乘法。在运算单元51中,由于离散数据的可能性较少,可以通过查找索引这样一种开关判断的方法代替了运算单元的功能。例如,可以规定-1/2的离散数据表示方法为01。如果一个运算因子是-1/2,则运算单元51接收到的离散数据为01。运算单元51便采用离散数据01对应的操作。通过对于16的8位定点数表示00010000符号位取反,向右移1位得到10001000,十进制表示为-8。对于除法操作,16除以-2。其中16是连续数据,-2是离散数据。如果规定离散数据-2二进制表示为10。运算单元便采用离散数据10对应的除法操作。通过对16的8位定点数表示0001000右移1位之后符号位取反得到10001000,十进制表示为-8得到结果。加法和减法操作与上述过程类似。根据离散数据的二进制作为一个索引,索引到按位左移、右移、异或等操作。经过该操作后实现了与离散数据表示的真实数据的相加或者相减操作。
数据依赖关系判断单元62负责计算过程中对神经元缓存单元的读写操作。数据依赖关系判断单元62执行读写操作之前会首先保证指令之间所用的数据不存在读写一致性冲突。例如,所有发往数据依赖关系单元62的微指令都会被存入数据依赖关系单元62内部的指令队列里,在该队列中,读指令的读取数据的范围如果与队列位置靠前的写指令写数据的范围发生冲突,则该指令必须等到所依赖的写指令被执行后才能够执行。
支持离散数据表示的神经元缓存单元63缓存该从运算模块6的输入神经元向量数据和输出神经元值数据。该数据可以以离散数据的形式存储和传输。
支持离散数据表示的权值缓存单元64缓存该从运算模块6在计算过程中需要的权值数据。该数据根据用户定义可以是离散表示的或不是。对于每一个从运算模块6,都只会存储全部输入神经元与部分输出神经元之间的权值。以全连接层为例,输出神经元按照从运算单元的个数N进行分段,每段的第n个输出神经元对应的权值存放在第n个从运算单元中。
从运算模块6实现每层人工神经网络正向运算过程中可以并行的前半部分。该模块中的数据存储以及运算都支持离散数据表示。以人工神经网络全连接层(MLP)为例,过程为y=f(wx+b),其中权值矩阵w和输入神经元向量x的乘法可以划分为不相关的并行计算子任务,out与in是列向量,每个从运算模块6只计算in中相应的部分标量元素与权值矩阵w对应的列的乘积,得到的每个输出向量都是最终结果的一个待累加的部分和,这些部分和在互联模块4中逐级两两相加得到最后的结果。这个结果可以是离散数据表示的。所以计算过程变成了并行的计算部分和的过程和后面的累加的过程。每个从运算模块6计算出输 出神经元值,所有的输出神经元值在互联模块4中拼成得到中间结果向量。每个从运算模块6只需要计算出中间结果向量y中与本模块对应的输出神经元值即可。互联模块4对所有从运算模块6输出的神经元值求和,得到最终的中间结果向量y。主运算模块5基于中间结果向量y进行后续计算,比如加偏置、池化(例如最大值池化(MAXPOOLING)或平均值池化(AVGPOOLING)等)、做激活和做采样等。
图45示出了运算单元的结构框图,其可用于主运算模块中的运算单元51或从运算模块中的运算单元61。运算过程中输入数据可以是离散数据或连续数据。数据类型判断单元71判断输入数据全是连续数据、全是离散数据或是既包含连续数据又包含离散数据的混合数据。当输入数据全是连续数据时,连续数据运算单元72执行相应运算。
当输入数据全是离散数据时,离散数据运算单元73执行相应运算。对于运算因子全是离散数据的情况,可以通过查表实现离散数据与离散数据的加减乘除运算。例如2位的离散数据,可以表示4个连续数据值。对于4个连续数据共有4*4=16种组合。对于每种加减乘除运算的操作,我们制作并维护该4*4的索引表,通过索引表找到对应的计算值。4种运算共需要4张4*4的索引表。
当输入数据是混合数据时,运算决定单元74根据其中的离散数据决定应对其执行何种操作。可以针对不同的离散数据分别预先设置相应操作。然后,混合数据运算单元75根据运算决定单元74的决定结果,执行相应操作。对于运算因子包含离散数据和连续数据的情况,可以针对不同离散数据,为加、减、乘、除运算预先设定相应的位操作。例如,可以采取按位异或后乘2的相应位次幂之后累加求和的方式代替离散数据与连续数据的点积运算。例如,对于乘法操作,乘法因子数据如果存在离散表示的,可以通过离散数据索引相应的操作(例如,对相应数据的按位异或、取非、移位等操作)代替和该离散数据表示的连续数据的乘法操作,从而减少了乘法器部件数量。例如对于连续数据与离散数据的乘法操作,-1/2乘以16。传统的乘法器部件会将-1/2与16直接做乘法。在运算单元51中,由于离散数据的可能性较少,可以通过查找索引这样一种开关判断的方法代替了运算单元的功能。例如,可以规定-1/2的离散数据表示方法为01。如果一个运算因子是-1/2,则运算单元51接收到的离散数据为01。运算单元51便采用离散数据01对应的操作。通过对于16的8位定点数表示00010000符号位取反,向右移1位得到10001000,十进制表示为-8。对于除法操作,16除以-2。其中16是连续数据,-2是离散数据。如果规定离散数据-2二进制表示为10。运算单元便采用离散数据10对应的除法操作。通过对16的8位定点数表示0001000右移1位之后符号位取反得到10001000,十进制表示为-8得到结果。加法和减法操作与上述过程类似。根据离散数据的二进制作为一个索引,索引到按位左移、右移、异或等操作。经过该操作后实现了与离散数据表示的真实数据的相加或者相减操作。
图46示出了连续离散转换单元。用户可以定义采用该模块将连续数据转换为离散数据或不采用。输入连续数据,输出离散数据。该单元包括随机数产生模块、判断模块、运算模块。对于输入的连续数据通过运算模块得到运算后的结果,经由判断模块用随机数与运算后的结果比较,判断随机数落在哪一个区间,从而决定出输出的离散数据的具体值。例如用户定义产生二元离散数据。对于输入的任意连续数据x。经由运算模块计算出结果y=abs(clip(-1,1))。之后通过判断模块,如果随机数大于y,则输出的离散数据是1,反之输出的离散数据是0。离散数据1和0分别代表了连续数据的-1和+1。将得到的离散数据存储回内存中。等待主从运算模块中的运算单元使用,产生相应的操作。
正向过程中的权值数据、输出输入数据可以采用离散数据表示或不采用。对于连续数据的乘法操作,可以通过基于离散数据的异或、取非、位移等方式代替连续数据的乘法操作。例如权值用1比特离散数据表示,0代表+1,1代表-1,通过对与权值相乘数据的符号位异或,实现了对权值的乘法运算。
根据本申请实施例,还提供了在前述装置上执行人工神经网络正向运算的指令集。指令集中包括CONFIG指令、COMPUTE指令、IO指令、NOP指令、JUMP指令和MOVE指令等,其中:
CONFIG指令在每层人工神经网络计算开始前配置当前层计算需要的各种常数;
COMPUTE指令完成每层人工神经网络的算术逻辑计算;
IO指令实现从外部地址空间读入计算需要的输入数据以及在计算完成后将数据存回至外部空间,该数据支持离散化表示;
NOP指令负责清空当前装置内部所有微指令缓存队列中的微指令,保证NOP指令之前的所有指令全部指令完毕。NOP指令本身不包含任何计算操作;
JUMP指令负责控制器将要从指令缓存单元读取的下一条指令地址的跳转,用来实现控制流的跳转;
MOVE指令负责将装置内部地址空间某一地址的数据搬运至装置内部地址空间的另一地址,该过程独立于运算单元,在执行过程中不占用运算单元的资源。
图42示出了根据本申请实施例的神经网络正向运算过程的示例框图。在不同从运算模块6中,输入神经元向量分别与该从运算模块6的权值向量进行点积运算,得到对应的输出神经元值,所有这些输出神经元值组成中间结果向量,该中间结果向量经过加偏置向量以及激活运算得到该层神经网络的最终输出神经元向量,公式描述为out=f(w*in+b),其中out输出神经元向量、in是输入神经元向量、b是偏置向量,w是权值矩阵,f是激活函数。每个从运算模块6的权值向量是权值矩阵中与该从运算模块6相对应的列向量。互联模块将输入神经元向量[in0,…,inN]发送给所有的从运算单元,暂存在神经元缓存单元中。对于第i个从运算单元,计算其相应的权值向量[w_i0,…,w_iN]与输入神经元向量的点积。从运算单元输出的结果经过互联模块拼成完整的输出向量并返回给主运算单元,在主运算单元中进行激活运算,得到最后的输出神经元向量[out0,out1,out2,…,outN]。
图43是示出根据一个实施例的单层支持离散数据表示的人工神经网络正向计算的一种实施方法。该流程图描述利用本申请的装置和指令集实现图5所示的一种单层离散数据表示的人工神经网络正向运算过程。该计算方法在如图2、图5或图6A所示的计算装置中实现。
步骤S1.1,将初始指令存放到指令存储单元1中;
步骤S1.2,从指令存储单元1中读取一条指令;
步骤S1.3,对上述指令进行译码;
步骤S1.4,根据译码得到的控制信号,进行相应操作;
步骤S1.5,将操作结果写回到相应存储中。
在步骤S1.1中,可以存入初始化IO指令,用于搬运后续指令。
在步骤S1.2中,可读取的指令包括但不限于CONFIG指令、COMPUTE指令、IO指令、NOP指令、JUMP指令和MOVE指令等。
在步骤S1.3中,根据指令的操作类型(CONFIG,COMPUTE,IO,NOP,JUMP,MOVE等)译码得到相应模块的控制信号。对于CONFIG指令,译码得到配置其余模块的配置信息。对于COMPUTE指令,译码得到主从运算模块的控制信号,控制不同离散数据采取的对应操作。对于IO指令,译码得到数据访问模块的控制信号。对于NOP指令,不产生实际控制信号,只用于清空当前装置内部所有控制信号缓存队列中的控制信号,保证NOP指令之前的所有指令全部执行完毕。对于JUMP指令,得到跳转指令流的控制信号。对于MOVE指令,得到在装置内部搬运数据的控制信号。
在步骤S1.4中,上述模块2-6根据控制信号执行相应操作。以执行支持离散数据表示的神经网络正向的COMPUTE指令为例,互连模块将输入神经元向量[in0,…,inN]发送给所有的从运算模块,暂存在神经元缓存单元中。对于第i个从运算模块,计算其相应的权值向量[w_i0,…,w_iN]与输入神经元向量的点积。从运算模块输出的结果经过互连模块拼成完整的输出向量并返回给主运算模块,在主运算模块中进行激活运算,得到最后的输出神经元向量[out0,out1,out2,…,outN]。
在步骤S1.5中,各个模块将操作结果写回到相应缓存中。以执行离散数据表示的神经网络正向的运算为例,主运算模块得到的输出神经元向量被写回到存储单元。
图44是示出根据一个实施例的单层人工神经网络正向运算的另一种更详细的实施方法。该流程图描述利用本申请的装置和指令集实现图4所示的一种单层神经网络正向运算的过程。
在步骤S1,在指令缓存单元1的首地址处预先存入一条IO指令。
在步骤S2,运算开始,控制器单元2从指令缓存单元1的首地址读取该条IO指令,根据译出的微指令,数据访问单元3从外部地址空间读取相应的所有人工神经网络运算指令,并将其缓存在指令缓存单元1中。
在步骤S3,控制器单元2接着从指令缓存单元读入下一条IO指令,根据译出的微指令,数据访问单元3从外部地址空间读取主运算模块5需要的所有数据(例如,包括输入神经元向量、插值表、常数表和偏置等)至主运算模块5的神经元缓存单元53,该数据支持离散表示,可以是全部离散或部分离散。
在步骤S4,控制器单元2接着从指令缓存单元读入下一条IO指令,根据译出的微指令,数据访问单元3从外部地址空间读取从运算模块6需要的权值矩阵数据,该数据支持离散表示,可以是全部离散或部分离散。
在步骤S5,控制器单元2接着从指令缓存单元读入下一条CONFIG指令,根据译出的微指令,装置配置该层神经网络计算需要的各种常数。例如,运算单元51、61根据微指令里的参数配置单元内部寄存器的值,所述参数例如包括本层计算的精度设置、激活函数的数据(例如本层计算的精度位,Lrn层算法的rang参数,AveragePooling层算法窗口大小的倒数等)。
在步骤S6,控制器单元2接着从指令缓存单元读入下一条COMPUTE指令,根据译出的微指令,主运算模块5首先通过互联模块4将输入神经元向量发给各从运算模块6,保存至从运算模块6的神经元缓存单元63。
在步骤S7,根据COMPUTE指令译出的微指令,从运算模块6的运算单元61从权值缓存单元64读取权值向量(权值矩阵中对应于该从运算模块6的列向量),从神经元缓存单元读取输入神经元向量,完成权值向量和输入神经元向量的点积运算,将中间结果通过互联返回,对于离散数据,自定义采用异或等位运算代替点积运算或不采用。例如对于1比特的离散数据表示,0代表+1,1代表-1,通过对与权值相乘数据的符号位异或,实现了对权值的乘法运算。。
在步骤S8,在互联模块4中,各从运算模块6返回的中间结果被逐级拼成完整的中间结果向量。
在步骤S9,主运算模块5得到互联模块4的返回值,根据COMPUTE指令译出的微指令,从神经元缓存单元53读取偏置向量,与互联模块4返回的向量相加,然后再对相加结果做激活,该装置支持用户自定义是否将激活后的结果离散化表示。并将最后的输出神经元向量写回至神经元缓存单元53。
在步骤S10,控制器单元接着从指令缓存单元读入下一条IO指令,根据译出的微指令,数据访问单元3将神经元缓存单元53中的输出神经元向量存至外部地址空间指定地址,运算结束。
对于人工神经网络批归一化运算(Batch Normalization)运算步骤与上述过程相仿。通过提供的指令集,控制器完成以下过程。控制器控制数据访问单元读入输入的数据,之后控制主从运算模块根据batch大小求出各自位置的均值以及方差或使用设定好的均值方差。之后控制器控制对应位置的输入数据减去均值除以方差。最后控制器控制用处理后的数据与学习参数相乘后加上另一个学习参数。
对于多层人工神经网络,其实现过程与单层神经网络类似,当上一层人工神经网络执行完毕后,下一层的运算指令会将主运算单元中存储的上一层的输出神经元地址作为本层的输入神经元地址。同样地,指令中的权值地址和偏置地址也会变更至本层对应的地址。
通过采用用于执行人工神经网络正向运算的装置和指令集,解决了CPU和GPU运算性能不足,前端译码开销大的问题。有效提高了对多层人工神经网络正向运算的支持。
通过采用针对多层人工神经网络正向运算的专用片上缓存,充分挖掘了输入神经元和权值数据的重用性,避免了反复向内存读取这些数据,降低了内存访问带宽,避免了内存带宽成为多层人工神经网络正向运算性能瓶颈的问题。
通过采用离散数据表示的方法,相较于浮点数、定点数等表示方法,大大较少了装置的存储能耗等开销。可以再有限的面积上优化结构布局,提高运算速度或性能能耗比等指标。
本公开还提供了一种神经网络运算装置。图47为依据本实施例神经网络运算装置的示意图。在一种可选实施例中,该神经网络运算装置可以为如图6A所示的计算装置,在如图6A所示的计算装置内,可以添加在幂次转换单元,该幂次转换单元与存储介质连接,用于将神经网络输入数据中非幂次权值数据转换为幂次权值数据。可选的,上述计算装置还可以包括:控制单元以及运算单元等等,控制单元以及运算单元的具体描述可以参见如图6A所示实施例的描述,这里不再赘述,另外,上述如图6A所示的计算装置还可以增加或扩展如图47所示的神经网络运算装置。另一种可选实施例中,神经网络运算装置的结构如图47,包括:
存储单元1,用于存储数据和运算指令;
控制单元,与所述存储单元连接,用于控制数据和运算指令的交互,其接收该存储单元发送的数据和运算指令,并将运算指令译码成运算微指令;
运算单元7,与所述控制单元连接,接收该控制单元发送的数据和运算微指令,并根据运算微指令对其接收的神经元数据及权值数据执行神经网络运算;
幂次转换单元9,其与所述存储单元连接,用于将神经网络运算的输入神经元数据和/或输出神经元数据转换为幂次神经元数据。
具体的,所述控制单元包括:
数据控制模块2,与所述存储单元连接,用于存储单元和各缓存模块之间的数据和运算指令交互;
指令缓存模块3,与所述数据控制模块连接,用于接收数据控制模块发送的运算指令;
译码模块4,与所述指令缓存模块连接,用于从指令缓存模块中读取运算指令,并将其译码成各运算微指令;
输入神经元缓存模块5,与所述数据控制模块连接,用于接收数据控制模块发送的神经元数据;
权值缓存模块6,与所述数据控制模块连接,用于接收从数据控制模块发送的权值数据。
进一步的,所述运算单元7,分别与所述译码模块、输入神经元缓存模块及权值缓存模块连接,接收运算微指令、神经元数据及权值数据,用于根据运算微指令对其接收的神经元数据和权值数据执行相应的运算。所述输出神经元缓存单元8,与所述运算单元连接,用于接收运算单元输出的神经元数据;并将其发送至所述控制单元的数据控制模块2。由此可作为下一层神经网络运算的输入数据。
其中,存储单元从外部地址空间接收数据和指令,该数据包括神经网络权值数据、神经网络输入数据等。
进一步的,幂次转换操作有多种可选方式。下面列举本实施例所采用的三种幂次转换操作:
第一种幂次转换方法:
s out=s in
Figure PCTCN2018095706-appb-000003
其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据,s in为输入数据的符号,s out为输出数据的符号,d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out
Figure PCTCN2018095706-appb-000004
表示对数据x做取下整操作。
第二种幂次转换方法:
s out=s in
Figure PCTCN2018095706-appb-000005
其中,
s out=s in
Figure PCTCN2018095706-appb-000006
其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据,s in为输入数据的符号,s out为输出数据的符号,d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out
Figure PCTCN2018095706-appb-000007
表示对数据x做取上整操作。
第三种幂次转换方法:
s out=s in
d out+=[log 2(d in+)]
其中,
s out=s in
Figure PCTCN2018095706-appb-000008
其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据;s in为输入数据的符号,s out为输出数据的符号;d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out;[x]表示对数据x做四舍五入操作。
本公开还提供了另一种神经网络运算装置。图48为依据本实施例神经网络运算装置的示意图。请参照图48,本实施例神经网络运算装置,包括:
存储单元101,用于存储数据和运算指令;该存储单元从外部地址空间接收数据和运算指令,该数据包括神经网络权值数据、神经网络输入数据等。
控制单元,与所述存储单元连接,用于控制数据和运算指令的交互,其接收该存储单元发送的数据和指令,并将运算指令译码成运算微指令;
运算单元107,与所述控制单元连接,接收该控制单元发送的数据和运算微指令,并根据运算微指令对其接收的权值数据和神经元数据执行神经网络运算;
输出神经元缓存单元108,与所述运算单元连接,用于接收运算单元输出的神经元数据,并将其发送至所述控制单元;
幂次转换单元109,其与所述存储单元连接,用于将神经网络运算的输入神经元数据和/或输出神经元数据转换为幂次神经元数据;以及
幂次转换单元110,其与所述输出神经元缓存单元108连接,用于将神经网络运算后的神经元数据转换为幂次神经元数据,并发送至所述控制单元。
进一步的,所述控制单元包括:
数据控制模块102,与所述存储单元连接,用于存储单元和各缓存模块之间的数据和运算指令交互;
指令缓存模块103,与所述数据控制模块连接,用于接收数据控制模块发送的运算指令;
译码模块104,与所述指令缓存模块连接,用于从指令缓存模块中读取运算指令,并将其译码成各运算微指令;
输入神经元缓存模块105,与所述数据控制模块连接,用于接收数据控制模块发送的神经元数据;
权值缓存模块106,与所述数据控制模块连接,用于接收从数据控制模块发送的权值数据。
具体的,所述运算单元107,分别与所述译码模块、输入神经元缓存模块及权值缓存模块连接,接收各运算微指令、神经元数据及权值数据,用于根据各运算微指令对其接收的神经元数据和权值数据执行相应的运算。
所述幂次转换单元110,与所述数据控制模块连接,用于将神经网络运算后的神经元数据转换为幂次神经元数据,并发送至所述控制单元的数据控制模块102。通过幂次转换单元110获得的幂次神经元数据可作为神经网络运算下一层的输入神经元。
另外,所述幂次转换的具体操作方法与前述实施例相同,此处不再赘述。
另外,本公开实施例还提供了一种神经网络运算方法,图49为本实施例神经网络运算方法的流程图。具体而言,本公开实施例的神经网络为多层神经网络,对于每层神经网络可按图49所示的运算方法进行运算,其中,神经网络第一层输入幂次权值数据可通过存储单元从外部地址读入,若外部地址读入的权值数据已经为幂次权值数据则直接传入存储单元,否则先通过幂次转换单元转换为幂次权值数据。请参照图49,本实施例单层神经网络运算方法,包括:
步骤S1,获取指令、神经元数据及幂次权值数据。
其中,所述步骤S1包括以下子步骤:
S11,将运算指令、神经元数据及权值数据输入存储单元;其中,对幂次权值数据直接输入存储单元,对非幂次权值数据经过幂次转换单元转换后输入存储单元;
S12,数据控制模块接收该存储单元发送的指令、神经元数据及幂次权值数据;
S13,指令缓存模块、输入神经元缓存模块及权值缓存模块分别接收所述数据控制模块发送的运算指令、神经元数据及幂次权值数据并分发给译码模块或运算单元。
所述幂次权值数据表示权值数据的数值采用其幂指数值形式表示,具体为,幂次权值数据包括符号位和幂次位,符号位用一位或多位比特位表示权值数据的符号,幂次位用m位比特位表示权值数据的幂次位数据,m为大于1的正整数。存储单元预存有编码表,提供幂次权值数据的每个幂次位数据对应的指数数值。编码表设置一个或者多个幂次位数据(即置零幂次位数据)为指定对应的幂次权值数据为0。也就是说,当幂次权值数据的幂次位数据是编码表里的置零幂次位数据时候,表示该幂次权值数据为0。其中,所述编码表可以有灵活的存储方式,既可以是表格形式进行存储,还可以是通过函数关系进行的映射。
编码表的对应关系可以是任意的。
例如,编码表的对应关系可以是乱序的。如图49.1所示一种m为5的编码表的部分内容,幂次位数据为00000的时候对应指数数值为0。幂次位数据为00001的时候对应指数数值为3。幂次位数据为00010的时候对应指数数值为4。幂次位数据为00011的时候对应指数数值为1。幂次位数据为00100的时候对应幂次权值数据为0。
编码表的对应关系也可以是正相关的,存储单元预存一个整数值x和一个正整数值y,最小的幂次位数据对应指数数值为x,其他任意一个或多个幂次位数据对应幂次权值数据为0。x表示偏置值,y表示步长。在一种实施例情况下,最小的幂次位数据对应指数数值为x,最大的幂次位数据对应幂次权值数据为0,最小和最大的幂次位数据之外的其他的幂次位数据对应指数数值为(幂次位数据+x)*y。通过预设定 不同的x和y以及通过改变x和y的数值,幂次的表示范围变得可配,可以适用于需要不同数值范围的不同的应用场景。因此,本神经网络运算装置的应用范围更加广泛,使用更加灵活可变,可根据用户需求来做调整。
在一种实施方式中,y为1,x的数值等于-2 m-1。由此幂次权值数据所表示的数值的指数范围为-2 m-1~2 m-1-1。
在一种实施方式中,如图49.2所示,一种m为5,x为0,y为1的编码表的部分内容,幂次位数据为00000的时候对应指数数值为0。幂次位数据为00001的时候对应指数数值为1。幂次位数据为00010的时候对应指数数值为2。幂次位数据为00011的时候对应指数数值为3。幂次位数据为11111的时候对应幂次权值数据为0。如图49.3所示,另一种m为5,x为0,y为2的编码表的部分内容,幂次位数据为00000的时候对应指数数值为0。幂次位数据为00001的时候对应指数数值为2。幂次位数据为00010的时候对应指数数值为4。幂次位数据为00011的时候对应指数数值为6。幂次位数据为11111的时候对应幂次权值数据为0。
编码表的对应关系可以是负相关的,存储单元预存一个整数值x和一个正整数值y,最大的幂次位数据对应指数数值为x,其他任意一个或多个幂次位数据对应幂次权值数据为0。x表示偏置值,y表示步长。在一种实施例情况下,最大的幂次位数据对应指数数值为x,最小的幂次位数据对应幂次权值数据为0,最小和最大的幂次位数据之外的其他的幂次位数据对应指数数值为(幂次位数据-x)*y。通过预设定不同的x和y以及通过改变x和y的数值,幂次的表示范围变得可配,可以适用于需要不同数值范围的不同的应用场景。因此,本神经网络运算装置的应用范围更加广泛,使用更加灵活可变,可根据用户需求来做调整。
在一种实施方式中,y为1,x的数值等于2 m-1。由此幂次权值数据所表示的数值的指数范围为-2 m-1-1~2 m-1
如图49.4所示,一种m为5的编码表的部分内容,幂次位数据为11111的时候对应数数值为0。幂次位数据为11110的时候对应指数数值为1。幂次位数据为11101的时候对应指数数值为2。幂次位数据为11100的时候对应指数数值为3。幂次位数据为00000的时候对应幂次权值数据为0。
编码表的对应关系可以是幂次位数据最高位代表置零位,幂次位数据其他m-1位对应指数数值。当幂次位数据最高位为0时,对应幂次权值数据为0;当幂次位数据最高位为1时,对应幂次权值数据不为0。反之亦可,即当幂次位数据最高位为1时,对应幂次权值数据为0;当幂次位数据最高位为0时,对应幂次权值数据不为0。用另一种语言来描述,即幂次权值数据的幂次位被分出一个比特来指示幂次权值数据是否为0。
在一个具体实例图49.5所示,符号位为1位,幂次位数据位为7位,即m为7。编码表为幂次位数据为11111111的时候对应幂次权值数据为0,幂次位数据为其他数值的时候幂次权值数据对应相应的二进制补码。当幂次权值数据符号位为0,幂次位为0001001,则其表示具体数值为2 9,即512;幂次权值数据符号位为1,幂次位为1111101,则其表示具体数值为-2 -3,即-0.125。相对于浮点数据,幂次数据只保留数据的幂次位,极大减小了存储数据所需的存储空间。
通过幂次数据表示方法,可以减小存储权值数据所需的存储空间。在本实施例所提供示例中,幂次数据为8位数据,应当认识到,该数据长度不是固定不变的,在不同场合下,根据数据权值的数据范围采用不同的数据长度。
步骤S2,根据运算指令对神经元数据及幂次权值数据进行神经网络运算。其中,所述步骤S2包括以下子步骤:
S21,译码模块从指令缓存模块中读取指令,并将其译码成各运算指令;
S22,运算单元分别接收所述译码模块、输入神经元缓存模块及权值缓存模块发送的运算指令、幂次权值数据以及神经元数据,并根据运算指令对神经元数据及幂次表示的权值数据进行神经网络运算。
所述神经元与幂次权值乘法操作具体为,神经元数据符号位与幂次权值数据符号位做异或操作;编码表的对应关系为乱序的情况下查找编码表找出幂次权值数据幂次位对应的指数数值,编码表的对应关系为正相关的情况下记录编码表的指数数值最小值并做加法找出幂次权值数据幂次位对应的指数数值,编码表的对应关系为负相关的情况下记录编码表的最大值并做减法找出幂次权值数据幂次位对应的指数数值;将指数数值与神经元数据幂次位做加法操作,神经元数据有效位保持不变。
具体实例一如图49.6所示,神经元数据为16位浮点数据,符号位为0,幂次位为10101,有效位为0110100000,则其表示的实际数值为1.40625*2 6。幂次权值数据符号位为1位,幂次位数据位为5位,即m为5。编码表为幂次位数据为11111的时候对应幂次权值数据为0,幂次位数据为其他数值的时候幂次位数据对应相应的二进制补码。幂次权值为000110,则其表示的实际数值为64,即2 6。幂次权值的幂次位加上神经元的幂次位结果为11011,则结果的实际数值为1.40625*2 12,即为神经元与幂次权值的乘积结果。通过该运算操作,使得乘法操作变为加法操作,减小计算所需的运算量。
具体实例二如图49.7所示,神经元数据为32位浮点数据,符号位为1,幂次位为10000011,有效位为10010010000000000000000,则其表示的实际数值为-1.5703125*2 4。幂次权值数据符号位为1位,幂次位数据位为5位,即m为5。编码表为幂次位数据为11111的时候对应幂次权值数据为0,幂次位数据为其他数值的时候幂次位数据对应相应的二进制补码。幂次神经元为111100,则其表示的实际数值为-2 -4。(神经元的幂次位加上幂次权值的幂次位结果为01111111,则结果的实际数值为1.5703125*2 0,即为神经元与幂次权值的乘积结果。
可选的,还包括步骤S3,将神经网络运算后的神经元数据输出并作为下一层神经网络运算的输入数据。
其中,所述步骤S3可包括以下子步骤:
S31,输出神经元缓存单元接收所述计算单元发送的神经网络运算后得到的神经元数据。
S32,将输出神经元缓存单元接收的神经元数据传输给数据控制模块,通过输出神经元缓存单元获得的神经元数据可作为神经网络运算下一层的输入神经元,再重复步骤S1至步骤S3直到神经网络最后一层运算结束。
另外,通过幂次转换单元获得的幂次神经元数据可作为神经网络运算下一层的输入幂次神经元,再重复步骤1至步骤3直到神经网络最后一层运算结束。通过改变存储单元预存的整数值x和正整数值y,可以调整神经网络运算装置所能表示的幂次神经元数据范围。
另外,所述幂次转换的具体操作方法与前述实施例相同,此处不再赘述。
另外,本公开实施例还提供了另一种神经网络运算方法,图50为本实施例神经网络运算方法的流程图。
具体而言,本公开实施例的神经网络为多层神经网络,对于每层神经网络可按图50所示的运算方法进行运算,其中,神经网络第一层输入幂次权值数据可通过存储单元从外部地址读入,若外部地址读入的数据已经为幂次权值数据则直接传入存储单元,否则先通过幂次转换单元转换为幂次权值数据;而神经网络第一层输入幂次神经元数据可通过存储单元从外部地址读入,若外部地址读入的数据已经为幂次数据则直接传入存储单元,否则先通过幂次转换单元转换为幂次神经元数据,此后各层神经网络的输入神经元数据可由在该层之前的一层或多层神经网络的输出幂次神经元数据提供。请参照图50,本实施例单层神经网络运算方法,包括:
步骤S4,获取指令、幂次神经元数据及幂次权值数据。
其中,所述步骤S4包括以下子步骤:
S41,将指令、神经元数据及权值数据输入存储单元;其中,对幂次神经元数据及幂次权值数据直接输入存储单元,对非幂次神经元数据及非幂次权值数据则经过所述第一幂次转换单元转换为幂次神经元数据及幂次权值数据后输入存储单元;
S42,数据控制模块接收该存储单元发送的运算指令、幂次神经元数据及幂次权值数据;
S43,指令缓存模块、输入神经元缓存模块及权值缓存模块分别接收所述数据控制模块发送的指令、幂次神经元数据及幂次权值数据并分发给译码模块或运算单元。
所述幂次神经元数据及幂次权值数据表示神经元数据及权值数据的数值采用其幂指数值形式表示,具体为,幂次神经元数据及幂次权值数据均包括符号位和幂次位,符号位用一位或多位比特位表示神经元数据及权值数据的符号,幂次位用m位比特位表示神经元数据及权值数据的幂次位数据,m为大于1的正整数。存储单元的存储单元预存有编码表,提供幂次神经元数据及幂次权值数据的每个幂次位数据对应的指数数值。编码表设置一个或者多个幂次位数据(即置零幂次位数据)为指定对应的幂次神经元数据及幂次权值数据为0。也就是说,当幂次神经元数据及幂次权值数据的幂次位数据是编码表里的置零幂次位数据时候,表示该幂次神经元数据及幂次权值数据为0。
编码表的对应关系可以是任意的。
例如,编码表的对应关系可以是乱序的。如图50.1所示一种m为5的编码表的部分内容,幂次位数据为00000的时候对应指数数值为0。幂次位数据为00001的时候对应指数数值为3。幂次位数据为00010的时候对应指数数值为4。幂次位数据为00011的时候对应指数数值为1。幂次位数据为00100的时候对应幂次神经元数据及幂次权值数据为0。
编码表的对应关系也可以是正相关的,存储单元预存一个整数值x和一个正整数值y,最小的幂次位数据对应指数数值为x,其他任意一个或多个幂次位数据对应幂次神经元数据及幂次权值数据为0。x表示偏置值,y表示步长。在一种实施例情况下,最小的幂次位数据对应指数数值为x,最大的幂次位数据对应幂次神经元数据及幂次权值数据为0,最小和最大的幂次位数据之外的其他的幂次位数据对应指数数值为(幂次位数据+x)*y。通过预设定不同的x和y以及通过改变x和y的数值,幂次的表示范围变得可配,可以适用于需要不同数值范围的不同的应用场景。因此,本神经网络运算装置的应用范围更加广泛,使用更加灵活可变,可根据用户需求来做调整。
在一种实施例方式中,y为1,x的数值等于-2 m-1。由此幂次神经元数据及幂次权值数据所表示的数值的指数范围为-2 m-1~2 m-1-1。
在一种实施例方式中,如图50.2所示一种m为5,x为0,y为1的编码表的部分内容,幂次位数据为00000的时候对应指数数值为0。幂次位数据为00001的时候对应指数数值为1。幂次位数据为00010的时候对应指数数值为2。幂次位数据为00011的时候对应指数数值为3。幂次位数据为11111的时候对应幂次神经元数据及幂次权值数据为0。如图50.3所示另一种m为5,x为0,y为2的编码表的部分内容,幂次位数据为00000的时候对应指数数值为0。幂次位数据为00001的时候对应指数数值为2。幂次位数据为00010的时候对应指数数值为4。幂次位数据为00011的时候对应指数数值为6。幂次位数据为11111的时候对应幂次神经元数据及幂次权值数据为0。
编码表的对应关系可以是负相关的,存储单元预存一个整数值x和一个正整数值y,最大的幂次位数据对应指数数值为x,其他任意一个或多个幂次位数据对应幂次神经元数据及幂次权值数据为0。x表示偏置值,y表示步长。在一种实施例情况下,最大的幂次位数据对应指数数值为x,最小的幂次位数据对应幂次神经元数据及幂次权值数据为0,最小和最大的幂次位数据之外的其他的幂次位数据对应指数数值为(幂次位数据-x)*y。通过预设定不同的x和y以及通过改变x和y的数值,幂次的表示范围变得可配, 可以适用于需要不同数值范围的不同的应用场景。因此,本神经网络运算装置的应用范围更加广泛,使用更加灵活可变,可根据用户需求来做调整。
在一种实施例方式中,y为1,x的数值等于2 m-1。由此幂次神经元数据及幂次权值数据所表示的数值的指数范围为-2 m-1-1~2 m-1
如图50.4所示一种m为5的编码表的部分内容,幂次位数据为11111的时候对应数数值为0。幂次位数据为11110的时候对应指数数值为1。幂次位数据为11101的时候对应指数数值为2。幂次位数据为11100的时候对应指数数值为3。幂次位数据为00000的时候对应幂次神经元数据及幂次权值数据为0。
编码表的对应关系可以是幂次位数据最高位代表置零位,幂次位数据其他m-1位对应指数数值。当幂次位数据最高位为0时,对应幂次神经元数据及幂次权值数据为0;当幂次位数据最高位为1时,对应幂次神经元数据及幂次权值数据不为0。反之亦可,即当幂次位数据最高位为1时,对应幂次神经元数据及幂次权值数据为0;当幂次位数据最高位为0时,对应幂次神经元数据及幂次权值数据不为0。用另一种语言来描述,即幂次神经元数据及幂次权值数据的幂次位被分出一个比特来指示幂次神经元数据及幂次权值数据是否为0。
在一个具体实例方式中,如图50.5所示,符号位为1位,幂次位数据位为7位,即m为7。编码表为幂次位数据为11111111的时候对应幂次神经元数据及幂次权值数据为0,幂次位数据为其他数值的时候幂次神经元数据及幂次权值数据对应相应的二进制补码。当幂次神经元数据及幂次权值数据符号位为0,幂次位为0001001,则其表示具体数值为2 9,即512;幂次神经元数据及幂次权值数据符号位为1,幂次位为1111101,则其表示具体数值为-2 -3,即-0.125。相对于浮点数据,幂次数据只保留数据的幂次位,极大减小了存储数据所需的存储空间。
通过幂次数据表示方法,可以减小存储神经元数据及权值数据所需的存储空间。在本实施例所提供示例中,幂次数据为8位数据,应当认识到,该数据长度不是固定不变的,在不同场合下,根据神经元数据及权值数据的数据范围采用不同的数据长度。
步骤S5,根据运算指令对幂次神经元数据及幂次权值数据进行神经网络运算。其中,所述步骤S5包括以下子步骤:
S51,译码模块从指令缓存模块中读取运算指令,并将其译码成各运算微指令;
S52,运算单元分别接收所述译码模块、输入神经元缓存模块及权值缓存模块发送的运算指令、幂次神经元数据及幂次权值数据,并根据运算微指令对幂次神经元数据及幂次权值数据进行神经网络运算。
所述幂次神经元与幂次权值乘法操作具体为,幂次神经元数据符号位与幂次权值数据符号位做异或操作;编码表的对应关系为乱序的情况下查找编码表找出幂次神经元数据及幂次权值数据幂次位对应的指数数值,编码表的对应关系为正相关的情况下记录编码表的指数数值最小值并做加法找出幂次神经元数据及幂次权值数据幂次位对应的指数数值,编码表的对应关系为负相关的情况下记录编码表的最大值并做减法找出幂次神经元书记及幂次权值数据幂次位对应的指数数值;将幂次神经元数据对应的指数数值与幂次权值数据对应的指数数值做加法操作。
具体实例一如图50.6所示,幂次神经元数据和幂次权值数据符号位为1位,幂次位数据位为4位,即m为4。编码表为幂次位数据为1111的时候对应幂次权值数据为0,幂次位数据为其他数值的时候幂次位数据对应相应的二进制补码。幂次神经元数据为00010,则其表示的实际数值为2 2。幂次权值为00110,则其表示的实际数值为64,即2 6。幂次神经元数据和幂次权值数据的乘积为01000,其表示的实际数值为2 8
可以看到,幂次神经元数据和幂次权值的乘法运算相比于浮点数据的乘法以及浮点数据和幂次数据的乘法都更加的简单方便。
本实施例方法还可进一步包括,步骤S6,将神经网络运算后的神经元数据输出并作为下一层神经网络运算的输入数据。
其中,所述步骤S6包括以下子步骤:
S61,输出神经元缓存单元接收所述计算单元发送的神经网络运算后得到的神经元数据。
S62,将输出神经元缓存单元接收的神经元数据传输给数据控制模块,通过输出神经元缓存单元获得的神经元数据可作为神经网络运算下一层的输入神经元,再重复步骤S4至步骤S6直到神经网络最后一层运算结束。
由于神经网络运算后得到的神经元数据也为幂次数据,将其传输给数据控制模块所需带宽相比于浮点数据所需带宽大大减少,因此进一步减小了神经网络存储资源和计算资源的开销,提高了神经网络的运算速度。
另外,所述幂次转换的具体操作方法与前述实施例相同,此处不再赘述。
所公开的实施例的所有的模块都可以是硬件结构,硬件结构的物理实现包括但不局限于物理器件,物理器件包括但不局限于晶体管,忆阻器,DNA计算机。
图51是本公开实施例的处理方法的流程图。在本公开一些实施例中,提供了一种处理方法,用于机器学习的稀疏化,例如神经网络的稀疏化,如图51所示,处理方法在如图2、图5或图6A所示的计算装置中实现,包括:
S101:使用滑动窗口从神经网络选取出一组权值,将选取的权值都置为零;
S102:对神经网络进行训练,训练过程中已经被置为零的权值保持为零。
步骤S101实际为对神经网路进行剪枝的过程;步骤S1022中是将剪枝后的神经网络使用反向传播算法(back propagation)进行重新训练,训练过程中已经被置为0的权值将一直保持0。
其中,选取神经网络的一组权值的方法可以有以下几种,组内所有权值绝对值的算术平均值小于第一阈值;或者组内所有权值绝对值的几何平均值小于第二阈值;或者组内所有权值绝对值的最大值小于第三阈值。上述第一阈值、第二阈值和第三阈值中各自的选择可以本领域技术人员可以根据情况进行预先设定,本公开并不以此为限。
图52是本公开实施例的处理方法的另一流程图。除包括与步骤S1和S2对于的步骤S201和S202外,还可以包括步骤S203:不断重复S201和S2022直至在保证精度不损失x%的前提下没有权值能被置为0,x为大于0小于100的数,x根据不同的神经网络以及不同的应用可以有不同的选择。在一个实施例里,x的值为0-5。
本公开实施例中,对神经网络进行剪枝可包括:对神经网络的全连接层、卷积层或LSTM层的权值进行剪枝。
图53是本公开实施例神经网络的全连接层的剪枝方法。如图3所示,神经网络的全连接层可以看成是一个二维矩阵(Nin,Nout),其中Nin表示输入神经元的个数,Nout表示输出神经元的个数,共有Nin*Nout个权值。在粗粒度剪枝时,我们先设定一个大小为Bin*Bout的滑动窗口,其中Bin为大于等于1小于等于Nin正整数,Bout为大于等于1小于等于Nout的正整数。滑动窗口可以沿着Bin的方向按照Sin的步长(stride)进行滑动,也可以沿着Bout方向按照Sout的步长进行滑动,其中Sin为大于等于1小于等于Bin的正整数,Sout为大于等于1小于等于Bout的正整数。当滑动窗口内的一组权值被选取时,这组权值将全部被置为0,即Bin*Bout个权值将同时置为0。
图54是本公开实施例神经网络的卷积层粗粒度剪枝方法。如图4所示,神经网络的卷积层可以看成是一个四维矩阵(Nfin,Nfout,Kx,Ky),其中Nfin表示输入特征图像(feature map)数量,Nout表示输出特 征图像数量,(Kx,Ky)表示卷积核(kernel)的大小。在粗粒度剪枝时,我们先设定一个大小为Bfin*Bfout*Bx*By的滑动窗口,其中Bfin为大于等于1小于等于Nfin的正整数,Bfout为大于等于1小于等于Nfout的正整数,Bx为大于等于1小于等于Kx的正整数,By为大于等于1小于等于Ky的正整数。滑动窗口可以沿着Bfin的方向按照Sfin的步长(stride)进行滑动,或者沿着Bfout方向按照Sfout的步长进行滑动,或者沿着Bx方向按照Sx的步长进行滑动,或沿着By方向按照Sy的步长进行滑动,其中Sfin为大于等于1小于等于Bfin的正整数,Sfout为大于等于1小于等于Bfout的正整数,Sx为大于等于1小于等于Bx的正整数,Sy为大于等于1小于等于By的正整数。当某个滑动窗口内的一组权值被选取时,这组权值将全部被置为0,即Bfin*Bfout*Bx*By个权值将同时置为0。
LSTM层的权值由多个全连接层权值组成,假设LSTM层的权值由m个全连接层权值组成,其中m为大于0的正整数。第i个全连接层权值为(Nin_i,Nout_i,),其中i是大于0小于等于m的正整数,Nin_i表示第i个全连接层权值输入神经元个数,Nout_i表示第i个全连接层权值输出神经元个数,在粗粒度剪枝时,对于第i个全连接层,我们先设定一个大小为Bin_i*Bout_i的滑动窗口,其中Bin_i为大于等于1小于等于Nin_i的正整数,Bout_i为大于等于1小于等于Nout_i的正整数。滑动窗口可以沿着Bin_i的方向按照Sin_i的步长进行滑动,也可以沿着Bout_i方向按照Sout_i的步长进行滑动,其中Sin_i为大于等于1小于等于Bin_i的正整数,Sout_i为大于等于1小于等于Bout_i的正整数。当滑动窗口内的一组权值被选取时,这组权值将全部被置为0,即Bin_i*Bout_i个权值将同时置为0。
本申请实施例还提供一种处理装置,在一种可选实施例中,该处理装置可以为图6A所示的计算装置,需要说明的是,上述如图6A所示的计算装置可以添加粗粒度剪枝单元和神经网络训练单元,在实际应用中,上述如图6A所示的计算装置也可以添加或扩展如图55所示的处理装置的模块或单元。在另一种可选实施例中,该处理装置如图55所示,用于对神经网络进行
粗粒度剪枝包括存储器:用于存储可执行指令;包括存储器:用于存储可执行指令;减少运算时间。
粗粒度剪枝单元:用于对神经网络进行剪枝,包括使用滑动窗口从神经网络选取出一组权值,将选取的权值都置为零;
神经网络训练单元:用于将剪枝后的神经网络进行训练:训练过程中已经被置为零的权值保持为零。
训练单元集成了神经网络反向训练算法,接收粗粒度剪枝后的神经网络,采用反向训练算法进行训练,在训练过程中被剪枝的权值始终保持为0。训练单元将训练后的神经网络或者传输给粗粒度剪枝单元进行进一步剪枝操作,或者直接输出。
进一步的,粗粒度剪枝单元还包括全连接层粗粒度剪枝单元,实现对神经网络的全连接层进行粗粒度剪枝操作。
进一步的,粗粒度剪枝单元还包括卷积层粗粒度剪枝单元,实现对神经网络的卷积层进行粗粒度剪枝操作。
进一步的,粗粒度剪枝单元还包括LSTM层粗粒度剪枝单元,实现对神经网络的LSTM层进行粗粒度剪枝操作。
本公开提供了一处理装置,该加速装置可以添加在如图6A所示计算装置内。图56是本公开实施例的处理装置的结构示意图。如图56所示的处理装置,能够处理粗粒度稀疏后的神经网络,充分挖掘粗粒度稀疏的特性,减少访存同时减少运算量,从而减少运算时间并降低能耗。
处理装置包括存储单元,指令控制单元,粗粒度选数单元和运算单元。处理装置可以是用于神经网络处理。
存储单元可用来存储神经网络的神经元,权值以及指令。
指令控制单元用于接收存储部分中的指令,经过译码后生成控制信息控制粗粒度选数单元进行选数操作和运算单元进行计算操作
综上所述,本申请中的运算单元,可以用于执行神经网络专用指令。本申请中的神经网络专用指令,包括但不限于所有专用于完成人工神经网络运算的指令。神经网络专用指令包括但不限于控制指令,数据传输指令,运算指令和逻辑指令。其中控制指令控制神经网络执行过程。数据传输指令完成不同存储介质之间的数据传输,数据格式包括但不仅限于矩阵,向量和标量。运算指令完成神经网络的算术运算,包括但不仅限于矩阵运算指令,向量运算指令,标量运算指令,卷积神经网络运算指令,全连接神经网络运算指令,池化神经网络运算指令,RBM神经网络运算指令,LRN神经网络运算指令,LCN神经网络运算指令,LSTM神经网络运算指令,RNN神经网络运算指令,RELU神经网络运算指令,PRELU神经网络运算指令,SIGMOID神经网络运算指令,TANH神经网络运算指令,MAXOUT神经网络运算指令。逻辑指令完成神经网络的逻辑运算,包括但不仅限于向量逻辑运算指令和标量逻辑运算指令。
其中,RBM神经网络运算指令用于实现Restricted Boltzmann Machine(RBM)神经网络运算。
其中,LRN神经网络运算指令用于实现Local Response Normalization(LRN)神经网络运算。
其中,LSTM神经网络运算指令用于实现Long Short-Term Memory(LSTM)神经网络运算。
其中,RNN神经网络运算指令用于实现Recurrent Neural Networks(RNN)神经网络运算。
其中,RELU神经网络运算指令用于实现Rectified linear unit(RELU)神经网络运算。
神经网络运算指令用于实现S型生长曲线(SIGMOID)神经网络运算。y=sigmoid(x)=1/1+e -x,其中x,y是实数。
其中,TANH神经网络运算指令用于实现双曲正切函数(TANH)神经网络运算。
其中,MAXOUT神经网络运算指令用于实现(MAXOUT)神经网络运算使用maxout激活函数输出一个节点的表达式为maxouti(x)=maxj∈[1,k](xTW…ij+bij)
其中,W为权重,b为偏置。
更具体的,它包括Cambricon指令集。
所述Cambricon指令集的特征在于,指令集中每一条指令长度为设定长度(例如64bit或128bit),指令由操作码和操作数组成。指令集包含四种类型的指令,分别是控制指令(control instructions),数据传输指令(data transfer instructions),运算指令(computational instructions),逻辑指令(logical instructions)。
进一步的,控制指令用于控制执行过程。控制指令包括跳转(jump)指令和条件分支(conditional branch)指令。
进一步的,数据传输指令用于完成不同存储介质之间的数据传输。数据传输指令包括加载(load)指令,存储(store)指令,搬运(move)指令。load指令用于将数据从主存加载到缓存,store指令用于将数据从缓存存储到主存,move指令用于在缓存与缓存或者缓存与寄存器或者寄存器与寄存器之间搬运数据。数据传输指令支持三种不同的数据组织方式,包括矩阵,向量和标量。
进一步的,运算指令用于完成神经网络算术运算。运算指令包括矩阵运算指令,向量运算指令和标量运算指令。
更进一步的,矩阵运算指令完成神经网络中的矩阵运算,包括矩阵乘向量(matrix multiply vector),向量乘矩阵(vector multiply matrix),矩阵乘标量(matrix multiply scalar),外积(outer product),矩阵加矩阵(matrix add matrix),矩阵减矩阵(matrix subtract matrix)。
更进一步的,向量运算指令完成神经网络中的向量运算,包括向量基本运算(vector elementary arithmetics),向量超越函数运算(vector transcendental functions),内积(dot product),向量随机生成(random vector generator),向量中最大/最小值(maximum/minimum of a vector)。其中向量基本运算包括向量加,减, 乘,除(add,subtract,multiply,divide),向量超越函数是指那些不满足任何以多项式作系数的多项式方程的函数,包括但不仅限于指数函数,对数函数,三角函数,反三角函数。
更进一步的,标量运算指令完成神经网络中的标量运算,包括标量基本运算(scalar elementary arithmetics)和标量超越函数运算(scalar transcendental functions)。其中标量基本运算包括标量加,减,乘,除(add,subtract,multiply,divide),标量超越函数是指那些不满足任何以多项式作系数的多项式方程的函数,包括但不仅限于指数函数,对数函数,三角函数,反三角函数。
进一步的,逻辑指令用于神经网络的逻辑运算。逻辑运算包括向量逻辑运算指令和标量逻辑运算指令。
更进一步的,向量逻辑运算指令包括向量比较(vector compare),向量逻辑运算(vector logical operations)和向量大于合并(vector greater than merge)。其中向量比较包括但大于,小于,等于,大于等于,小于等于和不等于。向量逻辑运算包括与,或,非。
更进一步的,标量逻辑运算包括标量比较(scalar compare),标量逻辑运算(scalar logical operations)。其中标量比较包括但大于,小于,等于,大于等于,小于等于和不等于。标量逻辑运算包括与,或,非。
粗粒度选数单元用于接收输入神经元和非零权值位置信息,使用滑动窗口选取神经网络的一组权值,将选取的权值都置为零,并选取出非零权值对应的神经元。
运算单元用于接收输入被选择的神经元和非零权值,通过乘加运算单元完成神经网络运算并将输出神经元重新传输给存储部分。
更进一步的,存储单元存放权值时只存放非零权值以及非零权值的位置信息。
更进一步的,粗粒度选数单元只会选择出非零权值对应的神经元并传输给运算单元。
更进一步的,加速装置还可包括预处理模块。如图57所示,该模块对原始数据进行预处理,包括切分、高斯滤波、二值化、正则化、归一化等等。
更进一步的,加速装置还可包括直接数据存取单元DMA(direct memory access)。
更进一步的,加速装置还可包括指令缓存,输入神经元缓存,非零权值缓存,非零权值位置缓存,输出神经元缓存。
特别的,存储单元主要用来存储神经网络的神经元,权值以及指令。其中存放权值时只存放非零权值以及非零权值的位置信息。
特别的,DMA用于在所述存储单元、指令缓存、非零权值缓存、非零权值位置缓存,输入神经元缓存和输出神经元缓存中进行数据或者指令读写。
指令缓存,用于存储专用指令;
非零权值缓存,用于缓存非零权值数据;
非零权值位置缓存,用于缓存非零权值位置数据;
非零权值位置缓存将输入数据中每个连接权值一一对应到相应的输入神经元。
一种情形下非零权值位置缓存一一对应的方法为采用1表示有连接,0表示无连接,每组输出与所有输入的连接状态组成一个0和1的字符串来表示该输出的连接关系。另一种情形下非零权值位置缓存一一对应的方法为采用1表示有连接,0表示无连接,每组输入与所有输出的连接状态组成一个0和1的字符串来表示该输入的连接关系。另一种情形下非零权值位置缓存一一对应的方法为将一组输出第一个连接所在的输入神经元位置距离第一个输入神经元的距离、所述输出第二组输入神经元距离上一个输入神经元的距离,所述输出第三组输入神经元距离上一个输入神经元的距离,……,依次类推,直到穷举所述输出的所有输入,来表示所述输出的连接关系。
输入神经元缓存单元,用于缓存输入到粗粒度选数单元的输入神经元;
输出神经元缓存单元,用于缓存运算单元输出的输出神经元。
控制单元,用于接收指令缓存中的指令,经过译码后生成控制信息控制运算单元进行计算操作。
粗粒度选数单元,用于接收输入神经元和非零权值位置信息,选择出需要进行运算的神经元。粗粒度选数单元只会选择出非零权值对应的神经元并传输给运算单元。
运算单元,用于根据存储单元中存储的指令对所述数据执行相应运算。
运算单元包括但不仅限于三个部分,第一部分一个或多个乘法器,第二部分一个或多个加法器,优选的,第二部分包括多个加法器,多个加法器组成加法树。第三部分为激活函数单元。第一部分将第一输入数据(in1)和第二输入数据(in2)相乘得到相乘之后的输出(out1),过程为:out=in1*in2;第二部分将第三输入数据in3通过加法树逐级相加得到第二输出数据(out2),其中in3是一个长度为N的向量,N大于1,过称为:out2=in3[1]+in3[2]+...+in3[N],和/或将第三输入数据(in3)通过加法数累加之后和第四输入数据(in4)相加得到第二输出数据(out2),过程为:out=in3[1]+in3[2]+...+in3[N]+in4,或者将第三输入数据(in3)和第四输入数据(in4)相加得到第二输出数据(out2),过称为:out2=in3+in4;第三部分将第五输入数据(in5)通过激活函数(active)运算得到激活输出数据(out),过程为:out3=active(in5),激活函数active可以是sigmoid、tanh、relu、softmax等,除了做激活操作,第三部分可以实现其他的非线性函数,可将将输入数据(in)通过运算(f)得到输出数据(out),过程为:out=f(in)。
运算单元还可以包池化单元,池化单元将输入数据(in)通过池化运算得到池化操作之后的输出数据(out),过程为out=pool(in),其中pool为池化操作,池化操作包括但不限于:平均值池化,最大值池化,中值池化,输入数据in是和输出out相关的一个池化核中的数据。
所述运算单元执行运算包括几个部分,第一部分是将所述第一输入数据和第二输入数据相乘,得到相乘之后的数据;第二部分执行加法树运算,用于将第三输入数据通过加法树逐级相加,或者将所述第三输入数据通过和第四输入数据相加得到输出数据;第三部分执行激活函数运算,对第五输入数据通过激活函数(active)运算得到输出数据。以上几个部分的运算可以自由组合,从而实现各种不同功能的运算。
以下,列举神经网络处理器实施例,对本公开的处理方法进行具体说明,但应理解的是其并非因此限制本公开,凡是利用本具体实施例所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本公开的保护范围内。
图58是本公开以处理方法的一具体实施例。如图8所示,其是神经网络的一个全连接层经过粗粒度剪枝后的结果,全连接层共有8个输入神经元为n1~n8和3个输出神经元为o1~o3。其中n3,n4,n7,n8四个输入神经元与o1,o2,o3三个输出神经元之间的权值通过粗粒度稀疏被置为零;n1与o1,o2,o3之间通过s11,s12,s13三个权值连接,n2与o1,o2,o3之间通过s21,s22,s23三个权值连接,n5与o1,o2,o3之间通过s31,s32,s33三个权值连接,n6与o1,o2,o3之间通过s41,s42,s43三个权值连接;我们用11001100这个比特串表示输入神经元与输出神经元之间的连接情况,即第一种表示非零权值位置信息的情况,1表示输入神经元与三个输出神经元都连接,0表示输入神经元与三个输出神经元都不连接。表1描述了实施例中神经元与权值的信息,公式1描述了o1,o2,o3三个输出神经元的运算公式。从公式1中可以看出o1,o2,o3将接收到相同的神经元进行运算。
表1
Figure PCTCN2018095706-appb-000009
Figure PCTCN2018095706-appb-000010
公式1--输出神经元运算公式:
o1=n1*s11+n2*s12+n5*s13+n6*s14
o2=n1*s21+n2*s22+n5*s23+n6*s24
o3=n1*s31+n7*s32+n5*s33+n6*s34
在处理装置进行运算时,8个输入神经元,12个权值和8比特的位置信息以及相应的指令被传输到存储单元。粗粒度选数单元接收8个输入神经元和非零权值位置,选出n1,n2,n5,n6四个需要参与运算的神经元。运算单元接收四个被选择的神经元与权值,通过公式1完成输出神经元的运算,然后将输出神经元传输回存储部分。
应该理解到,所揭露的相关装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。
通过本公开的实施例,提供了神经网络的粗粒度稀疏化的处理方法和对应的处理装置,以及芯片、芯片封装结构、板卡和电子装置。其中,粗粒度稀疏化处理方法能够使稀疏神经网络更加规则化,利于用硬件进行加速,同时减少非零权值位置的存储空间。神经网络处理器能够充分挖掘粗粒度稀疏的特性,减少访存同时减少运算量,从而获得加速比并降低能耗。
本申请还公开了一种用于执行人工神经网络正向运算的装置,在一种可选的实施方案中,该用于执行人工神经网络正向运算的装置可以为如图6A所示的计算装置,该计算装置还可以包括定点数据转换模块及相应的定点数据运算模块,所述定点数据转换模块包括浮点数据统计模块和数据转换单元;上述如图6A所示的计算装置还可以添加如图59或图60所示的单元或模块。其中,浮点数据统计模块用于统计及计算获得人工神经网络正向运算中存储各个类型数据所需的指数位偏移及指数位所需的比特位数;浮点数据转换单元用于实现短位数浮点数据类型与长位数浮点数据类型的转换,例如32位浮点数据类型的转换;浮点数据运算模块用于完成针对于短位数浮点数据所需的各类运算。
其中,“长位数浮点数据”表示原来的浮点数据,例如32位浮点数据,也可以是针对标准的64位或者16位浮点数等,这里只是以32位为具体实施例进行说明;“较少位数浮点数据”,又名“短位数浮点数据”,表示相对于原来的浮点数据来说,采用更少的位数来表示的浮点数据。
根据本申请实施例的多层人工神经网络的正向运算,包括两层或者两层以上的多个神经元。对于正向运算中所需的输入神经元、权值、偏置等数据,均采用短位数浮点数据类型表示,并参与各个层之间的运算。
图59示出了根据本申请一实施例的用于存储数据的短位数浮点数据结构的具体表示方法。其中,1位bit位用于表示符号,M位用于表示指数部分,N位用于表示有效位部分,由于浮点表示法要求第一位有效数字不能为0,对于二进制来说,只能为1,因此有效位的最高位1可以作为隐藏位,不写入内存,所以实际表示的浮点数有效位数为(N+1)位;相比于32位浮点数据表示形式,本申请采用的短位浮点数据表示形式除了占用比特位数更少外,对于神经网络中同一层、同一类型的数据,如第一个卷积层的所有权 值数据,还另外设置了两个标志位,标志位offset和标志位EL,其中标志位offset用于记录指数位的初始偏移,实际指数位表示=指数位表示数据+偏移量(offset),标志位EL用于记录指数位所占用的比特数M,则有效位所占比特数N=X-1-M。
图60A示出了用于执行人工神经网络正向运算的装置的示例框图。如图60A所示,该装置包括:
浮点数据统计模块11,用于对所述神经网络正向运算中的输入神经元、权值和/或偏置数据进行数据分析,以得到浮点数据的指数位偏移量及指数位的长度EL;
浮点数据转换模块12,用于根据所述浮点数据的指数位偏移量和指数位的长度EL,将所述输入神经元、权值和/或偏置数据从长位数浮点数据类型转换为短位数浮点数据类型;
浮点数据运算模块13,用于根据转换为短位数浮点数据类型的输入神经元、权值和/或偏置数据进行人工神经网络正向运算。
图60示出了浮点数据统计模块的示例框图。该浮点数据统计模块包括数据提取单元21、统计单元22和分析单元23。该模块的目的是,通过提取采用长位数浮点数据类型表示的神经网络中的所有长位数浮点数据,比如包括输入神经元、权值和/或偏置数据,并通过分析这些长位数浮点数据得到神经网络中用短位数浮点数据类型表示的各个不同类型数据(比如输入神经元、权值和偏移数据)所需的指数位偏移(offset)及指数位长度EL,以便在之后的短位数浮点正向运算中有更好的效果。
其中,数据提取单元21用于提取长位数浮点正向运算过程中各个不同类型的数据;统计单元22用于统计同一类型数据的数据范围及各个数据段的数据分布情况;分析单元23根据统计单元22统计的结果,以得到采用短位数浮点表示各个类型数据时应当设定的指数位长度EL及指数位偏移(offset),指数位长度EL的设定使得可表示的数据范围尽可能包含该类型的所有数据。
在一种可行的实施例中,上述用于执行人工神经网络正向运算的装置从其他单元或者装置中,比如CPU中,获取上述正向运算过程中采用长位数浮点数据类型表示的各个不同类型的数据,包括输入神经元、权值和偏置数据,然后统计同一类型的数据的数据范围和各个数据段的分布情况,根据该统计结果得到采用短位数浮点数据表示各个类型数据或者每一层各个类型数据时应当设定的指数位长度EL及指数位偏移,或者;
上述用于执行人工神经网络正向运算的装置从其他单元或者装置中,比如CPU中,获取采用短位数浮点数据表示上述人工神经网络中各个类型数据或者每一层各个数据类型时应当设定的指数位长度EL和指数位偏置。
图61示出了正向运算模块的短位数浮点计算部分的示例框图。包括运算缓存单元31、数据转换单元32、舍入单元33。其中:运算缓存单元31用于存储采用精度较高的数据类型表示的正向运算的中间结果,这是由于在正向运算时,加法或者乘法运算可能会导致数据范围扩大;运算结束后,对超出短位数浮点数据类型所表示的精度范围的数据,进行舍入操作,接着通过数据转换单元32将运算缓存单元中存储的数据从长位数浮点数据类型转换为短位数浮点数据类型。
舍入单元33用于对超出短位浮点数据类型表示精度范围的数据进行舍入操作,该舍入单元可以为随机舍入单元、四舍五入单元、向上舍入单元、向下舍入单元、截断舍入单元等,通过不同的舍入单元可以实现对超出短位数浮点数据类型表示精度范围的数据进行不同的舍入操作。
随机舍入单元执行如下操作:
Figure PCTCN2018095706-appb-000011
其中,y表示随机舍入后的短位数浮点数据,x表示随机舍入前的长位数浮点数据,ε为当前短位数浮点数据类型所能表示的最小正整数,即2 offset-(X-1-EL)
Figure PCTCN2018095706-appb-000012
表示对原数据x直接截得短位数浮点数据所得的数(类似于对小数做向下取整操作),w.p.表示概率,即随机舍入获得的数据y为
Figure PCTCN2018095706-appb-000013
的概率为
Figure PCTCN2018095706-appb-000014
为的
Figure PCTCN2018095706-appb-000015
概率为
Figure PCTCN2018095706-appb-000016
四舍五入单元执行如下操作:
Figure PCTCN2018095706-appb-000017
其中,y表示四舍五入后的短位数浮点数据,x表示四舍五入前的长位数浮点数据,ε为当前短位数浮点数据类型所能表示的最小正整数,即2 offset-(X-1-EL)
Figure PCTCN2018095706-appb-000018
为ε的整数倍,其值为小于或等于x的最大数。
向上舍入单元执行如下操作:
Figure PCTCN2018095706-appb-000019
其中,y表示向上舍入后的短位数浮点数据,x表示向上舍入前的长位数浮点数据,
Figure PCTCN2018095706-appb-000020
为ε的整数倍,其值为大于或等于x的最小数,ε为当前短位数浮点数据类型表示所能表示的最小正整数,即2 offset-(X-1-EL)
向下舍入单元执行如下操作:
Figure PCTCN2018095706-appb-000021
其中,y表示向上舍入后的短位数浮点数据,x表示向上舍入前的长位数浮点数据,
Figure PCTCN2018095706-appb-000022
为ε的整数倍,其值为小于或等于x的最大数,ε为当前短位数浮点数据类型所能表示的最小正整数,即2 offset-(X-1-EL)
截断舍入单元执行如下操作:
y=[x];
其中,y表示截断舍入后的短位数浮点数据,x表示截断舍入前的长位数浮点数据,[x]表示对原数据x直接截得短位数浮点数据所得的数。
本申请还公开了一种执行人工神经网络正向运算的方法,具体实施步骤为:
通过已训练好的神经网络长位数浮点模型获取神经网络各个层的以长位数浮点数据类型表示的数据,包括每一层的权值、偏置、输入神经元、输出神经元及其它数据参数。
对不同层,不同类型数据单独进行统计分析,获得不同层各类型数据采用短位数浮点数据类型表示时所需要的各个参数,包括指数位的位宽、有效位的位宽,以及指数位所需表示的范围等。
对统计分析得到的短位数浮点数据类型用于神经网络正向运算,即神经网络正向运算中所有数据用短位数浮点数据类型表示,同时,对神经网络的权值和偏置数据保留一份以长位数浮点数据类型表示的副本,然后进行正向运算。对于正向运算中,某些运算会导致数据范围扩大,如加法、乘法等,需要用缓存空间存储中间计算结果,中间结果用长位数浮点数据类型存储,计算完后再转回相应的短位数浮点数据类型。从长位数浮点数据类型转换为短位数浮点数据类型过程中需要采用舍入的方式,其中包括随机舍入、四舍五入舍入、向上舍入、向下舍入和截断舍入等,分别表示如下:
随机舍入的具体操作如下式所示:
Figure PCTCN2018095706-appb-000023
其中,y表示随机舍入后的短位数浮点数据,x表示随机舍入前长位数浮点数据,ε为当前短位数浮点数据类型所能表示的最小正整数,即2 offset-(X-1-EL)
Figure PCTCN2018095706-appb-000024
表示对原数据x直接截得短位数浮点数据所得的数(类似于对小数做向下取整操作),w.p.表示概率,即随机舍入获得的数据y为
Figure PCTCN2018095706-appb-000025
的概率为
Figure PCTCN2018095706-appb-000026
Figure PCTCN2018095706-appb-000027
的概率为
Figure PCTCN2018095706-appb-000028
四舍五入的具体操作如下式所示:
Figure PCTCN2018095706-appb-000029
其中,y表示四舍五入后的短位数浮点数据,x表示四舍五入前的长位数浮点数据,ε为当前短位数浮点数据类型所能表示的最小正整数,即2 offset-(X-1-EL)
Figure PCTCN2018095706-appb-000030
为ε的整数倍,其值为小于或等于x的最大数。
向上舍入的具体操作如下式所示:
Figure PCTCN2018095706-appb-000031
其中,y表示向上舍入后的短位数浮点数据,x表示向上舍入前的长位数浮点数据,
Figure PCTCN2018095706-appb-000032
为ε的整数倍,其值为大于或等于x的最小数,ε为当前短位数浮点数据类型所能表示的最小正整数,即2 offset-(X-1-EL)
向下舍入的具体操作如下式所示:
Figure PCTCN2018095706-appb-000033
其中,y表示向上舍入后的短位数浮点数据,x表示向上舍入前的长位数浮点数据,
Figure PCTCN2018095706-appb-000034
为ε的整数倍,其值为小于或等于x的最大数,ε为当前短位数浮点数据类型所能表示的最小正整数,即2 offset-(X-1-EL)
截断舍入的具体操作如下式所示:
y=[x];
其中,y表示截断舍入后的短位浮点数据,x表示截断舍入前的长位数浮点数据,[x]表示对原数据x直接截得短位数浮点数据所得的数。
正向运算结束后,做反向运算时,需要将正向运算中的以短位数浮点数据类型表示的数据转换为以长位数浮点数据类型表示的数据,然后使用以长位数浮点数据类型表示的数据参与反向运算,其中,参与反向运算的权值和偏置数据采用正向运算时保留的以长位数浮点数据类型表示的副本,反向运算结束后,将以长位数浮点数据类型表示的数据转换为以短位数浮点数据类型表示的数据,然后使用与短位数浮点数据类型表示的数据参与之后的正向运算,同时,在正向运算过程中仍对神经网络的权值和偏置数据保留以长位数浮点数据类型表示的副本,转换过程中需要做舍入操作,操作同上述正向运算中的舍入操作。
重复进行如上所述的正向及反向运算直到神经网络训练完成。
图62为根据本申请一实施例的单层人工神经网络正向运算流程图。该流程图描述利用本申请的装置和指令集实现的一种单层神经网络正向运算的过程。该运算过程在如图2、图5或图6A所示的计算装置中实现。对于每一层来说,首先对输入神经元向量进行加权求和计算出本层的中间结果向量。该中间结果向量加偏置并激活得到输出神经元向量。将输出神经元向量作为下一层的输入神经元向量。
图63示意性示出了根据本申请一实施例的运算流程示例框图。其中,正向运算模块51进行正向运算得到的除权值、偏置数据之外的以短位数浮点数据类型表示的数据在进行反向训练时要先通过短位数-长位数浮点数据类型转换单元53转换成以长位数浮点数据类型的数据,然后进行反向运算,反向运算模块53进行的反向运算结束后,需要通过长位数-短位数浮点数据类型转换单元54将以长位数浮点数据类型表示 的数据转换成以短位数浮点数据类型表示的数据,在转换过程中,需对超出短位数浮点数据类型所能表示的精度范围的数据进行舍入操作,此处舍入操作由舍入单元55完成,过程同图62中的舍入单元进行的舍入操作。
需要说明的是,上述正向运算也可以采用以长位数浮点数据类型表示的输入神经元、权值和/或偏置数据,上述反向训练也可以采用以短位数浮点数据类型表示的输入神经元、权值和/或偏置数据。
需要说明的是,上述短位数浮点数据类型是相对于上述长位数浮点数据类型的,当短位数浮点数据类型为16位浮点数据类型时,上述长位数浮点数据类型可为32位浮点数据类型或64位浮点数据类型;当上述短位数浮点数据类型为32位浮点数据类型时,上述长位数浮点数据类型为64位浮点数据类型。
通过将正向运算的数据用短位数浮点数据类型表示,充分利用了短位数浮点数据类型表示的数据范围空间,相对于以长位浮点数据类型表示的数据,极大地减少了存储网络参数所需的空间,优化了硬件的面积功耗比。
本申请公开了一种用于执行神经网络正向运算的装置,该用于执行神经网络正向运算的装置在一种可选的技术方案中,可以为如图6A所示的计算装置,该计算装置可以包括定点数据转换模块及相应的定点数据运算模块,所述定点数据转换模块包括浮点数据统计模块和数据转换单元;如图6A所示的计算装置还可以包括如图64、65、66所示装置的模块或单元。其中,浮点数据统计模块用于统计及计算获得正向运算中人工神经网络各个类型数据的合适的小数点位置(poin location);数据转换单元用于实现短位数定点数据类型与长位数浮点数据类型的转换;定点运算模块用于完成针对于短位数定点数据所需的各类正向运算。
其中,“长位数浮点数据”表示原来的浮点数据,例如32位浮点数据,也可以是针对标准的64位或者16位浮点数等,这里只是以32位为具体实施例进行说明;“较少位数定点数据”,又名“短位数定点数据”,表示相对于原来的浮点数据来说,采用更少的位数来表示的定点数据。
根据本申请实施例的多层人工神经网络的正向运算,包括两层或者两层以上的多个神经元。对于正向运算中所需的输入神经元、权值、偏置等数据,均采用短位数定点数据类型表示,并参与各个层之间的运算。
图64示出了根据本申请实施例的用于存储数据的短位数定点数据结构的具体表示方法。其中,1比特位用于表示符号,M位用于表示整数部分,N位用于表示小数部分;相比于32位浮点数据表示形式,本申请采用的短位数定点数据表示形式除了占用比特位数更少外,对于神经网络中同一层、同一类型的数据,如第一个卷积层的所有权值数据,还另外设置了一个标志位Point location记录小数点的位置,这样可以根据实际数据的分布调整定数数据类型所能表示的精度与可表示数据范围。
图65A示出了用于执行人工神经网络正向运算的装置的示例框图。如图60A所示,该装置包括:
浮点数据统计模块11,用于对所述人工神经网络正向运算中的输入神经元、权值和/或偏置数据进行数据分析,以得到定点数据类型的小数点位置;
数据转换模块12,用于根据所述定点数据的小数点位置,将所述输入神经元、权值和/或偏置数据从长位数浮点数据类型转换为短位数定点数据类型;
定点数据运算模块13,用于根据转换为短位数定点数据类型的输入神经元、权值和/或偏置数据进行人工神经网络正向运算。
图65示出了浮点数据统计模块的示例框图。该浮点数据统计模块11包括数据提取单元21、统计单元22和分析单元23。该模块的目的是,通过提取采用长位数浮点数据类型表示的神经网络中的所有长位数浮点数据,比如包括输入神经元、权值和/或偏置数据,并通过分析这些长位数浮点数据得到神经网络中用 短位数定点数据类型表示的各个不同类型数据(比如输入神经元、权值和偏移数据)所需的小数点位置Point location,以便在之后的短位数定点正向运算中有更好的效果。
其中,数据提取单元21用于提取长位数浮点正向运算过程中各个不同类型的数据;统计单元22用于统计同一类型数据的数据范围及各个数据段的数据分布情况;分析单元23根据统计单元22统计的结果,以得到用短位数定点数据类型表示各个类型数据应当设定的小数点位置Point location。
在一种可行的实施例中,上述用于执行人工神经网络正向运算的装置从其他单元或者装置中,比如CPU中,获取上述正向运算过程中采用长位数浮点数据类型表示的各个不同类型的数据,包括输入神经元、权值和偏置数据,然后统计同一类型的数据的数据范围和各个数据段的分布情况,根据该统计结果得到采用短位数定点数据表示各个类型数据或者每一层各个类型数据时应当设定的小数点位置,或者;
上述用于执行人工神经网络正向运算的装置从其他单元或者装置中,比如CPU中,获取采用短位数浮点数据表示上述人工神经网络中各个类型数据或者每一层各个数据类型时应当设定的指数位长度EL和指数位偏置。
图66示出了正向运算模块的短位数定点计算部分的示例框图。包括运算缓存单元31、数据转换单元32、舍入单元33。其中:运算缓存单元31用于存储采用精度较高的数据类型表示的正向运算的中间结果,这是由于在正向运算时,加法或者乘法运算可能会导致数据范围扩大;运算结束后,对超出短位数定点数据类型所表示的精度范围的数据,进行舍入操作,接着通过数据转换单元32将运算缓存单元中存储的数据从长位数浮点数据类型转换为短位数定点数据类型。
舍入单元33用于对超出短位定点数据类型表示精度范围的数据进行舍入操作,该单元可以为随机舍入单元、四舍五入单元、向上舍入单元、向下舍入单元、截断舍入单元等,通过不同的舍入单元可以实现对超出短位数定点数据类型表示精度范围的数据进行不同的舍入操作。
随机舍入单元执行如下操作:
Figure PCTCN2018095706-appb-000035
其中,y表示随机舍入后的短位数定点数据,x表示随机舍入前的长位数浮点数据,ε为当前短位数定点数据类型所能表示的最小正数,即2 -Point_location
Figure PCTCN2018095706-appb-000036
表示对原数据x直接截得短位数定点数据所得的数(类似于对小数做向下取整操作),w.p.表示概率,即随机舍入获得的数据y为
Figure PCTCN2018095706-appb-000037
的概率为
Figure PCTCN2018095706-appb-000038
Figure PCTCN2018095706-appb-000039
的概率为
Figure PCTCN2018095706-appb-000040
四舍五入单元执行如下操作:
Figure PCTCN2018095706-appb-000041
其中,y表示四舍五入后的短位数定点数据,x表示四舍五入前的长位浮点数据,ε为当前短位数定点数据类型所能表示的最小正整数,即2 -Point_location
Figure PCTCN2018095706-appb-000042
为ε的整数倍,其值为小于或等于x的最大数。
向上舍入单元执行如下操作:
Figure PCTCN2018095706-appb-000043
其中,y表示向上舍入后的短位数定点数据,x表示向上舍入前的长位数浮点数据,
Figure PCTCN2018095706-appb-000044
为ε的整数倍,其值为大于或等于x的最小数,ε为当前短位数定点数据类型所能表示的最小正整数,即2 -Point_location
向下舍入单元执行如下操作:
Figure PCTCN2018095706-appb-000045
其中,y表示向上舍入后的短位数定点数据,x表示向上舍入前的长位数浮点数据,
Figure PCTCN2018095706-appb-000046
为ε的整数倍,其值为小于或等于x的最大数,ε为当前短位数定点数据类型所能表示的最小正整数,即2 -Point_location
截断舍入单元执行如下操作:
y=[x];
其中,y表示截断舍入后的短位数定点数据,x表示截断舍入前的长位数浮点数据,[x]表示对原数据x直接截得短位数定点数据所得的数。
本申请还公开了一种执行人工神经网络正向运算的方法,具体实施步骤为:
通过已训练好的神经网络32位浮点模型获取神经网络各个层的32位浮点模型数据,包括每一层的权值、偏置、输入输出值及其它数据参数。
提取所述多层网络模型的每一层中同一类型的输入数据;统计并获取所述多层网络模型的每一层中同一类型的输入数据在预设区间上的分布比例;根据所述分布比例获取所述多层网络模型的每一层中同一类型的输入数据的小数点位置。
其中,上述预设区间可为[-2 X-1-i,2 X-1-i-2 -i],i=0,1,2,...,n,n为预设设定的一正整数,X为定点数据所占的比特位数。上述预设区间
Figure PCTCN2018095706-appb-000047
包括n+1个子区间。统计上述多层网络模型的每一层中同一类型的输入数据在上述n+1个子区间上分布信息,并根据该分布信息获取上述第一分布比例。该第一分布比例为p 0,p 1,p 2,...,p n,该n+1个数值为上述多层网络模型的每一层中同一类型的输入数据在上述n+1个子区间上的分布比例。预先设定一个溢出率EPL,从0,1,2,...,n中获取去最大的i,使得p i≥1-EPL,该最大的i为上述多层网络模型的每一层中同一类型的输入数据的小数点位置。换句话说,取上述多层网络模型的每一层中同一类型的输入数据的小数点位置为:max{i/p i≥1-EPL,i∈{0,1,2,...,n}},即在满足大于或者等于1-EPL的p i中,选取最大的下标值i为上述多层网络模型的每一层中同一类型的输入数据的小数点位置。
需要说明的是,上述p i为上述多层网络模型的每一层中同一类型的输入数据中取值在区间
Figure PCTCN2018095706-appb-000048
中的输入数据的个数与上述多层网络模型的每一层中同一类型的输入数据总个数的比值。比如m1个多层网络模型的每一层中同一类型的输入数据中有m2个输入数据取值在区间[-2 X-1-i,2 X-1-i-2 -i]中,则上述
Figure PCTCN2018095706-appb-000049
根据所述小数点位置Point location,将所有以长位数浮点数据类型表示的数据均采用短位数定点数据类型表示。
对统计分析得到的短位数定点数据类型用于神经网络正向运算,即神经网络正向运算中所有数据采用短位数定点数据类型表示,同时,对神经网络的权值和偏置数据保留一份以长位数浮点数据类型表示的副本,然后进行正向运算。对于正向运算中,某些运算会导致数据范围扩大,如加法、乘法等,需要用缓存空间存储中间计算结果,中间结果用长位数浮点数据类型存储,计算完后再转换为相应的短位数定点数据类型。从长位数浮点数据类型转换为短位数定点数据类型过程中需要采用用舍入的方式,其中包括随机舍入、四舍五入舍入、向上舍入、向下舍入和截断舍入等,分别表示如下:
随机舍入单元执行如下操作:
Figure PCTCN2018095706-appb-000050
其中,y表示随机舍入后的短位数浮点数据,x表示随机舍入前长位数浮点数据,ε为当前短位数定点数据表示格式所能表示的最小正数,即2 -Point_location
Figure PCTCN2018095706-appb-000051
表示对原数据x直接截得短位数定点数据所得的数(类似于对小数做向下取整操作),w.p.表示概率,即随机舍入获得的数据y为
Figure PCTCN2018095706-appb-000052
的概率为
Figure PCTCN2018095706-appb-000053
Figure PCTCN2018095706-appb-000054
的概率为
Figure PCTCN2018095706-appb-000055
四舍五入单元执行如下操作:
Figure PCTCN2018095706-appb-000056
其中,y表示四舍五入后的短位数定点数据,x表示四舍五入前的长位数浮点数据,ε为当前短位数定点数据类型所能表示的最小正整数,即2 -Point_location
Figure PCTCN2018095706-appb-000057
为ε的整数倍,其值为小于或等于x的最大数。
向上舍入单元执行如下操作:
Figure PCTCN2018095706-appb-000058
其中,y表示向上舍入后的短位数定点数据,x表示向上舍入前的长位数浮点数据,
Figure PCTCN2018095706-appb-000059
为ε的整数倍,其值为大于或等于x的最小数,ε为当前短位数定点数据类型所能表示的最小正整数,即2 -Point_location
向下舍入单元执行如下操作:
Figure PCTCN2018095706-appb-000060
其中,y表示向上舍入后的短位数定点数据,x表示向上舍入前的长位数浮点数据,
Figure PCTCN2018095706-appb-000061
为ε的整数倍,其值为小于或等于x的最大数,ε为当前短位数定点数据类型所能表示的最小正整数,即2 -Point_location
截断舍入单元执行如下操作:
y=[x];
其中,y表示截断舍入后的短位数定点数据,x表示截断舍入前的长位数浮点数据,[x]表示对原数据x直接截得短位数定点数据所得的数。
正向运算结束后,做反向运算时,需要将正向运算中的以短位数定点数据类型表示的数据转换为以长位数浮点数据类型表示的数据,然后使用以长位数浮点数据类型表示的数据参与反向运算,其中,参与反向运算的权值和偏置数据用正向运算时保留的以长位数浮点数据类型表示的副本,反向运算结束后,将以长位数浮点数据类型表示的数据转换为以短位数定点数据类型表示的数据,然后使用与短位数定点数据类型表示的数据参与之后的正向运算,同时,在正向运算过程中仍对神经网络的权值和偏置数据保留以长位数浮点数据类型表示的副本,转换过程中需要做舍入操作,操作同上述正向运算中的舍入操作。
重复进行如上所述的正向及反向运算直到神经网络训练完成。
图67是示出根据一个实施例的单层人工神经网络正向运算流程图。该运算过程在如图2、图5或图6A所示的计算装置中实现。该流程图描述利用本申请的装置和指令集实现的一种单层神经网络正向运算的过程。对于每一层来说,首先对输入神经元向量进行加权求和计算出本层的中间结果向量。该中间结果向量加偏置并激活得到输出神经元向量。将输出神经元向量作为下一层的输入神经元向量。
图68示意性示出了根据本申请一实施例的运算流程示例框图。该运算过程在如图2、图5或图6A所示的计算装置中实现。其中,正向运算模块51进行正向运算得到的除权值、偏置数据之外的以短位数定点数据类型表示的数据在进行反向训练时要先通过短位数定点数据-长位数浮点数据转换单元53转换成长位数浮点数据类型表示的数据,然后进行反向传播运算,反向运算模块53进行的反向传播运算结束后,需要通过长位数浮点数据-短位数定点数据转换单元54将长位数浮点类型表示的数据转换成以短位数定点数据类型表示的数据,在转换过程中,需对超出短位数定点数据类型所能表示的精度范围的数据进行同图68中的舍入操作,此处舍入操作由随机舍入单元55完成。
图69示出了根据本申请实施例的算法实施总体流程图。该运算过程在如图2、图5或图6A所示的计算装置中实现。细节操作在对图64到图68的说明中已经给出,详细步骤和申请内容中的具体实施步骤完全相同,这里不作赘述。
需要说明的是,上述正向运算也可以采用以长位数浮点数据类型表示的输入神经元、权值和/或偏置数据,上述反向训练也可以采用以短位数定点数据类型表示的输入神经元、权值和/或偏置数据。
需要说明的是,上述短位数浮点数据类型是相对于上述长位数浮点数据类型的,当短位数浮点数据类型为16位浮点数据类型时,上述长位数浮点数据类型可为32位浮点数据类型或64位浮点数据类型;当上述短位数浮点数据类型为32位浮点数据类型时,上述长位数浮点数据类型为64位浮点数据类型。
通过将正向运算的数据用短位数定点表示,充分利用了短位数定点数据类型所能表示的数据范围空间,相对于长位数浮点数据类型表示,极大地减少了存储网络参数所需的空间,优化了硬件的面积功耗比。
本申请包括用于片上重复数据寻址的装置及该装置的调度使用方法,在如图6A所示的计算装置内,如该存储介质为存储器时,数据访问单元与存储器之间的数据调用方法可以采用片上重复数据寻址的装置及该装置的调度使用方法,该方法还可以应用在如图26,图28,图30所示的装置内。所示的装置内该方法针对重复数据高效地进行读写,可以有效的实现片上重复寻址,同时支持片上片外数据交换,通过数据和地址划分,片上数据重复寻址空间可以被扩展到片外地址空间。本申请能够降低访存带宽需求,同时提供良好的灵活性,从而降低片上存储开销,而且能够适用于不同场景,并不仅仅局限于机器学习类处理器。
本申请同时可通过合理调度数据,缩减片上缓存开销,从而可提供更加高效的处理器设计支持。合理调度数据不仅仅指数据替换策略,也包括对于计算的划分,重新安排计算顺序,使得集中访问的数据可被安排在相同的数据块中。本申请为异构环境下利用片上重复寻址用于降低访存带宽,涉及存储单元、寻址单元的实施和调度。
图70是优选实施例的总体结构的示例框图。如图70所示的实施例,在实际应用中,还可以包括如图6A所示的互联模块以及运算单元,该运算单元包括多个计算器。对于如图70所示的总体结构,举例说明,对于异构平台来说,处理器的片上存储器20能够存储的数据十分有限,通常来讲片上有限的资源限制了将所有数据放置在片上的可能性,所以将大存储器(廉价,速度稍慢)放在片外,小存储器(昂贵,速度快)集成在片上,需要将所有的数据划分成为大小可以存储在片上存储器20的数据块,通过存储容量大的片外存储介质10和存储容量小的片上存储介质20上的数据交互将所需数据块读入或者写出。其间,片内地址索引单元40将片内数据地址按需提供给片上处理单元30。本申请的存储器并不限定,可以是静态随机存储器(Static Random Access Memory,SRAM),动态随机存储器(Dynamic Random Access Memory,DRAM),增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,eDRAM),寄存器堆(Register file,RF)等常见存储介质,也可是新型的存储器件,如非易失存储器(Non-Volatile Memory,NVM)或3D存储器件等。
本申请提供一种片上重复寻址的方法,是一种当总数据过大,大于片上存储介质20的存储容量时所使用的数据管理策略,从而可以把片外的数据读取至片内进行快速重复寻址,当然,也可以实现片外重复寻址,然而高效的做法是将集中访问的数据放在一起,一次搬至片内,然后直接在片内快速寻址。该方法包括:
数据划分步骤,根据预定的数据划分原则将片上存储介质和/或片外存储介质的数据划分为不同的数据块,所述数据划分原则包括将重用距离低于预定距离阈值的数据划分在同一个数据块。重用距离指的是一个数据两次使用的距离,距离是指访存次数,重用距离近的数据在运行短期内就会被访问,也即就有很强的时间上的相关性。这些数据划分在同一数据块上可以一次载入片内存储然后使用尽可能多的次数,从而访存更加高效。在每个数据块中,数据则按照预定的规则存储介质内,例如,顺序存储。
数据索引步骤,根据预定的替换策略的顺序关系,依次载入不同的所述数据块到至少一个片上处理单元,被载入的所述数据块中的重复数据在片内重复寻址。该数据块里的数据可在片内直接重复寻址,避免从片外存储或IO多次读写(速度慢,功耗高)。采用有效的数据划分原则,从而使得上述替换发生次数尽可能的少(有效的数据划分原则可减少替换次数,有效的数据替换策略在此基础上可进一步减少替换次数)。优选的是,图71所示即为数据地址划分图,所述数据的索引地址50包括数据块地址51与块内地址52;即每个数据的地址为当前数据块地址51与块内地址52拼接而成。将数据划分成为合理的数据块后,通过将地址划分成为片内和片外使得片内重复寻址更加高效。地址索引所采用的技术并不局限于简单的数据索引,也包括codebook(码本)等划分实施方案。
所述数据索引步骤包括:根据所述替换策略的顺序关系和数据块地址51,依次载入不同的所述数据块到至少一个片上处理单元30,被载入的所述数据块中的重复数据在片内重复寻址,当所述数据块的块内地址52全部索引完成后才替换新的数据块,直至没有数据块被需要载入为止。在数据块内进行索引时,只有数据的块内地址52有用,则索引的硬件单元不需要使用数据块地址51,然而数据块地址51仍然需要记录从而可以被后续使用。
优选的是,片上存储介质20与片上处理单元30通过片内数据通路进行数据交换;片上存储介质20与片外存储介质10通过片内外数据通路进行数据交换,片上存储介质20或片外存储介质10至少一次从内部或外部进行读写;所述数据以数据块为单位在片上存储介质20、片外存储介质10和/或片上处理单元30两两之间搬运。
优选的是,所述数据块的数据量小于片上存储介质20的容量,优选能够被其整除。
优选的是,片上存储介质20采用读写端口分离设计,从而使得数据的读出和写入相互独立,可以同时进行。
优选的是,所述方法应用于学习类处理器。
优选的是,所述方法应用于异构环境。
优选的是,片上处理单元30为片上运算模块,所述根据预定条件选取数据,满足所述预定条件的所述数据被划分在相同的所述数据块中。具体的是,所述预定条件包括简单划分条件、平均为预定数目的数据块条件、与不同输出神经元相关条件或者满足预定数学关系条件。这些是针对不同情况下具体的数据划分准则,仍在数据划分原则限定的范围内。
如图72所示为一个优选实施例的数据划分示意图。以常见的神经网络为例(向量运算),不同输出神经元所需的权值数据存储在不同的数据块,运算时,需要在不同的时刻载入不同的数据块进行索引。输入神经元的值是被复用,计算两个输出神经元用的是同样的输入。在计算输出神经元的时候需要载入相关的权值,计算完成后则这部分权值完全不需要了;计算输出神经元时,需要载入相关的权值。其中相同输入神经元的值只存了一份,也即计算时需要重复寻址。相同的权值也只存了一份,也需要重复寻址获得。
如图73所示为一个优选实施例的数据划分示意图。同样以常见的神经网络为例(向量运算),满足指定条件的权值连接被划分存储在同样的数据块中,如实线权值连接和虚线权值连接。在不同的时刻,不同数据块被载入,运算单元根据指定条件选取数据,如所有的输出神经元先计算与实线权值连接的相关计算,在数据块替换后再计算与虚线权值连接的相关计算。
优选的是,所述替换策略包括顺序替换、逆序替换或者乱序替换;图74所示即为一个优选实施例的替换策略示意图,数据被划分成为不同的数据块,在不同时刻,根据不同的替换策略载入不同的数据块。如顺序替换,数据块按照#1、#2、#3以此类推的顺序载入;逆序替换,数据块按照#N、#(N-1)、#(N-2)的顺序载入;乱序替换,则根据指定的顺序读入数据块。或者,所述替换策略包括数据写回,在数据处理完成后将最终结果或中间结果写回所述片上存储介质、所述片外存储介质和/或所述片上处理单元。不同的替换策略应当考虑到数据的一致性。
本申请相应提供一种实现片上重复寻址的方法的装置,该装置包括:
数据划分模块,用于根据预定的数据划分原则将片上存储介质和/或片外存储介质的数据划分为不同的数据块,所述数据划分原则包括将重用距离低于预定距离阈值的数据划分在同一个数据块;
数据索引模块,用于根据预定的替换策略的顺序关系,依次载入不同的所述数据块到至少一个片上处理单元,被载入的所述数据块中的重复数据在片内重复寻址。
优选的是,所述数据的索引地址包括数据块地址与块内地址;
所述数据索引模块用于根据所述替换策略的顺序关系和所述数据块地址,依次载入不同的所述数据块到至少一个所述片上处理单元,被载入的所述数据块中的重复数据在片内重复寻址,当所述数据块的所述块内地址全部索引完成后才替换新的数据块,直至没有数据块被需要载入为止。
优选的是,所述片上存储介质与所述片上处理单元通过片内数据通路进行数据交换;
所述片上存储介质与所述片外存储介质通过片内外数据通路进行数据交换,所述片上存储介质或所述片外存储介质至少一次从内部或外部进行读写;所述数据以数据块为单位在所述片上存储介质、所述片外存储介质和/或所述片上处理单元两两之间搬运。
优选的是,所述数据块的数据量小于所述片上存储介质的容量。
优选的是,所述片上存储介质采用读写端口分离设计。
优选的是,所述装置应用于学习类处理器。
优选的是,所述装置应用于异构环境。
优选的是,所述片上处理单元为片上运算模块,所述根据预定条件选取数据,满足所述预定条件的所述数据被划分在相同的所述数据块中。
优选的是,所述预定条件包括简单划分条件、平均为预定数目的数据块条件、与不同输出神经元相关条件或者满足预定数学关系条件。
优选的是,所述替换策略包括顺序替换、逆序替换或者乱序替换;或者
所述替换策略包括数据写回,在数据处理完成后将最终结果或中间结果写回所述片上存储介质、所述片外存储介质和/或所述片上处理单元。
图75所示即为一个优选实施例的利用片上数据重复寻址降低访存带宽需求装置使用的流程图。开始计算后,
步骤S101,数据按照数据划分原则划分成为不同的数据块。
步骤S102,将数据块载入片上存储介质20。在某一时刻,只有一块数据块被载入片上存储介质20用于片上计算,根据不同的替换策略,不同的数据块按不同的顺序被载入用于运算。
步骤S103,对获取的数据进行片上计算。
步骤S104,判断是否所有的计算完毕没有数据块需要再次载入,如果是则全部计算结束,否则,回到步骤S102。
图76所示即为一个优选实施例的计算单元根据地址进行重复寻址的框图。根据地址索引,存储于地址DA的数据被计算单元#0、#2、#4所需要,则实施例索引至地址DA,并将DA中的数据传播给所需的计算单元,即#0、#2和#4。这个例子中,三个计算单元所需要的数据因为是一样的,所以在片上只存储了一份,也即同一个数据要被重复寻址三次。图76中数据传递给片上计算单元的方式并不局限于BUS总线的连接方式,也包括Crossbar结构、FAT-TREE、H-TREE等其他连接方式。
综上所述,本申请将重用距离小于预定的距离阈值的数据划分在同一个数据块,重用距离指的是一个数据两次使用的距离,距离是指访存次数,重用距离近的数据在运行短期内就会被访问,也即就有很强的时间上的相关性。这些数据划分在同一数据块上可以一次载入片内存储然后使用尽可能多的次数,从而访存更加高效。本申请旨在利用片上的重复寻址用于降低访存带宽。本申请的装置及其相关使用方法可以有效的提供数据的复用性和其灵活寻址的需求,能够适用于不同场景,并不仅仅局限于机器学习类处理器。
现有异构平台,处理器的片上能够存储的数据十分有限,需要将所有的数据划分成为大小可以存储在片上的数据块,通过片外大存储介质和片内小存储介质上的数据交互将所需数据块读入或者写出。
为了实现上述目的,图77示出了本申请提供一种片上数据划分读写系统100,如图77所示的片上数据划分读写系统可以应用到如图6A、图26,图28,图30所示的装置内,如图6A所示的计算装置的存储器如为片外存储系统,则如图6A所示的计算中可以包括如图77所示的片上数据划分读写系统。该系统包括:
数据划分模块10,用于根据数据划分策略将片内存储数据划分在不同区域,分别存储在片内存储器和片外存储器;
预先操作模块20,用于在进行数据拼接时预先对片内存储数据的片内地址索引进行操作处理;
数据拼接模块30,用于根据数据拼接策略将片内存储数据和片外输入数据拼接得到所述原始数据表示。
对于异构平台来说,处理器的片上能够存储的数据十分有限,需要将所有的数据划分成为大小可以存储在片上的数据块,通过片外大存储器和片内小存储器上的数据交互将所需数据块读入或者写出。其间,片内数据地址通过片内地址索引按需提供给片上计算单元(如图6A所示的运算单元),物理框架如图81所示;图78和图79A、图79B所示的实施例划分只为本申请所涉及的典型情况,本申请并不局限于特定的数据划分,极端情况如数据全部被在片上,或者数据全部被划分在片外,也在本申请的实现范围之内。
进一步地,本申请所述片上数据划分读写系统100,还包括:
存储模块40,用于存储搬运所述片内存储介质的所述片内存储数据和来自所述片外存储介质的所述片外输入数据;
所述存储模块40采用读写端口分离,数据的读出和写入相互独立;
所述预先处理模块20还包括:
片上处理子模块21,用于运算处理所述片内存储数据;
片外处理子模块22,用于运算处理外部输入数据处理,所述外部输入数据包括所述片外输入数据、所述读写端口直接读入的数据。
进一步地,存储模块40还包括:
地址索引接口41,用于根据片内地址索引来索引所述片内存储数据;
数据读出接口42,用于已索引到所述片内存储数据的输出出口;
数据写入接口43,用于将要存储的数据根据写入地址写入相应存储位置。
所述片上数据划分读写系统100,优选的是数据划分模块10还包括:
地址划分子模块11,用于地址空间划分成为片外数据空间和片内数据空间;
数据替换子模块12,用于根据数据替换策略在所述片内存储介质和片外存储介质之间进行数据替换;所述数据替换策略包括顺序替换、逆序替换以及随机替换;
所述数据划分策略包括定点数划分、浮点数划分;作为典型,如图79A所示即为一个定点数实施例的数据划分,这种划分将定点数据换分成为整数部分和小数部分,图79B所示一个浮点数实施例的数据划分。这种划分将浮点数划分成为指数部分和小数部分。图79A和图79B所示的实施例划分只为本申请所涉及的典型情况,本申请并不局限于特定的数据划分,极端情况,如数据全部被在片上,或者数据全部被划分在片外,片上的缓存结构包括对输入数据的缓存,也在本申请的设计范围之内,地址划分子模块11将索引的地址空间划分对应到片外数据空间和片内数据空间,有需要的时候通过数据替换子模块12进行交换,将需要加速数据处理的转移到片内。数据划分模块10基于芯片中的一个或多个片上计算单元实现,所述片上计算单元发起读写请求并处理拼接得到的原始数据。
所述数据拼接模块30还包括:
索引拼接子模块31,用于片内片外数据传输的形式从原始数据表示转为全部或者部分的数据索引,拼接全部或者部分的片上的所述数据索引的结果获得所述原始数据表示;
所述数据拼接模块30读写通过片内片外数据通路或片内数据通路进行,所述片内片外数据通路包括PCI(Peripheral Component Interconnect,外部控制器接口)、PCIE(总线和接口标准,Peripheral Component Interface Express)、HT互联技术(Hyper Transport,超传输,是一种全新的具有可升级性的新型、高速、高性能的端到端集成电路互联总线技术),所述片内数据通路包括FAT-TREE、H-TREE互联技术(hierarchy tree,层次树),片内片外数据连接方式包括多芯片互联结构;图77所示的片内片外数据连接并不局限于PCIE总线连接,也包涵多芯片互联结构如片上网络。图77所示的片上计算单元与片内存储器的数据通路不局限于H-TREE,或者FAT-TREE等互联技术,通过片内片外数据通路可以在片外寻址,从而所述片上数据划分读写系统100可以对准确无误地将各种需要拼接的数据还原成原始数据,可以有效的支持不同的数据划分策略,从而减少片内片外数据交换。
所述片内存储器或所述片外存储器中的所述数据被一次或者多次读写,所述数据被读至一个或者多个片上运算单元;所述片内存储器或所述片外存储器被一次或者多从外部进行读写,所述片内存储器被一次或者多次从内部读写。
图80是本申请所述片上数据划分读写方法的一个具体实施例的流程图,其可通过本申请所述片上数据划分读写系统100实现,如图83,所述片上数据划分读写方法包括:
步骤S701,数据划分步骤,根据数据划分策略将片上数据存储在不同区域,分别存储在片内存储器和片外存储器;
步骤S702,预先操作步骤,在进行数据拼接时预先对片内存储数据的片内地址索引进行操作处理;
步骤S703,数据拼接步骤,根据数据拼接策略将所述片内存储数据和片外输入数据拼接得到原始数据表示。
分别通过数据划分模块10、预先操作模块20和数据拼接模块30实现,将原始数据在片内进行无损恢复。
其中优选的,本申请所述片上数据划分读写方法需要实现对于存储的管理,实现拼接过程需要存储模块40的支持,所述数据划分读写方法还包括:
数据存储步骤,存储搬运所述片内存储介质的所述片内存储数据和来自所述片外存储介质的所述片外输入数据;所述存储步骤中读写端口分离,数据的读出和写入相互独立;具体地,所述数据存储步骤还包括:
第一、根据片内地址索引来索引所述片内存储数据;
第二、将已索引到数据的输出出口;
第三、将要存储的数据根据写入地址写入相应存储位置;
读写时分别由地址索引接口41、数据读出接口42、数据写入接口43提供支持,与片内片外数据通路和片内数据通路配合实现模块内外的数据通信,独立的读写接口可以实现同时读写。片上数据根据片内地址索引,该片内地址索引有可能经过预先操作模块30一定的操作(如地址偏移计算),检索片内存储得到片内存储数据,结合外部输入至片内的数据,经过拼接操作,得到最后的完整数据。
在一个具体实施例中,优选的本申请所述片上数据划分读写方法的一个优选实施例的流程图,如图84所示,所述片上数据划分读写方法步骤包括:
步骤S801,地址空间划分成为片外数据空间和片内数据空间;
步骤S802,根据数据替换策略在所述片内存储器和片外存储器之间进行数据替换;所述数据替换策略包括顺序替换、逆序替换以及随机替换;所述数据划分策略包括定点数划分、浮点数划分;
步骤S803,运算处理所述片内存储数据;
步骤S804,运算处理外部输入数据处理,所述外部输入数据包括所述片外输入数据、所述读写端口直接读入的数据。
步骤S805,片内片外数据传输的形式从所述原始数据表示转为全部或者部分的数据索引,拼接全部或者部分的片上的所述数据索引的结果获得所述原始数据表示。
经过处理过后的片内存储数据和片外输入数据拼接在一起,然后才能交由后续的模块进行原始数据的处理,实现处理器的功能。
进一步地,为便于理解,下面以图80~图82所示的一个具体实施例的物理设计框架图进行说明。
对于异构平台来说,处理器的片上能够存储的数据十分有限,需要将所有的数据划分成为大小可以存储在片上的数据块,通过片外大存储器(即片外存储器)和片内小存储器(即片内存储器)上的数据交互将所需数据块读入或者写出,在数据块大小上有区分,因而划分并存储在不同区域,根据容量需求不同增设所述片外存储介质。其间,片内数据地址通过片内地址索引按需提供给片上计算单元,如图82通过片内地址索引接口41获取索引以及得到索引对应的数据,图80所示即为一个实施例的片上数据索引过程,装置根据8-bit地址索引256个存储位置,得到32-bit的数据,并不局限于图示的地址索引位宽和片上数据存储位宽。流程的实现在硬件上还依赖于片内存储器、片外存储器、片内片外数据通路以及片内数据通路之间的相互通信。
如图82所示即为一个实施例的数据拼接操作过程,片内存储数据,图示为32bit位宽,经过片上数据处理子模块31处理,图示为32bit位宽。片上数据处理子模块31并不局限于寻址操作,也包括其他运算,如算术计算。片外输入数据,图示为32bit位宽,经过片外数据处理子模块32处理,图示为32bit位宽。处理过后的片内存储数据和片外输入数据拼接在一起,图示为64bit位宽,输送给后续模块处理,如片上计算单元,经过处理的片内存储数据和片外输入数据并不局限于图示的位宽,数据块并不局限于特定的数据位宽,数据处理并不局限于特定的操作,而可能包涵复杂的操作,不仅是简单的拼接,而包涵其他操作处理。
具体地,所述数据拼接步骤通过片内片外数据通路或片内数据通路进行,尤其所述片内片外数据通路包括PCI、PCIE、HT互联技术,实现内部与片外之间的数据流,所述片内数据通路包括FAT-TREE、H-TREE互联技术,片内片外数据连接方式包括多芯片互联结构,如片上网络。
所述片内存储器或所述片外存储器中的所述数据可以被一次或者多次读写,所述数据可以被读至一个或者多个片上运算单元;所述片内存储介质或所述片外存储介质可以被一次或者多从外部进行读写,介质可以被一次或者多次从内部读写。
本申请提供一种片上读写装置,包括所述片上数据划分读写系统100,所述片上读写装置包括片内存储介质、片外存储介质、片内片外数据通路和片内数据通路,所述片上读写装置优选的是,还包括了静态随机存储器(Static Random Access Memory,SRAM),动态随机存储器(Dynamic Random Access Memory,DRAM),增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,eDRAM),寄存器堆(Registerfile,RF)等常见存储介质,也可以是新型的存储器件,如非易失存储器(Non-Volatile Memory,NVM)或者3D存储器件等等。
本申请将数据表示转换到索引,可以高效的进行片上地址空间内的重复寻址,也可以进行片外地址寻址;异构环境下片上重复寻址的装置及其使用策略,不同于直接对数据本身缓存进行加速,硬件支持需要包含片内存储介质,片外存储介质,地址索引器件,片内片外数据通路,片内数据通路。
最后,本申请旨在用于不同的数据划分的策略、装置和方法,根据不同的划分策略,数据被划分成为不同的部分,本申请中的装置支持不同划分策略的装置。
综上所述,本申请的装置及其相关使用方法可以有效的提供数据的复用性和其灵活寻址的需求,有效的降低访存带宽需求,能够适用于不同场景,并不仅仅局限于机器学习类处理器。本申请同时可以通过合理调度数据,缩减片上缓存开销,从而可以提供更加高效的处理器设计支持。
参阅图85,图85提供了一种基于多处理器协同的用于神经网络算法的推理和训练的运算系统,该系统可以包括n个处理器(n大于等于2的整数),互联装置,以及存储器。其中,n个处理器分别可以是神经网络处理器、GPU、CPU、FPGA、DSP等任意具备计算部分神经网络算法的设备,当然在实际应用中,上述神经网络处理器还可以为本申请中的专用处理器、计算装置等等,具体的可以参见如图6-图84的实施例的描述;互联装置用于连接各处理器,负责各个处理器之间的通信以及数据传输,其连接方式可以是通过各种片上互连技术(如总线,光互连等),也可以是通过SoC集成的方式连接等;存储模块是用于储存神经网络的输入,输出数据,以及训练的模型参数,以及运算过程中产生的各种中间数据,以及各处理器所需的计算指令。
互联模块可以使用但不限于环状,树状,交叉开关(crossbar),mesh,或者torus等拓扑结构。
不同处理器之间的连接方式,以及存储方式不局限于一种,即系统中可能存在大于一种的互连装置或存储器。
参阅图85,图85中的处理器可以为一种用于执行人工神经网络正向运算的装置,该执行人工神经网络正向运算的装置的具体结构可以为如图6A所示的计算装置的结构,当然在实际应用中,该装置还可以包括指令缓存单元、控制器单元、直接内存访问单元、树型模块、主运算模块、以及多个从运算模块,其中:指令缓存单元用于通过直接内存访问单元读入指令并缓存读入的指训练令;控制器单元用于从指令缓存单元读取指令,并将该指令译码成控制树型模块、主运算模块、以及从运算模块行为的微指令;直接内存访问单元用于从外部地址空间向主运算模块和各从运算模块的相应数据缓存单元中写数据或从所述数据缓存单元向外部地址空间读数据;树型模块用于,在每层神经网络反向开始计算的阶段,主运算模块通过树型模块向所有的从运算模块传输本层的输入神经元向量,在从计算模块的计算过程完成后,树型模块 逐级将各从计算模块的输出神经元值拼成中间结果向量;主运算模块用于利用中间结果向量完成后续计算。
该人工神经网络正向运算装置作为一种计算型处理器,可以和其他类型的处理器(如GPU,CPU)结合在一起组成一种新的神经网络任务处理系统。
图86A、86B显示了一种可能的实施方案。图86A中,包含三个模块:控制模块,包含了如CPU的控制处理器,用于进行逻辑控制,生成指令,以及调用其他的处理器;其次,正向处理模块,包含n个(n大于等于1)正向计算模块(人工神经网络专用正向计算装置),用于神经网络正向的计算;以及,m个(n大于等于1)个反向计算模块(使用通用处理器,比如GPU/DSP/FPGA等)用于进行神经网络的反向计算。控制模块和计算模块之间通过互联装置1进行连接和通信,正向处理模块和反向处理模块之间通过互联装置2进行连接和通信。
或者正向计算模块和反向计算模块使用人工神经网络专业处理器,权值更新使用通用处理器,比如GPU、DSP或FPGA。
图86B中展示了一种当n=1,m=1时的多处理器协同装置,其中包括了CPU,神经网络处理器,以及GPU三个处理器。该装置可以用于进行神经网络的推理和训练。
图87为一种更具体的,用于神经网络的训练和推理的多处理器协同装置。其中,1为控制模块,即用于控制整个执行过程的控制,包含控制处理器,常见情况下是CPU;3为正向处理模块,其中包含了n个用于进行正向计算的正向处理模块,用于进行训练和推理过程中的正向神经元的计算,常见情况下为人工神经网络正向运算装置;2为反向处理模块,包含m个反向计算模块,包括了反向处理器,常见情况下为GPU/FPGA/DSP,用于进行训练过程中的反向梯度传递,和权值更新的操作;5为存储器,正向处理模块从存储单元1中获取数据,包括神经元,权值等,控制处理器从存储单元3中获得数据,包括指令,网路模型等,反向处理器从存储单元2中获得数据,包括目标标签,权值,梯度等。
正向计算模块之间通过互联模块1进行连接,反向计算模块之间通过互连模块2进行连接。控制模块则通过互联模块3连接正向处理模块和反向处理模块进行通信。
图88是图87装置的变换。由于神经网络算法中,反向计算中需要用到的神经元,突触,偏置数据是正向过程计算出来的,如果将正向数据和反向数据分开存储会导致额外的数据传输开销,即反向计算开始之前,数据要从正向处理模块传输到反向处理模块可以访问的存储单元中,导致整体处理速度下降,功率增加。因此,我们设计一种正向处理模块和反向处理模块共享同一存储单元的装置。其中,正向处理模块和反向处理模块在运算过程中所需要的数据(包括输入原始数据,神经元,突触,梯度,标签等)都存放在存储单元1中。存储单元1的介质可以是之前所述的类型。
图89是另一种存储器组织结构。其中,控制模块,正向处理模块和反向处理模块共享同一个存储单元1。这样的好处是,省去了从控制处理器(CPU)存储器移动数据到其他处理器存储器的过程。
图89示出本公开中提出的人工神经网络正向处理模块的整体结构的示例框图。如图89所示,该装置包括指令缓存单元1、控制器单元2、直接内存访问单元3、树型模块4、主运算模块5和多个从运算模块6。指令缓存单元1、控制器单元2、直接内存访问单元3、树型模块4、主运算模块5和从运算模块6均可以通过硬件电路(例如专用集成电路ASIC)实现。
指令缓存单元1通过直接内存访问单元3读入指令并缓存读入的指令。
控制器单元2从指令缓存单元1中读取指令,将指令译成控制其他模块行为的微指令,所述其他模块例如直接内存访问单元3、主运算模块5和从运算模块6等。
直接内存访问单元3能够访存外部地址空间,直接向装置内部的各个缓存单元读写数据,完成数据的加载和存储。
如图90所示的系统可以包括:控制模块1,存储单元模块2,互联模块3,神经网络计算模块4。控制模块一般为CPU,存储单元1是其内存;神经网络计算模块为若干神经网络处理器组成的计算模块,用于处理任务中的神经网络算法的计算,如卷积,pooling或上述神经网络专用指令中的一种或多种等。控制处理器和神经网络计算模块的连接和通信通过互连模块2实现;神经网路计算模块中各处理器之间通过互连模块1进行连接和通信;神经网络计算模块从存储单元2中读取计算所需要的数据(权值,输入数据等)
本申请通过将设置多种类,多个处理器,保证神经网络处理装置的灵活性,高效性,以及可扩展性。即可以高效完成朴素的神经网络算法,通过多处理器的写作,也可以完成复杂的如目标识别这类任务。通过将不同特点的计算任务划分给不同的处理器,可以在让神经网络处理器发挥出其最大效率的同时,保证装置的可扩展性,兼容性,以及保证计算精度,和计算效率。上述如图85、图86A、图86B、图87、图88、图89、图90的结构可以应用到任何的神经网络计算指令的计算中或神经网络应用中。本申请并不限制该图85、图86A、图87、图88、图89结构的应用场景,另外,对于不同的神经网络计算指令的执行可能需要添加或扩展其他的功能模块,本申请也不限于添加或扩展的其他的功能模块的具体形式,例如,扩展的功能模块可以为如图6A中的模块或单元。
本申请一些实施例中,公开了一种加速装置,包括:存储器:存储有可执行指令;处理器:用于执行存储单元中的可执行指令,在执行指令时依照上述处理方法进行操作。
其中,处理器可以是单个处理单元,但也可以包括两个或更多个处理单元。另外,处理器还可以包括通用处理器(CPU)或者图形处理器(GPU);还可以包括在现场可编程逻辑门阵列(FPGA)或者专用集成电路(ASIC),以对神经网络进行设置和运算。处理器还可以包括用于缓存用途的片上存储器(即包括处理装置中的存储器)。
在一些实施例里,公开了一种芯片,其包括了上述神经网络处理器。
在一些实施例里,公开了一种芯片封装结构,其包括了上述芯片。
在一些实施例里,公开了一种板卡,其包括了上述芯片封装结构。
在一些实施例里,公开了一种电子装置,其包括了上述板卡。
电子装置包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (16)

  1. 一种计算方法,其特征在于,所述方法应用于计算装置内,所述计算装置包括:存储器、寄存器单元和矩阵计算单元;所述方法包括如下步骤:
    所述计算装置控制所述矩阵计算单元获取第一运算指令,所述第一运算指令包括执行所述指令所需的矩阵读取指示,所述所需的矩阵为至少一个矩阵,所述至少一个矩阵为长度相同的矩阵或长度不相同的矩阵;
    所述计算装置控制所述运算单元依据所述矩阵读取指示向所述存储器发送读取命令;
    所述计算装置控制所述运算单元依据采用批量读取方式读取所述矩阵读取指示对应的矩阵,对该矩阵执行所述第一运算指令。
  2. 根据权利要求1所述的方法,其特征在于,所述矩阵读取指示包括:所述指令所需的矩阵的存储地址或所述指令所需矩阵的标识。
  3. 根据权利要求2所述的方法,其特征在于,如所述矩阵读取指示为所述指令所需矩阵的标识时,所述计算装置控制所述运算单元依据所述矩阵读取指示向所述存储器发送读取命令包括:
    所述计算装置控制所述运算单元依据所述标识从所述寄存器单元出采用单位读取方式读取所述标识对应的存储地址,所述计算装置控制所述运算单元向所述存储器发送读取所述存储地址的读取命令并采用批量读取方式获取所述矩阵。
  4. 根据权利要求1-3任意一项所述的方法,其特征在于,所述对该矩阵执行所述第一运算指令包括:
    所述计算装置控制所述运算单元对该矩阵执行第一流水级的计算得到第一结果,将第一结果输入到第二流水级执行第二流水级得到第二结果,将所述第二结果输入到第三流水级执行第三流水级得到第三结果,将所述第三结果输入到所述存储器进行存储。
  5. 根据权利要求1-3任意一项所述的方法,其特征在于,所述计算装置还包括:缓存单元,所述方法还包括:
    所述计算装置将待执行的运算指令缓存于所述缓存单元内。
  6. 根据权利要求1-3任意一项所述的方法,其特征在于,所述方法在所述计算装置控制所述矩阵计算单元获取第一运算指令之前还包括:
    所述计算装置确定所述第一运算指令与所述第一运算指令之前的第二运算指令是否存在关联关系,如所述第一运算指令与所述第二运算指令存在关联关系,则将所述第一运算指令缓存与所述缓存单元内,在所述第二运算指令执行完毕后,从所述缓存单元提取所述第一运算指令传输至所述运算单元;
    所述确定该第一运算指令与第一运算指令之前的第二运算指令是否存在关联关系包括:
    依据所述第一运算指令提取所述第一运算指令中所需矩阵的第一存储地址区间,依据所述第二运算指令提取所述第二运算指令中所需矩阵的第二存储地址区间,如所述第一存储地址区间与所述第二存储地址区间具有重叠的区域,则确定所述第一运算指令与所述第二运算指令具有关联关系,如所述第一存储地址区间与所述第二存储地址区间不具有重叠的区域,则确定所述第一运算指令与所述第二运算指令不具有关联关系。
  7. 根据权利要求1-3任意一项所述的方法,其特征在于,
    所述矩阵为m*n矩阵、1*n矩阵或m*1矩阵,其中m、n为大于等于2的整数。
  8. 一种计算装置,其特征在于,所述计算装置包括:存储器、寄存器单元、矩阵计算单元和控制单元;
    所述存储器,用于存储矩阵;
    所述寄存器单元,用于存储标量数据,所述标量数据至少包括:所述矩阵在所述存储器内的存储地址;
    所述控制单元,用于控制所述矩阵计算单元获取第一运算指令,所述第一运算指令包括执行所述指令所需的矩阵读取指示,所述所需的矩阵为至少一个矩阵,所述至少一个矩阵为长度相同的矩阵或长度不相同的矩阵;
    所述运算单元,用于依据所述矩阵读取指示向所述存储器发送读取命令;依据采用批量读取方式读取所述矩阵读取指示对应的矩阵,对该矩阵执行所述第一运算指令。
  9. 根据权利要求8所述的计算装置,其特征在于,所述矩阵读取指示包括:所述指令所需的矩阵的存储地址或所述指令所需矩阵的标识。
  10. 根据权利要求8所述的计算装置,其特征在于,如所述矩阵读取指示为所述指令所需矩阵的标识时,
    所述控制单元,用于控制所述运算单元依据所述标识从所述寄存器单元出采用单位读取方式读取所述标识对应的存储地址,控制所述运算单元向所述存储器发送读取所述存储地址的读取命令并采用批量读取方式获取所述矩阵。
  11. 根据权利要求8-10任意一项所述的计算装置,其特征在于,
    所述运算单元,具体用于对该矩阵执行第一流水级的计算得到第一结果,将第一结果输入到第二流水级执行第二流水级得到第二结果,将所述第二结果输入到第三流水级执行第三流水级得到第三结果,将所述第三结果输入到所述存储器进行存储。
  12. 根据权利要求8-10任意一项所述的计算装置,其特征在于,所述计算装置还包括:
    缓存单元,用于缓存待执行的运算指令;
    所述控制单元,用于将待执行的运算指令缓存于所述缓存单元内。
  13. 根据权利要求8-10任意一项所述的计算装置,其特征在于,
    所述控制单元,用于确定所述第一运算指令与所述第一运算指令之前的第二运算指令是否存在关联关系,如所述第一运算指令与所述第二运算指令存在关联关系,则将所述第一运算指令缓存与所述缓存单元内,在所述第二运算指令执行完毕后,从所述缓存单元提取所述第一运算指令传输至所述运算单元;
    所述确定该第一运算指令与第一运算指令之前的第二运算指令是否存在关联关系包括:
    依据所述第一运算指令提取所述第一运算指令中所需矩阵的第一存储地址区间,依据所述第二运算指令提取所述第二运算指令中所需矩阵的第二存储地址区间,如所述第一存储地址区间与所述第二存储地址区间具有重叠的区域,则确定所述第一运算指令与所述第二运算指令具有关联关系,如所述第一存储地址区间与所述第二存储地址区间不具有重叠的区域,则确定所述第一运算指令与所述第二运算指令不具有关联关系。
  14. 根据权利要求8-10任意一项所述的计算装置,其特征在于,
    所述矩阵为m*n矩阵、1*n矩阵或m*1矩阵,其中m、n为大于等于2的整数。
  15. 根据权利要求8-10任意一项所述的计算装置,其特征在于,
    所述存储介质为高速暂存存储器。
  16. 一种计算机可读存储介质,其特征在于,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如权利要求1-7任一项所述的方法。
PCT/CN2018/095706 2017-07-20 2018-07-13 一种计算方法及相关产品 WO2019015541A1 (zh)

Priority Applications (6)

Application Number Priority Date Filing Date Title
CN202010189417.2A CN111221578B (zh) 2017-07-20 2018-07-13 计算装置以及计算方法
CN202010189355.5A CN111176727B (zh) 2017-07-20 2018-07-13 计算装置以及计算方法
CN201880004680.0A CN110036369B (zh) 2017-07-20 2018-07-13 一种计算方法及相关产品
EP18835662.0A EP3686734A4 (en) 2017-07-20 2018-07-13 CALCULATION PROCESS AND ASSOCIATED PRODUCT
US16/745,743 US11481215B2 (en) 2017-07-20 2020-01-17 Calculation method and related product
US17/929,730 US11983534B2 (en) 2017-07-20 2022-09-05 Calculation method and related product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710595951.1A CN107992329B (zh) 2017-07-20 2017-07-20 一种计算方法及相关产品
CN201710595951.1 2017-07-20

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/745,743 Continuation US11481215B2 (en) 2017-07-20 2020-01-17 Calculation method and related product

Publications (1)

Publication Number Publication Date
WO2019015541A1 true WO2019015541A1 (zh) 2019-01-24

Family

ID=61067110

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/095706 WO2019015541A1 (zh) 2017-07-20 2018-07-13 一种计算方法及相关产品

Country Status (4)

Country Link
US (2) US11481215B2 (zh)
EP (1) EP3686734A4 (zh)
CN (18) CN109284822B (zh)
WO (1) WO2019015541A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111506522A (zh) * 2019-01-31 2020-08-07 阿里巴巴集团控股有限公司 数据处理设备及方法
CN111930506A (zh) * 2020-08-13 2020-11-13 山东云海国创云计算装备产业创新中心有限公司 一种矩阵调度方法及相关装置
CN112257859A (zh) * 2020-10-30 2021-01-22 地平线(上海)人工智能技术有限公司 特征数据处理方法及装置、设备、存储介质
CN112711218A (zh) * 2020-12-08 2021-04-27 杭州电子科技大学上虞科学与工程研究院有限公司 一种工业设备数据采集的方法
EP3905248A1 (en) * 2020-04-27 2021-11-03 Intel Corporation Ultra-deep compute static random access memory with high compute throughput and multi-directional data propagation
CN116055049A (zh) * 2023-04-03 2023-05-02 富算科技(上海)有限公司 多方安全计算方法、装置、系统、电子设备和存储介质
US11823035B2 (en) 2020-07-07 2023-11-21 Qualcomm Incorporated Power-efficient compute-in-memory pooling

Families Citing this family (178)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3637325A4 (en) * 2017-05-23 2020-05-27 Shanghai Cambricon Information Technology Co., Ltd TREATMENT METHOD AND ACCELERATION DEVICE
CN109426553A (zh) 2017-08-21 2019-03-05 上海寒武纪信息科技有限公司 任务切分装置及方法、任务处理装置及方法、多核处理器
CN109214616B (zh) 2017-06-29 2023-04-07 上海寒武纪信息科技有限公司 一种信息处理装置、系统和方法
CN110413551B (zh) 2018-04-28 2021-12-10 上海寒武纪信息科技有限公司 信息处理装置、方法及设备
WO2019001418A1 (zh) 2017-06-26 2019-01-03 上海寒武纪信息科技有限公司 数据共享系统及其数据共享方法
CN110619390A (zh) * 2018-06-20 2019-12-27 上海寒武纪信息科技有限公司 用于执行生成对抗网络的处理装置及应用其进行机器创作的方法
CN109284822B (zh) * 2017-07-20 2021-09-21 上海寒武纪信息科技有限公司 一种神经网络运算装置及方法
US11437032B2 (en) 2017-09-29 2022-09-06 Shanghai Cambricon Information Technology Co., Ltd Image processing apparatus and method
CN111738431B (zh) 2017-12-11 2024-03-05 中科寒武纪科技股份有限公司 神经网络运算设备和方法
WO2019114842A1 (zh) * 2017-12-14 2019-06-20 北京中科寒武纪科技有限公司 一种集成电路芯片装置
CN108229671B (zh) * 2018-01-16 2022-03-04 华南理工大学 一种降低加速器外部数据存储带宽需求的系统和方法
CN108388446A (zh) 2018-02-05 2018-08-10 上海寒武纪信息科技有限公司 运算模块以及方法
CN110147249B (zh) * 2018-02-12 2021-02-09 上海寒武纪信息科技有限公司 一种网络模型的计算方法及装置
CN110163349B (zh) * 2018-02-12 2021-03-23 上海寒武纪信息科技有限公司 一种网络模型的计算方法及装置
CN110163360B (zh) * 2018-02-13 2021-06-25 上海寒武纪信息科技有限公司 一种计算装置及方法
US11397579B2 (en) 2018-02-13 2022-07-26 Shanghai Cambricon Information Technology Co., Ltd Computing device and method
CN110163350B (zh) * 2018-02-13 2021-06-08 上海寒武纪信息科技有限公司 一种计算装置及方法
KR102148110B1 (ko) * 2018-02-13 2020-08-25 상하이 캠브리콘 인포메이션 테크놀로지 컴퍼니 리미티드 계산 장치 및 방법
US11630666B2 (en) 2018-02-13 2023-04-18 Shanghai Cambricon Information Technology Co., Ltd Computing device and method
US11106598B2 (en) 2018-02-13 2021-08-31 Shanghai Cambricon Information Technology Co., Ltd. Computing device and method
CN116991226A (zh) 2018-02-14 2023-11-03 上海寒武纪信息科技有限公司 处理器的控制装置、方法及设备
CN110197272B (zh) * 2018-02-27 2020-08-25 上海寒武纪信息科技有限公司 集成电路芯片装置及相关产品
CN111626413A (zh) * 2018-03-14 2020-09-04 上海寒武纪信息科技有限公司 一种计算装置及方法
CN108520296B (zh) * 2018-03-20 2020-05-15 福州瑞芯微电子股份有限公司 一种基于深度学习芯片动态cache分配的方法和装置
CN110363279B (zh) * 2018-03-26 2021-09-21 华为技术有限公司 基于卷积神经网络模型的图像处理方法和装置
CN110413255B (zh) * 2018-04-28 2022-08-19 赛灵思电子科技(北京)有限公司 人工神经网络调整方法和装置
CN108647184B (zh) * 2018-05-10 2022-04-12 杭州雄迈集成电路技术股份有限公司 一种动态比特位卷积乘法实现方法
CN110472734B (zh) * 2018-05-11 2024-03-29 上海寒武纪信息科技有限公司 一种计算装置及相关产品
CN110147873B (zh) * 2018-05-18 2020-02-18 中科寒武纪科技股份有限公司 卷积神经网络的处理器及训练方法
EP3624020A4 (en) 2018-05-18 2021-05-05 Shanghai Cambricon Information Technology Co., Ltd CALCULATION PROCEDURES AND RELATED PRODUCTS
WO2019219083A1 (zh) 2018-05-18 2019-11-21 北京中科寒武纪科技有限公司 视频检索方法及视频检索映射关系生成方法、装置
CN111368987B (zh) * 2018-12-25 2023-03-24 上海寒武纪信息科技有限公司 一种神经网络计算装置和方法
CN110503179B (zh) * 2018-05-18 2024-03-01 上海寒武纪信息科技有限公司 计算方法以及相关产品
WO2020029018A1 (zh) 2018-08-06 2020-02-13 华为技术有限公司 矩阵的处理方法、装置及逻辑电路
CN109032670B (zh) * 2018-08-08 2021-10-19 上海寒武纪信息科技有限公司 神经网络处理装置及其执行向量复制指令的方法
CN109189715B (zh) * 2018-08-16 2022-03-15 北京算能科技有限公司 可编程人工智能加速器执行单元及人工智能加速方法
EP3757896B1 (en) 2018-08-28 2023-01-11 Cambricon Technologies Corporation Limited Method and device for pre-processing data in a neural network
KR20200026455A (ko) * 2018-09-03 2020-03-11 삼성전자주식회사 인공 신경망 시스템 및 인공 신경망의 고정 소수점 제어 방법
CN109242091B (zh) * 2018-09-03 2022-03-22 郑州云海信息技术有限公司 图像识别方法、装置、设备及可读存储介质
CN110929838B (zh) * 2018-09-19 2023-09-26 杭州海康威视数字技术股份有限公司 神经网络中位宽定点化方法、装置、终端和存储介质
CN110941789B (zh) * 2018-09-21 2023-12-15 北京地平线机器人技术研发有限公司 张量运算方法和装置
WO2020062392A1 (zh) 2018-09-28 2020-04-02 上海寒武纪信息科技有限公司 信号处理装置、信号处理方法及相关产品
CN110968532B (zh) * 2018-09-29 2021-09-21 上海寒武纪信息科技有限公司 数据传输方法及相关产品
WO2020070916A1 (ja) * 2018-10-02 2020-04-09 日本電信電話株式会社 算出装置、算出方法及び算出プログラム
US11494625B2 (en) 2018-10-03 2022-11-08 Maxim Integrated Products, Inc. Systems and methods for energy-efficient analog matrix multiplication for machine learning processes
CN110096310B (zh) * 2018-11-14 2021-09-03 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111047028A (zh) * 2018-10-12 2020-04-21 上海寒武纪信息科技有限公司 运算方法、装置及相关产品
CN111353595A (zh) * 2018-12-20 2020-06-30 上海寒武纪信息科技有限公司 运算方法、装置及相关产品
CN110096283A (zh) * 2018-10-12 2019-08-06 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111124497B (zh) * 2018-10-11 2022-03-29 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111047027A (zh) * 2018-10-12 2020-04-21 上海寒武纪信息科技有限公司 运算方法、装置及相关产品
WO2020073925A1 (zh) * 2018-10-09 2020-04-16 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111275197B (zh) * 2018-12-05 2023-11-10 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
WO2020073923A1 (zh) * 2018-10-09 2020-04-16 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111222633A (zh) * 2018-11-23 2020-06-02 上海寒武纪信息科技有限公司 运算方法、装置及相关产品
CN111353125B (zh) * 2018-12-20 2022-04-22 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111290788B (zh) * 2018-12-07 2022-05-31 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111047030A (zh) * 2018-10-11 2020-04-21 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111026440B (zh) * 2018-10-09 2022-03-29 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN110096309B (zh) * 2018-11-14 2020-04-14 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111290789B (zh) * 2018-12-06 2022-05-27 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111047005A (zh) * 2018-10-11 2020-04-21 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111061507A (zh) * 2018-10-16 2020-04-24 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111045729A (zh) * 2018-10-12 2020-04-21 上海寒武纪信息科技有限公司 运算方法、装置及相关产品
CN111353124A (zh) * 2018-12-20 2020-06-30 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111047024B (zh) * 2018-10-12 2023-05-23 上海寒武纪信息科技有限公司 一种计算装置及相关产品
CN111047023B (zh) * 2018-10-12 2023-11-24 上海寒武纪信息科技有限公司 一种计算装置及相关产品
CN111210011B (zh) * 2018-11-21 2022-12-02 上海寒武纪信息科技有限公司 数据处理装置及相关产品
CN111209245B (zh) * 2018-11-21 2021-11-16 上海寒武纪信息科技有限公司 数据处理装置、方法及相关产品
CN111078623B (zh) * 2018-10-18 2022-03-29 上海寒武纪信息科技有限公司 片上网络处理系统和片上网络数据处理方法
CN111078624B (zh) * 2018-10-18 2022-03-25 上海寒武纪信息科技有限公司 片上网络处理系统和片上网络数据处理方法
CN111209231B (zh) * 2018-11-21 2021-05-11 上海寒武纪信息科技有限公司 数据处理方法、装置及相关产品
CN111079908B (zh) * 2018-10-18 2024-02-13 上海寒武纪信息科技有限公司 片上网络数据处理方法、存储介质、计算机设备和装置
WO2020078470A1 (zh) * 2018-10-18 2020-04-23 上海寒武纪信息科技有限公司 片上网络数据处理方法及装置
CN111209244B (zh) * 2018-11-21 2022-05-06 上海寒武纪信息科技有限公司 数据处理装置及相关产品
CN111210012B (zh) * 2018-11-21 2022-12-09 上海寒武纪信息科技有限公司 数据处理方法、装置及相关产品
CN111078625B (zh) * 2018-10-18 2022-03-29 上海寒武纪信息科技有限公司 片上网络处理系统和片上网络数据处理方法
CN111209243B (zh) * 2018-11-21 2022-12-02 上海寒武纪信息科技有限公司 数据处理装置、方法及相关产品
CN111209230B (zh) * 2018-11-21 2021-08-31 上海寒武纪信息科技有限公司 数据处理装置、方法及相关产品
CN111079909B (zh) * 2018-10-19 2021-01-26 安徽寒武纪信息科技有限公司 运算方法、系统及相关产品
CN111078284B (zh) * 2018-10-19 2021-02-05 中科寒武纪科技股份有限公司 运算方法、系统及相关产品
CN111079913B (zh) * 2018-10-19 2021-02-05 中科寒武纪科技股份有限公司 运算方法、装置及相关产品
CN111078286B (zh) * 2018-10-19 2023-09-01 上海寒武纪信息科技有限公司 数据通信方法、计算系统和存储介质
CN111078282B (zh) * 2018-10-19 2020-12-22 安徽寒武纪信息科技有限公司 运算方法、装置及相关产品
CN111078291B (zh) * 2018-10-19 2021-02-09 中科寒武纪科技股份有限公司 运算方法、系统及相关产品
CN111079912B (zh) * 2018-10-19 2021-02-12 中科寒武纪科技股份有限公司 运算方法、系统及相关产品
CN111078280B (zh) * 2018-10-19 2021-01-26 中科寒武纪科技股份有限公司 运算方法、装置及相关产品
CN109669773B (zh) * 2018-11-12 2024-03-08 平安科技(深圳)有限公司 金融数据处理方法、装置、设备和存储介质
CN111191774B (zh) * 2018-11-14 2023-04-07 上海富瀚微电子股份有限公司 面向精简卷积神经网络的低代价加速器架构及其处理方法
CN109583579B (zh) * 2018-11-30 2021-04-09 上海寒武纪信息科技有限公司 计算装置及相关产品
CN109558110B (zh) * 2018-11-30 2021-06-01 上海寒武纪信息科技有限公司 数据转换装置及相关产品
CN111260070B (zh) * 2018-11-30 2022-11-29 上海寒武纪信息科技有限公司 运算方法、装置及相关产品
CN111381871B (zh) * 2018-12-28 2022-12-09 上海寒武纪信息科技有限公司 运算方法、装置及相关产品
CN111258935B (zh) * 2018-11-30 2022-01-25 上海寒武纪信息科技有限公司 数据传输装置和方法
CN111258641B (zh) * 2018-11-30 2022-12-09 上海寒武纪信息科技有限公司 运算方法、装置及相关产品
US11573765B2 (en) * 2018-12-13 2023-02-07 Advanced Micro Devices, Inc. Fused convolution and batch normalization for neural networks
CN109684087B (zh) * 2018-12-17 2020-01-10 中科寒武纪科技股份有限公司 运算方法、装置及相关产品
CN109635944B (zh) * 2018-12-24 2020-10-27 西安交通大学 一种稀疏卷积神经网络加速器及实现方法
CN111368990B (zh) * 2018-12-25 2023-03-07 上海寒武纪信息科技有限公司 一种神经网络计算装置和方法
CN111367567B (zh) * 2018-12-25 2023-03-07 上海寒武纪信息科技有限公司 一种神经网络计算装置和方法
CN111368967B (zh) * 2018-12-25 2023-04-07 上海寒武纪信息科技有限公司 一种神经网络计算装置和方法
CN111368985B (zh) * 2018-12-25 2023-11-28 上海寒武纪信息科技有限公司 一种神经网络计算装置和方法
CN111383638A (zh) 2018-12-28 2020-07-07 上海寒武纪信息科技有限公司 信号处理装置、信号处理方法及相关产品
CN111488976B (zh) * 2019-01-28 2023-06-30 中科寒武纪科技股份有限公司 神经网络计算装置、神经网络计算方法及相关产品
CN111523652B (zh) * 2019-02-01 2023-05-02 阿里巴巴集团控股有限公司 处理器及其数据处理方法、摄像装置
CN109902819B (zh) * 2019-02-12 2023-04-18 Oppo广东移动通信有限公司 神经网络计算方法、装置、移动终端及存储介质
US20200264891A1 (en) * 2019-02-20 2020-08-20 Nanjing Iluvatar CoreX Technology Co., Ltd. (DBA “Iluvatar CoreX Inc. Nanjing”) Constant scalar register architecture for acceleration of delay sensitive algorithm
CN109993293B (zh) * 2019-02-28 2021-04-13 中山大学 一种适用于堆叠式沙漏网络的深度学习加速器
CN109885407B (zh) * 2019-03-05 2021-09-21 上海商汤智能科技有限公司 数据处理方法和装置、电子设备、存储介质
CN111695686B (zh) * 2019-03-15 2022-11-01 上海寒武纪信息科技有限公司 地址分配方法及装置
CN111723920B (zh) * 2019-03-22 2024-05-17 中科寒武纪科技股份有限公司 人工智能计算装置及相关产品
US11983535B2 (en) 2019-03-22 2024-05-14 Cambricon Technologies Corporation Limited Artificial intelligence computing device and related product
WO2020200250A1 (zh) * 2019-04-02 2020-10-08 上海寒武纪信息科技有限公司 运算方法、装置及相关产品
US10698842B1 (en) * 2019-04-10 2020-06-30 Xilinx, Inc. Domain assist processor-peer for coherent acceleration
CN111832739B (zh) 2019-04-18 2024-01-09 中科寒武纪科技股份有限公司 一种数据处理方法及相关产品
US20200334522A1 (en) 2019-04-18 2020-10-22 Cambricon Technologies Corporation Limited Data processing method and related products
CN111860798A (zh) * 2019-04-27 2020-10-30 中科寒武纪科技股份有限公司 运算方法、装置及相关产品
WO2020220935A1 (zh) * 2019-04-27 2020-11-05 中科寒武纪科技股份有限公司 运算装置
CN110298441B (zh) * 2019-05-24 2022-01-11 深圳云天励飞技术有限公司 一种数据处理方法、电子装置及计算机可读存储介质
CN112068799B (zh) * 2019-06-11 2022-08-02 云南大学 一种最优带符号二进制快速计算方法以及椭圆曲线标量乘法
CN112085181B (zh) 2019-06-12 2024-03-29 上海寒武纪信息科技有限公司 神经网络量化方法及装置以及相关产品
US11676028B2 (en) 2019-06-12 2023-06-13 Shanghai Cambricon Information Technology Co., Ltd Neural network quantization parameter determination method and related products
CN110245750B (zh) * 2019-06-14 2022-07-15 西南科技大学 一种基于fpga的神经网络数值模拟方法
CN110390383B (zh) * 2019-06-25 2021-04-06 东南大学 一种基于幂指数量化的深度神经网络硬件加速器
WO2021004076A1 (zh) * 2019-07-05 2021-01-14 山东大学 基于人工智能芯片的适形穿戴式生物信息监测设备及系统
CN112168140B (zh) * 2019-07-05 2021-07-13 山东大学齐鲁医院 基于人工智能芯片的穿戴式生物信息监测设备及方法
CN110348021B (zh) * 2019-07-17 2021-05-18 湖北亿咖通科技有限公司 基于命名实体模型的字符串识别方法、电子设备、存储介质
WO2021022441A1 (zh) * 2019-08-05 2021-02-11 华为技术有限公司 数据传输方法、装置、电子设备及可读存储介质
CN112346707A (zh) * 2019-08-07 2021-02-09 上海寒武纪信息科技有限公司 指令处理方法、装置及相关产品
CN110728365B (zh) * 2019-09-12 2022-04-01 东南大学 多位宽pe阵列计算位宽的选择方法及计算精度控制电路
US11579802B2 (en) * 2019-10-04 2023-02-14 Fungible, Inc. Pipeline using match-action blocks
CN112667288A (zh) * 2019-10-15 2021-04-16 北京希姆计算科技有限公司 数据运算电路、数据处理装置、芯片、卡板及电子设备
WO2021077283A1 (zh) * 2019-10-22 2021-04-29 深圳鲲云信息科技有限公司 神经网络计算压缩方法、系统及存储介质
CN111080400B (zh) * 2019-11-25 2023-04-18 中山大学 一种基于门控图卷积网络的商品推荐方法及系统、存储介质
CN110989970B (zh) * 2019-11-27 2023-04-11 广州海格通信集团股份有限公司 一种双精度浮点矩阵运算处理器及方法
CN111091181B (zh) * 2019-12-09 2023-09-05 Oppo广东移动通信有限公司 卷积处理单元、神经网络处理器、电子设备及卷积运算方法
CN111124500B (zh) * 2019-12-12 2022-03-08 浪潮(北京)电子信息产业有限公司 一种指令执行方法、装置、设备及存储介质
CN111104513B (zh) * 2019-12-13 2023-05-02 中山大学 一种游戏平台用户问答业务的短文本分类方法
CN111026445A (zh) * 2019-12-17 2020-04-17 湖南长城银河科技有限公司 一种智能识别方法及芯片
CN111242293B (zh) * 2020-01-13 2023-07-18 腾讯科技(深圳)有限公司 一种处理部件、数据处理的方法以及电子设备
CN111221479B (zh) * 2020-01-19 2022-08-05 苏州浪潮智能科技有限公司 一种判断存储容量变化量异常的方法、系统及存储介质
US20210295134A1 (en) * 2020-03-18 2021-09-23 Infineon Technologies Ag Artificial neural network activation function
CN111507473B (zh) * 2020-04-20 2023-05-12 上海交通大学 一种基于Crossbar架构的剪枝方法及系统
CN111522776B (zh) * 2020-04-27 2022-04-05 西安交通大学 一种计算架构
CN113626080B (zh) * 2020-05-08 2023-10-03 安徽寒武纪信息科技有限公司 数据处理装置以及相关产品
CN113626082A (zh) * 2020-05-08 2021-11-09 安徽寒武纪信息科技有限公司 数据处理方法及装置以及相关产品
CN113807507A (zh) * 2020-06-16 2021-12-17 安徽寒武纪信息科技有限公司 数据处理方法及装置以及相关产品
CN111832718B (zh) * 2020-06-24 2021-08-03 上海西井信息科技有限公司 芯片架构
CN113867800A (zh) * 2020-06-30 2021-12-31 上海寒武纪信息科技有限公司 计算装置、集成电路芯片、板卡、电子设备和计算方法
CN113867799A (zh) * 2020-06-30 2021-12-31 上海寒武纪信息科技有限公司 计算装置、集成电路芯片、板卡、电子设备和计算方法
CN113867793A (zh) * 2020-06-30 2021-12-31 上海寒武纪信息科技有限公司 计算装置、集成电路芯片、板卡、电子设备和计算方法
CN111783954B (zh) * 2020-06-30 2023-05-02 安徽寒武纪信息科技有限公司 一种用于确定神经网络的性能的方法、电子设备和存储介质
CN111651207B (zh) * 2020-08-06 2020-11-17 腾讯科技(深圳)有限公司 一种神经网络模型运算芯片、方法、装置、设备及介质
WO2022040643A1 (en) * 2020-08-21 2022-02-24 Fu Zhi Sing Processing unit architectures and techniques for reusable instructions and data
KR20220034542A (ko) * 2020-09-11 2022-03-18 삼성전자주식회사 스토리지 장치 및 스토리지 장치의 동작 방법
CN112259071A (zh) * 2020-09-22 2021-01-22 北京百度网讯科技有限公司 语音处理系统、语音处理方法、电子设备和可读存储介质
CN112036554B (zh) * 2020-11-04 2021-04-06 深圳追一科技有限公司 神经网络模型的处理方法、装置、计算机设备和存储介质
CN112506436B (zh) * 2020-12-11 2023-01-31 西北工业大学 用于卷积神经网络加速器的高效率数据动态存储分配方法
CN112783556B (zh) * 2021-01-06 2023-04-07 南阳理工学院 信息处理方法、信息处理装置及终端设备
CN115271047A (zh) * 2021-04-29 2022-11-01 华为技术有限公司 一种数据处理方法及装置
CN113469326B (zh) * 2021-06-24 2024-04-02 上海寒武纪信息科技有限公司 在神经网络模型中执行剪枝优化的集成电路装置及板卡
CN113806285A (zh) * 2021-09-18 2021-12-17 北京爱芯科技有限公司 一种数据处理模组、芯片和数据处理方法
CN114139693A (zh) * 2021-12-03 2022-03-04 安谋科技(中国)有限公司 神经网络模型的数据处理方法、介质和电子设备
CN114237612A (zh) * 2021-12-03 2022-03-25 龙芯中科技术股份有限公司 程序代码的编译方法、装置、电子设备及存储介质
CN114296798A (zh) * 2021-12-10 2022-04-08 龙芯中科技术股份有限公司 向量移位方法、处理器及电子设备
CN114372012B (zh) * 2021-12-21 2024-02-20 中国科学院深圳先进技术研究院 一种通用、可配置的高能效池化计算单行输出系统和方法
CN114265872B (zh) * 2022-02-24 2022-05-24 苏州浪潮智能科技有限公司 一种用于总线的互联装置
CN114726512B (zh) * 2022-03-08 2024-03-26 支付宝(杭州)信息技术有限公司 数据处理方法和装置
CN114692833B (zh) * 2022-03-30 2023-11-21 广东齐芯半导体有限公司 一种卷积计算电路、神经网络处理器以及卷积计算方法
CN114818803A (zh) * 2022-04-25 2022-07-29 上海韶脑传感技术有限公司 基于神经元优化的单侧肢体患者运动想象脑电建模方法
CN115390654A (zh) * 2022-08-11 2022-11-25 Oppo广东移动通信有限公司 降低功耗的方法、处理器、电子设备及存储介质
KR20240033565A (ko) * 2022-09-05 2024-03-12 리벨리온 주식회사 뉴럴 프로세싱 장치, 그에 포함되는 프로세싱 엘리먼트 및 뉴럴 프로세싱 장치의 다양한 포맷 연산 방법
CN115203126B (zh) * 2022-09-15 2023-04-18 太初(无锡)电子科技有限公司 一种算子融合处理方法、装置、设备及存储介质
CN115934768A (zh) * 2022-12-01 2023-04-07 摩尔线程智能科技(北京)有限责任公司 数据的处理方法、显示适配器、电子设备及存储介质
CN115826910B (zh) * 2023-02-07 2023-05-02 成都申威科技有限责任公司 一种向量定点的alu处理系统
CN116360858B (zh) * 2023-05-26 2023-08-29 摩尔线程智能科技(北京)有限责任公司 数据的处理方法、图形处理器、电子设备及存储介质
KR102653745B1 (ko) * 2023-06-02 2024-04-02 라이프앤사이언스주식회사 최적화된 연산속도를 가지는 교육용 로봇제어기
CN117992396B (zh) * 2024-03-29 2024-05-28 深存科技(无锡)有限公司 流式张量处理器

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541814A (zh) * 2010-12-27 2012-07-04 北京国睿中数科技股份有限公司 用于数据通信处理器的矩阵计算装置和方法
CN104915322A (zh) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 一种卷积神经网络硬件加速方法及其axi总线ip核
US20160342890A1 (en) * 2015-05-21 2016-11-24 Google Inc. Batch processing in a neural network processor
CN107992329A (zh) * 2017-07-20 2018-05-04 上海寒武纪信息科技有限公司 一种计算方法及相关产品

Family Cites Families (91)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1013070B (zh) * 1988-01-09 1991-07-03 北京信通电脑技术公司 直接处理接近数学公式的″机器表达式″的计算机系统
US5083285A (en) * 1988-10-11 1992-01-21 Kabushiki Kaisha Toshiba Matrix-structured neural network with learning circuitry
US5327537A (en) * 1990-03-13 1994-07-05 At&T Bell Laboratories Apparatus for controlling instruction execution in a pipelined processor
GB2288521B (en) * 1994-03-24 1998-10-14 Discovision Ass Reconfigurable process stage
US5956703A (en) * 1995-07-28 1999-09-21 Delco Electronics Corporation Configurable neural network integrated circuit
US5717891A (en) 1995-10-12 1998-02-10 Analog Devices, Inc. Digital signal processor with caching of instructions that produce a memory conflict
US5889985A (en) * 1996-08-07 1999-03-30 Elbrus International Array prefetch apparatus and method
CN1302403A (zh) * 1998-05-22 2001-07-04 弗兰普顿·E·埃利斯三世 全球网络计算机
WO2004013752A1 (en) * 2002-07-26 2004-02-12 Koninklijke Philips Electronics N.V. Method and apparatus for accessing multiple vector elements in parallel
US6941289B2 (en) * 2001-04-06 2005-09-06 Sas Institute Inc. Hybrid neural network generation system and method
US7236995B2 (en) * 2002-12-27 2007-06-26 Arm Limited Data processing apparatus and method for converting a number between fixed-point and floating-point representations
US9555052B2 (en) * 2003-06-13 2017-01-31 Sumathi Paturu Intrauterine fetal growth restriction—the biochemical rationale of treatment modalities including extraperitoneal transamniotic fetal supplements
US7539714B2 (en) * 2003-06-30 2009-05-26 Intel Corporation Method, apparatus, and instruction for performing a sign operation that multiplies
US7020769B2 (en) * 2003-09-30 2006-03-28 Starcore, Llc Method and system for processing a loop of instructions
CN101211341A (zh) * 2006-12-29 2008-07-02 上海芯盛电子科技有限公司 图像智能模式识别搜索方法
WO2008092883A2 (en) * 2007-01-30 2008-08-07 Nema Labs Ab Speculative throughput computing
CN101021832A (zh) * 2007-03-19 2007-08-22 中国人民解放军国防科学技术大学 支持局部寄存和条件执行的64位浮点整数融合运算群
CN101399977A (zh) * 2007-09-29 2009-04-01 智多微电子(上海)有限公司 解码装置中控制片内存储器的数据并行读写的方法及装置
US8181003B2 (en) * 2008-05-29 2012-05-15 Axis Semiconductor, Inc. Instruction set design, control and communication in programmable microprocessor cores and the like
US20100047768A1 (en) * 2008-08-18 2010-02-25 J. Craig Venter Institute, Inc. Amplification of single viral genomes
US20100122070A1 (en) * 2008-11-07 2010-05-13 Nokia Corporation Combined associative and distributed arithmetics for multiple inner products
CN101644921B (zh) * 2009-08-05 2011-07-20 无锡信捷电气有限公司 一种改进型板料数控折弯设计方法
US8577950B2 (en) * 2009-08-17 2013-11-05 International Business Machines Corporation Matrix multiplication operations with data pre-conditioning in a high performance computing architecture
CN101667114B (zh) * 2009-09-30 2012-07-04 西安电子科技大学 适于矩阵求逆的超长指令集微处理系统
CN101770515B (zh) * 2010-01-18 2012-01-11 杭州顺网科技股份有限公司 一种基于数据块比较的数据更新方法
CN101783805B (zh) * 2010-03-01 2013-04-17 田耕 一种利用动态矢量矩阵的加密通信方法
CN101833441B (zh) * 2010-04-28 2013-02-13 中国科学院自动化研究所 并行向量处理引擎结构
US9129220B2 (en) * 2010-07-07 2015-09-08 Qualcomm Incorporated Methods and systems for digital neural processing with discrete-level synapes and probabilistic STDP
CN101916180B (zh) * 2010-08-11 2013-05-29 中国科学院计算技术研究所 Risc处理器中执行寄存器类型指令的方法和其系统
CN101963983A (zh) * 2010-09-28 2011-02-02 江苏瑞蚨通软件科技有限公司(中外合资) 一种粗集优化神经网络的数据挖掘方法
SG180028A1 (en) * 2010-10-12 2012-05-30 St Electronics Info Software Systems Pte Ltd Information management system and device
US8515885B2 (en) * 2010-10-29 2013-08-20 International Business Machines Corporation Neuromorphic and synaptronic spiking neural network with synaptic weights learned using simulation
CN102637157B (zh) * 2011-02-15 2014-12-03 郑磊 一种片上数字模板系统dtsoc
US8843425B2 (en) * 2011-07-29 2014-09-23 International Business Machines Corporation Hierarchical routing for two-way information flow and structural plasticity in neural networks
US9916538B2 (en) * 2012-09-15 2018-03-13 Z Advanced Computing, Inc. Method and system for feature detection
FR2980905B1 (fr) * 2011-09-29 2014-03-14 Continental Automotive France Procede d'effacement d'informations memorisees dans une memoire reinscriptible non volatile, support de memorisation et calculateur de vehicule automobile
CN102510282B (zh) * 2011-10-25 2014-07-09 中国科学院空间科学与应用研究中心 一种时间分辨单光子计数二维成像系统及方法
US9960917B2 (en) * 2011-12-22 2018-05-01 Intel Corporation Matrix multiply accumulate instruction
CN102609378B (zh) * 2012-01-18 2016-03-30 中国科学院计算技术研究所 一种消息式内存访问装置及其访问方法
KR20130090147A (ko) * 2012-02-03 2013-08-13 안병익 신경망 컴퓨팅 장치 및 시스템과 그 방법
CN103377033B (zh) * 2012-04-12 2016-01-13 无锡江南计算技术研究所 运算核心及其指令管理方法
CN102880341B (zh) * 2012-08-29 2015-08-05 北京集创北方科技有限公司 触摸屏数据处理系统、方法及专用alu
CN103699360B (zh) * 2012-09-27 2016-09-21 北京中科晶上科技有限公司 一种向量处理器及其进行向量数据存取、交互的方法
CN103023839B (zh) * 2012-12-04 2016-12-28 温州大学 基于输出反馈偏置型复连续反馈神经网络结构的无线光通信系统盲均衡方法
US9171029B2 (en) * 2013-01-31 2015-10-27 International Business Machines Corporation Performing batches of selective assignments in a vector friendly manner
CN103246541B (zh) * 2013-04-27 2016-03-23 中国人民解放军信息工程大学 一种自动并行化多级并行代价评估方法
CN103399486B (zh) * 2013-07-05 2016-04-06 杭州电子科技大学 塑料烘干器温度优化节能控制方法
EP2858024A1 (en) * 2013-10-01 2015-04-08 Enyx SA An asset management device and method in a hardware platform
US9582248B2 (en) * 2014-09-26 2017-02-28 Arm Limited Standalone floating-point conversion unit
US20160124651A1 (en) * 2014-11-03 2016-05-05 Texas Instruments Incorporated Method for performing random read access to a block of data using parallel lut read instruction in vector processors
US9996350B2 (en) * 2014-12-27 2018-06-12 Intel Corporation Hardware apparatuses and methods to prefetch a multidimensional block of elements from a multidimensional array
US20170061279A1 (en) * 2015-01-14 2017-03-02 Intel Corporation Updating an artificial neural network using flexible fixed point representation
US10223635B2 (en) * 2015-01-22 2019-03-05 Qualcomm Incorporated Model compression and fine-tuning
US11544214B2 (en) * 2015-02-02 2023-01-03 Optimum Semiconductor Technologies, Inc. Monolithic vector processor configured to operate on variable length vectors using a vector length register
CN104699629B (zh) * 2015-03-16 2017-09-22 清华大学 共享片上缓存划分装置
CN104778026A (zh) * 2015-04-28 2015-07-15 浪潮电子信息产业股份有限公司 一种带simd的高速数据格式转换部件及转换方法
US9633306B2 (en) * 2015-05-07 2017-04-25 Siemens Healthcare Gmbh Method and system for approximating deep neural networks for anatomical object detection
US9805303B2 (en) * 2015-05-21 2017-10-31 Google Inc. Rotating data for neural network computations
CN107924428B (zh) * 2015-09-01 2022-03-15 弗莱克斯-罗技克斯技术公司 可编程逻辑ic的块存储器布局和体系架构及其操作方法
US10776690B2 (en) * 2015-10-08 2020-09-15 Via Alliance Semiconductor Co., Ltd. Neural network unit with plurality of selectable output functions
CN106447036B (zh) * 2015-10-08 2019-03-15 上海兆芯集成电路有限公司 执行随机舍入的神经网络单元
CN106570559A (zh) * 2015-10-09 2017-04-19 阿里巴巴集团控股有限公司 一种基于神经网络的数据处理方法和装置
CN105224505B (zh) * 2015-10-29 2018-05-25 中国人民解放军国防科学技术大学 基于矩阵转置操作的fft加速器装置
CN105550749A (zh) * 2015-12-09 2016-05-04 四川长虹电器股份有限公司 一种新型网络拓扑结构的卷积神经网络的构造方法
US11579677B2 (en) * 2015-12-18 2023-02-14 Hewlett Packard Enterprise Development Lp Memristor crossbar arrays to activate processors
CN105630680B (zh) * 2015-12-28 2018-12-18 中国科学院微电子研究所 一种随机测试程序生成方法
US10762164B2 (en) * 2016-01-20 2020-09-01 Cambricon Technologies Corporation Limited Vector and matrix computing device
CN107609642B (zh) * 2016-01-20 2021-08-31 中科寒武纪科技股份有限公司 计算装置和方法
CN105844330B (zh) * 2016-03-22 2019-06-28 华为技术有限公司 神经网络处理器的数据处理方法及神经网络处理器
CN105843775B (zh) * 2016-04-06 2018-12-04 中国科学院计算技术研究所 片上数据划分读写方法、系统及其装置
CN105912476A (zh) * 2016-04-06 2016-08-31 中国科学院计算技术研究所 片上重复寻址的方法及装置
US20170337156A1 (en) * 2016-04-26 2017-11-23 Onnivation Llc Computing machine architecture for matrix and array processing
US11740903B2 (en) * 2016-04-26 2023-08-29 Onnivation, LLC Computing machine using a matrix space and matrix pointer registers for matrix and array processing
CN105930281B (zh) * 2016-05-12 2019-01-15 清华大学 以配置信息驱动数据访存模式匹配的片上缓存预取机制
CN106022614A (zh) * 2016-05-22 2016-10-12 广州供电局有限公司 一种基于最近邻聚类的神经网络数据挖掘方法
CN106066783A (zh) * 2016-06-02 2016-11-02 华为技术有限公司 基于幂次权重量化的神经网络前向运算硬件结构
CN105976024B (zh) * 2016-06-03 2018-12-25 福州大学 基于rbf的模式分类器及其工作方法
CN106203622B (zh) * 2016-07-14 2018-12-25 杭州华为数字技术有限公司 神经网络运算装置
CN106250103A (zh) * 2016-08-04 2016-12-21 东南大学 一种卷积神经网络循环卷积计算数据重用的系统
CN106650922B (zh) * 2016-09-29 2019-05-03 清华大学 硬件神经网络转换方法、计算装置、软硬件协作系统
US10175980B2 (en) * 2016-10-27 2019-01-08 Google Llc Neural network compute tile
CN106599840A (zh) * 2016-12-13 2017-04-26 郑州云海信息技术有限公司 一种图像识别协处理器、图像识别系统及方法
CN106845631B (zh) * 2016-12-26 2020-05-29 上海寒武纪信息科技有限公司 一种流执行方法及装置
CN106775599B (zh) * 2017-01-09 2019-03-01 南京工业大学 递归神经网络的多计算单元粗粒度可重构系统及方法
CN106909971A (zh) * 2017-02-10 2017-06-30 华南理工大学 一种面向多核计算环境的bp神经网络并行化方法
CN106940815B (zh) * 2017-02-13 2020-07-28 西安交通大学 一种可编程卷积神经网络协处理器ip核
CN106951961B (zh) * 2017-02-24 2019-11-26 清华大学 一种粗粒度可重构的卷积神经网络加速器及系统
EP3654172A1 (en) * 2017-04-19 2020-05-20 Shanghai Cambricon Information Technology Co., Ltd Fused vector multiplier and method using the same
US10223114B1 (en) * 2017-09-29 2019-03-05 Intel Corporation Fixed point to floating point conversion
US11210063B2 (en) * 2019-03-27 2021-12-28 Intel Corporation Machine learning training architecture for programmable devices
US11983530B2 (en) * 2020-03-27 2024-05-14 Intel Corporation Reconfigurable digital signal processing (DSP) vector engine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541814A (zh) * 2010-12-27 2012-07-04 北京国睿中数科技股份有限公司 用于数据通信处理器的矩阵计算装置和方法
US20160342890A1 (en) * 2015-05-21 2016-11-24 Google Inc. Batch processing in a neural network processor
CN104915322A (zh) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 一种卷积神经网络硬件加速方法及其axi总线ip核
CN107992329A (zh) * 2017-07-20 2018-05-04 上海寒武纪信息科技有限公司 一种计算方法及相关产品

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3686734A4 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111506522A (zh) * 2019-01-31 2020-08-07 阿里巴巴集团控股有限公司 数据处理设备及方法
CN111506522B (zh) * 2019-01-31 2023-04-18 阿里巴巴集团控股有限公司 数据处理设备及方法
EP3905248A1 (en) * 2020-04-27 2021-11-03 Intel Corporation Ultra-deep compute static random access memory with high compute throughput and multi-directional data propagation
US11450672B2 (en) 2020-04-27 2022-09-20 Intel Corporation Ultra-deep compute static random access memory with high compute throughput and multi-directional data propagation
US11823035B2 (en) 2020-07-07 2023-11-21 Qualcomm Incorporated Power-efficient compute-in-memory pooling
CN111930506A (zh) * 2020-08-13 2020-11-13 山东云海国创云计算装备产业创新中心有限公司 一种矩阵调度方法及相关装置
CN112257859A (zh) * 2020-10-30 2021-01-22 地平线(上海)人工智能技术有限公司 特征数据处理方法及装置、设备、存储介质
CN112711218A (zh) * 2020-12-08 2021-04-27 杭州电子科技大学上虞科学与工程研究院有限公司 一种工业设备数据采集的方法
CN116055049A (zh) * 2023-04-03 2023-05-02 富算科技(上海)有限公司 多方安全计算方法、装置、系统、电子设备和存储介质
CN116055049B (zh) * 2023-04-03 2023-07-04 富算科技(上海)有限公司 多方安全计算方法、装置、系统、电子设备和存储介质

Also Published As

Publication number Publication date
CN107832082A (zh) 2018-03-23
CN110688157B (zh) 2022-02-22
CN110688158A (zh) 2020-01-14
CN107729990B (zh) 2021-06-08
CN110825434B (zh) 2021-12-21
CN110597559B (zh) 2021-10-19
CN110688159B (zh) 2021-12-14
CN107844322A (zh) 2018-03-27
CN110597558B (zh) 2021-11-12
CN110036369A (zh) 2019-07-19
CN109284130A (zh) 2019-01-29
CN110597559A (zh) 2019-12-20
CN110825434A (zh) 2020-02-21
CN111176727A (zh) 2020-05-19
US20210224069A1 (en) 2021-07-22
CN107844322B (zh) 2020-08-04
CN110597558A (zh) 2019-12-20
EP3686734A1 (en) 2020-07-29
CN111221578A (zh) 2020-06-02
CN110688157A (zh) 2020-01-14
US20230024840A1 (en) 2023-01-26
CN110688158B (zh) 2022-02-22
CN107729990A (zh) 2018-02-23
CN107832082B (zh) 2020-08-04
CN111221578B (zh) 2022-07-15
CN109284822A (zh) 2019-01-29
US11983534B2 (en) 2024-05-14
CN107992329A (zh) 2018-05-04
CN107729989B (zh) 2020-12-29
CN107608715A (zh) 2018-01-19
CN107807819B (zh) 2021-06-25
US11481215B2 (en) 2022-10-25
CN110688159A (zh) 2020-01-14
EP3686734A4 (en) 2021-08-18
CN107608715B (zh) 2020-07-03
CN110036369B (zh) 2023-11-24
CN109284822B (zh) 2021-09-21
CN107729989A (zh) 2018-02-23
CN107807819A (zh) 2018-03-16
CN111176727B (zh) 2022-05-31
CN107992329B (zh) 2021-05-11
CN109284130B (zh) 2021-03-23

Similar Documents

Publication Publication Date Title
CN110036369B (zh) 一种计算方法及相关产品
CN109104876B (zh) 一种运算装置及相关产品
US10896369B2 (en) Power conversion in neural networks
WO2019085655A1 (zh) 信息处理方法及终端设备
US11307865B2 (en) Data processing apparatus and method
US20180349763A1 (en) Reconfigurable processing unit
CN110163353B (zh) 一种计算装置及方法
TWI827432B (zh) 計算裝置、機器學習運算裝置、組合處理裝置、神經網絡芯片、電子設備、板卡、及計算方法
CN111045728B (zh) 一种计算装置及相关产品
CN111626413A (zh) 一种计算装置及方法
CN111198714B (zh) 重训练方法及相关产品
CN111382848A (zh) 一种计算装置及相关产品
CN111222632A (zh) 计算装置、计算方法及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18835662

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018835662

Country of ref document: EP

Effective date: 20200220