WO2018192500A1 - 处理装置和处理方法 - Google Patents

处理装置和处理方法 Download PDF

Info

Publication number
WO2018192500A1
WO2018192500A1 PCT/CN2018/083415 CN2018083415W WO2018192500A1 WO 2018192500 A1 WO2018192500 A1 WO 2018192500A1 CN 2018083415 W CN2018083415 W CN 2018083415W WO 2018192500 A1 WO2018192500 A1 WO 2018192500A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
multiplier
bits
result
bit
Prior art date
Application number
PCT/CN2018/083415
Other languages
English (en)
French (fr)
Inventor
陈天石
韦洁
支天
王在
刘少礼
罗宇哲
郭崎
李韦
周聖元
杜子东
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201710256445.XA external-priority patent/CN108733412B/zh
Priority claimed from CN201710269106.5A external-priority patent/CN108734281A/zh
Priority claimed from CN201710264686.9A external-priority patent/CN108733408A/zh
Priority claimed from CN201710269049.0A external-priority patent/CN108734288B/zh
Priority to KR1020197038135A priority Critical patent/KR102258414B1/ko
Priority to EP19214371.7A priority patent/EP3786786B1/en
Priority to EP19214320.4A priority patent/EP3654172A1/en
Priority to JP2019549467A priority patent/JP6865847B2/ja
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Priority to US16/476,262 priority patent/US11531540B2/en
Priority to EP18788355.8A priority patent/EP3614259A4/en
Priority to KR1020197025307A priority patent/KR102292349B1/ko
Priority to CN201880000923.3A priority patent/CN109121435A/zh
Publication of WO2018192500A1 publication Critical patent/WO2018192500A1/zh
Priority to US16/697,533 priority patent/US11531541B2/en
Priority to US16/697,687 priority patent/US11734002B2/en
Priority to US16/697,637 priority patent/US11720353B2/en
Priority to US16/697,727 priority patent/US11698786B2/en
Priority to US16/697,603 priority patent/US11507350B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/46Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using electromechanical counter-type accumulators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • G06F7/527Multiplying only in serial-parallel fashion, i.e. one operand being entered serially and the other in parallel
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • G06F7/533Reduction of the number of iteration steps or stages, e.g. using the Booth algorithm, log-sum, odd-even
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • G06F9/30014Arithmetic instructions with variable precision
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • G06N3/065Analogue means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the field of computers, and further relates to a processing device and a processing method in the field of artificial intelligence.
  • neural network algorithms have become a research hotspot in the field of artificial intelligence in recent years, and have been widely used in pattern recognition, image analysis, and intelligent robots.
  • Deep learning is a method based on the representation of data in machine learning.
  • Observations e.g., an image
  • Observations can be represented in a variety of ways, such as a vector of each pixel intensity value, or more abstractly represented as a series of edges, regions of a particular shape, and the like. It is easier to learn from an instance (eg, face recognition or facial expression recognition) using certain representations.
  • deep learning architectures such as deep neural networks, convolutional neural networks and deep belief networks and recurrent neural networks
  • computer vision speech recognition
  • natural language processing natural language processing
  • audio recognition bioinformatics
  • deep learning has become a similar term, or a rebranding of neural networks.
  • neural network accelerators With the great heat of deep learning (neural network), neural network accelerators have emerged. Through the design of special memory and computing modules, neural network accelerators can obtain dozens of times more than general-purpose processors when performing deep learning operations. Hundreds of times the speedup, and the smaller area, lower power consumption.
  • the present disclosure provides a processing device for calculating a bit width dynamic configurable, comprising:
  • a memory for storing data, the data to be calculated of the neural network, the intermediate operation result, the final operation result, and the data to be cached;
  • a data width adjustment circuit configured to adjust a width of the data to be operated, an intermediate operation result, a final operation result, and/or a data to be cached
  • Control circuit for controlling memory, data width adjustment circuit and arithmetic circuit.
  • the present disclosure also provides a method of calculating a bit width dynamically configurable processing device, including the steps of:
  • the control circuit generates a control command and transmits it to the memory, the data width adjustment circuit, and the operation circuit;
  • the memory inputs the data to be calculated of the neural network to the arithmetic circuit according to the received control instruction;
  • the data width adjustment circuit adjusts the width of the data to be calculated of the neural network according to the received control instruction
  • the operation circuit selects a multiplier and an adder circuit of a corresponding type in the first operation module according to the received control instruction;
  • the arithmetic circuit calculates the data to be calculated of the neural network with different calculated bit widths according to the input data to be calculated and the neural network parameters and the control command.
  • the present disclosure also provides a processing apparatus, including: a memory for storing data, the data includes data to be calculated of a neural network, and an operation circuit for performing operations on data to be operated of the neural network, including using an adder circuit and The multiplier calculates the data to be calculated of the neural network of different calculation bit widths; and the control circuit is configured to control the memory and the operation circuit, and includes determining the types of the multiplier and the adder circuit of the operation circuit according to the data to be operated to perform the operation.
  • the present disclosure also provides a method for using the above processing apparatus, comprising the steps of: the control circuit generates a control instruction, and transmits the control instruction to the memory and the operation circuit; the memory inputs the data to be operated of the neural network to the operation circuit according to the received control instruction; the operation circuit is Receiving the control instruction, selecting a multiplier and an adder circuit of a corresponding type in the first operation module; the operation circuit calculates the data to be calculated of the neural network with different calculation width according to the input data to be operated and the neural network parameters and the control instruction The operation is performed and the result of the operation is sent back to the memory.
  • the present disclosure also provides an arithmetic device, including: an input module, configured to acquire input data, the input data includes data to be processed, network structure, and weight data, or the input data includes data to be processed and/or offline model data;
  • the model generation module is configured to construct an offline model according to the input network structure and the weight data;
  • the neural network operation module is configured to generate an operation instruction based on the offline model and cache, and perform operation on the data to be processed based on the operation instruction; and the output module;
  • the control module is configured to detect the input data type and control the input module, the model generation module, and the neural network operation module to perform operations.
  • the present disclosure also proposes an arithmetic method for applying the above-described arithmetic device, comprising the following steps:
  • the operation instruction is called, and the operation data is calculated to obtain an operation result for output.
  • the present disclosure also provides an apparatus for supporting a composite scalar instruction, including a controller module, a storage module, and an operator module, wherein: the storage module is configured to store composite scalar instructions and data, and the data has more than one type. Different types of data are stored in different addresses in the storage module; the controller module is configured to read the composite scalar instruction from the storage module and decode it into a control signal; the operator module is configured to receive the control signal, The storage module reads data, determines a data type according to an address of the read data, and performs an operation on the data.
  • the present disclosure also provides a processor for executing a composite scalar instruction, wherein the composite scalar instruction includes an opcode field, an operand address field, and a destination address field; and the opcode stored in the opcode field is used to distinguish different types
  • the operation address field is used to distinguish the type of the operand, and the destination address field is an address stored in the operation result.
  • the present disclosure also provides a method for executing a composite scalar instruction, comprising the steps of: storing different types of data in different addresses; decoding the composite scalar instruction into a control signal; reading the operation data according to the control signal, according to the reading
  • the address of the operation data determines the type of the operation data, and operates the operation data; the operation result is stored in the address of the corresponding type.
  • the present disclosure also provides a counting device, comprising: a register unit, a counting unit and a storage unit, wherein the register unit is configured to store an address of the input data to be counted stored in the storage unit; the counting unit is connected to the register unit, Obtaining a counting instruction, reading a storage address of the input data in the register unit according to the counting instruction, obtaining corresponding input data to be counted in the storage unit, and performing statistical counting on the number of elements in the input data that satisfy the given condition, Obtaining a counting result; the storage unit is connected to the counting unit for storing input data to be counted and for storing the counting result.
  • the present disclosure also provides a counting method of the above counting device, comprising the steps of: the counting unit acquires a counting instruction, and acquires a corresponding input to be counted in the storage unit according to a storage address of the input data read in the register unit according to the counting instruction. Data, and counts the number of elements in the input data that satisfy the given condition, and obtains the counting result; the statistical counting result is transmitted to the storage unit.
  • FIG. 1 is a schematic structural diagram of a processing device for calculating a bit width dynamic configurable according to an embodiment of the present disclosure.
  • FIG. 2 is a schematic structural diagram of a processing device for calculating a bit width dynamic configurable according to another embodiment of the present disclosure.
  • FIG. 3 is a schematic structural diagram of dynamically calculating a configurable bit width according to still another embodiment of the present disclosure.
  • FIG. 4 is a schematic structural diagram of another embodiment of a processing device for calculating a bit width dynamic configurable according to another embodiment of the present disclosure.
  • Figure 5 is a schematic diagram of a bit serial addition tree device for the apparatus of one embodiment of the present disclosure.
  • FIG. 6 is a block diagram of a bit serializer in a bit width dynamically configurable processing device of the present disclosure.
  • FIG. 7 is a schematic structural diagram of a first basic multiplier device according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a second basic multiplier device according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic structural diagram of a sparse multiplier device of an embodiment provided by the present disclosure.
  • Figure 10 is a block diagram showing the structure of an apparatus for vector multiplication by a base multiplier or a sparse multiplier according to an embodiment of the present disclosure.
  • FIG. 11 is a schematic structural diagram of an apparatus for performing vector multiplication by a fusion vector multiplier according to an embodiment of the present disclosure.
  • FIG. 12 is a schematic structural diagram of a specific implementation flow of a fusion vector multiplier device and other multiplier devices provided by the present disclosure.
  • Figure 13 is a schematic diagram showing the combination of a second basic multiplier and a bit serial addition tree in accordance with an embodiment of the present disclosure.
  • FIG. 14 is a flowchart of a processing method for calculating a bit width dynamic configurable according to an embodiment of the present disclosure.
  • FIG. 15 is a schematic structural diagram of a processing device for calculating a bit width dynamic configurable according to another embodiment of the present disclosure.
  • FIG. 16 is a schematic structural diagram of a processing device for calculating a bit width dynamic configurable according to another embodiment of the present disclosure.
  • FIG. 17 is a schematic structural diagram of a processing device for calculating a bit width dynamic configurable according to still another embodiment of the present disclosure.
  • FIG. 18 is a schematic structural diagram of another embodiment of a processing device for calculating a bit width dynamic configurable according to another embodiment of the present disclosure.
  • 19 is a block diagram showing the structure of a basic multiplier device of an embodiment provided by the present disclosure.
  • 20 is a schematic structural diagram of a sparse multiplier device of an embodiment provided by the present disclosure.
  • 21 is a schematic structural diagram of an apparatus for performing vector multiplication by a base multiplier or a sparse multiplier according to an embodiment of the present disclosure.
  • FIG. 22 is a schematic structural diagram of an apparatus for performing vector multiplication by a fused vector multiplier according to an embodiment of the present disclosure.
  • FIG. 23 is a schematic structural diagram of a specific implementation flow of a fusion vector multiplier device and other multiplier devices provided by the present disclosure.
  • FIG. 24 is a flowchart of a processing method for calculating a bit width dynamic configurable according to an embodiment of the present disclosure.
  • Figure 25 is a diagram of a typical programming framework.
  • FIG. 26 is a flowchart of operations of an arithmetic method according to an embodiment of the present disclosure.
  • FIG. 27 is a structural block diagram of an arithmetic device according to another embodiment of the present disclosure.
  • 29A is a diagram showing an example of a storage module RAM organization form according to an embodiment of the present disclosure.
  • 29B is a diagram showing an example of a storage module register file organization form according to an embodiment of the present disclosure.
  • FIG. 30A is a diagram showing an example of a composite scalar instruction according to an embodiment of the present disclosure.
  • FIG. 30B is a diagram showing an example of a composite scalar instruction using register addressing according to an embodiment of the present disclosure
  • FIG. 30C is a diagram showing an example of a composite scalar instruction when register indirect addressing is provided by an embodiment of the present disclosure
  • FIG. 30D is a diagram showing an example of a composite scalar instruction using immediate number addressing according to an embodiment of the present disclosure
  • FIG. 30E is a diagram showing an example of a composite scalar instruction when using RAM addressing according to an embodiment of the present disclosure
  • FIG. 31 is a flowchart of an operation method for supporting a composite scalar instruction according to an embodiment of the present disclosure.
  • FIG. 32 is a schematic structural diagram of a frame of a counting device according to an embodiment of the present disclosure.
  • FIG. 33 is a schematic structural diagram of a counting unit in a counting device according to an embodiment of the present disclosure.
  • Figure 34 is a block diagram showing the structure of the adder in the counting unit of Figure 33.
  • FIG. 35 is a schematic diagram of a format of an instruction set of a counting instruction in a counting device according to an embodiment of the present disclosure.
  • FIG. 36 is a flowchart of an execution process of a counting unit in a counting device according to an embodiment of the present disclosure.
  • FIG. 37 is a schematic structural diagram of a counting device according to an embodiment of the present disclosure.
  • FIG. 38 is a flowchart of an execution process of a counting device according to an embodiment of the present disclosure.
  • the "memory" described in the present disclosure may be integrated inside a computationally-configured bit-width dynamically configurable processing device, or may be a separate device for data transfer memory as an external memory and a computationally-programmable dynamically configurable processing device. It can be integrated inside a computationally configurable bit-width dynamically configurable processing device, or it can be a separate device that acts as an external memory for data transfer with a computationally-programmable dynamically configurable processing device.
  • FIG. 1 is a schematic structural diagram of a processing device for calculating a bit width dynamic configurable according to an embodiment of the present disclosure. As shown in FIG. 1, the device includes a control circuit, a data width adjustment circuit, an arithmetic circuit, and a memory.
  • the control circuit is used to send control signals to the data width adjustment circuit, the operation circuit and the memory to control the operation of the three and coordinate the data transmission between the three.
  • the memory is used to store related data, and may include input data (including data to be calculated and control instructions), intermediate operation results, final operation results, neurons, synapses, data to be cached, etc., and may be stored according to different requirements. Data content, storage organization, and access methods are planned differently.
  • the data width adjustment circuit is configured to adjust the width of the data. The process may be performed after the data is read from the memory, and the data width adjustment circuit adjusts the bit width of the data and then transfers the data to the operation circuit. The circuit passes the data width adjustment circuit to the data width adjustment circuit and then transfers the data to the memory.
  • the memory passes the data width adjustment circuit to adjust the bit width of the data and then transfers it back to the memory.
  • the specific operation is controlled by the control signal of the control circuit.
  • the specific operation includes increasing or decreasing or maintaining the data bit width without loss of precision; in the case of an acceptable degree of precision loss, the data bit width is increased or decreased or remains unchanged; A specified transformation or operation requirement (such as specifying a "bitwise AND" operation), increasing or decreasing or remaining the data width.
  • the arithmetic circuit may include at least one adder and at least one multiplier for operations of the data.
  • the at least one adder includes an adder, an addition tree, and/or a serial addition tree; the at least one multiplier includes a base multiplier, a sparse multiplier, and/or a fused vector multiplier.
  • the operation circuit may further include a comparator and/or an ALU, etc., wherein the multiplier and the adder can perform operations on data of different calculation bit widths, and can perform operations between operation data of different bit widths according to different requirements.
  • the multiplier can be a serial operator, and the multiplication operation is implemented by a bit serial method. It should be noted that the arithmetic circuit can directly perform data transmission with the memory without passing through the data bit width adjustment circuit.
  • each module or sub-module and operation circuit of the control circuit is connected to the memory, and includes at least one control signal register and at least one control processor, and the control signal register is used for storage control.
  • Signal, optionally, the control signal register is first in, first out.
  • the control processor is configured to take out a control signal to be executed, and after analyzing the control logic, control and coordinate the memory, the data width adjustment circuit, and the operation circuit.
  • the memory includes an input storage module, an output storage module, and a synaptic storage module, wherein the output storage module can be used to store intermediate operation results and final operation results.
  • the data width adjustment circuit can be divided into an input data processing module and an output data processing module, and the input data processing module is configured to adjust the data width of the data in the input storage module and/or the protruding storage module, which can be set after the input storage module
  • the output data processing module is configured to store the output data processing mode after adjusting the data of the operation circuit, and the operation circuit is mainly used for accelerating the convolution layer, the convolution operation of the fully connected layer, and the pooling layer. The operation of taking the average or maximum value.
  • the operation circuit may include a multiplier module, an addition tree module, and a nonlinear operation module (eg, a module that performs a sigmoid function operation).
  • the multiplier block, the adder tree module, and the non-linear arithmetic module can be executed in parallel in a pipelined manner.
  • the device can accelerate the operation process of the convolutional neural network, reduce the off-chip off-chip data exchange, and save storage space.
  • FIG. 3 is a schematic structural diagram of a processing apparatus according to another embodiment of the present disclosure.
  • the device is structured such that each module and operation circuit of the control circuit is connected to the memory, including a control signal register and a control processor for storing control signals, and the control processor is configured to take out control to be executed.
  • the signal after analyzing the control logic, controls and coordinates the memory and the arithmetic circuit.
  • the control signal register is first in, first out.
  • the memory includes an input storage module, an output storage module, and a synaptic storage module.
  • the synaptic storage module comprises a plurality of synaptic storage sub-modules
  • the arithmetic circuit comprises a plurality of arithmetic modules
  • the synaptic storage sub-module and the arithmetic module are respectively connected correspondingly, and a synaptic storage sub-module and an arithmetic module can be connected
  • Corresponding connections may also connect multiple synaptic storage sub-modules with one computing module.
  • the data width adjustment circuit can be divided into an input data processing module and an output data processing module, and the input data processing module is configured to adjust data width of the data in the input storage module and/or the synaptic storage module, which can be set in the input storage module.
  • the output data processing module is configured to store the data after the operation circuit is adjusted to the output data processing module; each time the input module passes the input data processing module, the input data is transmitted to all the operation modules.
  • the synaptic storage module transmits synapse data to the corresponding arithmetic module, and after the arithmetic module performs the operation, the output data processing module writes the result to the output storage module. In this way, in large-scale operations with many parameters, the computational efficiency is significantly improved.
  • the device can effectively accelerate the operation process of the convolutional neural network, and is especially suitable for the case where the network scale is large and the parameters are relatively large.
  • each module of the control circuit is connected to the memory and the arithmetic circuit and the data width adjusting circuit, including an instruction queue and a decoder, each time a new instruction is executed, from the instruction queue.
  • a new instruction is taken, sent to the decoder, decoded by the decoder, and the control information is sent to each module of the memory, the arithmetic circuit, and the data width adjustment circuit.
  • the memory includes an input storage module, an output storage module, a synapse storage module, and a cache module, wherein the output storage module can be used to store intermediate operation results and final operation results.
  • the data is first transmitted to the cache module.
  • the data in the buffer is then read to the data width adjustment circuit.
  • the data width adjustment circuit performs corresponding processing, for example, performing bit width expansion on the data without loss of precision, and forcibly deleting the lowest bit of the data to reduce the data bit width.
  • the data width adjustment circuit After processing by the data width adjustment circuit, it is sent to the corresponding operation module. If the control instruction does not need to process the data, the data can be directly transferred to the corresponding arithmetic module through the data width adjustment circuit. Similarly, when the operation module is finished, the result is first sent to the data width adjustment circuit.
  • the operation circuit includes a plurality of operation modules, including a first operation module and a second operation module.
  • the arithmetic modules can perform related operations in parallel, and can also transfer data to each other, thereby reducing the reuse distance of the localized data and further improving the operation speed.
  • the first operation module is mainly used to accelerate the linear operation of the same or different calculation bit width in the neural network algorithm, including: inter-matrix multiplication, addition, multiplication and addition; matrix and vector; matrix and constant; inter-vector; vector and constant; constant And constants can also be used for comparison operations, selection of max/min values, etc.
  • Preferred operations include dot product, matrix multiplication, and/or matrix addition.
  • the second operation module is configured to perform an unfinished operation in the first operation module, including a nonlinear operation, a division operation, a separate addition operation, or a separate multiplication operation.
  • the advantage of this is that the bit width can be dynamically adjusted in the calculation process according to the control instruction, so that the hardware utilization of the operation circuit and the memory can be further improved.
  • FIG. 5 is a schematic diagram of a bit serial addition tree device for the device according to an embodiment of the present disclosure, which can satisfy the requirement of calculating a bit width dynamic configurability.
  • M data to be calculated has a maximum bit width of N, where M and N are positive integers. If there is less than N bits of data, the number of bits is added to N bits in a reasonable way without affecting the accuracy of the data.
  • the methods that can be used include the highest/lower bit complement 0, the highest/lower complement sign bit, the shift, the operation operation, and the like.
  • the adder of the first layer to the xth layer in the bit-serial addition tree can perform the addition of n(n ⁇ 1) digits, and the adder bit in the x+1th layer can complete the number of not less than N bits. Addition. First, the carry output terminal Cin in the register and each adder is initially zero.
  • each adder completes the addition of the lowest n bits of the data to be operated, which is input to the a and b terminals
  • the obtained result value s is transmitted to the adder a or b end of the higher layer, and the obtained carry value Cout is sent back to the carry input Cin of the layer adder, and the next beat and the incoming data to be calculated are added.
  • the operation of the adder of the previous layer is similar, the incoming data is added, and then the result is passed to the higher layer, and the carry is passed back to the adder of the layer. Until the xth layer is reached.
  • the adder of the xth layer shifts the operation result, adds the original result from the register, and saves it back to the register. Then, the data to be operated is selected to be the next lowest n-bit incoming bit in the serial addition tree to complete the corresponding operation. At this time, Cin in each adder is the carry result outputted from the Cout end of the adder in the previous beat.
  • the operation can input the second batch of n-bit data to be operated, and the parallel operation is used to improve the utilization rate of the operator and further improve the operation speed.
  • the data in the register is the result.
  • the adder may also be turned off during the operation in the case where the data to be calculated (a, b) and the carry input (Cin) are all 0 to the adder. To achieve the goal of saving power.
  • the bit serializer used in the embodiment of the present disclosure such as a basic multiplier or the like, as shown in FIG. 6, includes an arithmetic component, a processing component, and a storage component.
  • the arithmetic component is used to perform multiplication and/or addition of one or more bits of data, and the input data to be calculated is from the data of the storage component and/or the data processed by the processing component, and the output operation result is directly transmitted into the storage.
  • the component is saved or passed to the processing component for processing.
  • the processing component is used for performing data shifting, expanding/decreasing the data bit width according to a given rule, modifying one or more bits of the data according to a given rule, and the processing data to be processed is derived from the operation.
  • the component and/or storage component, the processed data can be passed to the computing component and/or the processing component.
  • the storage unit is used to store data, including data to be operated, intermediate operation results, final operation results, and the like.
  • the storage component here can be an on-chip cache.
  • Each unit can be further subdivided into a plurality of units according to different functions thereof, for example, the operation unit can be subdivided into a multiplication unit, an addition unit, and the like.
  • a specific embodiment of a multiplier of the bit serial operator may include the first base multiplier of FIG. 7, the second base multiplier of FIG. 8, and the sparse multiplier of FIG.
  • FIG. 7 is a schematic diagram of a bit serializer of the present disclosure: a schematic diagram of a first base multiplier device that satisfies the requirements for calculating a bit width dynamic configurability.
  • the first base multiplier can be used in the apparatus of the present disclosure. As shown in FIG. 7, the multiplicand of the M bits and the multiplier of the N bits, where M and N are positive integers. Wherein, the positions of the multiplier and the multiplicand can be exchanged under the control of the control module.
  • the lower n bits of the multiplier (n is a positive integer, and 1 ⁇ n ⁇ N, optionally 1 ⁇ n ⁇ N, which can further improve the parallelism of the operation, make full use of hardware resources, and speed up the operation)
  • the low n bits of the multiplier are respectively ANDed with the multiplicand, that is, if the multiplier value is 1, the multiplicand itself is output, otherwise 0 is output.
  • the multiplier is sent to the first shift register for shifting, and the low n is shifted out, and the next time input to the input selection circuit is the new low n bit.
  • the result of the input selection circuit selection is input up to the second shift register for corresponding shifting, and then sent to the addition tree for addition.
  • the addition operation here is the result of performing input selection and shifting the data and the previous addition. After the result is obtained, it is stored in the result register as an intermediate operation result. When the next multiplied number is input and selected and then shifted, the result register takes the intermediate operation result and sends it to the addition tree (the device) for addition. When the multiplier is all 0, the multiplication operation ends.
  • the operation is as follows: First, the lowest 2 bits of the multiplier are taken out, and are sent to the input selection circuit together with the multiplicand, and the selection is the multiplicand itself. , is sent to the first shift register, the selected multiplicand corresponding to the lowest bit does not need to be shifted, that is, 10111011, and the selected multiplicand corresponding to the next lower bit is shifted to the left by 1 bit, that is, 101110110, and is sent to the addition tree. Since there is no previous digital addition, the result register is sent to the sum of 10111011 and 101110110, that is, 1000110001.
  • the multiplier is shifted to the right by 2 bits and then the lowest 2 bits, that is, 10, are sent to the input selection circuit together with the multiplicand to obtain 0 and 10111011, and then passed through the second shift register, and 0 is shifted to the left by 2 bits.
  • 0,10111011 Shift left 3 bits to 10111011000, and send it to the addition tree together with 1000110001 in the result register to get 100000001001, which is sent to the result register.
  • the multiplier is shifted to the right by 2 bits, all of which are 0, that is, the operation ends, and the result register is the final operation result, that is, 100000001001.
  • FIG. 8 is a schematic diagram of a second basic multiplier device for the present device in accordance with yet another embodiment of the present disclosure, capable of meeting the requirements for calculating a bit width dynamic configurability.
  • the positions of the multiplier and the multiplicand can be exchanged under the control of the control module.
  • the low m bits of the multiplicand are multiplied with the lower n bits of the multiplier, respectively.
  • the multiplier is sent to the first shift register for shifting, and the low n is shifted out, and the next time input to the input selection circuit is the new low n bit.
  • the result of the input selection is input up to the second shift register for corresponding shifting, and then sent to the addition tree for addition.
  • the addition operation here is the result of performing input selection and shifting the data and the previous addition.
  • the result register After the result is obtained, it is stored in the result register as an intermediate operation result.
  • the result register takes the intermediate operation result and sends it to the addition tree (the device) for addition.
  • the multiplier is all 0, the multiplicand is sent to the third shift register for shifting, the low m bits are removed, the multiplier is taken out of the backup register, and the above steps are repeated for operation.
  • the multiplication operation ends until the multiplicand and multiplier are both zero.
  • FIG. 9 is a schematic diagram of a sparse multiplier device for the present apparatus according to an embodiment of the present disclosure, which can meet the requirements of calculating a dynamic width configurability of a bit width.
  • Sparse multipliers can be used in the case of sparse operations, that is, when the multiplier or the binary representation of the multiplicand is 1 sparse, then the multiplier or multiplicand is sparsely represented by the position of 1. , can further improve the effectiveness of the operation and speed up the operation.
  • the multiplicand of the M bits and the multiplier of the N bits, where M and N are positive integers, that is, the number of bits of the multiplicand and the multiplier here may or may not be equal. .
  • the multiplier uses a sparse representation method to indicate the position of 1 in the multiplier in absolute or relative position.
  • the operation modules of the sparse multiplier provided in this embodiment are all configurable, so when the operation is performed by using different representation methods, the devices inside the arithmetic unit can be configured according to requirements. For example, when the result register is added, there is no need to shift, then the shift register connected to the result register can be configured to be inactive at this time, and the shift information of the multiplier can also not be transferred to the shift register. .
  • the relevant specific details can be adjusted as needed to complete the specific details of the shift of the multiplicand and the addition of the result.
  • the position of 1 in the multiplier is represented by the absolute position, assuming that we call the rightmost bit of the number the 0th bit, and the left bit of the 0th bit is called No. 1, and so on.
  • the multiplier is expressed as (1, 5).
  • the multiplicand is sent to the shift register, and then moved to 1 bit and sent to the adder for 101110110. Since the previous numbers are added, the result of being sent to the result register is 101110110. Then, the position of the next 1 of the multiplier, that is, 5, is taken, and is sent to the shift register together with the multiplicand. In the shift register, the multiplicand is shifted to the right by 5 bits to obtain 1011101100000, which is sent to the adder. At the same time, the result 101110110 in the result register is taken out.
  • the result can be directly sent to the adder for addition to obtain 1100011010110.
  • the result of the addition is again sent to the result register.
  • 1 of the multipliers has been calculated, so the operation ends. If the multiplier is expressed in a relative manner, and the representation is defined as the number from the highest digit (leftmost) to the first digit that is not 0, to the lowest digit, the number of digits between each digit that is not 0 is 0. . For 00100010, the number between the first digit other than 0 and the next digit other than 0 is 4 bits apart, and the second digit that is not 0 is the lowest digit, which is 1 bit apart, so it is expressed as (4,1). ).
  • the shift register connected to the result register and the multiplicand connected in this embodiment needs to work.
  • the data of the result register is 0, so the addition result 101110110000 is obtained, and it is sent to the result register.
  • the value is sent to the shift register to obtain 101110110 and 1011101100000, which are sent to the adder for addition, and the result is 1100011010110.
  • the result is sent to the result register again.
  • FIG. 10 is a schematic structural diagram of an apparatus for performing vector multiplication by a fusion vector multiplier according to an embodiment of the present disclosure.
  • the calculation vector with The inner product value the data of the corresponding dimension is sent to the multiplier to wait for the operation, as shown in FIG.
  • the request with The dimensions are the same, both are (N+1), but the bit width of each dimension is not necessarily the same, and it is assumed that n operations are performed each time, where n is greater than 1 and not greater than A positive integer of one bit width of one dimension.
  • each dimension performs the same operation as the first dimension. Then, through the addition tree, the data sent by these dimensions is added, and the value in the result register is sent to the addition tree, and the addition is performed together, and the result of the addition is obtained and sent to the result register.
  • the multiplier can flexibly configure the bit width of the data to be operated without the need to re-count the multiplicand shift bits every time a set of data multiplication is performed.
  • the bit width is 8 bits, ie
  • the bit width is 4 bits, ie
  • the operation flow using the basic multiplier or the above-mentioned basic or sparse multiplier (assuming n is 2, that is, moving 2 bits per multiplier) is divided into two stages: first, the product of the respective components is calculated separately, and then Then sum, as shown in Figure 10. Specifically, for a certain dimension A i and B i to be calculated, the shift register is cleared.
  • the first clock cycle takes the lowest two bits b i0 , b i1 of B i , input selection, shift, and feed to the adder to obtain the value of A i *b i0 b i1 and add 2 to the shift register; For each clock cycle, B i is shifted to the right by 2 bits and the lowest two bits are obtained to obtain the lowest bit b i2 , b i3 . The input is selected and shifted to obtain A i *b i2 b i3 , and the result is added to the previous sum to obtain the final operation.
  • the result A i *b i0 b i1 b i2 b i3 that is, the final operation result A i *B i of the dimension is obtained.
  • the product is sent to an addition tree for addition, and the result of the final vector inner product is obtained.
  • one multiplier can be selected to calculate each dimension in turn; multiple multiplier parallel operations can also be provided to perform one dimensional operation in one multiplier, as shown in FIGS. 11 and 12.
  • the shift value of the multiplier B i for each dimension needs to be recounted.
  • the multiplier at this stage may employ the first basic multiplier, the second basic multiplier, or the sparse multiplier described above.
  • the above-mentioned arithmetic unit can perform the required operations in any combination.
  • the second base multiplier and the bit serial addition tree are combined, as shown in FIG. 13, to perform vector multiplication.
  • the calculation vector with The inner product value the data of the corresponding dimension is sent to the multiplier to wait for the operation, as shown in FIG.
  • the request with The dimensions are the same, both are (N+1), but the bit width of each dimension is not necessarily the same, and A is assumed to be a multiplicand, B is a multiplier, and each operation, A takes the specified m-bit, B takes the specified
  • the n bits are operated, where m is not greater than a positive integer of a bit width of one dimension, n is not greater than A positive integer of one bit width of one dimension.
  • B is shifted by n bits, and multiplied by the lower m bits of A, and sent to the bit serial addition tree for addition, and the data of the original storage unit is added after being shifted by the third shift unit.
  • the operation is saved to the storage unit.
  • A shifts by m bits, and then operates in turn with n bits of B.
  • the data in the memory cell is the final operation result.
  • the multiplier can flexibly configure the bit width of the data to be operated without saving intermediate data, thereby reducing storage overhead. Speed up the operation.
  • the characteristics of low bit width and vector high dimension of the data can be greatly utilized, and the process can be executed in parallel by means of pipeline, which reduces the time required for operation and further accelerates the operation. Speed, improve performance per watt.
  • the device and the method can significantly improve the operation speed of the neural network, and have dynamic configurability, meet the requirements of the diversity of the data bit width and the dynamic variability of the data bit width in the operation process, and have flexibility. Strong, configurable, fast, and low power consumption.
  • a processing method for calculating a bit width dynamic configurable processing device includes the following steps:
  • the S1401 control circuit generates a control command, and transmits the control command to the memory, the data width adjustment circuit, and the operation circuit;
  • the S1402 memory inputs the data to be calculated of the neural network to the operation circuit according to the received control instruction
  • the S1403 data width adjustment circuit adjusts the to-be-calculated data, the intermediate operation result, the final operation result, and/or the width of the data to be cached according to actual needs;
  • the S1404 operation circuit selects a corresponding type of multiplier and an adder circuit bit serial operator according to the received control instruction
  • the S1405 operation circuit calculates the data to be calculated of the neural network of different calculation bit widths according to the input data to be calculated and the neural network parameters and the control instruction.
  • the data width adjustment circuit in the method of the embodiment can significantly improve the operation speed of the neural network, and has dynamic configurability, which satisfies the diversity of the data bit width and the dynamic variability of the data bit width during the operation. related requirements.
  • the first operation module in step S1403 includes using an adder circuit, and a base multiplier, a sparse multiplier, and/or a fused vector multiplier to perform operations on the data to be operated of the neural network.
  • an adder circuit and a base multiplier, a sparse multiplier, and/or a fused vector multiplier to perform operations on the data to be operated of the neural network.
  • FIG. 15 is a schematic structural diagram of a processing apparatus according to still another embodiment of the present disclosure.
  • the device is mainly divided into three parts, a control circuit, an arithmetic circuit and a memory.
  • the control circuit sends control signals to the arithmetic circuit and the memory to control the operation of the two and coordinate the data transmission between the two.
  • the function of each part refers to the description of each part in the embodiment shown in FIG. 1, and details are not described herein.
  • FIG. 16 is a block diagram showing the structure of a processing apparatus according to an embodiment of the present disclosure.
  • the structure shown in FIG. 16 is based on the structure shown in FIG. 2, and the data width adjusting circuit is removed, that is, the memory is directly connected to the arithmetic circuit, and the corresponding setting manners can be referred to the above.
  • the three modules can be executed in parallel in a pipelined manner.
  • the device can accelerate the operation process of the convolutional neural network, reduce the off-chip off-chip data exchange, and save storage space.
  • FIG. 17 is a schematic structural diagram of a processing apparatus according to another embodiment of the present disclosure.
  • the structure shown in FIG. 17 is similar to that of FIG. 3 except that the related structure and the connection relationship of the data width adjustment circuit are not included in FIG. 17, and the respective connections and functions implemented in FIG. 17 refer to the corresponding embodiment of FIG. Description will not be repeated here.
  • the processing apparatus of this embodiment significantly improves the calculation efficiency in the large-scale calculation with many parameters.
  • the device can effectively accelerate the operation process of the convolutional neural network, and is especially suitable for the case where the network scale is large and the parameters are relatively large.
  • FIG. 18 is a schematic structural diagram of a processing apparatus according to still another embodiment of the present disclosure.
  • the structure shown in FIG. 18 is similar to that of FIG. 4 except that in FIG. 18, the related structure and connection relationship of the data width adjusting circuit are not included, and the respective connections and functions implemented in FIG. 18 are referred to the corresponding implementation of FIG. The description of the example will not be repeated here.
  • FIG. 19 is a schematic diagram of a base multiplier device for the present device in accordance with yet another embodiment of the present disclosure, capable of meeting the requirements for calculating a bit width dynamic configurability.
  • the multiplicand of the M bits and the multiplier of the N bits where M and N are positive integers, that is, the number of bits of the multiplicand and the multiplier here may or may not be equal.
  • the lower n bits of the multiplier (n is a positive integer, and 1 ⁇ n ⁇ N) are input to the input selection circuit, and the low n value of the multiplier is ANDed with the multiplicand, that is, the multiplier A value of 1, takes the multiplicand itself, otherwise it takes 0.
  • the multiplier is sent to the first shift register for shifting, and the low n is shifted out, and the next time input to the input selection circuit is the new low n bit.
  • the result of the input selection is input up to the second shift register for corresponding shifting, and then sent to the addition tree for accumulation.
  • the accumulation here is the result of performing the input selection and shifting the data and the result of the previous accumulation.
  • the result is stored in the result register as an intermediate result.
  • the result register fetches the intermediate result and sends it to the addition tree (the device) for accumulation.
  • the multiplier is all 0, the multiplication operation ends.
  • the operation is as follows: First, the lowest 2 bits of the multiplier are taken out, and are sent to the input selection circuit together with the multiplicand, and the selection is the multiplicand itself. , is sent to the first shift register, the selected multiplicand corresponding to the lowest bit does not need to be shifted, that is, 10111011, and the selected multiplicand corresponding to the next lower bit is shifted to the left by 1 bit, that is, 101110110, and is sent to the addition tree. Since there is no previous digital addition, the result register is sent to the sum of 10111011 and 101110110, that is, 1000110001.
  • the multiplier is shifted to the right by 2 bits and then the lowest 2 bits, that is, 10, are sent to the input selection circuit together with the multiplicand to obtain 0 and 10111011, and then shifted to the left by 2 or 0 by the shift register.
  • 10111011 shifts 3 bits to 10111011000, and sends it to the addition tree together with 1000110001 in the result register to obtain 100000001001, which is sent to the result register.
  • the multiplier is shifted to the right by 2 bits, all of which are 0, that is, the operation ends, and the result is the final result, that is, 100000001001.
  • the sparse multiplier can improve the validity of the operation and speed up the operation in the case of sparse operation, that is, when the multiplier or the multiplicand represents the position of 1 by sparse representation.
  • the multiplicand of the M bits and the multiplier of the N bits, where M and N are positive integers, that is, the number of bits of the multiplicand and the multiplier here may or may not be equal.
  • the multiplier is represented by a sparse representation, and the position of 1 in the multiplier is expressed in absolute or relative position.
  • our arithmetic circuit is configurable, so when the operation is performed by different representation methods, the devices inside the arithmetic unit can be configured according to requirements. For example, when the result register is accumulated without shifting, it can be specified that the shift register connected to the result register does not work at this time, and the shift information of the multiplier can also not be transferred to the shift register. Relevant specific details can be adjusted as needed to complete the specific details of the shift of the multiplicand and the accumulation of the results.
  • the position of 1 in the multiplier is represented by the absolute position, assuming that we call the rightmost bit of the number the 0th bit, and the left bit of the 0th bit is called No. 1, and so on.
  • the multiplier is expressed as (1, 5).
  • the multiplicand is sent to the shift register, and then moved to 1 bit and sent to the adder for 1011 10110. Since the previous numbers are added, the result of being sent to the result register is 101110110. Then, the position of the next 1 of the multiplier, that is, 5, is taken, and is sent to the shift register together with the multiplicand. In the shift register, the multiplicand is shifted to the right by 5 bits to obtain 1011101100000, which is sent to the adder. At the same time, the result 101110110 in the result register is taken out.
  • the result can be directly sent to the adder for accumulation to obtain 1100011010110.
  • the accumulated result is sent to the result register again.
  • 1 of the multipliers has been calculated, so the operation ends.
  • the multiplier is expressed in a relative manner, and the representation is defined as the number from the highest digit (leftmost) to the first digit that is not 0, to the lowest digit, the number of digits between each digit that is not 0 is 0. . For 00100010, the number between the first digit other than 0 and the next digit other than 0 is 4 bits apart, and the second digit that is not 0 is the lowest digit, which is 1 bit apart, so it is expressed as (4,1). ).
  • the shift register connected to the result register and the multiplicand connected in this embodiment needs to work.
  • the first digit 4 of the multiplier is fetched into two shift registers, then the multiplied number is shifted to the right by 4 bits, and the data in the result register is shifted right by 4 bits and sent to the adder for accumulation.
  • the data of the result register is 0, so the accumulated result 101110110000 is obtained, and the result register is saved.
  • the value is sent to the shift register to obtain 101110110 and 1011101100000, which are sent to the adder for accumulation, and the result is 1100011010110.
  • the result is sent to the result register again.
  • FIG. 22 is a schematic structural diagram of an apparatus for performing vector multiplication by a fused vector multiplier according to an embodiment of the present disclosure.
  • the calculation vector with The inner product value the data of the corresponding dimension is sent to the multiplier to wait for the operation, as shown in FIG.
  • the request with The dimensions are the same, both are (N+1), but the bit width of each dimension is not necessarily the same, and it is assumed that n operations are performed each time, where n is greater than 1 and not greater than A positive integer of one bit width of one dimension.
  • the multiplier can flexibly configure the bit width of the data to be operated without the need to re-count the multiplicand shift bits every time a set of data multiplication is performed.
  • the characteristics of low bit width and vector high dimension of the data can be greatly utilized, and the process can be executed in parallel by means of pipeline, which reduces the time required for operation and further accelerates the operation. Speed, improve performance per watt.
  • the bit width is 8 bits, ie
  • the bit width is 4 bits, ie
  • the operation flow using the basic multiplier or the above-described basic or sparse multiplier (assuming n is 2, that is, moving 2 bits per multiplier) is divided into two phases: first, the products of the respective components are respectively calculated, and then the sum is performed. , as shown in Figure 21. Specifically, for a certain dimension A i and B i to be calculated, the shift register is cleared.
  • the first clock cycle takes the lowest two bits b i0 , b i1 of B i , input selection, shift, and feed to the adder to obtain the value of A i *b i0 b i1 and add 2 to the shift register; For each clock cycle, B i is shifted to the right by 2 bits and the lowest two bits are obtained to obtain the lowest bit b i2 , b i3 . The input is selected and shifted to obtain A i *b i2 b i3 , and the result is added to the previous sum to obtain the final result. A i *b i0 b i1 b i2 b i3 , that is, the final result A i *B i of the dimension is obtained.
  • phase 1 Perform the operation of the next dimension, input A i+1 and B i+1 , and clear the shift register... until each dimension is completed, and get (A 0 *B 0 , A 1 *B 1 ,...,A 7 *B 7 ), phase 1 is completed. Then, in phase 2, the product is sent to an addition tree for addition, and the result of the final vector inner product is obtained.
  • one multiplier can be selected to calculate each dimension in turn; multiple multiplier parallel operations can also be provided to perform one dimensional operation in one multiplier, as shown in FIG. When multiple multipliers are used, the shift value of the multiplier B i for each dimension needs to be recounted.
  • the multiplier at this stage can be either the basic multiplier or the sparse multiplier described above.
  • the fused vector multiplier is a horizontal accumulative operation as a whole.
  • the structure is as shown in Fig. 22.
  • the product of one component of each dimension is calculated, it is sent to the addition tree for accumulation until the operation is completed, and the final result is obtained.
  • the operation flow is as shown in the elliptical box of Fig. 23.
  • the product of the result is sent to the addition tree together with the data of the result register, and the shift register is incremented by one
  • the product of the result register is added to the addition tree and added to the addition tree, and the shift register is incremented by one
  • a processing method for calculating a bit width dynamic configurability including the steps of:
  • S2400 The control circuit generates a control instruction and transmits it to the memory and the operation circuit;
  • S2401 The memory inputs the data to be calculated of the neural network to the operation circuit according to the received control instruction
  • the operation circuit selects, according to the received control instruction, a multiplier and an adder circuit of a corresponding type in the first operation module;
  • the operation circuit calculates the data to be calculated of the neural network with different calculation bit widths according to the input data to be calculated and the neural network parameters and the control instruction.
  • the first operation module in step S2403 includes performing an operation on the data to be operated of the neural network by using an adder, and a base multiplier, a sparse multiplier, and/or a fused vector multiplier.
  • the processing apparatus and method can significantly improve the operation speed of the neural network, and have dynamic configurability, satisfying the diversity of the data bit width and the dynamic variability of the data bit width in the operation process. Flexibility, high configurability, fast computing speed, and low power consumption.
  • the present disclosure also provides an operation method and an operation device including an offline model. After generating an offline model, the operation can be directly performed according to the offline model, thereby avoiding the extra operation brought by running the entire software architecture including the deep learning framework.
  • the overhead will be specifically described below in conjunction with specific embodiments.
  • the neural network accelerator programming framework is usually located at the top, and the programming framework can be Caffe, Tensorflow, Torch, etc., as shown in Figure 25, from the bottom to the top, the neural network processor (for neural network operations) Dedicated hardware), hardware drivers (for software calls to neural network processors), neural network processor programming libraries (for providing interfaces to invoke neural network processors), neural network processor programming frameworks, and neural network operations Advanced application.
  • the programming framework can be Caffe, Tensorflow, Torch, etc., as shown in Figure 25, from the bottom to the top, the neural network processor (for neural network operations) Dedicated hardware), hardware drivers (for software calls to neural network processors), neural network processor programming libraries (for providing interfaces to invoke neural network processors), neural network processor programming frameworks, and neural network operations Advanced application.
  • An aspect of an embodiment of the present disclosure provides a method for computing a neural network, including the steps of:
  • Step 1 Obtain input data
  • Step 2 Acquire or determine an offline model according to the input data, and determine an operation instruction according to the offline model for subsequent calculation call;
  • Step 3 Call the operation instruction, and perform operation on the data to be processed to obtain an operation result for output.
  • the input data includes data to be processed, network structure and weight data, or the input data includes data to be processed offline model data.
  • the offline model in step 2 may be existing or may be post-built according to external data (such as network structure or weight data). By setting the offline model to obtain the operation instruction, the calculation process can be improved.
  • the calling operation instruction in step 3 may be that the network operation is performed only according to the operation instruction, in the case that the input data includes only the data to be processed that does not include the offline model or the data for determining the offline model.
  • the following steps are performed:
  • Step 11 obtaining input data
  • Step 12 construct an offline model according to the network structure and the weight data
  • Step 13 parsing the offline model, obtaining an operation instruction and buffering it for subsequent calculation calls;
  • step 14 according to the operation instruction, the data to be processed is operated to obtain an operation result for output.
  • the offline model is first constructed according to the network structure and the weight data, and then the offline model polarity is parsed to obtain the operation instruction, which enables the full use in the low-memory and real-time application environment in which the offline model is not stored. Performance, the calculation process is more concise and fast.
  • the following steps are performed:
  • Step 21 Acquire input data
  • Step 22 parsing the offline model, obtaining an operation instruction and buffering, for subsequent calculation call;
  • step 23 according to the operation instruction, the data to be processed is operated to obtain an operation result for output.
  • the offline model when the input data includes the offline model, when the offline model is established, the offline model is parsed and the operation instruction is obtained during the operation, thereby avoiding the overhead caused by running the entire software architecture including the deep learning framework.
  • the following steps are performed:
  • Step 31 Acquire input data
  • step 32 the cached operation instruction is called, and the operation data is calculated to obtain the operation result for output.
  • the operation result is obtained by computing the data to be processed by the operation instruction.
  • the neural network processor performs operations on the data to be processed according to the operation instruction, wherein the neural network processor is mainly used for neural network operations, receiving instructions, data to be processed, and/or a network model (for example, an offline model is performed; for example, for a multi-layer neural network, output layer data is calculated, for example, based on input layer data, and data such as neurons, weights, and offsets.
  • a network model For example, an offline model is performed; for example, for a multi-layer neural network, output layer data is calculated, for example, based on input layer data, and data such as neurons, weights, and offsets.
  • the neural network processor has an instruction cache unit for buffering the received operational instructions.
  • the neural network processor further has a data buffer unit for buffering the data to be processed.
  • the data to be processed is input to the neural network processor and temporarily stored in the data buffer unit, and then the operation is performed in conjunction with the operation instruction.
  • an embodiment of the present disclosure further provides an operation device, including:
  • An input module configured to acquire input data, where the input data includes data to be processed, network structure, and weight data, or the input data includes data to be processed offline model data;
  • a model generation module configured to construct an offline model according to the input network structure and weight data
  • a neural network operation module configured to generate an operation instruction based on offline model data in the input module or an offline model built in the model generation module, and perform an operation on the data to be processed based on the operation instruction;
  • An output module configured to output the operation result
  • a control module that detects the input data type and performs the following operations:
  • the control input module inputs the network structure and the weight data into the model generation module to construct an offline model, and controls the offline model of the neural network operation module based on the model generation module input. Performing operations on the data to be processed input by the input module;
  • the control input module inputs the data to be processed and the offline model into the neural network operation module, and controls the neural network operation module to generate an operation instruction based on the offline model and caches the data, and processes the data based on the operation instruction. Perform an operation;
  • the control input module inputs the data to be processed into the neural network operation module, and controls the operation instruction of the neural network operation module to call the cache, and performs the operation on the data to be processed.
  • the above neural network operation module includes a model analysis unit and a neural network processor, wherein:
  • a model parsing unit configured to generate an operation instruction based on the offline model
  • the neural network processor is configured to cache the operation instruction for the subsequent calculation call; or call the cache operation instruction when only the data to be processed is included in the input data, and perform operation on the data to be processed based on the operation instruction to obtain the operation result.
  • the neural network processor described above has an instruction cache unit for caching the operation instructions for subsequent computational calls.
  • the offline model may be a text file defined according to a special structure, and may be various neural network models, such as Cambricon_model, AlexNet_model, GoogleNet_model, VGG_model, R-CNN_model, GAN_model, LSTM_model, RNN_model, ResNet_model.
  • the models are, but are not limited to, the models proposed in this embodiment.
  • the offline model may include network weights of respective computing nodes in the original network and necessary network structure information such as instruction data, wherein the instructions may include information of computing attributes of the respective computing nodes and connection relationships between the computing nodes, thereby When the original network is run again, the offline model corresponding to the network can be directly run, and the same network is not required to be compiled again, thereby shortening the running time of the processor when the network is running, and improving the processing efficiency of the processor.
  • the processor may be a general-purpose processor, such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), or an IPU (Intelligence Processing Unit).
  • a processor that performs artificial neural network operations.
  • the data to be processed is an input that can be processed by a neural network, such as at least one of a continuous single picture, voice, or video stream.
  • the foregoing network structure may be various neural network structures, such as AlexNet, GoogleNet, ResNet, VGG, R-CNN, GAN, LSTM, RNN, ResNet, etc., but is not limited to the embodiment. These structures. It should be pointed out that the network structure and the offline model correspond to each other. For example, when the network structure is RNN, the offline model is RNN_model, and the model includes the network weight of each node in the RNN network and the necessary RNN network structure information such as instruction data.
  • the instruction may include information such as calculation properties of the respective computing nodes and connection relationships between the respective computing nodes.
  • the computing device of the embodiment of the present disclosure may have the following three execution forms according to different input data of the input module:
  • the control module controls the input module to transmit the network structure and the weight data to the model generation module, and transmits the data to be processed to the model analysis module;
  • the control module control model generation module generates an offline model according to the specific network structure and the corresponding weight data (the offline model may be a text file defined according to a preset structure, and may include network weights and instructions of each computing node in the neural network.
  • the necessary network structure information such as data
  • the instruction may include information such as calculation properties of each computing node and a connection relationship between the computing nodes, for example, the offline model may be constructed according to the corresponding network structure type and weight data), and Transmitting the generated offline model to the model parsing unit; the control module controlling the model parsing unit parsing the received offline model, and obtaining an operation instruction recognizable by the neural network processor (that is, mapping according to the text file of the offline model described above) Corresponding network operation
  • the operation instruction and the data to be processed are transmitted to the neural network processor; the neural network processor performs the operation on the data to be processed according to the received operation instruction, and obtains the operation result, and the operation result is obtained Transfer to the output module for output.
  • control module controls the input module to directly transmit the offline model and the data to be processed to the model parsing unit, and the subsequent working principle is the same as the first case.
  • the control module controls the input module to transmit the data to be processed to the neural network processor via the model parsing unit, and the neural network processor processes the data according to the cached operation instruction.
  • the operation gets the result of the operation.
  • the input module can include a determination module for determining the type of the input data. It can be understood that this situation usually does not occur in the first use of the neural network processor to ensure that there are certain operational instructions in the instruction cache.
  • the data input by the input module should include the network structure, the weight data, and the data to be processed, and the model generation module generates a new offline model and performs subsequent network operations.
  • the data input by the input module should include the offline model and the data to be processed; when the current network operation is the same as the offline model of the previous network operation, the data input by the input module only includes The data to be processed can be.
  • the computing device described in this disclosure is integrated as a sub-module into a central processor module of an entire computer system.
  • the data to be processed and the offline model are transferred to the computing device under the control of the central processor.
  • the model parsing unit parses the incoming neural network offline model and generates operational instructions.
  • the operation instruction and the data to be processed are sent to the neural network processor, and the operation result is obtained through the operation processing, and the operation result is returned to the main storage unit.
  • the network structure is no longer changed, and only the data to be processed is continuously transmitted to complete the neural network calculation, and the operation result is obtained.
  • this embodiment provides an operation method, including the following steps:
  • Step 11 obtaining input data
  • Step 12 Construct an offline model according to the network structure and the weight data
  • Step 13 parsing the offline model, obtaining an operation instruction and buffering, for subsequent calculation call;
  • Step 14 Perform operation on the data to be processed according to the operation instruction to obtain a neural network operation result for output;
  • Step 21 Acquire input data
  • Step 22 Parsing the offline model, obtaining an operation instruction and buffering, for subsequent calculation calls;
  • Step 23 Perform operation on the data to be processed according to the operation instruction to obtain a neural network operation result for output;
  • Step 31 Acquire input data.
  • Step 32 Calling the cached operation instruction, and performing processing on the data to be processed to obtain a neural network operation result for output.
  • the neural network processor processes the data to be processed according to the operation instruction to obtain an operation result;
  • the neural network processor has an instruction buffer unit and a data buffer unit for respectively buffering the received operation instruction and the data to be processed.
  • the input network structure proposed in this embodiment is AlexNet
  • the weight data is bvlc_alexnet.caffemodel
  • the data to be processed is a continuous single picture
  • the offline model is Cambricon_model.
  • the offline model Cambricon_model can be parsed to generate a series of operation instructions, and then the generated operation instructions are transmitted to the instruction cache unit on the neural network processor 2707, and the input module 2701 is input.
  • the input picture is transmitted to a data buffer unit on the neural network processor 2707.
  • the process of using the neural network processor to perform operations can be greatly simplified, and the extra memory and IO overhead of calling the traditional whole programming framework can be avoided.
  • the neural network accelerator can fully exert its computing performance in a low-memory, real-time environment.
  • the embodiment further provides an operation device, including: an input module 2701, a model generation module 2702, a neural network operation module 2703, an output module 2704, and a control module 2705, wherein the neural network operation module 103 includes a model. Parsing unit 2706 and neural network processor 2707.
  • the keyword of the device is offline execution, which means that after the offline model is generated, the offline model is directly used to generate the relevant operation instruction and the weight data is transmitted, and the data to be processed is processed. more specific:
  • the input module 2701 is configured to input a combination of a network structure, weight data, and data to be processed or a combination of an offline model and data to be processed.
  • the network structure and weight data are passed to the model generation module 2702 to generate an offline model for performing the following operations.
  • the offline model and the data to be processed are directly transmitted to the model parsing unit 2706 to perform the following operations.
  • the output module 2704 is configured to output the determined operational data generated according to a specific network structure and a set of data to be processed.
  • the output data is calculated by the neural network processor 2707.
  • the model generation module 2702 is configured to generate an offline model for use by the lower layer according to the input network structure parameter and the weight data.
  • the model parsing unit 2706 is configured to parse the incoming offline model, generate an operation instruction that can be directly transmitted to the neural network processor 2707, and transmit the to-be-processed data input by the input module 2701 to the neural network processor 2707.
  • the neural network processor 2707 is configured to perform an operation according to the incoming operation instruction and the data to be processed, and the determined operation result is transmitted to the output module 2704, and has an instruction buffer unit and a data buffer unit.
  • the above control module 2705 is configured to detect an input data type and perform the following operations:
  • the control input module 2701 inputs the network structure and weight data into the model generation module 2702 to construct an offline model, and controls the neural network operation module 2703 to input based on the model generation module 2702.
  • the offline model performs neural network operations on the data to be processed input by the input module 2701;
  • the control input module 2701 inputs the data to be processed and the offline model into the neural network operation module 2703, and controls the neural network operation module 2703 to generate an operation instruction based on the offline model and cache, and based on the operation instruction. Perform neural network operations on the processed data;
  • control input module 2701 inputs the data to be processed into the neural network operation module 2703, and controls the neural network operation module 2703 to call the cached operation instruction, and performs neural network operation on the data to be processed.
  • the input network structure proposed in this embodiment is AlexNet, the weight data is bvlc_alexnet.caffemodel, and the data to be processed is a continuous single picture.
  • the model generation module 102 generates a new offline model Cambricon_model according to the input network structure and weight data, and the generated offline model Cambricon_model can also be used as the next input alone; the model analysis unit 2706 can parse the offline model Cambricon_model to generate a series of operations. instruction.
  • the model parsing unit 2706 transmits the generated operation instruction to the instruction cache unit on the neural network processor 2707, and transmits the input picture input from the input module 2701 to the data buffer unit on the neural network processor 2707.
  • the present disclosure also provides an arithmetic device and an operation method for supporting a composite scalar instruction, which are unified to a large extent by providing a scalar instruction (an instruction that unifies a floating point instruction and a fixed point instruction) in the operation.
  • a scalar instruction an instruction that unifies a floating point instruction and a fixed point instruction
  • Floating-point instructions and fixed-point instructions do not distinguish between types of instructions during the decoding phase.
  • the operands are determined to be floating-point data or fixed-point data according to the address in the operand address field, simplifying the decoding logic of the instructions. It also makes the instruction set more streamlined. This will be specifically described below in conjunction with specific embodiments.
  • FIG. 28 is a schematic structural diagram of a device for supporting a composite scalar instruction according to an embodiment of the present disclosure. As shown in FIG. 28, the device includes a controller module 2810, a storage module 2820, an operator module 2830, and an input/output module 2840.
  • the controller module 2810 is configured to read an instruction from the storage module and store it in a local instruction queue, and then decode the instruction in the instruction queue into a control signal to control the behavior of the storage module, the operator module, and the input/output module.
  • the storage module 2820 includes storage devices such as a register file, a RAM, and a ROM for storing different data such as instructions and operands.
  • the operands include floating point data and fixed point data.
  • the memory module stores the floating point data and the fixed point data in a space corresponding to different addresses, such as different RAM addresses or different register numbers, so that the reading can be judged by the address and the register number. Whether the data is fetched or fixed.
  • the operator module 2830 can perform four operations, a logical operation, a shift operation, and a complement operation on the floating point data and the fixed point data, wherein the four operations include four operations of adding, subtracting, multiplying, and dividing; the logical operation includes Four operations with AND, OR, NOT, and XOR.
  • the operator module can determine whether the read data is a floating point type data or a fixed point type data by reading an address or a register number where the operand is located, and the arithmetic unit reads from the storage module. The data is manipulated and the corresponding operation is performed. The intermediate result of the operation is stored in the storage module, and the final calculation result is stored in the input/output module.
  • the input/output module 2840 can be used for storing and transmitting input and output data. During initialization, the input and output module stores the initial input data and the compiled composite scalar instruction into the storage module. After the operation ends, the receiving operator module transmits In addition, the input and output modules can also read the information required by the compiled instructions from the memory for the computer compiler to compile the program into various instructions.
  • the apparatus for supporting the composite scalar instruction provided by the embodiment of the present disclosure provides an efficient execution environment for the composite scalar instruction.
  • 29A and 29B are diagrams showing an example of an organization form of a storage module according to an embodiment of the present disclosure.
  • the storage module stores the floating point data and the fixed point data in different address spaces, such as different addresses or different register numbers, so that the address and the register number can be used to determine whether the read data is a floating point number or a fixed point number.
  • the present disclosure uses a memory module composed of a RAM having a start address of 0000H, a termination address of 3FFFH, and a register file of 16 registers as an example to show how to store floating point numbers and fixed point numbers. Separation.
  • the RAM the fixed point data is stored only in the RAM unit of the address 0000H to 1FFFH, and the floating point data is stored only in the RAM unit of 2000H to 3FFFH, and the instruction can be stored in any RAM unit, also Information that is invariant in the instruction set can be stored in the ROM. As shown in Fig.
  • registers 0 to 7 are used to store the RAM address of the fixed point data
  • registers 8 to 15 are used to store the RAM address of the floating point data.
  • FIG. 30A is a diagram showing an example of a composite scalar instruction provided by an embodiment of the present disclosure.
  • each instruction has an operation code field, an operand address field (or immediate), and a target address field.
  • the opcode field includes an opcode
  • the operand address field includes a source operand address 1 and a source operand address. 2, indicating the storage address of each source operand, the destination address field is the storage address of the operand operation result:
  • the opcode field is used to distinguish between different types of operations, such as addition, subtraction, multiplication, and division, but is not used to distinguish the type of operand.
  • the operand address field may contain a RAM address, a register number, and an immediate value.
  • the RAM address and register number used to store floating-point data and fixed-point data are different, so the address field can be used to distinguish between floating-point operands and fixed-point operands.
  • the operand address field stores an immediate value, a data type flag bit recognizable by the operator module is also needed to distinguish between the floating point operand and the fixed point operand.
  • the target address field can be either a RAM address or a register number.
  • the address field should correspond to the operand type, that is, the operation result of the floating-point operand is stored in the storage unit corresponding to the floating-point data; and the operation result of the fixed-point operand is stored in the storage unit corresponding to the fixed-point data.
  • the composite scalar instruction provided by the present disclosure is an instruction for unifying the floating point instruction and the fixed point instruction, and the floating point instruction and the fixed point instruction are unified to a large extent, and the type of the instruction is not performed in the decoding stage. Differentiate, in the specific calculation, it is determined whether the operand is floating point data or fixed point data according to the address of the read operand in the operand address field, which simplifies the decoding logic of the instruction and makes the instruction set more compact.
  • the operation code of the addition instruction is 0001
  • the composition of the composite scalar instruction is as shown in FIGS. 30B to 30E.
  • FIG. 30B is a diagram showing an example of a composite scalar instruction using register addressing according to an embodiment of the present disclosure.
  • the addressing mode flag is 01
  • source operand 1 and source operand 2 There are respectively a register corresponding to the source operand 1 register number and the source operand 2 register number, the registers of numbers 0 to 7 store fixed point data, and the registers of numbers 8 to 15 store floating point data;
  • FIG. 30C is a diagram showing an example of a composite scalar instruction using register indirect addressing according to an embodiment of the present disclosure.
  • the addressing mode flag is 10, source operand 1 and source operation.
  • the address of the number 2 in the RAM is respectively in the register corresponding to the source operand 1 register number and the source operand 2 register number, wherein the RAM address of the fixed point data (0000H to 1FFFH) is stored in the registers 0 to 7; the floating point number
  • the RAM address (2000H to 3FFFH) is stored in registers 8 to 15.
  • the target address field stores the destination register number or the target RAM address.
  • the fixed point data is stored in a RAM unit having an address in the range of 0000H to 1FFFH; the floating point data is stored in a RAM unit having an address in the range of 2000H to 3FFFH.
  • FIG. 30D is a diagram showing an example of a composite scalar instruction using immediate number addressing according to an embodiment of the present disclosure. As shown in FIG. 3D, if the data of the operand address field is two immediate numbers, the addressing mode flag is 00. The data type flag bit is also set between the addressing mode flag bit and the operand address field. When the immediate data is a fixed point type, the data type flag bit is 0; when the immediate data is a floating point type, the data type flag bit Is 1.
  • FIG. 30E is a diagram showing an example of a composite scalar instruction using RAM addressing according to an embodiment of the present disclosure.
  • the operand address field is a RAM address
  • the addressing mode flag is 11.
  • the source operand 1 and the source operand 2 are respectively stored in the RAM unit corresponding to the RAM address.
  • the fixed point data exists in the RAM unit corresponding to the RAM address 0000H to 1FFFH; the floating point data exists in the RAM unit corresponding to the RAM address 2000H to 3FFFH.
  • the target address field stores the target register number or the target RAM address.
  • the fixed point data is stored in registers 0 to 7 or in RAM cells with addresses ranging from 0000H to 1FFFH; floating point data is stored in registers 8 through 15 or in RAM cells with addresses in the range 2000H to 3FFFH.
  • FIG. 31 is a flowchart of a method for supporting a composite scalar instruction according to an embodiment of the present disclosure. As shown in FIG. 4, an embodiment of the present disclosure provides an operation method for supporting a composite scalar instruction, and the data operation is performed by using the above-mentioned composite scalar instruction device. Specifically, the following steps are included:
  • S3101 Store different types of data in different addresses.
  • the memory module stores floating point data and fixed point data in a space corresponding to different addresses, such as different RAM addresses or different register numbers.
  • S3102 Decode the composite scalar instruction into a control signal.
  • the controller module sends an input/output (IO) instruction to the storage module, reads the composite scalar instruction from the storage module, and stores it in the local instruction queue.
  • the controller module reads the composite scalar instruction from the local instruction queue and decodes it into a control signal.
  • S3103 Read operation data according to the control signal, and judge the type of the operation data according to the address of the read operation data, and perform operation on the operation data.
  • the operator module can determine whether the data of the floating point type or the fixed point type is read by reading the operand address field. If the operand is an immediate value, the operand type is determined according to the data type flag bit and calculated; if the operand is from the RAM or the register, the operand type is determined according to the RAM address or the register number, and the operand is read from the storage module and performed. Corresponding operation.
  • S3104 Store the operation result in the address of the corresponding type.
  • the controller module sends an IO command to the operator module, and the operator module transmits the operation result to the storage module or the input/output module.
  • the method for executing the composite scalar instruction provided by the present disclosure can perform the compound scalar instruction accurately and efficiently.
  • the provided device supporting the composite scalar instruction provides an efficient execution environment for the composite scalar instruction; the provided compound scalar instruction execution method can execute the composite scalar instruction accurately and efficiently.
  • the present disclosure also provides a technical device and a counting method for supporting technical instructions, which can improve computational efficiency by writing an algorithm for counting the number of elements of a given condition in the statistical input data (data to be counted) into an instruction form. This will be specifically described below in conjunction with specific embodiments.
  • FIG. 32 is a schematic structural diagram of a frame of a counting device according to an embodiment of the present disclosure.
  • the present disclosure supports a counting instruction counting device including: a storage unit, a counting unit, and a register unit.
  • the storage unit is connected to the counting unit, and is configured to store the input data to be counted and the number of elements (counting result) satisfying a given condition in the input data for storing statistics, and the storage unit may be main storage; or may be temporary storage.
  • the type memory further, may be a high-speed temporary storage memory.
  • the storage unit is a scratch-capable memory, and can support input data of different bit widths and/or input data occupying different size storage spaces, and temporarily store the input data to be counted in the scratchpad memory.
  • the counting process can flexibly and efficiently support data of different widths.
  • the counting unit is connected to the register unit, and the counting unit is configured to acquire the counting instruction, read the address of the input data in the register unit according to the counting instruction, and then obtain the corresponding input data to be counted in the storage unit according to the address of the input data, and The number of elements in the input data that satisfy the given condition is counted statistically to obtain a final count result and the count result is stored in the storage unit.
  • the register unit is used to store an address of the input data to be counted stored in the storage unit. In one embodiment, the address stored by the register unit is the address of the input data to be counted on the scratchpad memory.
  • the data type of the input data to be counted may be a 0/1 vector, or may be a numeric vector or matrix.
  • FIG. 33 is a schematic structural diagram of a counting unit in a counting device according to an embodiment of the present disclosure. As shown in FIG. 33, the counting unit includes an input and output module, an arithmetic module, and an accumulator module.
  • the input/output module is connected with the operation module, and each time the input data to be counted in the storage unit is taken, a piece of data whose length is set (the length can be configured according to actual requirements) is input to the operation module for calculation, and after the operation of the operation module is completed.
  • the input/output module continues to take the next piece of data of a fixed length until all elements of the input data to be counted are taken; the input/output module outputs the count result calculated by the accumulator module to the storage unit.
  • the arithmetic module is connected to the accumulator module, inputs a fixed length of data, and adds the number of each element of the input data satisfying the given condition by an adder of the arithmetic module, and outputs the obtained result to the accumulator module.
  • the operation module further includes a judgment sub-module for judging whether the input data satisfies a given condition (a given condition may be the same as a given element, or the value may be within a set interval), if satisfied, Output 1, if not satisfied, output 0, and then sent to the adder to accumulate.
  • the structure of the adder may include n layers, wherein: the first layer has 1 full adder, and the second layer has a full adder, ... the m layer has a full adder; wherein l, m, n are integers greater than 1, and m is an integer greater than 1 and less than n, Indicates that the data x is taken.
  • the first layer of the adder has one full adder; the second layer of the adder has A full adder, each full adder has 3 inputs and 2 outputs, then the first layer gets a total of 4l / 3 outputs; according to the method, each layer full adder has 3 inputs and 2 outputs And the adder of the same layer can be executed in parallel; if the number of data of the i-th bit is 1, during the calculation, the output of the ith bit, which is the final result, is the number of 1 in the 0/1 vector of the part.
  • Figure 34 is a schematic diagram of a specific full adder, wherein the adder structure comprises 7 layers (i.e., n is 7), the first layer has 6 full adders, and the fixed length 0/1 vector has a length of 18 (i.e., 1 is 6), the full adder of each layer can be in parallel, for example, the third layer has (that is, m is 3, 1 is 6) full adder, when the input data is (0,1,0), (1,0,0), (1,1,0), (0,1,0) , (1, 0, 0), (1, 1, 0), by the full adder of the embodiment of the present disclosure, the result is (001000), that is, 8.
  • the above adder can increase the parallelism of the addition calculation and effectively improve the operation speed of the arithmetic module.
  • the accumulator module is connected to the input and output module, and the result of the operation module is accumulated by using the accumulator until there is no new input.
  • the counting unit is a multi-stream water level structure, wherein the vector operation in the input and output module is at the first flow level, the operation module is at the second flow level, and the accumulator module is at the third flow level. These units are at different stages of flow and can more efficiently implement the operations required for counting instructions.
  • FIG. 35 is a schematic diagram of a format of an instruction set of a counting instruction in a counting device according to an embodiment of the present disclosure.
  • the counting instruction includes an operation code and one or more operation fields, wherein the operation code is used to indicate that the instruction is a counting instruction, and the counting unit can perform a counting operation by identifying the operation code, and the operation domain can include:
  • the address information for indicating the input data to be counted in the counting instruction may further include address information of the judgment condition.
  • the address information may be an immediate value or a register number. For example, to obtain a vector, the vector start address and the vector length may be obtained in the corresponding register according to the register number, and then in the storage unit according to the vector start address and the vector length. Get the vector stored in the corresponding address.
  • the instructions adopted by the embodiments of the present disclosure have a compact format, so that the instruction set is convenient to use and the supported data length is flexible.
  • FIG. 36 is a flowchart of an execution process of a counting unit in a counting device according to an embodiment of the present disclosure.
  • the counting unit acquires the address of the input data to be counted in the register unit according to the address information in the operation field of the counting instruction, and then acquires the input data to be counted in the storage unit according to the address.
  • the input data to be counted is stored in the scratchpad memory, each time the counting unit acquires a fixed length of input data from the scratchpad memory, determines whether the submodule determines whether the element satisfies the given condition, and then uses the adder to count the part of the input.
  • the number of elements in the data that satisfy a given condition, and the number of elements satisfying a given condition for each segment is accumulated by the accumulator module to obtain a final counting result and the counting result is stored in the storage unit.
  • FIG. 37 is a detailed structural diagram of a counting device according to an embodiment of the present disclosure.
  • the apparatus of the present disclosure that supports counting instructions may further include: an instruction memory, an instruction processing unit, an instruction cache unit, and a dependency processing unit.
  • the instruction processing unit it is used to acquire a count instruction from the instruction memory, and after processing the count instruction, provide the instruction cache unit and the dependency processing unit.
  • the instruction processing unit includes: an instruction fetch module and a decoding module.
  • the fetching module is connected to the instruction memory for acquiring the counting instruction from the instruction memory;
  • the decoding module is connected with the fetching module for decoding the obtained counting instruction.
  • the instruction processing unit may further include an instruction queue memory, and the instruction queue memory is connected to the decoding module for sequentially storing the decoded counting instructions, and sequentially sending the instructions to the instruction buffer unit and the dependency processing unit. Given that the number of instructions that the instruction cache unit and the dependency processing unit can accommodate is limited, the instructions in the instruction queue memory must wait until the instruction cache unit and the dependency processing unit have idle to continue sequential transmission.
  • the instruction cache unit is connectable to the instruction processing unit for sequentially storing the counting instructions to be executed.
  • the counting instruction is also cached in the instruction buffer unit during execution.
  • the instruction running result (counting result) is transferred to the instruction buffer unit, if the instruction is also not submitted in the instruction buffer unit.
  • the earliest instruction in the instruction, the instruction will be submitted, and together with the result of the instruction (counting result) is written back to the scratchpad memory.
  • the instruction cache unit may be a reordering cache.
  • the dependency processing unit may be connected to the instruction queue memory and the counting unit for determining whether the vector required for the counting instruction (ie, the vector to be counted) is up-to-date before the counting unit acquires the counting instruction, and if so, directly counting the instruction Provided to the counting unit; otherwise, the counting instruction is stored in a storage queue of the dependency processing unit, and after the required vector is updated, the counting instruction in the storage queue is provided to the counting unit. Specifically, when the counting instruction accesses the scratchpad memory, the storage space is waiting for the result of the previous instruction to be written.
  • the dependency processing unit enables instructions to be executed out of order, sequentially, effectively reducing pipeline blocking, and enabling precise exceptions.
  • the fetch module is responsible for fetching the next instruction to be executed from the instruction memory and transmitting the instruction to the decoding module;
  • the decoding module is responsible for decoding the instruction and transmitting the decoded instruction to the instruction queue memory;
  • the queue memory is used to buffer the decoded instruction, and after the instruction cache unit and the dependency processing unit have idle, send the instruction to the instruction cache unit and the dependency processing unit;
  • the counting instruction is sent from the instruction queue memory to the dependency processing unit.
  • the counting instruction reads the address of the input data in the storage unit from the register unit;
  • the dependency processing unit is configured to process the data dependency relationship between the current instruction and the previous instruction, and the counting instruction accesses the storage unit, and the previously executed Other instructions may access the same block of storage.
  • the counting unit acquires the counting instruction from the dependency processing unit, acquires the corresponding input data to be counted in the storage unit according to the address of the input data read in the register unit according to the counting instruction, and satisfies the given condition in the input data.
  • the number of elements is counted statistically, and the result of the counting is transmitted to the instruction buffer unit, and the last counting result and the counting instruction are written back to the storage unit.
  • FIG. 38 is a flowchart of an execution process of a counting device according to an embodiment of the present disclosure. As shown in FIG. 38, the process of executing the counting instruction includes:
  • the fetching module fetches the counting instruction from the instruction memory, and sends the counting instruction to the decoding module.
  • the decoding module decodes the counting instruction and sends the counting instruction to the instruction queue memory.
  • the counting instruction is sent to the instruction buffer unit and the dependency processing unit after waiting for the instruction buffer unit and the dependency processing unit to be idle in the instruction queue memory.
  • the counting instruction is sent from the instruction queue memory to the dependency processing unit, the counting instruction reads the storage address of the input data in the storage unit from the register unit, and the dependency processing unit analyzes the instruction and the previous execution has not been performed. Whether the end instruction has a dependency on the data, the count instruction needs to wait in the storage queue of the dependency processing unit until it no longer has a dependency on the data with the previous unexecuted instruction.
  • the count instruction is sent to the counting unit.
  • the counting unit acquires input data from the storage unit according to the storage address, and counts the number of elements in the input data that satisfy the given condition.
  • a chip is also disclosed that includes the neural network processor, processing device, counting device, or computing device described above.
  • a chip package structure is also disclosed that includes the above described chip.
  • a board is also disclosed that includes the chip package structure described above.
  • an electronic device is also disclosed that includes the above-described card.
  • Electronic devices may include, but are not limited to, robots, computers, printers, scanners, tablets, smart terminals, cell phones, driving recorders, navigators, sensors, cameras, cloud servers, cameras, cameras, projectors, watches, headphones, mobile Storage, wearable device vehicles, household appliances, and/or medical devices.
  • the vehicle may include an airplane, a ship, and/or a vehicle;
  • the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood;
  • the medical device includes a nuclear magnetic resonance instrument , B-ultrasound and / or electrocardiograph.
  • the related apparatus and method disclosed may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the part or module is only a logical function division.
  • there may be another division manner for example, multiple parts or modules may be combined. Or it can be integrated into one system, or some features can be ignored or not executed.
  • the term “and/or” may have been used.
  • the term “and/or” means one or the other or both (eg, A and/or B means both A or B or both A and B).
  • Each functional part/unit/subunit/module/submodule/component in the present disclosure may be hardware, such as the hardware may be a circuit, including digital circuits, analog circuits, and the like.
  • Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like.
  • the computing modules in the computing device can be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, and the like.
  • the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Executing Machine-Instructions (AREA)
  • Image Analysis (AREA)

Abstract

一种计算位宽动态可配置的处理装置,包括:存储器,用于存储数据,所述数据包括待运算数据、中间运算结果、最终运算结果和待缓存数据;数据宽度调整电路,用于调整所述待运算数据、中间运算结果、最终运算结果和/或待缓存数据的宽度;运算电路,用于对待运算数据进行运算,包括采用加法器电路和乘法器对不同计算位宽的待运算数据进行计算;以及控制电路,用于控制存储器、数据宽度调整电路和运算电路。所述的装置具有灵活性强、可配置程度高、运算速度快、功耗低等优点。

Description

处理装置和处理方法 技术领域
本公开涉及计算机领域,进一步涉及人工智能领域的处理装置和处理方法。
背景技术
随着大数据时代的来临,神经网络算法成为了近些年人工智能领域的一个研究热点,在模式识别、图像分析、智能机器人等方面都得到了广泛的应用。
深度学习是机器学习中一种基于对数据进行表征学习的方法。观测值(例如一幅图像)可以使用多种方式来表示,如每个像素强度值的向量,或者更抽象地表示成一系列边、特定形状的区域等。而使用某些特定的表示方法更容易从实例中学习人物(例如,人脸识别或面部表情识别)。
至今已有数种深度学习架构,如深度神经网络、卷积神经网络和深度信念网络和递归神经网络已被应用于计算机视觉、语音识别、自然语言处理、音频识别与生物信息学等领域,并获取了极好的效果。另外,深度学习已成为类似术语,或者说是神经网络的品牌重塑。
随着深度学习(神经网络)的大热,神经网络加速器也应运而生,通过专门的内存和运算模块设计,神经网络加速器在进行深度学习运算时可以获得相比较通用处理器几十倍甚是上百倍的加速比,并且面积更小,功耗更低。
发明内容
本公开提供一种计算位宽动态可配置的处理装置,包括:
存储器,用于存储数据,所述数据包括神经网络的待运算数据、中间运算结果、最终运算结果和待缓存数据;
数据宽度调整电路,用于调整所述待运算数据、中间运算结果、最终运算结果和/或待缓存数据的宽度;
运算电路,用于对神经网络的待运算数据进行运算;以及
控制电路,用于控制存储器、数据宽度调整电路和运算电路。
本公开还提供一种计算位宽动态可配置的处理装置的使用方法,包括步骤:
控制电路生成控制指令,传送给存储器、数据宽度调整电路和运算电路;
存储器根据接收的控制指令,向运算电路输入神经网络的待运算数据;
数据宽度调整电路根据接收的控制指令,调整神经网络的待运算数据的宽度;
运算电路根据接收的控制指令,选择第一运算模块中的对应类型的乘法器和加法器电路;
运算电路根据输入的待运算数据和神经网络参数以及控制指令,对不同计算位宽的神经网络的待运算数据进行运算。
本公开还提供一种处理装置,包括:存储器,用于存储数据,所述数据包括神经网络的待运算数据;运算电路,用于对神经网络的待运算数据进行运算,包括采用加法器电路和乘法器对不同计算位宽的神经网络的待运算数据进行计算;以及控制电路,用于控制存储器和运算电路,包括根据待运算数据确定运算电路的乘法器和加法器电路的类型以进行运算。
本公开还提供一种使用上述处理装置的方法,包括步骤:控制电路生成控制指令,传送给存储器和运算电路;存储器根据接收的控制指令,向运算电路输入神经网络的待运算数据;运算电路根据接收的控制指令,选择第一运算模块中的对应类型的乘法器和加法器电路;运算电路根据输入的待运算数据和神经网络参数以及控制指令,对不同计算位宽的神经网络的待运算数据进行运算,运算结果送回存储器。
本公开还提供一种运算装置,包括:输入模块,用于获取输入数据, 该输入数据包括待处理数据、网络结构和权值数据,或者该输入数据包括待处理数据和/或离线模型数据;模型生成模块,用于根据输入的网络结构和权值数据构建离线模型;神经网络运算模块,用于基于离线模型生成运算指令并缓存,以及基于运算指令对待处理数据进行运算得到运算结果;输出模块,用于输出所述运算结果;控制模块,用于检测输入数据类型并控制输入模块、模型生成模块和神经网络运算模块进行运算。
本公开还提出了一种应用上述运算装置的运算方法,包括以下步骤:
获取输入数据;
获取离线模型,或根据输入数据确定离线模型,依据离线模型确定运算指令,以供后续计算调用;
调用所述运算指令,对待处理数据进行运算得到运算结果以供输出。
本公开还提供一种支持复合标量指令的装置,包括控制器模块、存储模块和运算器模块,其中:所述存储模块,用于存储复合标量指令和数据,所述数据有一种以上的类型,不同类型的数据存储于存储模块中不同的地址内;所述控制器模块,用于从存储模块读取复合标量指令并译码为控制信号;所述运算器模块,用于接收控制信号,从所述存储模块读取数据,根据读取数据的地址判断数据类型,并对数据进行运算。
本公开还提供一种处理器,用于执行复合标量指令,其中该复合标量指令包括操作码域、操作数地址域和目的地址域;所述操作码域中存储的操作码用于区分不同类型的操作,所述操作数地址域用于区分操作数的类型,所述目的地址域为运算结果存储的地址。
本公开还提供一种复合标量指令的执行方法,包括以下步骤:将不同类型的数据存储于不同的地址内;将复合标量指令译码为控制信号;根据控制信号读取操作数据,根据读取操作数据的地址判断操作数据的类型,对操作数据进行运算;将运算结果存储于对应类型的地址内。
本公开还提供一种计数装置,包括:寄存器单元、计数单元和存储单元,其中,寄存器单元,用于存储待计数的输入数据在存储单元中存储的地址;计数单元,与寄存器单元连接,用于获取计数指令,根据计数指令在寄存器单元中读取输入数据的存储地址,在存储单元中获取相 应的待计数的输入数据,并对输入数据中满足给定条件的元素个数进行统计计数,得到计数结果;存储单元,与计数单元连接,用于存储待计数的输入数据以及用于存储所述的计数结果。
本公开还提供一种上述计数装置的计数方法,包括以下步骤:计数单元获取计数指令,根据计数指令在寄存器单元中读取的输入数据的存储地址,在存储单元中获取相应的待计数的输入数据,并对输入数据中满足给定条件的元素个数进行统计计数,得到计数结果;将统计的计数结果传输至存储单元中。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他附图。
图1是本公开一实施例提供的计算位宽动态可配置的处理装置的结构示意图。
图2是本公开另一实施例提供的计算位宽动态可配置的处理装置结构示意图。
图3是本公开再一实施例提供的计算位宽动态可配置的结构示意图。
图4是本公开又一实施例提供的计算位宽动态可配置的处理装置另一个实施例的结构示意图。
图5是本公开的一个实施例用于本装置的位串行加法树装置示意图。
图6是本公开计算位宽动态可配置的处理装置中位串行运算器方框示意图。
图7是本公开本公开提供的一个实施例的第一基础乘法器装置的结构示意图。
图8是本公开本公开提供的一个实施例的第二基础乘法器装置的结构示意图。
图9是本公开提供的一个实施例的稀疏乘法器装置的结构示意图。
图10是本公开提供的一个实施例的基础乘法器或稀疏乘法器进行 向量乘法的装置的结构示意图。
图11是本公开提供的一个实施例的融合向量乘法器进行向量乘法的装置的结构示意图。
图12是本公开提供的融合向量乘法器装置和其他乘法器装置具体实施流程的结构示意图。
图13是本公开一个实施例的第二基础乘法器和位串行加法树进行组合示意图。
图14是本公开一实施例提供的计算位宽动态可配置的处理方法流程图。
图15是本公开一另实施例提供的计算位宽动态可配置的处理装置的结构示意图。
图16是本公开另一实施例提供的计算位宽动态可配置的处理装置结构示意图。
图17是本公开再一实施例提供的计算位宽动态可配置的处理装置结构示意图。
图18是本公开又一实施例提供的计算位宽动态可配置的处理装置的另一个实施例的结构示意图。
图19是本公开提供的一个实施例的基础乘法器装置的结构示意图。
图20是本公开提供的一个实施例的稀疏乘法器装置的结构示意图。
图21是本公开提供的一个实施例的基础乘法器或稀疏乘法器进行向量乘法的装置的结构示意图。
图22是本公开提供的一个实施例的融合向量乘法器进行向量乘法的装置的结构示意图。
图23是本公开提供的融合向量乘法器装置和其他乘法器装置具体实施流程的结构示意图。
图24是本公开一实施例提供的计算位宽动态可配置的处理方法流程图。
图25是典型的编程框架图。
图26是本公开一实施例提出的运算方法的运算流程图。
图27是本公开另一实施例提出的运算装置的结构框架图。
图28是本公开实施例提供的运算装置的结构示意图;
图29A是本公开实施例提供的一种存储模块RAM组织形式示例图;
图29B是本公开实施例提供的一种存储模块寄存器堆组织形式示例图;
图30A是本公开实施例提供的复合标量指令示例图;
图30B是本公开实施例提供的采用寄存器寻址时复合标量指令示例图;
图30C是本公开实施例提供的采用寄存器间接寻址时复合标量指令示例图;
图30D是本公开实施例提供的采用立即数寻址时复合标量指令示例图;
图30E是本公开实施例提供的采用RAM寻址时复合标量指令示例图;
图31是本公开实施例提供的支持复合标量指令的运算方法流程图。
图32为本公开实施例计数装置的框架结构示意图。
图33为本公开实施例计数装置中计数单元的结构示意图。
图34为图33计数单元中的加法器结构示意图。
图35为本公开实施例计数装置中计数指令的指令集格式示意图。
图36为本公开实施例计数装置中计数单元的执行过程流程图。
图37为本公开实施例计数装置的结构示意图。
图38为本公开实施例计数装置的执行过程流程图。
具体实施方式
下面结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开的保护范围。
在本公开中所述的“存储器”可以集成在计算位宽动态可配置的处理装置的内部,也可以是一个单独的器件,作为外部存储器与计算位宽动态可配置的处理装置进行数据传输存储器可以集成在计算位宽动态可配置的处理装置的内部,也可以是一个单独的器件,作为外部存储器与计算位宽动态可配置的处理装置进行数据传输。
图1是本公开实施例提供的计算位宽动态可配置的处理装置的结构示意图。如图1所示,本装置包括控制电路、数据宽度调整电路、运算电路和存储器。
控制电路用于向数据宽度调整电路、运算电路和存储器发出控制信号,来控制三者的运行,协调三者间的数据传输。存储器用于存储相关数据,可包括输入数据(包括待运算数据和控制指令)、中间运算结果、最终运算结果、神经元、突触、待缓存数据等等,可以根据需求不同,对具体的存储数据内容、存储组织方式和存取调用方式进行不同的规划。可以如图1所示,所述数据宽度调整电路,用于调整数据的宽度,该过程可发生在从存储器读取数据后经过数据宽度调整电路对数据进行位宽调整后传递给运算电路、运算电路将计算结果经过数据宽度调整电路对数据进行位宽调整后传递给存储器、存储器将数据经过数据宽度调整电路对数据进行位宽调整后传递回存储器等。其具体操作通过控制电路的控制信号进行控制。其具体操作包括在不损失精度的情况下,对数据位宽进行增加或减少或保持不变;在可接受程度的精度损失的情况下,对数据位宽进行增加或减少或保持不变;根据某种指定的变换或运算要求(如指定“按位与”运算),对数据位宽进行增加或减少或保持不变等。运算电路可包括至少一个加法运算器和至少一个乘法运算器,用于数据的运算。至少一个加法运算器包括加法器、加法树、和/或串行加法树;至少一个乘法运算器包括基础乘法器、稀疏乘法器和/或融合向量乘法器。运算电路还可以包括比较器和/或ALU等,其中,乘法运算器和加法运算器能够满足不同计算位宽的数据进行运算,根据不同的需求,可以进行不同位宽的运算数据之间的运算。其中乘法器可以为串行运算器,通过位串行方式实现乘法运算。需要说明的是,运算电路也可以不 经过数据位宽调整电路,直接与存储器进行数据传输。
图2是本公开一个实施例的计算位宽动态可配置的处理装置的结构示意图。如图2所示,本装置的结构为,控制电路连接存储器的每个模块或子模块和运算电路,包括至少一个控制信号暂存器和至少一个控制处理器,控制信号暂存器用于存储控制信号,可选的,该控制信号暂存器是先进先出的。控制处理器用于取出待执行的控制信号,对控制逻辑进行分析后,对存储器、数据宽度调整电路和运算电路进行控制和协调。存储器包括输入存储模块、输出存储模块、突触存储模块,其中输出存储模块可以用于存储中间运算结果和最终运算结果。数据宽度调整电路可以分为输入数据处理模块和输出数据处理模块,输入数据处理模块用于将输入存储模块和/或突出存储模块中的数据进行数据宽度的调整,其可以设置于输入存储模块后端;输出数据处理模块用于将运算电路运算后的数据进行宽度调整后存储输出数据处理模;运算电路主要用于加速卷积层、全连接层的卷积运算和池化(pooling)层的取平均值或最大值的运算。可选的,运算电路可以包括乘法器模块、加法树模块和非线性运算模块(如,完成sigmoid函数运算的模块)。该乘法器模块、加法树模块和非线性运算模块可以采用流水线的方式并行执行。本装置能够加速卷积神经网络的运算过程,减少片内片外的数据交换,节约存储空间。
图3是本公开另一个实施例的处理装置的结构示意图。如图3所示,本装置的结构为,控制电路连接存储器的每个模块和运算电路,包括控制信号暂存器和控制处理器,用于存储控制信号,控制处理器用于取出待执行的控制信号,对控制逻辑进行分析后,对存储器和运算电路进行控制和协调。可选的,控制信号暂存器为先进先出的。存储器包括输入存储模块、输出存储模块和突触存储模块。本装置中,突触存储模块包括多个突触存储子模块,运算电路包括多个运算模块,将突触存储子模块与运算模块分别对应连接,可以将一个突触存储子模块与一个运算模块对应连接,也可以将多个突触存储子模块与一个运算模块对应连接。数据宽度调整电路可以分为输入数据处理模块和输出数据处理模块,输 入数据处理模块用于将输入存储模块和/或突触存储模块中的数据进行数据宽度的调整,其可以设置于输入存储模块后端;输出数据处理模块用于将运算电路运算后的数据进行宽度调整后存储至输出数据处理模块;每一次运算的时候,输入存储模块经输入数据处理模块后向所有的运算模块传递输入数据,突触存储模块向对应的运算模块传递突触数据,运算模块进行运算后,将输出数据处理模块将结果写到输出存储模块中。这样,在参数多的大规模运算当中,明显提高了运算效率。本装置能够有效加速卷积神经网络的运算过程,尤其适用于网络规模比较大,参数比较多的情况。
图4是本公开再一个实施例的处理装置的结构示意图。如图4所示,本装置的结构为,控制电路连接存储器的每个模块和运算电路和数据宽度调整电路,包括一个指令队列和一个解码器,每一次执行新的指令时,从指令队列中取出一条新的指令,送入解码器,通过解码器进行解码,将控制信息送入存储器的每个模块、运算电路和数据宽度调整电路。存储器包括输入存储模块、输出存储模块、突触存储模块和缓存模块,其中输出存储模块可以用于存储中间运算结果和最终运算结果。其中,每一次输入存储模块和突触存储模块向运算电路传递数据,都是先将数据传入缓存模块中。而后,将缓存中的数据读取至数据宽度调整电路。如果控制指令要求对数据进行处理,则在数据宽度调整电路完成相应的处理,例如对数据进行位宽进行不损失精度的位数扩大、强制删除数据最低位来减少数据位宽等。经过数据宽度调整电路的处理,再送入相应的运算模块中。如果控制指令无需对数据进行处理,则数据可直接通过数据宽度调整电路传递到相应的运算模块中。同样的,当运算模块运算完毕,将结果也是先送入数据宽度调整电路,根据控制指令,在完成数据处理操作或者对数据不做操作后传入缓存模块中,再从缓存模块写入输出存储模块。运算电路包括多个运算模块,包括第一运算模块和第二运算模块。运算模块间可以并行执行相关运算,也可以相互传递数据,从而降低具有局部性的数据的重用距离,进一步提高运算速度。第一运算模块主要用于加速神经网络算法中相同或不同计算位宽的线性运算,包 括:矩阵间乘法、加法、乘法混合加法;矩阵和向量;矩阵和常数;向量间;向量与常数;常数与常数,还可以用于比较运算、选择最大/小值等,优选的运算包括点积、矩阵乘法和/或矩阵加法运算。第二运算模块用于完成上述第一运算模块中未完成的运算,包括非线性运算、除法运算、单独的加法运算或单独的乘法运算。这样的好处是能够根据控制指令,在计算过程中,对数据动态调整位宽,从而使得运算电路、存储器的硬件利用率能够得到进一步提升。
图5是本公开的一个实施例用于本装置的位串行加法树装置示意图,能够满足计算位宽动态可配置的要求。如图5所示,M个待运算的数据,最大位宽为N,其中M,N均为正整数。若不足N位的数据,采用合理的方式在不影响数据精度的情况下将其位数补至N位。可采用的方式包括最高/低位补0、最高/低位补符号位、移位、进行运算操作等。位串行加法树中的第一层到第x层中的加法器可以完成n(n≥1)位数字的加法运算,第x+1层中的加法器位可以完成不小于N位的数字的加法运算。首先,将寄存器、各加法器中的进位输出端Cin初始为0。取各待运算数据的最低n位,分别输入至第一层的加法器中的a,b端,每个加法器中完成a,b端传入的待运算数据的最低n位的加法运算,得到的结果值s传向高一层的加法器a或b端,得到的进位值Cout传回该层加法器的进位输入Cin处,待下一拍和传入的待运算的数据进行加法运算。上一层的加法器的操作类似,将传入的数据加法运算,而后结果再向高一层的传递,进位传回该层的加法器。直到达到第x层。第x层的加法器将运算结果经过移位,和寄存器中传来的原结果进行加法运算后保存回寄存器。而后,待运算数据选择次低的n位传入位串行加法树中完成相应的运算。此时每个加法器中的Cin为上一拍中该加法器的Cout端输出的进位结果。优选的,该操作在第一层加法器运算完毕后,即可输入第二批待运算的n位数据,通过并行运算,提高了运算器的使用率,进一步提升运算速度。当全部运算完成后,寄存器中的数据即为所得结果。在一些实施例中,加法器还可以在输入给该加法器的待运算的数据(a,b端)及进位输入(Cin端)全部为0的情况下,在该次运算过程中关闭,从 而达到节省功耗的目的。
其中,本公开实施例中运用的位串行运算器,如基础乘法器等,如图6所示,包括运算部件、处理部件、存储部件。运算部件用于完成一位或多位数据的乘法和/或加法运算,其输入的待运算数据来自于存储部件的数据和/或经过处理部件处理后的数据,输出的运算结果直接传入存储部件进行保存,或传入处理部件进行处理。处理部件用于完成数据移位、根据某种给定规则扩大/减少数据位宽、根据某种给定规则对数据的某一位或多位进行修改等处理操作,其待处理数据来源于运算部件和/或存储部件,处理后的数据可传入运算部件和/或处理部件。存储部件用于存储数据,包括待运算数据、中间运算结果、最终运算结果等。这里的存储部件可以为片上缓存。其中,每个单元可以根据其不同功能,均可进一步细分为多个单元,如运算部件可以细分为乘法单元、加法单元等。位串行运算器的中乘法器的具体实施例可以包括图7的第一基础乘法器,图8的第二基础乘法器,图9的稀疏乘法器。
图7是本公开的位串行运算器的一具体实施例:第一基础乘法器装置示意图,能够满足计算位宽动态可配置的要求。该第一基础乘法器可以用于本公开的装置。如图7所示,M位的被乘数和N位的乘数,其中M,N均为正整数。其中,乘数和被乘数的位置可以在控制模块的控制下进行交换。将乘数的低n位(n为正整数,且1≤n≤N,可选的为1<n≤N,从而能够进一步提高运算的并行度,充分利用硬件资源,加快运算速度)输入至输入选择电路中,将乘数的低n位分别与被乘数做“与”运算,即如果乘数该位值为1,则输出被乘数本身,否则输出0。同时,将乘数送入第一移位寄存器中进行移位,将低n位移出,则下一次再输入至输入选择电路中的为新的低n位。输入选择电路选择后的结果向上输入到第二移位寄存器进行相应的移位,再送入加法树中进行加法运算。这里进行加法运算的是进行输入选择并进行移位后的数据和之前进行加法运算的结果。得到结果后作为中间运算结果存入结果寄存器。待下一次被乘数进行输入选择后进行移位时,结果寄存器取出中间运算结果送入加法树(器)中进行加法运算。当乘数全为0时,乘法运算结束。
为更清楚的表明该基础乘法器的运算流程,我们给出一个具体实施例,假定被乘数为10111011,即M=8,乘数为1011,即N=4。
当n=2时,即每次移动2位的时候,该运算过程如下:首先,取出乘数的最低2位的11,和被乘数一起送入输入选择电路,选择均为被乘数本身,送入第一移位寄存器,最低位对应的选择出的被乘数无需移位,即10111011,次低位对应的选择出的被乘数左移1位,即101110110,送入加法树中,由于之前没有数字相加,故送入结果寄存器的为10111011与101110110的和,即1000110001。而后,乘数右移2位后取其最低2位,即10,和被乘数一起送入输入选择电路中,得到0和10111011,而后通过第二移位寄存器,0左移了2位还是0,10111011左移3位为10111011000,和结果寄存器中的1000110001一起送入加法树中进行运算,得到100000001001,送入结果寄存器中。此时,乘数右移2位,全部为0,即运算结束,结果寄存器中即为最终运算结果,即100000001001。
图8是本公开的又一实施例的用于本装置的第二基础乘法器装置示意图,能够满足计算位宽动态可配置的要求。如图8所示,M位的被乘数和N位的乘数,其中M,N均为正整数。其中,乘数和被乘数的位置可以在控制模块的控制下进行交换。将被乘数的第m位(m为正整数,且1≤m≤M)输入至输入选择电路中,将乘数的低n位(n为正整数,且1≤n≤N)输入至输入选择电路中,被乘数的低m位分别与乘数的低n位做乘法运算。并将乘数送入第一移位寄存器中进行移位,将低n位移出,则下一次再输入至输入选择电路中的为新的低n位。输入选择后的结果向上输入到第二移位寄存器进行相应的移位,再送入加法树中进行加法运算。这里进行加法运算的是进行输入选择并进行移位后的数据和之前进行加法运算的结果。得到结果后作为中间运算结果存入结果寄存器。待下一次被乘数进行输入选择后进行移位时,结果寄存器取出中间运算结果送入加法树(器)中进行加法运算。当乘数全为0时,将被乘数送入第三移位寄存器中进行移位,将低m位移除,乘数从备份寄存器中取出,重复上述步骤进行运算。直到被乘数、乘数均为0时,乘法运 算结束。
图9是本公开提供的一实施例用于本装置的稀疏乘法器装置示意图,能够满足要求的计算位宽动态可配置的要求。稀疏乘法器可以用于稀疏运算的情况,也就是说,当乘数或者被乘数的二进制表示中的1是稀疏的情况,那么将乘数或被乘数用稀疏的方式表示出1的位置,可以进一步提高了运算的有效性,加快运算速度。如图9所示,M位的被乘数和N位的乘数,其中M,N均为正整数,也就是说,这里的被乘数和乘数的位数可以相等,也可以不相等。这里,乘数用稀疏的表示方法,用绝对或相对位置的方式表示该乘数中1的位置。这里,本实施例提供的稀疏乘法器的运算模块都是可配置的,故当采用不同的表示方法进行运算时,运算器内部的装置可以根据需求进行配置。譬如,当结果寄存器进行加法运算时无需移位,那么可以此时将和结果寄存器相接的移位寄存器配置为不工作,此时乘数的移位信息也可以不传递到该移位寄存器中。本领域人员可以理解,相关具体细节均可以根据需要做相应的调整,来完成对被乘数的移位和对结果的加法运算等相关具体细节。
为更清楚的表明该稀疏乘法器的运算流程,我们给出一个具体实施例,假定被乘数为10111011,即M=8,乘数为00100010,即N=8。当采用绝对的表示方式来表示乘数,那么用绝对位置表示出乘数中1的位置,假定我们把数的最右侧一位称为第0位,第0位的左侧一位称为第1位,以此类推。那么,该乘数表示为(1,5)。同时,我们要求该实施例中的与结果寄存器相连的移位寄存器不工作,乘数的数据无需传递给该移位寄存器。那么首先取出乘数的第一个数,即1,表示在第1位处有一个1。将被乘数送入移位寄存器,然后移动1位后为101110110送入加法器。由于之前数字相加,故送入结果寄存器的结果为101110110。而后取出乘数的下一个1的位置,即5,和被乘数一起送入移位寄存器。在移位寄存器中,将被乘数右移5位,得到1011101100000,送入加法器。同时取出结果寄存器中的结果101110110,由于采用的这种绝对表示的方法无需进行移位,故可直接将该结果送入加法器进行加法运算,得到1100011010110。加法运算后的结果再次送入结果寄存器。 此时,乘数中的1都已经计算完毕,故运算结束。如果采用相对的方式表示乘数,并定义其表示方法为从最高位(最左边)的第一个不为0的数字开始,到最低位,每两个不为0的数字间相距的位数。对于00100010,在第一个不为0的数字和下一个不为0的数字之间相距4位,在第二个不为0的数字到最低位,相距1位,故表示为(4,1)。这里,我们要求该实施例中的与结果寄存器相连的和与被乘数相连的移位寄存器均需要工作。首先,取出乘数的第一个数字4,送入两个移位寄存器中,那么将被乘数右移4位,和结果寄存器中的数据右移4位后送入加法器中进行加法运算。此时结果寄存器的数据为0,故得到加法运算结果101110110000,送入结果寄存器保存。而后,取出乘数的第二个数字1,那么将该值送入移位寄存器中,得到101110110和1011101100000,送入加法器进行加法运算,得到结果1100011010110。该结果再次送入结果寄存器。此时,乘数中的1都已经计算完毕,故运算结束。这样,可以有效利用数据的稀疏性,只进行有效的运算,即非0数据之间的运算。从而减少了无效的运算,加快运算速度,提高了性能功耗比。
图10是本公开提供的一个实施例的融合向量乘法器进行向量乘法的装置的结构示意图。这里,我们假定计算向量
Figure PCTCN2018083415-appb-000001
Figure PCTCN2018083415-appb-000002
的内积值,将相应维度的数据送入乘法器中等待运算,如图11所示。这里,要求
Figure PCTCN2018083415-appb-000003
Figure PCTCN2018083415-appb-000004
的维度相同,均为(N+1),但是每一维度的位宽不一定相同,同时假定每次取n位进行运算,其中n为大于1且不大于
Figure PCTCN2018083415-appb-000005
的一个维度的位宽的正整数。首先,取B 0的低n位和A 0均送入一个输入选择电路中,将B 0的低n位分别与A 0做与运算,得到的选择的结果送入后面的移位寄存器进行移位。取移位后,将结果送入加法树中。在此过程中,每个维度都和第一维度进行着相同的操作。而后通过加法树,对这些维度送入的数据进行加法运算,并将结果寄存器中的值送入加法树中,一同进行加法运算,得到加法运算后的结果再送入结果寄存器中。在运算的同时,每一维度的B i(i=0,1,……,N)值送入移位寄存器中右移n位后,重复上述操作,即取移位后的B i(i=0,1,……,N)值的最低n位和对应的A i(i=0,1,……,N)值一起送入输入选择电路中 进行选择,再送入移位寄存器中进行移位,而后送入加法树中进行加法运算。不断重复该过程直到每一维度的B i(i=0,1,……,N)值全为0,运算结束,此时结果寄存器中的数据即为所求的最终运算结果。利用该乘法器能够灵活的配置待运算数据的位宽,无需在每进行一组数据乘法时就需要重新对被乘数移位位数进行计数的过程。同时,当数据位数比较低或者向量位数比较高的时候,能够极大地利用数据低位宽、向量高维度的特性,可以采用流水线的方式并行执行该过程,降低运行所需时间,进一步加快运算速度,提高性能功耗比。
为更清楚的表明该融合向量乘法器的运算流程以及该乘法器和其他乘法器运算流程的区别及优势,给出一个具体实施例,结合图10、图11和图12进行说明。首先我们假定
Figure PCTCN2018083415-appb-000006
Figure PCTCN2018083415-appb-000007
的维度为8,即N=7,
Figure PCTCN2018083415-appb-000008
Figure PCTCN2018083415-appb-000009
Figure PCTCN2018083415-appb-000010
的位宽为8位,即
Figure PCTCN2018083415-appb-000011
的每一维度均为8位,即A i={a i7…a i1a i0},其中i=0,1,......,7;
Figure PCTCN2018083415-appb-000012
的位宽为4位,即
Figure PCTCN2018083415-appb-000013
的每一维度均为4位,即B i={b i3b i2b i1b i0},其中i=0,1,......,7。那么向量内积
Figure PCTCN2018083415-appb-000014
Figure PCTCN2018083415-appb-000015
一般情况下,采用基础乘法器或上述的基础或稀疏乘法器(假定n为2,即每次乘数移动2位)时的运算流程分为两个阶段:首先分别计算各自分量的乘积,然后再进行求和,如图10所示。具体的说,对于某一维度A i和B i进行计算,移位寄存器清零。第一个时钟周期取B i的最低两位b i0,b i1,输入选择、移位、送入加法器,得到A i*b i0b i1的值,并将移位寄存器加2;第二个时钟周期,B i右移2位后取最低两位得到最低位b i2,b i3,输入选择、移位得到A i*b i2b i3,将结果与之前的和相加,得到最终运算结果A i*b i0b i1b i2b i3,即得到该维度的最终运算结果A i*B i。进行下一维度的运算,输入A i+1和B i+1,移位寄存器清零……直到每一维度运算完毕,得到(A 0*B 0,A 1*B 1,……,A 7*B 7),阶段1运算完毕。而后,在阶段2,将乘积送入一个加法树中进行加法运算,得到最终的向量内积的结果,即
Figure PCTCN2018083415-appb-000016
在阶段1中,可以选择1个乘法器,依次计算每个维度;也可以提供多个乘法器并行运算,在一个乘法器中完成一个维度的运算,如图11和12所示。当采用多个乘法器时,每个维度的乘 数B i的移位值都需要重新进行计数。该阶段的乘法器采用上述的第一基础乘法器、第二基础乘法器或者稀疏乘法器均可。
上述运算器可以采用任意组合的方式完成所需运算。如,将第二基础乘法器和位串行加法树进行组合,如图13所示,来进行向量乘法。这里,我们假定计算向量
Figure PCTCN2018083415-appb-000017
Figure PCTCN2018083415-appb-000018
的内积值,将相应维度的数据送入乘法器中等待运算,如图11所示。这里,要求
Figure PCTCN2018083415-appb-000019
Figure PCTCN2018083415-appb-000020
的维度相同,均为(N+1),但是每一维度的位宽不一定相同,同时假定A为被乘数,B为乘数,每次运算,A取指定的m位、B取指定的n位进行运算,其中m不大于
Figure PCTCN2018083415-appb-000021
的一个维度的位宽的正整数,n不大于
Figure PCTCN2018083415-appb-000022
的一个维度的位宽的正整数。首先,取A 0的低m位和B 0的低n位乘法器中,将A 0的低m位和B 0的低n位做乘法运算,得到的选择的结果送入位串行加法树中进行加法运算。并将结果保存到存储单元中。而后,将B移位n位,和A的低m位进行乘法操作,并送入位串行加法树中进行加法运算,同时原存储单元的数据经过第三移位单元移位后一同进行加法运算,结果保存到存储单元。待B全部运算完毕后,A移位m位,重新依次与B的n位进行运算。待全部运算结束,此时存储单元中的数据即为所求的最终运算结果。利用该乘法器能够灵活的配置待运算数据的位宽,无需保存中间数据,从而降低了存储开销。加快了运算速度。同时,当数据位数比较低或者向量位数比较高的时候,能够极大地利用数据低位宽、向量高维度的特性,可以采用流水线的方式并行执行该过程,降低运行所需时间,进一步加快运算速度,提高性能功耗比。
综上所述,利用该装置和方法能够明显提高神经网络的运算速度,同时具有动态可配置性,满足数据位宽的多样性和运算过程中数据位宽的动态可变性的相关要求,具有灵活性强、可配置程度高、运算速度快、功耗低等优点。
根据本公开实施例的另一方面,还提供一种计算位宽动态可配置的处理装置的处理方法,参见图14所示,包括步骤:
S1401控制电路生成控制指令,传送给存储器、数据宽度调整电路和运算电路;
S1402存储器根据接收的控制指令,向运算电路输入神经网络的待运算数据;
S1403数据宽度调整电路根据实际需求调整所述待运算数据、中间运算结果、最终运算结果和/或待缓存数据的宽度;
S1404运算电路根据接收的控制指令,选择对应类型的乘法器和加法器电路位串行运算器;
S1405运算电路根据输入的待运算数据和神经网络参数以及控制指令,对不同计算位宽的神经网络的待运算数据进行运算。
以上,利用该本实施例的方法中数据宽度调整电路,能够明显提高神经网络的运算速度,同时具有动态可配置性,满足数据位宽的多样性和运算过程中数据位宽的动态可变性的相关要求。
进一步的,步骤S1403中第一运算模块包括采用加法器电路,以及基础乘法器、稀疏乘法器和/或融合向量乘法器对神经网络的待运算数据进行运算。通过动态的选择具体的加法器电路,以及基础乘法器、稀疏乘法器和/或融合向量乘法器,能够使处理方法具有灵活性强、可配置程度高、运算速度快、功耗低等特点。
以下,将介绍另一种方案的计算位宽动态可配置的处理装置和处理方法的实施例,以下介绍的方案中将不包含数据宽度调整电路和与数据宽度调整电路相关的功能单元。
图15是本公开又一实施例提供的处理装置的结构示意图。如图15所示,本装置主要分为三个部分,控制电路、运算电路和存储器。控制电路向运算电路和存储器发出控制信号,来控制二者的运行,协调二者间的数据传输。各部分的功能参照图1所示实施例中各部分的描述内容,在此不予赘述。
图16是本公开一个实施例的处理装置的结构示意图。图16所示的结构为图2所示结构的基础上去除数据宽度调整电路,即存储器直接与运算电路连接,相应的各设置方式可参照以上所述。三个模块可以采用流水线的方式并行执行。本装置能够加速卷积神经网络的运算过程,减 少片内片外的数据交换,节约存储空间。
图17是本公开另一个实施例的处理装置的结构示意图。图17所示结构与图3类似,不同之处仅在于图17中不包含数据宽度调整电路的相关结构和连接关系,关于图17中的各连接关系及所实现功能参照图3的相应实施例描述,在此不予赘述。本实施例的处理装置在参数多的大规模运算当中,明显提高了运算效率。本装置能够有效加速卷积神经网络的运算过程,尤其适用于网络规模比较大,参数比较多的情况。
图18是本公开再一个实施例的处理装置的结构示意图。图18所示结构与图4类似,不同之处仅在于图18中,不包含数据宽度调整电路的相关结构和连接关系,关于图18中的各连接关系及所实现功能参照图4的相应实施例描述,在此不予赘述。
图19是本公开的又一实施例的用于本装置的基础乘法器装置示意图,能够满足计算位宽动态可配置的要求。如图19所示,M位的被乘数和N位的乘数,其中M,N均为正整数,也就是说,这里的被乘数和乘数的位数可以相等,也可以不相等。将乘数的低n位(n为正整数,且1<n≤N)输入至输入选择电路中,当乘数的低n值分别与被乘数做“与”运算,即乘数该位值为1,则取被乘数本身,否则取0。同时,将乘数送入第一移位寄存器中进行移位,将低n位移出,则下一次再输入至输入选择电路中的为新的低n位。输入选择后的结果向上输入到第二移位寄存器进行相应的移位,再送入加法树中进行累加。这里进行累加的是进行输入选择并进行移位后的数据和之前进行累加的结果。得到结果后作为中间结果存入结果寄存器。待下一次被乘数进行输入选择后进行移位时,结果寄存器取出中间结果送入加法树(器)中进行累加。当乘数全为0时,乘法运算结束。
为更清楚的表明该基础乘法器的运算流程,我们给出一个具体实施例,假定被乘数为10111011,即M=8,乘数为1011,即N=4。
当n=2时,即每次移动2位的时候,该运算过程如下:首先,取出乘数的最低2位的11,和被乘数一起送入输入选择电路,选择均为被乘数本身,送入第一移位寄存器,最低位对应的选择出的被乘数无需移位, 即10111011,次低位对应的选择出的被乘数左移1位,即101110110,送入加法树中,由于之前没有数字相加,故送入结果寄存器的为10111011与101110110的和,即1000110001。而后,乘数右移2位后取其最低2位,即10,和被乘数一起送入输入选择电路中,得到0和10111011,而后通过移位寄存器,0左移了2位还是0,10111011左移3位为10111011000,和结果寄存器中的1000110001一起送入加法树中进行运算,得到100000001001,送入结果寄存器中。此时,乘数右移2位,全部为0,即运算结束,结果寄存器中即为最终结果,即100000001001。
图20是本公开提供的一实施例用于本装置的稀疏乘法器装置示意图,能够满足要求的计算位宽动态可配置的要求。顾名思义,稀疏乘法器针对稀疏运算的情况,即当乘数或者被乘数用稀疏表示的方式表示出1的位置时,可以进一步提高了运算的有效性,加快运算速度。如图20所示,M位的被乘数和N位的乘数,其中M,N均为正整数,也就是说,这里的被乘数和乘数的位数可以相等,也可以不相等。这里,乘数用稀疏表示的方法,用绝对或相对位置的方式表示该乘数中1的位置。这里,我们的运算电路是可配置的,故当采用不同的表示方法进行运算时,运算器内部的装置可以根据需求进行配置。譬如,可以当结果寄存器进行累加时无需移位,那么可以规定此时和结果寄存器相接的移位寄存器不工作,此时乘数的移位信息也可以不传递到该移位寄存器中。相关具体细节均可以根据需要做相应的调整,来完成对被乘数的移位和对结果的累加等相关具体细节。
为更清楚的表明该稀疏乘法器的运算流程,我们给出一个具体实施例,假定被乘数为10111011,即M=8,乘数为00100010,即N=8。当采用绝对的表示方式来表示乘数,那么用绝对位置表示出乘数中1的位置,假定我们把数的最右侧一位称为第0位,第0位的左侧一位称为第1位,以此类推。那么,该乘数表示为(1,5)。同时,我们要求该实施例中的与结果寄存器相连的移位寄存器不工作,乘数的数据无需传递给该移位寄存器。那么首先取出乘数的第一个数,即1,表示在第1位处有一个1。将被乘数送入移位寄存器,然后移动1位后为1011 10110送入加法器。由于之前数字相加,故送入结果寄存器的结果为101110110。而后取出乘数的下一个1的位置,即5,和被乘数一起送入移位寄存器。在移位寄存器中,将被乘数右移5位,得到1011101100000,送入加法器。同时取出结果寄存器中的结果101110110,由于采用的这种绝对表示的方法无需进行移位,故可直接将该结果送入加法器进行累加,得到1100011010110。累加后的结果再次送入结果寄存器。此时,乘数中的1都已经计算完毕,故运算结束。如果采用相对的方式表示乘数,并定义其表示方法为从最高位(最左边)的第一个不为0的数字开始,到最低位,每两个不为0的数字间相距的位数。对于00100010,在第一个不为0的数字和下一个不为0的数字之间相距4位,在第二个不为0的数字到最低位,相距1位,故表示为(4,1)。这里,我们要求该实施例中的与结果寄存器相连的和与被乘数相连的移位寄存器均需要工作。首先,取出乘数的第一个数字4,送入两个移位寄存器中,那么将被乘数右移4位,和结果寄存器中的数据右移4位后送入加法器中进行累加。此时结果寄存器的数据为0,故得到累加结果101110110000,送入结果寄存器保存。而后,取出乘数的第二个数字1,那么将该值送入移位寄存器中,得到101110110和1011101100000,送入加法器进行累加,得到结果1100011010110。该结果再次送入结果寄存器。此时,乘数中的1都已经计算完毕,故运算结束。这样,可以有效利用数据的稀疏性,只进行有效的运算,即非0数据之间的运算。从而减少了无效的运算,加快运算速度,提高了性能功耗比。
图22是本公开提供的一个实施例的融合向量乘法器进行向量乘法的装置的结构示意图。这里,我们假定计算向量
Figure PCTCN2018083415-appb-000023
Figure PCTCN2018083415-appb-000024
的内积值,将相应维度的数据送入乘法器中等待运算,如图8所示。这里,要求
Figure PCTCN2018083415-appb-000025
Figure PCTCN2018083415-appb-000026
的维度相同,均为(N+1),但是每一维度的位宽不一定相同,同时假定每次取n位进行运算,其中n为大于1且不大于
Figure PCTCN2018083415-appb-000027
的一个维度的位宽的正整数。首先,取B 0的低n位和A 0同时送入第一个输入选择电路中,将B 0的低n位分别与A 0做与运算,得到的选择的结果送入后面的移位寄存器进行移位。取移位后,将结果送 入加法树中。在此过程中,每个维度都和第一维度进行着相同的操作。而后通过加法树,对这些维度送入的数据进行累加,并将结果寄存器中的值送入加法树中,一同进行累加,得到累加后的结果再送入结果寄存器中。在运算的同时,每一维度的B i(i=0,1,……,N)值送入移位寄存器中右移n位后,重复上述操作,即取移位后的B i(i=0,1,……,N)值的最低n位和对应的A i(i=0,1,……,N)值一起送入输入选择电路中进行选择,再送入移位寄存器中进行移位,而后送入加法树中进行累加。不断重复该过程直到每一维度的B i(i=0,1,……,N)值全为0,运算结束,此时结果寄存器中的数据即为所求的最终结果。利用该乘法器能够灵活的配置待运算数据的位宽,无需在每进行一组数据乘法时就需要重新对被乘数移位位数进行计数的过程。同时,当数据位数比较低或者向量位数比较高的时候,能够极大地利用数据低位宽、向量高维度的特性,可以采用流水线的方式并行执行该过程,降低运行所需时间,进一步加快运算速度,提高性能功耗比。
我们可以采用多种方式来完成向量的内积运算,结合图21、图22和图23进行说明。首先我们假定
Figure PCTCN2018083415-appb-000028
Figure PCTCN2018083415-appb-000029
的维度为8,即N=7,
Figure PCTCN2018083415-appb-000030
Figure PCTCN2018083415-appb-000031
Figure PCTCN2018083415-appb-000032
的位宽为8位,即
Figure PCTCN2018083415-appb-000033
的每一维度均为8位,即Ai={a i7···a i1a i0},其中i=0,1,......,7;
Figure PCTCN2018083415-appb-000034
的位宽为4位,即
Figure PCTCN2018083415-appb-000035
的每一维度均为4位,即Bi={b i3b i2b i1b i0},其中i=0,1,......,7。那么向量内积
Figure PCTCN2018083415-appb-000036
采用基础乘法器或上述的基础或稀疏乘法器(假定n为2,即每次乘数移动2位)时的运算流程分为两个阶段:首先分别计算各自分量的乘积,然后再进行求和,如图21所示。具体的说,对于某一维度A i和B i进行计算,移位寄存器清零。第一个时钟周期取B i的最低两位b i0,b i1,输入选择、移位、送入加法器,得到A i*b i0b i1的值,并将移位寄存器加2;第二个时钟周期,B i右移2位后取最低两位得到最低位b i2,b i3,输入选择、移位得到A i*b i2b i3,将结果与之前的和相加,得到最终结果A i*b i0b i1b i2b i3,即得到该维度的最终结果A i*B i。进行下一维度的运算,输入A i+1和B i+1,移位寄存器清零……直到每一维度运算完毕,得到 (A 0*B 0,A 1*B 1,……,A 7*B 7),阶段1运算完毕。而后,在阶段2,将乘积送入一个加法树中进行加法运算,得到最终的向量内积的结果,即
Figure PCTCN2018083415-appb-000037
在阶段1中,可以选择1个乘法器,依次计算每个维度;也可以提供多个乘法器并行运算,在一个乘法器中完成一个维度的运算,如图7所示。当采用多个乘法器时,每个维度的乘数B i的移位值都需要重新进行计数。该阶段的乘法器采用上述的基础乘法器或者稀疏乘法器均可。
利用融合向量乘法器,是整体进行横向的累加运算,其结构如图22所示,将每一维度的一个分量的乘积运算完毕即送入加法树中进行累加,直到运算完毕,得到最终结果。例如,其运算流程如图23的椭圆形框所示,第一个时钟周期,每一维计算得到A i*b i0(i=0,1,……,7)的乘积,送入加法树中累加,计算结果送入结果寄存器,移位寄存器加1;第二个时钟周期,每一维根据移位寄存器计算得到2*A i*b i1(i=0,1,……,7)的乘积,和结果寄存器的数据一同送入加法树中累加,移位寄存器加1;第三个时钟周期,每一维根据移位寄存器计算得到4*A i*b i2(i=0,1,……,7)的乘积,和结果寄存器的数据一同送入加法树中累加,移位寄存器加1;最后,第四个时钟周期,计算得到8*A i*b i3(i=0,1,……,7)的乘积,和结果寄存器的数据一同送入加法树中累加,得到最终结果。因此我们在4个运算周期之后就得到了所需要的结果,运算过程中,共移位3次。而一般的乘法器,每个数据运算都需要进行移位操作,即,在有4个操作数的情况下,共需要4*3=12次移位操作。所以,我们的设计,通过改变运算顺序,大大减少了对移位值的计数操作从而有效提高了性能功耗比。
根据本公开实施例的另一方面,还提供一种计算位宽动态可配置的处理方法,参见图24所示,包括步骤:
S2400:控制电路生成控制指令,传送给存储器和运算电路;
S2401:存储器根据接收的控制指令,向运算电路输入神经网络的待运算数据;
S2402:运算电路根据接收的控制指令,选择第一运算模块中的对应类型的乘法器和加法器电路;
S2403:运算电路根据输入的待运算数据和神经网络参数以及控制指令,对不同计算位宽的神经网络的待运算数据进行运算。
进一步的,步骤S2403中第一运算模块包括采用加法器,以及基础乘法器、稀疏乘法器和/或融合向量乘法器对神经网络的待运算数据进行运算。
综上所述,利用该处理装置和方法能够明显提高神经网络的运算速度,同时具有动态可配置性,满足数据位宽的多样性和运算过程中数据位宽的动态可变性的相关要求,具有灵活性强、可配置程度高、运算速度快、功耗低等优点。
此外,本公开还提供一种包含构建离线模型的运算方法和运算装置,在生成离线模型之后,可根据离线模型直接进行运算,避免了运行包括深度学习框架在内的整个软件架构带来的额外开销,以下将结合具体实施例对该进行具体阐述。
在典型的应用场景中,神经网络加速器编程框架通常位于最上层,编程框架可以为Caffe,Tensorflow,Torch等,如图25所示,从底层到上层依次为神经网络处理器(用于神经网络运算的专用硬件),硬件驱动(用于软件调用神经网络处理器),神经网络处理器编程库(用于提供调用神经网络处理器的接口),神经网络处理器编程框架以及需要进行神经网络运算的高级应用。
本公开实施例的一方面,提供了一种神经网络的运算方法,包括步骤:
步骤1:获取输入数据;
步骤2:获取或根据输入数据确定离线模型,依据离线模型确定运算指令,以供后续计算调用;
步骤3:调用所述运算指令,对待处理数据进行运算得到运算结果以供输出。
其中,该输入数据包括待处理数据、网络结构和权值数据,或者该输入数据包括待处理数据离线模型数据。
其中,步骤2中的离线模型可以是已有的,或者是根据外部数据(例如网络结构或者权值数据)进行后期构建的。通过设置离线模型得到运算指令的方式,能够提高运算过程。
步骤3中的调用运算指令可以是在输入数据仅包括待处理数据不包含离线模型或者用于确定离线模型的数据情况下,仅根据运算指令进行网络运算。
在一些实施例中,当输入数据包括待处理数据、网络结构和权值数据时,执行如下步骤:
步骤11,获取输入数据;
步骤12,根据网络结构和权值数据构建离线模型;
步骤13,解析离线模型,得到运算指令并缓存,以供后续计算调用;
步骤14,根据运算指令,对待处理数据进行运算得到运算结果以供输出。
上述实施例中首先根据网络结构以及权值数据构建出离线模型,然后对离线模型极性解析后获取运算指令,这使得在不存储离线模型的低内存、实时性强的应用环境中能够充分发挥性能,运算过程更为简洁快速。
在一些实施例中,当输入数据包括待处理数据和离线模型时,执行如下步骤:
步骤21,获取输入数据;
步骤22,解析离线模型,得到运算指令并缓存,以供后续计算调用;
步骤23,根据运算指令,对待处理数据进行运算得到运算结果以供输出。
上述实施例中当输入数据包括离线模型时,当建立起离线模型后,运算时对离线模型进行解析后获取运算指令,从而避免了运行包括深度学习框架在内的整个软件架构带来的额外开销。
在一些实施例中,当输入数据仅包括待处理数据时,执行如下步骤:
步骤31,获取输入数据;
步骤32,调用缓存的运算指令,对待处理数据进行运算得到运算结果以供输出。
上述实施例当输入数据仅包括待处理数据而不含神经网络结构和权值数据时,则通过调取运算指令对待处理数据进行运算得到运算结果。
在一些实施例中,通过神经网络处理器,根据运算指令,对待处理数据进行运算得到运算结果;其中,神经网络处理器主要用于神经网络运算,接收指令、待处理数据和/或网络模型(例如离线模型)后进行运算;举例来说,对于多层神经网络来说,例如根据输入层数据,以及神经元、权值和偏置等数据,计算得到输出层数据。
在进一步的实施例中,该神经网络处理器具有指令缓存单元,用于对接收的运算指令进行缓存。
在一些实施例中,上述神经网络处理器还具有数据缓存单元,用于缓存所述待处理数据。待处理数据输入神经网络处理器后在该数据缓存单元中暂存,后续结合运算指令再进行运算。
基于上述运算方法,本公开实施例还提供了一种运算装置,包括:
输入模块,用于获取输入数据,该输入数据包括待处理数据、网络结构和权值数据,或者该输入数据包括待处理数据离线模型数据;
模型生成模块,用于根据输入的网络结构和权值数据构建离线模型;
神经网络运算模块,用于基于输入模块中的离线模型数据或者模型生成模块中构建的离线模型生成运算指令并缓存,以及基于运算指令对待处理数据进行运算得到运算结果;
输出模块,用于输出所述运算结果;
控制模块,用于检测输入数据类型并执行如下操作:
当输入数据包括待处理数据、网络结构和权值数据时,控制输入模块将网络结构和权值数据输入模型生成模块以构建离线模型,并控制神经网络运算模块基于模型生成模块输入的离线模型,对输入模块输入的待处理数据进行运算;
当输入数据包括待处理数据和离线模型时,控制输入模块将待处理数据和离线模型输入神经网络运算模块,并控制神经网络运算模块基于离线模型生成运算指令并缓存,并基于运算指令对待处理数据进行运算;
当输入数据仅包括待处理数据时,控制输入模块将待处理数据输入神经网络运算模块,并控制神经网络运算模块调用缓存的运算指令,对待处理数据进行运算。
上述神经网络运算模块包括模型解析单元和神经网络处理器,其中:
模型解析单元,用于基于离线模型生成运算指令;
神经网络处理器,用于缓存运算指令用于后续计算调用;或在输入数据中仅包括待处理数据时调用缓存的运算指令,并基于运算指令对待处理数据进行运算得到运算结果。
在一些实施例中,上述神经网络处理器具有指令缓存单元,用于缓存运算指令以供后续计算调用。
在一些实施例中,上述离线模型可以是一个按照特殊结构定义的文本文件,可以为各种神经网络模型,如可以为Cambricon_model、AlexNet_model、GoogleNet_model、VGG_model、R-CNN_model、GAN_model、LSTM_model、RNN_model、ResNet_model等模型,但并不只限于本实施例提出的这些模型。
离线模型可以包含原始网络中各个计算节点的网络权值以及指令数据等必要网络结构信息,其中,指令可以包括各个计算节点的计算属性以及各个计算节点之间的连接关系等信息,从而在处理器再次运行该原始网络时,可以直接运行该网络对应的离线模型,无需再次对同一网络进行编译等操作,从而缩短处理器运行该网络时的运行时间,提高处理器的处理效率。
可选地,处理器可以是通用处理器,如CPU(Central Processing Unit,中央处理器)、GPU(Graphics Processing Unit,图形处理器)或IPU(Intelligence Processing Unit,智能处理器),IPU为用于执行人工神经网络运算的处理器。
在一些实施例中,待处理数据为能用神经网络进行处理的输入,例如为连续的单张图片、语音或视频流中的至少一种。
在一些实施例中,上述网络结构可以为各种神经网络结构,例如可以为AlexNet、GoogleNet、ResNet、VGG、R-CNN、GAN、LSTM、RNN、ResNet等,但并不只限于本实施例提出的此些结构。需要指出的是,这里的网络结构与离线模型相互对应,例如当网络结构为RNN时,则离线模型为RNN_model,该模型包括RNN网络中各个节点的网络权值以及指令数据等必要RNN网络结构信息,其中,指令可以包括各个计算节点的计算属性以及各个计算节点之间的连接关系等信息。
具体地,根据输入模块输入数据的不同,本公开实施例的运算装置可以具有以下三种执行形式:
1、当输入模块输入的数据为网络结构、权值数据和待处理数据时,则控制模块控制输入模块将网络结构和权值数据传输至模型生成模块,将待处理数据传输至模型解析模块;控制模块控制模型生成模块根据具体的网络结构以及相应的权值数据生成离线模型(离线模型可以是一个按照预设的结构定义的文本文件,可以包含神经网络中各个计算节点的网络权值以及指令数据等必要网络结构信息,其中,指令可以包括各个计算节点的计算属性以及各个计算节点之间的连接关系等信息,例如可以根据相应的网络结构类型以及权值数据构建出该离线模型),并将该生成的离线模型传输至模型解析单元;控制模块控制模型解析单元对接收的离线模型进行解析,得到神经网络处理器可识别的运算指令(也就是说根据上述的离线模型的文本文件映射出相应的网络运算指令,而无需进行网络编译操作),并将运算指令和待处理数据传输至神经网络处理器;神经网络处理器根据接收的运算指令,对待处理数据进行运算,得到运算结果,并将该运算结果传输至输出模块以供输出。
2、当输入模块输入的数据为离线模型和待处理数据时,控制模块则控制输入模块将离线模型和待处理数据直接传输至模型解析单元,后续工作原理与第一种情况相同。
3、当输入模块输入的数据仅包含有待处理数据时,则控制模块控制输入模块将此待处理数据经模型解析单元传输至神经网络处理器,神经网络处理器根据缓存的运算指令对待处理数据进行运算得到运算结果。输入模块可以包括判断模块,用于判断输入数据的类型。可以理解的是,通常这种情况不会在首次使用神经网络处理器中出现,以确保指令缓存中已有确定的运算指令。
因此,在当前网络运算与上一次网络运算的离线模型不同时,输入模块输入的数据应包括网络结构、权值数据和待处理数据,通过模型生成模块生成新的离线模型后进行后续的网络运算;在当前网络运算事先已得到相应的离线模型时,输入模块输入的数据应包括离线模型和待处理数据;在当前网络运算与上一次网络运算的离线模型相同时,输入模块输入的数据仅包括待处理数据即可。
在本公开的一些实施例中,本公开描述的运算装置作为子模块集成到整个计算机系统的中央处理器模块当中。待处理数据和离线模型被中央处理器控制传送到运算装置中。模型解析单元会对传入的神经网络离线模型进行解析并生成运算指令。接着运算指令和待处理数据会被传入神经网络处理器中,通过运算处理得到运算结果,并将该运算结果返回到主存单元中。在后续计算过程中,网络结构不再改变,则只需要不断传入待处理数据即可完成神经网络计算,得到运算结果。
以下通过具体实施例对本公开提出的运算装置及方法进行详细描述。
如图26所示,本实施例提出一种运算方法,包括以下步骤:
当输入数据包括待处理数据、网络结构和权值数据时,执行如下步骤:
步骤11、获取输入数据;
步骤12、根据网络结构和权值数据构建离线模型;
步骤13、解析离线模型,得到运算指令并缓存,以供后续计算调用;
步骤14、根据运算指令,对待处理数据进行运算得到神经网络运算结果以供输出;
当输入数据包括待处理数据和离线模型时,执行如下步骤:
步骤21、获取输入数据;
步骤22、解析离线模型,得到运算指令并缓存,用于后续计算调用;
步骤23、根据运算指令,对待处理数据进行运算得到神经网络运算结果以供输出;
当输入数据仅包括待处理数据时,执行如下步骤:
步骤31、获取输入数据;
步骤32、调用缓存的运算指令,对待处理数据进行运算得到神经网络运算结果以供输出。
通过神经网络处理器,根据运算指令,对待处理数据进行处理得到运算结果;该神经网络处理器具有指令缓存单元和数据缓存单元,用于分别对接收的运算指令和待处理数据进行缓存。
本实施例中提出的输入的网络结构为AlexNet,权值数据为bvlc_alexnet.caffemodel,待处理数据为连续的单张图片,离线模型为Cambricon_model。对于已有的离线模型,可以对该离线模型Cambricon_model进行解析,从而生成一系列运算指令,随后将生成的运算指令传输到神经网络处理器2707上的指令缓存单元中,将输入模块2701传入的输入图片传输到神经网络处理器2707上的数据缓存单元中。
综上所述,运用本实施例提出的方法,可以极大程度上简化使用神经网络处理器进行运算的流程,避免调用传统整套编程框架到来的额外内存和IO开销。运用本方法,能让神经网络加速器在低内存、实时性强的环境下充分发挥运算性能。
如图27所示,本实施例还提出一种运算装置,包括:输入模块2701、模型生成模块2702、神经网络运算模块2703、输出模块2704及控制模块2705,其中,神经网络运算模块103包括模型解析单元2706和神经网络处理器2707。
该装置的关键词在于离线执行,是指生成离线模型后直接利用离线模型生成相关的运算指令并传入权值数据,对待处理数据进行处理运算。更具体的:
上述输入模块2701,用于输入网络结构、权值数据和待处理数据的组合或者离线模型和待处理数据的组合。当输入为网络结构、权值数据和待处理数据时,则将网络结构和权值数据传入模型生成模块2702,以生成离线模型用以执行下面运算。当输入为离线模型和待处理数据时,则将离线模型、待处理数据直接传入模型解析单元2706,以执行下面运算。
上述输出模块2704,用于输出根据特定网络结构和一组待处理数据产生的确定的运算数据。其中输出数据由神经网络处理器2707运算得到。
上述模型生成模块2702,用于根据输入的网络结构参数,权值数据生成用于可供下层使用的离线模型。
上述模型解析单元2706,用于解析传入的离线模型,生成可以直接传入神经网络处理器2707的运算指令,同时将输入模块2701传入的待处理数据传到神经网络处理器2707中。
上述神经网络处理器2707,用于根据传入的运算指令和待处理数据进行运算,得到确定的运算结果传入到输出模块2704中,具有指令缓存单元和数据缓存单元。
上述控制模块2705,用于检测输入数据类型并执行如下操作:
当输入数据包括待处理数据、网络结构和权值数据时,控制输入模块2701将网络结构和权值数据输入模型生成模块2702以构建离线模型,并控制神经网络运算模块2703基于模型生成模块2702输入的离线模型,对输入模块2701输入的待处理数据进行神经网络运算;
当输入数据包括待处理数据和离线模型时,控制输入模块2701将待处理数据和离线模型输入神经网络运算模块2703,并控制神经网络运算模块2703基于离线模型生成运算指令并缓存,并基于运算指令对待处理数据进行神经网络运算;
当输入数据仅包括待处理数据时,控制输入模块2701将待处理数据输入神经网络运算模块2703,并控制神经网络运算模块2703调用缓存的运算指令,对待处理数据进行神经网络运算。
本实施例中提出的输入的网络结构为AlexNet,权值数据为bvlc_alexnet.caffemodel,待处理数据为连续的单张图片。模型生成模块102根据输入的网络结构和权值数据生成新的离线模型Cambricon_model,生成的离线模型Cambricon_model也可以作为下次的输入单独使用;模型解析单元2706可以解析离线模型Cambricon_model,从而生成一系列运算指令。模型解析单元2706将生成的运算指令传输到神经网络处理器2707上的指令缓存单元中,将输入模块2701传入的输入图片传输到神经网络处理器2707上的数据缓存单元中。
此外,本公开还提供一种支持复合标量指令的运算装置及运算方法,通过在运算中提供符合标量指令(一种将浮点指令和定点指令统一起来的指令),在较大程度上统一了浮点指令和定点指令,在译码阶段不对指令的种类做区分,在具体计算时才根据操作数地址域中的地址来确定操作数是浮点数据还是定点数据,简化了指令的译码逻辑,也使得指令集变得更为精简。以下将结合具体实施例对该进行具体阐述。
图28是本公开实施例提供的支持复合标量指令装置的结构示意图,如图28所示,装置包括控制器模块2810、存储模块2820、运算器模块2830和输入输出模块2840。
控制器模块2810,用于从存储模块读取指令并存储于本地的指令队列中,再将指令队列中的指令译码为控制信号以控制存储模块、运算器模块和输入输出模块的行为。
存储模块2820,包括寄存器堆、RAM和ROM等存储器件,用于保存指令、操作数等不同数据。操作数包括浮点数据和定点数据,存储器模块将浮点数据和定点数据存储于不同的地址所对应的空间,如不同的RAM地址或不同的寄存器号,从而可以通过地址和寄存器号来判断读取的数据是浮点数还是定点数。
运算器模块2830,可以对浮点数据和定点数据进行四则运算、逻辑运算、移位操作和求补运算等操作,其中,四则运算包括加、减、乘和除四种运算操作;逻辑运算包括与、或、非和异或四种运算操作。运算器模块接收控制器模块的控制信号后,可以通过读取操作数所在的地址或寄存器号来判断所读取的是浮点类型的数据还是定点类型的数据,运算器模块从存储模块读取操作数据并进行对应的运算,运算的中间结果存在存储模块中,将最终运算结果存储至输入输出模块。
输入输出模块2840,可以用于输入输出数据的存储和传输,在初始化时,输入输出模块将初始的输入数据和编译好的复合标量指令存储至存储模块中,运算结束后,接收运算器模块传输的最终运算结果,此外,输入输出模块还可以从存储器中读取编译指令所需的信息,以供计算机编译器将程序编译为各种指令。
由此可见,本公开实施例提供的支持复合标量指令的装置,为复合标量指令提供了高效的执行环境。
图29A和图29B是本公开实施例提供的一种存储模块组织形式示例图。存储模块将浮点数据和定点数据存储于不同的地址空间,如不同的地址或不同的寄存器号,从而可以通过地址和寄存器号来判断读取的数据是浮点数还是定点数。
在本实施例中,本公开使用由起始地址为0000H,终止地址为3FFFH的RAM和16个寄存器组成的寄存器堆所构成的存储模块为例,展示如何将浮点数的存储与定点数的存储分离。如图29A所示,在RAM中,定点数据只存储在地址为0000H到1FFFH的RAM单元中,而浮点数据只存储在2000H到3FFFH的RAM单元中,指令可以存储在任意RAM单元中,也可以将指令集中不变的信息存储在ROM中。如图29B所示,在寄存器堆中,定点数据只存在0至7号寄存器中,浮点数据只存在8到15号寄存器中。当寄存器里存储的值为RAM地址时,0至7号寄存器用于存储定点数据的RAM地址,8至15号寄存器用于存储浮点数据的RAM地址。
图30A是本公开实施例所提供的复合标量指令示例图。如图30A所 示,每一条指令拥有操作码域、操作数地址域(或立即数)和目标地址域,操作码域包括操作码,操作数地址域包括源操作数地址1和源操作数地址2,表示各源操作数的存储地址,目标地址域为操作数运算结果的存储地址:
操作码域用于区分不同类型的操作,如加法、减法、乘法和除法等,但不用于区分操作数的类型。
操作数地址域中可能包含RAM地址、寄存器号和立即数。存储浮点数据和定点数据所用的RAM地址和寄存器号不同,因而能用地址域来区分浮点操作数和定点操作数。当操作数地址域所储存的是立即数时,还需要一个运算器模块可识别的数据类型标志位来区分浮点操作数和定点操作数。
目标地址域可以是RAM地址,也可以是寄存器号。该地址域应与操作数类型相对应,即将浮点操作数的运算结果存入浮点数据对应的存储单元;将定点操作数的运算结果存入定点数据对应的存储单元。
由此可见,本公开提供的复合标量指令,是一种将浮点指令和定点指令统一起来的指令,在较大程度上统一了浮点指令和定点指令,在译码阶段不对指令的类型做区分,在具体计算时才根据操作数地址域中的读取操作数的地址来确定操作数是浮点数据还是定点数据,简化了指令的译码逻辑,也使得指令集变得更为精简。
另外,针对本公开提供的复合标量指令,若采用多种寻址方式,则还需增加确定寻址方式的标志位。
例如,采用图29A和29B所示的存储模块组织结构,加法指令的操作码为0001,采用多种寻址方式时,复合标量指令的组成如图30B至图30E所示。
图30B是本公开实施例提供的采用寄存器寻址时复合标量指令示例图,如图30B所示,当采用寄存器寻址时,寻址方式标志位为01,源操作数1和源操作数2分别存在源操作数1寄存器号和源操作数2寄存器号所对应的寄存器中,编号0至7的寄存器中存储的是定点数据,编号8至15的寄存器中存储的是浮点数据;
图30C是本公开实施例提供的采用寄存器间接寻址时复合标量指令示例图,如图30C所示,当采用寄存器间接寻址时,寻址方式标志位为10,源操作数1和源操作数2在RAM中的地址分别存在源操作数1寄存器号和源操作数2寄存器号所对应的寄存器中,其中定点数据的RAM地址(0000H至1FFFH)存于0至7号寄存器中;浮点数据的RAM地址(2000H至3FFFH)存于8至15号寄存器中。目标地址域存储目标寄存器号或者目标RAM地址。定点数据存于地址在0000H至1FFFH范围内的RAM单元中;浮点数据存于地址在2000H至3FFFH范围内的RAM单元中。
图30D是本公开实施例提供的采用立即数寻址时复合标量指令示例图,如图3D所示,若操作数地址域的数据为两个立即数,则寻址方式标志位为00,在寻址方式标志位和操作数地址域之间还设置有数据类型标志位,当立即数为定点类型时,该数据类型标志位为0;当立即数为浮点类型时,该数据类型标志位为1。
图30E是本公开实施例提供的采用RAM寻址时复合标量指令示例图,如图30E所示,若操作数地址域为RAM地址,则寻址方式标志位为11。源操作数1和源操作数2分别存在RAM地址对应的RAM单元中。其中,定点数据存在RAM地址0000H至1FFFH对应的RAM单元中;浮点数据存在RAM地址2000H至3FFFH对应的RAM单元中。
在采用以上各寻址方式的相关指令中,目标地址域存储目标寄存器号或者目标RAM地址。定点数据存于0至7号寄存器或者地址在0000H至1FFFH范围内的RAM单元中;浮点数据存于8至15号寄存器或者地址在2000H至3FFFH范围内的RAM单元中。
图31是本公开实施例提供的支持复合标量指令的运算方法流程图,如图4所示,本公开实施例提供一种支持复合标量指令的运算方法,利用上述支持复合标量指令装置进行数据运算,具体包括以下步骤:
S3101:将不同类型的数据存储于不同的地址内。
存储器模块将浮点数据和定点数据存储于不同的地址所对应的空间,如不同的RAM地址或不同的寄存器号。
S3102:将复合标量指令译码为控制信号。
控制器模块向存储模块发送输入输出(IO)指令,从存储模块中读取复合标量指令,并存入本地指令队列。控制器模块从本地指令队列中读取复合标量指令,并译码为控制信号。
S3103:根据控制信号读取操作数据,并根据读取操作数据的地址判断操作数据的类型,对操作数据进行运算。
运算器模块收到来自控制器模块的控制信号后,可以通过读取操作数地址域来判断所读取的是浮点类型的数据还是定点类型的数据。若操作数是立即数,则根据数据类型标志位判断操作数类型并计算;若操作数来自RAM或寄存器,则根据RAM地址或寄存器号来判断操作数类型,从存储模块读取操作数并进行对应的运算。
S3104:将运算结果存储于对应类型的地址内。
控制器模块向运算器模块发送IO指令,运算器模块将运算结果传输至存储模块或输入输出模块。
从上述实施例可以看出,本公开提供的复合标量指令的执行方法,能够准确高效地执行复合标量指令。其中,所提供的支持复合标量指令的装置,为复合标量指令提供了高效的执行环境;所提供的复合标量指令的执行方法,能够准确高效地执行复合标量指令。
此外,本公开还提供一种支持技术指令的技术装置和计数方法,通过将统计输入数据(待计数的数据)中满足给定条件的元素个数的算法编写成指令的形式,可以提高计算效率,以下将结合具体实施例对该进行具体阐述。
在本公开的示例性实施例中,提供了一种支持计数指令的计数装置。图32为本公开实施例计数装置的框架结构示意图。如图32所示,本公开支持计数指令的计数装置包括:存储单元、计数单元、以及寄存器单元。存储单元与计数单元连接,用于存储待计数的输入数据以及用于存储统计的输入数据中满足给定条件的元素个数(计数结果),该存储单元可以是主存;也可以是暂存型存储器,进一步的,可以是高速暂存存 储器,通过将待统计的输入数据暂存在高速暂存存储器上,使得计数指令可以灵活有效地支持不同宽度的数据,提升执行性能。
在一种实施方式中,该存储单元是高速暂存存储器,能够支持不同位宽的输入数据和/或占据不同大小存储空间的输入数据,将待计数的输入数据暂存在高速暂存存储器上,使计数过程可以灵活有效地支持不同宽度的数据。计数单元与寄存器单元连接,计数单元用于获取计数指令,根据计数指令读取寄存器单元中的输入数据的地址,然后根据输入数据的地址在存储单元中获取相应的待计数的输入数据,并对输入数据中满足给定条件的元素个数进行统计计数,得到最终计数结果并将该计数结果存储于存储单元中。寄存器单元用于存储待计数的输入数据在存储单元中存储的地址。在一种实施方式中,寄存器单元存储的地址为待计数的输入数据在高速暂存存储器上的地址。
在一些实施例中,待计数的输入数据的数据类型可以是0/1向量,也可以是数值型向量或矩阵。统计输入数据中满足给定条件的元素个数时,所统计元素要满足的条件,可以是与一给定元素相同,例如统计向量A中包含元素x的个数,x可以是数字n,n=0,1,2…,x也可以是向量m,例如m=00,01,11…。所统计元素要满足的条件,也可以是满足给定表达式,例如统计向量B中大于数值y的元素个数,其中y可以是整数n,n=0,1,2…也可以是浮点数f,f=0.5,0.6…;例如统计向量C中能够整除z的元素个数,其中z可以是整数n,n=0,1,2…。
图33为本公开实施例计数装置中计数单元的结构示意图。如图33所示,计数单元包括输入输出模块、运算模块、累加器模块。
输入输出模块与运算模块连接,对存储单元中待计数的输入数据,每次取其中设定长度(该长度可以根据实际需求配置)的一段数据,输入到运算模块进行运算,运算模块运算完成后,输入输出模块继续取固定长度的下一段数据,直到取完待计数的输入数据的所有元素;输入输出模块将累加器模块计算得到的计数结果输出到存储单元。
运算模块与累加器模块连接,输入一段固定长度的数据,用运算模块的加法器将所述输入数据的满足给定条件的各个元素的个数相加,将 得到的结果输出到累加器模块。运算模块中还包括判断子模块,用于判断输入数据是否满足给定的条件(给定的条件可以与一给定元素相同,也可以是数值介于设定的区间内),如满足,则输出1,如不满足,则输出0,然后送入加法器中累加。
在一种实施方式中,加法器的结构可包括n层,其中:第一层有l个全加器、第二层有
Figure PCTCN2018083415-appb-000038
个全加器、……第m层有
Figure PCTCN2018083415-appb-000039
个全加器;其中,l、m、n为大于1的整数,m为大于1小于n的整数,
Figure PCTCN2018083415-appb-000040
表示对数据x做取上整操作。下面对其具体工作过程进行描述,假设输入的数据类型为0/1向量,现要统计待计数的0/1向量中1的个数,假设一段固定长度的0/1向量长度为3l,其中l为大于1的整数。加法器第一层有l个全加器;加法器第二层有
Figure PCTCN2018083415-appb-000041
个全加器,每个全加器有3个输入和2个输出,则第一层总共得到4l/3个输出;按照所述方法,各层全加器都有3个输入和2个输出,并且同一层的加法器可并行执行;计算过程中若第i位数据个数为1,则可作为最后结果的第i位输出,即为该部分0/1向量中1的个数。
图34为一具体的全加器示意图,其中加法器结构包括7层(即n为7),第一层有6个全加器,一端固定长度的0/1向量长度为18(即1为6),其中每一层的全加器可以并行,例如第3层则有
Figure PCTCN2018083415-appb-000042
个(即m为3,1为6)全加器,当输入数据为(0,1,0),(1,0,0),(1,1,0),(0,1,0),(1,0,0),(1,1,0),通过本公开实施例的全加器统计,结果为(001000),即为8。使用上述加法器可以增加加法计算的并行性,有效提高运算模块的运算速度。
累加器模块又与输入输出模块连接,将运算模块输出的结果使用累加器进行累加,直到无新的输入。
计数单元为多流水级结构,其中,输入输出模块中取向量操作处于第一流水级,运算模块处于第二流水级,累加器模块处于第三流水级。这些单元处于不同的流水级,可以更加高效地实现计数指令所要求的操作。
图35为本公开实施例计数装置中计数指令的指令集格式示意图。 如图35所示,计数指令包括一操作码和一个或多个操作域,其中,操作码用于指示该指令为计数指令,计数单元通过识别该操作码可进行计数运算,操作域可包括:用于指示该计数指令中待计数的输入数据的地址信息,还可以包括判断条件的地址信息。其中,地址信息可以是立即数或寄存器号,例如,要获取一个向量时,根据寄存器号可以在相应的寄存器中获取向量起始地址和向量长度,再根据向量起始地址和向量长度在存储单元中获取相应地址存放的向量。本公开实施例采用的指令具有精简的格式,使得指令集使用方便、支持的数据长度灵活。
图36为本公开实施例计数装置中计数单元的执行过程流程图。如图36所示,工作时候,计数单元根据计数指令操作域中的地址信息在寄存器单元中获取待计数的输入数据的地址,然后,根据该地址在存储单元中获取待计数的输入数据。待计数的输入数据存储在高速暂存存储器上,每次计数单元从高速暂存存储器上获取一段固定长度的输入数据,判断子模块判断元素是否满足给定条件,然后用加法器统计该部分输入数据中满足给定条件的元素的个数,将每一段的满足给定条件的元素的个数用累加器模块进行累加,得到最终计数结果并将计数结果存储于存储单元中。
图37为本公开实施例计数装置的详细结构示意图。如图37所示,本公开支持计数指令的装置还可包括:指令存储器、指令处理单元、指令缓存单元和依赖关系处理单元。
对于指令处理单元,其用于从指令存储器中获取计数指令,并对计数指令进行处理后,提供给所述指令缓存单元和依赖关系处理单元。其中,指令处理单元包括:取指模块和译码模块。取指模块与指令存储器连接,用于从指令存储器中获取计数指令;译码模块与取指模块连接,用于对获取的计数指令进行译码。此外,指令处理单元还可以包括指令队列存储器,指令队列存储器与译码模块连接,用于对译码后的计数指令进行顺序存储,并顺序将指令发送到指令缓存单元和依赖关系处理单元。考虑到指令缓存单元和依赖关系处理单元可容纳的指令数量有限,指令队列存储器中的指令必须等到指令缓存单元和依赖关系处理单元 有空闲才可继续顺序发送。
指令缓存单元,可与指令处理单元连接,用于顺序存储待执行的计数指令。计数指令在执行过程中,同时也被缓存在指令缓存单元中,当一条指令执行完之后,将指令运行结果(计数结果)传输到指令缓存单元,如果该指令同时也是指令缓存单元中未被提交指令中最早的一条指令,则该指令将被提交,并一起将指令运行结果(计数结果)写回高速暂存存储器。在一种实施方式中,指令缓存单元可以是重排序缓存。
依赖关系处理单元,可以与指令队列存储器和计数单元连接,用于在计数单元获取计数指令前,判断该计数指令所需向量(即要被计数的向量)是否为最新,若是,直接将计数指令提供给所述计数单元;否则,将该计数指令存储在依赖关系处理单元的一存储队列中,所需向量被更新后,将存储队列中的该计数指令提供给所述计数单元。具体地,计数指令访问高速暂存存储器时,存储空间正等待之前指令的结果写入,为了保证指令执行结果的正确性,当前指令如果被检测到与之前指令的数据存在依赖关系,该指令必须在存储队列内等待至依赖关系被消除。依赖关系处理单元使指令可以实现乱序执行,顺序提交,有效减少流水线阻塞,并且可实现精确例外。
取指模块负责从指令存储器中取出下一条将要执行的指令,并将该指令传给译码模块;译码模块负责对指令进行译码,并将译码后的指令传给指令队列存储器;指令队列存储器用于缓存译码后的指令,当指令缓存单元和依赖关系处理单元有空闲之后发送指令到指令缓存单元和依赖关系处理单元;计数指令从指令队列存储器中被发送到依赖关系处理单元的过程中,计数指令从寄存器单元中读取输入数据在存储单元中的地址;依赖关系处理单元用于处理当前指令与前一条指令可能存在的数据依赖关系,计数指令会访问存储单元,此前执行的其他指令可能会访问同一块存储空间。为了保证指令执行结果的正确性,当前指令如果被检测到与之前的指令数据存在依赖关系,该指令必须在依赖关系处理单元的存储队列内等待至依赖关系被消除。计数单元从依赖关系处理单元中获取计数指令,根据计数指令在寄存器单元中读取的输入数据的地 址,在存储单元中获取相应的待计数的输入数据,并对输入数据中满足给定条件的元素的个数进行统计计数,将计数结果传输至指令缓存单元,最后计数结果和该条计数指令被写回存储单元。
图38为本公开实施例计数装置的执行过程流程图。如图38所示,执行计数指令的过程包括:
S3801,取指模块从指令存储器中取出计数指令,并将该计数指令送往译码模块。
S3802,译码模块对计数指令译码,并将计数指令送往指令队列存储器。
S3803,计数指令在指令队列存储器中等待指令缓存单元和依赖关系处理单元有空闲后,被发送到指令缓存单元和依赖关系处理单元。
S3804,计数指令从指令队列存储器中被发送到依赖关系处理单元的过程中,计数指令从寄存器单元中读取输入数据在存储单元中的存储地址,依赖关系处理单元分析该指令与前面的尚未执行结束的指令在数据上是否存在依赖关系,该条计数指令需要在依赖关系处理单元的存储队列中等待至其与前面的未执行结束的指令在数据上不再存在依赖关系为止。
S3805:依赖关系不存在后,该条计数指令被送往计数单元。计数单元根据存储地址从存储单元中获取输入数据,统计输入数据中满足给定条件的元素个数。
S3806,计数完成后,计数结果通过指令缓存单元被写回存储单元中,指令缓存单元将该条计数指令提交至存储单元中。
至此,已经结合附图对本实施例进行了详细描述。依据以上描述,本领域技术人员应当对本公开实施例支持计数指令的计数装置及其计数方法有了清楚的认识。
在一些实施例中,还公开了一种芯片,其包括了上述神经网络处理器、处理装置、计数装置或者运算装置。
在一些实施例中,还公开了一种芯片封装结构,其包括了上述芯片。
在一些实施例中,还公开了一种板卡,其包括了上述芯片封装结构。
在一个实施例中,还公开了一种电子设备,其包括了上述板卡。
电子设备可包括但不限于机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备交通工具、家用电器、和/或医疗设备。
所述交通工具可包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
本公开所提供的实施例中,应理解到,所揭露的相关装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述部分或模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个部分或模块可以结合或者可以集成到一个系统,或一些特征可以忽略或者不执行。
本公开中,术语“和/或”可能已被使用。如本文中所使用的,术语“和/或”意指一个或其他或两者(例如,A和/或B意指A或B或者A和B两者)。
在上面的描述中,出于说明目的,阐述了众多具体细节以便提供对本公开的各实施例的全面理解。然而,对本领域技术人员将显而易见的是,没有这些具体细节中的某些也可实施一个或多个其他实施例。所描述的具体实施例不是为了限制本公开而是为了说明。本公开的范围不是由上面所提供的具体示例确定,而是仅由下面的权利要求确定。在其他情况下,以框图形式,而不是详细地示出已知的电路、结构、设备,和操作以便不至于使对描述的理解变得模糊。在认为适宜之处,附图标记或附图标记的结尾部分在诸附图当中被重复以指示可选地具有类似特性或相同特征的对应或类似的要素,除非以其他方式来指定或显而易见。
已描述了各种操作和方法。已经以流程图方式以相对基础的方式对一些方法进行了描述,但这些操作可选择地被添加至这些方法和/或从这 些方法中移去。另外,尽管流程图示出根据各示例实施例的操作的特定顺序,但可以理解,该特定顺序是示例性的。替换实施例可以可任选地以不同方式执行这些操作、组合某些操作、交错某些操作等。设备的此处所描述的组件、特征,以及特定可选细节还可以可任选地应用于此处所描述的方法,在各实施例中,这些方法可以由这样的设备执行和/或在这样的设备内执行。
本公开中各功能部分/单元/子单元/模块/子模块/部件都可以是硬件,比如该硬件可以是电路,包括数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于物理器件,物理器件包括但不局限于晶体管,忆阻器等等。所述计算装置中的计算模块可以是任何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。所述存储单元可以是任何适当的磁存储介质或者磁光存储介质,比如RRAM,DRAM,SRAM,EDRAM,HBM,HMC等等。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
以上所述的具体实施例,对本公开的目的、技术方案和有益效果进行了进一步详细说明,应理解的是,以上所述仅为本公开的具体实施例而已,并不用于限制本公开,凡在本公开的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。

Claims (34)

  1. 一种计算位宽动态可配置的处理装置,其特征在于,包括:
    存储器,用于存储数据,所述数据包括神经网络的待运算数据、中间运算结果、最终运算结果和待缓存数据;
    数据宽度调整电路,用于调整所述待运算数据、中间运算结果、最终运算结果和/或待缓存数据的宽度;
    运算电路,用于对神经网络的待运算数据进行运算;以及
    控制电路,用于控制存储器、数据宽度调整电路和运算电路。
  2. 根据权利要求1所述的装置,其特征在于,所述运算电路用于对神经网络的待运算数据进行运算包括:根据待运算数据确定运算电路的乘法器和加法器电路的类型以进行运算。
  3. 根据权利要求1所述的装置,其特征在于,所述数据宽度调整电路包括:
    输入数据处理模块,用于将存储器的数据进行数据宽度的调整;
    输出数据处理模块,用于将运算电路运算后的数据进行宽度调整后存入存储器。
  4. 根据权利要求1或2所述的装置,其特征在于,所述存储器包括:
    输入存储模块,用于存储神经网络的待运算数据;
    输出存储模块,用于存储中间运算结果和最终运算结果;以及
    缓存模块,用于数据的缓冲存储;
    其中,所述输入存储模块包括:
    神经元存储模块,用于存储神经元参数;
    突触存储模块,用于存储突触参数;
    所述输出存储模块包括:
    中间运算结果存储子模块,用于存储中间运算结果;
    最终运算结果存储子模块,用于存储最终运算结果。
  5. 根据权利要求3所述的装置,其特征在于,所述运算电路包括多个运模块,所述突触存储模块包括多个突触存储子模块,每个所述运 算模块分别与一个或多个突触存储子模块对应连接。
  6. 根据权利要求1-5任一所述的装置,其特征在于,所述运算电路包括:
    第一运算模块,用于进行不同位宽数据的运算;所述第一运算模块包括加法器电路以及乘法器,进行神经网络中的不同位宽数据的运算。
  7. 根据权利要求6所述的装置,其特征在于,所述第一运算模块还包括位串行加法树,所述位串行加法树包括移位器、寄存器和多个加法器,每一加法器均包括a端和b端,所述位串行加法树包括x+1层结构,x为正整数,该串行加法树按如下方式运行:
    各寄存器和加法器中的进位输出端Cin被初始为0,各待运算数据的最低n位,分别输入至第一层的加法器中的a,b端,第一层的每个加法器中完成a,b端传入的待运算数据的最低n位的加法运算,确定的结果值s传向高一层的加法器a或b端,每一加法器得到的进位值Cout传回该层加法器的进位输入Cin处,待下一拍和传入的待运算的数据进行加法运算;
    上一层的加法器的操作参照前一层的加法器,将传入的数据进行加法运算,而后结果再向高一层的传递,进位传回该层的加法器,直到达到第x层,第x层的加法器将运算结果经过移位器移位,和寄存器中传来的原结果进行加法运算后保存回寄存器,而后,待下一拍选择运算数据次低的n位传入位串行加法树中完成相应的运算。
  8. 根据权利要求2所述的装置,其特征在于,所述乘法器包括位串行运算器,所述位串行运算器包括以下至少一种:
    基础乘法器,用于将乘数分为多个低位宽数据分别与被乘数相乘后累加的运算;
    稀疏乘法器,用于在乘数和/或被乘数用稀疏方式进行表示的情况下进行乘法运算。
    融合向量乘法器,用于向量间的乘法运算。
  9. 根据权利要求8所述的装置,其特征在于,所述位串行运算器包括运算部件、处理部件和存储部件,其中,
    运算部件,输入待运算数据,完成一位或多位数据的乘法和/或加法运算,输出运算结果;
    存储部件,输入运算结果进行存储;
    处理部件,输入运算结果,用于完成数据移位、根据设定规则扩大/减少数据位宽、根据某设定规则对数据的某一位或多位进行操作。
  10. 根据权利要求8所述的装置,其特征在于,所述基础乘法器为第一基础乘法器,包括:
    乘法存储单元,用于存储乘数,所述乘数位宽为N位;
    第一移位寄存器,用于每次移出乘数的低n位,进行移出操作后的乘数重新送入乘法存储单元,其中1<n≤N;
    输入选择电路,用于每次输入乘数的低n位和被乘数,乘数的低n位中每位的值分别与乘数作“与”运算,得出与运算结果;
    第二移位寄存器,用于输入与运算结果并进行移位;
    加法器,用于输入移位后的数据进行相加;
    结果寄存器,用于寄存加法器的相加结果并将所述相加结果重新输入至加法器参加下次相加。
  11. 根据权利要求8所述的装置,其特征在于,所述基础乘法器为第二基础乘法器,包括:
    乘法存储单元,用于存储乘数,所述乘数位宽为N位;
    第一移位寄存器,用于每次移出乘数的低n位,进行移出操作后的乘数重新送入乘法存储单元,其中1<n≤N;
    备份寄存器,暂存移位后的乘数;
    输入选择电路,用于每次输入乘数的低n位和被乘数,乘数的低n位中每位的值分别与乘数作“与”运算,得出与运算结果;
    第二移位寄存器,用于输入与运算结果并进行移位;
    第三移位寄存器,用于将被乘数进行移位,将乘数的低m位移除;
    加法器,用于输入移位后的数据进行相加;
    结果寄存器,用于寄存加法器的相加结果并重新输入至加法器参加下次相加。
  12. 根据权利要求10或11所述的装置,其特征在于,所述第一基础乘法器或第二基础乘法器还包括判断电路,用于判断乘法存储单元当次的乘数数值是否全为0。
  13. 根据权利要求7所述的装置,其特征在于,所述稀疏乘法器包括:
    乘法存储单元,用于存储乘数,所述乘数采用稀疏方式表示,位宽为N位;
    输入选择电路,用于每次从低位选择乘数数值为1的位;
    第一移位寄存器,用于每次移出所述位数为1以下的各低位,并重新送入乘法储存单元,作为下次的乘数;
    第二移位寄存器,用于依据所述位数为1的位进行移位操作;
    加法器,输入移位后的数据并进行相加;
    结果寄存器,寄存加法器的相加结果;
    第三移位寄存器,根据数值为1的位,对结果寄存器内结果进行移位后重新输入至加法器参与下次运算;
    判断电路,用于判断乘法存储单元当次的乘数数值是否全为0。
  14. 根据权利要求1-13任一所述的装置,其特征在于,所述数据的运算包括:点积、矩阵间乘法、加法、乘法混合加法;矩阵和向量的乘法、加法、乘法混合加法;矩阵和常数的乘法、加法、乘法混合加法;向量间的乘法、加法、乘法混合加法;向量与常数的乘法、加法、乘法混合加法;常数与常数的乘法、加法、乘法混合加法;比较选择最大/小值,以及可以拆分为乘法、加法、或乘加混合的运算。
  15. 根据权利要求8所述的装置,其特征在于,所述乘法器以及加法树采用流水线的方式并行执行。
  16. 根据权利要求11所述的装置,其特征在于,所述第一运算模块包括第二基础乘法器和位串行加法树,按照如下方式进行运算:
    设定计算向量
    Figure PCTCN2018083415-appb-100001
    Figure PCTCN2018083415-appb-100002
    的内积值,
    Figure PCTCN2018083415-appb-100003
    Figure PCTCN2018083415-appb-100004
    的维度相同,均为(N+1),A为被乘数,B为乘数,每次运算,A取指定的m位、B取指定的n位进行运算,其中m不大于
    Figure PCTCN2018083415-appb-100005
    的一个维度的位宽 的正整数,n不大于
    Figure PCTCN2018083415-appb-100006
    的一个维度的位宽的正整数;
    取A 0的低m位和B 0的低n位乘法器中,将A 0的低m位和B 0的低n位做乘法运算,得到的选择的结果送入位串行加法树中进行加法运算,并将结果保存到存储器中;
    将B移位n位,和A的低m位进行乘法操作,并送入位串行加法树中进行加法运算,同时原存储单元的数据经过第三移位单元移位后一同进行加法运算,结果保存到存储单元;
    待B全部运算完毕后,A移位m位,重新依次与B的n位进行运算;
    待全部运算结束,此时存储单元中的数据即为所求的最终运算结果。
  17. 一种芯片,其特征在于,所述芯片包括权利要求1-16中任一权利要求所述的装置。
  18. 一种电子设备,其特征在于,所述电子设备包括权利要求16所述的芯片。
  19. 一种使用权利要求1-16任一所述装置的方法,其特征在于包括步骤:
    控制电路生成控制指令,传送给存储器、数据宽度调整电路和运算电路;
    存储器根据接收的控制指令,向运算电路输入神经网络的待运算数据;
    数据宽度调整电路根据接收的控制指令,调整神经网络的待运算数据的宽度;
    运算电路根据输入的待运算数据和神经网络参数以及控制指令,对不同计算位宽的神经网络的待运算数据进行运算,运算结果送回存储器。
  20. 根据权利要求19所述的方法,其特征在于,所述数据宽度调整电路包括:输入数据处理模块,用于将存储器的数据进行数据宽度的调整;输出数据处理模块,用于将运算电路运算后的数据进行宽度调整后存入存储器。
  21. 根据权利要求20所述的方法,其特征在于,调整神经网络的 待运算数据的宽度包括以下至少一种方式:
    在不损失精度的情况下,对数据位宽进行增加或减少或保持不变;
    在可设定精度损失的情况下,对数据位宽进行增加或减少或保持不变;
    根据指定的变换或运算要求,对数据位宽进行增加或减少或保持不变。
  22. 根据权利要求19或20所述的方法,其特征在于,所述存储器包括:
    输入存储模块:用于存储神经网络的待运算数据;
    输出存储模块:用于存储中间运算结果和最终运算结果;
    缓存模块:用于数据的缓冲存储;
    所述输入存储模块包括:
    神经元存储模块:用于存储神经元参数;
    存储模块:用于存储突触参数所述输出存储模块包括:
    中间结果中间运算结果存储子模块:用于存储中间运算结果;
    最终结果最终运算结果存储子模块:用于存储最终运算结果。
  23. 根据权利要求19所述的方法,其特征在于,还包括设置多个运算模块,分别与一个或多个突触模块对应,在运算时,输入存储模块向所有的运算模块传递输入数据,突触存储模块向对应的运算模块传递突触数据,运算模块进行运算后,将结果写入输出存储模块。
  24. 根据权利要求19-23任一所述的方法,其特征在于,还包括:
    采用第一运算模块进行不同位宽数据的运算,包括:采用加法器以及乘法器进行加速神经网络中的不同位宽数据的运算。
  25. 根据权利要求24所述的方法,其特征在于,所述对不同计算位宽的神经网络的待运算数据进行运算包括采用位串行加法树进行不同位宽数据的运算,运算方式如下:
    设定具有M个待运算的数据,最大位宽为N,其中M,N均为正整数,若不足N位的数据,采将其位数补至N位;位串行加法树包括x+1层,其中,x为正整数,第1层到第x层中的加法器完成位数字的加法 运算n,n≥1,第x+1层中的加法器位完成不小于N位的数字的加法运算;首先,将寄存器、各加法器中的进位输出端Cin初始为0,取各待运算数据的最低n位,分别输入至第一层的加法器中的a,b端,每个加法器中完成a,b端传入的待运算数据的最低n位的加法运算,得到的结果值s传向高一层的加法器a或b端,得到的进位值Cout传回该层加法器的进位输入Cin处,待下一拍和传入的待运算的数据进行加法运算;
    上一层的加法器的操作类似,将传入的数据加法运算,而后结果再向高一层的传递,进位传回该层的加法器,直到达到第x层,第x层的加法器将运算结果经过移位,和寄存器中传来的原结果进行加法运算后保存回寄存器,而后,待运算数据选择次低的n位传入位串行加法树中完成相应的运算。
  26. 根据权利要求25所述的方法,其特征在于:所述采用位串行加法树进行不同位宽数据的运算时还包括:在第一层加法器运算完毕后,输入第二批待运算的n位数据。
  27. 根据权利要求24所述的方法,其特征在于:所述采用位串行加法树进行不同位宽数据的运算时还包括:当所述加法器在输入给该加法器的待运算的数据的a,b端及进位输入Cin端全部为0的情况下,在该次运算过程中关闭。
  28. 根据权利要求19所述的方法,其特征在于,所述对不同计算位宽的神经网络的待运算数据进行运算包括:采用位串行运算器进行运算,包括如下操作:
    使用运算部件输入待运算数据,完成一位或多位数据的乘法和/或加法运算,输出运算结果;
    采用存储部件输入运算结果进行存储;
    采用处理部件输入运算结果,用于完成数据移位、根据设定规则扩大或减少数据位宽、根据某设定规则对数据的某一位或多位进行操作。
  29. 根据权利要求19所述的方法,其特征在于,所述对不同计算位宽的神经网络的待运算数据进行运算包括:采用第一基础乘法器进行不同位宽数据的运算,包括如下操作:
    采用乘法存储单元存储乘数,所述乘数位宽为N位;
    采用第一移位寄存器,每次移出乘数的低n位,将进行移出操作后的乘数重新送入乘法存储单元,其中1<n≤N;
    采用输入选择电路,每次输入乘数的低n位和被乘数,乘数的低n位中每位的值分别与乘数作“与”运算,得出与运算结果;
    采用第二移位寄存器,输入与运算结果并进行移位;
    采用加法器,输入移位后的数据进行相加;
    采用结果寄存器,寄存加法器的相加结果并将相加结果重新输入至加法器参加下次相加。
  30. 根据权利要求19所述的方法,其特征在于,所述对不同计算位宽的神经网络的待运算数据进行运算包括:采用第二基础乘法器进行不同位宽数据的运算,包括如下操作:
    采用乘法存储单元,存储乘数,所述乘数位宽为N位;
    采用第一移位寄存器,每次移出乘数的低n位,将进行移出操作后的乘数重新送入乘法存储单元,其中1<n≤N;
    采用备份寄存器,暂存移位后的乘数;
    采用输入选择电路,每次输入乘数的低n位和被乘数,乘数的低n位中每位的值分别与乘数作“与”运算,得出与运算结果;
    采用第二移位寄存器,输入与运算结果并进行移位;
    采用第三移位寄存器,将被乘数进行移位,将低m位移除;
    采用加法器,输入移位后的数据进行相加;
    采用结果寄存器,寄存加法器的相加结果并将相加结果重新输入至加法器参加下次相加。
  31. 根据权利要求19所述的方法,其特征在于,所述对不同计算位宽的神经网络的待运算数据进行运算包括:采用稀疏乘法器进行不同位宽数据的运算,包括如下操作:
    采用乘法存储单元,存储乘数,所述乘数采用稀疏方式表示,位宽为N位;
    采用输入选择电路,每次从低位选择乘数数值为1的位;
    采用第一移位寄存器,每次移出所述位数为1以下的各低位,并重新送入乘法储存单元,作为下次的乘数;
    采用第二移位寄存器,依据所述位数为1的位进行移位操作;
    采用加法器,输入移位后的数据并进行相加;
    采用结果寄存器,寄存加法器的相加结果;
    采用第三移位寄存器,根据数值为1的位,对结果寄存器内结果进行移位后重新输入至加法器参与下次运算。
  32. 根据权利要求19所述的方法,其特征在于,所述对不同计算位宽的神经网络的待运算数据进行运算包括:采用第二基础乘法器和位串行加法树,按照如下方式进行操作:
    设定计算向量
    Figure PCTCN2018083415-appb-100007
    Figure PCTCN2018083415-appb-100008
    的内积值,
    Figure PCTCN2018083415-appb-100009
    Figure PCTCN2018083415-appb-100010
    的维度相同,均为(N+1),A为被乘数,B为乘数,每次运算,A取指定的m位、B取指定的n位进行运算,其中m不大于
    Figure PCTCN2018083415-appb-100011
    的一个维度的位宽的正整数,n不大于
    Figure PCTCN2018083415-appb-100012
    的一个维度的位宽的正整数;
    取A 0的低m位和B 0的低n位乘法器中,将A 0的低m位和B 0的低n位做乘法运算,得到的选择的结果送入位串行加法树中进行加法运算,并将结果保存到存储单元中;
    将B移位n位,和A的低m位进行乘法操作,并送入位串行加法树中进行加法运算,同时原存储单元的数据经过第三移位单元移位后一同进行加法运算,结果保存到存储单元;
    待B全部运算完毕后,A移位m位,重新依次与B的n位进行运算;
    待全部运算结束,此时存储单元中的数据即为所求的最终运算结果。
  33. 根据权利要求19-32任一权利要求所述的方法,所述对不同计算位宽的神经网络的待运算数据进行运算包括:通过所述运算电路进行全连接层和/或池化层的运算。
  34. 根据权利要求19所述的方法,其特征在于,还包括:所述运算电路根据接收的控制指令,选择第一运算模块中的对应类型的乘法器和加法器电路。
PCT/CN2018/083415 2017-04-19 2018-04-17 处理装置和处理方法 WO2018192500A1 (zh)

Priority Applications (13)

Application Number Priority Date Filing Date Title
US16/476,262 US11531540B2 (en) 2017-04-19 2018-04-17 Processing apparatus and processing method with dynamically configurable operation bit width
EP18788355.8A EP3614259A4 (en) 2017-04-19 2018-04-17 TREATMENT APPARATUS AND TREATMENT METHOD
KR1020197025307A KR102292349B1 (ko) 2017-04-19 2018-04-17 처리 장치 및 처리 방법
CN201880000923.3A CN109121435A (zh) 2017-04-19 2018-04-17 处理装置和处理方法
EP19214371.7A EP3786786B1 (en) 2017-04-19 2018-04-17 Processing device, processing method, chip, and electronic apparatus
EP19214320.4A EP3654172A1 (en) 2017-04-19 2018-04-17 Fused vector multiplier and method using the same
JP2019549467A JP6865847B2 (ja) 2017-04-19 2018-04-17 処理装置、チップ、電子設備及び方法
KR1020197038135A KR102258414B1 (ko) 2017-04-19 2018-04-17 처리 장치 및 처리 방법
US16/697,603 US11507350B2 (en) 2017-04-21 2019-11-27 Processing apparatus and processing method
US16/697,727 US11698786B2 (en) 2017-04-19 2019-11-27 Processing apparatus and processing method
US16/697,637 US11720353B2 (en) 2017-04-19 2019-11-27 Processing apparatus and processing method
US16/697,687 US11734002B2 (en) 2017-04-19 2019-11-27 Counting elements in neural network input data
US16/697,533 US11531541B2 (en) 2017-04-19 2019-11-27 Processing apparatus and processing method

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
CN201710256445.XA CN108733412B (zh) 2017-04-19 2017-04-19 一种运算装置和方法
CN201710256445.X 2017-04-19
CN201710269106.5 2017-04-21
CN201710264686.9 2017-04-21
CN201710269049.0A CN108734288B (zh) 2017-04-21 2017-04-21 一种运算方法及装置
CN201710264686.9A CN108733408A (zh) 2017-04-21 2017-04-21 计数装置及计数方法
CN201710269049.0 2017-04-21
CN201710269106.5A CN108734281A (zh) 2017-04-21 2017-04-21 处理装置、处理方法、芯片及电子装置

Related Child Applications (6)

Application Number Title Priority Date Filing Date
US16/476,262 A-371-Of-International US11531540B2 (en) 2017-04-19 2018-04-17 Processing apparatus and processing method with dynamically configurable operation bit width
US16/697,637 Continuation US11720353B2 (en) 2017-04-19 2019-11-27 Processing apparatus and processing method
US16/697,603 Continuation US11507350B2 (en) 2017-04-21 2019-11-27 Processing apparatus and processing method
US16/697,727 Continuation US11698786B2 (en) 2017-04-19 2019-11-27 Processing apparatus and processing method
US16/697,533 Continuation US11531541B2 (en) 2017-04-19 2019-11-27 Processing apparatus and processing method
US16/697,687 Continuation US11734002B2 (en) 2017-04-19 2019-11-27 Counting elements in neural network input data

Publications (1)

Publication Number Publication Date
WO2018192500A1 true WO2018192500A1 (zh) 2018-10-25

Family

ID=63856461

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/083415 WO2018192500A1 (zh) 2017-04-19 2018-04-17 处理装置和处理方法

Country Status (6)

Country Link
US (5) US11531540B2 (zh)
EP (3) EP3614259A4 (zh)
JP (2) JP6865847B2 (zh)
KR (2) KR102292349B1 (zh)
CN (1) CN109121435A (zh)
WO (1) WO2018192500A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750232A (zh) * 2019-10-17 2020-02-04 电子科技大学 一种基于sram的并行乘加装置
CN111352656A (zh) * 2018-12-24 2020-06-30 三星电子株式会社 使用按位运算的神经网络设备和方法
JP2022518640A (ja) * 2019-12-27 2022-03-16 北京市商▲湯▼科技▲開▼▲發▼有限公司 データ処理方法、装置、機器、記憶媒体及びプログラム製品
US20220214875A1 (en) * 2018-08-10 2022-07-07 Cambricon Technologies Corporation Limited Model conversion method, device, computer equipment, and storage medium

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10554382B2 (en) * 2017-06-27 2020-02-04 Amazon Technologies, Inc. Secure models for IoT devices
US11350360B2 (en) 2017-06-27 2022-05-31 Amazon Technologies, Inc. Generating adaptive models for IoT networks
CN107807819B (zh) * 2017-07-20 2021-06-25 上海寒武纪信息科技有限公司 一种支持离散数据表示的用于执行人工神经网络正向运算的装置及方法
CN108228696B (zh) * 2017-08-31 2021-03-23 深圳市商汤科技有限公司 人脸图像检索方法和系统、拍摄装置、计算机存储介质
US11275713B2 (en) * 2018-06-09 2022-03-15 International Business Machines Corporation Bit-serial linear algebra processor
US11520561B1 (en) * 2018-11-28 2022-12-06 Amazon Technologies, Inc. Neural network accelerator with compact instruct set
CN112085176B (zh) * 2019-06-12 2024-04-12 安徽寒武纪信息科技有限公司 数据处理方法、装置、计算机设备和存储介质
US11675676B2 (en) * 2019-06-12 2023-06-13 Shanghai Cambricon Information Technology Co., Ltd Neural network quantization parameter determination method and related products
CN110766155A (zh) * 2019-09-27 2020-02-07 东南大学 一种基于混合精度存储的深度神经网络加速器
CN110909869B (zh) * 2019-11-21 2022-08-23 浙江大学 一种基于脉冲神经网络的类脑计算芯片
CN110991633B (zh) * 2019-12-04 2022-11-08 电子科技大学 一种基于忆阻网络的残差神经网络模型及其应用方法
CN111105581B (zh) * 2019-12-20 2022-03-15 上海寒武纪信息科技有限公司 智能预警方法及相关产品
CN111176582A (zh) * 2019-12-31 2020-05-19 北京百度网讯科技有限公司 矩阵存储方法、矩阵访问方法、装置和电子设备
US20210241080A1 (en) * 2020-02-05 2021-08-05 Macronix International Co., Ltd. Artificial intelligence accelerator and operation thereof
US11593628B2 (en) * 2020-03-05 2023-02-28 Apple Inc. Dynamic variable bit width neural processor
KR102414582B1 (ko) * 2020-04-29 2022-06-28 한국항공대학교산학협력단 신경망 모델의 추론 속도 향상 장치 및 방법
KR20230010669A (ko) * 2020-05-14 2023-01-19 더 가버닝 카운슬 오브 더 유니버시티 오브 토론토 심층 학습 네트워크를 위한 메모리 압축 시스템 및 방법
US11783163B2 (en) * 2020-06-15 2023-10-10 Arm Limited Hardware accelerator for IM2COL operation
CN111930671B (zh) * 2020-08-10 2024-05-14 中国科学院计算技术研究所 异构智能处理器、处理方法及电子设备
US11427290B2 (en) 2020-08-31 2022-08-30 Mike Scheck Anchor rescue system
CN112183732A (zh) * 2020-10-22 2021-01-05 中国人民解放军国防科技大学 卷积神经网络加速方法、装置和计算机设备
CN112099898B (zh) * 2020-11-06 2021-02-09 广州市玄武无线科技股份有限公司 一种基于Web前端的表格处理系统及方法
CN112765936B (zh) * 2020-12-31 2024-02-23 出门问问(武汉)信息科技有限公司 一种基于语言模型进行运算的训练方法及装置
CN113434113B (zh) * 2021-06-24 2022-03-11 上海安路信息科技股份有限公司 基于静态配置数字电路的浮点数乘累加控制方法及系统
CN113642724B (zh) * 2021-08-11 2023-08-01 西安微电子技术研究所 一种高带宽存储的cnn加速器
KR102395744B1 (ko) * 2021-09-16 2022-05-09 오픈엣지테크놀로지 주식회사 데이터 스케일을 고려한 덧셈 연산 방법 및 이를 위한 하드웨어 가속기, 이를 이용한 컴퓨팅 장치
KR102442577B1 (ko) * 2022-03-08 2022-09-13 주식회사 마키나락스 개발환경을 제공하는 방법
KR20230132343A (ko) * 2022-03-08 2023-09-15 주식회사 마키나락스 개발환경을 제공하는 방법
US20240080423A1 (en) * 2022-09-02 2024-03-07 Samsung Electronics Co., Ltd. Fusion techniques for combining most significant bits and least significant bits of image data in image processing or other applications

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101309430A (zh) * 2008-06-26 2008-11-19 天津市亚安科技电子有限公司 基于fpga的视频图像预处理器
CN101359453A (zh) * 2007-07-31 2009-02-04 奇美电子股份有限公司 数据处理装置与其数据处理方法
CN102750127A (zh) * 2012-06-12 2012-10-24 清华大学 一种协处理器
CN103019656A (zh) * 2012-12-04 2013-04-03 中国科学院半导体研究所 可动态重构的多级并行单指令多数据阵列处理系统
US20140081893A1 (en) * 2011-05-31 2014-03-20 International Business Machines Corporation Structural plasticity in spiking neural networks with symmetric dual of an electronic neuron
CN106066783A (zh) * 2016-06-02 2016-11-02 华为技术有限公司 基于幂次权重量化的神经网络前向运算硬件结构

Family Cites Families (79)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2639461A1 (fr) * 1988-11-18 1990-05-25 Labo Electronique Physique Arrangement bidimensionnel de points memoire et structure de reseaux de neurones utilisant un tel arrangement
DE69031842T2 (de) * 1989-02-20 1998-04-16 Fujitsu Ltd Lernsystem und Lernverfahren für eine Datenverarbeitungsvorrichtung
JPH02287862A (ja) * 1989-04-28 1990-11-27 Toshiba Corp ニューラルネットワーク演算装置
US5086479A (en) * 1989-06-30 1992-02-04 Hitachi, Ltd. Information processing system using neural network learning function
EP0813143A3 (en) 1989-11-13 1998-01-28 Harris Corporation Sign extension in plural-bit recoding multiplier
JPH0820942B2 (ja) * 1991-09-26 1996-03-04 インターナショナル・ビジネス・マシーンズ・コーポレイション 高速乗算器
JPH0652132A (ja) * 1992-07-28 1994-02-25 Mitsubishi Electric Corp 並列演算半導体集積回路装置およびそれを用いたシステム
JPH06139217A (ja) * 1992-10-29 1994-05-20 Hitachi Ltd 高精度演算処理装置および方法
US6601051B1 (en) * 1993-08-09 2003-07-29 Maryland Technology Corporation Neural systems with range reducers and/or extenders
US5630024A (en) * 1994-01-19 1997-05-13 Nippon Telegraph And Telephone Corporation Method and apparatus for processing using neural network with reduced calculation amount
JPH0973440A (ja) * 1995-09-06 1997-03-18 Fujitsu Ltd コラム構造の再帰型ニューラルネットワークによる時系列トレンド推定システムおよび方法
US6049793A (en) * 1996-11-15 2000-04-11 Tomita; Kenichi System for building an artificial neural network
US6718457B2 (en) * 1998-12-03 2004-04-06 Sun Microsystems, Inc. Multiple-thread processor for threaded software applications
JP2001117900A (ja) * 1999-10-19 2001-04-27 Fuji Xerox Co Ltd ニューラルネットワーク演算装置
KR20030009682A (ko) * 2001-07-23 2003-02-05 엘지전자 주식회사 가산기 기반 분산 연산의 가산 공유 정보 추출을 위한신경망 알고리즘 구현 방법
WO2005050396A2 (en) * 2003-11-18 2005-06-02 Citigroup Global Markets, Inc. Method and system for artificial neural networks to predict price movements in the financial markets
WO2005109221A2 (en) * 2004-05-03 2005-11-17 Silicon Optix A bit serial processing element for a simd array processor
US7398347B1 (en) * 2004-07-14 2008-07-08 Altera Corporation Methods and apparatus for dynamic instruction controlled reconfigurable register file
WO2006054861A1 (en) * 2004-11-16 2006-05-26 Samsung Electronics Co., Ltd. Apparatus and method for processing digital signal in an ofdma wireless communication system
US7428521B2 (en) * 2005-06-29 2008-09-23 Microsoft Corporation Precomputation of context-sensitive policies for automated inquiry and action under uncertainty
US8543343B2 (en) * 2005-12-21 2013-09-24 Sterling Planet, Inc. Method and apparatus for determining energy savings by using a baseline energy use model that incorporates an artificial intelligence algorithm
US7881889B2 (en) * 2005-12-21 2011-02-01 Barclay Kenneth B Method and apparatus for determining energy savings by using a baseline energy use model that incorporates an artificial intelligence algorithm
US7451122B2 (en) * 2006-03-29 2008-11-11 Honeywell International Inc. Empirical design of experiments using neural network models
GB2447428A (en) * 2007-03-15 2008-09-17 Linear Algebra Technologies Lt Processor having a trivial operand register
CN100492415C (zh) 2007-04-20 2009-05-27 哈尔滨工程大学 柴油机运行数据记录方法
US8055886B2 (en) * 2007-07-12 2011-11-08 Texas Instruments Incorporated Processor micro-architecture for compute, save or restore multiple registers and responsive to first instruction for repeated issue of second instruction
US7694112B2 (en) * 2008-01-31 2010-04-06 International Business Machines Corporation Multiplexing output from second execution unit add/saturation processing portion of wider width intermediate result of first primitive execution unit for compound computation
CN101527010B (zh) 2008-03-06 2011-12-07 上海理工大学 人工神经网络算法的硬件实现方法及其系统
US8521801B2 (en) 2008-04-28 2013-08-27 Altera Corporation Configurable hybrid adder circuitry
CN101685388B (zh) 2008-09-28 2013-08-07 北京大学深圳研究生院 执行比较运算的方法和装置
CN101599828A (zh) * 2009-06-17 2009-12-09 刘霁中 一种高效的rsa加解密方法及其协处理器
US8468191B2 (en) * 2009-09-02 2013-06-18 Advanced Micro Devices, Inc. Method and system for multi-precision computation
KR101303591B1 (ko) * 2011-10-07 2013-09-11 전자부품연구원 통합형 서포트 벡터 머신 회로 장치
US20140108480A1 (en) 2011-12-22 2014-04-17 Elmoustapha Ould-Ahmed-Vall Apparatus and method for vector compute and accumulate
CN103699360B (zh) 2012-09-27 2016-09-21 北京中科晶上科技有限公司 一种向量处理器及其进行向量数据存取、交互的方法
US9563401B2 (en) 2012-12-07 2017-02-07 Wave Computing, Inc. Extensible iterative multiplier
US9110657B2 (en) 2013-01-21 2015-08-18 Tom Yap Flowchart compiler for a compound complex instruction set computer (CCISC) processor architecture
US9189200B1 (en) * 2013-03-14 2015-11-17 Altera Corporation Multiple-precision processing block in a programmable integrated circuit device
US9558743B2 (en) * 2013-03-15 2017-01-31 Google Inc. Integration of semantic context information
US9037945B2 (en) * 2013-03-26 2015-05-19 Seagate Technology Llc Generating partially sparse generator matrix for a quasi-cyclic low-density parity-check encoder
CN203299808U (zh) * 2013-04-16 2013-11-20 西华大学 位串行加法器
JP6042274B2 (ja) * 2013-06-28 2016-12-14 株式会社デンソーアイティーラボラトリ ニューラルネットワーク最適化方法、ニューラルネットワーク最適化装置及びプログラム
KR20150016089A (ko) * 2013-08-02 2015-02-11 안병익 신경망 컴퓨팅 장치 및 시스템과 그 방법
US9495155B2 (en) 2013-08-06 2016-11-15 Intel Corporation Methods, apparatus, instructions and logic to provide population count functionality for genome sequencing and alignment
US9513907B2 (en) 2013-08-06 2016-12-06 Intel Corporation Methods, apparatus, instructions and logic to provide vector population count functionality
US10068170B2 (en) * 2013-09-23 2018-09-04 Oracle International Corporation Minimizing global error in an artificial neural network
US10373047B2 (en) * 2014-02-28 2019-08-06 Educational Testing Service Deep convolutional neural networks for automated scoring of constructed responses
CN105207794B (zh) 2014-06-05 2019-11-05 南京中兴软件有限责任公司 统计计数设备及其实现方法、具有统计计数设备的系统
CN104699458A (zh) 2015-03-30 2015-06-10 哈尔滨工业大学 定点向量处理器及其向量数据访存控制方法
US10262259B2 (en) 2015-05-08 2019-04-16 Qualcomm Incorporated Bit width selection for fixed point neural networks
CN105005911B (zh) 2015-06-26 2017-09-19 深圳市腾讯计算机系统有限公司 深度神经网络的运算系统及运算方法
KR101778679B1 (ko) * 2015-10-02 2017-09-14 네이버 주식회사 딥러닝을 이용하여 텍스트 단어 및 기호 시퀀스를 값으로 하는 복수 개의 인자들로 표현된 데이터를 자동으로 분류하는 방법 및 시스템
US10275393B2 (en) * 2015-10-08 2019-04-30 Via Alliance Semiconductor Co., Ltd. Tri-configuration neural network unit
CN106484362B (zh) 2015-10-08 2020-06-12 上海兆芯集成电路有限公司 利用使用者指定二维定点算术运算的装置
CN105426160B (zh) 2015-11-10 2018-02-23 北京时代民芯科技有限公司 基于sprac v8指令集的指令分类多发射方法
CN105512724B (zh) * 2015-12-01 2017-05-10 中国科学院计算技术研究所 加法器装置、数据累加方法及数据处理装置
CN105913118B (zh) * 2015-12-09 2019-06-04 上海大学 一种基于概率计算的人工神经网络硬件实现装置
US10757043B2 (en) * 2015-12-21 2020-08-25 Google Llc Automatic suggestions and other content for messaging applications
CN107578099B (zh) * 2016-01-20 2021-06-11 中科寒武纪科技股份有限公司 计算装置和方法
US10387771B2 (en) * 2016-05-26 2019-08-20 The Governing Council Of The University Of Toronto Accelerator for deep neural networks
US11295203B2 (en) * 2016-07-27 2022-04-05 International Business Machines Corporation Optimizing neuron placement in a neuromorphic system
CN106484366B (zh) * 2016-10-17 2018-12-14 东南大学 一种二元域位宽可变模乘运算器
CN106447034B (zh) 2016-10-27 2019-07-30 中国科学院计算技术研究所 一种基于数据压缩的神经网络处理器、设计方法、芯片
US11003985B2 (en) * 2016-11-07 2021-05-11 Electronics And Telecommunications Research Institute Convolutional neural network system and operation method thereof
US10083162B2 (en) * 2016-11-28 2018-09-25 Microsoft Technology Licensing, Llc Constructing a narrative based on a collection of images
US10546575B2 (en) * 2016-12-14 2020-01-28 International Business Machines Corporation Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier
US10249292B2 (en) * 2016-12-14 2019-04-02 International Business Machines Corporation Using long short-term memory recurrent neural network for speaker diarization segmentation
US10691996B2 (en) * 2016-12-15 2020-06-23 Beijing Deephi Intelligent Technology Co., Ltd. Hardware accelerator for compressed LSTM
US11354565B2 (en) * 2017-03-15 2022-06-07 Salesforce.Com, Inc. Probability-based guider
US20180314963A1 (en) * 2017-04-19 2018-11-01 AIBrain Corporation Domain-independent and scalable automated planning system using deep neural networks
US20180314942A1 (en) * 2017-04-19 2018-11-01 AIBrain Corporation Scalable framework for autonomous artificial intelligence characters
SG11201810989VA (en) * 2017-04-27 2019-01-30 Beijing Didi Infinity Technology & Development Co Ltd Systems and methods for route planning
US11170287B2 (en) * 2017-10-27 2021-11-09 Salesforce.Com, Inc. Generating dual sequence inferences using a neural network model
CN109117184A (zh) * 2017-10-30 2019-01-01 上海寒武纪信息科技有限公司 人工智能处理器及使用处理器执行平面旋转指令的方法
US10599391B2 (en) * 2017-11-06 2020-03-24 Google Llc Parsing electronic conversations for presentation in an alternative interface
US10365340B1 (en) * 2018-03-01 2019-07-30 Siemens Medical Solutions Usa, Inc. Monitoring dynamics of patient brain state during neurosurgical procedures
US10497366B2 (en) * 2018-03-23 2019-12-03 Servicenow, Inc. Hybrid learning system for natural language understanding
US11526728B2 (en) * 2018-04-09 2022-12-13 Microsoft Technology Licensing, Llc Deep learning model scheduling
CN109829451B (zh) * 2019-03-22 2021-08-24 京东方科技集团股份有限公司 生物体动作识别方法、装置、服务器及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359453A (zh) * 2007-07-31 2009-02-04 奇美电子股份有限公司 数据处理装置与其数据处理方法
CN101309430A (zh) * 2008-06-26 2008-11-19 天津市亚安科技电子有限公司 基于fpga的视频图像预处理器
US20140081893A1 (en) * 2011-05-31 2014-03-20 International Business Machines Corporation Structural plasticity in spiking neural networks with symmetric dual of an electronic neuron
CN102750127A (zh) * 2012-06-12 2012-10-24 清华大学 一种协处理器
CN103019656A (zh) * 2012-12-04 2013-04-03 中国科学院半导体研究所 可动态重构的多级并行单指令多数据阵列处理系统
CN106066783A (zh) * 2016-06-02 2016-11-02 华为技术有限公司 基于幂次权重量化的神经网络前向运算硬件结构

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220214875A1 (en) * 2018-08-10 2022-07-07 Cambricon Technologies Corporation Limited Model conversion method, device, computer equipment, and storage medium
US11853760B2 (en) * 2018-08-10 2023-12-26 Cambricon Technologies Corporation Limited Model conversion method, device, computer equipment, and storage medium
CN111352656A (zh) * 2018-12-24 2020-06-30 三星电子株式会社 使用按位运算的神经网络设备和方法
CN110750232A (zh) * 2019-10-17 2020-02-04 电子科技大学 一种基于sram的并行乘加装置
JP2022518640A (ja) * 2019-12-27 2022-03-16 北京市商▲湯▼科技▲開▼▲發▼有限公司 データ処理方法、装置、機器、記憶媒体及びプログラム製品

Also Published As

Publication number Publication date
US11531540B2 (en) 2022-12-20
KR102258414B1 (ko) 2021-05-28
EP3654172A1 (en) 2020-05-20
EP3786786C0 (en) 2023-06-07
US20200097792A1 (en) 2020-03-26
EP3786786A1 (en) 2021-03-03
JP2020518042A (ja) 2020-06-18
EP3786786B1 (en) 2023-06-07
KR102292349B1 (ko) 2021-08-20
US11531541B2 (en) 2022-12-20
US20200050918A1 (en) 2020-02-13
US20200117976A1 (en) 2020-04-16
US11734002B2 (en) 2023-08-22
JP2020074099A (ja) 2020-05-14
US20200097794A1 (en) 2020-03-26
KR20190139837A (ko) 2019-12-18
US20200097795A1 (en) 2020-03-26
EP3614259A4 (en) 2021-02-24
JP6821002B2 (ja) 2021-01-27
EP3614259A1 (en) 2020-02-26
CN109121435A (zh) 2019-01-01
US11698786B2 (en) 2023-07-11
KR20200000480A (ko) 2020-01-02
US11720353B2 (en) 2023-08-08
JP6865847B2 (ja) 2021-04-28

Similar Documents

Publication Publication Date Title
WO2018192500A1 (zh) 处理装置和处理方法
CN109219821B (zh) 运算装置和方法
CN109117948B (zh) 画风转换方法及相关产品
US11442786B2 (en) Computation method and product thereof
CN108733348B (zh) 融合向量乘法器和使用其进行运算的方法
CN110163357B (zh) 一种计算装置及方法
KR102252137B1 (ko) 계산 장치 및 방법
WO2017177446A1 (zh) 支持离散数据表示的人工神经网络反向训练装置和方法
CN111178492B (zh) 计算装置及相关产品、执行人工神经网络模型的计算方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18788355

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20197025307

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2019549467

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018788355

Country of ref document: EP

Effective date: 20191119