WO2019218896A1 - 计算方法以及相关产品 - Google Patents

计算方法以及相关产品 Download PDF

Info

Publication number
WO2019218896A1
WO2019218896A1 PCT/CN2019/085844 CN2019085844W WO2019218896A1 WO 2019218896 A1 WO2019218896 A1 WO 2019218896A1 CN 2019085844 W CN2019085844 W CN 2019085844W WO 2019218896 A1 WO2019218896 A1 WO 2019218896A1
Authority
WO
WIPO (PCT)
Prior art keywords
gradient
precision
data
calculation
layer
Prior art date
Application number
PCT/CN2019/085844
Other languages
English (en)
French (fr)
Inventor
刘少礼
罗宇哲
孟小甫
张曦珊
宋新开
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201810479540.0A external-priority patent/CN110503179B/zh
Priority claimed from CN201811040961.XA external-priority patent/CN110880037A/zh
Priority claimed from CN201811041573.3A external-priority patent/CN110880033A/zh
Priority claimed from CN201811592249.0A external-priority patent/CN111368987B/zh
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Priority to EP19803375.5A priority Critical patent/EP3624020A4/en
Publication of WO2019218896A1 publication Critical patent/WO2019218896A1/zh
Priority to US16/718,742 priority patent/US11409575B2/en
Priority to US16/720,145 priority patent/US11442785B2/en
Priority to US16/720,171 priority patent/US11442786B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • G06F9/30109Register structure having multiple operands in a single register
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of neural networks, and in particular, to a calculation method and related products.
  • a neural network is an operational model consisting of a large number of nodes (or neurons) connected to each other. Each node represents a specific output function called an activation function. The connection between every two nodes represents a weighting value for passing the connection signal, called weight, which is equivalent to the memory of the artificial neural network.
  • the output of the network varies depending on the connection method of the network, the weight value, and the excitation function.
  • the network itself is usually an approximation of an algorithm or function in nature, or it may be an expression of a logic strategy.
  • the calculation methods of the neural network include but are not limited to: addition, multiplication, activation, and the like.
  • the existing calculation methods of neural networks cannot realize fast calculation of neural network data and affect the operation speed.
  • the present application provides a calculation method and related products, and has the advantages of realizing an improved operation speed for an existing integrated circuit chip.
  • a computing method is provided, the computing method being applied to a computing system, the computing system comprising: a control unit, a computing group, and a total storage unit, the control unit comprising: a first memory, a decoding logic, and a control
  • the computing group includes: a group controller and a plurality of computing units; the total storage unit is configured to store data; and the calculating method includes the following steps:
  • the controller receives a first level instruction sequence, and the decoding logic splits the first level instruction sequence into a plurality of second level instruction sequences.
  • the controller opens M threads for the plurality of second-level instruction sequences, and the controller allocates independent registers for each thread of the M threads and configures an independent addressing function; the M ranges from 1 to greater than or equal to 1 Integer
  • the group controller acquires a plurality of calculation types of the plurality of second-level instruction sequences, and obtains a fusion calculation manner corresponding to the calculation type according to the plurality of calculation types, where the plurality of calculation units invoke the M threads by using the fusion calculation manner Performing a calculation on the plurality of second instruction sequences results in a final result.
  • the group controller acquires multiple calculation types of the plurality of second-level instruction sequences, and obtains a fusion calculation manner corresponding to the calculation type according to the multiple calculation types, and the plurality of calculation units adopt the fusion calculation manner Invoking the M threads to perform calculations on the plurality of second instruction sequences to obtain a final result:
  • the group controller invokes a combined calculation method of the same type of single instruction multiple data stream SIMD combined with the single instruction multi-thread SIMT, and uses the M threads to perform a combined calculation manner to obtain a final result. Specifically, including:
  • Decoding logic splits M threads into N thread groups and allocates them to a plurality of computing units
  • the group controller converts the plurality of second instruction sequences into a plurality of second control signals and sends them to multiple computing units
  • the computing unit calls the allocated thread group and the second control signal extracts corresponding data according to the independent addressing function, and the plurality of computing units perform operations on the data to obtain a plurality of intermediate results, and splicing the plurality of intermediate results to obtain a final result.
  • the group controller acquires multiple calculation types of the plurality of second-level instruction sequences, and obtains a fusion calculation manner corresponding to the calculation type according to the multiple calculation types, and the plurality of calculation units adopt the fusion calculation manner Invoking the M threads to perform calculations on the plurality of second instruction sequences to obtain a final result:
  • the group controller calls the synchronous multi-threaded SMT and the M threads perform the calculation to obtain the final result, which specifically includes:
  • the decoding logic splits the M threads into N thread groups, converts the plurality of second instruction sequences into a plurality of second control signals, and the group controller acquires calculation types supported by the plurality of calculation units, and the controller a thread group and a plurality of second control signals are allocated to the computing unit corresponding to the calculation type supporting the thread group and the second control signal, and the plurality of computing units call the allocated thread group and the second control signal, and the plurality of computing units extract Corresponding data, a plurality of computing units perform operations on the data to obtain a plurality of intermediate results, and all intermediate results are spliced together to obtain a final result.
  • the method further includes:
  • the thread group A in the plurality of thread groups is blocked, the thread group A is added to the waiting queue. If the data of the thread group A has been extracted, the thread group A is added to the preparation queue, and the preparation queue is scheduled when the computing resource is idle. The queue in which the thread group being executed is located.
  • the first level instruction sequence includes: a super long instruction
  • the second level instruction sequence includes: a sequence of instructions
  • the computing system further includes: a tree module, the tree module includes: a root port and a plurality of branch ports, the root port of the tree module is connected to the group controller, the tree type a plurality of branch ports of the module are respectively connected to one of the plurality of computing units;
  • the tree module forwards a data block, a thread group, or a sequence of instructions between the group controller and the plurality of computing units.
  • the tree module is an n-tree, and the n is an integer greater than or equal to 2.
  • the computing system further includes: a branch processing circuit,
  • the branch processing circuit is connected between the group controller and the plurality of computing units;
  • the branch processing circuit forwards data, a thread group, or a sequence of instructions between the group controller and the plurality of computing units.
  • a computing system comprising: a control unit, a computing group and a total storage unit, the control unit comprising: a first memory, a decoding logic and a controller, the computing group comprising: a group a controller and a plurality of computing units; the total storage unit for storing data;
  • the controller is configured to receive a first level instruction sequence and to control the first memory and the decoding logic;
  • the decoding logic is configured to split the first level instruction sequence into multiple second level instruction sequences
  • the controller is further configured to: open M threads for the plurality of second level instruction sequences; allocate independent registers for each of the M threads; and configure an independent addressing function; the M value range For an integer greater than or equal to 1, the plurality of second-level instruction sequences are converted into a plurality of control signals and sent to the group controller;
  • the group controller is configured to receive the multiple control signals, acquire multiple calculation types of the multiple control signals, divide M threads into N thread groups, and calculate multiple calculations according to the multiple calculation types.
  • the unit allocates N thread groups and a plurality of control signals;
  • a plurality of calculating units configured to extract an operation from the total storage unit by using the allocated thread group and a control signal to perform an operation to obtain an intermediate result
  • the group controller is used to splicing all intermediate results to obtain a final calculation result.
  • the plurality of computing units include: an addition calculator, a multiplication calculator, an activation calculator, or a dedicated calculator.
  • the dedicated calculator includes: a face recognition calculator, a graph calculator, a fingerprint calculator, or a neural network calculator.
  • the group controller is specifically configured to: when the calculation type of the multiple control signals is graphic calculation, fingerprint recognition, face recognition, or neural network operation, respectively assign the multiple control signals to the face recognition Calculator, graphing calculator, fingerprint calculator or neural network calculator.
  • the first level instruction sequence includes: a super long instruction
  • the second level instruction sequence includes: a sequence of instructions
  • the computing system includes: a tree module, where the tree module includes: a root port and a plurality of branch ports, the root port of the tree module is connected to the group controller, and the tree module a plurality of branch ports respectively connected to one of the plurality of computing units;
  • the tree module is configured to forward a data block, a thread group, or a sequence of instructions between the group controller and the plurality of computing units.
  • the tree module is an n-tree, and the n is an integer greater than or equal to 2.
  • the computing system includes: a branch processing circuit,
  • the branch processing circuit is connected between the group controller and the plurality of computing units;
  • the branch processing circuit is configured to forward data, a thread group, or a sequence of instructions between the group controller and the plurality of computing units.
  • the present invention provides a neural network operation module for performing a multi-layer neural network operation, including:
  • a storage unit for storing input neuron precision, weight precision, and output neuron gradient accuracy
  • a controller unit configured to acquire, from the storage unit, an input neuron precision S x(l) , a weight precision S w(l), and an output neuron gradient accuracy of the L-th layer of the multi-layer neural network Wherein L is an integer greater than 0; according to the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy Obtaining a gradient update precision T; adjusting the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy when the gradient update precision T is less than the preset precision T r So that the absolute value of the difference between the gradient update precision T and the preset precision T r is minimized;
  • An arithmetic unit configured to represent an output neuron and a weight of the Lth layer according to the adjusted input neuron precision S x(l) and weight precision S w(l) , according to the adjusted output neuron gradient accuracy
  • the L-th layer output neuron gradient obtained by the operation is represented for subsequent operations.
  • the controller unit is based on the input neuron precision S x(1) , the weight precision S w(l), and the output neuron gradient accuracy Obtain the gradient update precision T, including:
  • the controller unit compares the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy according to a preset formula Performing a calculation to obtain the gradient update precision T;
  • the first preset formula is:
  • the controller unit adjusts the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy include:
  • the controller unit keeps the input neuron precision S x(l) and the weight precision S w(l) unchanged, and increases the output neuron gradient accuracy
  • the controller unit increases the gradient accuracy of the output neuron At the time, the bit width of the fixed point data format representing the output neuron gradient is reduced.
  • the controller unit increases the gradient accuracy of the output neuron Thereafter, the controller unit is further configured to:
  • the controller unit reduces a bit width of a fixed point data format representing the output neuron gradient, including:
  • the controller unit reduces a bit width of the fixed point data format indicating the output neuron gradient according to a first preset step size N1;
  • the first preset step size N1 is 1, 2, 4, 6, 7, 8, or other positive integers.
  • the controller unit reduces a bit width of a fixed point data format representing the output neuron gradient, including:
  • the controller unit reduces the bit width of the fixed point data format representing the output neuron gradient in a 2x decrement manner.
  • controller unit is further configured to:
  • an embodiment of the present invention provides a neural network operation method, including:
  • the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient are adjusted. So that the absolute value of the difference between the gradient update precision T and the preset precision T r is minimized;
  • the output neurons and weights of the Lth layer are represented according to the adjusted input neuron precision S x(l) and the weight precision S w(l) ; according to the adjusted output neuron gradient accuracy
  • the L-th layer output neuron gradient obtained by the operation is represented for subsequent operations.
  • the accuracy according to the input neuron S x(l) , the weight precision S w(l), and the output neuron gradient accuracy Calculate the gradient update accuracy T, including:
  • the preset formula is:
  • the adjusting the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy include:
  • the increasing the output neuron gradient accuracy further includes reducing a bit width of a fixed point data format representing the output neuron gradient
  • the method further includes:
  • the reducing the bit width of the fixed point data format representing the output neuron gradient comprises:
  • the first preset step size N1 is 1, 2, 4, 6, 7, 8, or other positive integers.
  • the reducing the bit width of the fixed point data format representing the output neuron gradient comprises:
  • the bit width of the fixed point data format representing the output neuron gradient is reduced in a 2x decrement manner.
  • the method further includes:
  • the present invention provides a neural network operation module, which is used for performing operations on a multi-layer neural network, including:
  • a storage unit for storing input neuron precision, weight precision, and output neuron gradient accuracy
  • a controller unit configured to acquire, from the storage unit, an input neuron precision S x(l) , a weight precision S w(l), and an output neuron gradient accuracy of the L-th layer of the multi-layer neural network Wherein L is an integer greater than 0; according to the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy Obtaining a gradient update precision T; adjusting the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy when the gradient update precision T is greater than the preset precision T r So that the absolute value of the difference between the gradient update precision T and the preset precision T r is minimized;
  • An arithmetic unit configured to represent an output neuron and a weight of the Lth layer according to the adjusted input neuron precision S x(l) and weight precision S w(l) , according to the adjusted output neuron gradient accuracy
  • the L-th layer output neuron gradient obtained by the operation is represented for subsequent operations.
  • the controller unit is based on the input neuron precision S x(1) , the weight precision S w(l), and the output neuron gradient accuracy Obtain the gradient update precision T, including:
  • the controller unit compares the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy according to a preset formula Performing a calculation to obtain the gradient update precision T;
  • the first preset formula is:
  • the controller unit adjusts the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy include:
  • the controller unit keeps the input neuron precision S x(1) and the weight precision S w(l) unchanged, and reduces the output neuron gradient accuracy
  • the controller unit reduces the gradient accuracy of the output neuron At the time, the bit width of the fixed point data format representing the output neuron gradient is increased.
  • the controller unit increases the output neuron gradient accuracy Thereafter, the controller unit is further configured to:
  • the bit width of the fixed point data format representing the output neuron gradient is increased.
  • the controller unit increases a bit width of a fixed point data format representing the output neuron gradient, including:
  • the controller unit increases a bit width of the fixed point data format indicating the output neuron gradient according to a first preset step size N1;
  • the first preset step size N1 is 1, 2, 4, 6, 7, 8, or other positive integers.
  • the controller unit increases a bit width of a fixed point data format representing the output neuron gradient, including:
  • the controller unit increases the bit width of the fixed point data format representing the output neuron gradient in a 2-fold incremental manner.
  • controller unit is further configured to:
  • an embodiment of the present invention provides a neural network operation module, where the neural network operation module is used to perform a multi-layer neural network operation, including:
  • a storage unit configured to store an output neuron gradient of the multi-layer neural network
  • the scale data a is greater than the second preset threshold, reducing the gradient accuracy of the output layer of the Lth layer
  • An arithmetic unit for reducing the gradient of the output neuron according to the output Indicates that the Lth layer outputs a neuron gradient for subsequent operations.
  • the controller unit increases the gradient accuracy of the Lth layer output neuron At the time, the bit width of the fixed point data format indicating the gradient of the output layer of the Lth layer is increased.
  • the controller unit reduces the gradient accuracy of the L-th layer output neuron Thereafter, the controller unit is further configured to:
  • the bit width of the fixed point data format indicating the gradient of the output layer of the Lth layer is increased.
  • the increasing the bit width of the fixed point data format indicating the gradient of the output layer of the Lth layer comprises:
  • the controller unit increases the bit width of the fixed point data format indicating the gradient of the Lth layer output neuron according to the second preset step size N2.
  • the controller unit increases a bit width of a fixed point data format indicating a gradient of the output layer of the Lth layer, including:
  • the controller unit increases the bit width of the fixed point data format representing the L-th layer output neuron gradient in a 2-fold increment manner.
  • an embodiment of the present invention provides a neural network operation method, including:
  • the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient are adjusted. So that the absolute value of the difference between the gradient update precision T and the preset precision T r is minimized;
  • the output neurons and weights of the Lth layer are represented according to the adjusted input neuron precision S x(l) and the weight precision S w(l) ; according to the adjusted output neuron gradient accuracy
  • the L-th layer output neuron gradient obtained by the operation is represented for subsequent operations.
  • the accuracy according to the input neuron S x(l) , the weight precision S w(l), and the output neuron gradient accuracy Calculate the gradient update accuracy T, including:
  • the preset formula is:
  • the adjusting the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy include:
  • the method further includes:
  • the bit width of the fixed point data format representing the output neuron gradient is increased.
  • the increasing the bit width of the fixed point data format representing the output neuron gradient comprises:
  • the first preset step size N1 is 1, 2, 4, 6, 7, 8, or other positive integers.
  • the increasing the bit width of the fixed point data format representing the output neuron gradient comprises:
  • the bit width of the fixed point data format representing the output neuron gradient is increased in a 2-fold increment.
  • the method further includes:
  • an embodiment of the present application provides a neural network operation method, including:
  • the reduced output neuron gradient accuracy Indicates that the Lth layer outputs a neuron gradient for subsequent operations.
  • the reducing the L-level output neuron gradient accuracy At the time, the bit width of the fixed point data format indicating the gradient of the output layer of the Lth layer is increased.
  • the method further includes:
  • the bit width of the fixed point data format indicating the gradient of the output layer of the Lth layer is increased.
  • the increasing the bit width of the fixed point data format indicating the gradient of the output layer of the Lth layer includes:
  • the bit width of the fixed point data format representing the gradient of the output layer of the Lth layer is increased according to a third preset step size N2.
  • the increasing the bit width of the fixed point data format indicating the L-th layer output neuron gradient comprises: increasing the representation of the L-th layer output nerve in a 2-fold increment manner The bit width of the fixed-point data format of the meta-gradient.
  • the present invention provides a neural network computing device for performing an artificial neural network training calculation;
  • the neural network training calculation includes a neural network multi-layer training operation, the multi-layer training operation including an i-th a layer, at least part of data in the forward operation or the inverse operation of the i-th layer is used for fixed-point data operations, and the above i is an integer greater than or equal to 1;
  • the computing device includes: a controller unit, an operation unit, and a conversion unit The controller unit is connected to the arithmetic unit and the conversion unit;
  • the i-th layer training operation includes a forward operation of the i-th layer and a reverse operation of the i-th layer; and a controller unit for acquiring the input of the i-th layer Neuron data, ith layer weight data, and an i-th layer forward calculation instruction;
  • the controller unit is further configured to parse the i-th layer forward calculation instruction to obtain a plurality of forward operation instructions, and send the i-th layer input neuron data and the i-th layer weight data to the conversion unit, and the plurality of operation instructions Sent to the arithmetic unit;
  • a conversion unit configured to perform floating point type and fixed point type conversion on all or part of data in the i-th layer input neuron data and the i-th layer weight data to obtain all fixed point data or mixed data, and all fixed point data or mixed
  • the data is sent to the operation unit, and the mixed data includes: partial fixed point data and partial floating point data;
  • the operation unit is configured to perform a fixed point operation on all the fixed point data according to the forward operation instruction or perform a mixed operation on the mixed data to obtain a positive output result of the i-th layer;
  • the mixing operation includes performing a fixed point operation on the partial fixed point data and a floating point operation on the partial floating point data.
  • the controller unit is further configured to acquire the input neuron data of the i-th layer, the i-th layer weight data, the i-th input neuron gradient, and the i-th layer reverse calculation instruction;
  • the controller unit is further configured to parse the i-th layer calculation instruction to obtain a plurality of reverse operation instructions, and send the i-th layer input neuron data, the i-th layer weight data, and the i-th layer input neuron gradient to the conversion unit, Transmitting the plurality of operation instructions to the operation unit;
  • the conversion unit is further configured to perform floating point type and fixed point type conversion on all or part of the input data of the i-th layer input neuron data, the i-th layer weight data, and the i-th input neuron gradient to obtain all fixed point data or a mixture Data, sending all fixed point data or mixed data to the operation unit, the mixed data includes: partial fixed point data and partial floating point data;
  • the operation unit is further configured to perform a fixed point operation on all the fixed point data according to the plurality of forward operation instructions or perform a mixed operation on the mixed data to obtain an weight gradient of the i-th layer and an output gradient of the i-th layer; and use the weight of the i-th layer
  • the gradient updates the ith layer weight.
  • the converting unit is specifically configured to convert a portion of the i-th layer input neuron data into partial fixed-point input neuron data and convert the portion of the i-th layer weight data into partial fixed-point weight data;
  • the fixed-point input neuron data and the partial fixed-point weight data are sent to the operation unit, and the partial input neuron data and the partial weight data are sent to the operation unit;
  • the operation unit is specifically configured to perform partial fixed-point forward output results by performing partial fixed-point input neuron data and partial fixed-point weight data, and send partial fixed-point forward output results to the conversion unit.
  • the conversion unit is specifically configured to perform the fixed-point and floating-point conversion of the partial fixed-point forward output result to obtain the first partial floating-point forward output result, and send the first partial floating-point forward output result to the arithmetic unit;
  • the operation unit is specifically configured to perform part of the input neuron data and the partial weight data to obtain a second partial floating point forward operation result, and combine the first partial floating point forward operation result and the second partial floating point forward operation result Get the result of the i-th layer forward output.
  • the converting unit is specifically configured to convert the part of the i-th layer input neuron data into partial fixed-point input neuron data, convert the part of the i-th layer weight data into partial fixed-point weight data, and
  • the i-level input neuron gradient is converted into a partial fixed-point input neuron gradient; some fixed-point input neuron data, partial fixed-point input neuron gradient, and partial fixed-point weight data are sent to the arithmetic unit, and part of the input neuron data and partial input nerve are input.
  • the meta-gradient and partial weight data are sent to the arithmetic unit;
  • the operation unit is specifically configured to perform partial ith layer weight gradient by performing partial point input neuron gradient and partial fixed point input data to perform fixed point data operation, and perform partial point input neuron gradient and partial fixed point weight data to perform fixed point data operation to obtain part
  • the i-th layer outputs a gradient of the result, and sends a part of the i-th layer weight gradient and a part of the i-th layer output result gradient to the conversion unit,
  • the converting unit is specifically configured to perform the fixed point and floating point conversion of the part i-th layer weight gradient and the partial i-th layer output result gradient to obtain the first part i-th layer weight gradient and the first part i-th layer output result gradient, a part of the i-th layer weight gradient and the first part of the i-th layer output result gradient are sent to the arithmetic unit;
  • the operation unit is specifically configured to perform a second partial i-th weight gradient by performing a partial input neuron gradient and a part of the input data, and perform a second partial i-th output by performing a partial input neuron gradient and partial weight data.
  • Result gradient combining the first part of the i-th weight gradient and the second part of the i-th weight gradient to obtain the i-th weight gradient, the first part of the i-th output gradient and the second part of the i-th output
  • the gradients are combined to obtain an output gradient of the i-th layer.
  • the converting unit is specifically configured to determine a decimal point position point of the floating point number
  • maxabs is the largest absolute value in the floating point data to be converted, and width is the bit width of the fixed point number
  • the method for obtaining an i-th layer input neuron gradient specifically includes:
  • the controller unit is configured to receive an output of the (i+1)th output result, and send the output of the i+1th layer to the operation unit;
  • the operation unit is specifically configured to obtain an i-th input neuron gradient according to an output gradient of the (i+1)th layer;
  • the i-th layer input neuron gradient f'* the i+1th layer output result gradient
  • f' is the derivative of the activation function f.
  • the operation unit includes: a main processing circuit and a plurality of slave processing circuits; wherein
  • the main processing circuit is configured to perform pre-processing on the data, and transmit data and an operation instruction to the plurality of slave processing circuits;
  • a plurality of slave processing circuits for performing a plurality of intermediate results in parallel according to data transmitted from the main processing circuit and the operation instructions, and transmitting a plurality of intermediate results to the main processing circuit;
  • the main processing circuit is configured to obtain an i-th layer forward output result, an i-th layer output result gradient, an i-th layer weight gradient according to the plurality of intermediate results, and update the i-th layer weight according to the i-th layer weight gradient .
  • the main processing circuit is specifically configured to send the ith layer input neuron data to each slave processing circuit, and transmit the ith layer input neuron gradient to each slave processing circuit, and each slave processing circuit Multiplying the scalar data corresponding to the slave processing circuit and the ith layer input neuron data in the i-th input neuron gradient in_gradient to obtain the original weight of the ith layer of each slave processing circuit to update the gradient vector dw_original
  • the original weight update gradient vector dw_original is multiplied by the weight of each slave processing circuit to obtain the update weight of each slave processing circuit.
  • the processing circuit specifically for multiplying the weight dw' by the weight update gradient dw', obtains the update weight of each of the slave processing circuits of the i-th layer.
  • the main processing circuit and the slave processing circuit each include a storage module
  • the storage module is configured to store data
  • the storage module further includes at least one shared area, which is a storage space shared by the main processing circuit or from the processing circuit.
  • the computing unit further includes: a branch processing circuit
  • the branch processing circuit is disposed between the main processing circuit and the plurality of slave processing circuits, and implements forwarding of data between the main processing circuit and the plurality of slave processing circuits and the operation instructions.
  • the branch processing circuit includes: a storage module, where the storage module includes at least one shared area, where the shared area is a storage space shared by the processing circuit and used by the processing circuit.
  • the device further includes a tree module, for example, the tree module may be an interconnection module, and the interconnection module is an n-tree path composed of a plurality of nodes, and data of an upstream node of the n-tree is sent to The downstream n nodes, and the interconnection module combine the data returned by the downstream n nodes and send the data to the upstream node, where n is an integer greater than or equal to 2.
  • the tree module may be an interconnection module
  • the interconnection module is an n-tree path composed of a plurality of nodes, and data of an upstream node of the n-tree is sent to The downstream n nodes, and the interconnection module combine the data returned by the downstream n nodes and send the data to the upstream node, where n is an integer greater than or equal to 2.
  • the activation function f is any one of a nonlinear function sigmoid, tanh, relu, softmax or a linear function;
  • the operation instructions include: a CONFIG instruction, a COMPUTE instruction, an IO instruction, a NOP instruction, a JUMP instruction, or a MOVE instruction.
  • the main processing circuit includes a first storage unit, a first operation unit, a first data dependency determination unit, and a first storage unit, where:
  • a neuron buffer unit for buffering input data and output data used by the main processing circuit in the calculation process
  • a first arithmetic unit configured to perform various computing functions of the main processing circuit
  • a first data dependency determining unit configured to read the input neuron vector from the first storage unit, and send the same to the slave processing circuit through the interconnect module; and receive the intermediate result vector of the interconnect module, and send the intermediate result vector to The first arithmetic unit.
  • the first operation unit includes: a vector addition unit and an activation operation unit;
  • the vector adding unit is configured to add offset data to the intermediate result to obtain an offset result
  • the activation operation unit is configured to perform an activation function operation on the bias result.
  • each slave processing circuit includes a second operation unit, a second data dependency determination unit, a second storage unit, and a third storage unit, wherein:
  • a second operation unit configured to perform an arithmetic logic operation
  • a second data dependency determining unit configured to perform a read/write operation on the second storage unit and the third storage unit
  • a second storage unit configured to buffer data of the input neuron vector and the output neuron value calculated by the processing circuit
  • a third storage unit configured to buffer a weight vector required by the processing circuit in the calculation process.
  • the second operation unit includes: a vector multiplication unit and an accumulation unit;
  • the vector multiplication unit is configured to perform a vector multiplication operation in a dot product operation
  • the accumulating unit is configured to perform an accumulating operation in a dot product operation.
  • a neural network training method is provided, the method is used in a neural network computing device;
  • the neural network training calculation comprises a neural network multi-layer training operation, and the multi-layer training operation includes an i-th layer, At least part of the data in the forward or reverse operation of the i-th layer is used for fixed-point data operations, and the above i is an integer greater than or equal to 1;
  • the computing device includes: a controller unit, an arithmetic unit, and a conversion unit, wherein the control The unit is connected to the operation unit and the conversion unit;
  • the i-th layer training operation includes an i-th layer forward operation and an i-th layer inverse operation;
  • the ith layer forward operation includes:
  • the controller unit acquires the input neuron data of the i-th layer, the i-th layer weight data, and the i-th layer forward calculation instruction; parses the i-th layer calculation instruction to obtain a plurality of forward operation instructions, and inputs the i-th layer into the neuron Data and the i-th layer weight data are sent to the conversion unit, and the plurality of forward operation instructions are sent to the operation unit;
  • the converting unit performs all or part of the i-th layer input neuron data and the i-th layer weight data to perform floating point type and fixed point type conversion to obtain all fixed point data or mixed data, and sends all fixed point data or mixed data to the arithmetic unit.
  • the mixed data includes: partial fixed point data and partial floating point data;
  • the arithmetic unit performs a fixed point operation on all the fixed point data according to the plurality of forward operation instructions or performs a mixed operation on the mixed data to obtain a forward output result of the i-th layer;
  • the mixing operation includes performing a fixed point operation on the partial fixed point data and a floating point operation on the partial floating point data.
  • the i-th layer reverse operation includes:
  • the controller unit acquires the input neuron data of the i-th layer, the i-th layer weight data, the i-th input neuron gradient, and the i-th layer reverse calculation instruction; and parses the i-th layer calculation instruction to obtain a plurality of reverse operation instructions Transmitting the i-th layer input neuron data, the i-th layer weight data, and the i-th layer input neuron gradient to the conversion unit, and transmitting the plurality of reverse operation instructions to the operation unit;
  • the conversion unit performs all or part of the i-th layer input neuron data, the i-th layer weight data, and the i-th input neuron gradient to perform floating point type and fixed point type conversion to obtain all fixed point data or mixed data, and all the fixed points are fixed.
  • Data or mixed data is sent to the arithmetic unit, the mixed data includes: partial fixed point data and partial floating point data;
  • the arithmetic unit performs a fixed point operation on all the fixed point data according to the plurality of forward operation instructions or performs a mixed operation on the mixed data to obtain the weight gradient of the i-th layer and the output gradient of the i-th layer; and the weight gradient of the i-th layer is used for the i-th Layer weight update.
  • the converting unit performs all or part of the i-th layer input neuron data and the i-th layer weight data to perform floating point type and fixed point type conversion to obtain all fixed point data or mixed data, and all the fixed point data and
  • the mixed data is sent to the operation unit, and the mixed data includes: partial fixed point data and partial floating point data; the operation unit performs a fixed point operation on all the fixed point data according to the plurality of forward operation instructions or performs a mixed operation on the mixed data to obtain the i-th layer
  • the positive output results specifically include:
  • the converting unit converts the portion of the i-th layer input neuron data into partial fixed-point input neuron data and converts the portion of the i-th layer weight data into partial fixed-point weight data; and inputs the partial fixed-point input into the neuron data and the partial fixed point
  • the weight data is sent to the operation unit, and the partial input neuron data and the partial weight data are sent to the operation unit;
  • the operation unit performs partial fixed-point input neuron data and partial fixed-point weight data to perform fixed-point data operation to obtain a partial fixed-point forward output result, and sends a partial fixed-point forward output result to the conversion unit.
  • the conversion unit performs fixed-point and floating-point conversion on the fixed-point output of the fixed-point output to obtain a first-half floating-point forward output result, and sends the first partial floating-point forward output result to the arithmetic unit;
  • the arithmetic unit performs partial operation of the partial input of the neuron data and the partial weight data to obtain the second partial floating point forward operation result, and combines the first partial floating point forward operation result and the second partial floating point forward operation result to obtain the i th
  • the layer outputs the result in the positive direction.
  • the converting unit performs all or part of the i-th layer input neuron data, the i-th layer weight data, and the i-th layer input neuron gradient to perform floating point type and fixed point type conversion to obtain all fixed point data or
  • the data is mixed, and all the fixed point data and the mixed data are sent to the operation unit, and the mixed data includes: partial fixed point data and partial floating point data; the arithmetic unit performs fixed point operations on all fixed point data or performs performed on the mixed data according to the plurality of forward operation instructions.
  • the hybrid operation obtains the weight gradient of the i-th layer and the output gradient of the i-th layer; the update of the weight gradient of the i-th layer and the weight of the i-th layer specifically includes:
  • the conversion unit converts a portion of the i-th layer input neuron data into partial fixed-point input neuron data, converts a portion of the i-th layer weight data into partial fixed-point weight data, and converts the i-th input neuron gradient into Partially-point input neuron gradient; send part of the fixed-point input neuron data, partial fixed-point input neuron gradient, and partial fixed-point weight data to the operation unit, and send part of the input neuron data, part of the input neuron gradient, and part of the weight data Giving an arithmetic unit;
  • the arithmetic unit performs part-point input neuron gradient and partial fixed-point input data to perform fixed-point data operation to obtain partial i-th weight gradient, and part-point input neuron gradient and partial fixed-point weight data perform fixed-point data operation to obtain partial i-th output.
  • a result gradient sending a part of the i-th layer weight gradient and a part of the i-th layer output result gradient to the conversion unit;
  • the conversion unit performs the part i-th weight gradient and the partial i-th output gradient to perform fixed-point and floating-point conversion to obtain the first part i-th weight gradient and the first part i-th output result gradient, and the first part i-th layer
  • the weight gradient and the first part of the i-th layer output result gradient are sent to the arithmetic unit;
  • the arithmetic unit performs part of the input neuron gradient and part of the input data to obtain the second part of the i-th layer weight gradient, and performs part of the input neuron gradient and the partial weight data to obtain the second part of the i-th layer output result gradient,
  • the first part of the i-th weight gradient and the second part of the i-th weight gradient are combined to obtain the i-th weight gradient, and the first part of the i-th output gradient and the second part of the i-th output gradient are combined.
  • the i-th layer outputs a gradient of results.
  • a neural network training device comprising: the computing device provided by the fifth aspect, configured to acquire data to be calculated and control information from other processing devices, and perform a specified operation, The execution result is transmitted to other processing devices through the I/O interface;
  • the neural network training device includes a plurality of the computing devices
  • the plurality of the computing devices may connect and transmit data through a specific structure
  • a plurality of said computing devices are interconnected and transmitted by a fast external device interconnect bus PCIE bus to support larger-scale neural network training operations; a plurality of said computing devices share the same control system or have respective control systems The plurality of computing devices share memory or have respective memory; the interconnection manner of the plurality of computing devices is an arbitrary interconnection topology.
  • a combined processing device comprising the neural network training device of the seventh aspect, a universal interconnection interface and other processing devices;
  • the neural network training device interacts with the other processing devices to jointly perform a user-specified computing operation.
  • a neural network chip comprising the computing device provided by the fifth aspect or the neural network training device of the seventh aspect or the combined processing device of the eighth aspect.
  • an electronic device comprising the chip provided by the ninth aspect.
  • a card comprising: a storage device, an interface device, and a control device, and a neural network chip provided by the ninth aspect;
  • neural network chip is respectively connected to the storage device, the control device and the interface device;
  • the storage device is configured to store data
  • the interface device is configured to implement data transmission between the chip and an external device
  • the control device is configured to monitor a status of the chip.
  • the storage device includes: a plurality of groups of storage units, each group of the storage units being connected to the chip by a bus, the storage unit being: DDR SDRAM;
  • the chip includes: a DDR controller for controlling data transmission and data storage of each of the storage units;
  • the interface device is: a standard PCIE interface.
  • FIG. 1 is a schematic flow chart of a calculation method provided by the present application.
  • Figure 1A is a schematic diagram of a fixed point data format.
  • FIG. 1B is a schematic structural diagram of a neural network operation module according to an embodiment of the present invention.
  • FIG. 1C is a schematic flowchart diagram of a neural network operation method according to an embodiment of the present invention.
  • FIG. 1D is a schematic flowchart diagram of another neural network operation method according to an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of a computing system provided by the present application.
  • FIG. 2A is a schematic structural view of a control unit of the present application.
  • 2B is a schematic structural diagram of a computing group of the present application.
  • 2C is a schematic diagram of a hardware structure of a group controller and a plurality of computing units.
  • 2D is a schematic diagram of another hardware structure of a group controller and a plurality of computing units.
  • 3A is a schematic structural view of a computing unit.
  • FIG. 3B is a schematic structural diagram of an arithmetic unit.
  • FIG. 3C is a schematic structural diagram of another arithmetic unit.
  • FIG. 4 shows an example block diagram of the overall structure of a neural network computing device in accordance with an embodiment of the present application.
  • FIG. 4A is a schematic block diagram showing the structure of an arithmetic unit according to an embodiment of the present application.
  • FIG. 4B is a schematic diagram showing another structure of an arithmetic unit according to an embodiment of the present application.
  • FIG. 4C schematically shows a schematic diagram of transmission of a tree module according to an embodiment of the present application.
  • FIG. 4D schematically illustrates a receiving diagram of a tree module in accordance with an embodiment of the present application.
  • FIG. 4E schematically shows a schematic structural view of a combined processing apparatus according to an embodiment of the present application.
  • FIG. 5 is a schematic block diagram showing the structure of a board according to an embodiment of the present application.
  • references to "an embodiment” herein mean that a particular feature, result, or characteristic described in connection with the embodiments can be included in at least one embodiment of the present application.
  • the appearances of the phrases in various places in the specification are not necessarily referring to the same embodiments, and are not exclusive or alternative embodiments that are mutually exclusive. Those skilled in the art will understand and implicitly understand that the embodiments described herein can be combined with other embodiments.
  • SIMD single instruction multiple data stream
  • SIMT single instruction multiple thread
  • SMT synchronous multithreading
  • SIMD Single Instruction Multiple Data Stream
  • SIMD refers to the operation determined by a computer executing multiple instructions simultaneously on multiple data. For example, when one or two long vectors are needed, in the SIMD scenario, one or two long vectors can be split into several short vectors, so that multiple vector addition components execute several short vectors in parallel. Adding operation, then, combining the addition results of several short vectors, that is, adding the result of the long vector.
  • the instruction stream is single at any time, that is, the executed instruction stream can be the same instruction, but the execution The data can be different.
  • SIMMT Single-instruction multi-threading
  • Synchronous Multithreading means that the processor can run multiple instructions from multiple threads in the same clock cycle. When a thread is blocked, we can use the context switch to run another thread's instructions.
  • FIG. 1 provides a calculation method, which may be performed by a computing system, including: a control unit, a computing group, and a total storage unit, the control unit including: a first memory, decoding a logic sum controller, the computing group comprising: a group controller and a plurality of computing units; the total storage unit, configured to store data; the calculating method comprising the steps of:
  • Step S101 The controller of the computing system receives the first level instruction sequence, and splits the first level instruction sequence into multiple second level instruction sequences.
  • the computing system can also directly receive multiple second-level instruction sequences.
  • the second level instruction sequence is a sequence of instructions whose integration level is one level lower than the first level instruction sequence, that is, the first level instruction sequence may include or integrate multiple second level instruction sequences.
  • the above-described manner of inclusion or integration is not limited in this application.
  • the first level instruction sequence may be: a super long instruction
  • the second level instruction sequence includes: an instruction sequence.
  • the first-level instruction sequence may be: an instruction sequence
  • the second-level instruction sequence may be: a micro-instruction sequence.
  • the above is only for the purpose of illustration. For a sequence of instructions in a specific implementation, only the first level instruction sequence needs to include a set of second level instruction sequences.
  • Step S102 The controller of the computing system opens M threads for the plurality of second-level instruction sequences, and the controller of the computing system allocates an independent storage space and configures an independent addressing function for each of the M threads.
  • the M value ranges from an integer greater than or equal to 1;
  • Step S103 The group controller of the computing system acquires multiple calculation types of the plurality of second-level instruction sequences, and obtains a fusion calculation manner corresponding to the calculation type according to the plurality of calculation types, and the plurality of calculation units adopt the fusion calculation manner Invoking the M threads to perform calculations on the plurality of second instruction sequences to obtain a final result.
  • This application presents a SIMD, SMT, and SIMT fusion computing system and method with VLIW as an optional aid.
  • This application fully exploits the parallelism of computing. Under the background of the rise of deep learning, the calculation of vector calculation is getting larger and larger, and the technical solution provided by the present application can obtain the processing result faster, so it has the advantage of improving the calculation speed.
  • the VLIW is parsed to obtain 25 vector addition instructions, and 5 threads can be called by SIMT, and each thread executes 5 vector addition instructions by SIMD method, and the time of obtaining 25 vector addition instructions can be For 5t, the switching time is ignored here, and thus it can be seen that the calculation speed of the calculation method provided by the present application is increased by nearly 5 times compared with the existing method.
  • the group controller invokes a combination calculation method of the same type of single instruction multiple data stream SIMD and single instruction multi-thread SIMT, and invokes the M threads to perform calculation to obtain a final result, specifically including :
  • the decoding logic splits the M threads into N thread groups, converts the plurality of second instruction sequences into a plurality of second control signals, and assigns the plurality of second control signals and the N thread groups to the plurality of calculations
  • the unit, the plurality of computing units call the allocated thread group and the second control signal to extract corresponding data, and the plurality of computing units perform operations on the data to obtain a plurality of intermediate results, and combine the plurality of intermediate results to obtain a final result.
  • the group controller invokes a combination calculation method of the same type of single instruction multiple data stream SIMD and synchronous multi-thread SIM, and invokes the M threads to perform calculation to obtain a final result, which specifically includes:
  • the group controller splits the M threads into N thread groups, converts the plurality of second instruction sequences into a plurality of second control signals, and assigns a second instruction sequence of different types of calculation operations to N thread groups.
  • the group controller acquires a function type of the calculation unit, for example, the function type of the calculation unit A is the same as the type of the instruction sequence A of the plurality of second instruction sequences, and the control signal A corresponding to the instruction sequence A is assigned to
  • the calculation unit A performs an intermediate result; if the function type of the calculation unit is different from the calculation type of the second instruction sequence, the plurality of second control signals and the N thread groups are allocated to the plurality of calculation units, and the plurality of calculation units call the allocation
  • the thread group and the second control signal extract corresponding data, and the plurality of calculation units perform the operation to obtain a plurality of intermediate results, and all the intermediate results are spliced together to obtain a final result.
  • the method further includes:
  • the controller adds the thread group A to the waiting queue. If the data of the thread group A has been extracted, the thread group A is added to the preparation queue, and the preparation queue is when the computing resource is idle. The queue in which the thread group being scheduled to execute is located.
  • FIG. 2 provides a computing system, the control unit 20, the computing group 21 and the total storage unit 22, as shown in FIG. 2A, the control unit includes: a first memory 301, a decoding logic 302, and a control 303 and scheduler 304, referring to FIG. 2B, the computing group includes: a group controller 305 and a plurality of computing units 306; the total storage unit 22 for storing data;
  • the controller 303 is configured to receive a first level instruction sequence and to control the first memory 301 and the decoding logic 302;
  • the decoding logic 302 is configured to split the first level instruction sequence into multiple second level instruction sequences
  • the controller 303 is further configured to: open M threads for the plurality of second-level instruction sequences; allocate independent storage spaces for each of the M threads; and configure an independent addressing function; The value ranges from an integer greater than or equal to 1, and the plurality of second-level instruction sequences are converted into a plurality of control signals and sent to the group controller;
  • the group controller 305 is configured to receive the multiple control signals, acquire multiple calculation types of the multiple control signals, and divide M threads into N thread groups, and multiple according to the multiple calculation types.
  • the computing unit allocates N thread groups and a plurality of control signals;
  • the calculating unit 306 is configured to extract data from the total storage unit 22 by using the allocated thread group and control signaling, and perform an operation to obtain an intermediate result,
  • the group controller 305 is configured to splicing all intermediate results to obtain a final calculation result.
  • the plurality of computing units 306 include: an addition calculator, a multiplication calculator, an activation calculator, or a dedicated calculator.
  • the dedicated calculator includes: a face recognition calculation calculator, a graph calculator, a fingerprint calculator, or a neural network calculator.
  • the group controller is specifically configured to: if the calculation types of the multiple control signals are graphic calculation, fingerprint recognition, face recognition, or neural network operation, respectively assign the multiple control signals to the face recognition calculation Calculator, graphing calculator, fingerprint calculator or neural network calculator.
  • the first level instruction sequence includes: a super long instruction
  • the second level instruction sequence includes: a sequence of instructions
  • the computing system can include a control unit 20, a computing group 21, and a storage unit 22.
  • the control unit is responsible for the distribution of instructions, the development of threads, the decoding of common instructions and very long instruction words, and the issuance of control signals.
  • the control unit includes: local storage, decoding logic, scheduler, and controller. Wherein, the local storage is used to store instructions, the decoding logic can decode the super long instruction words and the ordinary instructions, the scheduler is responsible for thread context switching, and the controller calls the stored code to control each submodule in the control unit (for example, local storage) , the decoding logic and the behavior of the scheduler).
  • the computing group can include: a group controller and a plurality of computing units.
  • the group controller receives the control signal from the control unit and converts it into an intra-group control signal, and transmits the intra-group control signal to one or more of the plurality of computing units to calculate the intra-group control signal.
  • the computing unit may include a variety of functional components, and in particular, vector computing components and various optimized computing components for specialized algorithms (such as dedicated components for machine learning or graphics processing, etc.).
  • the computing unit can also include: a unit controller and local storage. The unit controller is used to control the behavior of various functional components within the computing unit, and the local storage is used to cache data.
  • the storage unit is used to store user input data, calculation group output data, and the like.
  • the computing group can extract suitable data from the storage unit by various addressing modes under the control of the control unit.
  • the super long instruction word is taken as an example to illustrate the functions that the computing system can perform. It should be noted that the above-mentioned super long instruction word is for illustrative purposes only. In practical applications, the technical solution of the present application does not limit the above instruction. A specific form, such as a sequence of instructions.
  • An ultra-long vector is a very long vector.
  • the vector can include multiple pieces of data.
  • the computing system can perform different operations on each segment of multiple pieces of data, or perform the same operations on multiple pieces of data.
  • the compiler packs the storage information of each segment of the super-long vector and the information of the required operation into a very long instruction word and sends it to the control unit.
  • the control unit decodes the very long instruction word and decodes the very long instruction word into a series of micro control instruction sequences. (Note that very long instruction words are optional.
  • the local storage of the control unit stores the instruction sequences, which are decoded by the decoding logic into a sequence of micro control instructions.
  • the sequence of control instructions is also optional, and the sequence of instructions can also be executed directly by the controller.
  • local storage is also optional and can be replaced by a storage unit.
  • the computing system takes SIMT The calculation method of fusion with SIMD.
  • the controller unit opens up multiple threads for the micro-control instruction sequence, each thread having independent storage space and being independently addressable.
  • the appropriate number of threads are packaged into thread groups based on the number of compute units in the compute group, such that the computing system will get one or more thread groups (typically multiple thread groups).
  • the scheduler receives thread allocation information, and the cooperative decoding logic converts the sequence of micro-control instructions in the thread into a group control unit that sends control signals to the computing group.
  • the group control unit receives the control signal from the control unit and converts the control signal into an intra-group control signal to the appropriate computing unit.
  • the computing unit reads the vector operand from the storage unit and performs vector calculation.
  • the intermediate result can be temporarily stored locally, and the final result is stored in the storage unit.
  • the computing group performs the computing operations of other thread groups through context switching, and the blocked thread group enters the waiting queue.
  • the operand of the blocked thread group is ready, the thread group enters from the waiting queue. Go to the preparation queue.
  • the thread group in the preparation queue can be scheduled to execute when the computing resource is idle.
  • the number of threads contained in a thread group is generally constant.
  • the computing system takes the approach of SMT and SIMD fusion.
  • the computing system assigns a sequence of micro-control instructions for different operations to threads in different thread groups.
  • the computing system can perform context switching to perform other operations on the thread group.
  • the above calculation may be performed cooperatively by several computing units. For example, for a video compression calculation, the prediction, transformation, quantization, and entropy coding processes of the calculation process may be allocated to different computing units, and the calculation units may transfer the results to each other, thereby constituting assembly line.
  • FIG. 2C is a schematic diagram of a hardware structure of a group controller and a plurality of computing units.
  • the computing system may further include a tree module 401, which may be an n-tree structure, where n is greater than 2.
  • the integer, specific, tree module includes: a root port and a plurality of branch ports, the root port of the tree module is connected to the group controller 305, and the plurality of branch ports of the tree module are respectively connected to multiple calculations
  • One of the units 306 is a computing unit 306;
  • the tree module is configured to forward a data block, a thread group, or a sequence of instructions between the group controller 305 and the plurality of computing units.
  • FIG. 2D is another schematic diagram of a hardware structure of a group controller and a plurality of computing units, where the computing system includes: a branch processing circuit.
  • the branch processing circuit is connected between the group controller and the plurality of computing units;
  • the branch processing circuit is configured to forward data, a thread group, or a sequence of instructions between the group controller and the plurality of computing units.
  • the calculation unit includes: a multiplication processing circuit; the multiplication processing circuit performs a product operation on the received data to obtain a product result; the calculation unit further includes: an accumulation processing circuit, wherein the accumulation processing circuit performs an accumulation operation on the product result The intermediate result.
  • the above calculation unit may also be another hardware structure, as shown in FIG. 3A, the controller unit 311 and the operation unit 312, wherein the controller unit 311 is connected to the operation unit 312, and the operation unit 312 includes : a main processing circuit and a plurality of slave processing circuits;
  • the controller unit 311 is configured to acquire data, a thread group, and instructions.
  • the data includes: inputting neuron data, weight data, and outputting neuron data; in an alternative manner
  • the data acquisition, the thread group, and the instruction may be obtained through a data input/output unit, and the data input/output unit may specifically be one or more data I/O interfaces or I/O pins.
  • the above instructions include, but are not limited to, a forward operation instruction or a reverse training instruction, or other neural network operation instruction, etc., such as a convolution operation instruction, and the specific embodiment of the present application does not limit the specific expression form of the above calculation instruction.
  • the controller unit 311 is further configured to parse the instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the data to the main processing circuit;
  • the main processing circuit 3101 is configured to perform pre-processing on the data, and transmit data and an operation instruction to the plurality of slave processing circuits;
  • a plurality of slave processing circuits 3102 configured to perform a plurality of intermediate data results in parallel according to data transmitted from the main processing circuit and the operation instructions, and transmit the plurality of intermediate data results to the main processing circuit;
  • the main processing circuit 3101 is configured to perform subsequent processing on the plurality of intermediate data results to obtain an instruction result of the instruction.
  • the foregoing calculating unit may further include: the storage unit 310 and a direct memory access unit, where the storage unit may include: one of a register and a cache, or any combination thereof.
  • the cache is configured to store the operation instruction.
  • the register is used to store a thread group, an instruction, a data, or a scalar; the cache is a cache.
  • the direct memory access unit is used to read or store data from the storage unit 310.
  • the controller unit includes: an instruction storage unit, an instruction processing unit, and a storage queue unit;
  • An instruction storage unit for storing instructions
  • the instruction processing unit is configured to parse the calculation instruction to obtain a plurality of operation instructions
  • the storage queue unit is configured to store a queue, and the queue may be an instruction queue.
  • the instruction queue includes: a plurality of operation instructions or calculation instructions to be executed in a sequence of the queues.
  • the controller unit may further include: a dependency processing unit;
  • a dependency processing unit configured to determine, when the plurality of operation instructions have a first operation instruction, an association relationship between the first operation instruction and the zeroth operation instruction before the first operation instruction, such as the first operation instruction and the zeroth If the operation instruction has an association relationship, the first operation instruction is cached in the instruction storage unit, and after the execution of the zeroth operation instruction, the first operation instruction is transmitted from the instruction storage unit to the Arithmetic unit
  • the determining whether the first operation instruction is associated with the zeroth operation instruction before the first operation instruction comprises:
  • An address interval, where the first storage address interval and the zeroth storage address interval overlap determining that the first operation instruction has an association relationship with the zeroth operation instruction, such as the first storage address If the interval does not overlap with the zeroth storage address interval, it is determined that the first operation instruction has no association relationship with the zeroth operation instruction.
  • the structure of an operation unit includes: a tree module, the tree module includes: a root port and a plurality of branch ports, The root port of the tree module is connected to the main processing circuit, and the plurality of branch ports of the tree module are respectively connected to one of the plurality of slave processing circuits, and the tree module has a transceiver function.
  • the tree module is configured to forward data blocks, weights, and operation instructions between the main processing circuit and the plurality of slave processing circuits.
  • the operation unit 12, as shown in FIG. 3C may include a branch processing circuit; the specific connection structure is as shown in FIG. 3C, where
  • the main processing circuit 3101 is connected to the branch processing circuit 3103, and the branch processing circuit 3103 is connected to the plurality of slave processing circuits 3102;
  • the branch processing circuit 3103 is configured to forward data or instructions between the main processing circuit 3101 and the slave processing circuit 3102.
  • the floating point number is usually converted into a fixed point number for calculation, because the bit number of the fixed point number is generally smaller than the floating point number, so the memory capacity can be reduced, and the calculation speed can be improved.
  • a fixed-point number is a data format that can specify the position of a decimal point.
  • the bit width of a 16-bit fixed point number is 16.
  • the precision of the data is related to the range of numbers that can be represented. For example, if the precision that can be represented is larger, the range of numbers that can be represented is smaller. As shown in FIG.
  • the first bit is a sign bit
  • the integer part occupies the x-bit
  • the fractional part occupies the s-bit
  • the maximum fixed-point precision S that the fixed-point data format can represent is 2 - s .
  • data can be represented in a fixed-point data format.
  • the data of the L-th layer includes the input neuron X (l) , the output neuron Y (l) , and the weight W (l).
  • the data of the Lth layer includes the input neuron gradient Output neuron gradient Weight gradient
  • the above data can be expressed by a fixed number of points, or the data represented by the fixed point data format can be calculated by a fixed point number.
  • the training process in the neural network usually includes two steps of forward operation and reverse operation.
  • the reverse operation the precision required to input the neuron gradient, the weight gradient and the output neuron gradient may change, possibly with training.
  • the process is reduced. If the precision of the fixed point number is redundant, the operation overhead is increased and the computing resources are wasted.
  • the input neuron, weight and output neurons included in the forward operation process and the input neuron gradient and weight gradient included in the reverse training process And the output neuron gradient will change.
  • the accuracy of input neurons, weights, output neurons, input neuron gradients, weight gradients, and output neuron gradients expressed in fixed-point data format may need to be increased or decreased.
  • the present application proposes a neural network operation module and method for dynamically adjusting the accuracy of the above data in the process of performing neural network operations, so as to reduce the error of the operation result and improve the accuracy of the calculation result while satisfying the operation requirement.
  • Embodiments of the present application achieve the purpose of adjusting the accuracy of the data by adjusting the bit width of the above data. For example, when the precision of the fixed-point data format exceeds the requirement of the operation, the bit width of the fractional part in the fixed-point data format can be reduced, that is, the s in FIG. 1A is reduced, thereby reducing the accuracy of the fixed-point data format; but the fixed-point data format
  • the accuracy is related to the bit width of the fractional part, and the precision of the fixed-point data format can be adjusted by increasing or decreasing the bit width of the fractional part.
  • the bit width of the fractional part can be reduced, thereby increasing the precision of the fixed-point data format, thereby reducing the precision redundancy of the fixed-point data format, reducing the computational overhead, and avoiding waste of computing resources.
  • FIG. 1B is a schematic structural diagram of a neural network operation module according to an embodiment of the present invention.
  • the neural network computing module is used to perform operations on a multi-layer neural network.
  • the neural network operation module 100 includes:
  • the storage unit 101 is configured to store input neuron precision, weight precision, and output neuron gradient accuracy.
  • the controller unit 102 is configured to acquire, from the storage unit 101, the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient precision of the L-th layer of the multi-layer neural network.
  • L is an integer greater than 0; according to the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy Obtaining a gradient update precision T; adjusting the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy when the gradient update precision T is less than the preset precision T r
  • the storage unit 101 is further configured to store input neurons, weights and output neurons, and output neuron gradients
  • the controller unit 102 obtains the L-th layer input neurons from the storage unit 101, Weighting and outputting the neuron gradient, the controller unit 102 obtains the input neuron precision S x(l) , the weight precision S w(l), and the output neuron according to the L-th layer input neuron, the weight, and the output neuron gradient.
  • bit width of the fixed-point data number indicating the input neuron and the bit width of the fixed-point data format for indicating the weight are the first bit width
  • bit width of the fixed-point data format for indicating the output neuron gradient is The second bit is wide.
  • the second bit width is greater than the first bit width.
  • the second bit width is twice the width of the first bit width to facilitate processing by an electronic computer.
  • first bit width may be selected as 8 bits
  • second bit width may be selected as 16 bits
  • controller unit 102 may be preset by the user, the default precision T r; also be based on a second preestablished formula, by changing the input parameters obtained in a manner to match the preset accuracy of the input parameters T r; It is also possible to obtain T r by machine learning.
  • the controller unit 102 sets the preset precision T r according to the learning rate and the batchsize (the number of samples in the batch processing).
  • the controller unit 102 sets the preset according to the number of output neurons in the upper layer and the batchsize and learning rate.
  • the accuracy T r that is, the higher the number of output neurons in the upper layer and the larger the batch size and the higher the learning rate, the larger the preset precision T r .
  • the controller unit 102 acquires the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy. After that, according to the first preset formula, the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient precision Performing a calculation to obtain the gradient update precision T, wherein the first preset formula may be:
  • the controller unit 102 adjusts the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy. include:
  • the controller unit 102 keeps the input neuron precision S x(l) and the weight precision S w(l) unchanged, and increases the output neuron gradient accuracy.
  • the controller unit 102 increases the output neuron gradient accuracy It refers to reducing the fractional part width s1 of the fixed point data format representing the output neuron gradient.
  • the controller unit 102 reduces the fractional part width s1 of the fixed-point data format indicating the weight according to the value of the Tr-T according to the first preset step size N1.
  • the controller unit 102 decreases the N1 bit each time, that is, the fractional width of the fractional part is s1-N1, and obtains an output neuron gradient.
  • Precision According to the above preset formula Determining whether the absolute value of the difference between the gradient update precision T and the preset precision Tr becomes smaller; when it is determined that the absolute value of the difference between the gradient update precision T and the preset precision Tr becomes smaller, the controller unit 102 continues The bit width of the fractional part of the fixed-point data format indicating the output neuron gradient is reduced by N1, that is, the bit width is s1-2*N1, and the output neuron gradient accuracy is obtained.
  • the controller unit 102 uses the bit width obtained by the n-1th process, that is, s1-(n-1)*N1 as a fixed point indicating the gradient of the output neuron.
  • the bit width of the fractional part of the data format, and the output neuron gradient accuracy after reducing the bit width of the fractional part is
  • the first preset step size N1 is 1, 2, 4, 6, 7, 8, or other positive integers.
  • the controller unit 102 reduces the fractional portion bit width of the fixed point data format indicating the output neuron gradient in a 2x decreasing manner.
  • the fractional part width of the fixed-point data format indicating the output neuron gradient is 4, that is, the precision of the weight is 2 -4 , and the fixed-point data format indicating the output neuron gradient after reducing the bit width in a 2-fold decreasing manner
  • the fractional part has a bit width of 2, that is, the reduced output neuron gradient accuracy is 2 -2 .
  • the controller unit 102 after the controller unit 102 determines the reduction width b of the fractional portion bit width of the fixed point data format indicating the output neuron gradient, the controller unit 102 reduces the fixed point data format by multiple times.
  • b1 and b2 above may be the same or different.
  • the controller unit 102 increases the output neuron gradient accuracy At the time, the bit width of the fixed point data format representing the output neuron gradient is reduced.
  • the controller unit 102 reduces the bit width of the fixed point data format, and after the bit width of the fixed point data format is reduced, the bit width of the integer part remains unchanged, that is, the reduced value of the integer part bit width and the decimal part bit width.
  • the reduction values are the same, thereby ensuring that the maximum value represented by the fixed point data format does not change in the case where the bit width of the fractional portion is changed.
  • the bit width of the fixed point data format is 9, wherein the bit width of the sign bit is 1, the bit width of the integer part is 5, and the bit width of the decimal part is 4, and the controller unit 102 reduces the bit of the decimal part.
  • the bit width of the fractional part is 2, and the bit width of the integer part is 5, that is, the bit width of the above-mentioned fractional part is reduced, and the bit width of the integer part remains unchanged.
  • the controller unit 102 reduces the output neuron gradient accuracy After that, the controller unit 102 is further configured to:
  • the above controller unit 102 increases the above-mentioned output neuron gradient accuracy.
  • the reason is the output neuron gradient accuracy Less than the above required precision, that is, there is precision redundancy, which increases the computational overhead and wastes computing resources. Therefore, in order to reduce the computational overhead and avoid the waste of computing resources, it is necessary to increase the gradient accuracy of the output neurons described above.
  • the controller unit 102 increases the output neuron gradient accuracy. After that, it is necessary to further determine whether there is precision redundancy, that is, to determine the output neuron gradient accuracy. Is it less than the required accuracy? When determining the above output neuron gradient accuracy When less than the above required precision, reducing the bit width of the fixed point data format indicating the output neuron gradient to increase the output neuron gradient accuracy Reduce accuracy redundancy.
  • controller unit 102 reduces the bit width of the above-mentioned fixed point data format, specifically, the bit width of the integer part of the fixed point data format.
  • controller unit 102 reduces the bit width of the fixed point data format indicating the output neuron gradient, including:
  • the controller unit 102 reduces the bit width of the fixed point data format indicating the output neuron gradient according to the second preset step size N2, wherein the second preset step size N2 may be 1, 2, 3, 4, 5, 7, 8 or other positive integers.
  • the reduction value of the controller unit 102 each time reducing the bit width of the fixed point data format is the second preset step size N2.
  • the controller unit 102 reduces the bit width of the fixed-point data format indicating the output neuron gradient, including:
  • the controller unit 102 reduces the bit width of the fixed point data format indicating the output neuron gradient described above in a 2x decreasing manner.
  • the bit-width of the fixed-point data format is 8 instead of the sign bit; After the decrementing manner reduces the bit width of the fixed point data format, the fixed point data format removes the bit width of the sign bit to be 4.
  • the controller unit 102 adjusts the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy.
  • the controller unit 102 increases the input neuron accuracy S x(1) and/or the output neuron gradient accuracy Keeping the above weight precision S w(l) unchanged, or;
  • the controller unit 102 increases the input neuron precision S x(l) to reduce the gradient accuracy of the output neuron. Keeping the above weight precision S w(l) unchanged, and the magnitude of the above input neuron precision S x(l) is greater than the output neuron gradient precision Decrease, or;
  • the controller unit 102 reduces the output neuron gradient accuracy Increasing the above-mentioned input neuron precision S x(l) , keeping the above weight precision S w(l) unchanged, and the output neuron gradient precision The magnitude of the decrease is smaller than the increase of the accuracy of the input neuron S x(l) , or
  • the controller unit 102 increases or decreases the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy. One or any combination, so that the absolute value of the minimum difference update gradient above a preset accuracy T T R & lt accuracy of.
  • the above-mentioned controller unit 102 applies the above-mentioned weight precision S w(l) , the above-mentioned input neuron precision S x(l), and the output neuron gradient accuracy.
  • the controller unit 102 may increase the weight precision S w(l) , the input neuron precision S x(l), and the output neuron gradient accuracy. The related operations are not described here.
  • the operation unit 103 performs the adjusted input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy during the operation.
  • the input neurons, weights, and output neuron gradients of the Lth layer are expressed in a fixed-point data format format, and then subsequent operations are performed.
  • the frequency at which the controller unit 102 calculates the gradient update precision T can be flexibly set according to requirements.
  • the controller unit 102 can adjust the frequency of calculating the gradient update precision T according to the number of training iterations in the neural network training process.
  • the controller unit 102 recalculates the gradient update precision T every iteration in the neural network training process; or recalculates the gradient update precision T every preset number of iterations; or updates the accuracy according to the gradient
  • the change in T is set to the above frequency.
  • the controller unit 102 sets a frequency for calculating the gradient update precision T according to the number of training iterations in the neural network training.
  • the operation unit 103 is configured to represent the input neurons and weights of the Lth layer according to the increased or decreased input neuron precision S x(1) and the weight precision S w(1) ; according to the increase or decrease Output neuron gradient accuracy To represent the L-th output neuron gradient obtained by the operation.
  • the above operation unit is configured to increase or decrease the fixed-point data format of the input neuron precision S x(1) to represent the L-th layer input neuron, and increase or decrease the weight precision S w(l ) represent fixed-point data format of the above-mentioned L-th level right weight, with increasing or decreasing the accuracy of output neurons gradient
  • the fixed point data format represents the output neuron gradient of the Lth layer above for subsequent operations.
  • FIG. 1C is a schematic flowchart of a neural network operation method according to an embodiment of the present invention. As shown in FIG. 1C, the method includes:
  • the neural network operation module acquires the L-th input neuron precision, the weight precision, and the output neuron gradient precision of the neural network.
  • the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy may be the same, or they may be partially identical or the two may not be equal to each other.
  • the above neural network is a multi-layer neural network
  • the L-th layer input neuron precision S x(l) , weight precision S w(l), and output neuron gradient accuracy
  • the neural network operation module acquires the input neurons, weights, and output neurons of the Lth layer; and obtains the Lth according to the input neurons, weights, and output neurons of the Lth layer.
  • the neural network operation module calculates the gradient update precision T according to the L-th input neuron precision, the weight precision, and the output neuron gradient precision.
  • the neural network operation module is configured to perform the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy according to the first preset formula. The calculation is performed to obtain the above gradient update precision T.
  • the first preset formula is
  • the neural network operation module adjusts the L-th layer input neuron precision, the weight precision, and the output neuron gradient to make the difference between the gradient update precision T and the preset precision T r .
  • the absolute value of the value is the smallest.
  • the fixed-point data format for representing the input neuron and the bit width of the fixed-point data format for indicating the weight are the first bit width
  • the bit width of the fixed-point data format for indicating the output neuron gradient is the second bit width
  • the second bit width is greater than the first bit width.
  • the second bit width is twice the width of the first bit width to facilitate processing by an electronic computer.
  • first bit width may be selected as 8 bits
  • second bit width may be selected as 16 bits
  • said predetermined accuracy T r may be preset empirically; may be preset according to a second equation, by changing the input parameters obtained in a manner matching the input parameters T r; T r may also be obtained by the method of machine learning.
  • the neural network operation module sets the preset precision T r according to the learning rate and the batchsize (the number of samples in the batch processing).
  • the preset precision T r is set according to the number of output neurons of the upper layer and the batchsize and the learning rate. That is, the higher the number of output neurons in the upper layer and the larger the batch size, the higher the learning rate, the larger the preset precision T r .
  • the neural network operation module adjusts the above-mentioned input neuron precision S x(l) , weight precision S w(l), and output neuron gradient accuracy include:
  • the neural network operation module keeps the input neuron precision S x(l) and the weight precision S w(l) unchanged, and increases the gradient precision of the output neuron.
  • the above neural network computing module increases the gradient accuracy of the output neuron It refers to reducing the fractional part width s1 of the fixed point data format representing the output neuron gradient.
  • the neural network operation module controller unit reduces the fractional portion width s1 of the fixed-point data format indicating the weight according to the value of the Tr-T according to the first preset step size N1.
  • the neural network operation module reduces the N1 bit each time, that is, the fractional part has a bit width of s1-N1, and obtains an output neuron gradient.
  • Precision According to the above preset formula Determining whether the absolute value of the difference between the gradient update precision T and the preset precision Tr becomes smaller; when it is determined that the absolute value of the difference between the gradient update precision T and the preset precision Tr becomes smaller, the neural network operation module continues The bit width of the fractional part of the fixed-point data format indicating the output neuron gradient is reduced by N1, that is, the bit width is s1-2*N1, and the output neuron gradient accuracy is obtained.
  • the processing is continued according to the above method; if it becomes smaller, the processing is continued according to the above method; if the gradient update precision T is the same as the above in the nth processing
  • the absolute value of the difference of the preset precision Tr becomes larger, and the neural network operation module uses the bit width obtained by the n-1th process, that is, s1-(n-1)*N1 as a fixed point indicating the gradient of the output neuron.
  • the bit width of the fractional part of the data format, and the output neuron gradient accuracy after reducing the bit width of the fractional part is
  • the first preset step size N1 is 1, 2, 4, 6, 7, 8, or other positive integers.
  • the neural network operation module reduces the fractional part width of the fixed point data format indicating the output neuron gradient in a 2x decreasing manner.
  • the fractional part width of the fixed-point data format indicating the output neuron gradient is 4, that is, the precision of the weight is 2 -4 , and the fixed-point data format indicating the output neuron gradient after reducing the bit width in a 2-fold decreasing manner
  • the fractional part has a bit width of 2, that is, the reduced output neuron gradient accuracy is 2 -2 .
  • the neural network operation module determines the reduction width b of the fractional portion width of the fixed point data format indicating the output neuron gradient
  • the neural network operation module reduces the fixed point data format by multiple times.
  • the fractional part width is wide.
  • b1 and b2 above may be the same or different.
  • the neural network operation module increases the output neuron gradient accuracy At the time, the bit width of the fixed point data format representing the output neuron gradient is reduced.
  • the neural network operation module reduces the bit width of the fixed-point data format, and the bit width of the fixed-point data format is reduced, the bit width of the integer part remains unchanged, that is, the reduction value of the integer part width and the decimal part width
  • the reduction value is the same, thus ensuring that the maximum value represented by the fixed point data format does not change in the case where the bit width of the fractional part is changed.
  • the bit width of the fixed-point data format is 9, wherein the bit width of the sign bit is 1, the bit width of the integer part is 5, and the bit width of the fractional part is 3.
  • the neural network operation module reduces the bit of the decimal part. After the bit width of the width and integer parts, the bit width of the fractional part is 2, and the bit width of the integer part is 5, that is, the bit width of the above decimal part is reduced, and the bit width of the integer part remains unchanged.
  • the neural network operation module reduces the output neuron gradient accuracy After that, the neural network operation module is further used to:
  • the above neural network computing module increases the gradient accuracy of the output neurons described above.
  • the reason is the output neuron gradient accuracy Less than the above required precision, that is, there is precision redundancy, which increases the computational overhead and wastes computing resources. Therefore, in order to reduce the computational overhead and avoid the waste of computing resources, it is necessary to increase the gradient accuracy of the output neurons described above.
  • the foregoing neural network operation module increases the gradient accuracy of the output neuron After that, it is necessary to further determine whether there is precision redundancy, that is, to determine the output neuron gradient accuracy. Is it less than the required accuracy? When determining the above output neuron gradient accuracy When less than the above required precision, reducing the bit width of the fixed point data format indicating the output neuron gradient to increase the output neuron gradient accuracy Reduce accuracy redundancy.
  • the above-mentioned neural network operation module reduces the bit width of the above fixed-point data format, specifically, the bit width of the integer part of the fixed-point data format.
  • the neural network operation module reduces the bit width of the fixed point data format indicating the output neuron gradient, including:
  • the neural network operation module reduces the bit width of the fixed point data format indicating the output neuron gradient according to the second preset step size N2, wherein the second preset step size N2 may be 1, 2, 3, 4, 5, 7, 8 or other positive integers.
  • the reduction value of the neural network operation module when reducing the bit width of the fixed point data format is the second preset step size N2.
  • the neural network operation module reduces the bit width of the fixed-point data format indicating the output neuron gradient, including:
  • the neural network operation module reduces the bit width of the fixed-point data format indicating the output neuron gradient described above in a 2x decreasing manner.
  • the bit-width of the fixed-point data format is 8 instead of the sign bit; After the decrementing manner reduces the bit width of the fixed point data format, the fixed point data format removes the bit width of the sign bit to be 4.
  • the neural network operation module adjusts the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy.
  • the above neural network computing module increases the accuracy of the input neuron S x(l) and/or the output neuron gradient accuracy Keeping the above weight precision S w(l) unchanged, or;
  • the above neural network computing module increases the accuracy of the input neuron S x(l) to reduce the gradient accuracy of the output neuron Keeping the above weight precision S w(l) unchanged, and the magnitude of the above input neuron precision S x(l) is greater than the output neuron gradient precision Decrease, or;
  • the above neural network computing module reduces the gradient accuracy of the output neurons described above Increasing the above-mentioned input neuron precision S x(l) , keeping the above weight precision S w(l) unchanged, and the output neuron gradient precision The magnitude of the increase and decrease is smaller than the increase of the accuracy of the input neuron S x(l) , or
  • the above neural network operation module increases or decreases the above-mentioned input neuron precision S x(l) , weight precision S w(l), and output neuron gradient accuracy One or any combination, so that the absolute value of the minimum difference update gradient above a preset accuracy T T R & lt accuracy of.
  • the above-mentioned neural network operation module has the weight precision S w(l) , the input neuron precision S x(l), and the output neuron gradient precision.
  • the above neural network operation module For the specific process of performing the reduction operation, refer to the above neural network operation module to increase the weight precision S w(l) , the input neuron precision S x(l), and the output neuron gradient accuracy. The related operations are not described here.
  • the neural network operation module represents the output neurons and weights of the Lth layer according to the adjusted input neuron precision and weight precision; and the Lth output neuron gradient obtained by the operation according to the adjusted output neuron gradient precision For subsequent operations.
  • the above operation unit is configured to increase or decrease the fixed-point data format of the input neuron precision S x(1) to represent the L-th layer input neuron, and increase or decrease the weight precision S w(l ) represent fixed-point data format of the above-mentioned L-th level right weight, with increasing or decreasing the accuracy of output neurons gradient
  • the fixed point data format represents the output neuron gradient of the Lth layer above for subsequent operations.
  • the above neural network calculation block calculates the gradient of re-updating precision T; updated when the gradient is no longer greater than the predetermined precision accuracy T r, the above neural network calculation module decreases the precision of the input neuron manner as described above referring to S step S203 x(l) , weight precision S w(l), and output neuron gradient accuracy
  • the frequency of calculating the gradient update precision T by the neural network operation module may be flexibly set according to requirements.
  • the neural network operation module may adjust the frequency of the gradient update precision T according to the number of training iterations in the neural network training process.
  • the neural network operation module recalculates the gradient update precision T every iteration in the neural network training process; or recalculates the gradient update precision T every preset number of iterations; or updates the accuracy according to the gradient
  • the change in T is set to the above frequency.
  • the neural network operation module is configured to calculate a frequency for calculating the gradient update precision T according to the number of training iterations in the neural network training.
  • the solution of the embodiment of the present invention dynamically adjusts the input neuron precision S x , the weight precision S w and the output neuron gradient precision in the neural network operation process.
  • reduce the precision redundancy reduce the computational overhead, and avoid wasting the computing resources.
  • training calculations are the basis of neural network applications.
  • training calculations which are also called pre-training or pre-processing of models
  • special equipment such as data center
  • processing is usually required, which makes How to reduce the amount of computation of training calculations becomes the key to applying training calculations to common devices (such as personal computers, terminal devices).
  • data can be represented and computed in a fixed-point data format.
  • the data of the Lth layer includes the input neuron X (1) , the output neuron Y (l) , and the weight W (l) .
  • the data of the Lth layer includes the input neuron gradient Output neuron gradient Weight gradient
  • the above data can be expressed in fixed-point numbers or in fixed-point numbers.
  • a fixed-point number is a data format that can specify the position of a decimal point.
  • the bit width of a 16-bit fixed point number is 16.
  • the accuracy of the data and the range of numbers that can be represented are related. If the precision can be expressed, the smaller the range of numbers that can be represented.
  • the first bit is a sign bit
  • the integer part occupies the x-bit
  • the fractional part occupies the s-bit
  • the maximum fixed-point precision S that the fixed-point data format can represent is 2 - s .
  • the training process in the neural network usually includes two steps of forward operation and reverse operation.
  • the precision required to input the neuron gradient, the weight gradient and the output neuron gradient may change, possibly with training.
  • the process increases, if the accuracy of the fixed point number is not enough, it will lead to a large error in the operation result, and even lead to training failure.
  • the purpose of adjusting the accuracy of the data is achieved by adjusting the bit width of the above data.
  • the bit width of the fractional part in the fixed-point data format can be increased, that is, the s in FIG. 1A is increased, thereby increasing the accuracy of the above fixed-point data format; Since the bit width of the fixed point data format is fixed, when the bit width of the fractional part is increased, the bit width of the integer part is reduced, so the data range that the fixed point data format can represent is reduced.
  • the bit width of the fixed point data format is large. Since the bit width of the fractional part is constant, increasing the bit width of the fixed point data format can be regarded as increasing the bit width of the integer part of the fixed point data format, thereby expanding the fixed point data format. Can represent the scope of the data.
  • FIG. 1B is a schematic structural diagram of a neural network operation module according to an embodiment of the present invention.
  • the neural network computing module is used to perform operations on a multi-layer neural network.
  • the neural network operation module 100 includes:
  • the storage unit 101 is configured to store input neuron precision, weight precision, and output neuron gradient accuracy.
  • the controller unit 102 is configured to acquire, from the storage unit 101, the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient precision of the L-th layer of the multi-layer neural network.
  • L is an integer greater than 0; according to the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy Obtaining a gradient update precision T; adjusting the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy when the gradient update precision T is greater than the preset precision T r
  • the storage unit 101 is further configured to store input neurons, weights and output neurons, and output neuron gradients, and the controller unit 102 obtains the L-th layer input neurons from the storage unit 101. And weighting and outputting the neuron gradient, the controller unit 102 obtains the input neuron precision S x(l) , the weight precision S w(l), and the output nerve according to the L-th layer input neuron, the weight, and the output neuron gradient. Metagradient accuracy
  • bit width of the fixed-point data number indicating the input neuron and the bit width of the fixed-point data format for indicating the weight are the first bit width
  • bit width of the fixed-point data format for indicating the output neuron gradient is The second bit is wide.
  • the second bit width is greater than the first bit width.
  • the second bit width is twice the width of the first bit width to facilitate processing by an electronic computer.
  • first bit width may be selected as 8 bits
  • second bit width may be selected as 16 bits
  • the controller unit 102 may be pre-set by the user or may be preset by the user, and the accuracy is preset to T r ; or the second preset formula may be used to obtain a pre-match with the input parameter by changing the input parameter. Set the precision T r ; you can also get the T r by machine learning.
  • the controller unit 102 sets the preset precision T r according to the learning rate and the batchsize (the number of samples in the batch processing).
  • the controller unit 102 sets the preset according to the number of output neurons in the upper layer and the batchsize and learning rate.
  • the accuracy T r that is, the higher the number of output neurons in the upper layer and the larger the batch size and the higher the learning rate, the larger the preset precision T r .
  • the controller unit 102 acquires the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy. After that, according to the first preset formula, the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient precision Performing a calculation to obtain the gradient update precision T, wherein the first preset formula may be:
  • the controller unit 102 adjusts the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy. include:
  • the controller unit 102 keeps the input neuron precision S x(l) and the weight precision S w(l) unchanged, and reduces the output neuron gradient accuracy.
  • the controller unit 102 reduces the output neuron gradient accuracy It refers to increasing the fractional part width s1 of the fixed point data format indicating the output neuron gradient.
  • the controller unit 102 increases the decimal portion bit width s1 of the fixed point data format indicating the weight according to the value of the Tr-T according to the first preset step size N1.
  • the controller unit 102 increases the N1 bit each time, that is, the bit width of the fractional part is s1+N1, and obtains the output neuron gradient accuracy.
  • the controller unit 102 continues The bit width of the fractional part of the fixed-point data format indicating the output neuron gradient is increased by N1, that is, the bit width is s1+2*N1, and the output neuron gradient accuracy is obtained.
  • the controller unit 102 uses the bit width obtained by the n-1th process, that is, s1+(n-1)*N1 as the fixed point data indicating the gradient of the output neuron.
  • the bit width of the fractional part of the format, and the gradient of the output neuron after increasing the bit width of the fractional part is
  • the first preset step size N1 is 1, 2, 4, 6, 7, 8, or other positive integers.
  • the controller unit 102 increases the fractional part width of the fixed point data format indicating the output neuron gradient in a manner of increasing by 2 times.
  • the fractional part width of the fixed-point data format indicating the output neuron gradient is 3, that is, the precision of the weight is 2 -3
  • the fractional part has a bit width of 6, that is, the reduced output neuron gradient accuracy is 2 -6 .
  • the controller unit 102 after the controller unit 102 determines the increase width b of the fractional portion bit width of the fixed point data format indicating the output neuron gradient, the controller unit 102 increases the fixed point data format by multiple times.
  • b1 and b2 above may be the same or different.
  • the controller unit 102 reduces the output neuron gradient accuracy When, increase the bit width of the fixed point data format representing the output neuron gradient.
  • the controller unit 102 increases the bit width of the fixed point data format, and after the bit width of the fixed point data format increases, the bit width of the integer part remains unchanged, that is, the added value of the integer part bit width and the decimal part bit width. The added value is the same.
  • the bit width of the fixed point data format is 9, wherein the bit width of the sign bit is 1, the bit width of the integer part is 5, and the bit width of the decimal part is 3.
  • the controller unit 102 increases the bit width of the decimal part. After the bit width of the integer part and the bit width of the fractional part is 6, the bit width of the integer part is 5, that is, the bit width of the above-mentioned fractional part is increased, and the bit width of the integer part remains unchanged.
  • the controller unit 102 reduces the output neuron gradient accuracy After that, the controller unit 102 is further configured to:
  • the bit width of the fixed point data format representing the output neuron gradient is increased.
  • the controller unit 102 reduces the gradient accuracy of the output neuron.
  • the fixed-point data format indicating the output neuron gradient indicates that the range of the data is reduced, so when the controller unit 102 reduces the output neuron gradient accuracy Then, it is judged whether the output neuron gradient is overflowed when expressed in the above fixed point data format; when it is determined that the overflow occurs, the controller unit 102 increases the bit width of the fixed point data format, thereby expanding the range of the fixed point data format representation data, so that the above The output neuron gradient does not overflow when expressed in the above fixed-point data format.
  • controller unit 102 increases the bit width of the above-mentioned fixed point data format, specifically, increases the bit width of the integer part of the fixed point data format.
  • controller unit 102 increases the bit width of the fixed point data format indicating the output neuron gradient, including:
  • the controller unit 102 increases the bit width of the fixed point data format indicating the output neuron gradient according to the second preset step size N2, wherein the second preset step size N2 may be 1, 2, 3, 4, 5, 7, 8 or other positive integers.
  • the controller unit 102 increases the bit width of the fixed point data format each time by the second preset step size N2.
  • the controller unit 102 increases the bit width of the fixed-point data format indicating the output neuron gradient, including:
  • the controller unit 102 increases the bit width of the fixed-point data format indicating the output neuron gradient described above in a double-increasing manner.
  • the bit-width of the fixed-point data format is 16 instead of the sign bit; After increasing the bit width of the fixed point data format in an incremental manner, the fixed point data format removes the bit width of the sign bit to be 32.
  • the controller unit 102 adjusts the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy.
  • the controller unit 102 reduces the input neuron accuracy S x(l) and/or the output neuron gradient accuracy Keeping the above weight precision S w(l) unchanged, or;
  • the controller unit 102 reduces the input neuron precision S x(l) to increase the output neuron gradient accuracy Keeping the above weight precision S w(l) unchanged, and the magnitude of the above input neuron precision S x(l) is greater than the output neuron gradient precision Increase in amplitude, or;
  • the controller unit 102 increases the output neuron gradient accuracy Reducing the above-mentioned input neuron precision S x(l) , keeping the above weight precision S w(l) unchanged, and the output neuron gradient precision The magnitude of the increase is less than the magnitude of the decrease in the accuracy of the input neuron S x(l) , or
  • the controller unit 102 increases or decreases the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy. One or any combination, so that the absolute value of the minimum difference update gradient above a preset accuracy T T R & lt accuracy of.
  • the above-mentioned controller unit 102 applies the above-mentioned weight precision S w(l) , the above-mentioned input neuron precision S x(l), and the output neuron gradient accuracy.
  • a specific process performed in any operation may refer to the above increase control means 102 increases the accuracy of the above-described weight S w (l), the input neuron precision S x (l) and the output neurons gradient Accuracy The related operations are not described here.
  • the operation unit 103 performs the adjusted input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy during the operation.
  • the input neurons, weights, and output neuron gradients of the Lth layer are expressed in a fixed-point data format format, and then subsequent operations are performed.
  • the frequency at which the controller unit 102 calculates the gradient update precision T can be flexibly set according to requirements.
  • the controller unit 102 can adjust the frequency of calculating the gradient update precision T according to the number of training iterations in the neural network training process.
  • the controller unit 102 recalculates the gradient update precision T every iteration of the controller unit; or recalculates the gradient update precision T every preset number of iterations; or according to the above
  • the gradient update accuracy T is set to set the above frequency.
  • the controller unit 102 sets a frequency for calculating the gradient update precision T according to the number of training iterations in the neural network training.
  • the operation unit 103 is configured to represent the input neurons and weights of the Lth layer according to the increased or decreased input neuron precision S x(1) and the weight precision S w(1) ; according to the increase or decrease Output neuron gradient accuracy To represent the L-th output neuron gradient obtained by the operation.
  • the above-mentioned operation unit 103 is configured to increase or decrease the fixed-point data format of the input neuron precision S x(1) to represent the above-mentioned L-th layer input neuron, with increasing or decreasing weight precision S w ( l) fixed-point data format to represent the weight of the above L-th layer, with increasing or decreasing the output neuron gradient accuracy
  • the fixed point data format represents the output neuron gradient of the Lth layer above for subsequent operations.
  • the controller unit 102 acquires an Lth layer output neuron gradient of the multi-layer neural network.
  • the controller unit 102 acquires the output neurons of the Lth layer and the output neurons of the L-1 layer, and then according to the output neurons of the Lth layer and the L-1 layer.
  • the output neurons acquire the above-mentioned Lth layer output neuron gradient.
  • the controller unit 102 obtains the proportional data a of the output neuron gradient in which the absolute value of the output neuron gradient is smaller than the first preset threshold.
  • the first preset threshold may be 0, 0.01, 0.05, 0.1, 0.12, 0.05 or other values.
  • the above ratio data may be 50%, 60%, 65%, 70%, 80%, 85%, 90% or other values.
  • the above ratio data is 80%.
  • the controller unit 102 reduces the gradient accuracy of the L-th output neuron
  • the controller unit 102 reduces the gradient accuracy of the L-th output neuron When, the bit width of the fixed point data format indicating the gradient of the output layer of the Lth layer is increased.
  • the controller unit 102 reduces the gradient accuracy of the L-th output neuron Thereafter, the controller unit 102 is further configured to:
  • the bit width of the fixed point data format indicating the gradient of the output layer of the above Lth layer is increased.
  • the controller unit 102 increases the bit width of the fixed point data format indicating the gradient of the Lth layer output neuron, including:
  • the controller unit 102 increases the bit width of the fixed point data format indicating the gradient of the Lth layer output neuron according to the third preset step size N3.
  • the controller unit 102 increases the bit width of the fixed point data format indicating the gradient of the Lth layer output neuron, including:
  • the controller unit 102 increases the bit width of the fixed point data format indicating the gradient of the output layer of the Lth layer by the second layer in a manner of increasing the number of times.
  • controller unit 102 reduces the output neuron gradient accuracy.
  • the specific process can be seen in the above related description and will not be described here.
  • the operation unit 103 performs the adjusted output neuron gradient accuracy during the operation.
  • the output neuron gradient of the above Lth layer is expressed in fixed point form, and then subsequent operations are performed.
  • the error of the output neuron is reduced, thereby ensuring normal training.
  • FIG. 1C is a schematic flowchart of a neural network operation method according to an embodiment of the present invention. As shown in FIG. 1C, the method includes:
  • the neural network operation module acquires the L-th input neuron precision, the weight precision, and the output neuron gradient precision of the neural network.
  • the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy may be the same, or they may be partially identical or the two may not be equal to each other.
  • the above neural network is a multi-layer neural network
  • the L-th layer input neuron precision S x(l) , weight precision S w(l), and output neuron gradient accuracy
  • the neural network operation module acquires the input neurons, weights, and output neurons of the Lth layer; and obtains the Lth according to the input neurons, weights, and output neurons of the Lth layer.
  • the neural network operation module calculates the gradient update precision T according to the L-th input neuron precision, the weight precision, and the output neuron gradient precision.
  • the neural network operation module is configured to perform the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy according to the first preset formula. The calculation is performed to obtain the above gradient update precision T.
  • the first preset formula is
  • the neural network calculation module adjusts the L-th input neuron precision, accuracy and the weight gradient output neurons, so that the difference with a predetermined gradient update precision T T R & lt accuracy of The absolute value of the value is the smallest.
  • the fixed-point data format for representing the input neuron and the bit width of the fixed-point data format for indicating the weight are the first bit width
  • the bit width of the fixed-point data format for indicating the output neuron gradient is the second bit width
  • the second bit width is greater than the first bit width.
  • the second bit width is twice the width of the first bit width to facilitate processing by an electronic computer.
  • first bit width may be selected as 8 bits
  • second bit width may be selected as 16 bits
  • said predetermined accuracy T r can be empirically set in advance; a second predetermined equation may be obtained matches the input mode by changing the parameters T r input parameters; T r may also be obtained by the method of machine learning.
  • the neural network operation module sets the preset precision T r according to the learning rate and the batchsize (the number of samples in the batch processing).
  • the preset precision T r is set according to the number of output neurons of the upper layer and the batchsize and the learning rate. That is, the higher the number of output neurons in the upper layer and the larger the batch size, the higher the learning rate, the larger the preset precision T r .
  • said neural network operation module adjusting the input neurons precision S x (l), the weight accuracy S w (l) and the output neurons gradient accuracy include:
  • the above neural network computing module reduces the gradient accuracy of the output neurons described above. It refers to increasing the fractional part width s1 of the fixed point data format indicating the output neuron gradient.
  • the neural network operation module controller unit increases the fractional part width s1 of the fixed point data format indicating the output neuron gradient according to the first preset step size N1 according to the value of Tr-T.
  • the neural network operation module increases N1 each time, that is, the fractional width of the fractional part is s1+N1, and the output neuron gradient accuracy is obtained.
  • N1 the fractional width of the fractional part is s1+N1
  • the output neuron gradient accuracy is obtained.
  • the processing is continued according to the above method; if it becomes smaller, the processing is continued according to the above method; if the gradient update precision T is the same as the above in the nth processing
  • the absolute value of the difference of the preset precision Tr becomes larger, and the neural network operation module uses the bit width obtained by the n-1th process, that is, s1+(n-1)*N1 as the fixed point data indicating the gradient of the output neuron.
  • the bit width of the fractional part of the format, and the gradient of the output neuron after increasing the bit width of the fractional part is
  • the first preset step size N1 is 1, 2, 4, 6, 7, 8, or other positive integers.
  • the neural network operation module increases the fractional part width of the fixed point data format indicating the output neuron gradient according to the method of increasing the number of times.
  • the fractional part width of the fixed-point data format indicating the output neuron gradient is 3, that is, the output neuron gradient precision is 2 -3 , and the fixed-point data format indicating the output neuron gradient is increased in a 2-fold increment manner.
  • the fractional part has a bit width of 6, that is, the reduced output neuron gradient accuracy is 2 -6 .
  • the neural network operation module determines an increase width b of a fractional part bit width of the fixed point data format indicating the output neuron gradient
  • the neural network operation module increases the fixed point data by a plurality of times.
  • b1 and b2 above may be the same or different.
  • the bit width of the fixed point data format indicating the weight is increased.
  • the output neuron gradient precision S w(1) is achieved by increasing the fractional partial bit width of the fixed point data format indicating the above weight, and due to the bit of the fixed point data format indicating the output neuron gradient described above
  • the width is constant. If the bit width of the fractional part is increased, the integer part width is reduced, and the data range represented by the fixed point data format is reduced. Therefore, after the neural network operation module reduces the output neuron gradient precision S w(l) , The neural network operation module increases the bit width of the fixed-point data format, and after the bit width of the fixed-point data format increases, the bit width of the integer part remains unchanged, that is, the increase value of the integer part bit width and the increase of the decimal part bit width The values are the same.
  • the bit width of the fixed-point data format is 9, wherein the bit width of the sign bit is 1, the bit width of the integer part is 5, and the bit width of the fractional part is 3.
  • the neural network operation module increases the bit width of the decimal part. After the bit width of the integer part and the bit width of the fractional part is 6, the bit width of the integer part is 5, that is, the bit width of the above-mentioned fractional part is increased, and the bit width of the integer part remains unchanged.
  • the neural network operation module is further configured to:
  • the bit width of the fixed point data format representing the output neuron gradient is increased.
  • the fixed-point data format indicating the output neuron gradient indicates that the range of the data is reduced, so when the neural network operation module is used After reducing the precision of the output neuron gradient, determining whether the output neuron gradient overflows in the above-mentioned fixed-point data format; when determining the overflow, the neural network operation module increases the bit width of the fixed-point data format, thereby expanding the fixed point
  • the data format represents the extent of the data such that the output neuron gradient described above does not overflow when represented in the above-described fixed point data format.
  • the above-mentioned neural network operation module increases the bit width of the above fixed-point data format, specifically, increases the bit width of the integer part of the fixed-point data format.
  • the neural network operation module adds a bit width of the fixed point data format indicating the output neuron gradient, and includes:
  • the neural network operation module increases the bit width of the fixed point data format indicating the output neuron gradient according to the second preset step size N2, wherein the second preset step size N2 may be 1, 2, 3, 4, 5, 7, 8 or other positive integers.
  • the increase value of the neural network operation module when increasing the bit width of the fixed point data format is the second preset step size N2.
  • the neural network operation module increases the bit width of the fixed-point data format indicating the output neuron gradient, including:
  • the neural network operation module increases the bit width of the fixed-point data format indicating the output neuron gradient described above in a two-fold increment manner.
  • the bit-width of the fixed-point data format is 16 instead of the sign bit; After increasing the bit width of the fixed point data format in an incremental manner, the fixed point data format removes the bit width of the sign bit to be 32.
  • the neural network operation module adjusts the input neuron precision S x(l) , the weight precision S w(l), and the output neuron gradient accuracy.
  • the above-mentioned neural network operation module has the weight precision S w(l) , the input neuron precision S x(l), and the output neuron gradient precision.
  • the above-mentioned neural network operation module For the specific process of performing any increase operation, refer to the above-mentioned neural network operation module to increase the weight precision S w(l) , the above-mentioned input neuron precision S x(l), and the output neuron gradient accuracy. The related operations are not described here.
  • the neural network operation module represents the output neurons and weights of the Lth layer according to the adjusted input neuron precision and weight precision; and the Lth output neuron gradient obtained by the operation according to the adjusted output neuron gradient precision For subsequent operations.
  • the above operation unit is configured to increase or decrease the fixed-point data format of the input neuron precision S x(1) to represent the L-th layer input neuron, and increase or decrease the weight precision S w(l ) represent fixed-point data format of the above-mentioned L-th level right weight, with increasing or decreasing the accuracy of output neurons gradient
  • the fixed point data format represents the output neuron gradient of the Lth layer above for subsequent operations.
  • the above neural network calculation module decreases the precision of the input neuron manner as described above referring to S step S203 x(l) , weight precision S w(l) and output neuron gradient accuracy
  • the frequency of calculating the gradient update precision T by the neural network operation module may be flexibly set according to requirements.
  • the neural network operation module may adjust the frequency of the gradient update precision T according to the number of training iterations in the neural network training process.
  • the neural network operation module recalculates the gradient update precision T every iteration in the neural network training process; or recalculates the gradient update precision T every preset number of iterations; or updates the accuracy according to the gradient
  • the change in T is set to the above frequency.
  • the neural network operation module is configured to calculate a frequency for calculating the gradient update precision T according to the number of training iterations in the neural network training.
  • the input neuron precision S x , the weight precision S w and the output neuron gradient precision are dynamically adjusted in the neural network operation process.
  • the error and computational overhead of the computational results are reduced, and computational resources are saved.
  • FIG. 1D is a schematic flowchart diagram of a neural network operation method according to an embodiment of the present invention. As shown in FIG. 1D, the method includes:
  • the neural network operation module acquires an Lth layer output neuron gradient.
  • the neural network operation module acquires the output neurons of the Lth layer and the output neurons of the L-1 layer, and then according to the output neurons of the Lth layer and the L-1 layer.
  • the output neurons acquire the above-mentioned Lth layer output neuron gradient.
  • the neural network operation module acquires the proportion data a in which the absolute value of the Lth layer output neuron gradient is less than the first preset threshold.
  • the first preset threshold may be 0, 0.01, 0.05, 0.1, 0.12, 0.05 or other values.
  • the above ratio data may be 50%, 60%, 65%, 70%, 80%, 85%, 90% or other values.
  • the above ratio data is 80%.
  • the neural network operation module reduces the precision of the L-th output neuron gradient.
  • the neural network operation module reduces the gradient accuracy of the L-th output neuron When, the bit width of the fixed point data format indicating the gradient of the output layer of the Lth layer is increased.
  • the neural network operation module reduces the gradient accuracy of the L-th output neuron After that, the above neural network operation module is further used to:
  • the bit width of the fixed point data format indicating the gradient of the output layer of the above Lth layer is increased.
  • the neural network operation module adds a bit width of a fixed point data format indicating the gradient of the output layer of the Lth layer, including:
  • the bit width of the fixed point data format representing the gradient of the output layer of the Lth layer is increased according to the third preset step size N3.
  • the neural network operation module adds a bit width of the fixed point data format indicating the gradient of the output layer of the Lth layer, including:
  • the bit width of the fixed point data format representing the gradient of the output layer of the Lth layer is increased in a 2-fold increment manner.
  • Adjust the above output neuron gradient accuracy according to the above method After the above neural network operation module is in the process of operation, according to the adjusted output neuron gradient accuracy
  • the output neuron gradient of the above Lth layer is expressed in a fixed point data format, and then subsequent operations are performed.
  • the accuracy of the output neuron is adjusted according to the output neuron gradient, thereby reducing the error of the output neuron, thereby ensuring normal training.
  • Neural networks are also called artificial neural networks. Artificial neural networks are widely used in the fields of pattern recognition, image processing, function approximation and optimization calculation. Multi-layer artificial networks have been in parallel in recent years due to their high recognition accuracy and better parallelism. Sexuality has received increasing attention from academia and industry. Artificial neural networks involve a variety of algorithms. The fully connected layer is an important algorithm in artificial neural networks and is widely used in various artificial neural network models.
  • the existing neural network operation is based on a general-purpose processor for neural network operations, and the existing general-purpose processor only supports the operation of floating-point data, but the neural network operation particularly involves relatively complicated operations, so the amount of computation is large, and the memory is large. The requirements are high.
  • the existing neural network operation is based on floating-point data, and the memory requirements are high. Therefore, the existing scheme has high energy consumption and high cost.
  • the electronic device may include various handheld devices having wireless communication capabilities, in-vehicle devices, wireless headsets, computing devices, or other processing devices connected to the wireless modem, as well as various forms of user equipment (UE), mobile stations (mobile) Station, MS), terminal device, etc., and the electronic device can be, for example, a smartphone, a tablet, a headset, or the like.
  • UE user equipment
  • MS mobile stations
  • terminal device etc.
  • the electronic device can be, for example, a smartphone, a tablet, a headset, or the like.
  • the above-mentioned devices are collectively referred to as electronic devices or electronic devices.
  • the above electronic device or electronic device can be applied to the following (including but not limited to) scenarios: data processing, robots, computers, printers, scanners, telephones, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, Cameras, cloud servers, cameras, camcorders, projectors, watches, headsets, mobile storage, wearable devices and other electronic products; aircraft, ships, vehicles and other vehicles; TV, air conditioning, microwave ovens, refrigerators, rice cookers, humidification Instruments, washing machines, electric lights, gas stoves, range hoods and other household appliances; and including nuclear magnetic resonance apparatus, B-ultrasound, electrocardiograph and other medical equipment.
  • a neural network computing device for performing a neural network training calculation, the neural network training calculation comprising a neural network multi-layer training operation, wherein the multi-layer training operation includes an i-th layer, At least part of the data in the forward or reverse operation of the i-th layer is used for fixed-point data operations, and the above i is an integer greater than or equal to 1;
  • the computing device includes: a controller unit 11, an arithmetic unit 12, and a conversion unit 13, wherein The controller unit 11 and the arithmetic unit 12 and the conversion unit 13 (the above-mentioned conversion unit may be separately provided, or may be integrated in the controller unit or the operation unit);
  • the i-th layer training operation includes the i-th layer forward operation and The i-th layer reverse operation;
  • the ith layer forward operation can include:
  • the controller unit 11 is configured to acquire the input neuron data of the i-th layer, the i-th layer weight data, and the i-th layer forward calculation instruction; in an alternative, specifically, the controller unit acquires the input neuron
  • the data and the calculation instruction mode can be obtained by a data input/output unit, which can be one or more data I/O interfaces or I/O pins, and a data input/output unit for using an external device or an external memory. Read input neuron data or forward calculation instructions.
  • the foregoing forward calculation instructions include, but are not limited to, a convolution operation instruction, a matrix multiplication instruction, a vector multiplication instruction, an activation instruction, and the like.
  • the specific embodiment of the present application does not limit the specific expression or specific category of the above forward calculation instruction.
  • the controller unit 11 is further configured to parse the i-th layer calculation instruction to obtain a plurality of forward operation instructions, and send the i-th layer input neuron data and the i-th layer weight data to the conversion unit 13, and the plurality of forward directions
  • the operation instruction is sent to the operation unit 12;
  • the converting unit 13 is configured to perform floating point type and fixed point type conversion on all or part of the i-th layer input neuron data and the i-th layer weight data to obtain all fixed point data or mixed data, and all fixed point data or mixed data. Sended to the arithmetic unit, the mixed data includes: partial fixed point data and partial floating point data;
  • the operation unit 12 is configured to perform a fixed point operation on all the fixed point data or perform a mixed operation on the mixed data according to the plurality of forward operation instructions to obtain a forward output result of the i-th layer.
  • the ith layer inverse operation can include:
  • the controller unit 11 is configured to acquire the input neuron data of the i-th layer, the i-th layer weight data, the i-th input neuron gradient, and the i-th layer reverse calculation instruction; in an alternative, the specific The controller unit obtains the input neuron data and the calculation instruction manner can be obtained by the data input and output unit, and the data input/output unit can be specifically one or more data I/O interfaces or I/O pins; the data input and output unit, Used to read input neuron data or reverse calculation instructions from an external device or external memory.
  • the foregoing reverse calculation instructions include, but are not limited to, a matrix multiplication instruction or a vector multiplication instruction, etc., and the specific embodiment of the present application does not limit the specific expression or specific category of the above reverse calculation instruction.
  • the controller unit 11 is further configured to parse the ith layer calculation instruction to obtain a plurality of reverse operation instructions, and send the i-th layer input neuron data, the i-th layer weight data, and the i-th layer input neuron gradient to the conversion unit. 13, the plurality of reverse operation instructions are sent to the arithmetic unit 12;
  • the converting unit 13 is configured to perform all or part of the i-th layer input neuron data, the i-th layer weight data, and the i-th layer input neuron gradient to perform floating point type and fixed point type conversion to obtain all fixed point data or mixed data.
  • the operation unit 12 is configured to perform a fixed point operation on all the fixed point data according to the plurality of forward operation instructions or perform a mixed operation on the mixed data to obtain a weight gradient of the i-th layer and an output gradient of the i-th layer; the operation unit adopts the i-th layer The weight gradient is updated with the i-th layer weight.
  • the hybrid operation includes performing a fixed point operation on the partial fixed point data and a floating point operation on the partial floating point data.
  • the technical solution provided by the present application provides a conversion unit, which can convert all or part of the input neuron data, the weight data, and the input data neuron gradient into fixed points when performing the i-th layer training operation of the neural network.
  • Data or mixed data compared with floating-point data, the storage space of fixed-point data is less, so that the training of the neural network can be realized through a small memory space, so the computing device provided by the present application can reduce the memory capacity and reduce the cost.
  • at least part of the fixed-point data operation is performed, and the calculation of the floating-point data has the advantage that the calculation amount is reduced and the calculation is fast.
  • the training calculation in neural network training can be the training operation of one layer in the neural network, that is, the training operation of the i-th layer.
  • the conventional training operation method can be used, or the similar i-th in this application can also be adopted.
  • Layer training calculation method For the multi-layer neural network, the training calculation method is implemented.
  • the computing device In the forward operation, after the artificial neural network of the upper layer is executed, the computing device will calculate the output neurons in the computing unit (ie, The operation is performed as an input neuron of the next layer (or some operation is performed on the output neuron as an input neuron of the next layer), and some of the operations include but are not limited to: an operation such as an activation operation At the same time, the computing device replaces the weight of the upper layer with the weight of the next layer.
  • the inverse operation when the inverse operation of the artificial neural network of the next layer is completed, the computing device uses the output neuron gradient (ie, the output result gradient) calculated in the operation unit as the input neuron gradient of the upper layer. Perform an operation (or perform some operation on the output neuron gradient as an input neuron gradient of the previous layer), and the computing device replaces the weight and input neuron data with the weight of the forward operation of the previous layer. And input neuron data.
  • the input neurons and output neurons of the multi-layer operation are not the neurons in the input layer of the entire neural network, but the neurons in the output layer.
  • the neurons in the lower layer of the network forward operation are input neurons
  • the neurons in the upper layer of the network forward operation are output neurons.
  • the converting unit 13 is specifically configured to convert the portion of the i-th layer input neuron data into partial fixed-point input neuron data and convert the portion of the i-th layer weight data into partial fixed-point weight data;
  • Input neuron data and partial fixed-point weight data are sent to the arithmetic unit, and part of the input neuron data (the remaining floating-point data of floating point and fixed-point conversion) and part of the weight data (the remaining of floating point and fixed-point conversion are not performed) Floating point data) is sent to the arithmetic unit;
  • the operation unit is specifically configured to perform partial fixed-point forward output results by performing partial fixed-point input neuron data and partial fixed-point weight data, and send partial fixed-point forward output results to the conversion unit.
  • the conversion unit is specifically configured to perform the fixed-point and floating-point conversion of the partial fixed-point forward output result to obtain the first partial floating-point forward output result, and send the first partial floating-point forward output result to the arithmetic unit;
  • the operation unit is specifically configured to perform partial operation (floating point operation) on the partial input neuron data and the partial weight data to obtain a second partial floating point forward operation result, and the first partial floating point forward operation result and the second partial floating point result The forward operation results are combined to obtain the i-th layer forward output result.
  • partial operation floating point operation
  • the converting unit 13 is specifically configured to convert the portion of the i-th layer input neuron data into partial fixed-point input neuron data, convert the portion of the i-th layer weight data into partial fixed-point weight data, and
  • the layer input neuron gradient is converted into a partial fixed-point input neuron gradient; part of the fixed-point input neuron data, the partial fixed-point input neuron gradient, and the partial fixed-point weight data are sent to the arithmetic unit, and part of the input neuron data is input (the floating point is not executed)
  • the remaining floating point data with the fixed point conversion), the partial input neuron gradient, and the partial weight data (the remaining floating point data of the floating point and fixed point conversion are not performed) are sent to the arithmetic unit;
  • the operation unit is specifically configured to perform partial ith layer weight gradient by performing partial point input neuron gradient and partial fixed point input data to perform fixed point data operation, and perform partial point input neuron gradient and partial fixed point weight data to perform fixed point data operation to obtain part
  • the i-th layer outputs a gradient of the result, and sends a part of the i-th layer weight gradient and a part of the i-th layer output result gradient to the conversion unit;
  • the converting unit is specifically configured to perform the fixed point and floating point conversion of the part i-th layer weight gradient and the partial i-th layer output result gradient to obtain the first part i-th layer weight gradient and the first part i-th layer output result gradient, a part of the i-th layer weight gradient and the first part of the i-th layer output result gradient are sent to the arithmetic unit;
  • the operation unit is specifically configured to perform a calculation (floating point) on a part of the input neuron gradient and a part of the input data to obtain a second part of the i-th layer weight gradient, and perform part of the input neuron gradient and the partial weight data to obtain the second part.
  • the i-th layer outputs a gradient of the result, combining the first part of the i-th weight gradient and the second part of the i-th weight gradient to obtain an i-th weight gradient, and the first part of the i-th output gradient and the second part
  • the i-layer output result gradients are combined to obtain an ith layer output result gradient.
  • the converting unit 13 is specifically configured to determine a point of the floating point number
  • width is the bit width value of the fixed point number.
  • maxabs is the maximum absolute value in the floating point data to be converted, that is, the maximum value of the absolute value in the elements of the i-th layer input neuron data and the i-th layer weight data. This allows the fixed point number to represent a maximum value greater than the minimum point (point position) value of maxabs.
  • Int is a fixed-point value
  • float is a floating-point value
  • point is a fixed-point decimal point
  • the foregoing method for obtaining an i-th layer input neuron gradient may include:
  • the i-th layer input neuron gradient f'* the i+1th layer output result gradient
  • f' is the derivative of the activation function f.
  • the foregoing operation unit may include: a main processing circuit 3101 and a plurality of slave processing circuits 3102, where
  • the main processing circuit 3101 is configured to perform pre-processing on the data (including one or any combination of input neuron data, weight data, input neuron gradient, and additionally, the data may be fixed-point data or floating-point data) Transmitting data and operation instructions to the plurality of slave processing circuits;
  • a plurality of slave processing circuits 3102 for performing intermediate operations in parallel according to data transmitted from the main processing circuit (which may be fixed point data or floating point data) and the operation instructions, and obtaining a plurality of intermediate results Transmitting to the main processing circuit;
  • the main processing circuit 3101 is configured to obtain an i-th layer forward output result, an i-th layer output result gradient, an i-th layer weight gradient according to the plurality of intermediate results, and perform the i-th layer weight according to the i-th layer weight gradient Update.
  • the activation function f is any one of a nonlinear function sigmoid, tanh, relu, softmax or a linear function;
  • the operation instructions include: a CONFIG instruction, a COMPUTE instruction, an IO instruction, a NOP instruction, a JUMP instruction, or a MOVE instruction.
  • the main processing circuit includes a first storage unit, a first operation unit, a first data dependency determination unit, and a first storage unit, where:
  • a neuron buffer unit for buffering input data and output data used by the main processing circuit in the calculation process
  • a first arithmetic unit that performs various computing functions of the main processing circuit
  • a first data dependency determining unit configured to read the input neuron vector from the first storage unit, and send the same to the slave processing circuit through the interconnect module; and receive the intermediate result vector of the interconnect module, and send the intermediate result vector to The first arithmetic unit.
  • the first operation unit includes: a vector addition unit and an activation operation unit;
  • the vector adding unit is configured to add offset data to the intermediate result to obtain an offset result
  • the activation operation unit is configured to perform an activation function operation on the bias result.
  • the master each slave processing circuit includes a second operation unit, a second data dependency determination unit, a second storage unit, and a third storage unit, wherein:
  • a second operation unit configured to perform an arithmetic logic operation
  • a second data dependency determining unit configured to perform a read/write operation on the second storage unit and the third storage unit
  • a second storage unit configured to buffer data of the input neuron vector and the output neuron value calculated by the processing circuit
  • a third storage unit configured to buffer a weight vector required by the processing circuit in the calculation process.
  • the second computing unit of the main unit includes: a vector multiplying unit and an accumulating unit;
  • the vector multiplication unit is configured to perform a vector multiplication operation in a dot product operation
  • the accumulating unit is configured to perform an accumulating operation in a dot product operation.
  • the process of updating the above weights may include:
  • the main processing circuit 3101 is specifically configured to send the i-th layer input neuron data to each slave processing circuit, and transmit the i-th layer input neuron gradient to each slave processing circuit 3102, and each slave processing circuit 3102 will be the i-th
  • the layer input neuron gradient in_gradient is multiplied by the scalar data corresponding to the processing circuit and the i-th input neuron data, and the original weight update gradient vector dw_original of the ith layer of each slave processing circuit is obtained, and all the calculations are calculated.
  • the main processing circuit can limit the original weight update gradient.
  • the main processing circuit is specifically used to calculate the original weight update of all layers.
  • the sum of the squares of the gradient sumsq_diff, then the sumsq_diff is opened to obtain l2norm_diff. If l2norm_diff is larger than clip_gradient (a set normal number), the main processing circuit calculates the scaling factor scale_factor clip_gradient/l2norm_diff, and updates all the original weights to the gradient dw_original respectively. Multiply by the scaling factor scale_factor to get the weight update gradient dw', the main The processing circuit transmits an update gradient dw' to each of the slave processing circuits; and the slave processing circuit, specifically for multiplying the weights by the weight update gradient dw' to obtain the update weights of the respective slave processing circuits of the i-th layer.
  • clip_gradient a set normal number
  • the technical solution provided by the present application sets the arithmetic unit into a master multi-slave structure, and for the calculation instruction of the forward operation, the structure can split the data according to the calculation instruction of the forward operation, so that the plurality of slave processing circuits can Parallel operation is performed on the portion with a large amount of calculation, thereby increasing the operation speed, saving the calculation time, and further reducing the power consumption.
  • the data can also be split, similar to the forward operation, and the operation speed can be improved.
  • the foregoing main processing circuit and the slave processing circuit may include: a storage module, configured to store data of the main processing circuit or the processing circuit.
  • the main processing circuit and the slave processing circuit may share the foregoing storage module, that is, one or more regions are divided into shared regions in the storage module of the main processing circuit, and the storage space of the shared region may be composed of multiple slave processing circuits. Shared use (including reading or writing data); one or more areas can be divided into shared areas from the storage module of the processing circuit, and the storage space of the shared area can be shared by the main processing circuit (including reading or writing) Into the data).
  • the technical solution sets the area sharing scheme of the storage module. Compared with the fixed scheme of the storage module, the interconnected main processing circuit and the storage module between the plurality of processing circuits are shared, which can prevent calculation from being impossible due to insufficient storage area. In addition, the storage module sharing can effectively reduce the storage space setting of the storage area of the main processing circuit, which greatly reduces the cost of the main processing circuit. In addition, the solution reduces the overhead of data reading or writing relative to extracting data from an external device. For the computing device, such as reading or writing data from an external device, the data needs to pass through the controller unit and the conversion unit.
  • the forwarding of components such that for neural network operations, the data needs to go through multiple components, thereby making the data read and write overhead and energy consumption high, and appropriate to set a part of the sharing in the main processing circuit and from the processing circuit
  • the area is such that when the space of the main processing circuit or the memory module of the processing circuit is insufficient, it is not necessary to store the data in the external device, but directly stored in the internal operation unit, thereby greatly reducing the overhead.
  • the foregoing computing device may further include: the storage unit 10 and the direct memory access unit 50.
  • the storage unit 10 may include: one or any combination of the register 210 and the cache 202.
  • the cache 202 for storing the calculation instruction;
  • the register 201 is configured to store the input neuron data, weight data, input neuron gradient, and scalar; and
  • the cache 202 is a cache.
  • the direct memory access unit 50 is for reading or storing data from the storage unit 10.
  • the controller unit 11 includes: an instruction cache unit 110, an instruction processing unit 111, and a storage queue unit 113;
  • the instruction cache unit 110 is configured to store a calculation instruction associated with the artificial neural network operation
  • the instruction processing unit 111 is configured to parse the calculation instruction to obtain a plurality of operation instructions
  • the storage queue unit 113 is configured to store an instruction queue, where the instruction queue includes: a plurality of operation instructions or calculation instructions to be executed arranged in the order of the queue.
  • the main operation processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, specifically for decoding the instruction into a micro instruction.
  • the slave arithmetic processing circuit may also include another controller unit, the other controller unit including the slave instruction processing unit, specifically for receiving and processing the microinstructions.
  • the microinstruction may be the next instruction of the instruction, and the microinstruction may be obtained by splitting or decoding the instruction, and can be further decoded into a control signal of each component, each unit or each processing circuit.
  • the structure of the calculation instructions can be as shown in the following table.
  • the ellipsis in the above table indicates that multiple registers or immediate data can be included.
  • the calculation instructions can include: one or more operational domains and an opcode.
  • the calculation instructions can include neural network operational instructions. Taking the neural network operation instruction as an example, as shown in the following table, the register number 0, the register number 1, the register number 2, the register number 3, and the register number 4 may be the operation domain. Among them, the register number 0, the register number 1, the register number 2, the register number 3, and the register number 4 may be the numbers of one or more registers.
  • the above register may be an off-chip memory. Of course, in an actual application, it may also be an on-chip memory for storing data.
  • the data is one-dimensional data, that is, a vector.
  • the data is 2-dimensional data, that is, a matrix.
  • the arithmetic unit 12 may include a main processing circuit 3101 and a plurality of slave processing circuits 3102.
  • a plurality of slave processing circuits 102 are arranged in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and the main processing circuit 101 is connected to the plurality of slave processing circuits.
  • k slave processing circuits, the k slave processing circuits are: n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column, The k slave processing circuits shown in FIG.
  • the circuit is a slave processing circuit that is directly connected to the main processing circuit from the processing circuit.
  • the above conversion unit may be disposed in the main processing circuit.
  • the main processing circuit may further include:
  • An addition processing circuit for performing an addition operation or an accumulation operation.
  • the main processing circuit is configured to determine that the input neuron data is broadcast data, the weight data is distribution data, distribute the distribution data into a plurality of data blocks, and at least one of the plurality of data blocks and Transmitting at least one of the plurality of operational instructions to the slave processing circuit;
  • the plurality of slave processing circuits are configured to perform an operation on the received data block according to the operation instruction to obtain an intermediate result, and transmit the intermediate result to the main processing circuit;
  • the main processing circuit is configured to receive an i-th layer forward output result, an i-th layer output result gradient, an i-th layer weight gradient, and update the i-th layer weight according to the i-th layer weight gradient.
  • the slave processing circuit includes: a multiplication processing circuit
  • the multiplication processing circuit is configured to perform a product operation on the received data block to obtain a product result
  • a forwarding processing circuit (optional) for forwarding received data blocks or product results.
  • accumulating processing circuit wherein the accumulating processing circuit is configured to perform an accumulating operation on the product result to obtain the intermediate result.
  • the operation instruction is a matrix multiplied by a matrix instruction, an accumulation instruction, an activation instruction, and the like to calculate an instruction.
  • the apparatus may further include: a tree module 40, the tree module includes: a root port 401 and a plurality of branch ports 404, the tree type a root port of the module is connected to the main processing circuit, and a plurality of branch ports of the tree module are respectively connected to one of the plurality of slave processing circuits;
  • the tree module has a transceiving function.
  • the tree module is a transmitting function.
  • the tree module 40 is a receiving function.
  • the tree module is configured to forward data between the main processing circuit and the plurality of slave processing circuits and an operation instruction.
  • the tree module is an optional structure of the computing device, and may include at least one layer node, which is a line structure with a forwarding function, and the node itself may not have a computing function. If the tree module has a zero-layer node, the tree module is not needed.
  • the tree module may be an n-tree structure, for example, a binary tree structure as shown in FIG. 4C, and may also be a tri-tree structure, and the n may be an integer greater than or equal to 2.
  • the specific embodiment of the present application does not limit the specific value of the above n, and the number of layers may be two, and the processing circuit may connect nodes of other layers than the second-to-last layer node.
  • the main processing circuit in the foregoing operation unit may carry a separate cache.
  • the method may include: a neuron cache unit that buffers input neuron vector data and output neuron values of the slave processing circuit. data.
  • the main processing circuit may further include: a weight buffer unit for buffering weight data required by the processing circuit in the calculation process.
  • the operation unit 12, as shown in FIG. 3C may include a branch processing circuit 3103; the specific connection structure is as shown in FIG. 3C, where
  • the main processing circuit 3101 is connected to one or more branch processing circuits 3103, and the branch processing circuit 3103 is connected to one or more slave processing circuits 3102;
  • the branch processing circuit 3103 is configured to forward data or instructions between the main processing circuit 3101 and the slave processing circuit 3102.
  • a storage module may be disposed in the branch processing circuit 3103, where the storage module may divide one or more shared areas, a main processing circuit, and a slave processing circuit, specifically for performing data writing or reading on the shared area. operating.
  • Providing the shared area in the branch processing circuit 3103 can facilitate the main processing circuit and store data from the processing circuit, and the overhead of reading or writing the data storage is small, which can save the storage module from the processing circuit and the main processing circuit. Capacity, reducing the cost of computing devices.
  • f is the activation function, which can be any of the sigmoid function, tanh, relu, and softmax functions. It is assumed here that it is a binary tree structure, and the arithmetic unit has eight slave processing circuits, and the implemented method can be:
  • the controller unit acquires the input neuron matrix x, the weight matrix w, and the fully connected operation instruction from the storage unit, and transmits the input neuron matrix x, the weight matrix w, and the fully connected operation instruction to the main processing circuit;
  • the main processing circuit determines that the input neuron matrix x is broadcast data, determines the weight matrix w as distribution data, splits the weight matrix w into eight sub-matrices, and then distributes the eight sub-matrices through the tree module to eight slave processes.
  • the main processing circuit is configured to sort 8 intermediate results to obtain the operation result of wx, perform the operation of the offset b on the operation result, perform an activation operation to obtain a final result y, and send the final result y to the controller unit, the controller unit
  • the final result y is output or stored in the storage unit.
  • the specific implementation manner of arranging the eight intermediate results to obtain the operation result of the wx may be: multiplying the matrix by the matrix, determining part of the elements of the input neuron matrix x corresponding to the eight sub-matrices, and extracting the minimum number of rows and the part of the eight sub-matrices
  • the minimum number of columns of the element, the minimum number of rows, and the minimum number of columns are the positions of the intermediate results in the operation result.
  • the method for executing the neural network forward operation instruction by the computing device shown in FIG. 4 may specifically be:
  • the controller unit extracts a neural network forward operation instruction, an operation domain corresponding to the neural network operation instruction, and at least one operation code from the instruction storage unit, and the controller unit transmits the operation domain to the data access unit, and sends the at least one operation code To the arithmetic unit.
  • the controller unit extracts the weight w and the offset b corresponding to the operation domain from the storage unit (when b is 0, it is not necessary to extract the offset b), and transmits the weight w and the offset b to the main processing of the operation unit.
  • the controller unit extracts the input data Xi from the storage unit and transmits the input data Xi to the main processing circuit.
  • the main processing circuit determines the operation code as a multiplication operation according to the at least one operation code, converts the input data Xi into the fixed point input data Xi, converts the weight data into fixed point weight data, determines the fixed point input data Xi as the broadcast data, and determines the fixed point.
  • the weight data is distribution data, and the fixed point weight w is split into n fixed point data blocks;
  • the instruction processing unit of the controller unit determines the multiplication instruction, the offset instruction, and the accumulation instruction according to the at least one operation code, and sends the multiplication instruction, the offset instruction, and the accumulation instruction to the main processing circuit, and the main processing circuit inputs the multiplication instruction and the input data Xi is broadcasted to a plurality of slave processing circuits, and the n fixed-point data blocks are distributed to the plurality of slave processing circuits (for example, having n slave processing circuits, then each slave processing circuit transmits one data block); a slave processing circuit, configured to perform a fixed-point multiplication operation on the fixed-point data block and the received fixed-point data block according to the multiplication instruction to obtain a fixed-point intermediate result, and send the fixed-point intermediate result to a main processing circuit, wherein the main processing circuit is configured according to the The accumulating instruction performs an accumulating result by performing an accumulating operation on the intermediate results sent from the processing circuit, and converting the accumulating result into a floating point accumulating result, and adding the
  • the technical solution provided by the present application realizes the multiplication operation and the offset operation of the neural network through an instruction, that is, the neural network operation instruction, without storing or extracting the intermediate result of the neural network calculation, reducing the storage and extraction operation of the intermediate data, so Reduce the corresponding operational steps and improve the computational effect of the neural network.
  • the present application also discloses a neural network device including one or more computing devices mentioned in the present application for acquiring data to be processed and control information from other processing devices, performing specified neural network training calculations, and executing The result is passed to the peripheral device via the I/O interface.
  • Peripherals such as cameras, monitors, mice, keyboards, network cards, wifi interfaces, and servers.
  • these computing devices can be linked and transferred by a specific structure, such as interconnecting and transmitting data over a PCIE bus to support larger scale machine learning operations.
  • these computing devices can share the same control system, or they can have separate control systems; they can share memory, or each accelerator can have its own memory.
  • the manner in which these computing devices are interconnected can be any interconnected topology.
  • the neural network device has high compatibility and can be connected to various types of servers through a PCIE interface.
  • the present application also provides a combined processing device that includes the neural network device described above, a universal interconnect interface, and other processing devices.
  • the neural network device interacts with other processing devices to perform user-specified operations.
  • 4E is a schematic diagram of a combined processing apparatus.
  • processing devices include processor types of one or more of general purpose/dedicated processors such as a central processing unit CPU, a graphics processing unit GPU, a neural network processor, and the like. This application does not limit the number of processors included in other processing devices.
  • Other processing devices serve as an interface between the neural network device and external data and control, including data handling, and complete basic control such as opening and stopping of the neural network device; other processing devices can also cooperate with the neural network device to complete the computing task.
  • a universal interconnect interface for transmitting data and control commands between the neural network device and other processing devices.
  • the neural network device acquires required input data from other processing devices and writes to the storage device on the slice of the neural network device; can obtain control commands from other processing devices, write to the control cache on the slice of the neural network device; or read The data in the storage module of the neural network device is transmitted to other processing devices.
  • the structure is as shown in FIG. 4, and may further include a storage device, where the storage device is respectively connected to the neural network device and the other processing device.
  • the storage device is used to store data in the neural network device and the other processing device, and is particularly suitable for the case where the data to be calculated cannot be completely stored in the internal storage of the machine learning arithmetic device or other processing device.
  • the combined processing device can be used as a SOC on-chip system for mobile phones, robots, drones, video monitoring devices, etc., effectively reducing the core area of the control part, increasing the processing speed, and reducing the overall power consumption.
  • the universal interconnect interface of the combined processing device is coupled to certain components of the device. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • a chip is also applied that includes the above-described neural network computing device or combination processing device.
  • a chip package structure is claimed that includes the chip described above.
  • a board is claimed that includes the chip package structure described above.
  • FIG. 5 provides a board, which may include other supporting components in addition to the chip 389 described above, including but not limited to: a storage device 390, an interface device 391, and a control device 392. ;
  • the memory device 390 is connected to a chip in the chip package structure via a bus for storing data.
  • the memory device can include multiple sets of memory cells 393. Each set of said memory cells is connected to said chip via a bus. It can be understood that each set of the storage unit may be a DDR SDRAM (English: Double Data Rate SDRAM, Double Rate Synchronous Dynamic Random Access Memory).
  • the storage device can include four sets of the storage units. Each set of said memory cells can include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include four 72-bit DDR4 controllers, 64 bits of which are used for data transmission and 8 bits for ECC verification. It can be understood that when DDR4-3200 particles are used in each group of the storage units, the theoretical bandwidth of data transmission can reach 25600 MB/s.
  • each set of said memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transfer data twice in one clock cycle.
  • a controller for controlling DDR is provided in the chip for controlling data transmission and data storage of each of the storage units.
  • the interface device is electrically connected to a chip within the chip package structure.
  • the interface device is configured to implement data transfer between the chip and an external device, such as a server or a computer.
  • the interface device can be a standard PCIE interface.
  • the data to be processed is transferred to the chip by the server through a standard PCIE interface to implement data transfer.
  • the interface device may also be another interface.
  • the application does not limit the specific expression of the other interfaces, and the interface unit can implement the transfer function.
  • the calculation result of the chip is still transmitted by the interface device back to an external device (for example, a server).
  • the control device is electrically coupled to the chip.
  • the control device is for monitoring the status of the chip.
  • the chip and the control device can be electrically connected through an SPI interface.
  • the control device may include a Micro Controller Unit (MCU). If the chip can include multiple processing chips, multiple processing cores or multiple processing circuits, multiple loads can be driven. Therefore, the chip can be in different operating states such as multiple loads and light loads.
  • the control of the operating states of the plurality of processing chips, the plurality of processing and/or the plurality of processing circuits in the chip can be realized by the control device.
  • an electronic device that includes the above-described card.
  • Electronic equipment including data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headphones , mobile storage, wearables, vehicles, household appliances, and/or medical equipment.
  • the vehicle includes an airplane, a ship, and/or a vehicle;
  • the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood;
  • the medical device includes a nuclear magnetic resonance instrument, B-ultrasound and / or electrocardiograph.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Neurology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Complex Calculations (AREA)
  • Image Analysis (AREA)
  • Feedback Control In General (AREA)
  • Image Processing (AREA)
  • Advance Control (AREA)

Abstract

本申请提供一种计算方法以及相关产品,所述计算方法采用融合方式执行机器学习计算。本申请的技术方案具有计算量小,节省功耗的优点。

Description

计算方法以及相关产品 技术领域
本申请涉及神经网络领域,尤其涉及一种计算方法以及相关产品。
背景技术
神经网络是一种运算模型,由大量的节点(或称神经元)之间相互联接构成。每个节点代表一种特定的输出函数,称为激励函数(activation function)。每两个节点间的连接都代表一个对于通过该连接信号的加权值,称之为权重,这相当于人工神经网络的记忆。网络的输出则依网络的连接方式、权重值和激励函数的不同而不同。而网络自身通常都是对自然界某种算法或者函数的逼近,也可能是对一种逻辑策略的表达。
神经网络的计算方式包括但不限于:加法运算、乘法运算、激活运算等等运算方式。神经网络现有的计算方式无法实现对神经网络数据的快速运算,影响运算速度。
发明内容
本申请提供一种计算方法及相关产品,对现有的集成电路芯片,具有可实现提升运算速度的优点。
第一方面,提供一种计算方法,所述计算方法应用于计算系统,所述计算系统包括:控制单元、计算群和总存储单元,所述控制单元包括:第一存储器、译码逻辑和控制器,所述计算群包括:群控制器和多个计算单元;所述总存储单元,用于存储数据;所述计算方法包括如下步骤:
所述控制器接收第一级指令序列,所述译码逻辑将该第一级指令序列拆分成多个第二级指令序列,
控制器为所述多个第二级指令序列开辟M个线程,控制器为所述M个线程中每个线程分配独立的寄存器以及配置独立寻址功能;所述M取值范围为大于等于1的整数;
群控制器获取所述多个第二级指令序列的多个计算类型,依据所述多个计算类型获取计算类型对应的融合计算方式,多个计算单元采用该融合计算方式调用所述M个线程对所述多个第二指令序列执行计算得到最终结果。
可选的,所述群控制器获取所述多个第二级指令序列的多个计算类型,依据所述多个计算类型获取计算类型对应的融合计算方式,多个计算单元采用该融合计算方式调用所述M个线程对所述多个第二指令序列执行计算得到最终结果:
如所述计算类型代表相同类型的计算操作,群控制器调用相同类型的单指令多数据流SIMD结合单指令多线程SIMT的组合计算方式,并采用所述M个线程执行组合计算方式得到最终结果,具体包括:
译码逻辑将M个线程拆分成N个线程组分配给多个计算单元,群控制器将所述多个第二指令序列转换成多个第二控制信号并发送给多个计算单元,多个计算单元调用分配的线 程组以及第二控制信号依据所述独立寻址功能提取对应的数据,多个计算单元将该数据执行运算得到多个中间结果,将多个中间结果拼接起来得到最终结果。
可选的,所述群控制器获取所述多个第二级指令序列的多个计算类型,依据所述多个计算类型获取计算类型对应的融合计算方式,多个计算单元采用该融合计算方式调用所述M个线程对所述多个第二指令序列执行计算得到最终结果:
如所述计算类型代表不同类型的计算操作,群控制器调用同步多线程SMT以及所述M个线程执行计算得到最终结果,具体包括:
译码逻辑将M个线程拆分成N个线程组,将所述多个第二指令序列转换成多个第二控制信号,群控制器获取多个计算单元支持的计算类型,控制器将N个线程组以及多个第二控制信号,分配给支持该线程组以及第二控制信号的计算类型对应的计算单元,多个计算单元调用分配的线程组以及第二控制信号,多个计算单元提取对应的数据,多个计算单元将该数据执行运算得到多个中间结果,将所有中间结果拼接起来得到最终结果。
可选的,所述方法还包括:
如多个线程组中的线程组A阻塞,将线程组A加入等待队列,如线程组A的数据已被提取,将线程组A加入到准备队列,所述准备队列为计算资源空闲时被调度执行的线程组所在的队列。
可选的,所述第一级指令序列包括:超长指令,所述第二级指令序列包括:指令序列。
可选的,所述计算系统还包括:树型模块,所述树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述群控制器,所述树型模块的多个支端口分别连接多个计算单元中的一个计算单元;
所述树型模块转发所述群控制器与所述多个计算单元之间的数据块、线程组或指令序列。
可选的,所述树型模块为n叉树,所述n为大于等于2的整数。
可选的,所述计算系统还包括:分支处理电路,
所述分支处理电路连接在所述群控制器与所述多个计算单元之间;
所述分支处理电路转发所述群控制器与所述多个计算单元之间的数据、线程组或指令序列。
第二方面,提供一种计算系统,所述计算系统包括:控制单元、计算群和总存储单元,所述控制单元包括:第一存储器、译码逻辑和控制器,所述计算群包括:群控制器和多个计算单元;所述总存储单元,用于存储数据;
所述控制器,用于接收第一级指令序列以及用于控制所述第一存储器和所述译码逻辑;
所述译码逻辑,用于将该第一级指令序列拆分成多个第二级指令序列;
所述控制器,还用于为所述多个第二级指令序列开辟M个线程;为所述M个线程中每个线程分配独立的寄存器以及配置独立寻址功能;所述M取值范围为大于等于1的整数,将所述多个第二级指令序列转换成多个控制信号发送给所述群控制器;
所述群控制器,用于接收所述多个控制信号,获取所述多个控制信号的多个计算类型,将M个线程划分成N个线程组,依据该多个计算类型为多个计算单元分配N个线程组以及多个控制信号;
多个计算单元,用于通过分配的线程组以及控制信号从所述总存储单元提取数据执行运算得到中间结果,
所述群控制器,用于拼接所有中间结果得到最终计算结果。
可选的,所述多个计算单元包括:加法计算器、乘法计算器、激活计算器或专用计算器。
可选的,所述专用计算器包括:人脸识别计算器、图形计算器、指纹计算器或神经网络计算器。
可选的,所述群控制器,具体用于:如多个控制信号的计算类型为图形计算、指纹识别、人脸识别或神经网络运算时,将该多个控制信号分别分配给人脸识别计算器、图形计算器、指纹计算器或神经网络计算器。
可选的,所述第一级指令序列包括:超长指令,所述第二级指令序列包括:指令序列。
可选的,所述计算系统包括:树型模块,所述树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述群控制器,所述树型模块的多个支端口分别连接多个计算单元中的一个计算单元;
所述树型模块,用于转发所述群控制器与所述多个计算单元之间的数据块、线程组或指令序列。
可选的,所述树型模块为n叉树,所述n为大于等于2的整数。
可选的,所述计算系统包括:分支处理电路,
所述分支处理电路连接在所述群控制器与所述多个计算单元之间;
所述分支处理电路,用于转发所述群控制器与所述多个计算单元之间的数据、线程组或指令序列。
第三方面,本发明提供了一种神经网络运算模块,该神经网络运算模块用于进行多层神经网络的运算,包括:
存储单元,用于存储输入神经元精度、权重精度和输出神经元梯度精度;
控制器单元,用于从所述存储单元获取所述多层神经网络第L层的输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000001
其中,所述L为大于0的整数;根据所述输入神经元精度S x(l)、所述权重精度S w(l)和所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000002
获取梯度更新精度T;当所述梯度更新精度T小于预设精度T r时,调整所述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000003
以使所述梯度更新精度T与所述预设精度T r的差值的绝对值最小;
运算单元,用于根据调整后的输入神经元精度S x(l)和权重精度S w(l)来表示第L层的输出神经元和权重,根据调整后的输出神经元梯度精度
Figure PCTCN2019085844-appb-000004
来表示运算得到的第L层输出神经元梯度,以进行后续运算。
在一种可行的实施例中,所述控制器单元根据所述输入神经元精度S x(l)、所述权重精度S w(l)和所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000005
获取梯度更新精度T,具体包括:
所述控制器单元根据预设公式对所述输入神经元精度S x(l)、所述权重精度S w(l)和所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000006
进行计算,以得到所述梯度更新精度T;
其中,所述第一预设公式为:
Figure PCTCN2019085844-appb-000007
在一种可行的实施例中,所述控制器单元调整所述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000008
包括:
所述控制器单元保持所述输入神经元精度S x(l)和所述权重精度S w(l)不变,增大所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000009
在一种可行的实施例中,所述控制器单元增大所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000010
时,减少表示所述输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述控制器单元增大所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000011
后,所述控制器单元还用于:
判断所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000012
是否小于需求精度,所述需求精度为进行多层神经网络运算时输出神经元梯度的最小精度;
当所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000013
小于所述需求精度时,减少表示所述输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述控制器单元减少表示所述输出神经元梯度的定点数据格式的位宽,包括:
所述控制器单元按照第一预设步长N1减少所述表示所述输出神经元梯度的定点数据格式的位宽;
其中,所述第一预设步长N1为1、2、4、6、7、8或者其他正整数。
在一种可行的实施例中,所述控制器单元减少表示所述输出神经元梯度的定点数据格式的位宽,包括:
所述控制器单元按照2倍递减的方式减少所述表示所述输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述控制器单元还用于:
根据机器学习的方法获取所述预设精度T r,或者;
根据第L-1层输出神经元的个数、学习率和批处理时的样本数量获取所述预设精度T r;且所述第L-1层输出神经元的个数和批处理时的样本数量越多以及学习率越高,所述预设精度T r越大。
第四方面,本发明实施例提供了一种神经网络运算方法,包括:
获取神经网络的第L层的输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000014
根据所述输入神经元精度S x(l)、所述权重精度S w(l)和所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000015
计算得到梯度更新精度T;
当梯度更新精度T小于预设精度T r时,调整输入神经元精度S x(l)、权重精度S w(l)和输 出神经元梯度
Figure PCTCN2019085844-appb-000016
以使所述梯度更新精度T与所述预设精度T r的差值的绝对值最小;
根据调整后的输入神经元精度S x(l)和权重精度S w(l)来表示第L层的输出神经元和权重;根据调整后的输出神经元梯度精度
Figure PCTCN2019085844-appb-000017
来表示运算得到的第L层输出神经元梯度,以进行后续运算。
在一种可行的实施例中,所述根据所述输入神经元精度S x(l)、所述权重精度S w(l)和所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000018
计算得到梯度更新精度T,包括:
根据预设公式对所述输入神经元精度S x(l)、所述权重精度S w(l)和所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000019
进行计算,以得到所述梯度更新精度T;
其中,所述预设公式为:
Figure PCTCN2019085844-appb-000020
在一种可行的实施例中,所述调整所述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000021
包括:
保持所述输入神经元精度S x(l)和所述权重精度S w(l)不变,增大所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000022
在一种可行的实施例中,所述增大所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000023
时,所述方法还包括:减少表示所述输出神经元梯度的定点数据格式的位宽
在一种可行的实施例中,所述增大所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000024
后,所述方法还包括:
判断所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000025
是否小于需求精度,所述需求精度为进行多层神经网络运算时输出神经元梯度的最小精度;
当所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000026
小于所述需求精度时,减少表示所述输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述减少表示所述输出神经元梯度的定点数据格式的位宽,包括:
按照第一预设步长N1减少所述表示所述输出神经元梯度的定点数据格式的位宽;
其中,所述第一预设步长N1为1、2、4、6、7、8或者其他正整数。
在一种可行的实施例中,所述减少表示所述输出神经元梯度的定点数据格式的位宽,包括:
按照2倍递减的方式减少所述表示所述输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述方法还包括:
根据机器学习的方法获取所述预设精度T r,或者;
根据第L-1层的输出神经元的个数、学习率和批处理时的样本数量获取所述预设精度T r;且所述第L-1层输出神经元的个数和批处理时的样本数量越多以及学习率越高,所述预设精度T r越大。
第五方面,本发明提供了一种神经网络运算模块,该神经网络运算模块用于进行多层神经网络的运算,包括:
存储单元,用于存储输入神经元精度、权重精度和输出神经元梯度精度;
控制器单元,用于从所述存储单元获取所述多层神经网络第L层的输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000027
其中,所述L为大于0的整数;根据所 述输入神经元精度S x(l)、所述权重精度S w(l)和所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000028
获取梯度更新精度T;当所述梯度更新精度T大于预设精度T r时,调整所述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000029
以使所述梯度更新精度T与所述预设精度T r的差值的绝对值最小;
运算单元,用于根据调整后的输入神经元精度S x(l)和权重精度S w(l)来表示第L层的输出神经元和权重,根据调整后的输出神经元梯度精度
Figure PCTCN2019085844-appb-000030
来表示运算得到的第L层输出神经元梯度,以进行后续运算。
在一种可行的实施例中,所述控制器单元根据所述输入神经元精度S x(l)、所述权重精度S w(l)和所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000031
获取梯度更新精度T,具体包括:
所述控制器单元根据预设公式对所述输入神经元精度S x(l)、所述权重精度S w(l)和所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000032
进行计算,以得到所述梯度更新精度T;
其中,所述第一预设公式为:
Figure PCTCN2019085844-appb-000033
在一种可行的实施例中,所述控制器单元调整所述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000034
包括:
所述控制器单元保持所述输入神经元精度S x(l)和所述权重精度S w(l)不变,减小所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000035
在一种可行的实施例中,所述控制器单元减小所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000036
时,增加表示所述输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述控制器单元增加所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000037
后,所述控制器单元还用于:
判断所述输出神经元梯度以所述输出神经元梯度的定点数据格式表示时是否溢出;
当确定溢出时,增加表示所述输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述控制器单元增加表示所述输出神经元梯度的定点数据格式的位宽,包括:
所述控制器单元按照第一预设步长N1增加所述表示所述输出神经元梯度的定点数据格式的位宽;
其中,所述第一预设步长N1为1、2、4、6、7、8或者其他正整数。
在一种可行的实施例中,所述控制器单元增加表示所述输出神经元梯度的定点数据格式的位宽,包括:
所述控制器单元按照2倍递增的方式增加所述表示所述输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述控制器单元还用于:
根据机器学习的方法获取所述预设精度T r,或者;
根据第L-1层输出神经元的个数、学习率和批处理时的样本数量获取所述预设精度T r;且所述第L-1层输出神经元的个数和批处理时的样本数量越多以及学习率越高,所述预设精度T r越大。
第六方面,本发明实施例提供了一种神经网络运算模块,该神经网络运算模块用于进行多层神经网络的运算,包括:
存储单元,用于存储所述多层神经网络的输出神经元梯度;
控制器单元,用于从所述存储单元获取所述多层神经网络的第L层的输入神经元梯度;所述L为大于0的整数;获取所述第L层输出神经元梯度中绝对值小于第一预设阈值的输出神经元梯度的个数n1;根据所述个数n1和所述第L层输出神经元梯度的个数n2获取比例数据a,其中,a=n1/n2;当所述比例数据a大于第二预设阈值时,减小所述第L层输出神经元梯度精度
Figure PCTCN2019085844-appb-000038
运算单元,用于根据减小后的输出神经元梯度精度
Figure PCTCN2019085844-appb-000039
表示第L层输出神经元梯度,以进行后续运算。
在一种可行的实施例中,所述控制器单元增大所述第L层输出神经元梯度精度
Figure PCTCN2019085844-appb-000040
时,增加表示所述第L层输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述控制器单元减小所述第L层输出神经元梯度精度
Figure PCTCN2019085844-appb-000041
后,所述控制器单元还用于:
判断所述第L层输出神经元梯度以所述第L层输出神经元梯度的定点数据格式表示时是否溢出;
当确定溢出时,增加表示所述第L层输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述增加表示所述第L层输出神经元梯度的定点数据格式的位宽,包括:
所述控制器单元按照第二预设步长N2增加所述表示所述第L层输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述控制器单元增加表示所述第L层输出神经元梯度的定点数据格式的位宽,,包括:
所述控制器单元按照2倍递增的方式增加所述表示所述第L层输出神经元梯度的定点数据格式的位宽。
第七方面,本发明实施例提供了一种神经网络运算方法,包括:
获取神经网络的第L层输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000042
根据所述输入神经元精度S x(l)、所述权重精度S w(l)和所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000043
计算得到梯度更新精度T;
当梯度更新精度T大于预设精度T r时,调整输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度
Figure PCTCN2019085844-appb-000044
以使所述梯度更新精度T与所述预设精度T r的差值的绝对值最小;
根据调整后的输入神经元精度S x(l)和权重精度S w(l)来表示第L层的输出神经元和权重;根据调整后的输出神经元梯度精度
Figure PCTCN2019085844-appb-000045
来表示运算得到的第L层输出神经元梯度,以进行后续运算。
在一种可行的实施例中,所述根据所述输入神经元精度S x(l)、所述权重精度S w(l)和所 述输出神经元梯度精度
Figure PCTCN2019085844-appb-000046
计算得到梯度更新精度T,包括:
根据预设公式对所述输入神经元精度S x(l)、所述权重精度S w(l)和所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000047
进行计算,以得到所述梯度更新精度T;
其中,所述预设公式为:
Figure PCTCN2019085844-appb-000048
在一种可行的实施例中,所述调整所述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000049
包括:
保持所述输入神经元精度S x(l)和所述权重精度S w(l)不变,减小所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000050
在一种可行的实施例中,所述减小所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000051
时,增加表示所述输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述减小所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000052
后,所述方法还包括:
判断所述输出神经元梯度以所述输出神经元梯度的定点数据格式表示时是否溢出;
当确定溢出时,增加表示所述输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述增大表示所述输出神经元梯度的定点数据格式的位宽,包括:
按照第一预设步长N1增加所述表示所述输出神经元梯度的定点数据格式的位宽;
其中,所述第一预设步长N1为1、2、4、6、7、8或者其他正整数。
在一种可行的实施例中,所述增加表示所述输出神经元梯度的定点数据格式的位宽,包括:
按照2倍递增的方式增加所述表示所述输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述方法还包括:
根据机器学习的方法获取所述预设精度T r,或者;
根据第L-1层输出神经元的个数、学习率和批处理时的样本数量获取所述预设精度T r;且所述第L-1层输出神经元的个数和批处理时的样本数量越多以及学习率越高,所述预设精度T r越大。
第八方面,本申请实施例提供了一种神经网络运算方法,包括:
获取所述多层神经网络的第L层的输入神经元梯度,所述L为大于0的整数;
获取所述第L层输出神经元梯度中绝对值小于第一预设阈值的输出神经元梯度的个数n1;
根据所述个数n1和所述第L层输出神经元梯度的个数n2获取比例数据a,其中,a=n1/n2;
当所述比例数据a大于第二预设阈值时,减小所述第L层输出神经元梯度精度
Figure PCTCN2019085844-appb-000053
根据减小后的输出神经元梯度精度
Figure PCTCN2019085844-appb-000054
表示第L层输出神经元梯度,以进行后续运算。
在一种可行的实施例中,所述减小所述第L层输出神经元梯度精度
Figure PCTCN2019085844-appb-000055
时,增加表示所述第L层输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述减小所述第L层输出神经元梯度精度
Figure PCTCN2019085844-appb-000056
后,所述方法 还包括:
判断所述权重以表示所述第L层输出神经元梯度的定点数据格式时是否溢出;
当确定溢出时,增加表示所述第L层输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述增加表示所述第L层输出神经元梯度的定点数据格式的位宽,,包括:
按照第三预设步长N2增加所述表示所述第L层输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,所述增加表示所述第L层输出神经元梯度的定点数据格式的位宽,,包括:按照2倍递增的方式增加所述表示所述第L层输出神经元梯度的定点数据格式的位宽。
第九方面,本发明提供了一种神经网络计算装置,所述装置用于执行人工神经网络训练计算;所述神经网络训练计算包括神经网络多层训练运算,所述多层训练运算包括第i层,所述第i层的正向运算或反向运算中至少有部分数据用于定点数据运算,上述i为大于等于1的整数;所述计算装置包括:控制器单元、运算单元和转换单元,其中,控制器单元与运算单元以及转换单元连接;所述第i层训练运算包括第i层的正向运算和第i层的反向运算;制器单元,用于获取第i层的输入神经元数据、第i层权值数据以及第i层正向计算指令;
控制器单元,还用于解析该第i层正向计算指令得到多个正向运算指令,将第i层输入神经元数据以及第i层权值数据发送给转换单元,将该多个运算指令发送给运算单元;
转换单元,用于对该第i层输入神经元数据以及第i层权值数据中的全部或部分数据执行浮点类型与定点类型转换以得到全部定点数据或混合数据,将全部定点数据或混合数据发送给运算单元,所述混合数据包括:部分定点数据以及部分浮点数据;
运算单元,用于依据正向运算指令对全部定点数据执行定点运算或对混合数据执行混合运算得到第i层的正向输出结果;
所述混合运算包括:对部分定点数据执行定点运算以及对部分浮点数据执行浮点运算。
可选的,控制器单元,还用于获取第i层的输入神经元数据、第i层权值数据、第i层输入神经元梯度以及第i层反向计算指令;
控制器单元,还用于解析该第i层计算指令得到多个反向运算指令,将第i层输入神经元数据、第i层权值数据以及第i层输入神经元梯度发送给转换单元,将该多个运算指令发送给运算单元;
转换单元,还用于对该第i层输入神经元数据、第i层权值数据以及第i层输入神经元梯度中的全部或部分数据执行浮点类型与定点类型转换得到全部定点数据或混合数据,将全部定点数据或混合数据发送给运算单元,该混合数据包括:部分定点数据以及部分浮点数据;
运算单元,还用于依据多个正向运算指令对全部定点数据执行定点运算或对混合数据执行混合运算得到第i层的权值梯度以及第i层输出结果梯度;采用第i层的权值梯度对第i层权值更新。
可选的,所述转换单元,具体用于将第i层输入神经元数据的部分转换成部分定点输入神经元数据以及将第i层权值数据的部分转换成部分定点权值数据;将部分定点输入神经元数据以及部分定点权值数据发送给运算单元,将部分输入神经元数据和部分权值数据发送给运算单元;
运算单元,具体用于将部分定点输入神经元数据以及部分定点权值数据执行定点数据运算得到部分定点正向输出结果,将部分定点正向输出结果发送给转换单元,
转换单元,具体用于将该部分定点正向输出结果执行定点与浮点转换得到第一部分浮点正向输出结果,将第一部分浮点正向输出结果发送给运算单元;
运算单元,具体用于将部分输入神经元数据和部分权值数据执行运算得到第二部分浮点正向运算结果,将第一部分浮点正向运算结果和第二部分浮点正向运算结果结合起来得到第i层正向输出结果。
可选的,所述转换单元,具体用于将第i层输入神经元数据的部分转换成部分定点输入神经元数据、将第i层权值数据的部分转换成部分定点权值数据以及将第i层输入神经元梯度转换成部分定点输入神经元梯度;将部分定点输入神经元数据、部分定点输入神经元梯度以及部分定点权值数据发送给运算单元,将部分输入神经元数据、部分输入神经元梯度和部分权值数据发送给运算单元;
运算单元,具体用于将部分定点输入神经元梯度以及部分定点输入数据执行定点数据运算得到部分第i层权值梯度,将部分定点输入神经元梯度与部分定点权值数据执行定点数据运算得到部分第i层输出结果梯度,将部分第i层权值梯度以及部分第i层输出结果梯度发送给转换单元,
转换单元,具体用于将该部分第i层权值梯度以及部分第i层输出结果梯度执行定点与浮点转换得到第一部分第i层权值梯度以及第一部分第i层输出结果梯度,将第一部分第i层权值梯度以及第一部分第i层输出结果梯度发送给运算单元;
运算单元,具体用于将部分输入神经元梯度以及部分输入数据执行运算得到第二部分第i层权值梯度,将部分输入神经元梯度与部分权值数据执行运算得到第二部分第i层输出结果梯度,将第一部分第i层权值梯度和第二部分第i层权值梯度结合起来得到第i层权值梯度,将第一部分第i层输出结果梯度和第二部分第i层输出结果梯度结合起来得到第i层输出结果梯度。
可选的,所述转换单元,具体用于确定浮点数的小数点位置point;
Figure PCTCN2019085844-appb-000057
其中,maxabs为需要转换的浮点数据中的最大绝对值,width为定点数的位宽;
Figure PCTCN2019085844-appb-000058
其中,float=int*2point;所述float为浮点数的值,int为定点数的值。
可选的,所述获取第i层输入神经元梯度的方法具体包括:
所述控制器单元,具体用于接收第i+1层输出结果梯度,将第i+1层输出结果梯度发送至运算单元;
所述运算单元,具体用于依据第i+1层输出结果梯度得到第i层输入神经元梯度;
第i层输入神经元梯度=f′*第i+1层输出结果梯度;
其中f′为激活函数f的导函数。
可选的,所述运算单元包括:主处理电路和多个从处理电路;其中,
所述主处理电路,用于对数据执行前序处理,以及向所述多个从处理电路传输数据以及运算指令;
多个从处理电路,用于依据从所述主处理电路传输的数据以及运算指令并行执行中间运算得到多个中间结果,并将多个中间结果传输给所述主处理电路;
主处理电路,用于依据多个中间结果得到第i层正向输出结果、第i层输出结果梯度、第i层权值梯度,并依据第i层权值梯度对第i层权值进行更新。
可选的,所述主处理电路,具体用于将该第i层输入神经元数据分发发送给各个从处理电路,将第i层输入神经元梯度传送到各个从处理电路,每个从处理电路将第i层输入神经元梯度in_gradient中与该从处理电路相对应的标量数据以及第i层输入神经元数据相乘,得到每个从处理电路的第i层的原始权值以更新梯度向量dw_original,采用该原始权值更新梯度向量dw_original与每个从处理电路的权值相乘得到各个从处理电路的更新权值。
可选的,所述主处理电路,具体用于在计算得到所有层的原始权值更新梯度向量后,计算所有层的原始权值更新梯度的平方和sumsq_diff,然后对sumsq_diff进行开方得到l2norm_diff,如果l2norm_diff大于clip_gradient,计算缩放因子scale_factor=clip_gradient/l2norm_diff,将所有的原始权值更新梯度dw_original分别乘以缩放因子scale_factor,得到权值更新梯度dw’,将更新梯度dw’发送给每个从处理电路;
从处理电路,具体用于使用权值更新梯度dw’乘以权值得到第i层各个从处理电路的更新权值。
可选的,所述主处理电路以及从处理电路均包括存储模块;
所述存储模块,用于存储数据;
所述存储模块还包括至少一个共享区域,所述共享区域为主处理电路或从处理电路共享使用的存储空间。
可选的,所述运算单元还包括:分支处理电路;
所述分支处理电路设置在主处理电路与多个从处理电路之间,实现主处理电路与多个从处理电路之间的数据以及运算指令的转发。
可选的,所述分支处理电路包括:存储模块,所述存储模块包括至少一个共享区域,所述共享区域为主处理电路和从处理电路共享使用的存储空间。
可选的,所述装置还包括树型模块,例如该树型模块可以为互连模块,互联模块为由 多个节点构成的n叉树通路,所述n叉树的上游节点的数据发送至下游的n个节点,以及互联模块将下游的n个节点返回的数据进行合并后发送给上游节点,所述n为大于等于2的整数。
可选的,所述激活函数f是非线性函数sigmoid,tanh,relu,softmax中的任一个或线性函数;
所述运算指令包括:CONFIG指令、COMPUTE指令、IO指令、NOP指令、JUMP指令或MOVE指令。
可选的,所述主处理电路包括第一存储单元、第一运算单元、第一数据依赖关系判定单元和第一存储单元,其中:
神经元缓存单元,用于缓存主处理电路在计算过程中用到的输入数据和输出数据;
第一运算单元,用于完成主处理电路的各种运算功能;
第一数据依赖关系判定单元,用于从第一存储单元读取输入的神经元向量,并通过互连模块发送给从处理电路;以及接收互连模块的中间结果向量,将中间结果向量发送到第一运算单元。
可选的,所述第一运算单元包括:向量加法单元和激活运算单元;
所述向量加法单元,用于将偏置数据与所述中间结果对位相加得到偏置结果;
所述激活运算单元,用于将所述偏置结果执行激活函数操作。
可选的,每个从处理电路包括第二运算单元、第二数据依赖关系判定单元、第二存储单元和第三存储单元,其中:
第二运算单元,用于执行算数逻辑运算;
第二数据依赖关系判定单元,用于对第二存储单元和第三存储单元的执行读写操作;
第二存储单元,用于缓存输入神经元向量的数据以及该从处理电路计算得到的输出神经元值;
第三存储单元,用于缓存该从处理电路在计算过程中需要的权值向量。
可选的,所述第二运算单元包括:向量乘单元和累加单元;
所述向量乘单元,用于执行点积运算中的向量乘运算;
所述累加单元,用于执行点积运算中的累加运算。
第十方面,提供一种神经网络训练方法,所述方法用于神经网络计算装置;所述神经网络训练计算括神经网络多层训练运算,所述多层训练运算中包括第i层,所述第i层的正向运算或反向运算中至少有部分数据用于定点数据运算,上述i为大于等于1的整数;所述计算装置包括:控制器单元、运算单元和转换单元,其中,控制器单元与运算单元以及转换单元连接;所述第i层训练运算中包括第i层正向运算和第i层反向运算;
所述第i层正向运算包括:
控制器单元获取第i层的输入神经元数据、第i层权值数据以及第i层正向计算指令;解析该第i层计算指令得到多个正向运算指令,将第i层输入神经元数据以及第i层权值数据发送给转换单元,将该多个正向运算指令发送给运算单元;
转换单元将该第i层输入神经元数据以及第i层权值数据中的全部或部分执行浮点类型 与定点类型转换得到全部定点数据或混合数据,将全部定点数据或混合数据发送给运算单元,所述混合数据包括:部分定点数据以及部分浮点数据;
运算单元依据多个正向运算指令对全部定点数据执行定点运算或对混合数据执行混合运算得到第i层的正向输出结果;
所述混合运算包括:对部分定点数据执行定点运算以及对部分浮点数据执行浮点运算。
可选的,所述第i层反向运算包括:
控制器单元获取第i层的输入神经元数据、第i层权值数据、第i层输入神经元梯度以及第i层反向计算指令;解析该第i层计算指令得到多个反向运算指令,将第i层输入神经元数据、第i层权值数据以及第i层输入神经元梯度发送给转换单元,将该多个反向运算指令发送给运算单元;
转换单元将该第i层输入神经元数据、第i层权值数据以及第i层输入神经元梯度中的全部或部分执行浮点类型与定点类型转换得到全部定点数据或混合数据,将全部定点数据或混合数据发送给运算单元,该混合数据包括:部分定点数据以及部分浮点数据;
运算单元依据多个正向运算指令对全部定点数据执行定点运算或对混合数据执行混合运算得到第i层的权值梯度以及第i层输出结果梯度;采用第i层的权值梯度对第i层权值更新。
可选的,所述转换单元将该第i层输入神经元数据以及第i层权值数据中的全部或部分执行浮点类型与定点类型转换得到全部定点数据或混合数据,将全部定点数据和混合数据发送给运算单元,所述混合数据包括:部分定点数据以及部分浮点数据;运算单元依据多个正向运算指令对全部定点数据执行定点运算或对混合数据执行混合运算得到第i层的正向输出结果具体包括:
所述转换单元将第i层输入神经元数据的部分转换成部分定点输入神经元数据以及将第i层权值数据的部分转换成部分定点权值数据;将部分定点输入神经元数据以及部分定点权值数据发送给运算单元,将部分输入神经元数据和部分权值数据发送给运算单元;
运算单元将部分定点输入神经元数据以及部分定点权值数据执行定点数据运算得到部分定点正向输出结果,将部分定点正向输出结果发送给转换单元,
转换单元将该部分定点正向输出结果执行定点与浮点转换得到第一部分浮点正向输出结果,将第一部分浮点正向输出结果发送给运算单元;
运算单元将部分输入神经元数据和部分权值数据执行运算得到第二部分浮点正向运算结果,将第一部分浮点正向运算结果和第二部分浮点正向运算结果结合起来得到第i层正向输出结果。
可选的,所述转换单元将该第i层输入神经元数据、第i层权值数据以及第i层输入神经元梯度中的全部或部分执行浮点类型与定点类型转换得到全部定点数据或混合数据,将全部定点数据和混合数据发送给运算单元,该混合数据包括:部分定点数据以及部分浮点数据;运算单元依据多个正向运算指令对全部定点数据执行定点运算或对混合数据执行混合运算得到第i层的权值梯度以及第i层输出结果梯度;采用第i层的权值梯度与第i层权值进行更新具体包括:
所述转换单元将第i层输入神经元数据的部分转换成部分定点输入神经元数据、将第i层权值数据的部分转换成部分定点权值数据以及将第i层输入神经元梯度转换成部分定点输入神经元梯度;将部分定点输入神经元数据、部分定点输入神经元梯度以及部分定点权值数据发送给运算单元,将部分输入神经元数据、部分输入神经元梯度和部分权值数据发送给运算单元;
运算单元将部分定点输入神经元梯度以及部分定点输入数据执行定点数据运算得到部分第i层权值梯度,将部分定点输入神经元梯度与部分定点权值数据执行定点数据运算得到部分第i层输出结果梯度,将部分第i层权值梯度以及部分第i层输出结果梯度发送给转换单元;
转换单元将该部分第i层权值梯度以及部分第i层输出结果梯度执行定点与浮点转换得到第一部分第i层权值梯度以及第一部分第i层输出结果梯度,将第一部分第i层权值梯度以及第一部分第i层输出结果梯度发送给运算单元;
运算单元将部分输入神经元梯度以及部分输入数据执行运算得到第二部分第i层权值梯度,将部分输入神经元梯度与部分权值数据执行运算得到第二部分第i层输出结果梯度,将第一部分第i层权值梯度和第二部分第i层权值梯度结合起来得到第i层权值梯度,将第一部分第i层输出结果梯度和第二部分第i层输出结果梯度结合起来得到第i层输出结果梯度。
第十一方面,提供一种神经网络训练装置,所述神经网络训练装置包括第五方面提供的计算装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的运算,将执行结果通过I/O接口传递给其他处理装置;
当所述神经网络训练装置包含多个所述计算装置时,所述多个所述计算装置间可以通过特定的结构进行连接并传输数据;
其中,多个所述计算装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的神经网络训练运算;多个所述计算装置共享同一控制系统或拥有各自的控制系统;多个所述计算装置共享内存或者拥有各自的内存;多个所述计算装置的互联方式是任意互联拓扑。
第十二方面,提供一种组合处理装置,所述组合处理装置包括第七方面所述的神经网络训练装置,通用互联接口和其他处理装置;
所述神经网络训练装置与所述其他处理装置进行交互,共同完成用户指定的计算操作。
第十三方面,提供一种神经网络芯片,所述神经网络芯片包括第五方面提供的计算装置或第七方面神经网络训练装置或第八方面所述的组合处理装置。
第十四方面,提供一种电子设备,所述电子设备包括第九方面提供的芯片。
第十五方面,提供一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及第九方面提供的神经网络芯片;
其中,所述神经网络芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;
所述存储器件,用于存储数据;
所述接口装置,用于实现所述芯片与外部设备之间的数据传输;
所述控制器件,用于对所述芯片的状态进行监控。
可选的,所述存储器件包括:多组存储单元,每一组所述存储单元与所述芯片通过总线连接,所述存储单元为:DDR SDRAM;
所述芯片包括:DDR控制器,用于对每个所述存储单元的数据传输与数据存储的控制;
所述接口装置为:标准PCIE接口。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请提供的一种计算方法的流程示意图。
图1A是定点数据格式示意图。
图1B为本发明实施例提供的一种神经网络运算模块的结构示意图。
图1C为本发明实施例提供的一种神经网络运算方法的流程示意图。
图1D为本发明实施例提供的另一种神经网络运算方法的流程示意图。
图2是本申请提供的一种计算系统的结构示意图。
图2A为本申请的控制单元的结构示意图。
图2B为本申请的计算群的结构示意图。
图2C为群控制器与多个计算单元的一种硬件结构示意图。
图2D为群控制器与多个计算单元的另一种硬件结构示意图。
图3A为一种计算单元的结构示意图。
图3B为一种运算单元的结构示意图。
图3C为另一种运算单元的结构示意图。
图4示出了根据本申请实施例的神经网络计算装置的整体结构的示例框图。
图4A示意性示出了根据本申请实施例的运算单元的结构示意图。
图4B示意性示出了根据本申请实施例的运算单元的另一结构示意图。
图4C示意性示出了根据本申请实施例的树型模块的发送示意图。
图4D示意性示出了根据本申请实施例的树型模块的接收示意图。
图4E示意性示出了根据本申请实施例的组合处理装置的结构示意图。
图5示意性示出了根据本申请实施例的板卡的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地 描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结果或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
对于神经网络的计算方式,其一般分为多种方式,具体的,包括但不限于:单指令多数据流(SIMD)、单指令多线程(SIMT)和同步多线程(SMT)。
单指令多数据流(SIMD)是指计算机对多个数据同时执行单条指令所确定的操作。例如,当需要进行一个或两个长向量的加法运算时,在SIMD的情景下,可以将一个或两个长向量拆成若干个短向量,使多个向量加法部件并行执行若干个短向量的加法运算,随后,将若干个短向量的加法结果合并,即得到长向量的加法运算结果,在SIMD模型中,任意时刻指令流都是单一的,即执行的指令流可以是同一指令,但是执行的数据可以不同。
单指令多线程(SIMT)是指多个线程运行同一条指令,但每一个线程可以有不同的数据。在单指令多线程的情况下,我们常常把线程合成线程组(warp),每次运行同一个线程组中的线程,当一个线程的处理数据被阻塞时,我们通过上下文切换(context switch)将该处理数据切换到另一个线程组的线程进行执行。例如,第一线程组等待访存操作返回操作数时,切换为第二线程组,当操作数准备好之后,可以切换回来至第一线程组。
同步多线程(SMT)是指处理器在同一个时钟周期内可以运行多个来自多个线程的指令。当一个线程被阻塞时,我们可以通过上下文切换来运行另一个线程的指令。
参阅图1,图1提供了一种计算方法,该计算方法可以由计算系统来执行,该计算系统包括:控制单元、计算群和总存储单元,所述控制单元包括:第一存储器、译码逻辑和控制器,所述计算群包括:群控制器和多个计算单元;所述总存储单元,用于存储数据;所述计算方法包括如下步骤:
步骤S101、计算系统的控制器接收第一级指令序列,将该第一级指令序列拆分成多个第二级指令序列,
当然在实际应用中,计算系统也可以直接接收多个第二级指令序列。上述第二级指令序列为集成度比第一级指令序列低一级别的指令序列,即第一级指令序列可以包括或集成多个第二级指令序列。上述包括或集成的方式本申请并不限定。
上述第一级指令序列具体可以为:超长指令,该第二级指令序列包括:指令序列。当然在实际应用中,上述第一级指令序列具体可以为:指令序列,该第二级指令序列可以为:微指令序列。上述仅仅是为了举例说明,对于具体的实现方式中的指令序列只需第一级指令序列包含第二级指令序列的集合即可。
步骤S102、计算系统的控制器为所述多个第二级指令序列开辟M个线程,计算系统的控制器为所述M个线程中每个线程分配独立的存储空间以及配置独立寻址功能;所述M取值范围为大于等于1的整数;
步骤S103、计算系统的群控制器获取所述多个第二级指令序列的多个计算类型,依据所述多个计算类型获取计算类型对应的融合计算方式,多个计算单元采用该融合计算方式调用所述M个线程对所述多个第二指令序列执行计算得到最终结果。
本申请给出了一种SIMD、SMT和SIMT融合的计算系统和方法,将VLIW作为可选的辅助工具。本申请充分挖掘了计算的并行能力。在深度学习兴起的大背景下,向量计算的计算量越来越大,采用本申请提供的技术方案的能够更快的得到处理结果,所以其具有提高计算速度的优点。
下面我们以一个实际的例子来说明本申请的优点,这里假设有25个向量加法指令序列,25个向量加法指令序列结合成一个VLIW,如采用常规运算的方式,即解析该VLIW得到25个向量加法指令,则采用SIMD方式对这25个向量加法指令进行计算得到25个中间结果,假设每个向量加法指令的时间为t,如该SIMD为串行执行,所需时间为25t。而采用本申请提供的计算方法,解析该VLIW得到25个向量加法指令,可以通过SIMT调用5个线程,每个线程采用SIMD方式执行5个向量加法指令,其得到25个向量加法指令的时间可以为5t,这里忽略切换时间,由此可见,本申请提供的计算方法的计算的速度相对于现有的方式提高了近5倍。
可选的,所述依据所述多个计算类型获取计算类型对应的融合计算方式,采用该融合计算方式调用所述M个线程对所述多个第二指令序列执行计算得到最终结果:
如所述计算类型代表相同类型的计算操作,群控制器调用相同类型的单指令多数据流SIMD、单指令多线程SIMT的组合计算方式,调用所述M个线程执行计算得到最终结果,具体包括:
译码逻辑将M个线程拆分成N个线程组,将所述多个第二指令序列转换成多个第二控制信号,将多个第二控制信号以及N个线程组分配给多个计算单元,多个计算单元调用分配的线程组以及第二控制信号提取对应的数据,多个计算单元将该数据执行运算得到多个中间结果,将多个中间结果拼接起来得到最终结果。
可选的,所述依据所述多个计算类型获取计算类型对应的融合计算方式,采用该融合计算方式调用所述M个线程对所述多个第二指令序列执行计算得到最终结果:
如所述计算类型代表不同类型的计算操作,群控制器调用相同类型的单指令多数据流SIMD、同步多线程SIM的组合计算方式,调用所述M个线程执行计算得到最终结果,具体包括:
群控制器将M个线程拆分成N个线程组,将所述多个第二指令序列转换成多个第二控 制信号,对不同类型的计算操作的第二指令序列分配N个线程组中不同的线程组,群控制器获取计算单元的功能类型,如计算单元A的功能类型与该多个第二指令序列的指令序列A的类型相同,将该指令序列A对应的控制信号A分配给计算单元A执行得到中间结果;如计算单元的功能类型与第二指令序列的计算类型不相同,将多个第二控制信号以及N个线程组分配给多个计算单元,多个计算单元调用分配的线程组以及第二控制信号提取对应的数据,多个计算单元将该数据执行运算得到多个中间结果,将所有中间结果拼接起来得到最终结果。
可选的,所述方法还包括:
如多个线程组中的线程组A阻塞,控制器将线程组A加入等待队列,如线程组A的数据已被提取,将线程组A加入到准备队列,所述准备队列为计算资源空闲时被调度执行的线程组所在的队列。
参阅图2,图2提供一种计算系统,所述控制单元20、计算群21和总存储单元22,如图2A所示,所述控制单元包括:第一存储器301、译码逻辑302、控制器303和调度器304,参阅图2B,计算群包括:群控制器305和多个计算单元306;所述总存储单元22,用于存储数据;
控制器303,用于接收第一级指令序列以及用于控制所述第一存储器301和所述译码逻辑302;
所述译码逻辑302,用于将该第一级指令序列拆分成多个第二级指令序列;
所述控制器303,还用于为所述多个第二级指令序列开辟M个线程;为所述M个线程中每个线程分配独立的存储空间以及配置独立寻址功能;所述M取值范围为大于等于1的整数,将所述多个第二级指令序列转换成多个控制信号发送给所述群控制器;
所述群控制器305,用于接收所述多个控制信号,获取所述多个控制信号的多个计算类型,将M个线程划分成N个线程组,依据该多个计算类型为多个计算单元分配N个线程组以及多个控制信号;
计算单元306,用于通过分配的线程组以及控制信令从所述总存储单元22提取数据执行运算得到中间结果,
所述群控制器305,用于将所有中间结果拼接得到最终计算结果。
可选的,所述多个计算单元306包括:加法计算器、乘法计算器、激活计算器或专用计算器。
可选的,所述专用计算器包括:人脸识别计算计算器、图形计算器、指纹计算器或神经网络计算器。
可选的,所述群控制器,具体用于:如多个控制信号的计算类型为图形计算、指纹识别、人脸识别或神经网络运算,将该多个控制信号分别分配给人脸识别计算计算器、图形计算器、指纹计算器或神经网络计算器。
可选的,所述第一级指令序列包括:超长指令,所述第二级指令序列包括:指令序列。
计算系统可以包括:控制单元20、计算群21、存储单元22。控制单元负责指令的分 发、线程的开辟、普通指令和超长指令字的译码、控制信号的发出等。控制单元包括:本地存储、译码逻辑、调度器和控制器。其中,本地存储用于存储指令,译码逻辑可对超长指令字和普通指令进行译码,调度器负责线程的上下文切换,控制器调用存储的代码控制控制单元中各子模块(例如本地存储、译码逻辑和调度器)的行为。
计算群可以包括:群控制器和多个计算单元。群控制器接收来自控制单元的控制信号并将其转换为群内控制信号,将该群内控制信号发送给多个计算单元中的一个或多个计算单元以对该群内控制信号进行计算。计算单元可以包括多种功能部件,具体的,包括向量运算部件和各种针对专用算法的优化计算部件(如针对机器学习或图形处理的专用部件等)。计算单元还可以包括:单元控制器和本地存储。单元控制器用于控制计算单元内的各功能部件行为,本地存储用于缓存数据。
存储单元用于存储用户输入数据、计算群输出数据等。计算群可在控制单元的控制下通过多种寻址方式从存储单元提取合适的数据。
下面以超长指令字为例来说明该计算系统可以完成的功能,需要说明的是,上述超长指令字仅仅是为了举例说明,在实际应用中,本申请的技术方案并不限制上述指令的具体形式,例如指令序列。
超长向量是一个长度非常长的向量,该向量可以包括多段数据,计算系统可以对多段数据的每一段执行不同的操作,也可以对多段数据执行相同的操作。当计算系统需要对一个或多个超长向量进行计算时,首先编译器将超长向量各段的存储信息和所需操作的信息打包成超长指令字发送给控制单元。控制单元对超长指令字进行译码,将超长指令字解码为一系列微控制指令序列。(注意,超长指令字是可选项,当不使用超长指令字的时候,控制单元的本地存储中存储的是指令序列,由译码逻辑将它们译码为微控制指令序列。注意,微控制指令序列也是可选的,指令序列也可以直接由控制器开辟线程执行。注意,本地存储也是可选项,可由存储单元替代。)对于一系列涉及向量的相同类型的计算操作,计算系统采取SIMT和SIMD融合的计算方式。控制器单元为微控制指令序列开辟多个线程,每个线程有独立的存储空间并且可以独立寻址。根据计算群中计算单元数目,将适当数量的线程打包为线程组,这样计算系统将会得到一个或多个线程组(一般为多个线程组)。调度器接收线程分配信息,协同译码逻辑将线程中的微控制指令序列转化为控制信号发往计算群的群控制单元。群控制单元接收来自控制单元的控制信号,并将控制信号转化为群内控制信号发往合适的计算单元。计算单元从存储单元读取向量操作数并进行向量计算,中间结果可暂存在本地存储,最终结果存储在存储单元中。当线程组因为访存而阻塞时,通过上下文切换,计算群执行其他线程组的计算操作,阻塞的线程组进入等待队列,当阻塞的线程组的操作数准备好后,线程组从等待队列进入到准备队列。准备队列中的线程组可在计算资源空闲时被调度执行。线程组内包含的线程数量一般是恒定的,若剩余线程数不足一个线程组,则用非活跃线程填充至恒定值。对于一系列涉及向量的不同类型的计算操作,计算系统采取SMT和SIMD融合的计算方式。计算系统将不同操作的微控制指令序列分配给处于不同线程组的线程。在计算过程中,若一个线程组阻塞则计算系统可以进行上下文切换从而执行其他操作的线程组。上述计算可以由几个计算单元协同完 成,如对于一个视频压缩计算,可将计算过程的预测、变换、量化和熵编码过程分配给不同的计算单元,计算单元之间可以互相传递结果,从而构成流水线。
参阅图2C,图2C为群控制器与多个计算单元的一种硬件结构示意图,该计算系统还可以包括树型模块401,该树型模块可以为n叉树结构,该n为大于2的整数,具体的,树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述群控制器305,所述树型模块的多个支端口分别连接多个计算单元306中的一个计算单元306;
所述树型模块,用于转发所述群控制器305与所述多个计算单元之间的数据块、线程组或指令序列。
参阅图2D,图2D为群控制器与多个计算单元的另一种硬件结构示意图,所述计算系统包括:分支处理电路,
所述分支处理电路连接在所述群控制器与所述多个计算单元之间;
所述分支处理电路,用于转发所述群控制器与所述多个计算单元之间的数据、线程组或指令序列。
所述计算单元包括:乘法处理电路;乘法处理电路对接收到的数据执行乘积运算得到乘积结果;所述计算单元还包括:累加处理电路,所述累加处理电路对所述乘积结果执行累加运算得到所述中间结果。
需要说明的是,上述计算单元还可以是另外一种硬件的结构,如图3A所示,控制器单元311和运算单元312,其中,控制器单元311与运算单元312连接,该运算单元312包括:一个主处理电路和多个从处理电路;
控制器单元311,用于获取数据、线程组以及指令,在执行人工神经网络模型运算时,所述数据包括:输入神经元数据、权值数据和输出神经元数据;在一种可选方案中,具体的,获取数据、线程组以及指令可以通过数据输入输出单元得到,该数据输入输出单元具体可以为一个或多个数据I/O接口或I/O引脚。
上述指令包括但不限于:正向运算指令或反向训练指令,或其他神经网络运算指令等等,例如卷积运算指令,本申请具体实施方式并不限制上述计算指令的具体表现形式。
控制器单元311,还用于解析该指令得到多个运算指令,将该多个运算指令以及所述数据发送给所述主处理电路;
主处理电路3101,用于对所述数据执行前序处理,以及向所述多个从处理电路传输数据以及运算指令;
多个从处理电路3102,用于依据从所述主处理电路传输的数据以及运算指令并行执行中间运算得到多个中间数据结果,并将多个中间数据结果传输给所述主处理电路;
主处理电路3101,用于对所述多个中间数据结果执行后续处理得到所述指令的指令结果。
可选的,上述计算单元还可以包括:该存储单元310和直接内存访问单元,存储单元可以包括:寄存器、缓存中的一个或任意组合,具体的,所述缓存,用于存储所述运算指令;所述寄存器,用于存储线程组、指令、数据或标量;所述缓存为高速暂存缓存。直接内存访问单元用于从存储单元310读取或存储数据。
可选的,该控制器单元包括:指令存储单元、指令处理单元和存储队列单元;
指令存储单元,用于存储指令;
所述指令处理单元,用于对所述计算指令解析得到多个运算指令;
存储队列单元,用于存储队列,该队列可以为指令队列,该指令队列包括:按该队列排列的前后顺序待执行的多个运算指令或计算指令。
可选的,该控制器单元还可以包括:依赖关系处理单元;
依赖关系处理单元,用于在具有多个运算指令时,确定第一运算指令与所述第一运算指令之前的第零运算指令是否存在关联关系,如所述第一运算指令与所述第零运算指令存在关联关系,则将所述第一运算指令缓存在所述指令存储单元内,在所述第零运算指令执行完毕后,从所述指令存储单元提取所述第一运算指令传输至所述运算单元;
所述确定该第一运算指令与第一运算指令之前的第零运算指令是否存在关联关系,包括:
依据所述第一运算指令提取所述第一运算指令所需数据(例如矩阵)的第一存储地址区间,依据所述第零运算指令提取所述第零运算指令中所需矩阵的第零存储地址区间,如所述第一存储地址区间与所述第零存储地址区间具有重叠的区域,则确定所述第一运算指令与所述第零运算指令具有关联关系,如所述第一存储地址区间与所述第零存储地址区间不具有重叠的区域,则确定所述第一运算指令与所述第零运算指令不具有关联关系。
在一种可选的实施方案中,如图3B所示,为一种运算单元的结构,该运算单元包括:树型模块,所述树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;上述树型模块具有收发功能。
所述树型模块,用于转发所述主处理电路与所述多个从处理电路之间的数据块、权值以及运算指令。
在一种可选实施例中,运算单元12如图3C所示,可以包括分支处理电路;其具体的连接结构如图3C所示,其中,
主处理电路3101与分支处理电路3103连接,分支处理电路3103与多个从处理电路3102连接;
分支处理电路3103,用于转发主处理电路3101与从处理电路3102之间的数据或指令。
在运算中,为了减少计算量以及提高计算速度,通常将浮点数转换成定点数进行计算,因为定点数的比特位一般比浮点数小,因此能够降低内存容量,并且提高计算的速度。
定点数是一种可以指定小数点位置的数据格式,我们通常用位宽来表示一个定点数的数据长度。例如,16位定点数的位宽就是16。对于给定位宽的定点数,可表示数据的精度和可表示的数字范围是有关联的,例如,如果可以表示的精度越大,则可表示的数字范围就越小。如图1A所示,对于位宽为bitnum的定点数据格式,第一位为符号位,整数部分占x位,小数部分占s位,则该定点数据格式能够表示的最大定点精度S为2 -s。该定点数据格式可以表示的范围为[neg,pos],其中pos=(2 bitnum-1-1)*2 -s,neg=-(2 bitnum-1)*2 -s
在神经网络运算中,数据可以用定点数据格式进行表示例如,在正向运算过程中,第L层的数据包括输入神经元X (l)、输出神经元Y (l)、权重W (l)。在反向运算过程中,第L层的数据包括输入神经元梯度
Figure PCTCN2019085844-appb-000059
输出神经元梯度
Figure PCTCN2019085844-appb-000060
权重梯度
Figure PCTCN2019085844-appb-000061
可以将上面的数据均用定点数进行表示,也可以将用定点数据格式进行表示的数据用定点数进行运算。
在神经网络的训练过程通常包括正向运算和反向运算两个步骤,在反向运算时,输入神经元梯度、权重梯度和输出神经元梯度所需要的精度可能会出现变化,可能随着训练的过程减小,如果定点数的精度冗余,则会增加运算开销,浪费运算资源。
在神经网络运算的过程中,由于经过加减乘除和卷积等一系列运算,正向运算过程包括的输入神经元、权重和输出神经元和反向训练过程包括的输入神经元梯度、权重梯度和输出神经元梯度会发生变化。以定点数据格式表示的输入神经元、权重、输出神经元、输入神经元梯度、权重梯度和输出神经元梯度的精度有可能需要增大或者减小。如果输入神经元、权重、输出神经元、输入神经元梯度、权重梯度和输出神经元梯度的精度不够,会导致运算结果出现较大的误差,甚至会导致反向训练失败;如果输入神经元、权重、输出神经元、输入神经元梯度、权重梯度和输出神经元梯度的精度冗余,则会增大不必要的的运算开销,浪费运算资源。本申请提出了一种神经网络运算模块及方法,在进行神经网络运算的过程中动态调整上述数据的精度,以在满足运算需求的同时,减少运算结果的误差,提高运算结果的精度。
本申请的实施例通过调整上述数据的位宽来达到调整该数据精度的目的。比如在定点数据格式的精度超过运算的需求时,可以通过将定点数据格式中的小数部分的位宽减少,即减小图1A中的s,从而降低上述定点数据格式的精度;但是定点数据格式的精度与其小数部分的位宽相关,即可通过增大或者减少小数部分的位宽来调整定点数据格式的精度。因此定点数据格式的精度小于需求精度时,可以减少小数部分的位宽,从而增大定点数据格式的精度,进而降低定点数据格式的精度冗余,减少运算开销,避免运算资源的浪费。
请参阅图1B,图1B是为本发明实施例提供的一种神经网络运算模块的结构示意图。该神经网络运算模块用于进行多层神经网络的运算。如图1B所示,该神经网络运算模块100包括:
存储单元101,用于存储输入神经元精度、权重精度和输出神经元梯度精度。
控制器单元102,用于从所述存储单元101获取所述多层神经网络第L层的输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000062
其中,所述L为大于0的整数;根据所述输入神经元精度S x(l)、所述权重精度S w(l)和所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000063
获取梯度更新精度T;当所述梯度更新精度T小于预设精度T r时,调整所述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000064
在一种可行的实施例中,上述存储单元101还用于存储输入神经元、权重和输出神经元以及输出神经元梯度,上述控制器单元102从上述存储单元101获取第L层输入神经元、权重和输出神经元梯度,该控制器单元102根据上述第L层输入神经元、权重和输出神经元梯度获取上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000065
其中,用于表示输入神经元的定点数据个数的位宽和用于表示权重的定点数据格式的 位宽为第一位宽,用于表示上述输出神经元梯度的定点数据格式的位宽为第二位宽。
可选地,上述第二位宽大于上述第一位宽。
进一步地,上述第二位宽为上述第一位宽的两倍,以便于电子计算机进行处理。
进一步地,上述第一位宽可选为8位,上述第二位宽可选为16位。
其中,上述控制器单元102可以根据由用户进行预先设置,将精度预设为T r;也可依据第二预设公式,通过改变输入参数的方式获得与输入参数匹配的预设精度T r;还可以通过机器学习的方法获取T r
可选地,上述控制器单元102根据学习率、batchsize(批处理时的样本数量)设置上述预设精度T r
进一步地,如果该神经网络中存在参数共享层(如卷积层和循环神经网络层),则上述控制器单元102根据上一层输出神经元的个数以及batchsize、学习率来设置上述预设精度T r,即上一层的输出神经元的个数越高以及batchsize越大、学习率越高,上述预设精度T r越大。
具体地,上述控制器单元102获取上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000066
后,根据第一预设公式对上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000067
进行计算,以得到上述梯度更新精度T,其中,上述第一预设公式可以为:
Figure PCTCN2019085844-appb-000068
其中,上述控制器单元102调整上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000069
包括:
上述控制器单元102保持上述输入神经元精度S x(l)和权重精度S w(l)不变,增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000070
需要指出的是,由于上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000071
上述控制器单元102增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000072
是指减小表示该输出神经元梯度的定点数据格式的小数部分位宽s1。
可选地,上述控制器单元102根据Tr-T的值按照第一预设步长N1减小上述表示权重的定点数据格式的小数部分位宽s1。
具体地,对于表示上述输出神经元梯度的定点数据格式的小数部分位宽s1,上述控制器单元102每次减小N1位,即小数部分的位宽为s1-N1,并得到输出神经元梯度精度
Figure PCTCN2019085844-appb-000073
再根据上述预设公式
Figure PCTCN2019085844-appb-000074
判断上述梯度更新精度T与上述预设精度Tr的差值的绝对值是否变小;当确定该梯度更新精度T与上述预设精度Tr的差值的绝对值变小时,上述控制器单元102继续对表示上述输出神经元梯度的定点数据格式的小数部分位宽减小N1,即位宽为s1-2*N1,并得到输出神经元梯度精度
Figure PCTCN2019085844-appb-000075
并继续判断上述梯度更新精度T与上述预设精度Tr的差值的绝对值是否变小;若变小,则继续按照上述方法进行处理;若在第n次处理时上述梯度更新精度T与上述预设精度Tr的差值的绝对值变大,上述控制器单元102则将第n-1次处理得到的位宽,即s1-(n-1)*N1作为表示上述输出神经元梯度的定点数据格式的小数部分的位宽,减小小数部分位宽后的输出神经元梯度精度为
Figure PCTCN2019085844-appb-000076
可选地,上述第一预设步长N1为1、2、4、6、7、8或者其他正整数。
可选地,上述控制器单元102按照2倍递减的方式,减小表示上述输出神经元梯度的定点数据格式的小数部分位宽。
比如表示上述输出神经元梯度的定点数据格式的小数部分位宽为4,即权重的精度为2 -4,则按照2倍递减的方式减少位宽后的表示上述输出神经元梯度的定点数据格式的小数部分位宽为2,即减小后的输出神经元梯度精度为2 -2
在一种可行的实施例中,上述控制器单元102确定对表示上述输出神经元梯度的定点数据格式的小数部分位宽的减少幅度b后,上述控制器单元102分多次减少上述定点数据格式的小数部分位宽,比如上述控制器单元102分两次减少上述定点数据格式的小数部分位宽,第一次减少的幅度为b1,第二次减少的幅度为b2,且b=b1+b2。
其中,上述b1与b2可以相同或者不相同。
可选地,上述控制器单元102增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000077
时,减少表示该输出神经元梯度的定点数据格式的位宽。
进一步地,由于增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000078
是通过减少表示上述输出神经元梯度的定点数据格式的小数部分位宽来实现的,且由于表示上述输出神经元梯度的定点数据格式的位宽不变,若小数部分位宽减少,则整数部分位宽增大,该定点数据格式表示的数据范围会增大,但是该定点数据格式表示的精度也增大,因此在控制器单元102增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000079
后,该控制器单元102减少上述定点数据格式的位宽,且该定点数据格式的位宽减少后,其整数部分的位宽保持不变,即整数部分位宽的减少值与小数部分位宽的减少值相同,由此保证了在小数部分位宽改变的情况下,该定点数据格式表示的最大值不变。
举例说明,上述定点数据格式的位宽为9,其中符号位的位宽为1,整数部分的位宽为5,小数部分的位宽为4,上述控制器单元102减小上述小数部分的位宽和整数部分的位宽后,小数部分的位宽为2,整数部分的位宽为5,即减少上述小数部分的位宽,整数部分的位宽保持不变。
在一种可行的实施例中,上述控制器单元102减小上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000080
后,该控制器单元102还用于:
判断所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000081
是否小于需求精度,所述需求精度为进行多层神经网络运算时输出神经元梯度的最小精度;
当所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000082
小于所述需求精度时,减少表示所述输出神经元梯度的定点数据格式的位宽。
需要指出的是,上述控制器单元102增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000083
的原因是该输出神经元梯度精度
Figure PCTCN2019085844-appb-000084
小于上述需求精度,即存在精度冗余,此时会增大运算开销,浪费运算资源。因此为了减小运算开销,避免运算资源的浪费,需要增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000085
具体地,由上述相关描述可知,上述控制器单元102增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000086
后,需要进一步判断是否存在精度冗余,即判断输出神经元梯度精度
Figure PCTCN2019085844-appb-000087
是否小于需求精度。当确定上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000088
小于上述需求精度时,减少表示所述输出 神经元梯度的定点数据格式的位宽,以增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000089
降低精度冗余。
需要指出的是,上述控制器单元102减少上述定点数据格式的位宽具体是减少该定点数据格式的整数部分的位宽。
进一步地,上述控制器单元102减少所述表示上述输出神经元梯度的定点数据格式的位宽,包括:
上述控制器单元102按照第二预设步长N2减少所述表示所述输出神经元梯度的定点数据格式的位宽,其中,第二预设步长N2可为1、2、3、4、5、7、8或者其他正整数。
具体地,当确定减少上述定点数据格式的位宽时,上述控制器单元102每次减少该定点数据格式的位宽时的减少值为上述第二预设步长N2。
在一种可行的实施例中,上述控制器单元102减少上述表示上述输出神经元梯度的定点数据格式的位宽,包括:
上述控制器单元102按照2倍递减的方式减少上述表示上述输出神经元梯度的定点数据格式的位宽。
举例说明,上述定点数据格式除去符号位的位宽为16,则按照2倍递减的方式减少该定点数据格式的位宽后,该定点数据格式除去符号位的位宽为8;再次按照2倍递减的方式减少该定点数据格式的位宽后,该定点数据格式除去符号位的位宽为4。
在一种可行的实施例中,上述控制器单元102调整上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000090
包括
上述控制器单元102增大上述输入神经元精度S x(l)和/或上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000091
保持上述权重精度S w(l)不变,或者;
上述控制器单元102增大上述输入神经元精度S x(l),减少上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000092
保持上述权重精度S w(l)不变,且上述输入神经元精度S x(l)增大的幅度大于上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000093
的减小幅度,或者;
上述控制器单元102减小上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000094
增大上述输入神经元精度S x(l),保持上述权重精度S w(l)不变,且上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000095
减小的幅度小于上述输入神经元精度S x(l)的增大幅度,或者;
上述控制器单元102增大或减小上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000096
中的一个或者任意组合,以使上述梯度更新精度T与上述预设精度T r的差值的绝对值最小。
在此需要说明的是,上述控制器单元102对上述权重精度S w(l)、上述输入神经元精度S x(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000097
中的任意一个的进行减小操作的具体过程可参见上述控制器单元102增大上述权重精度S w(l)、上述输入神经元精度S x(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000098
的相关操作,在此不再叙述。
按照上述方法调整上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000099
后,上述运算单元103在运算过程中,按照调整后的输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000100
以定点数据格式形式表示上述第L层的输入神经元、权重和输出神经元梯度,然后进行后续的运算。
需要说明的是,上述控制器单元102计算上述梯度更新精度T的频率可以根据需求灵活设置。
其中,上述控制器单元102可根据神经网络训练过程中的训练迭代次数调整计算上述梯度更新精度T的频率。
可选地,上述控制器单元102在神经网络训练过程中,每迭代一轮就重新计算上述梯度更新精度T;或者每迭代预设次数就重新计算上述梯度更新精度T;或者根据上述梯度更新精度T的变化进行设置上述频率。
可选地,上述控制器单元102根据神经网络训练中的训练迭代次数来设置计算上述梯度更新精度T的频率。
运算单元103,用于根据增大或者减小后的输入神经元精度S x(l)和权重精度S w(l)来表示第L层的输入神经元和权重;根据增大或者减小后的输出神经元梯度精度
Figure PCTCN2019085844-appb-000101
来表示运算得到的第L层输出神经元梯度。
换句话说,上述运算单元,用于增大或者减小输入神经元精度S x(l)的定点数据格式来表示上述第L层输入神经元,用增大或者减小权重精度S w(l)的定点数据格式来表示上述第L层的权重,用增大或者减小输出神经元梯度精度
Figure PCTCN2019085844-appb-000102
的定点数据格式来表示上述第L层的输出神经元梯度,以进行后续的运算。
通过在神经网络运算过程中,动态调整(包括增大或者减小)上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000103
以在满足运算需求的同时,降低精度冗余,减小运算开销,避免对运算资源造成浪费。
参见图1C,图1C为本发明实施例提供的一种神经网络运算方法的流程示意图,如图1C所示,该方法包括:
S201、神经网络运算模块获取神经网络的第L层输入神经元精度、权重精度和输出神经元梯度精度。
其中,上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000104
的取值可以相同,也可以是部分相同或者两两互不相等。
其中,上述神经网络为多层神经网络,上述第L层输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000105
分别为上述多层神经网络的任一层的输入神经元精度、权重精度和输出神经元梯度精度。
在一种可行的实施例中,上述神经网络运算模块获取上述第L层的输入神经元、权重和输出神经元;根据上述第L层的输入神经元、权重和输出神经元,获取上述第L层输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000106
S202、神经网络运算模块根据第L层输入神经元精度、权重精度和输出神经元梯度精度,计算得到梯度更新精度T。
具体地,上述神经网络运算模块根据第一预设公式对上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000107
进行计算,以得到上述梯度更新精度T。
其中,上述第一预设公式为
Figure PCTCN2019085844-appb-000108
S203、当梯度更新精度T小于预设精度T r时,神经网络运算模块调整第L层输入神经 元精度、权重精度和输出神经元梯度,以使梯度更新精度T与预设精度T r的差值的绝对值最小。
其中,用于表示输入神经元的定点数据格式和用于表示权重的定点数据格式的位宽为第一位宽,用于表示输出神经元梯度的定点数据格式的位宽为第二位宽。
可选地,上述第二位宽大于上述第一位宽。
进一步地,上述第二位宽为上述第一位宽的两倍,以便于电子计算机进行处理。
进一步地,上述第一位宽可选为8位,上述第二位宽可选为16位。
其中,上述预设精度T r可以根据经验进行预先设置;也可以根据第二预设公式,通过改变输入参数的方式获得与输入参数匹配的T r;还可以通过机器学习的方法获取T r
可选地,上述神经网络运算模块根据学习率、batchsize(批处理时的样本数量)设置上述预设精度T r
进一步地,如果该神经网络中存在参数共享层(如卷积层和循环神经网路层),则根据上一层输出神经元的个数以及batchsize、学习率来设置上述预设精度T r,即上一层的输出神经元的个数越高以及batchsize越大、学习率越高,预设精度T r越大。
其中,神经网络运算模块调整上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000109
包括:
上述神经网络运算模块保持上述输入神经元精度S x(l)和权重精度S w(l)不变,增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000110
需要指出的是,由于上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000111
上述神经网络运算模块增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000112
是指减小表示该输出神经元梯度的定点数据格式的小数部分位宽s1。
可选地,上述神经网络运算模块控制器单元根据Tr-T的值按照第一预设步长N1减小上述表示权重的定点数据格式的小数部分位宽s1。
具体地,对于表示上述输出神经元梯度的定点数据格式的小数部分位宽s1,上述神经网络运算模块每次减小N1位,即小数部分的位宽为s1-N1,并得到输出神经元梯度精度
Figure PCTCN2019085844-appb-000113
再根据上述预设公式
Figure PCTCN2019085844-appb-000114
判断上述梯度更新精度T与上述预设精度Tr的差值的绝对值是否变小;当确定该梯度更新精度T与上述预设精度Tr的差值的绝对值变小时,上述神经网络运算模块继续对表示上述输出神经元梯度的定点数据格式的小数部分位宽减小N1,即位宽为s1-2*N1,并得到输出神经元梯度精度
Figure PCTCN2019085844-appb-000115
并继续判断上述梯度更新精度T与上述预设精度Tr的差值的绝对值是否变小;若变小,则继续按照上述方法进行处理;若在第n次处理时上述梯度更新精度T与上述预设精度Tr的差值的绝对值变大,上述神经网络运算模块则将第n-1次处理得到的位宽,即s1-(n-1)*N1作为表示上述输出神经元梯度的定点数据格式的小数部分的位宽,减小小数部分位宽后的输出神经元梯度精度为
Figure PCTCN2019085844-appb-000116
可选地,上述第一预设步长N1为1、2、4、6、7、8或者其他正整数。
可选地,上述神经网络运算模块按照2倍递减的方式,减小表示上述输出神经元梯度的定点数据格式的小数部分位宽。
比如表示上述输出神经元梯度的定点数据格式的小数部分位宽为4,即权重的精度为2 -4,则按照2倍递减的方式减少位宽后的表示上述输出神经元梯度的定点数据格式的小数部分位宽为2,即减小后的输出神经元梯度精度为2 -2
在一种可行的实施例中,上述神经网络运算模块确定对表示上述输出神经元梯度的定点数据格式的小数部分位宽的减少幅度b后,上述神经网络运算模块分多次减少上述定点数据格式的小数部分位宽,比如上述神经网络运算模块分两次减少上述定点数据格式的小数部分位宽,第一次减少的幅度为b1,第二次减少的幅度为b2,且b=b1+b2。
其中,上述b1与b2可以相同或者不相同。
可选地,上述神经网络运算模块增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000117
时,减少表示该输出神经元梯度的定点数据格式的位宽。
进一步地,由于增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000118
是通过减少表示上述输出神经元梯度的定点数据格式的小数部分位宽来实现的,且由于表示上述输出神经元梯度的定点数据格式的位宽不变,若小数部分位宽减少,则整数部分位宽增大,该定点数据格式表示的数据范围会增大,但是该定点数据格式表示的精度也增大,因此在神经网络运算模块增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000119
后,该神经网络运算模块减少上述定点数据格式的位宽,且该定点数据格式的位宽减少后,其整数部分的位宽保持不变,即整数部分位宽的减少值与小数部分位宽的减少值相同,因此保证了在小数部分位宽改变的情况下,该定点数据格式表示的最大值不变。
举例说明,上述定点数据格式的位宽为9,其中符号位的位宽为1,整数部分的位宽为5,小数部分的位宽为3,上述神经网络运算模块减小上述小数部分的位宽和整数部分的位宽后,小数部分的位宽为2,则整数部分的位宽为5,即减少上述小数部分的位宽,整数部分的位宽保持不变。
在一种可行的实施例中,上述神经网络运算模块减小上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000120
后,该神经网络运算模块还用于:
判断所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000121
是否小于需求精度,所述需求精度为进行多层神经网络运算时输出神经元梯度的最小精度;
当所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000122
小于所述需求精度时,减少表示所述输出神经元梯度的定点数据格式的位宽。
需要指出的是,上述神经网络运算模块增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000123
的原因是该输出神经元梯度精度
Figure PCTCN2019085844-appb-000124
小于上述需求精度,即存在精度冗余,此时会增大运算开销,浪费运算资源。因此为了减小运算开销,避免运算资源的浪费,需要增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000125
具体地,由上述相关描述可知,上述神经网络运算模块增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000126
后,需要进一步判断是否存在精度冗余,即判断输出神经元梯度精度
Figure PCTCN2019085844-appb-000127
是否小于需求精度。当确定上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000128
小于上述需求精度时,减少表示所述输出神经元梯度的定点数据格式的位宽,以增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000129
降低精度冗余。
需要指出的是,上述神经网络运算模块减少上述定点数据格式的位宽具体是减少该定 点数据格式的整数部分的位宽。
进一步地,上述神经网络运算模块减少所述表示上述输出神经元梯度的定点数据格式的位宽,包括:
上述神经网络运算模块按照第二预设步长N2减少所述表示所述输出神经元梯度的定点数据格式的位宽,其中,第二预设步长N2可为1、2、3、4、5、7、8或者其他正整数。
具体地,当确定减少上述定点数据格式的位宽时,上述神经网络运算模块每次减少该定点数据格式的位宽时的减少值为上述第二预设步长N2。
在一种可行的实施例中,上述神经网络运算模块减少上述表示上述输出神经元梯度的定点数据格式的位宽,包括:
上述神经网络运算模块按照2倍递减的方式减少上述表示上述输出神经元梯度的定点数据格式的位宽。
举例说明,上述定点数据格式除去符号位的位宽为16,则按照2倍递减的方式减少该定点数据格式的位宽后,该定点数据格式除去符号位的位宽为8;再次按照2倍递减的方式减少该定点数据格式的位宽后,该定点数据格式除去符号位的位宽为4。
在一种可行的实施例中,上述神经网络运算模块调整上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000130
包括
上述神经网络运算模块增大上述输入神经元精度S x(l)和/或上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000131
保持上述权重精度S w(l)不变,或者;
上述神经网络运算模块增大上述输入神经元精度S x(l),减少上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000132
保持上述权重精度S w(l)不变,且上述输入神经元精度S x(l)增大的幅度大于上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000133
的减小幅度,或者;
上述神经网络运算模块减小上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000134
增大上述输入神经元精度S x(l),保持上述权重精度S w(l)不变,且上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000135
增大减小的幅度小于上述输入神经元精度S x(l)的增大幅度,或者;
上述神经网络运算模块增大或减小上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000136
中的一个或者任意组合,以使上述梯度更新精度T与上述预设精度T r的差值的绝对值最小。
在此需要说明的是,上述神经网络运算模块对上述权重精度S w(l)、上述输入神经元精度S x(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000137
中的任意一个的进行减小操作的具体过程可参见上述神经网络运算模块增大上述权重精度S w(l)、上述输入神经元精度S x(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000138
的相关操作,在此不再叙述。
S204、神经网络运算模块根据调整后的输入神经元精度和权重精度来表示第L层的输出神经元和权重;根据调整后的输出神经元梯度精度来表示运算得到的第L层输出神经元梯度,以进行后续运算。
换句话说,上述运算单元,用于增大或者减小输入神经元精度S x(l)的定点数据格式来表示上述第L层输入神经元,用增大或者减小权重精度S w(l)的定点数据格式来表示上述第L层的权重,用增大或者减小输出神经元梯度精度
Figure PCTCN2019085844-appb-000139
的定点数据格式来表示上述第L层 的输出神经元梯度,以进行后续的运算。
按照上述方法调整上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000140
后,上述神经网络运算模块重新计算上述梯度更新精度T;当该梯度更新精度不再大于上述预设精度T r时,上述神经网络运算模块参照上述步骤S203的方式减小上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000141
需要说明的是,上述神经网络运算模块计算上述梯度更新精度T的频率可以根据需求灵活设置。
其中,上述神经网络运算模块可根据神经网络训练过程中的训练迭代次数调整计算上述梯度更新精度T的频率。
可选地,上述神经网络运算模块在神经网络训练过程中,每迭代一轮就重新计算上述梯度更新精度T;或者每迭代预设次数就重新计算上述梯度更新精度T;或者根据上述梯度更新精度T的变化进行设置上述频率。
可选地,上述神经网络运算模块根据神经网络训练中的训练迭代次数来设置计算上述梯度更新精度T的频率。
可以看出,本发明实施例的方案在神经网络运算过程中,动态调整上述输入神经元精度S x、权重精度S w和输出神经元梯度精度
Figure PCTCN2019085844-appb-000142
以在满足运算需求的同时,降低精度冗余,减小运算开销,避免对运算资源造成浪费。
在神经网络领域,训练计算为神经网络应用的基础,对于训练计算又称为模型的预训练或预处理,由于训练计算的运算量大,通常需要专用的设备(例如数据中心)处理,这使得如何降低训练计算的运算量成为将训练计算应用到普通设备(例如个人计算机、终端设备)的关键。
在神经网络运算中,数据可以用定点数据格式进行表示、运算。例如,在正向运算过程中,第L层的数据包括输入神经元X (l)、输出神经元Y (l)、权重W (l)。在反向运算过程中,第L层的数据包括输入神经元梯度
Figure PCTCN2019085844-appb-000143
输出神经元梯度
Figure PCTCN2019085844-appb-000144
权重梯度
Figure PCTCN2019085844-appb-000145
可以将上面的数据均用定点数进行表示,也可以用定点数进行运算。
定点数是一种可以指定小数点位置的数据格式,我们通常用位宽来表示一个定点数的数据长度。例如,16位定点数的位宽就是16。对于给定位宽的定点数,可表示数据的精度和可表示的数字范围是有关联的,如果可以表示的精度越大,则可表示的数字范围就越小。如图1A所示,对于位宽为bitnum的定点数据格式,第一位为符号位,整数部分占x位,小数部分占s位,则该定点数据格式能够表示的最大定点精度S为2 -s。该定点数据格式可以表示的范围为[neg,pos],其中pos=(2 bitnum-1-1)*2 -s,neg=-(2 bitnum-1)*2 -s
在神经网络的训练过程通常包括正向运算和反向运算两个步骤,在反向运算时,输入神经元梯度、权重梯度和输出神经元梯度所需要的精度可能会出现变化,可能随着训练的过程增大,如果定点数的精度不够,会导致运算结果出现较大误差,甚至会导致训练失败。
在本申请的实施例中通过调整上述数据的位宽来达到调整该数据精度的目的。比如在定点数据格式的精度无法满足运算的需求时,可以通过将定点数据格式中的小数部分的位宽增大,即增大图1A中的s,从而增大上述定点数据格式的精度;但是由于定点数据格式 的位宽是固定的,当增大小数部分的位宽时,整数部分的位宽则会减小,故该定点数据格式能够表示的数据范围则会缩小,此时,可增大该定点数据格式的位宽,由于小数部分的位宽不变,因此增大该定点数据格式的位宽可以看作是增大该定点数据格式的整数部分的位宽,从而扩大定点数据格式能够表示数据的范围。
请参阅图1B,图1B是为本发明实施例提供的一种神经网络运算模块的结构示意图。该神经网络运算模块用于进行多层神经网络的运算。如图1B所示,该神经网络运算模块100包括:
存储单元101,用于存储输入神经元精度、权重精度和输出神经元梯度精度。
控制器单元102,用于从所述存储单元101获取所述多层神经网络第L层的输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000146
其中,所述L为大于0的整数;根据所述输入神经元精度S x(l)、所述权重精度S w(l)和所述输出神经元梯度精度
Figure PCTCN2019085844-appb-000147
获取梯度更新精度T;当所述梯度更新精度T大于预设精度T r时,调整所述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000148
在一种可行的实施例中,上述存储单元101还用于存储输入神经元、权重和输出神经元以及输出神经元梯度,上述控制器单元102从上述存储单元101中获取第L层输入神经元、权重和输出神经元梯度,该控制器单元102根据上述第L层输入神经元、权重和输出神经元梯度获取上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000149
其中,用于表示输入神经元的定点数据个数的位宽和用于表示权重的定点数据格式的位宽为第一位宽,用于表示上述输出神经元梯度的定点数据格式的位宽为第二位宽。
可选地,上述第二位宽大于上述第一位宽。
进一步地,上述第二位宽为上述第一位宽的两倍,以便于电子计算机进行处理。
进一步地,上述第一位宽可选为8位,上述第二位宽可选为16位。
其中,上述控制器单元102可以由用户进行预先设置或可以由用户进行预设,将精度预设为T r;也可以第二预设公式,通过改变输入参数的方式获得与输入参数匹配的预设精度T r;还可以通过机器学习的方法获取T r
可选地,上述控制器单元102根据学习率、batchsize(批处理时的样本数量)设置上述预设精度T r
进一步地,如果该神经网络中存在参数共享层(如卷积层和循环神经网络层),则上述控制器单元102根据上一层输出神经元的个数以及batchsize、学习率来设置上述预设精度T r,即上一层的输出神经元的个数越高以及batchsize越大、学习率越高,上述预设精度T r越大。
具体地,上述控制器单元102获取上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000150
后,根据第一预设公式对上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000151
进行计算,以得到上述梯度更新精度T,其中,上述第一预设公式可以为:
Figure PCTCN2019085844-appb-000152
其中,上述控制器单元102调整上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000153
包括:
上述控制器单元102保持上述输入神经元精度S x(l)和权重精度S w(l)不变,减小上述输 出神经元梯度精度
Figure PCTCN2019085844-appb-000154
需要指出的是,由于上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000155
上述控制器单元102减小上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000156
是指增加表示该输出神经元梯度的定点数据格式的小数部分位宽s1。
可选地,上述控制器单元102根据Tr-T的值按照第一预设步长N1增加上述表示权重的定点数据格式的小数部分位宽s1。
具体地,对于表示上述输出神经元梯度的定点数据格式的小数部分位宽s1,上述控制器单元102每次增加N1位,即小数部分的位宽为s1+N1,并得到输出神经元梯度精度
Figure PCTCN2019085844-appb-000157
再根据上述预设公式
Figure PCTCN2019085844-appb-000158
判断上述梯度更新精度T与上述预设精度Tr的差值的绝对值是否变小;当确定该梯度更新精度T与上述预设精度Tr的差值的绝对值变小时,上述控制器单元102继续对表示上述输出神经元梯度的定点数据格式的小数部分位宽增加N1,即位宽为s1+2*N1,并得到输出神经元梯度精度
Figure PCTCN2019085844-appb-000159
并继续判断上述梯度更新精度T与上述预设精度Tr的差值的绝对值是否变小;若变小,则继续按照上述方法进行处理;若在第n次处理时上述梯度更新精度T与上述预设精度Tr的差值的绝对值变大,上述控制器单元102则将第n-1次处理得到的位宽,即s1+(n-1)*N1作为表示上述输出神经元梯度的定点数据格式的小数部分的位宽,增加小数部分位宽后的输出神经元梯度精度为
Figure PCTCN2019085844-appb-000160
可选地,上述第一预设步长N1为1、2、4、6、7、8或者其他正整数。
可选地,上述控制器单元102按照2倍递增的方式,增加表示上述输出神经元梯度的定点数据格式的小数部分位宽。
比如表示上述输出神经元梯度的定点数据格式的小数部分位宽为3,即权重的精度为2 -3,则按照2倍递增的方式增加位宽后的表示上述输出神经元梯度的定点数据格式的小数部分位宽为6,即减小后的输出神经元梯度精度为2 -6
在一种可行的实施例中,上述控制器单元102确定对表示上述输出神经元梯度的定点数据格式的小数部分位宽的增加幅度b后,上述控制器单元102分多次增加上述定点数据格式的小数部分位宽,比如上述控制器单元102分两次增加上述定点数据格式的小数部分位宽,第一次增加的幅度为b1,第二次增加的幅度为b2,且b=b1+b2。
其中,上述b1与b2可以相同或者不相同。
可选地,上述控制器单元102减小上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000161
时,增加表示该输出神经元梯度的定点数据格式的位宽。
进一步地,由于增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000162
是通过增加表示上述输出神经元梯度的定点数据格式的小数部分位宽来实现的,且由于表示上述输出神经元梯度的定点数据格式的位宽不变,若小数部分位宽增加,则整数部分位宽减少,该定点数据格式表示的数据范围会缩小,因此在控制器单元102减小上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000163
后,该控制器单元102增加上述定点数据格式的位宽,且该定点数据格式的位宽增加后,其整数部分的位宽保持不变,即整数部分位宽的增加值与小数部分位宽的增加值相同。
举例说明,上述定点数据格式的位宽为9,其中符号位的位宽为1,整数部分的位宽为 5,小数部分的位宽为3,上述控制器单元102增加上述小数部分的位宽和整数部分的位宽后,小数部分的位宽为6,则整数部分的位宽为5,即增加上述小数部分的位宽,整数部分的位宽保持不变。
在一种可行的实施例中,上述控制器单元102减小上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000164
后,该控制器单元102还用于:
判断所述输出神经元梯度以所述输出神经元梯度的定点数据格式表示时是否溢出;
当确定溢出时,增加表示所述输出神经元梯度的定点数据格式的位宽。
具体地,由上述相关描述可知,上述控制器单元102减小上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000165
时,上述表示该输出神经元梯度的定点数据格式表示数据的范围会缩小,因此当上述控制器单元102减小上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000166
后,判断上述输出神经元梯度以上述定点数据格式表示时是否溢出;当确定溢出时,上述控制器单元102增加上述定点数据格式的位宽,从而扩大上述定点数据格式表示数据的范围,使得上述输出神经元梯度以上述定点数据格式表示时不会溢出。
需要指出的是,上述控制器单元102增加上述定点数据格式的位宽具体是增加该定点数据格式的整数部分的位宽。
进一步地,上述控制器单元102增加所述表示上述输出神经元梯度的定点数据格式的位宽,包括:
上述控制器单元102按照第二预设步长N2增加所述表示所述输出神经元梯度的定点数据格式的位宽,其中,第二预设步长N2可为1、2、3、4、5、7、8或者其他正整数。
具体地,当确定增加上述定点数据格式的位宽时,上述控制器单元102每次增加该定点数据格式的位宽时的增加值为上述第二预设步长N2。
在一种可行的实施例中,上述控制器单元102增加上述表示上述输出神经元梯度的定点数据格式的位宽,包括:
上述控制器单元102按照2倍递增的方式增加上述表示上述输出神经元梯度的定点数据格式的位宽。
举例说明,上述定点数据格式除去符号位的位宽为8,则按照2倍递增的方式增加该定点数据格式的位宽后,该定点数据格式除去符号位的位宽为16;再次按照2倍递增的方式增加该定点数据格式的位宽后,该定点数据格式除去符号位的位宽为32。
在一种可行的实施例中,上述控制器单元102调整上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000167
包括
上述控制器单元102减小上述输入神经元精度S x(l)和/或上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000168
保持上述权重精度S w(l)不变,或者;
上述控制器单元102减小上述输入神经元精度S x(l),增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000169
保持上述权重精度S w(l)不变,且上述输入神经元精度S x(l)减小的幅度大于上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000170
的增大幅度,或者;
上述控制器单元102增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000171
减小上述输入神经元精度S x(l),保持上述权重精度S w(l)不变,且上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000172
增大的幅度小于上述 输入神经元精度S x(l)的减小幅度,或者;
上述控制器单元102增大或减小上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000173
中的一个或者任意组合,以使上述梯度更新精度T与上述预设精度T r的差值的绝对值最小。
在此需要说明的是,上述控制器单元102对上述权重精度S w(l)、上述输入神经元精度S x(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000174
中的任意一个的进行增大操作的具体过程可参见上述控制器单元102增大上述权重精度S w(l)、上述输入神经元精度S x(l)和输出神经元梯度精度 的相关操作,在此不再叙述。
按照上述方法调整上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000176
后,上述运算单元103在运算过程中,按照调整后的输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000177
以定点数据格式形式表示上述第L层的输入神经元、权重和输出神经元梯度,然后进行后续的运算。
需要说明的是,上述控制器单元102计算上述梯度更新精度T的频率可以根据需求灵活设置。
其中,上述控制器单元102可根据神经网络训练过程中的训练迭代次数调整计算上述梯度更新精度T的频率。
可选地,上述控制器单元102在神经网络训练过程中,每迭代一轮控制器单元就重新计算上述梯度更新精度T;或者每迭代预设次数就重新计算上述梯度更新精度T;或者根据上述梯度更新精度T的变化进行设置上述频率。
可选地,上述控制器单元102根据神经网络训练中的训练迭代次数来设置计算上述梯度更新精度T的频率。
运算单元103,用于根据增大或者减小后的输入神经元精度S x(l)和权重精度S w(l)来表示第L层的输入神经元和权重;根据增大或者减小后的输出神经元梯度精度
Figure PCTCN2019085844-appb-000178
来表示运算得到的第L层输出神经元梯度。
换句话说,上述运算单元103,用于增大或者减小输入神经元精度S x(l)的定点数据格式来表示上述第L层输入神经元,用增大或者减小权重精度S w(l)的定点数据格式来表示上述第L层的权重,用增大或者减小输出神经元梯度精度
Figure PCTCN2019085844-appb-000179
的定点数据格式来表示上述第L层的输出神经元梯度,以进行后续的运算。
通过在神经网络运算过程中,动态调整(包括增大或者减小)上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000180
以在满足运算需求的同时,减少运算结果的误差和运算开销,节省运算资源。
在另一个可选的实施例中,上述控制器单元102获取上述多层神经网络的第L层输出神经元梯度。
在一种可行的实施例中,上述控制器单元102获取第L层的输出神经元和第L-1层的输出神经元,然后根据上述第L层的输出神经元和第L-1层的输出神经元获取上述第L层输出神经元梯度。
上述控制器单元102获取输出神经元梯度中绝对值小于第一预设阈值的输出神经元梯度的比例数据a。
可选地,上述第一预设阈值可为0、0.01、0.05、0.1、0.12、0.05或者其他值。
具体地,上述控制器单元102获取上述第L层输出神经元梯度后,获取该第L层输出神经元梯度中绝对值小于上述第一预设阈值的梯度值的个数n1,然后根据该个数n1和上述第L层输出神经元梯度的个数n2获取上述比例数据a,即a=n1/n2。
可选地,上述比例数据可为50%、60%、65%、70%、80%、85%、90%或者其他值。
可选地,上述比例数据为80%。
当比例数据a大于第二预设阈值时,控制器单元102减小上述第L层输出神经元梯度精度
Figure PCTCN2019085844-appb-000181
在一种可行的实施例中,上述控制器单元102减小上述第L层输出神经元梯度精度
Figure PCTCN2019085844-appb-000182
时,增加表示上述第L层输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,上述控制器单元102减小上述第L层输出神经元梯度精度
Figure PCTCN2019085844-appb-000183
后,上述控制器单元102还用于:
判断上述第L层输出神经元梯度以上述第L层输出神经元梯度的定点数据格式表示时是否溢出;
当确定溢出时,增加表示上述第L层输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,上述控制器单元102增加表示上述第L层输出神经元梯度的定点数据格式的位宽,包括:
上述控制器单元102按照第三预设步长N3增加所述表示上述第L层输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,上述控制器单元102增加表示上述第L层输出神经元梯度的定点数据格式的位宽,包括:
上述控制器单元102按照2倍递增的方式增加所述表示上述第L层输出神经元梯度的定点数据格式的位宽。
在此需要说明的是,上述控制器单元102减小上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000184
的具体过程可见上述相关描述,在此不再叙述。
按照上述方法调整上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000185
后,上述运算单元103在运算过程中,按照调整后的输出神经元梯度精度
Figure PCTCN2019085844-appb-000186
以定点数形式表示上述第L层的输出神经元梯度,然后进行后续的运算。
通过在神经网络运算过程中根据输出神经元梯度来调整输出神经元精度的大小,从而减小输出神经元的误差,进而保证训练正常进行。
参见图1C,图1C为本发明实施例提供的一种神经网络运算方法的流程示意图,如图1C所示,该方法包括:
S201、神经网络运算模块获取神经网络的第L层输入神经元精度、权重精度和输出神经元梯度精度。
其中,上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000187
的取值可以相同,也可以是部分相同或者两两互不相等。
其中,上述神经网络为多层神经网络,上述第L层输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000188
分别为上述多层神经网络的任一层的输入神经元精度、权重精度和输出神经元梯度精度。
在一种可行的实施例中,上述神经网络运算模块获取上述第L层的输入神经元、权重和输出神经元;根据上述第L层的输入神经元、权重和输出神经元,获取上述第L层输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000189
S202、神经网络运算模块根据第L层输入神经元精度、权重精度和输出神经元梯度精度,计算得到梯度更新精度T。
具体地,上述神经网络运算模块根据第一预设公式对上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000190
进行计算,以得到上述梯度更新精度T。
其中,上述第一预设公式为
Figure PCTCN2019085844-appb-000191
S203、当梯度更新精度T大于预设精度T r时,神经网络运算模块调整第L层输入神经元精度、权重精度和输出神经元梯度,以使梯度更新精度T与预设精度T r的差值的绝对值最小。
其中,用于表示输入神经元的定点数据格式和用于表示权重的定点数据格式的位宽为第一位宽,用于表示输出神经元梯度的定点数据格式的位宽为第二位宽。
可选地,上述第二位宽大于上述第一位宽。
进一步地,上述第二位宽为上述第一位宽的两倍,以便于电子计算机进行处理。
进一步地,上述第一位宽可选为8位,上述第二位宽可选为16位。
其中,上述预设精度T r可以根据经验进行预先设置;也可以第二预设公式,通过改变输入参数的方式获得与输入参数匹配的T r;还可以通过机器学习的方法获取T r
可选地,上述神经网络运算模块根据学习率、batchsize(批处理时的样本数量)设置上述预设精度T r
进一步地,如果该神经网络中存在参数共享层(如卷积层和循环神经网路层),则根据上一层输出神经元的个数以及batchsize、学习率来设置上述预设精度T r,即上一层的输出神经元的个数越高以及batchsize越大、学习率越高,预设精度T r越大。
其中,上述神经网络运算模块调整上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000192
包括:
保持上述输入神经元精度S x(l)和权重精度S w(l)不变,增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000193
需要指出的是,上述神经网络运算模块减小上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000194
是指增加表示该输出神经元梯度的定点数据格式的小数部分位宽s1。
可选地,上述神经网络运算模块控制器单元根据Tr-T的值按照第一预设步长N1增大上述表示输出神经元梯度的定点数据格式的小数部分位宽s1。
具体地,对于表示上述输出神经元梯度的定点数据格式的小数部分位宽s1,上述神经 网络运算模块每次增加N1,即小数部分的位宽为s1+N1,并得到输出神经元梯度精度
Figure PCTCN2019085844-appb-000195
再根据上述预设公式
Figure PCTCN2019085844-appb-000196
判断上述梯度更新精度T与上述预设精度Tr的差值的绝对值是否变小;当确定该梯度更新精度T与上述预设精度Tr的差值的绝对值变小时,上述神经网络运算模块继续对表示上述输出神经元梯度的定点数据格式的小数部分位宽增加N1,即位宽为s1+2*N1,并得到输出神经元梯度精度
Figure PCTCN2019085844-appb-000197
并继续判断上述梯度更新精度T与上述预设精度Tr的差值的绝对值是否变小;若变小,则继续按照上述方法进行处理;若在第n次处理时上述梯度更新精度T与上述预设精度Tr的差值的绝对值变大,上述神经网络运算模块则将第n-1次处理得到的位宽,即s1+(n-1)*N1作为表示上述输出神经元梯度的定点数据格式的小数部分的位宽,增加小数部分位宽后的输出神经元梯度精度为
Figure PCTCN2019085844-appb-000198
可选地,上述第一预设步长N1为1、2、4、6、7、8或者其他正整数。
可选地,上述神经网络运算模块按照2倍递增的方式,增加表示上述输出神经元梯度的定点数据格式的小数部分位宽。
比如表示上述输出神经元梯度的定点数据格式的小数部分位宽为3,即输出神经元梯度精度为2 -3,则按照2倍递增的方式增加后的表示上述输出神经元梯度的定点数据格式的小数部分位宽为6,即减小后的输出神经元梯度精度为2 -6
在一种可行的实施例中,上述神经网络运算模块确定对表示上述输出神经元梯度的定点数据格式的小数部分位宽的增加幅度b后,上述神经网络运算模块分多次增大上述定点数据格式的小数部分位宽,比如上述神经网络运算模块分两次增大上述定点数据格式的小数部分位宽,第一次增加幅度为b1,第二次增加度为b2,且b=b1+b2。
其中,上述b1与b2可以相同或者不相同。
可选地,上述神经网络运算模块减小上述输出神经元梯度精度时,增加表示该权重的定点数据格式的位宽。
进一步地,由于减小上述输出神经元梯度精度S w(l)是通过增加表示上述权重的定点数据格式的小数部分位宽来实现的,且由于表示上述输出神经元梯度的定点数据格式的位宽不变,若小数部分位宽增加,则整数部分位宽减少,该定点数据格式表示的数据范围会缩小,因此在神经网络运算模块减小上述输出神经元梯度精度S w(l)后,该神经网络运算模块增加上述定点数据格式的位宽,且该定点数据格式的位宽增加后,其整数部分的位宽保持不变,即整数部分位宽的增加值与小数部分位宽的增加值相同。
举例说明,上述定点数据格式的位宽为9,其中符号位的位宽为1,整数部分的位宽为5,小数部分的位宽为3,上述神经网络运算模块增加上述小数部分的位宽和整数部分的位宽后,小数部分的位宽为6,则整数部分的位宽为5,即增加上述小数部分的位宽,整数部分的位宽保持不变。
在一种可行的实施例中,上述神经网络运算模块减小上述输出神经元梯度精度后,该神经网络运算模块还用于:
判断所述输出神经元梯度以所述输出神经元梯度的定点数据格式表示时是否溢出;
当确定溢出时,增加表示所述输出神经元梯度的定点数据格式的位宽。
具体地,由上述相关描述可知,上述神经网络运算模块减小上述输出神经元梯度的精度时,上述表示该输出神经元梯度的定点数据格式表示数据的范围会缩小,因此当上述神经网络运算模块减小上述输出神经元梯度的精度后,判断上述输出神经元梯度以上述定点数据格式表示时是否溢出;当确定溢出时,上述神经网络运算模块增加上述定点数据格式的位宽,从而扩大上述定点数据格式表示数据的范围,使得上述输出神经元梯度以上述定点数据格式表示时不会溢出。
需要指出的是,上述神经网络运算模块增加上述定点数据格式的位宽具体是增加该定点数据格式的整数部分的位宽。
进一步地,上述神经网络运算模块增加所述表示上述输出神经元梯度的定点数据格式的位宽,包括:
上述神经网络运算模块按照第二预设步长N2增加所述表示所述输出神经元梯度的定点数据格式的位宽,其中,第二预设步长N2可为1、2、3、4、5、7、8或者其他正整数。
具体地,当确定增加上述定点数据格式的位宽时,上述神经网络运算模块每次增加该定点数据格式的位宽时的增加值为上述第二预设步长N2。
在一种可行的实施例中,上述神经网络运算模块增加上述表示上述输出神经元梯度的定点数据格式的位宽,包括:
上述神经网络运算模块按照2倍递增的方式增加上述表示上述输出神经元梯度的定点数据格式的位宽。
举例说明,上述定点数据格式除去符号位的位宽为8,则按照2倍递增的方式增加该定点数据格式的位宽后,该定点数据格式除去符号位的位宽为16;再次按照2倍递增的方式增大该定点数据格式的位宽后,该定点数据格式除去符号位的位宽为32。
在一种可行的实施例中,上述神经网络运算模块调整上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000199
包括:
减小上述输入神经元精度S x(l)和/或上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000200
保持上述权重精度S w(l)不变,或者;
减小上述输入神经元精度S x(l),增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000201
保持上述权重精度S w(l)不变,且上述输入神经元精度S x(l)减小的幅度大于上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000202
的增大幅度,或者;
增大上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000203
减小上述输入神经元精度S x(l),保持上述权重精度S w(l)不变,且上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000204
增大的幅度小于上述输入神经元精度S x(l)的减小幅度,或者;
增大或减小上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000205
中的一个或者任意组合,以使上述梯度更新精度T与上述预设精度T r的差值的绝对值最小。
在此需要说明的是,上述神经网络运算模块对上述权重精度S w(l)、、上述输入神经元精度S x(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000206
中的任意一个的进行增大操作的具体过程可参见上述神经网络运算模块增大权重精度S w(l)、、上述输入神经元精度S x(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000207
的相关操作,在此不再叙述。
S204、神经网络运算模块根据调整后的输入神经元精度和权重精度来表示第L层的输出神经元和权重;根据调整后的输出神经元梯度精度来表示运算得到的第L层输出神经元梯度,以进行后续运算。
换句话说,上述运算单元,用于增大或者减小输入神经元精度S x(l)的定点数据格式来表示上述第L层输入神经元,用增大或者减小权重精度S w(l)的定点数据格式来表示上述第L层的权重,用增大或者减小输出神经元梯度精度
Figure PCTCN2019085844-appb-000208
的定点数据格式来表示上述第L层的输出神经元梯度,以进行后续的运算。
按照上述方法调整上述输入神经元精度S x(l)、权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000209
后,上述神经网络运算模块重新计算上述梯度更新精度T;当该梯度更新精度不再大于上述预设精度T r时,上述神经网络运算模块参照上述步骤S203的方式减小上述输入神经元精度S x(l),权重精度S w(l)和输出神经元梯度精度
Figure PCTCN2019085844-appb-000210
需要说明的是,上述神经网络运算模块计算上述梯度更新精度T的频率可以根据需求灵活设置。
其中,上述神经网络运算模块可根据神经网络训练过程中的训练迭代次数调整计算上述梯度更新精度T的频率。
可选地,上述神经网络运算模块在神经网络训练过程中,每迭代一轮就重新计算上述梯度更新精度T;或者每迭代预设次数就重新计算上述梯度更新精度T;或者根据上述梯度更新精度T的变化进行设置上述频率。
可选地,上述神经网络运算模块根据神经网络训练中的训练迭代次数来设置计算上述梯度更新精度T的频率。
可以看出,在本发明实施例的方案中在神经网络运算过程中,动态调整上述输入神经元精度S x、权重精度S w和输出神经元梯度精度
Figure PCTCN2019085844-appb-000211
以在满足运算需求的同时,减少运算结果的误差和运算开销,节省运算资源。
参见图1D,图1D为本发明实施例提供的一种神经网络运算方法的流程示意图。如图1D所示,该方法包括:
S301、神经网络运算模块获取第L层输出神经元梯度。
在一种可行的实施例中,上述神经网络运算模块获取第L层的输出神经元和第L-1层的输出神经元,然后根据上述第L层的输出神经元和第L-1层的输出神经元获取上述第L层输出神经元梯度。
S302、神经网络运算模块获取第L层输出神经元梯度中绝对值小于第一预设阈值的比例数据a。
可选地,上述第一预设阈值可为0、0.01、0.05、0.1、0.12、0.05或者其他值。
具体地,上述神经网络运算模块获取上述第L层输出神经元梯度后,获取该第L层输出神经元梯度中绝对值小于上述第一预设阈值的梯度值的个数n1,然后根据该个数n1和上述第L层输出神经元梯度的个数n2获取上述比例数据a,即a=n1/n2。
可选地,上述比例数据可为50%、60%、65%、70%、80%、85%、90%或者其他值。
可选地,上述比例数据为80%。
S303、当比例数据a大于第二预设阈值时,神经网络运算模块减小上述第L层输出神经元梯度的精度。
在一种可行的实施例中,上述神经网络运算模块减小上述第L层输出神经元梯度精度
Figure PCTCN2019085844-appb-000212
时,增加表示上述第L层输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,上述神经网络运算模块减小上述第L层输出神经元梯度精度
Figure PCTCN2019085844-appb-000213
后,上述神经网络运算模块还用于:
判断上述第L层输出神经元梯度以上述第L层输出神经元梯度的定点数据格式表示时是否溢出;
当确定溢出时,增加表示上述第L层输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,上述神经网络运算模块增加表示上述第L层输出神经元梯度的定点数据格式的位宽,包括:
按照第三预设步长N3增加所述表示上述第L层输出神经元梯度的定点数据格式的位宽。
在一种可行的实施例中,上述神经网络运算模块增加表示上述第L层输出神经元梯度的定点数据格式的位宽,,包括:
按照2倍递增的方式增加所述表示上述第L层输出神经元梯度的定点数据格式的位宽。
在此需要说明的是,上述神经网络运算模块减小上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000214
的具体过程可见上述相关描述,在此不再叙述。
按照上述方法调整上述输出神经元梯度精度
Figure PCTCN2019085844-appb-000215
后,上述神经网络运算模块在运算过程中,按照调整后的输出神经元梯度精度
Figure PCTCN2019085844-appb-000216
以定点数据格式表示上述第L层的输出神经元梯度,然后进行后续的运算。
可以看出,在本发明实施例的方案中在神经网络运算过程中,根据输出神经元梯度来调整其精度的大小,从而减小输出神经元的误差,进而保证训练正常进行。
神经网络又称人工神经网络,人工神经网络被广泛应用于模式识别,图像处理,函数逼近和优化计算等领域,多层人工网络在近年来由于其较高的识别准确度和较好的可并行性,受到学术界和工业界越来越广泛的关注。人工神经网络涉及到多种算法,其中全连接层作为人工神经网络中的一种重要算法,被广泛的应用在各种人工神经网络模型中。
现有的神经网络运算基于通用处理器进行神经网络运算,现有的通用处理器,仅仅支持浮点数据的运算,但神经网络运算尤其涉及到比较复杂的运算,因此其运算量大,并且内存要求高,现有的神经网络运算是基于浮点数据来运算,对内存要求较高,因此现有的方案能耗高、成本高。
电子设备可以包括各种具有无线通信功能的手持设备、车载设备、无线耳机、计算设备或连接到无线调制解调器的其他处理设备,以及各种形式的用户设备(user equipment,UE),移动台(mobile station,MS),终端设备(terminal device)等等,电子设备例如可以为智能手机、平板电脑、耳机盒等等。为方便描述,上面提到的设备统称为电子设备或电子装置。
上述电子设备或电子装置可以应用于以下(包括但不限于)场景中:数据处理、机器人、电脑、打印机、扫描仪、电话、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备等各类电子产品;飞机、轮船、车辆等各类交通工具;电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机等各类家用电器;以及包括核磁共振仪、B超、心电图仪等各类医疗设备。
下面对本申请实施例进行详细介绍。
首先介绍本申请使用的计算装置。参阅图4,提供了一种神经网络计算装置,该计算装置用于执行神经网络训练计算,上述神经网络训练计算包括神经网络多层训练运算,上述多层训练运算中包括第i层,所述第i层的正向运算或反向运算中至少有部分数据用于定点数据运算,上述i为大于等于1的整数;该计算装置包括:控制器单元11、运算单元12和转换单元13,其中,控制器单元11与运算单元12以及转换单元13(上述转换单元可以单独设置,也可以集成在控制器单元或运算单元内)连接;上述第i层训练运算中包括第i层正向运算和第i层反向运算;
该第i层正向运算可以包括:
控制器单元11,用于获取第i层的输入神经元数据、第i层权值数据以及第i层正向计算指令;在一种可选方案中,具体的,控制器单元获取输入神经元数据以及计算指令方式可以通过数据输入输出单元得到,该数据输入输出单元具体可以为一个或多个数据I/O接口或I/O引脚;数据输入输出单元,用于从外部设备或外部存储器读取输入神经元数据或正向计算指令。
上述正向计算指令包括但不限于:卷积运算指令、矩阵乘法指令、向量乘法指令、激活指令等等,本申请具体实施方式并不限制上述正向计算指令的具体表现形式或具体的类别。
控制器单元11,还用于解析该第i层计算指令得到多个正向运算指令,将第i层输入神经元数据以及第i层权值数据发送给转换单元13,将该多个正向运算指令发送给运算单元12;
转换单元13,用于将该第i层输入神经元数据以及第i层权值数据中的全部或部分执行浮点类型与定点类型转换得到全部定点数据或混合数据,将全部定点数据或混合数据发送给运算单元,该混合数据包括:部分定点数据以及部分浮点数据;
运算单元12,用于依据多个正向运算指令对全部定点数据执行定点运算或对混合数据执行混合运算得到第i层的正向输出结果。
该第i层反向运算可以包括:
控制器单元11,用于获取第i层的输入神经元数据、第i层权值数据、第i层输入神经元梯度以及第i层反向计算指令;在一种可选方案中,具体的,控制器单元获取输入神经元数据以及计算指令方式可以通过数据输入输出单元得到,该数据输入输出单元具体可以为一个或多个数据I/O接口或I/O引脚;数据输入输出单元,用于从外部设备或外部存储器读取输入神经元数据或反向计算指令。
上述反向计算指令包括但不限于:矩阵乘法指令或向量乘法指令等等,本申请具体实施方式并不限制上述反向计算指令的具体表现形式或具体的类别。
控制器单元11,还用于解析该第i层计算指令得到多个反向运算指令,将第i层输入神经元数据、第i层权值数据以及第i层输入神经元梯度发送给转换单元13,将该多个反向运算指令发送给运算单元12;
转换单元13,用于将该第i层输入神经元数据、第i层权值数据以及第i层输入神经元梯度中的全部或部分执行浮点类型与定点类型转换得到全部定点数据或混合数据,将全部定点数据和混合数据发送给运算单元,该混合数据包括:部分定点数据以及部分浮点数据;
运算单元12,用于依据多个正向运算指令对全部定点数据执行定点运算或对混合数据执行混合运算得到第i层的权值梯度以及第i层输出结果梯度;运算单元采用第i层的权值梯度与第i层权值进行更新。
该混合运算包括:对部分定点数据执行定点运算以及对部分浮点数据执行浮点运算。
本申请提供的技术方案设置了转换单元,该转换单元在执行神经网络的第i层训练运算时,可以将输入神经元数据、权值数据、输入数据神经元梯度中的全部或部分转换成定点数据或混合数据,相对于浮点数据,定点数据的存储空间较少,这样通过较小的内存空间即能够实现神经网络的训练,因此本申请提供的计算装置可以降低内存的容量,降低成本,另外,本申请提供的技术方案的i层训练运算中存在至少部分定点数据的运算,相对于浮点数据的运算,具有运算量降低,运算快的优点。
神经网络训练中的训练计算可以为神经网络中的一层的训练运算,即第i层的训练运算,对于其他层的训练运算可以采用常规的训练运算方法,也可以采用本申请中类似第i层的训练计算方法。对于多层神经网络,训练计算方法实现过程是,在正向运算中,当上一层的人工神经网络正向运算执行完成之后,计算装置会将运算单元中计算出的输出神经元(即正向输出结果)作为下一层的输入神经元进行运算(或者是对该输出神经元进行某些操作再作为下一层的输入神经元),上述某些操作包括但不限于:激活操作等操作,同时,计算装置将上一层的权值也替换为下一层的权值。在反向运算中,当下一层的人工神经网络的反向运算执行完成后,计算装置会将运算单元中计算出的输出神经元梯度(即输出结果梯度)作为上一层的输入神经元梯度进行运算(或者是对该输出神经元梯度进行某些操作再作为上一层的输入神经元梯度),同时计算装置将权值以及输入神经元数据替换为上一层的正向运算的权值以及输入神经元数据。
对于人工神经网络运算,如果该人工神经网络运算具有多层运算,多层运算的输入神经元和输出神经元并非是指整个神经网络的输入层中神经元和输出层中神经元,而是指对于网络中任意相邻的两层,处于网络正向运算下层中的神经元即为输入神经元,处于网络正向运算上层中的神经元即为输出神经元。以卷积神经网络为例,设一个卷积神经网络有L层,K=1,2,...,L-1,对于第K层和第K+1层来说,我们将第K层称为输入层,第K层中的神经元为所述输入神经元,第K+1层称为输出层,第K+1层中的神经元为所述输出神经元。即除最顶层外,每一层都可以作为输入层,其下一层为对应的输出层。
可选的,转换单元13,具体用于将第i层输入神经元数据的部分转换成部分定点输入神经元数据以及将第i层权值数据的部分转换成部分定点权值数据;将部分定点输入神经元数据以及部分定点权值数据发送给运算单元,将部分输入神经元数据(未执行浮点与定点转换的剩余浮点数据)和部分权值数据(未执行浮点与定点转换的剩余浮点数据)发送给运算单元;
运算单元,具体用于将部分定点输入神经元数据以及部分定点权值数据执行定点数据运算得到部分定点正向输出结果,将部分定点正向输出结果发送给转换单元,
转换单元,具体用于将该部分定点正向输出结果执行定点与浮点转换得到第一部分浮点正向输出结果,将第一部分浮点正向输出结果发送给运算单元;
运算单元,具体用于将部分输入神经元数据和部分权值数据执行运算(浮点运算)得到第二部分浮点正向运算结果,将第一部分浮点正向运算结果和第二部分浮点正向运算结果结合起来得到第i层正向输出结果。
可选的,转换单元13,具体用于将第i层输入神经元数据的部分转换成部分定点输入神经元数据、将第i层权值数据的部分转换成部分定点权值数据以及将第i层输入神经元梯度转换成部分定点输入神经元梯度;将部分定点输入神经元数据、部分定点输入神经元梯度以及部分定点权值数据发送给运算单元,将部分输入神经元数据(未执行浮点与定点转换的剩余浮点数据)、部分输入神经元梯度和部分权值数据(未执行浮点与定点转换的剩余浮点数据)发送给运算单元;
运算单元,具体用于将部分定点输入神经元梯度以及部分定点输入数据执行定点数据运算得到部分第i层权值梯度,将部分定点输入神经元梯度与部分定点权值数据执行定点数据运算得到部分第i层输出结果梯度,将部分第i层权值梯度以及部分第i层输出结果梯度发送给转换单元;
转换单元,具体用于将该部分第i层权值梯度以及部分第i层输出结果梯度执行定点与浮点转换得到第一部分第i层权值梯度以及第一部分第i层输出结果梯度,将第一部分第i层权值梯度以及第一部分第i层输出结果梯度发送给运算单元;
运算单元,具体用于将部分输入神经元梯度以及部分输入数据执行运算(浮点)得到第二部分第i层权值梯度,将部分输入神经元梯度与部分权值数据执行运算得到第二部分第i层输出结果梯度,将第一部分第i层权值梯度和第二部分第i层权值梯度结合起来得到第i层权值梯度,将第一部分第i层输出结果梯度和第二部分第i层输出结果梯度结合起来得到第i层输出结果梯度。
可选的,转换单元13,具体用于确定浮点数的point;
Figure PCTCN2019085844-appb-000217
其中width为定点数的位宽值。
其中,maxabs为需要转换的浮点数据中的最大绝对值,即第i层输入神经元数据以及第i层权值数据的元素中的绝对值最大值。这样使得定点数能够表示的最大值大于maxabs 的最小point(点的位置)值。
对于已知的point和width,浮点数和定点数:
Figure PCTCN2019085844-appb-000218
Round表示四舍五入。
其中,float=int*2point
Int为定点数值,float为浮点数值,point为定点小数点位数。
例如,width=8,maxabs(一组数的绝对值的最大值)=2.9,则可以计算的这组数的point=-4。如point=-4时,对于float=1.3,则可以推算出int=21。可选的,上述获取第i层输入神经元梯度的方法具体可以包括:
第i层输入神经元梯度=f′*第i+1层输出结果梯度;
其中f′为激活函数f的导函数。
可选的,参阅图3B,上述运算单元可以包括:一个主处理电路3101和多个从处理电路3102,其中,
主处理电路3101,用于对数据(包括输入神经元数据、权值数据、输入神经元梯度中的一种或任意组合,另外,该数据可以为定点数据或浮点数据)执行前序处理以及向所述多个从处理电路传输数据以及运算指令;
多个从处理电路3102,用于依据从所述主处理电路传输的数据(可以为定点数据也可以为浮点数据)以及运算指令并行执行中间运算得到多个中间结果,并将多个中间结果传输给所述主处理电路;
主处理电路3101,用于依据多个中间结果得到第i层正向输出结果、第i层输出结果梯度、第i层权值梯度,并依据第i层权值梯度对第i层权值进行更新。
可选的,上述激活函数f是非线性函数sigmoid,tanh,relu,softmax中的任一个或线性函数;
所述运算指令包括:CONFIG指令、COMPUTE指令、IO指令、NOP指令、JUMP指令或MOVE指令。
可选的,主处理电路包括第一存储单元、第一运算单元、第一数据依赖关系判定单元和第一存储单元,其中:
神经元缓存单元,用于缓存主处理电路在计算过程中用到的输入数据和输出数据;
第一运算单元,完成主处理电路的各种运算功能;
第一数据依赖关系判定单元,用于从第一存储单元读取输入的神经元向量,并通过互连模块发送给从处理电路;以及接收互连模块的中间结果向量,将中间结果向量发送到第一运算单元。
可选的,第一运算单元包括:向量加法单元和激活运算单元;
所述向量加法单元,用于将偏置数据与所述中间结果对位相加得到偏置结果;
所述激活运算单元,用于将所述偏置结果执行激活函数操作。
可选的,主每个从处理电路包括第二运算单元、第二数据依赖关系判定单元、第二存储单元和第三存储单元,其中:
第二运算单元,用于执行算数逻辑运算;
第二数据依赖关系判定单元,用于对第二存储单元和第三存储单元的执行读写操作;
第二存储单元,用于缓存输入神经元向量的数据以及该从处理电路计算得到的输出神经元值;
第三存储单元,用于缓存该从处理电路在计算过程中需要的权值向量。
可选的,主所述第二计算单元包括:向量乘单元和累加单元;
所述向量乘单元,用于执行点积运算中的向量乘运算;
所述累加单元,用于执行点积运算中的累加运算。
上述权值更新的过程可以包括:
主处理电路3101,具体用于将该第i层输入神经元数据分别发送给各个从处理电路,将第i层输入神经元梯度传送到各个从处理电路3102,每个从处理电路3102将第i层输入神经元梯度in_gradient中与该从处理电路相对应的标量数据以及第i层输入神经元数据相乘,得到每个从处理电路的第i层的原始权值更新梯度向量dw_original,在算出所有层的原始权值更新梯度向量之后,为了限制权值的梯度范围,主处理电路可以对原始权值更新梯度进行限制处理,具体的,主处理电路,具体用于计算所有层的原始权值更新梯度的平方和sumsq_diff,然后对sumsq_diff进行开方得到l2norm_diff,如果l2norm_diff大于clip_gradient(一个设定的正常数),主处理电路计算缩放因子scale_factor=clip_gradient/l2norm_diff,将所有的原始权值更新梯度dw_original分别乘以缩放因子scale_factor,得到权值更新梯度dw’,主处理电路将更新梯度dw’发送给每个从处理电路;从处理电路,具体用于使用权值更新梯度dw’乘以权值得到第i层各个从处理电路的更新权值。
本申请提供的技术方案将运算单元设置成一主多从结构,对于正向运算的计算指令,该结构可以将依据正向运算的计算指令将数据进行拆分,这样通过多个从处理电路即能够对计算量较大的部分进行并行运算,从而提高运算速度,节省运算时间,进而降低功耗,对于反向运算,也可以将数据进行拆分,类似正向运算,也可以提高运算速度。
可选的,上述主处理电路以及从处理电路均可以包括:存储模块,该存储模块,用于存储主处理电路或从处理电路的数据。需要说明的是,主处理电路与从处理电路可以共享上述存储模块,即在主处理电路的存储模块中划分一个或多个区域为共享区域,该共享区域的存储空间可以由多个从处理电路共享使用(包括读取或写入数据);从处理电路的存储模块中也可以划分一个或多个区域为共享区域,该共享区域的存储空间可以由主处理电路共享使用(包括读取或写入数据)。
此技术方案设置了存储模块的区域共享的方案,相对于存储模块固定的方案,互相连接的主处理电路与多个从处理电路之间的存储模块共享,能够避免因为存储区域不足导致计算无法进行的问题,另外,存储模块共享能够有效的降低主处理电路的存储区域的存储 空间的设置,这样大大降低了主处理电路的成本。另外,本方案相对于从外部设备提取数据来说,减少了数据读入或写入的开销,对于本计算装置,如从外部设备读入或写入数据,数据需要经过控制器单元、转换单元等部件的转发,这样对于神经网络运算来说,数据需要经过多个部件,由此使得数据读写的开销和能耗都很高,而适当的在主处理电路以及从处理电路中设置一部分共享区域,这样在主处理电路或从处理电路的存储模块的空间不够时,无需将数据存储在外部设备,而是直接存储在运算单元内部即可,由此可大大降低开销。
可选的,参阅图4A,上述计算装置还可以包括:该存储单元10和直接内存访问单元50,存储单元10可以包括:寄存器210、缓存202中的一个或任意组合,具体的,所述缓存202,用于存储所述计算指令;所述寄存器201,用于存储所述输入神经元数据、权值数据、输入神经元梯度和标量;所述缓存202为高速暂存缓存。直接内存访问单元50用于从存储单元10读取或存储数据。
可选的,该控制器单元11包括:指令缓存单元110、指令处理单元111和存储队列单元113;
指令缓存单元110,用于存储所述人工神经网络运算关联的计算指令;
所述指令处理单元111,用于对所述计算指令解析得到多个运算指令;
存储队列单元113,用于存储指令队列,该指令队列包括:按该队列的前后顺序排列的待执行的多个运算指令或计算指令。
举例说明,在一个可选的技术方案中,主运算处理电路也可以包括一个控制器单元,该控制器单元可以包括主指令处理单元,具体用于将指令译码成微指令。当然在另一种可选方案中,从运算处理电路也可以包括另一个控制器单元,该另一个控制器单元包括从指令处理单元,具体用于接收并处理微指令。上述微指令可以为指令的下一级指令,该微指令可以通过对指令的拆分或解码后获得,能被进一步解码为各部件、各单元或各处理电路的控制信号。
在一种可选方案中,该计算指令的结构可以如下表所示。
操作码 寄存器或立即数 寄存器/立即数 ...
上表中的省略号表示可以包括多个寄存器或立即数。
在另一种可选方案中,该计算指令可以包括:一个或多个操作域以及一个操作码。该计算指令可以包括神经网络运算指令。以神经网络运算指令为例,如下表所示,其中,寄存器号0、寄存器号1、寄存器号2、寄存器号3、寄存器号4可以为操作域。其中,寄存器号0、寄存器号1、寄存器号2、寄存器号3、寄存器号4可以是一个或者多个寄存器的号码。
Figure PCTCN2019085844-appb-000219
Figure PCTCN2019085844-appb-000220
上述寄存器可以为片外存储器,当然在实际应用中,也可以为片内存储器,用于存储数据,该数据具体可以为n维数据,n为大于等于1的整数,例如,n=1时,该数据为1维数据,即向量,如n=2时,该数据为2维数据,即矩阵,如n=3或3以上时,该数据为多维张量。
在另一种可选实施例中,参阅图3B,该运算单元12如图3B所示,可以包括一个主处理电路3101和多个从处理电路3102。在一个实施例里,如图4B所示,多个从处理电路102呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,主处理电路101连接所述多个从处理电路中的k个从处理电路,所述k个从处理电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路,需要说明的是,如图4B所示的k个从处理电路仅包括第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路,即该k个从处理电路为多个从处理电路中直接与主处理电路连接的从处理电路。
k个从处理电路,用于转发所述主处理电路以及多个从处理电路之间的数据以及指令。
可选的,上述转换单元可以设置在主处理电路内。
上述主处理电路还可以包括:
激活处理电路,用于执行主处理电路内数据的激活运算或激活求导运算;
加法处理电路,用于执行加法运算或累加运算。
所述主处理电路,用于确定所述输入神经元数据为广播数据、权值数据为分发数据,将分发数据分配成多个数据块,将所述多个数据块中的至少一个数据块以及多个运算指令中的至少一个运算指令发送给所述从处理电路;
所述多个从处理电路,用于对接收到的数据块依据该运算指令执行运算得到中间结果,并将中间结果传输给所述主处理电路;
所述主处理电路,用于接收第i层正向输出结果、第i层输出结果梯度、第i层权值梯度,并依据第i层权值梯度对第i层权值进行更新。
所述从处理电路包括:乘法处理电路;
所述乘法处理电路,用于对接收到的数据块执行乘积运算得到乘积结果;
转发处理电路(可选的),用于转发接收到的数据块或乘积结果。
累加处理电路,所述累加处理电路,用于对该乘积结果执行累加运算得到该中间结果。
另一个实施例里,该运算指令为矩阵乘以矩阵的指令、累加指令、激活指令等等计算指令。
下面通过神经网络运算指令来说明如图4所示的计算装置的具体计算方法。对于神经网络运算指令来说,其实际需要执行的公式可以为:s=s(∑wx i+b),即将权值w乘以输入数据x i,进行求和,然后加上偏置b后做激活运算s(h),得到最终的输出结果s。
在一种可选的实施方案中,如图4C所示,所述装置还可以包括:树型模块40,所述树型模块包括:一个根端口401和多个支端口404,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;
上述树型模块具有收发功能,例如如图4C所示,该树型模块即为发送功能,如图4D所示,该树型模块40即为接收功能。
所述树型模块,用于转发所述主处理电路与所述多个从处理电路之间的数据以及运算指令。
可选的,该树型模块为计算装置的可选择结构,其可以包括至少1层节点,该节点为具有转发功能的线结构,该节点本身可以不具有计算功能。如树型模块具有零层节点,即无需该树型模块。
可选的,该树型模块可以为n叉树结构,例如,如图4C所示的二叉树结构,当然也可以为三叉树结构,该n可以为大于等于2的整数。本申请具体实施方式并不限制上述n的具体取值,上述层数也可以为2,从处理电路可以连接除倒数第二层节点以外的其他层的节点。
可选的,上述运算单元内的主处理电路可以携带单独的缓存,具体的,可以包括:神经元缓存单元,该神经元缓存单元缓存该从处理电路的输入神经元向量数据和输出神经元值数据。该主处理电路还可以包括:权值缓存单元,用于缓存该从处理电路在计算过程中需要的权值数据。
在一种可选实施例中,运算单元12如图3C所示,可以包括分支处理电路3103;其具体的连接结构如图3C所示,其中,
主处理电路3101与一个或多个分支处理电路3103连接,分支处理电路3103与一个或多个从处理电路3102连接;
分支处理电路3103,用于转发主处理电路3101与从处理电路3102之间的数据或指令。
可选的,上述分支处理电路3103内可以设置存储模块,该存储模块可以划分一个或多个共享区域,主处理电路以及从处理电路,具体用于对该共享区域执行数据的写入或读取操作。在分支处理电路3103内设置该共享区域能够方便主处理电路以及从处理电路存储数据,并且数据存储的读取或写入的开销很小,这样能够节省从处理电路以及主处理电路的存储模块的容量,降低计算装置的成本。
在一种可选实施例中,以神经网络运算中的全连接运算为例,过程可以为: y=f(wx+b),其中,x为输入神经元矩阵,w为权值矩阵,b为偏置标量,f为激活函数,具体可以为:sigmoid函数、tanh、relu、softmax函数中的任意一个。这里假设为二叉树结 构,运算单元具有8个从处理电路,其实现的方法可以为:
控制器单元从存储单元内获取输入神经元矩阵x、权值矩阵w以及全连接运算指令,将输入神经元矩阵x、权值矩阵w以及全连接运算指令传输给主处理电路;
主处理电路确定该输入神经元矩阵x为广播数据,确定权值矩阵w为分发数据,将权值矩阵w拆分成8个子矩阵,然后将8个子矩阵通过树型模块分发给8个从处理电路,将输入神经元矩阵x广播给8个从处理电路;
从处理电路并行执行8个子矩阵与输入神经元矩阵x的乘法运算和累加运算,得到8个中间结果,将8个中间结果发送给主处理电路;
主处理电路,用于将8个中间结果排序得到wx的运算结果,将该运算结果执行偏置b的运算后执行激活操作得到最终结果y,将最终结果y发送至控制器单元,控制器单元将该最终结果y输出或存储至存储单元内。
上述将8个中间结果排列得到wx的运算结果的实现具体方式可以为,对于矩阵乘以矩阵,确定8个子矩阵对应的输入神经元矩阵x的部分元素,提取8个子矩阵中行数最小值、部分元素的列数最小值,行数最小值以及列数最小值即为中间结果在运算结果中的位置。
如图4所示的计算装置执行神经网络正向运算指令的方法具体可以为:
控制器单元从指令存储单元内提取神经网络正向运算指令、神经网络运算指令对应的操作域以及至少一个操作码,控制器单元将该操作域传输至数据访问单元,将该至少一个操作码发送至运算单元。
控制器单元从存储单元内提取该操作域对应的权值w和偏置b(当b为0时,不需要提取偏置b),将权值w和偏置b传输至运算单元的主处理电路,控制器单元从存储单元内提取输入数据Xi,将该输入数据Xi发送至主处理电路。
主处理电路依据该至少一个操作码确定该操作码为乘法运算,将输入数据Xi转换为定点输入数据Xi,将权值数据转换为定点权值数据,确定定点输入数据Xi为广播数据,确定定点权值数据为分发数据,将定点权值w拆分成n个定点数据块;
控制器单元的指令处理单元依据该至少一个操作码确定乘法指令、偏置指令和累加指令,将乘法指令、偏置指令和累加指令发送至主处理电路,主处理电路将该乘法指令、输入数据Xi以广播的方式发送给多个从处理电路,将该n个定点数据块分发给该多个从处理电路(例如具有n个从处理电路,那么每个从处理电路发送一个数据块);多个从处理电路,用于依据该乘法指令将该定点输入数据Xi与接收到的定点数据块执行定点乘法运算得到定点中间结果,将该定点中间结果发送至主处理电路,该主处理电路依据该累加指令将多个从处理电路发送的中间结果执行累加运算得到累加结果,将该累加结果转换成浮点累加结果,依据该偏置指令将该浮点累加结果加上偏置b得到最终结果,将该最终结果发送至该控制器单元。
本申请提供的技术方案通过一个指令即神经网络运算指令实现了神经网络的乘法运算以及偏置运算,无需存储或提取神经网络计算的中间结果,减少了中间数据的存储以及提取操作,所以其具有减少对应的操作步骤,提高神经网络的计算效果的优点。
本申请还揭露了一个神经网络装置,其包括一个或多个在本申请中提到的计算装置,用于从其他处理装置中获取待运算数据和控制信息,执行指定的神经网络训练计算,执行结果通过I/O接口传递给外围设备。外围设备譬如摄像头、显示器、鼠标、键盘、网卡、wifi接口和服务器。当包含一个以上计算装置时,这些计算装置之间可以通过特定的结构进行链接并传输数据,譬如,通过PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算。此时,这些计算装置可以共享同一控制系统,也可以有各自独立的控制系统;可以共享内存,也可以每个加速器有各自的内存。此外,这些计算装置的互联方式可以是任意互联拓扑。
该神经网络装置具有较高的兼容性,可通过PCIE接口与各种类型的服务器相连接。
本申请还提供了一个组合处理装置,其包括上述的神经网络装置,通用互联接口,和其他处理装置。神经网络装置与其他处理装置进行交互,共同完成用户指定的操作。图4E为组合处理装置的示意图。
其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。本申请不对其他处理装置所包括的处理器数量做限制。其他处理装置作为神经网络装置与外部数据和控制的接口,包括数据搬运,完成对本神经网络装置的开启、停止等基本控制;其他处理装置也可以和神经网络装置协作共同完成运算任务。
通用互联接口,用于传输所述神经网络装置与其他处理装置之间的数据和控制指令。该神经网络装置从其他处理装置中获取所需的输入数据,写入神经网络装置片上的存储装置;可以从其他处理装置中获取控制指令,写入神经网络装置片上的控制缓存;也可以读取神经网络装置的存储模块中的数据并传输给其他处理装置。
可选的,该结构如图4所示,还可以包括存储装置,存储装置分别与所述神经网络装置和所述其他处理装置连接。存储装置用于保存在所述神经网络装置和所述其他处理装置的数据,尤其适用于所需要运算的数据在本机器学习运算装置或其他处理装置的内部存储中无法全部保存的情况。
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上系统,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口。
在一些实施例里,还申请了一种芯片,其包括了上述神经网络运算装置或组合处理装置。
在一些实施例里,申请了一种芯片封装结构,其包括了上述芯片。
在一些实施例里,申请了一种板卡,其包括了上述芯片封装结构。参阅图5,图5提供了一种板卡,上述板卡除了包括上述芯片389以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件390、接口装置391和控制器件392;
所述存储器件390与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元393。每一组所述存储单元与所述芯片通过总线连接。可 以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组所述存储单元中采用DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。
在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于控制每个所述存储单元的数据传输与数据存储。
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。可选的,当采用PCIE 3.0X 16接口传输时,理论带宽可达到16000MB/s。在另一个实施例中,所述接口装置还可以是其他的接口,本申请并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述芯片电连接。所述控制器件用于对所述芯片的状态进行监控。具体的,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理和或多个处理电路的工作状态的调控。
在一些实施例里,申请了一种电子设备,其包括了上述板卡。
电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (17)

  1. 一种计算方法,其特征在于,所述计算方法应用于计算系统,所述计算系统包括:控制单元、计算群和总存储单元,所述控制单元包括:第一存储器、译码逻辑和控制器,所述计算群包括:群控制器和多个计算单元;所述总存储单元,用于存储数据;所述计算方法包括如下步骤:
    所述控制器接收第一级指令序列,所述译码逻辑将该第一级指令序列拆分成多个第二级指令序列,
    控制器为所述多个第二级指令序列开辟M个线程,控制器为所述M个线程中每个线程分配独立的寄存器以及配置独立寻址功能;所述M取值范围为大于等于1的整数;
    群控制器获取所述多个第二级指令序列的多个计算类型,依据所述多个计算类型获取计算类型对应的融合计算方式,多个计算单元采用该融合计算方式调用所述M个线程对所述多个第二指令序列执行计算得到最终结果。
  2. 根据权利要求1所述的方法,其特征在于,所述群控制器获取所述多个第二级指令序列的多个计算类型,依据所述多个计算类型获取计算类型对应的融合计算方式,多个计算单元采用该融合计算方式调用所述M个线程对所述多个第二指令序列执行计算得到最终结果:
    如所述计算类型代表相同类型的计算操作,群控制器调用相同类型的单指令多数据流SIMD结合单指令多线程SIMT的组合计算方式,并采用所述M个线程执行组合计算方式得到最终结果,具体包括:
    译码逻辑将M个线程拆分成N个线程组分配给多个计算单元,群控制器将所述多个第二指令序列转换成多个第二控制信号并发送给多个计算单元,多个计算单元调用分配的线程组以及第二控制信号依据所述独立寻址功能提取对应的数据,多个计算单元将该数据执行运算得到多个中间结果,将多个中间结果拼接起来得到最终结果。
  3. 根据权利要求1所述的方法,其特征在于,所述群控制器获取所述多个第二级指令序列的多个计算类型,依据所述多个计算类型获取计算类型对应的融合计算方式,多个计算单元采用该融合计算方式调用所述M个线程对所述多个第二指令序列执行计算得到最终结果:
    如所述计算类型代表不同类型的计算操作,群控制器调用同步多线程SMT以及所述M个线程执行计算得到最终结果具体包括:
    译码逻辑将M个线程拆分成N个线程组,将所述多个第二指令序列转换成多个第二控制信号,群控制器获取多个计算单元支持的计算类型,控制器将N个线程组以及多个第二控制信号,分配给支持该线程组以及第二控制信号的计算类型对应的计算单元,多个计算单元调用分配的线程组以及第二控制信号,多个计算单元提取对应的数据,多个计算单元将该数据执行运算得到多个中间结果,将所有中间结果拼接起来得到最终结果。
  4. 根据权利要求2或3所述的方法,其特征在于,所述方法还包括:
    如多个线程组中的线程组A阻塞,将线程组A加入等待队列,如线程组A的数据已被提取,将线程组A加入到准备队列,所述准备队列为计算资源空闲时被调度执行的线程组 所在的队列。
  5. 根据权利要求1所述的方法,其特征在于,
    所述第一级指令序列包括:超长指令,所述第二级指令序列包括:指令序列。
  6. 根据权利要求1所述的方法,其特征在于,所述计算系统还包括:树型模块,所述树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述群控制器,所述树型模块的多个支端口分别连接多个计算单元中的一个计算单元;
    所述树型模块转发所述群控制器与所述多个计算单元之间的数据块、线程组或指令序列。
  7. 根据权利要求6所述的方法,其特征在于,所述树型模块为n叉树,所述n为大于等于2的整数。
  8. 根据权利要求1所述的方法,其特征在于,所述计算系统还包括:分支处理电路,
    所述分支处理电路连接在所述群控制器与所述多个计算单元之间;
    所述分支处理电路转发所述群控制器与所述多个计算单元之间的数据、线程组或指令序列。
  9. 一种计算系统,其特征在于,所述计算系统包括:控制单元、计算群和总存储单元,所述控制单元包括:第一存储器、译码逻辑和控制器,所述计算群包括:群控制器和多个计算单元;所述总存储单元,用于存储数据;
    控制器,用于接收第一级指令序列以及用于控制所述第一存储器和所述译码逻辑;
    所述译码逻辑,用于将该第一级指令序列拆分成多个第二级指令序列;
    所述控制器,还用于为所述多个第二级指令序列开辟M个线程;为所述M个线程中每个线程分配独立的寄存器以及配置独立寻址功能;所述M取值范围为大于等于1的整数,将所述多个第二级指令序列转换成多个控制信号发送给所述群控制器;
    所述群控制器,用于接收所述多个控制信号,获取所述多个控制信号的多个计算类型,将M个线程划分成N个线程组,依据该多个计算类型为多个计算单元分配N个线程组以及多个控制信号;
    多个计算单元,用于通过分配的线程组以及控制信号从所述总存储单元提取数据执行运算得到中间结果,
    所述群控制器,用于拼接所有中间结果得到最终计算结果。
  10. 根据权利要求9所述的计算系统,其特征在于,
    所述多个计算单元包括:加法计算器、乘法计算器、激活计算器或专用计算器。
  11. 根据权利要求9所述的计算系统,其特征在于,
    所述专用计算器包括:人脸识别计算器、图形计算器、指纹计算器或神经网络计算器。
  12. 根据权利要求11所述的计算系统,其特征在于,
    所述群控制器,具体用于如多个控制信号的计算类型为图形计算、指纹识别、人脸识别或神经网络运算时,将该多个控制信号分别分配给人脸识别计算器、图形计算器、指纹 计算器或神经网络计算器。
  13. 根据权利要求9所述的计算系统,其特征在于,
    所述第一级指令序列包括:超长指令,所述第二级指令序列包括:指令序列。
  14. 根据权利要求9所述的计算系统,其特征在于,所述计算系统包括:树型模块,所述树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述群控制器,所述树型模块的多个支端口分别连接多个计算单元中的一个计算单元;
    所述树型模块,用于转发所述群控制器与所述多个计算单元之间的数据块、线程组或指令序列。
  15. 根据权利要求14所述的计算系统,其特征在于,所述树型模块为n叉树,所述n为大于等于2的整数。
  16. 根据权利要求9所述的计算系统,其特征在于,所述计算系统包括:分支处理电路,
    所述分支处理电路连接在所述群控制器与所述多个计算单元之间;
    所述分支处理电路,用于转发所述群控制器与所述多个计算单元之间的数据、线程组或指令序列。
  17. 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可通过操作使计算机执行如权利要求1-8任一项所述的方法。
PCT/CN2019/085844 2018-05-18 2019-05-07 计算方法以及相关产品 WO2019218896A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP19803375.5A EP3624020A4 (en) 2018-05-18 2019-05-07 CALCULATION PROCEDURES AND RELATED PRODUCTS
US16/718,742 US11409575B2 (en) 2018-05-18 2019-12-18 Computation method and product thereof
US16/720,145 US11442785B2 (en) 2018-05-18 2019-12-19 Computation method and product thereof
US16/720,171 US11442786B2 (en) 2018-05-18 2019-12-19 Computation method and product thereof

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
CN201810479540.0A CN110503179B (zh) 2018-05-18 2018-05-18 计算方法以及相关产品
CN201810479540.0 2018-05-18
CN201811041573.3 2018-09-06
CN201811040961.X 2018-09-06
CN201811040961.XA CN110880037A (zh) 2018-09-06 2018-09-06 神经网络运算模块及方法
CN201811041573.3A CN110880033A (zh) 2018-09-06 2018-09-06 神经网络运算模块及方法
CN201811592249.0A CN111368987B (zh) 2018-12-25 2018-12-25 一种神经网络计算装置和方法
CN201811592249.0 2018-12-25

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/718,742 Continuation-In-Part US11409575B2 (en) 2018-05-18 2019-12-18 Computation method and product thereof

Publications (1)

Publication Number Publication Date
WO2019218896A1 true WO2019218896A1 (zh) 2019-11-21

Family

ID=68539478

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/085844 WO2019218896A1 (zh) 2018-05-18 2019-05-07 计算方法以及相关产品

Country Status (3)

Country Link
US (3) US11409575B2 (zh)
EP (1) EP3624020A4 (zh)
WO (1) WO2019218896A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111552563A (zh) * 2020-04-20 2020-08-18 南昌嘉研科技有限公司 一种多线程数据架构、多线程消息传递方法及系统
US12008369B1 (en) 2021-08-31 2024-06-11 Apple Inc. Load instruction fusion

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11437032B2 (en) 2017-09-29 2022-09-06 Shanghai Cambricon Information Technology Co., Ltd Image processing apparatus and method
KR102252137B1 (ko) 2018-02-13 2021-05-13 상하이 캠브리콘 인포메이션 테크놀로지 컴퍼니 리미티드 계산 장치 및 방법
US11663002B2 (en) 2018-02-13 2023-05-30 Shanghai Cambricon Information Technology Co., Ltd Computing device and method
US11630666B2 (en) 2018-02-13 2023-04-18 Shanghai Cambricon Information Technology Co., Ltd Computing device and method
CN116991226A (zh) 2018-02-14 2023-11-03 上海寒武纪信息科技有限公司 处理器的控制装置、方法及设备
WO2019218896A1 (zh) 2018-05-18 2019-11-21 上海寒武纪信息科技有限公司 计算方法以及相关产品
WO2020001438A1 (zh) 2018-06-27 2020-01-02 上海寒武纪信息科技有限公司 片上代码断点调试方法、片上处理器及芯片断点调试系统
JP6867518B2 (ja) 2018-08-28 2021-04-28 カンブリコン テクノロジーズ コーポレイション リミティド データ前処理方法、装置、コンピュータ機器及び記憶媒体
EP3859488A4 (en) 2018-09-28 2022-06-29 Shanghai Cambricon Information Technology Co., Ltd Signal processing device, signal processing method and related product
CN111385462A (zh) 2018-12-28 2020-07-07 上海寒武纪信息科技有限公司 信号处理装置、信号处理方法及相关产品
US20200334522A1 (en) 2019-04-18 2020-10-22 Cambricon Technologies Corporation Limited Data processing method and related products
CN111832739B (zh) 2019-04-18 2024-01-09 中科寒武纪科技股份有限公司 一种数据处理方法及相关产品
US11676029B2 (en) 2019-06-12 2023-06-13 Shanghai Cambricon Information Technology Co., Ltd Neural network quantization parameter determination method and related products
EP3772022A1 (en) 2019-06-12 2021-02-03 Shanghai Cambricon Information Technology Co., Ltd Method for determining quantization parameters in neural network and related products
EP4020321A4 (en) 2019-08-23 2024-01-17 Anhui Cambricon Information Technology Co., Ltd. DATA PROCESSING METHOD, APPARATUS, COMPUTER APPARATUS AND STORAGE MEDIUM
CN113867791B (zh) * 2020-06-30 2023-09-26 上海寒武纪信息科技有限公司 一种计算装置、芯片、板卡、电子设备和计算方法
KR20220033713A (ko) * 2020-09-10 2022-03-17 에스케이하이닉스 주식회사 데이터 처리 시스템 및 그 동작 방법
KR20230063791A (ko) * 2021-11-02 2023-05-09 리벨리온 주식회사 인공지능 코어, 인공지능 코어 시스템 및 인공지능 코어 시스템의 로드/스토어 방법

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140164737A1 (en) * 2012-12-06 2014-06-12 Kalray Execution efficiency in a single-program, multiple-data processor
CN105389158A (zh) * 2014-08-28 2016-03-09 想象技术有限公司 组合路径
CN106406812A (zh) * 2015-10-02 2017-02-15 上海兆芯集成电路有限公司 微处理器和微处理器内的执行融合复合算术运算的方法

Family Cites Families (173)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0375860A (ja) 1989-08-18 1991-03-29 Hitachi Ltd パーソナライズド端末
US5052043A (en) * 1990-05-07 1991-09-24 Eastman Kodak Company Neural network with back propagation controlled through an output confidence measure
US6144977A (en) 1995-07-10 2000-11-07 Motorola, Inc. Circuit and method of converting a floating point number to a programmable fixed point number
GB9602701D0 (en) 1996-02-09 1996-04-10 Canon Kk Image manipulation
US7242414B1 (en) 1999-07-30 2007-07-10 Mips Technologies, Inc. Processor having a compare extension of an instruction set architecture
JP2000293371A (ja) * 1999-04-09 2000-10-20 Hitachi Ltd マイクロプログラム制御方法及び装置
US6671796B1 (en) 2000-02-25 2003-12-30 Sun Microsystems, Inc. Converting an arbitrary fixed point value to a floating point value
US6931639B1 (en) * 2000-08-24 2005-08-16 International Business Machines Corporation Method for implementing a variable-partitioned queue for simultaneous multithreaded processors
JP4424907B2 (ja) 2000-09-07 2010-03-03 新日本製鐵株式会社 Sn系、Al系めっき鋼板用6価クロムフリー表面処理薬剤および表面処理鋼板
US20020138714A1 (en) 2001-03-22 2002-09-26 Sun Microsystems, Inc. Scoreboard for scheduling of instructions in a microprocessor that provides out of order execution
WO2002086817A1 (en) 2001-04-19 2002-10-31 Telefonaktiebolaget Lm Ericsson (Publ) Adaptive memory allocation
US20030167460A1 (en) * 2002-02-26 2003-09-04 Desai Vipul Anil Processor instruction set simulation power estimation method
US7236995B2 (en) 2002-12-27 2007-06-26 Arm Limited Data processing apparatus and method for converting a number between fixed-point and floating-point representations
DE10316381A1 (de) 2003-04-10 2004-10-28 Bayer Technology Services Gmbh Verfahren zum Training von neuronalen Netzen
JP4202244B2 (ja) * 2003-12-22 2008-12-24 Necエレクトロニクス株式会社 Vliw型dsp,及びその動作方法
US20060161375A1 (en) 2004-12-30 2006-07-20 Allen Duberstein Optimizing processing speed based on measured temperatures
US7721128B2 (en) 2005-11-29 2010-05-18 International Business Machines Corporation Implementation of thermal throttling logic
US20070242973A1 (en) 2006-04-12 2007-10-18 Konica Minolta Business Technologies, Inc. Developing device and image forming apparatus
CN1851668A (zh) 2006-06-01 2006-10-25 北京天碁科技有限公司 片上系统芯片、片上系统芯片的跟踪调试系统及方法
DE102006059156B4 (de) 2006-12-14 2008-11-06 Advanced Micro Devices, Inc., Sunnyvale Verfahren zum Testen eines integrierten Schaltkreischips mit zumindest zwei Schaltungskernen sowie integrierter Schaltkreischip und Testsystem
US20110060587A1 (en) 2007-03-07 2011-03-10 Phillips Michael S Command and control utilizing ancillary information in a mobile voice-to-speech application
US8560591B2 (en) 2007-04-25 2013-10-15 International Business Machines Corporation Detection of potential need to use a larger data format in performing floating point operations
US8051117B2 (en) 2007-04-26 2011-11-01 International Business Machines Corporation Shift significand of decimal floating point data
US8051118B2 (en) 2007-04-26 2011-11-01 International Business Machines Corporation Composition of decimal floating point data
US8190664B2 (en) 2007-04-26 2012-05-29 International Business Machines Corporation Employing a mask field of an instruction to encode a sign of a result of the instruction
JP5184824B2 (ja) 2007-06-15 2013-04-17 キヤノン株式会社 演算処理装置及び方法
JP2009110353A (ja) 2007-10-31 2009-05-21 Hitachi Ltd マイクロコントローラ及び制御システム
US7904287B2 (en) 2007-11-13 2011-03-08 International Business Machines Corporation Method and system for real-time prediction of power usage for a change to another performance state
JP4998794B2 (ja) 2007-11-29 2012-08-15 Nkワークス株式会社 画像補正方法と画像補正装置
US20100073068A1 (en) 2008-09-22 2010-03-25 Hanwoo Cho Functional block level thermal control
CN101572829B (zh) 2009-06-10 2011-02-02 中国联合网络通信集团有限公司 Iptv视频质量监测方法、装置和系统
EP2336882A1 (en) 2009-12-18 2011-06-22 Telefonaktiebolaget L M Ericsson (PUBL) Technique for run-time provision of executable code using off-device services
CN102985673B (zh) 2010-04-21 2015-06-17 丰田自动车株式会社 内燃机的控制装置
JP2011253374A (ja) 2010-06-02 2011-12-15 Sony Corp 情報処理装置、および情報処理方法、並びにプログラム
US8452463B2 (en) 2010-06-04 2013-05-28 Apple Inc. Adjusting the thermal behavior of a computing system using indirect information about ambient temperature
US8694572B2 (en) 2010-07-06 2014-04-08 Silminds, Llc, Egypt Decimal floating-point fused multiply-add unit
US8924455B1 (en) 2011-02-25 2014-12-30 Xilinx, Inc. Multiplication of matrices using systolic arrays
CN102761509B (zh) 2011-04-27 2016-01-06 联芯科技有限公司 Ofdm系统的接收系统及降低接收系统内存的方法
KR20140002034A (ko) 2011-05-12 2014-01-07 애플 인크. 존재 감지
CN102789413B (zh) 2011-05-23 2016-02-17 同济大学 一种并行程序的调试系统及方法
US8594982B2 (en) 2011-06-09 2013-11-26 Pulsar Informatics, Inc. Systems and methods for distributed calculation of fatigue-risk prediction and optimization
CN102404673B (zh) 2011-11-24 2013-12-18 苏州上声电子有限公司 数字化扬声器系统通道均衡与声场控制方法和装置
CN103152673B (zh) 2011-12-07 2015-07-08 中国科学院声学研究所 基于四元码动态失配整形的数字扬声器驱动方法和装置
CN102684701B (zh) 2012-04-27 2014-07-09 苏州上声电子有限公司 基于编码转换的数字扬声器驱动方法和装置
DE102012009502A1 (de) * 2012-05-14 2013-11-14 Kisters Ag Verfahren zum Trainieren eines künstlichen neuronalen Netzes
US9417891B2 (en) 2012-06-11 2016-08-16 Vmware, Inc. Unified storage/VDI provisioning methodology
US9063731B2 (en) 2012-08-27 2015-06-23 Samsung Electronics Co., Ltd. Ultra low power apparatus and method to wake up a main processor
CN102903089B (zh) 2012-09-07 2014-12-17 山东大学 一种Linux环境下生成遥感图像快视图的方法
US9412366B2 (en) 2012-09-18 2016-08-09 Adobe Systems Incorporated Natural language image spatial and tonal localization
CN102981854A (zh) 2012-11-16 2013-03-20 天津市天祥世联网络科技有限公司 基于浮点数运算内联函数库的神经网络优化方法
JP5706040B2 (ja) 2012-11-22 2015-04-22 学校法人慶應義塾 アクリル系共重合体、光学フィルム、偏光板および液晶表示装置
US9720732B1 (en) 2013-02-11 2017-08-01 Amazon Technologies, Inc. Parameter selection for optimization of task execution based on execution history for prior tasks
JP2014170295A (ja) 2013-03-01 2014-09-18 Honda Motor Co Ltd 物体認識システム及び物体認識方法
US20190138372A1 (en) * 2013-04-29 2019-05-09 Moogsoft, Inc. System for managing an instructure with security
JP6184891B2 (ja) 2014-03-12 2017-08-23 東芝メモリ株式会社 情報処理装置、半導体チップ、情報処理方法およびプログラム
US9507405B2 (en) 2014-06-18 2016-11-29 Oracle International Corporation System and method for managing power in a chip multiprocessor using a proportional feedback mechanism
US9575537B2 (en) 2014-07-25 2017-02-21 Intel Corporation Adaptive algorithm for thermal throttling of multi-core processors with non-homogeneous performance states
US10282100B2 (en) 2014-08-19 2019-05-07 Samsung Electronics Co., Ltd. Data management scheme in virtualized hyperscale environments
US9916130B2 (en) 2014-11-03 2018-03-13 Arm Limited Apparatus and method for vector processing
FR3030077B1 (fr) 2014-12-10 2016-12-02 Arnault Ioualalen Procede d'ajustement de la precision d'un programme d'ordinateur manipulant au moins un nombre a virgule.
EP3035204B1 (en) 2014-12-19 2018-08-15 Intel Corporation Storage device and method for performing convolution operations
US20170061279A1 (en) 2015-01-14 2017-03-02 Intel Corporation Updating an artificial neural network using flexible fixed point representation
US20160328645A1 (en) * 2015-05-08 2016-11-10 Qualcomm Incorporated Reduced computational complexity for fixed point neural network
US10083395B2 (en) 2015-05-21 2018-09-25 Google Llc Batch processing in a neural network processor
CN104899641B (zh) 2015-05-25 2018-07-13 杭州朗和科技有限公司 深度神经网络学习方法、处理器和深度神经网络学习系统
CN115100018A (zh) 2015-06-10 2022-09-23 无比视视觉技术有限公司 用于处理图像的图像处理器和方法
CN104978303B (zh) 2015-06-19 2019-06-04 上海兆芯集成电路有限公司 单芯片整合的传感器集线器和多传感器管理方法
CN106469291A (zh) 2015-08-19 2017-03-01 中兴通讯股份有限公司 图像处理方法及终端
US10031765B2 (en) * 2015-09-24 2018-07-24 Intel Corporation Instruction and logic for programmable fabric hierarchy and cache
US10812831B2 (en) 2015-09-30 2020-10-20 Piksel, Inc. Video stream delivery via adaptive quality enhancement using error correction models
CN106570559A (zh) 2015-10-09 2017-04-19 阿里巴巴集团控股有限公司 一种基于神经网络的数据处理方法和装置
WO2017087568A1 (en) 2015-11-17 2017-05-26 Eman Bayani A digital image capturing device system and method
CN106814639A (zh) 2015-11-27 2017-06-09 富泰华工业(深圳)有限公司 语音控制系统及方法
CN105893419A (zh) 2015-11-30 2016-08-24 乐视致新电子科技(天津)有限公司 一种多媒体照片生成方法、装置、设备及手机
US10699186B2 (en) 2015-12-02 2020-06-30 Google Llc Determining orders of execution of a neural network
CN106991478B (zh) 2016-01-20 2020-05-08 中科寒武纪科技股份有限公司 用于执行人工神经网络反向训练的装置和方法
CN106997236B (zh) 2016-01-25 2018-07-13 亮风台(上海)信息科技有限公司 基于多模态输入进行交互的方法和设备
US10803401B2 (en) 2016-01-27 2020-10-13 Microsoft Technology Licensing, Llc Artificial intelligence engine having multiple independent processes on a cloud based platform configured to scale
US10497089B2 (en) 2016-01-29 2019-12-03 Fotonation Limited Convolutional neural network
JP2017156511A (ja) 2016-03-01 2017-09-07 ソニー株式会社 情報処理装置、情報処理方法、およびプログラム
US10103714B2 (en) 2016-03-01 2018-10-16 Qualcomm Incorporated Adjust voltage for thermal mitigation
US10019779B2 (en) 2016-03-08 2018-07-10 Amazon Technologies, Inc. Browsing interface for item counterparts having different scales and lengths
CN107330515A (zh) 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 一种用于执行人工神经网络正向运算的装置和方法
US10552119B2 (en) 2016-04-29 2020-02-04 Intel Corporation Dynamic management of numerical representation in a distributed matrix processor architecture
US11055063B2 (en) * 2016-05-02 2021-07-06 Marvell Asia Pte, Ltd. Systems and methods for deep learning processor
US10187568B1 (en) 2016-05-02 2019-01-22 Bao Tran Video smart phone
CN105978611B (zh) 2016-05-12 2019-09-17 京信通信系统(中国)有限公司 一种频域信号压缩方法及装置
AU2016203619A1 (en) 2016-05-31 2017-12-14 Canon Kabushiki Kaisha Layer-based operations scheduling to optimise memory for CNN applications
EP3252949B1 (en) 2016-06-01 2020-03-18 Intel IP Corporation Methods and devices for predistortion of signals
US20170357910A1 (en) 2016-06-10 2017-12-14 Apple Inc. System for iteratively training an artificial intelligence using cloud-based metrics
CN107545889B (zh) 2016-06-23 2020-10-23 华为终端有限公司 适用于模式识别的模型的优化方法、装置及终端设备
CN106156310A (zh) 2016-06-30 2016-11-23 努比亚技术有限公司 一种图片处理装置和方法
US10372588B2 (en) 2016-07-08 2019-08-06 International Business Machines Corporation Providing debug information on production containers using debug containers
DE102016214786A1 (de) 2016-08-09 2018-02-15 Fujitsu Limited Anwendungsprofiling-Jobmanagement-System, -Programm und -Verfahren
US20180046903A1 (en) 2016-08-12 2018-02-15 DeePhi Technology Co., Ltd. Deep processing unit (dpu) for implementing an artificial neural network (ann)
CN107688855B (zh) 2016-08-12 2021-04-13 赛灵思公司 针对于复杂神经网络的分层量化方法与装置
CN106354568A (zh) 2016-08-23 2017-01-25 京信通信技术(广州)有限公司 一种不同进程间的通信方法及通信装置
CN107797913A (zh) 2016-09-07 2018-03-13 大陆汽车电子(连云港)有限公司 一种实时系统的软件分析系统与方法
US11907760B2 (en) 2016-09-23 2024-02-20 Apple Inc. Systems and methods of memory allocation for neural networks
CN106650922B (zh) 2016-09-29 2019-05-03 清华大学 硬件神经网络转换方法、计算装置、软硬件协作系统
US20180096243A1 (en) 2016-09-30 2018-04-05 General Electric Company Deep learning for data driven feature representation and anomaly detection
WO2018071546A1 (en) 2016-10-11 2018-04-19 The Research Foundation For The State University Of New York System, method, and accelerator to process convolutional neural network layers
CN106485316B (zh) 2016-10-31 2019-04-02 北京百度网讯科技有限公司 神经网络模型压缩方法以及装置
CN106502626A (zh) 2016-11-03 2017-03-15 北京百度网讯科技有限公司 数据处理方法和装置
US10216479B2 (en) 2016-12-06 2019-02-26 Arm Limited Apparatus and method for performing arithmetic operations to accumulate floating-point numbers
US10997492B2 (en) 2017-01-20 2021-05-04 Nvidia Corporation Automated methods for conversions to a lower precision data format
CN106951587A (zh) 2017-02-15 2017-07-14 芯启源(南京)半导体科技有限公司 Fpga调试系统及方法
CN106951962B (zh) 2017-03-22 2020-09-01 南京地平线机器人技术有限公司 用于神经网络的复合运算单元、方法和电子设备
US10380039B2 (en) * 2017-04-07 2019-08-13 Intel Corporation Apparatus and method for memory management in a graphics processing environment
US10332302B2 (en) 2017-04-17 2019-06-25 Intel Corporation Scatter gather engine
US10402932B2 (en) 2017-04-17 2019-09-03 Intel Corporation Power-based and target-based graphics quality adjustment
CN107025629B (zh) 2017-04-27 2021-03-26 维沃移动通信有限公司 一种图像处理方法及移动终端
US11842280B2 (en) * 2017-05-05 2023-12-12 Nvidia Corporation Loss-scaling for deep neural network training with reduced precision
US10019668B1 (en) 2017-05-19 2018-07-10 Google Llc Scheduling neural network processing
US11144828B2 (en) 2017-06-09 2021-10-12 Htc Corporation Training task optimization system, training task optimization method and non-transitory computer readable medium for operating the same
US10944902B2 (en) 2017-06-20 2021-03-09 Adobe Inc. Digital image generation using capture support data
US20200097799A1 (en) 2017-06-30 2020-03-26 Intel Corporation Heterogeneous multiplier
CN107451654B (zh) 2017-07-05 2021-05-18 深圳市自行科技有限公司 卷积神经网络的加速运算方法、服务器及存储介质
US10427306B1 (en) 2017-07-06 2019-10-01 X Development Llc Multimodal object identification
CN107729990B (zh) 2017-07-20 2021-06-08 上海寒武纪信息科技有限公司 支持离散数据表示的用于执行正向运算的装置及方法
CN107451658B (zh) 2017-07-24 2020-12-15 杭州菲数科技有限公司 浮点运算定点化方法及系统
CN107688849B (zh) 2017-07-28 2021-04-13 赛灵思电子科技(北京)有限公司 一种动态策略定点化训练方法及装置
US11481218B2 (en) 2017-08-02 2022-10-25 Intel Corporation System and method enabling one-hot neural networks on a machine learning compute platform
WO2019031858A1 (en) 2017-08-08 2019-02-14 Samsung Electronics Co., Ltd. METHOD AND APPARATUS FOR DETERMINING MEMORY NEEDS IN A NETWORK
US20190050710A1 (en) 2017-08-14 2019-02-14 Midea Group Co., Ltd. Adaptive bit-width reduction for neural networks
CN107644254A (zh) 2017-09-09 2018-01-30 复旦大学 一种卷积神经网络权重参数量化训练方法及系统
US10223114B1 (en) 2017-09-29 2019-03-05 Intel Corporation Fixed point to floating point conversion
KR102317958B1 (ko) 2017-09-29 2021-10-27 상하이 캠브리콘 인포메이션 테크놀로지 컴퍼니 리미티드 화상처리장치 및 방법
US11437032B2 (en) 2017-09-29 2022-09-06 Shanghai Cambricon Information Technology Co., Ltd Image processing apparatus and method
US10224954B1 (en) 2017-09-29 2019-03-05 Intel Corporation Floating point to fixed point conversion
US11450319B2 (en) 2017-09-29 2022-09-20 Cambricon (Xi'an) Semiconductor Co., Ltd. Image processing apparatus and method
JP6540770B2 (ja) 2017-10-17 2019-07-10 富士通株式会社 演算処理回路、演算処理回路を含む演算処理装置、演算処理装置を含む情報処理装置、および方法
US10410121B2 (en) 2017-10-25 2019-09-10 SparkCognition, Inc. Adjusting automated neural network generation based on evaluation of candidate neural networks
US20210061028A1 (en) 2017-10-26 2021-03-04 Applied Mechatronic Products Apparatus and method for vehicular monitoring, analysis, and control
US10783634B2 (en) 2017-11-22 2020-09-22 General Electric Company Systems and methods to deliver point of care alerts for radiological findings
US10803379B2 (en) 2017-12-12 2020-10-13 Amazon Technologies, Inc. Multi-memory on-chip computational network
CN108053028B (zh) 2017-12-21 2021-09-14 深圳励飞科技有限公司 数据定点化处理方法、装置、电子设备及计算机存储介质
US11636327B2 (en) 2017-12-29 2023-04-25 Intel Corporation Machine learning sparse computation mechanism for arbitrary neural networks, arithmetic compute microarchitecture, and sparsity for training mechanism
US11373088B2 (en) 2017-12-30 2022-06-28 Intel Corporation Machine learning accelerator mechanism
US20190251429A1 (en) 2018-02-12 2019-08-15 Kneron, Inc. Convolution operation device and method of scaling convolution input for convolution neural network
US11630666B2 (en) 2018-02-13 2023-04-18 Shanghai Cambricon Information Technology Co., Ltd Computing device and method
US11106598B2 (en) 2018-02-13 2021-08-31 Shanghai Cambricon Information Technology Co., Ltd. Computing device and method
US11663002B2 (en) 2018-02-13 2023-05-30 Shanghai Cambricon Information Technology Co., Ltd Computing device and method
CN116991226A (zh) 2018-02-14 2023-11-03 上海寒武纪信息科技有限公司 处理器的控制装置、方法及设备
JP7056225B2 (ja) 2018-02-26 2022-04-19 富士通株式会社 演算処理装置、情報処理装置、情報処理方法、およびプログラム
US10628275B2 (en) 2018-03-07 2020-04-21 Nxp B.V. Runtime software-based self-test with mutual inter-core checking
US11475306B2 (en) 2018-03-22 2022-10-18 Amazon Technologies, Inc. Processing for multiple input data sets
CN108510067B (zh) 2018-04-11 2021-11-09 西安电子科技大学 基于工程化实现的卷积神经网络量化方法
US11562213B2 (en) 2018-04-17 2023-01-24 Intel Corporation Methods and arrangements to manage memory in cascaded neural networks
US10691413B2 (en) 2018-05-04 2020-06-23 Microsoft Technology Licensing, Llc Block floating point computations using reduced bit-width vectors
WO2019218896A1 (zh) 2018-05-18 2019-11-21 上海寒武纪信息科技有限公司 计算方法以及相关产品
CN108717570A (zh) 2018-05-23 2018-10-30 电子科技大学 一种脉冲神经网络参数量化方法
US10360304B1 (en) 2018-06-04 2019-07-23 Imageous, Inc. Natural language processing interface-enabled building conditions control system
CN109062540B (zh) 2018-06-06 2022-11-25 北京理工大学 一种基于cordic算法的可重构浮点运算装置
CN109063820A (zh) 2018-06-07 2018-12-21 中国科学技术大学 利用时频联合长时循环神经网络的数据处理方法
WO2020001438A1 (zh) 2018-06-27 2020-01-02 上海寒武纪信息科技有限公司 片上代码断点调试方法、片上处理器及芯片断点调试系统
CN110728364A (zh) 2018-07-17 2020-01-24 上海寒武纪信息科技有限公司 一种运算装置和运算方法
JP6867518B2 (ja) 2018-08-28 2021-04-28 カンブリコン テクノロジーズ コーポレイション リミティド データ前処理方法、装置、コンピュータ機器及び記憶媒体
EP3859488A4 (en) 2018-09-28 2022-06-29 Shanghai Cambricon Information Technology Co., Ltd Signal processing device, signal processing method and related product
CN109685202B (zh) 2018-12-17 2023-03-21 腾讯科技(深圳)有限公司 数据处理方法及装置、存储介质和电子装置
US10699465B1 (en) * 2018-12-28 2020-06-30 Intel Corporation Cluster of scalar engines to accelerate intersection in leaf node
CN111385462A (zh) 2018-12-28 2020-07-07 上海寒武纪信息科技有限公司 信号处理装置、信号处理方法及相关产品
CN109902745A (zh) 2019-03-01 2019-06-18 成都康乔电子有限责任公司 一种基于cnn的低精度训练与8位整型量化推理方法
CN109993296B (zh) 2019-04-01 2020-12-29 安徽寒武纪信息科技有限公司 量化实现方法及相关产品
CN110059733A (zh) 2019-04-01 2019-07-26 苏州科达科技股份有限公司 卷积神经网络的优化及快速目标检测方法、装置
US20200334522A1 (en) 2019-04-18 2020-10-22 Cambricon Technologies Corporation Limited Data processing method and related products
CN111832739B (zh) 2019-04-18 2024-01-09 中科寒武纪科技股份有限公司 一种数据处理方法及相关产品
US11676029B2 (en) 2019-06-12 2023-06-13 Shanghai Cambricon Information Technology Co., Ltd Neural network quantization parameter determination method and related products
WO2021036908A1 (zh) 2019-08-23 2021-03-04 安徽寒武纪信息科技有限公司 数据处理方法、装置、计算机设备和存储介质
EP4020321A4 (en) 2019-08-23 2024-01-17 Anhui Cambricon Information Technology Co., Ltd. DATA PROCESSING METHOD, APPARATUS, COMPUTER APPARATUS AND STORAGE MEDIUM
WO2021036890A1 (zh) 2019-08-23 2021-03-04 安徽寒武纪信息科技有限公司 数据处理方法、装置、计算机设备和存储介质
WO2021036905A1 (zh) 2019-08-27 2021-03-04 安徽寒武纪信息科技有限公司 数据处理方法、装置、计算机设备和存储介质
CN110780845B (zh) 2019-10-17 2021-11-30 浙江大学 一种用于量化卷积神经网络的可配置近似乘法器及其实现方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140164737A1 (en) * 2012-12-06 2014-06-12 Kalray Execution efficiency in a single-program, multiple-data processor
CN105389158A (zh) * 2014-08-28 2016-03-09 想象技术有限公司 组合路径
CN106406812A (zh) * 2015-10-02 2017-02-15 上海兆芯集成电路有限公司 微处理器和微处理器内的执行融合复合算术运算的方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111552563A (zh) * 2020-04-20 2020-08-18 南昌嘉研科技有限公司 一种多线程数据架构、多线程消息传递方法及系统
US12008369B1 (en) 2021-08-31 2024-06-11 Apple Inc. Load instruction fusion

Also Published As

Publication number Publication date
US20200142748A1 (en) 2020-05-07
EP3624020A1 (en) 2020-03-18
US11409575B2 (en) 2022-08-09
US20200160163A1 (en) 2020-05-21
US11442786B2 (en) 2022-09-13
US11442785B2 (en) 2022-09-13
US20200183752A1 (en) 2020-06-11
EP3624020A4 (en) 2021-05-05

Similar Documents

Publication Publication Date Title
WO2019218896A1 (zh) 计算方法以及相关产品
US11531540B2 (en) Processing apparatus and processing method with dynamically configurable operation bit width
KR102354718B1 (ko) 계산 장치 및 방법
CN110163362B (zh) 一种计算装置及方法
CN111488976B (zh) 神经网络计算装置、神经网络计算方法及相关产品
CN110059797B (zh) 一种计算装置及相关产品
CN111930681B (zh) 一种计算装置及相关产品
CN110059809B (zh) 一种计算装置及相关产品
US20200242455A1 (en) Neural network computation device and method
CN111368967B (zh) 一种神经网络计算装置和方法
CN109740730B (zh) 运算方法、装置及相关产品
CN111178492B (zh) 计算装置及相关产品、执行人工神经网络模型的计算方法
CN111367567B (zh) 一种神经网络计算装置和方法
CN111368986B (zh) 一种神经网络计算装置和方法
CN111368987B (zh) 一种神经网络计算装置和方法
CN111368990B (zh) 一种神经网络计算装置和方法
CN111198714B (zh) 重训练方法及相关产品
CN111368985B (zh) 一种神经网络计算装置和方法
CN111222632A (zh) 计算装置、计算方法及相关产品
CN111291884A (zh) 神经网络剪枝方法、装置、电子设备及计算机可读介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2019803375

Country of ref document: EP

Effective date: 20191209

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19803375

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE