CN111368986B

CN111368986B - Neural network computing device and method

Info

Publication number: CN111368986B
Application number: CN201811592246.7A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2023-03-10
Anticipated expiration: 2038-12-25
Also published as: CN111368986A

Abstract

The application provides a neural network computing device and a method, wherein the device is used for executing artificial neural network training computation; the neural network training operation comprises neural network multilayer training operation, and the technical scheme provided by the application has the advantages of low cost and low energy consumption.

Description

Neural network computing device and method

Technical Field

The present application relates generally to artificial neural networks, and in particular to a neural network computing device and method.

Background

The neural network is also called an artificial neural network, the artificial neural network is widely applied to the fields of pattern recognition, image processing, function approximation, optimization calculation and the like, and the multilayer artificial network is more and more widely concerned in academia and industry in recent years due to higher recognition accuracy and better parallelism. The artificial neural network relates to a plurality of algorithms, wherein a full connection layer is taken as an important algorithm in the artificial neural network and is widely applied to various artificial neural network models.

The existing neural network operation is based on a general processor to perform neural network operation, the existing general processor only supports floating point data operation, and for the neural network operation, particularly, relatively complex operation is involved, so that the operation amount is large, the requirement on a memory is high, and the existing neural network operation is based on the floating point data operation and has higher requirement on the memory, so that the existing scheme is high in energy consumption and high in cost.

Disclosure of Invention

One aspect of the present application provides a neural network computing device and method, where the device and method use fixed-point data to perform operations, and the fixed-point data can save memory and reduce operation amount compared with floating-point data, so that the device and method have the advantages of reducing energy consumption and cost.

In one aspect, a neural network computing device is provided for performing an artificial neural network training computation; the neural network training operation comprises a neural network multi-layer training operation, the multi-layer training operation comprises at least one ith layer, at least part of data in the forward operation or the reverse operation of the ith layer is fixed point data operation, and i is an integer greater than or equal to 1; the computing device includes: the device comprises a controller unit, an arithmetic unit and a conversion unit, wherein the controller unit is connected with the arithmetic unit and the conversion unit; the ith layer of training operation comprises the ith layer of forward operation and the ith layer of reverse operation;

the ith layer of forward operation comprises:

the controller unit is used for acquiring input neuron data of the ith layer, weight data of the ith layer and a forward calculation instruction of the ith layer;

the controller unit is also used for analyzing the ith layer of calculation instruction to obtain a plurality of forward operation instructions, sending the ith layer of input neuron data and the ith layer of weight data to the conversion unit, and sending the plurality of operation instructions to the operation unit;

a conversion unit, configured to perform floating point type and fixed point type conversion on all or part of the i-th layer input neuron data and i-th layer weight data to obtain all fixed point data or mixed data, and send all the fixed point data or mixed data to an arithmetic unit, where the mixed data includes: partial fixed point data and partial floating point data;

the conversion unit is further configured to convert the data into a digital signal according to float = int scale 2 ^point Performing conversion between a floating point type and a fixed point type; wherein, float is a floating point type numerical value, int is a fixed point type data value, scale is a fixed point type scaling value; point is a decimal point position value;

the arithmetic unit is used for executing fixed point operation on all fixed point data or executing mixed operation on mixed data according to a plurality of forward operation instructions to obtain a forward output result of the ith layer;

the ith layer of inverse operation comprises:

the controller unit is used for acquiring input neuron data of the ith layer, weight data of the ith layer, input neuron gradient of the ith layer and a backward calculation instruction of the ith layer;

the controller unit is also used for analyzing the ith layer of calculation instruction to obtain a plurality of reverse calculation instructions, sending the ith layer of input neuron data, the ith layer of weight data and the ith layer of input neuron gradient to the conversion unit, and sending the plurality of calculation instructions to the calculation unit;

a conversion unit, configured to perform floating point type and fixed point type conversion on all or part of the ith layer of input neuron data, the ith layer of weight data, and the ith layer of input neuron gradient to obtain all fixed point data or mixed data, and send all the fixed point data or mixed data to an arithmetic unit, where the mixed data includes: partial fixed point data and partial floating point data;

the operation unit is used for performing fixed-point operation on all fixed-point data or performing mixed operation on mixed data according to a plurality of forward operation instructions to obtain the weight gradient of the ith layer and the output result gradient of the ith layer; updating by adopting the weight gradient of the ith layer and the weight of the ith layer;

the blending operation includes: performing fixed-point operations on portions of fixed-point data and floating-point operations on portions of floating-point data.

In another aspect, a neural network training method is provided, the method being used for a neural network computing device; the neural network training operation comprises a neural network multi-layer training operation, the multi-layer training operation comprises at least one ith layer, at least part of data in the forward operation or the reverse operation of the ith layer is fixed point data operation, and i is an integer greater than or equal to 1; the computing device includes: the device comprises a controller unit, an arithmetic unit and a conversion unit, wherein the controller unit is connected with the arithmetic unit and the conversion unit; the ith layer of training operation comprises the ith layer of forward operation and the ith layer of reverse operation;

the ith layer of forward operation comprises:

the controller unit acquires input neuron data of an ith layer, weight data of the ith layer and a forward calculation instruction of the ith layer; analyzing the ith layer of calculation instruction to obtain a plurality of forward operation instructions, sending the ith layer of input neuron data and the ith layer of weight data to a conversion unit, and sending the plurality of operation instructions to an operation unit;

the conversion unit performs floating point type and fixed point type conversion on the ith layer input neuron data and all or part of the ith layer weight data to obtain all fixed point data or mixed data, and sends all the fixed point data or the mixed data to the operation unit, wherein the mixed data comprises: partial fixed point data and partial floating point data;

the arithmetic unit executes fixed point operation on all fixed point data or mixed operation on mixed data according to a plurality of forward operation instructions to obtain a forward output result of the ith layer;

the ith layer of inverse operations include:

the controller unit acquires input neuron data of an ith layer, weight data of the ith layer, gradient of input neurons of the ith layer and a reverse calculation instruction of the ith layer; analyzing the ith layer of calculation instructions to obtain a plurality of reverse calculation instructions, sending the ith layer of input neuron data, the ith layer of weight data and the ith layer of input neuron gradient to a conversion unit, and sending the plurality of calculation instructions to a calculation unit;

the conversion unit performs floating point type and fixed point type conversion on all or part of the ith layer input neuron data, the ith layer weight data and the ith layer input neuron gradient to obtain all fixed point data or mixed data, and sends the all fixed point data or mixed data to the operation unit, wherein the mixed data comprises: partial fixed point data and partial floating point data;

the executing the conversion between the floating point type and the fixed point type specifically includes:

the conversion unit is based on float = int scale 2 ^point Performing conversion between a floating point type and a fixed point type; wherein, float is a floating point type numerical value, int is a fixed point type data value, scale is a fixed point type scaling value; point is the decimal point position value;

the arithmetic unit executes fixed point operation on all fixed point data or mixed operation on mixed data according to a plurality of forward operation instructions to obtain the weight gradient of the ith layer and the output result gradient of the ith layer; updating by adopting the weight gradient of the ith layer and the weight of the ith layer;

In yet another aspect, a neural network training arithmetic device is provided, where the neural network training arithmetic device includes one or more computing devices of the first aspect, and is configured to obtain data to be operated and control information from other processing devices, execute a specified operation, and transmit an execution result to the other processing devices through an I/O interface;

when the neural network training arithmetic device comprises a plurality of computing devices, the computing devices can be connected through a specific structure and transmit data;

the computing devices are interconnected through a PCIE bus of a fast peripheral equipment interconnection bus and transmit data so as to support larger-scale neural network training operation; a plurality of the computing devices share the same control system or own respective control systems; the computing devices share the memory or own the memory; the plurality of computing devices are interconnected in any interconnection topology.

In a further aspect, a combined processing device is provided, which includes a neural network training arithmetic device of the further aspect, a universal interconnection interface and other processing devices;

and the neural network training operation device interacts with the other processing devices to jointly complete the calculation operation specified by the user.

In a next aspect, a neural network chip is provided that includes a computing device of the one hand or a neural network training arithmetic device of the other hand or a combined processing device of the still other hand.

Further, an electronic device comprising the chip of claim 24 is also provided.

Finally, a board card is provided, the board card comprising: memory device, interface device and control device and the above-mentioned neural network chip;

wherein, the neural network chip is respectively connected with the storage device, the control device and the interface device;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the chip and external equipment;

and the control device is used for monitoring the state of the chip.

Drawings

For a more complete understanding of the present application and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

fig. 1 illustrates an example block diagram of an overall architecture of a neural network computing device in accordance with an embodiment of the present application.

Fig. 2 schematically shows a structural diagram of another neural network computing device according to an embodiment of the present application.

Fig. 2a schematically shows a schematic structural diagram of an arithmetic unit according to an embodiment of the present application.

Fig. 2b schematically shows another structural diagram of an arithmetic unit according to an embodiment of the present application.

Fig. 2c schematically shows a transmission diagram of a tree module according to an embodiment of the present application.

Fig. 2d schematically shows a receiving schematic diagram of a tree module according to an embodiment of the present application.

Fig. 3 schematically shows another structural diagram of an arithmetic unit according to an embodiment of the present application.

Fig. 4 schematically shows a structural diagram of a combined processing device according to an embodiment of the present application.

Fig. 5 schematically illustrates a structural diagram of a board card according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The electronic devices may include various handheld devices having wireless communication functions, in-vehicle devices, wireless headsets, computing devices or other processing devices connected to wireless modems, as well as various forms of User Equipment (UE), mobile Stations (MS), terminal devices (terminal device), and the like, and may be, for example, smart phones, tablets, earphone boxes, and the like. For convenience of description, the above-mentioned apparatuses are collectively referred to as electronic apparatuses or electronic devices.

The electronic device or the electronic apparatus described above may be applied in the following (including but not limited to) scenarios: the system comprises various electronic products such as a data processing device, a robot, a computer, a printer, a scanner, a telephone, a tablet computer, an intelligent terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage device and a wearable device; various vehicles such as airplanes, ships, vehicles, and the like; various household appliances such as televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas stoves, range hoods and the like; and various medical devices including nuclear magnetic resonance apparatuses, B-ultrasonic apparatuses, electrocardiographs and the like.

The following describes embodiments of the present application in detail.

First, a computing device as used herein is described. Referring to fig. 1, a neural network computing device is provided, where the computing device is configured to perform a neural network training calculation, the neural network training calculation includes a neural network multi-layer training calculation, the multi-layer training calculation includes at least an ith layer, at least part of data in a forward operation or a reverse operation of the ith layer is a fixed point data operation, and i is an integer greater than or equal to 1; the computing device includes: a controller unit 11, an arithmetic unit 12 and a conversion unit 13, wherein the controller unit 11 is connected with the arithmetic unit 12 and the conversion unit 13 (the conversion unit can be arranged independently or integrated in the controller unit or the arithmetic unit); the ith layer of training operation comprises the ith layer of forward operation and the ith layer of reverse operation;

the ith layer forward operation may include:

the controller unit 11 is configured to obtain input neuron data of an ith layer, ith layer weight data, and an ith layer forward calculation instruction; in an alternative, the input neuron data and the calculation instruction may be obtained through a data input/output unit, where the data input/output unit may be one or more data I/O interfaces or I/O pins; and the data input and output unit is used for reading input neuron data or forward computing instructions from external equipment or an external memory.

The forward computing instruction includes, but is not limited to: convolution operation instructions, matrix multiplication instructions, vector multiplication instructions, activation instructions, etc., and the specific embodiments of the present application do not limit the specific representation form or the specific category of the forward calculation instructions.

The controller unit 11 is further configured to analyze the ith layer calculation instruction to obtain a plurality of forward calculation instructions, send the ith layer input neuron data and the ith layer weight data to the conversion unit 13, and send the plurality of calculation instructions to the calculation unit 12;

a conversion unit 13, configured to perform floating point type and fixed point type conversion on all or part of the i-th layer input neuron data and the i-th layer weight data to obtain all fixed point data or mixed data, and send all the fixed point data or mixed data to the operation unit, where the mixed data includes: partial fixed point data and partial floating point data;

and the arithmetic unit 12 is used for performing fixed-point operation on all fixed-point data or performing mixed operation on mixed data according to a plurality of forward operation instructions to obtain a forward output result of the ith layer.

The ith layer of inverse operations may include:

the controller unit 11 is configured to obtain input neuron data of an ith layer, ith layer weight data, ith layer input neuron gradient, and an ith layer inverse computation instruction; in an alternative, the manner of acquiring input neuron data and calculating an instruction may be obtained by a data input/output unit, which may be one or more data I/O interfaces or I/O pins; and the data input and output unit is used for reading input neuron data or a reverse calculation instruction from an external device or an external memory.

The above-mentioned reverse calculation instruction includes but is not limited to: matrix multiply instructions, vector multiply instructions, etc., and the embodiments of the present application do not limit the particular representation or the particular class of the above-described inverse compute instructions.

The controller unit 11 is further configured to analyze the ith layer calculation instruction to obtain a plurality of inverse calculation instructions, send the ith layer input neuron data, the ith layer weight data, and the ith layer input neuron gradient to the conversion unit 13, and send the plurality of calculation instructions to the calculation unit 12;

a conversion unit 13, configured to perform floating point type and fixed point type conversion on all or part of the ith layer of input neuron data, ith layer of weight data, and ith layer of input neuron gradient to obtain all fixed point data or mixed data, and send all fixed point data or mixed data to the arithmetic unit, where the mixed data includes: partial fixed point data and partial floating point data;

a conversion unit 13, in particular for converting the data according to float = int scale 2 ^point Performing conversion between a floating point type and a fixed point type; wherein, float is a floating point type numerical value, int is a fixed point type data value, scale is a fixed point type scaling value; point is the decimal point position value;

optionally, offset may be added to the above formula, specifically, float = int scale 2 ^point -offset, said offset being an offset value. The above-mentioned offset value is used to express int x scale x 2 ^point Deviation from float.

Wherein width is the bit width value of the fixed point number.

Scale＝δ*2 ^point /maxabc；

The delta is an empirical value (set by the manufacturer) that is an integer and less than the maximum value of the fixed-point bit width. For example, the fixed point bit width is 8 bits and the maximum fixed point bit width is 255.

The maxabs is the maximum absolute value in the floating point data that needs to be converted, that is, the maximum absolute value in the elements of the ith layer input neuron data and the ith layer weight data. This enables the fixed-point number to represent a maximum value greater than the minimum point (position of point) value of maxabs.

For example, width =8, maxabs (maximum of absolute value of a set of numbers) =2.9, then point of the set of numbers can be calculated = -4. If point = -4, int =21 can be estimated for float = 1.3.

For known points and widths, floating point number and fixed point number:

the operation unit 12 is configured to perform fixed-point operation on all fixed-point data or perform mixed operation on mixed data according to a plurality of forward operation instructions to obtain a weight gradient of an ith layer and an output result gradient of the ith layer; and updating by adopting the weight gradient of the ith layer and the weight of the ith layer.

The blending operation includes: performing fixed point operations on portions of fixed point data and floating point operations on portions of floating point data.

The technical scheme provided by the application is provided with the conversion unit, when the conversion unit executes the ith layer of training operation of the neural network, all or part of input neuron data, weight data and input data neuron gradients can be converted into fixed point data or mixed data, compared with floating point data, the storage space of the fixed point data is small, and therefore training of the neural network can be achieved through a small memory space.

The training operation in the neural network training can be the training operation of one layer in the neural network, namely the training operation of the ith layer, the training operation of other layers can adopt a conventional training operation method, and a training operation method similar to the ith layer in the application can also be adopted. In the forward operation, after the forward operation of the artificial neural network in the previous layer is completed, the operation instruction in the next layer may operate the output neuron (i.e., the forward output result) calculated in the operation unit as the input neuron in the next layer (or perform some operations on the output neuron and then perform the operations as the input neuron in the next layer), where the operations include, but are not limited to: and activating operation and the like, and meanwhile, replacing the weight of the previous layer with the weight of the next layer. In the inverse operation, after the inverse operation of the next layer artificial neural network is completed, the previous layer operation instruction performs operation with the output neuron gradient (i.e., the output result gradient) calculated in the operation unit as the previous layer input neuron gradient (or performs some operation on the output neuron gradient and then uses the output neuron gradient as the previous layer input neuron gradient), and simultaneously replaces the weight and the input neuron data with the weight and the input neuron data of the previous layer forward operation.

For the artificial neural network operation, if the artificial neural network operation has multilayer operation, the input neurons and the output neurons of the multilayer operation do not refer to the neurons in the input layer and the neurons in the output layer of the whole neural network, but for any two adjacent layers in the network, the neurons in the lower layer of the network forward operation are the input neurons, and the neurons in the upper layer of the network forward operation are the output neurons. Taking a convolutional neural network as an example, let a convolutional neural network have L layers, K =1,2.., L-1, for K layer and K +1 layer, we will refer to K layer as an input layer, in which the neurons are the input neurons, and K +1 layer as an output layer, in which the neurons are the output neurons. That is, each layer except the topmost layer can be used as an input layer, and the next layer is a corresponding output layer.

Optionally, the converting unit 13 is specifically configured to convert a part of the i-th layer input neuron data into a part of fixed point input neuron data and convert a part of the i-th layer weight data into a part of fixed point weight data; sending part of the fixed point input neuron data and part of the fixed point weight data to an arithmetic unit, and sending part of the input neuron data (the residual floating point data which is not subjected to floating point and fixed point conversion) and part of the weight data (the residual floating point data which is not subjected to floating point and fixed point conversion) to the arithmetic unit;

the operation unit is specifically used for executing fixed point data operation on part of fixed point input neuron data and part of fixed point weight data to obtain part of fixed point forward output results, sending the part of fixed point forward output results to the conversion unit,

the conversion unit is specifically used for performing fixed point and floating point conversion on the part of fixed point forward output results to obtain a first part of floating point forward output results and sending the first part of floating point forward output results to the arithmetic unit;

and the operation unit is specifically used for performing operation (floating point operation) on part of the input neuron data and part of the weight data to obtain a second part of floating point forward operation results, and combining the first part of floating point forward operation results and the second part of floating point forward operation results to obtain the ith layer of forward output results.

Optionally, the converting unit 13 is specifically configured to convert a part of the ith layer of input neuron data into a part of fixed point input neuron data, convert a part of the ith layer of weight data into a part of fixed point weight data, and convert the ith layer of input neuron gradient into a part of fixed point input neuron gradient; sending part of fixed point input neuron data, part of fixed point input neuron gradient and part of fixed point weight data to an arithmetic unit, and sending part of input neuron data (residual floating point data without floating point and fixed point conversion), part of input neuron gradient and part of weight data (residual floating point data without floating point and fixed point conversion) to the arithmetic unit;

the operation unit is specifically used for executing fixed point data operation on part of fixed point input neuron gradients and part of fixed point input data to obtain part of ith layer weight gradients, executing fixed point data operation on part of fixed point input neuron gradients and part of fixed point weight data to obtain part of ith layer output result gradients, and sending part of ith layer weight gradients and part of ith layer output result gradients to the conversion unit,

the conversion unit is specifically used for performing fixed point and floating point conversion on the part of the ith layer weight gradient and the part of the ith layer output result gradient to obtain a first part of the ith layer weight gradient and a first part of the ith layer output result gradient, and sending the first part of the ith layer weight gradient and the first part of the ith layer output result gradient to the operation unit;

and the operation unit is specifically used for performing operation (floating point) on part of input neuron gradients and part of input data to obtain an ith layer weight gradient of a second part, performing operation on part of input neuron gradients and part of weight data to obtain an ith layer output result gradient of the second part, combining the ith layer weight gradient of the first part and the ith layer weight gradient of the second part to obtain an ith layer weight gradient, and combining the ith layer output result gradient of the first part and the ith layer output result gradient of the second part to obtain an ith layer output result gradient.

Optionally, the method for obtaining the gradient of the i-th layer of input neurons specifically may include:

the gradient of the input neuron at the ith layer = f'. Times, the gradient of the output result at the ith +1 layer;

where f' is the derivative of the activation function f.

Optionally, referring to fig. 2a, the operation unit may include: a master processing circuit 101 and a plurality of slave processing circuits 102, wherein,

a master processing circuit 101, configured to perform a preamble process on data (including one or any combination of input neuron data, weight data, and input neuron gradient, and in addition, the data may be fixed-point data or floating-point data), and transmit data and operation instructions with the plurality of slave processing circuits;

a plurality of slave processing circuits 102, configured to execute intermediate operations in parallel according to data (fixed-point data or floating-point data) and an operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;

and the main processing circuit 101 is configured to obtain an ith layer forward output result, an ith layer output result gradient, and an ith layer weight gradient according to the plurality of intermediate results, and update an ith layer weight according to the ith layer weight gradient.

Optionally, the activation function f is any one of nonlinear functions sigmoid, tanh, relu and softmax or a linear function;

the operation instruction comprises: CONFIG instruction, COMPUTE instruction, IO instruction, NOP instruction, JUMP instruction, or MOVE instruction.

Optionally, the main processing circuit includes a first storage unit, a first arithmetic unit, a first data dependency relationship determination unit, and a first storage unit, where:

the neuron cache unit is used for caching input data and output data used by the main processing circuit in the calculation process;

a first arithmetic unit for completing various arithmetic functions of the main processing circuit;

the first data dependency relation judging unit is used for reading the input neuron vectors from the first storage unit and sending the neuron vectors to the slave processing circuit through the interconnection module; and receiving the intermediate result vector of the interconnection module and sending the intermediate result vector to the first arithmetic unit.

Optionally, the first arithmetic unit includes: a vector addition unit and an activation operation unit;

the vector addition unit is used for adding the offset data and the intermediate result in a counterpoint manner to obtain an offset result;

and the activation arithmetic unit is used for executing activation function operation on the bias result.

Optionally, each of the master processing circuits includes a second arithmetic unit, a second data dependency relationship determination unit, a second storage unit, and a third storage unit, where:

a second arithmetic unit for performing arithmetic logic operations;

the second data dependency relation judgment unit is used for executing read-write operation on the second storage unit and the third storage unit;

a second storage unit for caching data of the input neuron vector and the output neuron value calculated from the processing circuit;

and the third storage unit is used for caching the weight vector required by the slave processing circuit in the calculation process.

Optionally, the main computing unit includes: a vector multiplication unit and an accumulation unit;

the vector multiplication unit is used for executing vector multiplication operation in dot product operation;

and the accumulation unit is used for executing accumulation operation in dot product operation.

The process of updating the weight value may include:

the master processing circuit 101 is specifically configured to send the ith layer of input neuron data to each slave processing circuit, transmit the ith layer of input neuron gradient to each slave processing circuit 102, each slave processing circuit 102 multiplies scalar data corresponding to the slave processing circuit in the ith layer of input neuron gradient in _ gradient by the ith layer of input neuron data to obtain an original weight update gradient vector dw _ original of the ith layer of each slave processing circuit, after calculating the original weight update gradient vectors of all layers, the master processing circuit may perform a limiting process on the original weight update gradient in order to limit a gradient range of a weight, and specifically, the master processing circuit is specifically configured to calculate a square of the original weight update gradient of all layers and a sumsq _ diff, then perform a squaring on the sumsq _ diff to obtain l2norm _ diff, and if l2norm _ diff is greater than a clip _ gradient (a set normal number), calculate a scale factor = clip _ factor/2 gradient, and send each original weight update gradient to each slave processing circuit, and each slave processing circuit multiplies the original weight update gradient by the corresponding gradient vector update gradient; and the slave processing circuit is specifically configured to multiply the weight by the weight update gradient dw' to obtain an update weight of each slave processing circuit in the ith layer.

The technical scheme that this application provided sets the arithmetic element to a main many slave structures, to the computational instruction of forward operation, it can split data according to the computational instruction of forward operation, can carry out parallel operation to the great part of calculated amount through a plurality of processing circuits from like this to improve the arithmetic speed, save the operating time, and then reduce the consumption, to the backward operation, also can split data, similar forward operation also can improve the arithmetic speed.

Optionally, the master processing circuit and the slave processing circuit may each include: and the storage module is used for storing the data of the main processing circuit or the slave processing circuit. It should be noted that the memory module may be shared between the master processing circuit and the slave processing circuit, that is, one or more regions are divided into a shared region in the memory module of the master processing circuit, and a memory space of the shared region may be shared by a plurality of slave processing modules (including reading or writing data); one or more areas can be divided into shared areas in the storage module of the slave processing circuit, and the storage space of the shared areas can be shared and used (including reading or writing data) by the master processing module.

This technical scheme has set up the scheme of the regional sharing of memory module, for the fixed scheme of memory module, the memory module sharing between interconnect's main processing circuit and a plurality of slave processing circuit can avoid because the not enough problem that leads to calculating to go on of memory area, in addition, the memory module sharing can effectual reduction main processing circuit's memory space's setting, greatly reduced main processing circuit's cost like this. In addition, compared with the data extraction from external equipment, the scheme reduces the data reading or writing overhead, for the computing device, if the data is read or written from the external, the data needs to be forwarded through components such as a controller unit, a conversion unit and the like, so that a plurality of components need to be passed through for the neural network operation, the overhead is large during the data reading and writing, the energy consumption is also large, and a part of shared areas are properly arranged in a main processing circuit and a secondary processing circuit, so that when the space of a storage module of the computing device is insufficient, the storage module does not need to be stored in the external equipment, the storage module can be directly stored in the computing unit, and the overhead is greatly reduced.

Optionally, referring to fig. 2, the computing apparatus may further include: the storage unit 10 and the direct memory access unit 50, the storage unit 10 may include: one or any combination of a register and a cache, specifically, the cache is used for storing the calculation instruction; the register is used for storing the input neuron data, the weight data, the input neuron gradient and the scalar; the cache is a scratch pad cache. The direct memory access unit 50 is used to read or store data from the storage unit 10.

Optionally, the controller unit includes: an instruction storage unit 110, an instruction processing unit 111, and a storage queue unit 113;

an instruction storage unit 110, configured to store a calculation instruction associated with the artificial neural network operation;

the instruction processing unit 111 is configured to analyze the calculation instruction to obtain a plurality of operation instructions;

a store queue unit 113 for storing an instruction queue, the instruction queue comprising: and a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue.

For example, in an alternative embodiment, the main operation processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, specifically configured to decode instructions into microinstructions. Of course, in another alternative, the slave arithmetic processing circuit may also include another controller unit that includes a slave instruction processing unit, specifically for receiving and processing microinstructions. The micro instruction may be a next-stage instruction of the instruction, and the micro instruction may be obtained by splitting or decoding the instruction, and may be further decoded into control signals of each component, each unit, or each processing circuit.

In one alternative, the structure of the calculation instruction may be as shown in the following table.

Operation code

Registers or immediate data

Register/immediate

...

The ellipses in the above table indicate that multiple registers or immediate numbers may be included.

In another alternative, the computing instructions may include: one or more operation domains and an opcode. The computation instructions may include neural network operation instructions. Taking the neural network operation instruction as an example, as shown in table 1, register number 0, register number 1, register number 2, register number 3, and register number 4 may be operation domains. Each of register number 0, register number 1, register number 2, register number 3, and register number 4 may be a number of one or more registers.

The register may be an off-chip memory, but in practical applications, the register may also be an on-chip memory for storing data, and the data may specifically be n-dimensional data, where n is an integer greater than or equal to 1, for example, when n =1, the data is 1-dimensional data, that is, a vector, when n =2, the data is 2-dimensional data, that is, a matrix, and when n =3 or greater, the data is a multidimensional tensor.

In an alternative embodiment, referring to fig. 2a, the arithmetic unit 12 may comprise a master processing circuit 101 and a plurality of slave processing circuits 102, as shown in fig. 2 a. In one embodiment, as shown in FIG. 2b, a plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, and the k slave processing circuits are as follows: it should be noted that, as shown in fig. 2b, the K slave processing circuits include only the n slave processing circuits in the 1 st row, the n slave processing circuits in the m th row, and the m slave processing circuits in the 1 st column, that is, the K slave processing circuits are slave processing circuits directly connected to the master processing circuit among the plurality of slave processing circuits.

And the K slave processing circuits are used for forwarding data and instructions between the main processing circuit and the plurality of slave processing circuits.

Alternatively, the above-described conversion unit may be provided in the main processing circuit 101.

The main processing circuit may further include:

an activation processing circuit 111 for performing an activation operation or an activation derivation operation of data in the main processing circuit;

and an addition processing circuit 112 for performing addition operation or accumulation operation.

The master processing circuit is configured to determine that the input neuron data is broadcast data and the weight data is distribution data, distribute the distribution data into a plurality of data blocks, and send at least one data block of the plurality of data blocks and at least one operation instruction of the plurality of operation instructions to the slave processing circuit;

the plurality of slave processing circuits are used for executing operation on the received data blocks according to the operation instruction to obtain intermediate results and transmitting the intermediate results to the main processing circuit;

and the main processing circuit is used for updating the ith layer weight according to the ith layer weight gradient.

The slave processing circuit includes: a multiplication processing circuit;

the multiplication processing circuit is used for executing multiplication operation on the received data block to obtain a product result;

forwarding processing circuitry (optional) for forwarding the received data block or the product result.

And the accumulation processing circuit is used for performing accumulation operation on the product result to obtain the intermediate result.

In another embodiment, the operation instruction is a matrix by matrix instruction, an accumulation instruction, an activation instruction, or the like.

The following describes a specific calculation method of the calculation apparatus shown in fig. 1 by a neural network operation instruction. For a neural network operation instruction, the formula that actually needs to be executed may be: s = s (∑ wx) _i + b), wherein the weight w is multiplied by the input data x _i Summing, adding bias b, and performing activation operation s (h) to obtain final output nodeAnd (5) fruits.

In an alternative embodiment, as shown in fig. 2c, the apparatus may further comprise: a tree module 40, the tree module comprising: a root port 401 and a plurality of branch ports 404, the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;

the tree module has a transceiving function, for example, as shown in fig. 2c, the tree module is a transmitting function, and as shown in fig. 2d, the tree module is a receiving function.

And the tree module is used for forwarding data and operation instructions between the main processing circuit and the plurality of slave processing circuits.

Optionally, the tree module is an optional structure of the computing device, and may include at least 1 layer of nodes, where the nodes are line structures with forwarding function, and the nodes themselves may not have computing function. If the tree module has zero-level nodes, the tree module is not needed.

Optionally, the tree module may have an n-ary tree structure, for example, a binary tree structure as shown in fig. 2c, or may have a ternary tree structure, where n may be an integer greater than or equal to 2. The present embodiment is not limited to the specific value of n, the number of layers may be 2, and the slave processing circuit may be connected to nodes in other layers than the node in the penultimate layer.

Optionally, the main processing circuit in the arithmetic unit may carry a separate cache, and specifically, the method may include: a neuron buffer unit that buffers the input neuron vector data and the output neuron value data of the slave processing circuit. The main processing circuit may further include: and the weight buffer unit is used for buffering weight data required by the slave processing circuit in the calculation process.

In an alternative embodiment, the arithmetic unit 12, as shown in fig. 3, may include a branch processing circuit 103; the specific connection structure is shown in fig. 3, wherein,

the main processing circuit 101 is connected to branch processing circuit(s) 103, the branch processing circuit 103 being connected to one or more slave processing circuits 102;

a branch processing circuit 103 for executing data or instructions between the forwarding main processing circuit 101 and the slave processing circuit 102.

Alternatively, the branch processing circuit 103 may be configured with a storage module, and the storage module may be divided into one or more shared areas, a master processing circuit and a slave processing circuit, and is specifically configured to perform a write or read operation on data in the shared area. The shared area is arranged in the branch processing circuit 103, so that the main processing circuit and the slave processing circuit can store data conveniently, and the data storage cost is low, so that the capacities of the storage modules of the slave processing circuit and the main processing circuit can be saved, and the cost of the computing device can be reduced.

In an alternative embodiment, taking the fully-connected operation in the neural network operation as an example, the process may be: y = f (wx + b), where x is an input neuron matrix, w is a weight matrix, b is a bias scalar, and f is an activation function, and may specifically be: sigmoid function, tanh, relu, softmax function. Here, a binary tree structure is assumed, and there are 8 slave processing circuits, and the implementation method may be:

the controller unit acquires an input neuron matrix x, a weight matrix w and a full-connection operation instruction from the storage unit, and transmits the input neuron matrix x, the weight matrix w and the full-connection operation instruction to the main processing circuit;

the main processing circuit determines the input neuron matrix x as broadcast data, determines the weight matrix w as distribution data, divides the weight matrix w into 8 sub-matrixes, then distributes the 8 sub-matrixes to 8 slave processing circuits through a tree module, broadcasts the input neuron matrix x to the 8 slave processing circuits,

the slave processing circuit executes multiplication and accumulation operation of the 8 sub-matrixes and the input neuron matrix x in parallel to obtain 8 intermediate results, and the 8 intermediate results are sent to the master processing circuit;

and the main processing circuit is used for sequencing the 8 intermediate results to obtain a wx operation result, executing the offset b operation on the operation result, executing the activation operation to obtain a final result y, sending the final result y to the controller unit, and outputting or storing the final result y into the storage unit by the controller unit.

The specific implementation of the operation result of arranging the 8 intermediate results to obtain wx may be that, for the matrix multiplied by the matrix, partial elements of the input neuron matrix x corresponding to the 8 sub-matrices are determined, the minimum value of the number of rows in the 8 sub-matrices and the minimum value of the number of columns of the partial elements are extracted, and the minimum value of the number of rows and the minimum value of the number of columns are the positions of the intermediate results in the operation result.

The method for executing the neural network forward operation instruction by the computing device shown in fig. 1 may specifically be:

the controller unit extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction and at least one operation code from the instruction storage unit, transmits the operation domain to the data access unit, and sends the at least one operation code to the operation unit.

The controller unit extracts the weight w and the offset b corresponding to the operation domain from the storage unit (when b is 0, the offset b does not need to be extracted), transmits the weight w and the offset b to the main processing circuit of the arithmetic unit, extracts the input data Xi from the storage unit, and transmits the input data Xi to the main processing circuit.

The main processing circuit determines multiplication operation according to the at least one operation code, converts input data Xi into fixed point input data Xi, converts weight data into fixed point weight data, determines the fixed point input data Xi as broadcast data, determines the fixed point weight data as distribution data, and splits the fixed point weight w into n fixed point data blocks;

the instruction processing unit of the controller unit determines a multiplication instruction, an offset instruction and an accumulation instruction according to the at least one operation code, and sends the multiplication instruction, the offset instruction and the accumulation instruction to the master processing circuit, the master processing circuit sends the multiplication instruction and the input data Xi to a plurality of slave processing circuits in a broadcasting mode, and distributes the n fixed-point data blocks to the plurality of slave processing circuits (for example, if n slave processing circuits are provided, each slave processing circuit sends one data block); the plurality of slave processing circuits are used for executing fixed-point multiplication operation on the fixed-point input data Xi and the received fixed-point data block according to the multiplication instruction to obtain a fixed-point intermediate result, sending the fixed-point intermediate result to the master processing circuit, executing accumulation operation on the intermediate results sent by the plurality of slave processing circuits according to the accumulation instruction to obtain an accumulation result, converting the accumulation result into a floating-point accumulation result, executing offset b on the floating-point accumulation result according to the offset instruction to obtain a final result, and sending the final result to the controller unit.

According to the technical scheme, multiplication and offset operation of the neural network are achieved through one instruction, namely the neural network operation instruction, storage or extraction is not needed in the intermediate result of the neural network calculation, and storage and extraction operations of intermediate data are reduced, so that the neural network calculation method has the advantages of reducing corresponding operation steps and improving the calculation effect of the neural network.

The application also discloses a neural network arithmetic device which comprises one or more computing devices mentioned in the application and is used for acquiring data to be operated and control information from other processing devices, executing specified neural network training calculation and transmitting the execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one computing device is included, the computing devices can be linked and transmit data through a specific structure, for example, through the PCIE bus to interconnect and transmit data, so as to support larger-scale machine learning operations. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The neural network arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

The application also discloses a combined processing device which comprises the neural network arithmetic device, the universal interconnection interface and other processing devices. The neural network arithmetic device interacts with other processing devices to jointly complete the operation designated by the user. Fig. 4 is a schematic view of a combined treatment apparatus.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the neural network arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the neural network arithmetic device; other processing devices can cooperate with the neural network arithmetic device to complete the arithmetic task.

And the universal interconnection interface is used for transmitting data and control instructions between the neural network arithmetic device and other processing devices. The neural network arithmetic device acquires required input data from other processing devices and writes the input data into a storage device on the neural network arithmetic device chip; control instructions can be obtained from other processing devices and written into a control cache on a neural network arithmetic device chip; the data in the storage module of the neural network arithmetic device can also be read and transmitted to other processing devices.

Optionally, as shown in fig. 4, the structure may further include a storage device, and the storage device is connected to the neural network operation device and the other processing device, respectively. The storage device is used for storing data in the neural network arithmetic device and the other processing device, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the local machine learning arithmetic device or the other processing device.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the universal interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

In some embodiments, a chip including the above neural network operation device or the combined processing device is also provided.

In some embodiments, a chip package structure is provided, which includes the above chip.

In some embodiments, a board card is provided, which includes the above chip package structure. Referring to fig. 5, fig. 5 provides a card that may include other kit components in addition to the chip 389, including but not limited to: memory device 390, interface device 391 and control device 392;

the memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a ddr sdram (Double data rate sdram).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controllers are used for data transmission, and 8 bits are used for ECC checking. It can be understood that when DDR4-3200 grains are adopted in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600MB/s.

In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when the PCIE3.0X16 interface is adopted for transmission, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing andor a plurality of processing circuits in the chip.

In some embodiments, an electronic device is provided that includes the above board card.

The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one physical part, or may be distributed on a plurality of physical parts. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer readable memory, which may include: a flash disk, a read-only memory (ROM), a Random Access Memory (RAM), or an optical disk.

The foregoing detailed description of the embodiments of the present application has been presented, and specific examples have been applied in the present application to explain the principles and implementations of the present application, and the above description of the embodiments is only used to help understand the method and the core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A neural network computing device, the device being arranged to perform an artificial neural network training calculation; the neural network training operation comprises a neural network multi-layer training operation, the multi-layer training operation comprises at least one ith layer, at least part of data in the forward operation or the reverse operation of the ith layer is fixed point data operation, and i is an integer greater than or equal to 1; the computing device includes: the device comprises a controller unit, an arithmetic unit and a conversion unit, wherein the controller unit is connected with the arithmetic unit and the conversion unit; the ith layer of training operation comprises the ith layer of forward operation and the ith layer of reverse operation;

the controller unit is also used for analyzing the ith layer of calculation instructions to obtain a plurality of forward calculation instructions, sending the ith layer of input neuron data and the ith layer of weight data to the conversion unit, and sending the plurality of calculation instructions to the calculation unit;

the conversion unitAnd for scaling by float = int scale 2 ^point Performing conversion between a floating point type and a fixed point type; wherein, float is a floating point type numerical value, int is a fixed point type data value, scale is a fixed point type scaling value; point is the decimal point position value;

2. The apparatus of claim 1, further comprising:

the controller unit is also used for acquiring input neuron data of the ith layer, weight data of the ith layer, input neuron gradient of the ith layer and a backward calculation instruction of the ith layer;

the conversion unit is further configured to perform floating point type and fixed point type conversion on all or part of the ith layer of input neuron data, the ith layer of weight data, and the ith layer of input neuron gradient to obtain all fixed point data or mixed data, and send all the fixed point data or mixed data to the arithmetic unit, where the mixed data includes: part of fixed point data and part of floating point data;

the operation unit is also used for executing fixed point operation on all fixed point data or executing mixed operation on mixed data according to a plurality of forward operation instructions to obtain the weight gradient of the ith layer and the output result gradient of the ith layer; and updating by adopting the weight gradient of the ith layer and the weight of the ith layer.

3. The apparatus of claim 1,

the conversion unit is specifically used for converting part of the ith layer of input neuron data into partial fixed point input neuron data and converting part of the ith layer of weight data into partial fixed point weight data; sending part of the fixed point input neuron data and part of the fixed point weight data to an arithmetic unit, and sending part of the input neuron data and part of the fixed point weight data to the arithmetic unit;

the conversion unit is specifically used for performing fixed point and floating point conversion on the part of fixed point forward output results to obtain a first part of floating point forward output results, and sending the first part of floating point forward output results to the arithmetic unit;

and the operation unit is specifically used for executing operation on part of the input neuron data and part of the weight data to obtain a second part of floating point forward operation results, and combining the first part of floating point forward operation results and the second part of floating point forward operation results to obtain an ith layer of forward output results.

4. The computing device of any of claims 1-3,

the conversion unit is specifically configured to convert a part of the ith layer of input neuron data into partial fixed point input neuron data, convert a part of the ith layer of weight data into partial fixed point weight data, and convert the ith layer of input neuron gradient into partial fixed point input neuron gradient; sending part of fixed point input neuron data, part of fixed point input neuron gradient and part of fixed point weight data to an operation unit, and sending part of input neuron data, part of input neuron gradient and part of weight data to the operation unit;

the conversion unit is specifically used for performing fixed-point and floating-point conversion on the part of the ith layer weight gradient and the part of the ith layer output result gradient to obtain a first part of the ith layer weight gradient and a first part of the ith layer output result gradient, and sending the first part of the ith layer weight gradient and the first part of the ith layer output result gradient to the operation unit;

and the operation unit is specifically used for performing operation on part of input neuron gradients and part of input data to obtain a second part ith layer weight gradient, performing operation on part of input neuron gradients and part of weight data to obtain a second part ith layer output result gradient, combining the first part ith layer weight gradient and the second part ith layer weight gradient to obtain an ith layer weight gradient, and combining the first part ith layer output result gradient and the second part ith layer output result gradient to obtain an ith layer output result gradient.

5. The computing device of claim 1,

the conversion unit is specifically configured to convert the data into a digital signal according to float = int scale 2 ^point -the offset performs a conversion of the floating point type to the fixed point type, wherein the offset is an offset value.

6. The apparatus of claim 1, wherein the method for obtaining the i-th layer input neuron gradient specifically comprises:

the controller unit is specifically used for receiving the output result gradient of the (i + 1) th layer and sending the output result gradient of the (i + 1) th layer to the arithmetic unit;

the operation unit is specifically used for obtaining the gradient of the input neuron at the ith layer according to the gradient of the output result at the (i + 1) th layer;

input neuron gradient = f' at i +1 th layer and output result gradient;

where f' is the derivative of the activation function f.

7. The apparatus according to claim 1, wherein the arithmetic unit comprises: a master processing circuit and a plurality of slave processing circuits; wherein the content of the first and second substances,

the main processing circuit is used for performing preamble processing on data and transmitting data and operation instructions with the plurality of slave processing circuits;

the slave processing circuits are used for executing intermediate operation in parallel according to the data and the operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results and transmitting the intermediate results to the master processing circuit;

and the main processing circuit is used for obtaining an ith layer forward output result, an ith layer output result gradient and an ith layer weight gradient according to the plurality of intermediate results, and updating the ith layer weight according to the ith layer weight gradient.

8. The apparatus of claim 7,

the master processing circuit is specifically configured to send the ith layer of input neuron data to each slave processing circuit, transmit the ith layer of input neuron gradient to each slave processing circuit, multiply scalar data corresponding to the slave processing circuit and the ith layer of input neuron data in _ gradient by each slave processing circuit to obtain an original weight update gradient vector dw _ original of the ith layer of each slave processing circuit, and multiply the weight of each slave processing circuit by the original weight update gradient vector dw _ original to obtain an update weight of each slave processing circuit.

9. The apparatus of claim 8,

the main processing circuit is specifically configured to calculate a sum of squares of original weight update gradients of all layers and a sumsq _ diff after the original weight update gradient vectors of all layers are calculated, then perform evolution on the sumsq _ diff to obtain an l2norm _ diff, calculate a scale factor scale _ factor = clip _ gradient/l2norm _ diff if the l2norm _ diff is greater than the clip _ gradient, multiply all the original weight update gradients dw _ original by the scale factor scale _ factor respectively to obtain a weight update gradient dw ', and send the update gradient dw' to each slave processing circuit;

and the slave processing circuit is specifically configured to multiply the weight by the weight update gradient dw' to obtain an update weight of each slave processing circuit in the ith layer.

10. The apparatus of any of claims 7-9, wherein the master processing circuit and the slave processing circuit each comprise a memory module;

the storage module is used for storing data;

the memory module further comprises at least one shared area, and the shared area is used by the main processing circuit or the auxiliary processing circuit in a shared mode.

11. The apparatus according to any one of claims 7-9, wherein the arithmetic unit further comprises: a branch processing circuit;

the branch processing circuit is arranged between the main processing circuit and the plurality of slave processing circuits, and realizes the forwarding of data and operation instructions between the main processing circuit and the plurality of slave processing circuits.

12. The apparatus of claim 11, wherein the branch processing circuit comprises: the storage module comprises at least one shared area, and the shared area is used by sharing of the main processing circuit and the auxiliary processing circuit.

13. The apparatus according to claim 12, further comprising a tree module, wherein the interconnection module is an n-way tree composed of a plurality of nodes, data of an upstream node of the n-way tree is transmitted to n downstream nodes in the same way, and data returned by the n downstream nodes is merged and transmitted to the upstream node, and n is an integer greater than or equal to 2.

14. The apparatus of claim 6, wherein the activation function f is any one of a nonlinear function sigmoid, tanh, relu, softmax, or a linear function;

15. The apparatus according to any one of claims 7 to 9, wherein the main processing circuit includes a first storage unit, a first arithmetic unit, a first data dependency determination unit, and a first storage unit, wherein:

16. The apparatus of claim 15, wherein the first arithmetic unit comprises: a vector addition unit and an activation operation unit;

the vector addition unit is used for adding offset data and the intermediate result counterpoint to obtain an offset result;

17. The apparatus according to any one of claims 7 to 9, wherein each slave processing circuit comprises a second arithmetic unit, a second data dependency determination unit, a second storage unit, and a third storage unit, wherein:

a second arithmetic unit for performing arithmetic logic operations;

the second data dependency relation judging unit is used for executing read-write operation on the second storage unit and the third storage unit;

18. The apparatus of claim 17, wherein the second computing unit comprises: a vector multiplication unit and an accumulation unit;

19. A neural network training method, for a neural network computing device; the neural network training operation comprises a neural network multi-layer training operation, the multi-layer training operation comprises at least one ith layer, at least part of data in the forward operation or the reverse operation of the ith layer is fixed point data operation, and i is an integer greater than or equal to 1; the computing device includes: the device comprises a controller unit, an arithmetic unit and a conversion unit, wherein the controller unit is connected with the arithmetic unit and the conversion unit; the ith layer of training operation comprises the ith layer of forward operation and the ith layer of reverse operation;

the ith layer of forward operation comprises:

the executing the conversion between the floating point type and the fixed point type specifically comprises:

20. The method of claim 19, wherein the i-th layer inverse operation comprises:

the controller unit acquires input neuron data of an ith layer, weight data of the ith layer, gradient of input neurons of the ith layer and a reverse calculation instruction of the ith layer; analyzing the ith layer of calculation instruction to obtain a plurality of reverse calculation instructions, sending the ith layer of input neuron data, the ith layer of weight data and the ith layer of input neuron gradient to a conversion unit, and sending the plurality of calculation instructions to a calculation unit;

the arithmetic unit executes fixed point operation on all fixed point data or mixed operation on mixed data according to a plurality of forward operation instructions to obtain the weight gradient of the ith layer and the output result gradient of the ith layer; and updating by adopting the weight gradient of the ith layer and the weight of the ith layer.

21. The method of claim 19, wherein the conversion unit performs floating point type and fixed point type conversion on the i-th layer input neuron data and all or part of the i-th layer weight data to obtain all fixed point data or mixed data, and sends all fixed point data and mixed data to the arithmetic unit, and the mixed data comprises: part of fixed point data and part of floating point data; the step of the arithmetic unit performing fixed-point arithmetic on all fixed-point data or performing mixed arithmetic on mixed data according to a plurality of forward arithmetic instructions to obtain the forward output result of the ith layer specifically comprises:

the conversion unit converts part of the ith layer of input neuron data into partial fixed point input neuron data and converts part of the ith layer of weight data into partial fixed point weight data; sending part of the fixed point input neuron data and part of the fixed point weight data to an arithmetic unit, and sending part of the input neuron data and part of the weight data to the arithmetic unit;

the arithmetic unit executes fixed point data operation on part of fixed point input neuron data and part of fixed point weight data to obtain part of fixed point forward output results, and sends the part of fixed point forward output results to the conversion unit,

the conversion unit performs fixed point and floating point conversion on the part of fixed point forward output results to obtain a first part of floating point forward output results, and sends the first part of floating point forward output results to the arithmetic unit;

the arithmetic unit executes operation on part of input neuron data and part of weight data to obtain a second part of floating point forward operation result, and combines the first part of floating point forward operation result and the second part of floating point forward operation result to obtain an ith layer of forward output result.

22. The method of claim 19, wherein the conversion unit performs floating point type and fixed point type conversion on all or part of the i-th layer input neuron data, i-th layer weight data and i-th layer input neuron gradient to obtain all fixed point data or mixed data, and sends all fixed point data and mixed data to the operation unit, and the mixed data comprises: part of fixed point data and part of floating point data; the arithmetic unit executes fixed point operation on all fixed point data or mixed operation on mixed data according to a plurality of forward operation instructions to obtain the weight gradient of the ith layer and the output result gradient of the ith layer; the updating by using the weight gradient of the ith layer and the weight of the ith layer specifically comprises the following steps:

the conversion unit converts part of the ith layer of input neuron data into partial fixed point input neuron data, converts part of the ith layer of weight data into partial fixed point weight data, and converts the ith layer of input neuron gradient into partial fixed point input neuron gradient; sending part of fixed point input neuron data, part of fixed point input neuron gradient and part of fixed point weight data to an operation unit, and sending part of input neuron data, part of input neuron gradient and part of weight data to the operation unit;

the arithmetic unit executes fixed point data operation on part of fixed point input neuron gradients and part of fixed point input data to obtain part of ith layer weight gradients, executes fixed point data operation on part of fixed point input neuron gradients and part of fixed point weight data to obtain part of ith layer output result gradients, and sends part of ith layer weight gradients and part of ith layer output result gradients to the conversion unit,

the conversion unit performs fixed point and floating point conversion on the part of the ith layer weight gradient and the part of the ith layer output result gradient to obtain a first part of the ith layer weight gradient and a first part of the ith layer output result gradient, and sends the first part of the ith layer weight gradient and the first part of the ith layer output result gradient to the operation unit;

the operation unit performs operation on part of input neuron gradients and part of input data to obtain a second part ith layer weight gradient, performs operation on part of input neuron gradients and part of weight data to obtain a second part ith layer output result gradient, combines the first part ith layer weight gradient and the second part ith layer weight gradient to obtain an ith layer weight gradient, and combines the first part ith layer output result gradient and the second part ith layer output result gradient to obtain an ith layer output result gradient.

23. Method according to claim 19, characterized in that said criterion is float = int scale 2 ^point The performing the conversion between the floating point type and the fixed point type specifically includes:

the conversion unit is based on float = int scale 2 ^point -the offset performs a conversion of the floating point type to the fixed point type, wherein the offset is an offset value.

24. A neural network training arithmetic device, wherein the neural network training arithmetic device comprises one or more computing devices according to any one of claims 1 to 18, and is used for acquiring data to be operated and control information from other processing devices, executing specified operations, and transmitting the execution results to other processing devices through an I/O interface;

when the neural network training operation device comprises a plurality of computing devices, the computing devices can be connected through a specific structure and transmit data;

25. A combined processing device, characterized in that the combined processing device comprises the neural network training arithmetic device as claimed in claim 24, a universal interconnection interface and other processing devices;

26. The combined processing device of claim 25, further comprising: and the storage device is respectively connected with the neural network training arithmetic device and the other processing devices and is used for storing the data of the neural network training arithmetic device and the other processing devices.

27. A neural network chip comprising the computing device of claim 1 or the neural network training computational device of claim 24 or the combinatorial processing device of claim 25.

28. An electronic device, characterized in that it comprises a chip according to claim 27.

29. The utility model provides a board card, its characterized in that, the board card includes: a memory device, an interface apparatus and a control device and a neural network chip as claimed in claim 27;

the storage device is used for storing data;

and the control device is used for monitoring the state of the chip.

30. The card of claim 29,

the memory device includes: a plurality of groups of memory cells, each group of memory cells is connected with the chip through a bus, and the memory cells are: DDR SDRAM;

the chip includes: the DDR controller is used for controlling data transmission and data storage of each memory unit;

the interface device is as follows: a standard PCIE interface.