CN111367567B - Neural network computing device and method - Google Patents

Neural network computing device and method Download PDF

Info

Publication number
CN111367567B
CN111367567B CN201811592237.8A CN201811592237A CN111367567B CN 111367567 B CN111367567 B CN 111367567B CN 201811592237 A CN201811592237 A CN 201811592237A CN 111367567 B CN111367567 B CN 111367567B
Authority
CN
China
Prior art keywords
data
ith layer
gradient
unit
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811592237.8A
Other languages
Chinese (zh)
Other versions
CN111367567A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN201811592237.8A priority Critical patent/CN111367567B/en
Publication of CN111367567A publication Critical patent/CN111367567A/en
Application granted granted Critical
Publication of CN111367567B publication Critical patent/CN111367567B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Advance Control (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a neural network computing device and a method, the device is used for executing the artificial neural network reverse computing, and the technical scheme provided by the application has the advantages of low cost and low energy consumption.

Description

Neural network computing device and method
Technical Field
The present application relates generally to artificial neural networks, and in particular to a neural network computing device and method.
Background
The neural network is also called an artificial neural network, the artificial neural network is widely applied to the fields of pattern recognition, image processing, function approximation, optimization calculation and the like, and the multilayer artificial network is more and more widely concerned in academia and industry in recent years due to higher recognition accuracy and better parallelism. The artificial neural network relates to a plurality of algorithms, wherein a full connection layer is taken as an important algorithm in the artificial neural network and is widely applied to various artificial neural network models.
The existing neural network operation is based on a general processor to perform neural network operation, the existing general processor only supports floating point data operation, and for the neural network operation, particularly, relatively complex operation is involved, so that the operation amount is large, the requirement on a memory is high, and the existing neural network operation is based on the floating point data operation and has higher requirement on the memory, so that the existing scheme is high in energy consumption and high in cost.
Disclosure of Invention
One aspect of the present application provides a neural network computing device and method, where the device and method use fixed-point data to perform operations, and the fixed-point data can save memory and reduce the amount of operations, compared with floating-point data, so that the device and method have the advantages of reducing energy consumption and reducing cost.
In one aspect, an apparatus is provided for performing at least one layer i of inverse operations in an artificial neural network training calculation; at least part of data in the reverse operation of the ith layer is fixed point data operation, and i is an integer greater than or equal to 1; the computing device includes: the device comprises a controller unit, an arithmetic unit and a conversion unit, wherein the controller unit is connected with the arithmetic unit and the conversion unit;
the ith layer of inverse operation comprises:
the controller unit is used for acquiring input neuron data of the ith layer, weight data of the ith layer, input neuron gradient of the ith layer and a backward calculation instruction of the ith layer;
the controller unit is also used for analyzing the ith layer of calculation instruction to obtain a plurality of reverse calculation instructions, sending the ith layer of input neuron data, the ith layer of weight data and the ith layer of input neuron gradient to the conversion unit, and sending the plurality of calculation instructions to the calculation unit;
a conversion unit, configured to perform floating point type and fixed point type conversion on all or part of the ith layer of input neuron data, the ith layer of weight data, and the ith layer of input neuron gradient to obtain all fixed point data or mixed data, and send all the fixed point data or mixed data to an arithmetic unit, where the mixed data includes: partial fixed point data and partial floating point data;
the operation unit is used for performing fixed-point operation on all fixed-point data or performing mixed operation on mixed data according to a plurality of reverse operation instructions to obtain the weight gradient of the ith layer and the output result gradient of the ith layer;
the blending operation includes: performing fixed-point operations on portions of fixed-point data and floating-point operations on portions of floating-point data.
In another aspect, a neural network inverse operation method is provided, where the neural network inverse operation includes at least one ith layer of inverse operation, where i is an integer greater than or equal to 1; the computing device includes: the device comprises a controller unit, an arithmetic unit and a conversion unit, wherein the controller unit is connected with the arithmetic unit and the conversion unit;
the ith layer of inverse operations include:
the controller unit acquires input neuron data of an ith layer, weight data of the ith layer, gradient of input neurons of the ith layer and a reverse calculation instruction of the ith layer; analyzing the ith layer of calculation instructions to obtain a plurality of reverse calculation instructions, sending the ith layer of input neuron data, the ith layer of weight data and the ith layer of input neuron gradient to a conversion unit, and sending the plurality of calculation instructions to a calculation unit;
the conversion unit performs floating point type and fixed point type conversion on all or part of the ith layer input neuron data, the ith layer weight data and the ith layer input neuron gradient to obtain all fixed point data or mixed data, and sends the all fixed point data or mixed data to the operation unit, wherein the mixed data comprises: partial fixed point data and partial floating point data;
the operation unit executes fixed point operation on all fixed point data or mixed operation on mixed data according to a plurality of reverse operation instructions to obtain the weight gradient of the ith layer and the output result gradient of the ith layer;
the blending operation includes: performing fixed-point operations on portions of fixed-point data and floating-point operations on portions of floating-point data.
In another aspect, a neural network inverse operation device is provided, where the neural network inverse operation device includes one or more computing devices of the first aspect, and is configured to obtain data to be operated and control information from other processing devices, perform a specified operation, and transmit an execution result to the other processing devices through an I/O interface;
when the neural network inverse operation device comprises a plurality of computing devices, the computing devices can be connected through a specific structure and transmit data;
the computing devices are interconnected through a PCIE bus of a fast peripheral equipment interconnection bus and transmit data so as to support larger-scale neural network training operation; a plurality of the computing devices share the same control system or own respective control systems; the computing devices share the memory or own the memory; the interconnection mode of the computing devices is any interconnection topology.
In a further aspect, a combined processing device is provided, which includes the neural network inverse operation device of the further aspect, a universal interconnection interface and other processing devices;
and the neural network reverse operation device interacts with the other processing devices to jointly complete the calculation operation specified by the user.
In a further aspect, a neural network chip is provided, the neural network chip comprising a computing device of the one hand or a neural network inverse operation device of the other hand or a combined processing device of the still further aspect.
Further, an electronic device comprising the chip of claim 24 is provided.
Finally, a board card is provided, the board card comprising: memory device, interface device and control device and the above-mentioned neural network chip;
wherein, the neural network chip is respectively connected with the storage device, the control device and the interface device;
the storage device is used for storing data;
the interface device is used for realizing data transmission between the chip and external equipment;
and the control device is used for monitoring the state of the chip.
Drawings
For a more complete understanding of the present application and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
fig. 1 illustrates an example block diagram of an overall architecture of a neural network computing device in accordance with an embodiment of the present application.
Fig. 2 schematically shows a structural diagram of another neural network computing device according to an embodiment of the present application.
Fig. 2a schematically shows a schematic structural diagram of an arithmetic unit according to an embodiment of the present application.
Fig. 2b schematically shows another structural diagram of an arithmetic unit according to an embodiment of the present application.
Fig. 2c schematically shows a transmission diagram of a tree module according to an embodiment of the present application.
Fig. 2d schematically shows a receiving schematic diagram of a tree module according to an embodiment of the present application.
Fig. 3 schematically shows another structural diagram of an arithmetic unit according to an embodiment of the present application.
Fig. 4 schematically shows a structural diagram of a combined processing device according to an embodiment of the present application.
Fig. 5 schematically shows a structural diagram of a board card according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The electronic devices may include various handheld devices having wireless communication functions, in-vehicle devices, wireless headsets, computing devices or other processing devices connected to wireless modems, as well as various forms of User Equipment (UE), mobile Stations (MS), terminal devices (terminal device), and the like, and may be, for example, smart phones, tablets, earphone boxes, and the like. For convenience of description, the above-mentioned apparatuses are collectively referred to as electronic apparatuses or electronic devices.
The electronic device or electronic apparatus described above may be applied in the following (including but not limited to) scenarios: the system comprises various electronic products such as a data processing device, a robot, a computer, a printer, a scanner, a telephone, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage device and a wearable device; various vehicles such as airplanes, ships, vehicles, and the like; various household appliances such as televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas stoves, range hoods and the like; and various medical devices including nuclear magnetic resonance apparatuses, B-ultrasonic apparatuses, electrocardiographs and the like.
The following describes embodiments of the present application in detail.
First, a computing device as used herein is described. Referring to fig. 1, a neural network computing device is provided, where the computing device is configured to perform at least one ith layer of inverse operation in a neural network training calculation, where the neural network training calculation includes a neural network multilayer training operation, the multilayer training operation includes at least one ith layer, at least part of data in the ith layer of inverse operation is a fixed-point data operation, and i is an integer greater than or equal to 1; the computing device includes: a controller unit 11, an arithmetic unit 12 and a conversion unit 13, wherein the controller unit 11 is connected with the arithmetic unit 12 and the conversion unit 13 (the conversion unit can be arranged independently or integrated in the controller unit or the arithmetic unit);
the ith layer of inverse operations may include:
the controller unit 11 is configured to obtain input neuron data of an ith layer, ith layer weight data, ith layer input neuron gradient, and an ith layer inverse computation instruction; in an alternative, the input neuron data and the calculation instruction may be obtained through a data input/output unit, where the data input/output unit may be one or more data I/O interfaces or I/O pins; and the data input and output unit is used for reading input neuron data or a reverse calculation instruction from an external device or an external memory.
The above-mentioned reverse calculation instruction includes but is not limited to: matrix multiply instructions, vector multiply instructions, etc., and the embodiments of the present application do not limit the particular representation or the particular class of the above-described inverse compute instructions.
The controller unit 11 is further configured to analyze the ith layer calculation instruction to obtain a plurality of inverse calculation instructions, send the ith layer input neuron data, the ith layer weight data, and the ith layer input neuron gradient to the conversion unit 13, and send the plurality of calculation instructions to the calculation unit 12;
a converting unit 13, configured to perform floating point type and fixed point type conversion on all or part of the ith layer input neuron data, the ith layer weight data, and the ith layer input neuron gradient to obtain all fixed point data or mixed data, and send all the fixed point data or mixed data to the arithmetic unit, where the mixed data includes: partial fixed point data and partial floating point data;
the operation unit 12 is configured to perform fixed-point operation on all fixed-point data or perform mixed operation on mixed data according to a plurality of reverse operation instructions to obtain a weight gradient of an ith layer and an output result gradient of the ith layer; and (4) updating (optional) by adopting the weight gradient of the ith layer and the weight of the ith layer.
The blending operation includes: performing fixed-point operations on portions of fixed-point data and floating-point operations on portions of floating-point data.
The technical scheme provided by the application is provided with the conversion unit, when the conversion unit executes the ith layer of training operation of the neural network, all or part of input neuron data, weight data and input data neuron gradients can be converted into fixed point data or mixed data, and compared with floating point data, the storage space of the fixed point data is small, so that the training of the neural network can be realized through a small memory space.
The inverse operation in the neural network training may be an inverse operation of one layer in the neural network, that is, an inverse operation of the ith layer, and the inverse operation of other layers may adopt a conventional inverse operation method, or may adopt an inverse operation method similar to the ith layer in the present application. In the forward operation, after the forward operation of the artificial neural network in the previous layer is completed, the operation instruction in the next layer may operate the output neuron (i.e., the forward output result) calculated in the operation unit as the input neuron in the next layer (or perform some operations on the output neuron and then perform the operations as the input neuron in the next layer), where the operations include, but are not limited to: and activating operation and the like, and simultaneously replacing the weight of the previous layer with the weight of the next layer. In the inverse operation, after the inverse operation of the next layer artificial neural network is completed, the previous layer operation instruction performs operation with the output neuron gradient (i.e., the output result gradient) calculated in the operation unit as the previous layer input neuron gradient (or performs some operation on the output neuron gradient and then uses the output neuron gradient as the previous layer input neuron gradient), and simultaneously replaces the weight and the input neuron data with the weight and the input neuron data of the previous layer forward operation.
For the artificial neural network operation, if the artificial neural network operation has multilayer operation, the input neurons and the output neurons of the multilayer operation do not refer to the neurons in the input layer and the neurons in the output layer of the whole neural network, but for any two adjacent layers in the network, the neurons in the lower layer of the network forward operation are the input neurons, and the neurons in the upper layer of the network forward operation are the output neurons. Taking a convolutional neural network as an example, let a convolutional neural network have L layers, K =1,2.., L-1, for the K-th layer and K + 1-th layer, we will refer to the K-th layer as an input layer, in which the neurons are the input neurons, and the K + 1-th layer as an output layer, in which the neurons are the output neurons. That is, each layer except the topmost layer can be used as an input layer, and the next layer is a corresponding output layer.
Optionally, the converting unit 13 is specifically configured to convert a part of the ith layer of input neuron data into a part of fixed point input neuron data, convert a part of the ith layer of weight data into a part of fixed point weight data, and convert the ith layer of input neuron gradient into a part of fixed point input neuron gradient; sending part of fixed point input neuron data, part of fixed point input neuron gradient and part of fixed point weight data to an arithmetic unit, and sending part of input neuron data (residual floating point data without floating point and fixed point conversion), part of input neuron gradient and part of weight data (residual floating point data without floating point and fixed point conversion) to the arithmetic unit;
the operation unit is specifically used for executing fixed point data operation on part of fixed point input neuron gradients and part of fixed point input data to obtain part of ith layer weight gradients, executing fixed point data operation on part of fixed point input neuron gradients and part of fixed point weight data to obtain part of ith layer output result gradients, and sending part of ith layer weight gradients and part of ith layer output result gradients to the conversion unit,
the conversion unit is specifically used for performing fixed-point and floating-point conversion on the part of the ith layer weight gradient and the part of the ith layer output result gradient to obtain a first part of the ith layer weight gradient and a first part of the ith layer output result gradient, and sending the first part of the ith layer weight gradient and the first part of the ith layer output result gradient to the operation unit;
and the operation unit is specifically used for performing operation (floating point) on part of input neuron gradients and part of input data to obtain an ith layer weight gradient of a second part, performing operation on part of input neuron gradients and part of weight data to obtain an ith layer output result gradient of the second part, combining the ith layer weight gradient of the first part and the ith layer weight gradient of the second part to obtain an ith layer weight gradient, and combining the ith layer output result gradient of the first part and the ith layer output result gradient of the second part to obtain an ith layer output result gradient.
Optionally, the conversion unit 13 is specifically adapted to
float=int*scale*2 point Performing conversion between a floating point type and a fixed point type; wherein, float is a floating point type numerical value, int is a fixed point type data value, scale is a fixed point type scaling value; point is the decimal point position value;
optionally, offset may be added to the above formula, specifically, float = int scale 2 point + offset, which is an offset value. The above-mentioned offset value is used to represent int scale 2 point Deviation from float.
Figure GDA0004034836670000071
Wherein width is the bit width value of the fixed point number.
The maxabs is the maximum absolute value in the floating point data that needs to be converted, that is, the maximum absolute value in the elements of the ith layer input neuron data and the ith layer weight data. This enables the fixed-point number to represent a maximum value greater than the minimum point (position of point) value of maxabs.
Scale=δ*2 point /max abc;
The delta is an empirical value (set by the manufacturer) that is an integer and less than the maximum value of the fixed-point bit width. For example, the fixed point bit width is 8 bits and the maximum fixed point bit width is 255.
For point, for example, width =8, maxabs (maximum of absolute value of a set of numbers) =2.9, then point of the set of numbers = -4 can be calculated. If point = -4, int =21 can be estimated for float = 1.3.
For known points and widths, floating point number and fixed point number:
Figure GDA0004034836670000081
optionally, the method for obtaining the gradient of the i-th layer of input neurons specifically may include:
the gradient of the input neuron at the ith layer = f'. Times, the gradient of the output result at the ith +1 layer;
where f' is the derivative of the activation function f.
Optionally, referring to fig. 2a, the operation unit may include: a master processing circuit 101 and a plurality of slave processing circuits 102, wherein,
a master processing circuit 101, configured to perform a preamble process on data (including one or any combination of input neuron data, weight data, and input neuron gradient, and in addition, the data may be fixed-point data or floating-point data), and transmit data and operation instructions with the plurality of slave processing circuits;
a plurality of slave processing circuits 102, configured to execute intermediate operations in parallel according to data (fixed-point data or floating-point data) and an operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;
and the main processing circuit 101 is configured to obtain an ith layer forward output result, an ith layer output result gradient, and an ith layer weight gradient according to the plurality of intermediate results, and update an ith layer weight according to the ith layer weight gradient.
Optionally, the activation function f is any one of a nonlinear function sigmoid, tanh, relu, and softmax, or a linear function;
the operation instruction comprises: CONFIG instruction, COMPUTE instruction, IO instruction, NOP instruction, JUMP instruction, or MOVE instruction.
Optionally, the main processing circuit includes a first storage unit, a first arithmetic unit, a first data dependency relationship determination unit, and a first storage unit, where:
the neuron cache unit is used for caching input data and output data used by the main processing circuit in the calculation process;
a first arithmetic unit for completing various arithmetic functions of the main processing circuit;
the first data dependency relation judging unit is used for reading the input neuron vectors from the first storage unit and sending the neuron vectors to the slave processing circuit through the interconnection module; and receiving the intermediate result vector of the interconnection module and sending the intermediate result vector to the first arithmetic unit.
Optionally, the first arithmetic unit includes: a vector addition unit and an activation operation unit;
the vector addition unit is used for adding offset data and the intermediate result counterpoint to obtain an offset result;
and the activation arithmetic unit is used for executing activation function operation on the bias result.
Optionally, each of the master processing circuits includes a second arithmetic unit, a second data dependency relationship determination unit, a second storage unit, and a third storage unit, where:
a second arithmetic unit for performing arithmetic logic operations;
the second data dependency relation judging unit is used for executing read-write operation on the second storage unit and the third storage unit;
a second storage unit for caching data of the input neuron vector and the output neuron value calculated from the processing circuit;
and the third storage unit is used for caching the weight vector required by the slave processing circuit in the calculation process.
Optionally, the main computing unit includes: a vector multiplication unit and an accumulation unit;
the vector multiplication unit is used for executing vector multiplication operation in dot product operation;
and the accumulation unit is used for executing accumulation operation in dot product operation.
The process of updating the weight value may include:
the master processing circuit 101 is specifically configured to send the ith layer of input neuron data to each slave processing circuit, transmit the ith layer of input neuron gradient to each slave processing circuit 102, each slave processing circuit 102 multiplies scalar data corresponding to the slave processing circuit in the ith layer of input neuron gradient in _ gradient by the ith layer of input neuron data to obtain an original weight update gradient vector dw _ original of the ith layer of each slave processing circuit, after calculating the original weight update gradient vectors of all layers, the master processing circuit may perform a limiting process on the original weight update gradient in order to limit a gradient range of a weight, and specifically, the master processing circuit is specifically configured to calculate a square of the original weight update gradient of all layers and a sumsq _ diff, then perform a squaring on the sumsq _ diff to obtain l2norm _ diff, and if l2norm _ diff is greater than a clip _ gradient (a set normal number), calculate a scale factor = clip _ factor/2 gradient, and send each original weight update gradient to each slave processing circuit, and each slave processing circuit multiplies the original weight update gradient by the corresponding gradient vector update gradient; and the slave processing circuit is specifically configured to multiply the weight by the weight update gradient dw' to obtain an update weight of each slave processing circuit in the ith layer.
The technical scheme that this application provided sets the arithmetic element to a main many slave structures, to the computational instruction of forward operation, it can split data according to the computational instruction of forward operation, can carry out parallel operation to the great part of calculated amount through a plurality of processing circuits from like this to improve the arithmetic speed, save the operating time, and then reduce the consumption, to the backward operation, also can split data, similar forward operation also can improve the arithmetic speed.
Optionally, the master processing circuit and the slave processing circuit may each include: and the storage module is used for storing the data of the main processing circuit or the slave processing circuit. It should be noted that the memory module may be shared between the master processing circuit and the slave processing circuit, that is, one or more regions are divided into a shared region in the memory module of the master processing circuit, and a memory space of the shared region may be shared by a plurality of slave processing modules (including reading or writing data); one or more areas can be divided into shared areas in the storage module of the slave processing circuit, and the storage space of the shared areas can be shared and used (including reading or writing data) by the master processing module.
This technical scheme has set up the scheme of the regional sharing of memory module, for the fixed scheme of memory module, the memory module sharing between interconnect's main processing circuit and a plurality of slave processing circuit can avoid because the not enough problem that leads to calculating to go on of memory area, in addition, the memory module sharing can effectual reduction main processing circuit's memory space's setting, greatly reduced main processing circuit's cost like this. In addition, compared with the data extraction from external equipment, the scheme reduces the data reading or writing overhead, for the computing device, if the data is read or written from the external, the data needs to be forwarded through components such as a controller unit, a conversion unit and the like, so that a plurality of components need to be passed through for the neural network operation, the overhead is large during the data reading and writing, the energy consumption is also large, and a part of shared areas are properly arranged in a main processing circuit and a secondary processing circuit, so that when the space of a storage module of the computing device is insufficient, the storage module does not need to be stored in the external equipment, the storage module can be directly stored in the computing unit, and the overhead is greatly reduced.
Optionally, referring to fig. 2, the computing apparatus may further include: the storage unit 10 and the direct memory access unit 50, the storage unit 10 may include: one or any combination of a register and a cache, specifically, the cache is used for storing the calculation instruction; the register is used for storing the input neuron data, the weight data, the input neuron gradient and the scalar; the cache is a scratch pad cache. The direct memory access unit 50 is used to read or store data from the memory unit 10.
Optionally, the controller unit includes: an instruction storage unit 110, an instruction processing unit 111, and a storage queue unit 113;
an instruction storage unit 110, configured to store a calculation instruction associated with the artificial neural network operation;
the instruction processing unit 111 is configured to analyze the calculation instruction to obtain a plurality of operation instructions;
a store queue unit 113 for storing an instruction queue, the instruction queue comprising: and a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue.
For example, in an alternative embodiment, the main operation processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, specifically configured to decode instructions into microinstructions. Of course, in another alternative, the slave arithmetic processing circuit may also include another controller unit that includes a slave instruction processing unit, specifically for receiving and processing microinstructions. The micro instruction may be a next-stage instruction of the instruction, and the micro instruction may be obtained by splitting or decoding the instruction, and may be further decoded into control signals of each component, each unit, or each processing circuit.
In one alternative, the structure of the calculation instruction may be as shown in the following table.
Operation code Registers or immediate data Register/immediate ...
The ellipses in the above table indicate that multiple registers or immediate numbers may be included.
In another alternative, the computing instructions may include: one or more operation domains and an opcode. The computation instructions may include neural network operation instructions. Taking the neural network operation instruction as an example, as shown in table 1, register number 0, register number 1, register number 2, register number 3, and register number 4 may be operation domains. Each of register number 0, register number 1, register number 2, register number 3, and register number 4 may be a number of one or more registers.
Figure GDA0004034836670000111
The register may be an off-chip memory, but in practical applications, the register may also be an on-chip memory for storing data, and the data may specifically be n-dimensional data, where n is an integer greater than or equal to 1, for example, when n =1, the data is 1-dimensional data, that is, a vector, when n =2, the data is 2-dimensional data, that is, a matrix, and when n =3 or greater, the data is a multidimensional tensor.
In an alternative embodiment, referring to fig. 2a, the arithmetic unit 12 may comprise a master processing circuit 101 and a plurality of slave processing circuits 102, as shown in fig. 2 a. In one embodiment, as shown in FIG. 2b, a plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, and the k slave processing circuits are as follows: it should be noted that, as shown in fig. 2b, the K slave processing circuits include only the n slave processing circuits in the 1 st row, the n slave processing circuits in the m th row, and the m slave processing circuits in the 1 st column, that is, the K slave processing circuits are slave processing circuits directly connected to the master processing circuit among the plurality of slave processing circuits.
And the K slave processing circuits are used for forwarding data and instructions between the main processing circuit and the plurality of slave processing circuits.
Alternatively, the above-described conversion unit may be provided within the main processing circuit 101.
The main processing circuit may further include:
an activation processing circuit 111 for performing an activation operation or an activation derivation operation of data in the main processing circuit;
and an addition processing circuit 112 for performing addition operation or accumulation operation.
The master processing circuit is configured to determine that the input neuron data is broadcast data and the weight data is distribution data, distribute the distribution data into a plurality of data blocks, and send at least one data block of the plurality of data blocks and at least one operation instruction of the plurality of operation instructions to the slave processing circuit;
the plurality of slave processing circuits are used for executing operation on the received data blocks according to the operation instruction to obtain intermediate results and transmitting the intermediate results to the main processing circuit;
and the main processing circuit is used for updating the ith layer weight according to the ith layer weight gradient.
The slave processing circuit includes: a multiplication processing circuit;
the multiplication processing circuit is used for performing product operation on the received data block to obtain a product result;
forwarding processing circuitry (optional) for forwarding the received data block or the product result.
And the accumulation processing circuit is used for performing accumulation operation on the product result to obtain the intermediate result.
In another embodiment, the operation instruction is a matrix by matrix instruction, an accumulation instruction, an activation instruction, or the like.
The following describes a specific calculation method of the calculation apparatus shown in fig. 1 by a neural network operation instruction. For a neural network operation instruction, the formula that actually needs to be executed may be: s = s (∑ wx) i + b), wherein the weight w is multiplied by the input data x i And summing, adding a bias b, and performing activation operation s (h) to obtain a final output result s.
In an alternative embodiment, as shown in fig. 2c, the apparatus may further comprise: a tree module 40, the tree module comprising: a root port 401 and a plurality of branch ports 404, wherein the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;
the tree module has a transceiving function, for example, as shown in fig. 2c, the tree module is a transmitting function, and as shown in fig. 2d, the tree module is a receiving function.
And the tree module is used for forwarding data and operation instructions between the main processing circuit and the plurality of slave processing circuits.
Optionally, the tree module is an optional structure of the computing device, and may include at least 1 layer of nodes, where the nodes are line structures with forwarding function, and the nodes themselves may not have computing function. If the tree module has zero level nodes, the tree module is not needed.
Optionally, the tree module may have an n-ary tree structure, for example, a binary tree structure as shown in fig. 2c, or may have a ternary tree structure, where n may be an integer greater than or equal to 2. The present embodiment is not limited to the specific value of n, the number of layers may be 2, and the slave processing circuit may be connected to nodes in other layers than the node in the penultimate layer.
Optionally, the main processing circuit in the arithmetic unit may carry a separate cache, and specifically, the method may include: a neuron buffer unit that buffers the input neuron vector data and the output neuron value data of the slave processing circuit. The main processing circuit may further include: and the weight buffer unit is used for buffering weight data required by the slave processing circuit in the calculation process.
In an alternative embodiment, the arithmetic unit 12, as shown in fig. 3, may include a branch processing circuit 103; the specific connection structure is shown in fig. 3, wherein,
the main processing circuit 101 is connected to branch processing circuit(s) 103, the branch processing circuit 103 being connected to one or more slave processing circuits 102;
a branch processing circuit 103 for executing data or instructions between the forwarding main processing circuit 101 and the slave processing circuit 102.
Alternatively, the branch processing circuit 103 may be configured with a storage module, and the storage module may be divided into one or more shared areas, a master processing circuit and a slave processing circuit, and is specifically configured to perform a write or read operation on data in the shared area. The shared area is arranged in the branch processing circuit 103, so that the main processing circuit and the auxiliary processing circuit can store data conveniently, and the data storage cost is low, so that the capacities of the auxiliary processing circuit and the storage module of the main processing circuit can be saved, and the cost of the computing device can be reduced.
In an alternative embodiment, taking the fully-connected operation in the neural network operation as an example, the process may be: y = f (wx + b), where x is an input neuron matrix, w is a weight matrix, b is a bias scalar, and f is an activation function, and may specifically be: sigmoid function, tanh, relu, softmax function. Here, a binary tree structure is assumed, and there are 8 slave processing circuits, and the implementation method may be:
the controller unit acquires an input neuron matrix x, a weight matrix w and a full-connection operation instruction from the storage unit, and transmits the input neuron matrix x, the weight matrix w and the full-connection operation instruction to the main processing circuit;
the main processing circuit determines the input neuron matrix x as broadcast data, determines the weight matrix w as distribution data, divides the weight matrix w into 8 sub-matrixes, then distributes the 8 sub-matrixes to 8 slave processing circuits through a tree module, broadcasts the input neuron matrix x to the 8 slave processing circuits,
the slave processing circuit executes multiplication and accumulation operation of the 8 sub-matrixes and the input neuron matrix x in parallel to obtain 8 intermediate results, and the 8 intermediate results are sent to the master processing circuit;
and the main processing circuit is used for sequencing the 8 intermediate results to obtain a wx operation result, executing offset b operation on the operation result, executing activation operation to obtain a final result y, sending the final result y to the controller unit, and outputting or storing the final result y into the storage unit by the controller unit.
The specific implementation of the operation result of arranging the 8 intermediate results to obtain wx may be that, for the matrix multiplied by the matrix, partial elements of the input neuron matrix x corresponding to the 8 sub-matrices are determined, the minimum value of the number of rows in the 8 sub-matrices and the minimum value of the number of columns of the partial elements are extracted, and the minimum value of the number of rows and the minimum value of the number of columns are the positions of the intermediate results in the operation result.
The application also discloses a neural network inverse operation device which comprises one or more computing devices mentioned in the application and is used for acquiring data to be operated and control information from other processing devices, executing the specified neural network inverse computation and transmitting the execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one computing device is included, the computing devices may be linked and transmit data through a specific structure, such as through a PCIE bus, to support larger-scale operations of the neural network. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.
The neural network reverse operation device has high compatibility and can be connected with various types of servers through PCIE interfaces.
The application also discloses a combined processing device which comprises the neural network inverse operation device, the universal interconnection interface and other processing devices. The neural network reverse operation device interacts with other processing devices to jointly complete the operation designated by the user. Fig. 4 is a schematic view of a combined treatment apparatus.
Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the neural network reverse operation device and external data and control, and the interfaces comprise data transportation and complete basic control of starting, stopping and the like of the neural network reverse operation device; other processing devices can also cooperate with the neural network inverse operation device to complete operation tasks together.
And the universal interconnection interface is used for transmitting data and control instructions between the neural network reverse operation device and other processing devices. The neural network reverse operation device acquires required input data from other processing devices and writes the input data into a storage device on the neural network reverse operation device chip; control instructions can be obtained from other processing devices and written into a control cache on a neural network reverse operation device chip; the data in the storage module of the neural network arithmetic device can also be read and transmitted to other processing devices.
Optionally, the structure may further include a storage device, as shown in fig. 4, and the storage device is connected to the neural network inverse operation device and the other processing device, respectively. The storage device is used for storing the data in the neural network inverse operation device and the other processing devices, and is particularly suitable for the data which needs to be operated and cannot be completely stored in the internal storage of the neural network inverse operation device or the other processing devices.
The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.
In some embodiments, a chip including the neural network inverse operation device or the combined processing device is also provided.
In some embodiments, a chip package structure is provided, which includes the above chip.
In some embodiments, a board card is provided, which includes the above chip package structure. Referring to fig. 5, fig. 5 provides a card, which may include other components besides the chip 389, including but not limited to: memory device 390, interface device 391 and control device 392;
the memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).
DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controllers are used for data transmission, and 8 bits are used for ECC checking. It can be understood that when DDR4-3200 grains are adopted in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600MB/s.
In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And arranging a controller for controlling DDR in the chip, wherein the controller is used for controlling data transmission and data storage of each storage unit.
The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and external equipment (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when the PCIE 3.0X16 interface is adopted for transmission, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.
The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the chip.
In some embodiments, an electronic device is provided, which includes the above board.
The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.
The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art will recognize that the embodiments described in this specification are preferred embodiments and that acts or modules referred to are not necessarily required for this application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one physical part, or may be distributed on a plurality of physical parts. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Those skilled in the art will appreciate that all or part of the steps of the methods of the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable memory, and the readable memory may include: a flash disk, a read-only memory (ROM), a Random Access Memory (RAM), or an optical disk.
The foregoing detailed description of the embodiments of the present application has been presented, and specific examples have been applied in the present application to explain the principles and implementations of the present application, and the above description of the embodiments is only used to help understand the method and the core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific implementation manner and the application scope may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (26)

1. A neural network computing device, said device being configured to perform at least one layer i of inverse operations in an artificial neural network training calculation; at least part of data in the reverse operation of the ith layer is fixed point data operation, and i is an integer greater than or equal to 1; the computing device includes: the device comprises a controller unit, an arithmetic unit and a conversion unit, wherein the controller unit is connected with the arithmetic unit and the conversion unit;
the ith layer of inverse operations include:
the controller unit is used for acquiring input neuron data of the ith layer, weight data of the ith layer, input neuron gradient of the ith layer and a backward calculation instruction of the ith layer;
the controller unit is also used for analyzing the ith layer of calculation instruction to obtain a plurality of reverse calculation instructions, sending the ith layer of input neuron data, the ith layer of weight data and the ith layer of input neuron gradient to the conversion unit, and sending the plurality of calculation instructions to the calculation unit;
a conversion unit, configured to perform floating point type and fixed point type conversion on all or part of the ith layer of input neuron data, the ith layer of weight data, and the ith layer of input neuron gradient to obtain all fixed point data or mixed data, and send all the fixed point data or mixed data to an arithmetic unit, where the mixed data includes: partial fixed point data and partial floating point data;
the conversion unit is further configured to convert the data according to the flow = int scale 2 point Performing conversion between a floating point type and a fixed point type; wherein, float is a floating point type numerical value, int is a fixed point type data value, scale is a fixed point type scaling value; point is the decimal point position value;
the operation unit is used for performing fixed-point operation on all fixed-point data or performing mixed operation on mixed data according to a plurality of reverse operation instructions to obtain the weight gradient of the ith layer and the output result gradient of the ith layer;
the blending operation includes: performing fixed-point operations on portions of fixed-point data and floating-point operations on portions of floating-point data.
2. The computing device of claim 1,
the conversion unit is specifically configured to convert a part of the ith layer of input neuron data into partial fixed point input neuron data, convert a part of the ith layer of weight data into partial fixed point weight data, and convert the ith layer of input neuron gradient into partial fixed point input neuron gradient; sending part of fixed point input neuron data, part of fixed point input neuron gradient and part of fixed point weight data to an operation unit, and sending part of input neuron data, part of input neuron gradient and part of weight data to the operation unit;
the operation unit is specifically used for executing fixed point data operation on part of fixed point input neuron gradients and part of fixed point input data to obtain part of ith layer weight gradients, executing fixed point data operation on part of fixed point input neuron gradients and part of fixed point weight data to obtain part of ith layer output result gradients, and sending part of ith layer weight gradients and part of ith layer output result gradients to the conversion unit,
the conversion unit is specifically used for performing fixed point and floating point conversion on the part of the ith layer weight gradient and the part of the ith layer output result gradient to obtain a first part of the ith layer weight gradient and a first part of the ith layer output result gradient, and sending the first part of the ith layer weight gradient and the first part of the ith layer output result gradient to the operation unit;
and the operation unit is specifically used for performing operation on part of input neuron gradients and part of input data to obtain a second part ith layer weight gradient, performing operation on part of input neuron gradients and part of weight data to obtain a second part ith layer output result gradient, combining the first part ith layer weight gradient and the second part ith layer weight gradient to obtain an ith layer weight gradient, and combining the first part ith layer output result gradient and the second part ith layer output result gradient to obtain an ith layer output result gradient.
3. The computing device of claim 1,
the conversion unit is specifically configured to convert the data according to float = int scale 2 point -offset performs a conversion of a floating point type to a fixed point type, wherein offset is an offset value.
4. The apparatus of claim 1, wherein the method for obtaining the i-th layer input neuron gradient specifically comprises:
the controller unit is specifically used for receiving the output result gradient of the (i + 1) th layer and sending the output result gradient of the (i + 1) th layer to the arithmetic unit;
the operation unit is specifically used for obtaining the gradient of the input neuron of the ith layer according to the gradient of the output result of the (i + 1) th layer;
the gradient of the input neuron at the ith layer = f'. Times, the gradient of the output result at the ith +1 layer;
where f' is the derivative of the activation function f.
5. The apparatus according to claim 1, wherein the arithmetic unit comprises: a master processing circuit and a plurality of slave processing circuits; wherein,
the main processing circuit is used for performing preamble processing on data and transmitting data and operation instructions with the plurality of slave processing circuits;
the slave processing circuits are used for executing intermediate operation in parallel according to the data and the operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results and transmitting the intermediate results to the master processing circuit;
and the main processing circuit is used for obtaining the ith layer output result gradient and the ith layer weight gradient according to the plurality of intermediate results.
6. The apparatus of claim 5,
the master processing circuit is specifically configured to send the ith layer of input neuron data to each slave processing circuit, transmit the ith layer of input neuron gradient to each slave processing circuit, multiply scalar data corresponding to the slave processing circuit in the ith layer of input neuron gradient in _ gradient by the ith layer of input neuron data to obtain an original weight update gradient vector dw _ original of the ith layer of each slave processing circuit, and multiply the weight of each slave processing circuit by the original weight update gradient vector dw _ original to obtain an update weight of each slave processing circuit.
7. The apparatus of claim 6,
the main processing circuit is specifically configured to calculate a square sum of the initial weight update gradients of all layers and a sumsq _ diff after the initial weight update gradient vectors of all layers are calculated, then perform evolution on the sumsq _ diff to obtain an l2norm _ diff, if the l2norm _ diff is greater than a clip _ gradient, calculate a scaling factor scale _ factor = clip _ gradient/l2norm _ diff, multiply all the initial weight update gradients dw _ initial by the scaling factor scale _ factor respectively to obtain a weight update gradient dw ', and send the update gradient dw' to each slave processing circuit;
and the slave processing circuit is specifically configured to multiply the weight by the weight update gradient dw' to obtain an update weight of each slave processing circuit in the ith layer.
8. The apparatus of any of claims 5-7, wherein the master processing circuit and the slave processing circuit each comprise a memory module;
the storage module is used for storing data;
the memory module further comprises at least one shared area, and the shared area is used by the main processing circuit or the auxiliary processing circuit in a shared mode.
9. The apparatus according to any one of claims 5-7, wherein the arithmetic unit further comprises: a branch processing circuit;
the branch processing circuit is arranged between the main processing circuit and the plurality of slave processing circuits, and realizes the forwarding of data and operation instructions between the main processing circuit and the plurality of slave processing circuits.
10. The apparatus of claim 9, wherein the branch processing circuit comprises: the storage module comprises at least one shared area, and the shared area is a storage space shared and used by the main processing circuit and the auxiliary processing circuit.
11. The apparatus according to claim 1, further comprising a tree module, wherein the interconnection module is an n-way tree composed of a plurality of nodes, data of an upstream node of the n-way tree is sent to n downstream nodes in the same way, and data returned by the n downstream nodes is merged and sent to the upstream node, and n is an integer greater than or equal to 2.
12. The apparatus of claim 4, wherein the activation function f is any one of a nonlinear function sigmoid, tanh, relu, softmax, or a linear function;
the operation instruction comprises: CONFIG instruction, COMPUTE instruction, IO instruction, NOP instruction, JUMP instruction, or MOVE instruction.
13. The apparatus according to any one of claims 5 to 7, wherein the main processing circuit includes a first storage unit, a first arithmetic unit, a first data dependency determination unit, and a first storage unit, wherein:
the neuron cache unit is used for caching input data and output data used by the main processing circuit in the calculation process;
a first arithmetic unit for completing various arithmetic functions of the main processing circuit;
the first data dependency relation judging unit is used for reading the input neuron vectors from the first storage unit and sending the neuron vectors to the slave processing circuit through the interconnection module; and receiving the intermediate result vector of the interconnection module and sending the intermediate result vector to the first arithmetic unit.
14. The apparatus according to claim 13, wherein the first arithmetic unit comprises: a vector addition unit and an activation operation unit;
the vector addition unit is used for adding offset data and the intermediate result counterpoint to obtain an offset result;
and the activation operation unit is used for executing activation function operation on the bias result.
15. The apparatus of any one of claims 5-7, wherein each slave processing circuit comprises a second arithmetic unit, a second data dependency determination unit, a second storage unit, and a third storage unit, wherein:
a second arithmetic unit for performing arithmetic logic operations;
the second data dependency relation judging unit is used for executing read-write operation on the second storage unit and the third storage unit;
a second storage unit for caching data of the input neuron vector and the output neuron value calculated from the processing circuit;
and the third storage unit is used for caching the weight vector required by the slave processing circuit in the calculation process.
16. The apparatus of claim 15, wherein the second computing unit comprises: a vector multiplication unit and an accumulation unit;
the vector multiplication unit is used for executing vector multiplication operation in dot product operation;
and the accumulation unit is used for executing accumulation operation in dot product operation.
17. A neural network inverse operation method is characterized in that the neural network inverse operation comprises at least one layer of ith inverse operation, wherein i is an integer greater than or equal to 1; the computing device includes: the device comprises a controller unit, an arithmetic unit and a conversion unit, wherein the controller unit is connected with the arithmetic unit and the conversion unit;
the ith layer of inverse operations include:
the controller unit acquires input neuron data of an ith layer, weight data of the ith layer, input neuron gradient of the ith layer and a reverse calculation instruction of the ith layer; analyzing the ith layer of calculation instruction to obtain a plurality of reverse calculation instructions, sending the ith layer of input neuron data, the ith layer of weight data and the ith layer of input neuron gradient to a conversion unit, and sending the plurality of calculation instructions to a calculation unit;
the conversion unit performs floating point type and fixed point type conversion on all or part of the ith layer input neuron data, the ith layer weight data and the ith layer input neuron gradient to obtain all fixed point data or mixed data, and sends the all fixed point data or mixed data to the operation unit, wherein the mixed data comprises: partial fixed point data and partial floating point data;
the executing the conversion between the floating point type and the fixed point type specifically comprises:
the conversion unit is based on float = int scale 2 point Performing conversion between a floating point type and a fixed point type; wherein, float is a floating point type numerical value, int is a fixed point type data value, scale is a fixed point type scaling value; point is the decimal point position value;
the operation unit executes fixed point operation on all fixed point data or mixed operation on mixed data according to a plurality of reverse operation instructions to obtain the weight gradient of the ith layer and the output result gradient of the ith layer;
the blending operation includes: performing fixed-point operations on portions of fixed-point data and floating-point operations on portions of floating-point data.
18. The method of claim 17, wherein the conversion unit performs floating point type and fixed point type conversion on all or part of the i-th layer input neuron data, i-th layer weight data and i-th layer input neuron gradient to obtain all fixed point data or mixed data, and sends all fixed point data and mixed data to the operation unit, and the mixed data comprises: partial fixed point data and partial floating point data; the step of the arithmetic unit performing fixed-point operation on all fixed-point data or performing mixed operation on mixed data according to a plurality of inverse operation instructions to obtain a weight gradient of an ith layer and a gradient of an ith layer output result specifically includes:
the conversion unit converts part of the ith layer of input neuron data into partial fixed point input neuron data, converts part of the ith layer of weight data into partial fixed point weight data and converts the ith layer of input neuron gradient into partial fixed point input neuron gradient; sending part of fixed point input neuron data, part of fixed point input neuron gradient and part of fixed point weight data to an operation unit, and sending part of input neuron data, part of input neuron gradient and part of weight data to the operation unit;
the arithmetic unit executes fixed point data operation on part of fixed point input neuron gradients and part of fixed point input data to obtain part of ith layer weight gradients, executes fixed point data operation on part of fixed point input neuron gradients and part of fixed point weight data to obtain part of ith layer output result gradients, and sends part of ith layer weight gradients and part of ith layer output result gradients to the conversion unit,
the conversion unit performs fixed point and floating point conversion on the part of the ith layer weight gradient and the part of the ith layer output result gradient to obtain a first part of the ith layer weight gradient and a first part of the ith layer output result gradient, and sends the first part of the ith layer weight gradient and the first part of the ith layer output result gradient to the operation unit;
the operation unit executes operation on part of input neuron gradients and part of input data to obtain a second part ith layer weight gradient, executes operation on part of input neuron gradients and part of weight data to obtain a second part ith layer output result gradient, combines the first part ith layer weight gradient and the second part ith layer weight gradient to obtain an ith layer weight gradient, and combines the first part ith layer output result gradient and the second part ith layer output result gradient to obtain an ith layer output result gradient.
19. The method of claim 17, wherein float = int × scale*2 point The performing the conversion between the floating point type and the fixed point type specifically includes:
according to float = int scale 2 point -offset performs a conversion of a floating point type to a fixed point type, wherein offset is an offset value.
20. A neural network inverse operation device, wherein the neural network inverse operation device comprises one or more computing devices as claimed in any one of claims 1 to 16, and is configured to obtain data to be operated and control information from other processing devices, perform specified operations, and transmit the execution results to other processing devices through an I/O interface;
when the neural network inverse operation device comprises a plurality of computing devices, the computing devices can be connected through a specific structure and transmit data;
the computing devices are interconnected through a PCIE bus of a fast peripheral equipment interconnection bus and transmit data so as to support larger-scale neural network reverse operation; a plurality of the computing devices share the same control system or own respective control systems; the computing devices share the memory or own the memory; the plurality of computing devices are interconnected in any interconnection topology.
21. A combined processing device, characterized in that the combined processing device comprises the neural network inverse operation device, the universal interconnection interface and other processing devices as claimed in claim 20;
and the neural network reverse operation device interacts with the other processing devices to jointly complete the calculation operation specified by the user.
22. The combination processing device of claim 21, further comprising: and the storage device is respectively connected with the neural network reverse operation device and the other processing devices and is used for storing the data of the neural network reverse operation device and the other processing devices.
23. A neural network chip comprising the computing device of claim 1 or the neural network inverse operation device of claim 20 or the combined processing device of claim 21.
24. An electronic device, characterized in that it comprises a chip according to claim 23.
25. The utility model provides a board card, its characterized in that, the board card includes: a memory device, an interface apparatus and a control device and a neural network chip as claimed in claim 23;
wherein, the neural network chip is respectively connected with the storage device, the control device and the interface device;
the storage device is used for storing data;
the interface device is used for realizing data transmission between the chip and external equipment;
and the control device is used for monitoring the state of the chip.
26. The card of claim 25, wherein,
the memory device includes: a plurality of groups of memory cells, each group of memory cells is connected with the chip through a bus, and the memory cells are: DDR SDRAM;
the chip includes: the DDR controller is used for controlling data transmission and data storage of each memory unit;
the interface device is as follows: a standard PCIE interface.
CN201811592237.8A 2018-12-25 2018-12-25 Neural network computing device and method Active CN111367567B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811592237.8A CN111367567B (en) 2018-12-25 2018-12-25 Neural network computing device and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811592237.8A CN111367567B (en) 2018-12-25 2018-12-25 Neural network computing device and method

Publications (2)

Publication Number Publication Date
CN111367567A CN111367567A (en) 2020-07-03
CN111367567B true CN111367567B (en) 2023-03-07

Family

ID=71208097

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811592237.8A Active CN111367567B (en) 2018-12-25 2018-12-25 Neural network computing device and method

Country Status (1)

Country Link
CN (1) CN111367567B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112959326B (en) * 2021-03-29 2022-06-07 深圳市优必选科技股份有限公司 Method and device for solving positive kinematics of robot, readable storage medium and robot

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017124641A1 (en) * 2016-01-20 2017-07-27 北京中科寒武纪科技有限公司 Device and method for executing reversal training of artificial neural network
CN107844322A (en) * 2017-07-20 2018-03-27 上海寒武纪信息科技有限公司 Apparatus and method for performing artificial neural network forward operation
CN108427990A (en) * 2016-01-20 2018-08-21 北京中科寒武纪科技有限公司 Neural computing system and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009110353A (en) * 2007-10-31 2009-05-21 Hitachi Ltd Microcontroller and control system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017124641A1 (en) * 2016-01-20 2017-07-27 北京中科寒武纪科技有限公司 Device and method for executing reversal training of artificial neural network
CN108427990A (en) * 2016-01-20 2018-08-21 北京中科寒武纪科技有限公司 Neural computing system and method
CN107844322A (en) * 2017-07-20 2018-03-27 上海寒武纪信息科技有限公司 Apparatus and method for performing artificial neural network forward operation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
支持OpenCL的GPU加速人工神经网络训练;祝伟华等;《计算机系统应用》;20110715(第07期);全文 *
浮定点转换与SoC定点加速器字长协同设计研究;周凡等;《应用科学学报》;20070315(第02期);全文 *

Also Published As

Publication number Publication date
CN111367567A (en) 2020-07-03

Similar Documents

Publication Publication Date Title
CN109543832B (en) Computing device and board card
CN109740739B (en) Neural network computing device, neural network computing method and related products
CN109522052B (en) Computing device and board card
WO2019218896A1 (en) Computing method and related product
CN109740754B (en) Neural network computing device, neural network computing method and related products
CN110163363B (en) Computing device and method
CN111488976B (en) Neural network computing device, neural network computing method and related products
CN110059797B (en) Computing device and related product
CN111047022A (en) Computing device and related product
CN109753319B (en) Device for releasing dynamic link library and related product
CN111045728B (en) Computing device and related product
CN111488963B (en) Neural network computing device and method
CN111079908A (en) Network-on-chip data processing method, storage medium, computer device and apparatus
CN111930681B (en) Computing device and related product
CN110059809B (en) Computing device and related product
CN109711540B (en) Computing device and board card
CN111368967B (en) Neural network computing device and method
CN111367567B (en) Neural network computing device and method
CN111368986B (en) Neural network computing device and method
CN111368990B (en) Neural network computing device and method
CN111368987B (en) Neural network computing device and method
CN109740730B (en) Operation method, device and related product
CN111178492B (en) Computing device, related product and computing method for executing artificial neural network model
CN111047021A (en) Computing device and related product
CN111198714B (en) Retraining method and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant