CN110059797A - A kind of computing device and Related product - Google Patents
A kind of computing device and Related product Download PDFInfo
- Publication number
- CN110059797A CN110059797A CN201811181151.6A CN201811181151A CN110059797A CN 110059797 A CN110059797 A CN 110059797A CN 201811181151 A CN201811181151 A CN 201811181151A CN 110059797 A CN110059797 A CN 110059797A
- Authority
- CN
- China
- Prior art keywords
- layer
- circuit
- processing circuit
- convolution
- reversed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Error Detection And Correction (AREA)
- Advance Control (AREA)
Abstract
The application provides a kind of computing device and Related product, and for executing convolutional neural networks training operation, computing device provided by the present application has the advantages that at low cost, low in energy consumption the computing device.
Description
Technical field
This application involves technical field of information processing, and in particular to a kind of computing device and Related product.
Background technique
With the continuous development of information technology and the growing demand of people, requirement of the people to information timeliness is more next
It is higher.Currently, terminal is all based on general processor acquisition to the acquisition and processing of information.
In practice, it has been found that this mode for handling information based on general processor runs software program, is limited to lead to
With the operating rate of processor, especially in the biggish situation of general processor load, information processing efficiency is lower, time delay compared with
Greatly, for the training of the convolution of the computation model of information processing such as computation model, convolution is trained computationally intensive, general
The time that processor completes convolution training is long, low efficiency, and power consumption is high.
Summary of the invention
The embodiment of the present application provides a kind of computing device and Related product, can promote the processing speed of convolution training operation
Degree, improves efficiency, saves power consumption.
In a first aspect, providing a kind of computing device, the computing device is for executing convolutional neural networks training fortune
Calculate, the convolutional neural networks include: α layer, it is α layers described at least i-th layer be convolutional layer;The computing device includes: fortune
Calculate unit and controller unit;The arithmetic element includes: main process task circuit and from processing circuit, the α be greater than
Integer equal to 2, the i are integer and are less than or equal to α;The computing device is for executing i-th layer of convolution forward operation and holding
I-th layer of reversed operation of convolution of row;
I-th layer of convolution forward operation of the execution specifically includes:
The controller unit refers to for obtaining i-th layer of input data, i-th layer of convolution kernel and i-th layer of positive calculates
It enables;
The controller unit is also used to parse the forward direction computations to obtain multiple forward operation instructions, this is more
A operational order and the input data, the convolution kernel and multiple operational orders are sent to the main process task circuit;
The main process task circuit, it is described from processing circuit for the input data to be broadcast to, the convolution kernel is torn open
It is divided into multiple Nuclear Data blocks, multiple Nuclear Data blocks is distributed to described from processing circuit, the multiple operational order is sent to
It is described from processing circuit;
It is described from processing circuit, for being executed according to operational order to the input data and the Nuclear Data block received
Convolution algorithm obtains operation result, and operation result is transferred to the main process task circuit;
The main process task circuit obtains convolution results for carrying out splicing to the operation result;
I-th layer of reversed operation of convolution of the execution specifically includes:
The controller unit is also used to obtain i-th layer of output data gradient, i-th layer of convolution kernel, i-th layer of input data
It is instructed with retrospectively calculate;
The controller unit is also used to parse the retrospectively calculate and instructs to obtain multiple reversed operational orders, will be described
Reversed operational order and i-th layer of output data gradient, i-th layer of convolution kernel, i-th layer of input data are sent to the main place
Manage circuit;
The main process task circuit is also used to choose i-th layer of reversed operation from i-th layer of input data according to convolution window
Reversed input data, i-th layer of output data gradient is broadcast to it is described from processing circuit, by i-th layer of reversed input data
Split into multiple reversed input blocks, by multiple reversed input blocks and multiple reversed operational orders be distributed to it is described from
Processor circuit;
It is described from processing circuit, for according to the reversed operational order received by the reversed input block received with
I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result;The vector operation result is returned
Back to the main process task circuit;
The main process task circuit, for determining i-th layer of convolution kernel gradient according to the vector operation result, by i-th layer of convolution
Core gradient executes update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel.
Second aspect, the embodiment of the present application provide a kind of convolution training device, which is characterized in that the convolution training cartridge
The computing device provided including one or more first aspects is set, for obtaining from other processing units to operational data and control
Information processed, and specified convolution algorithm is executed, implementing result is passed into other processing units by I/O interface;
When the convolution training device includes multiple computing devices, can lead between the multiple computing device
Specific structure is crossed to be attached and transmit data;
Wherein, multiple computing devices are interconnected by quick external equipment interconnection Bus PC IE bus and transmit number
According to support the operation of more massive machine learning;Multiple computing devices are shared same control system or are possessed respectively
Control system;Multiple computing device shared drives possess respective memory;The interconnection of multiple computing devices
Mode is any interconnection topology.
The third aspect, provides a kind of combined treatment device, and the combined treatment device includes the convolution training of second aspect
Device, general interconnecting interface and other processing units;
The convolution training device is interacted with other described processing units, the common calculating behaviour for completing user and specifying
Make.
Fourth aspect, provides a kind of neural network chip, and neural network chip includes the computing device that first aspect provides
Or the combined treatment device that the convolution training device or the third aspect of second aspect offer provide.
5th aspect, provides a kind of electronic equipment, and the electronic equipment includes the chip provided such as fourth aspect.
6th aspect, provides a kind of board, and the board includes: memory device, interface arrangement and control device and the
The neural network chip that four aspects provide;
Wherein, the neural network chip and the memory device, the control device and the interface arrangement are distinguished
Connection;
The memory device, for storing data;
The interface arrangement, for realizing the data transmission between the chip and external equipment;
The control device is monitored for the state to the chip.
7th aspect, the embodiment of the present application also provide a kind of convolutional neural networks training method, and the method is applied to meter
Calculate device, the convolutional neural networks include: α layer, it is α layers described at least i-th layer be convolutional layer;The computing device packet
It includes: arithmetic element and controller unit;The arithmetic element includes: main process task circuit and from processing circuit, and the α is
Integer more than or equal to 2, the i are integer and are less than or equal to α;The convolutional neural networks training method includes at least: i-th layer
The i-th layer of reversed operation of convolution of convolution forward operation and execution;
I-th layer of convolution forward operation of the execution include:
The controller unit obtains i-th layer of input data, i-th layer of convolution kernel and i-th layer of positive computations;It should
Positive computations parse to obtain multiple forward operation instructions, by multiple operational order and the input data, the volume
Product core and multiple operational orders are sent to the main process task circuit;
The input data is broadcast to described from processing circuit by the main process task circuit, the convolution kernel is split into more
A Nuclear Data block, multiple Nuclear Data blocks are distributed to it is described from processing circuit, by the multiple operational order be sent to it is described from
Processing circuit;
It is described that convolution is executed to the input data and the Nuclear Data block received according to operational order from processing circuit
Operation obtains operation result, and operation result is transferred to the main process task circuit;
The main process task circuit carries out splicing to the operation result and obtains convolution results;
I-th layer of reversed operation of convolution of the execution include:
The controller unit obtains i-th layer of output data gradient, i-th layer of convolution kernel, i-th layer of input data and reversed meter
Calculate instruction;It parses the retrospectively calculate to instruct to obtain multiple reversed operational orders, by the reversed operational order and described i-th
Layer output data gradient, i-th layer of convolution kernel, i-th layer of input data are sent to the main process task circuit;
The main process task circuit is reversed defeated according to convolution window choose reversed operation from i-th layer of input data i-th layer
Enter data, i-th layer of output data gradient is broadcast to described from processing circuit, i-th layer of reversed input data is split into
Multiple reversed input blocks and multiple reversed operational orders are distributed to described from processor by multiple reversed input blocks
Circuit;
The reversed input block that will be received from processing circuit according to the reversed operational order that receives with it is described
I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result;The vector operation result is returned to
The main process task circuit;
The main process task circuit determines i-th layer of convolution kernel gradient according to the vector operation result, by i-th layer of convolution kernel gradient
Update operation, which is executed, with i-th layer of convolution kernel obtains i-th layer of updated convolution kernel.
In some embodiments, the electronic equipment includes data processing equipment, robot, computer, printer, scanning
Instrument, tablet computer, intelligent terminal, mobile phone, automobile data recorder, navigator, sensor, camera, server, cloud server,
Camera, video camera, projector, wrist-watch, earphone, mobile storage, wearable device, the vehicles, household electrical appliance, and/or medical treatment
Equipment.
In some embodiments, the vehicles include aircraft, steamer and/or vehicle;The household electrical appliance include electricity
Depending on, air-conditioning, micro-wave oven, refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator;The Medical Devices include
Nuclear Magnetic Resonance, B ultrasound instrument and/or electrocardiograph.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, for ability
For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached
Figure.
Figure 1A is a kind of structural schematic diagram of computing device provided by the embodiments of the present application.
Figure 1B is the structure chart for the computing device that the application one embodiment provides.
Fig. 1 C is the structure chart for the computing device that another embodiment of the application provides.
Fig. 1 D is the structure chart of main process task circuit provided by the embodiments of the present application.
Fig. 1 E is the structure chart of another computing device provided by the embodiments of the present application.
Fig. 1 F is the structural schematic diagram of tree-shaped module provided by the embodiments of the present application.
Fig. 1 G is the structure chart of another computing device provided by the embodiments of the present application.
Fig. 1 H is also a kind of structure chart of computing device provided by the embodiments of the present application.
Fig. 2 is a kind of structure chart of combined treatment device provided by the embodiments of the present application.
Fig. 2A is a kind of structural schematic diagram of computing device provided by the embodiments of the present application.
Fig. 3 is the structure chart of another combined treatment device provided by the embodiments of the present application.
Fig. 3 A is a kind of structural schematic diagram of board provided by the embodiments of the present application.
Fig. 4 A is a kind of hierarchical diagram of convolutional neural networks provided by the embodiments of the present application.
Fig. 4 B is i-th layer of forward operation schematic diagram provided by the embodiments of the present application.
Fig. 4 C is i-th layer provided by the embodiments of the present application reversed operation schematic diagram.
Fig. 4 D is splicing result schematic diagram provided by the embodiments of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen
Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall in the protection scope of this application.
The description and claims of this application and term " first ", " second ", " third " and " in the attached drawing
Four " etc. are not use to describe a particular order for distinguishing different objects.In addition, term " includes " and " having " and it
Any deformation, it is intended that cover and non-exclusive include.Such as it contains the process, method of a series of steps or units, be
System, product or equipment are not limited to listed step or unit, but optionally further comprising the step of not listing or list
Member, or optionally further comprising other step or units intrinsic for these process, methods, product or equipment.
Referenced herein " embodiment " is it is meant that a particular feature, structure, or characteristic described can wrap in conjunction with the embodiments
It is contained at least one embodiment of the application.Each position in the description occur the phrase might not each mean it is identical
Embodiment, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art explicitly and
Implicitly understand, embodiment described herein can be combined with other embodiments.
Computing device used in this application is introduced first.A refering to fig. 1 provides a kind of computing device, which uses
In executing convolutional neural networks training operation, which includes: controller unit 11 and arithmetic element 12, wherein
Controller unit 11 is connect with arithmetic element 12, which includes: main process task circuit 101 and from processing circuit
102 (can be one or more from processing circuit, preferentially select multiple from processing circuit);
It should be noted that above-mentioned main process task circuit itself includes memory (such as memory or register), the memory
The some data that can store main process task circuit can choose carrying memory from processing circuit.
Above-mentioned convolutional neural networks include: α layers, it is α layers described at least i-th layer be convolutional layer, the convolutional neural networks
The schematic diagram of multilayered structure can be as shown in Figure 4 A, it should be noted that above-mentioned i-th layer can be any one layer in α layers,
For the convenience of picture, i-th layer of Fig. 4 A is by taking middle layer as an example.The α is integer more than or equal to 2, and the i is integer and small
In equal to α;The computing device is for executing i-th layer of convolution forward operation and executing i-th layer of reversed operation of convolution;
I-th layer of convolution forward operation of above-mentioned execution specifically includes:
The controller unit refers to for obtaining i-th layer of input data, i-th layer of convolution kernel and i-th layer of positive calculates
It enables;
The controller unit is also used to parse the forward direction computations to obtain multiple forward operation instructions, this is more
A operational order and the input data, the convolution kernel and multiple operational orders are sent to the main process task circuit;
The main process task circuit, it is described from processing circuit for the input data to be broadcast to, the convolution kernel is torn open
It is divided into multiple Nuclear Data blocks, multiple Nuclear Data blocks is distributed to described from processing circuit, the multiple operational order is sent to
It is described from processing circuit;
It is described from processing circuit, for being executed according to operational order to the input data and the Nuclear Data block received
Convolution algorithm obtains operation result, and operation result is transferred to the main process task circuit;
The main process task circuit obtains convolution results for carrying out splicing to the operation result;
I-th layer of reversed operation of convolution of the execution specifically includes:
The controller unit is also used to obtain i-th layer of output data gradient, i-th layer of convolution kernel, i-th layer of input data
It is instructed with retrospectively calculate;
Above-mentioned i-th layer of output data gradient can be obtained from the reversed input data gradient of i+1 layer, for example, can be straight
It connects using the reversed input data gradient of i+1 layer as i-th layer of output data gradient, can also be certainly, by the anti-of i+1 layer
To input data gradient multiplied by the result of h (x) as i-th layer of output data gradient, h (x) is that i-th layer of activation primitive leads letter
Number.
The controller unit is also used to parse the retrospectively calculate and instructs to obtain multiple reversed operational orders, will be described
Reversed operational order and i-th layer of output data gradient, i-th layer of convolution kernel, i-th layer of input data are sent to the main place
Manage circuit;
The main process task circuit is also used to choose i-th layer of reversed operation from i-th layer of input data according to convolution window
Reversed input data, i-th layer of output data gradient is broadcast to it is described from processing circuit, by i-th layer of reversed input data
Split into multiple reversed input blocks, by multiple reversed input blocks and multiple reversed operational orders be distributed to it is described from
Processor circuit;
Above-mentioned i-th layer of reversed input data that reversed operation is chosen from i-th layer of input data according to convolution window is specific
It can be i-th layer of reversed input data that reversed operation is selected according to information such as the size of convolution window, moving step lengths, example
I-th layer of input data can be such as cut according to the size of convolution window, moving step length to obtain and the convolution window size, mobile step
Long corresponding i-th layer of reversed input data.
It is described from processing circuit, for according to the reversed operational order received by the reversed input block received with
I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result;The vector operation result is returned
Back to the main process task circuit;
The main process task circuit, for determining i-th layer of convolution kernel gradient according to the vector operation result, by i-th layer of convolution
Core gradient executes update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel.
Optionally, described to determine that i-th layer of convolution kernel gradient specifically includes according to the vector operation result:
The main process task circuit, the convolution kernel gradient specifically for solving i-th layer of all slave computing module are corresponding flat
Square averageThe t when c is greater than threshold value, all gradients zoom in and out dw'=dw/c*t, according to the convolution after scaling
The value of core gradient updating convolution kernel;The w is the convolution kernel gradient from computing module.
Optionally, it is above-mentioned i-th layer of convolution kernel gradient is executed with i-th layer of convolution kernel update operation obtain i-th layer update after
Convolution kernel be specifically as follows, updated i-th layer of convolution kernel of convolution kernel=the i-th layer convolution kernel gradient *.Certainly it can also be
Other update modes, the application do not limit to the specific method of above-mentioned update.
It should be noted that the training of above-mentioned convolutional neural networks is only by taking the training of i-th layer of convolutional layer as an example, for volume
The training of product other layers of neural network can be not intended to limit convolutional neural networks except i-th using conventional training method, the application
Training method other than layer.
Arithmetic element is arranged to one master and multiple slaves structure by technical solution provided by the present application, and the calculating of forward operation is referred to
It enables, can will split data according to the computations of forward operation, in this way by can be to meter from processing circuit
The biggish part of calculation amount carries out concurrent operation, to improve arithmetic speed, operation time is saved, and then reduce power consumption, for anti-
To operation, i-th layer of reversed input data is split, is then assigned to from processing circuit and executes the calculating that vector multiplies vector, this
Calculation amount can be distributed to multiple processing circuits (master and slave structure) parallel computation by sample, improve arithmetic speed, when saving operation
Between, and then power consumption is reduced, therefore it has the advantages of reducing the training time, reducing power consumption.
It can be one layer of operation in convolutional neural networks for i-th layer of training, for multilayer convolutional neural networks,
Its realization process is, in forward operation, after upper one layer of convolutional neural networks, which execute, to be completed, and next layer of operational order meeting
Using output neuron calculated in arithmetic element as next layer input neuron carry out operation (or to the output mind
The input neuron that certain operations are re-used as next layer is carried out through member), meanwhile, weight is also replaced with to next layer of weight;?
In reversed operation, after the completion of the reversed operation of upper one layer of artificial neural network executes, next layer of operational order can be by operation list
In member it is calculated input neuron gradient as next layer output neuron gradient carry out operation (or to the input mind
Certain operations, which are carried out, through first gradient is re-used as next layer of output neuron gradient), while weight being replaced with to next layer of power
Value.
For artificial neural network operation, if the artificial neural network operation have multilayer operation, multilayer operation it is defeated
Enter neuron and output neuron does not mean that in the input layer of entire neural network neuron in neuron and output layer, but
For two layers of arbitrary neighborhood in network, the neuron in network forward operation lower layer is to input neuron, is in net
Neuron in network forward operation upper layer is output neuron.By taking convolutional neural networks as an example, if a convolutional neural networks
There are L layers, K=1,2 ..., L-1, for K layers and K+1 layers, we are known as input layer, nerve therein for K layers
Member is the input neuron, and K+1 layers are known as output layer, and neuron therein is the output neuron.Remove top
Outside, each layer all can serve as input layer, and next layer is corresponding output layer.
Optionally, above-mentioned computing device can also include: the storage unit 10 and direct memory access unit 50, and storage is single
Member 10 may include: register, one or any combination in caching, specifically, the caching, refers to for storing the calculating
It enables;The register, for storing the input data and scalar;The caching is that scratchpad caches.Direct memory access
Unit 50 is used to read from storage unit 10 or storing data.
Optionally, which includes: the location of instruction 110, instruction process unit 111 and storage queue unit
113;
The location of instruction 110, for storing the associated computations of artificial neural network operation;
Described instruction processing unit 111 obtains multiple operational orders for parsing to the computations;
Storage queue unit 113, for storing instruction queue, the instruction queue include: to wait for by the tandem of the queue
The multiple operational orders or computations executed.
For example, main arithmetic processing circuit also may include a controller list in an optional technical solution
Member, the controller unit may include master instruction processing unit, be specifically used for Instruction decoding into microcommand.Certainly in another kind
Also may include another controller unit from arithmetic processing circuit in optinal plan, another controller unit include from
Instruction process unit, specifically for receiving and processing microcommand.Above-mentioned microcommand can be the next stage instruction of instruction, micro- finger
Order can further can be decoded as each component, each unit or each processing circuit by obtaining after the fractionation or decoding to instruction
Control signal.
In a kind of optinal plan, the structure of the computations can be as shown in the table.
Operation code | Register or immediate | Register/immediate | ... |
Ellipsis expression in upper table may include multiple registers or immediate.
In alternative dispensing means, which may include: one or more operation domains and an operation code.
The computations may include neural network computing instruction.By taking neural network computing instructs as an example, as shown in table 1, wherein deposit
Device number 0, register number 1, register number 2, register number 3, register number 4 can be operation domain.Wherein, each register number 0,
Register number 1, register number 2, register number 3, register number 4 can be the number of one or more register.
Above-mentioned register can be chip external memory, certainly in practical applications, or on-chip memory, for depositing
Store up data, which is specifically as follows n dimension data, and n is the integer more than or equal to 1, for example, be 1 dimension data when n=1, i.e., to
Amount is 2 dimension datas, i.e. matrix when such as n=2, is multidimensional tensor when such as n=3 or 3 or more.
Optionally, which can also include:
The dependence processing unit 108, for determining the first operational order and institute when with multiple operational orders
The 0th operational order before stating the first operational order whether there is incidence relation, such as first operational order and the described 0th
There are incidence relations for operational order, then first operational order are buffered in described instruction storage unit, the described 0th
After operational order is finished, first operational order is extracted from described instruction storage unit and is transmitted to the arithmetic element;
The determination first operational order whether there is with the 0th operational order before the first operational order to be associated with
System includes:
Extract required data (such as matrix) in first operational order according to first operational order first is deposited
Address section is stored up, the 0th stored address area of required matrix in the 0th operational order is extracted according to the 0th operational order
Between, such as first storage address section has Chong Die region with the 0th storage address section, it is determined that described first
Operational order and the 0th operational order have incidence relation, such as first storage address section and the 0th storage
Location section does not have the region of overlapping, it is determined that first operational order does not have with the 0th operational order to be associated with
System.
In another alternative embodiment, arithmetic element 12 may include 101 He of main process task circuit as shown in Figure 1 C
It is multiple from processing circuit 102.In one embodiment, as shown in Figure 1 C, it is multiple from processing circuit be in array distribution;Each from
Reason circuit is connect with other adjacent from processing circuit, and the multiple k from processing circuit of main process task circuit connection are from
Circuit is managed, the k is a from processing circuit are as follows: the n of n of the 1st row from processing circuit, m row is a to be arranged from processing circuit and the 1st
M from processing circuit, it should be noted that K as shown in Figure 1 C only include n of the 1st row from processing circuit from processing
Circuit, the n m arranged from processing circuit and the 1st of m row are a from processing circuit, i.e. the k are multiple from from processing circuit
Manage circuit in directly with the slave processing circuit of main process task circuit connection.
K is from processing circuit, in the main process task circuit and multiple data between processing circuit and referring to
The forwarding of order.
Optionally, as shown in figure iD, which can also include: conversion processing circuit 110, activation processing circuit
111, one of addition process circuit 112 or any combination;
Conversion processing circuit 110, for the received data block of main process task circuit or intermediate result to be executed the first data knot
Exchange (such as conversion of continuous data and discrete data) between structure and the second data structure;Or it is main process task circuit is received
Data block or intermediate result execute exchange (such as fixed point type and floating-point class between the first data type and the second data type
The conversion of type);
Processing circuit 111 is activated, for executing the activation operation of data in main process task circuit;
Addition process circuit 112, for executing add operation or accumulating operation.
The main process task circuit, for determining that the input data is broadcast data, convolution kernel is distribution data, will be described
Convolution kernel splits into multiple Nuclear Data blocks, by the multiple Nuclear Data block at least one Nuclear Data block and multiple operations refer to
At least one operational order in order is sent to the K from processing circuit;
The K is a from processing circuit, for forwarding the main process task circuit and the multiple core between processing circuit
Data block, input data and operational order;
It is the multiple from processing circuit, for being held according to the operational order to the Nuclear Data block and input data that receive
Row convolution algorithm obtains operation result, and operation result is transferred to the K from processing circuit;
The main process task circuit, for being spliced to obtain convolution from the operation result that processing circuit is sent by the K
As a result, the convolution results are sent to the controller unit;
The main process task circuit, is also used to for i-th layer of output result gradient being broadcast to the k from processing circuit,
I-th layer of reversed input data for choosing reversed operation from i-th layer of input data according to convolution window, by i-th layer of reversed input
Data split into multiple reversed input blocks, and multiple reversed input blocks and multiple reversed operational orders are distributed to institute
K are stated from processing circuit;
The k is also used to forward the main process task circuit and the multiple between processing circuit from processing circuit
Reversed input block, vector operation result, i-th layer of output result gradient and reversed operational order;
It is the multiple from processing circuit, for according to the reversed operational order received by the input block received with
I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result;The vector operation result is returned
It is a from processing circuit back to the k;
The main process task circuit, for determining i-th layer from the vector operation result that processing circuit is sent according to the k
I-th layer of convolution kernel gradient is executed update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel by convolution kernel gradient.
It is described from processing circuit include: multiplication process circuit;
The multiplication process circuit obtains result of product for executing product calculation to the data block received;
Forward process circuit (optional), for forwarding the data block received or result of product.
Accumulation process circuit, the accumulation process circuit obtain among this for executing accumulating operation to the result of product
As a result.
In another embodiment, which is Matrix Multiplication in terms of the instruction of matrix, accumulated instruction, activation instruction etc.
Calculate instruction.
Illustrate the circular of computing device as shown in Figure 1A below by neural network computing instruction.For
For neural network computing instruction, the formula that actually needs to be implemented can be with are as follows: s=s (∑ wxi+ b), wherein i.e. by weight w
Multiplied by input data xi, sum, then plus activation operation s (h) is done after biasing b, obtain final output result s.
In a kind of optional embodiment, as referring to figure 1E, the arithmetic element includes: tree-shaped module 40, the tree
Pattern block includes: a root port 401 and multiple ports 404, and the root port of the tree-shaped module connects the main process task electricity
Road, multiple ports of the tree-shaped module are separately connected multiple one from processing circuit from processing circuit;
Above-mentioned tree-shaped module has transmission-receiving function, such as referring to figure 1E, which is sending function, such as Fig. 2A
Shown, which is receive capabilities.
The tree-shaped module, for forward the main process task circuit and the multiple data block between processing circuit,
Weight and operational order.
Optionally, which is the optional as a result, it may include at least 1 node layer, the node of computing device
For the cable architecture with forwarding capability, the node itself can not have computing function.If tree-shaped module has zero layer node, i.e.,
Without the tree-shaped module.
Optionally, which can pitch tree construction for n, for example, binary tree structure as shown in Figure 1 F, certainly may be used
Think trident tree construction, which can be the integer more than or equal to 2.The application specific embodiment is not intended to limit the specific of above-mentioned n
Value, the above-mentioned number of plies may be 2, can connect the node of other layers in addition to node layer second from the bottom from processing circuit,
Such as it can connect the node of layer last as shown in Figure 1 F.
Optionally, above-mentioned arithmetic element can carry individual caching, may include: neuron caching as shown in Figure 1 G
Unit, the neuron cache unit 63 cache the input neuron vector data and output neuron value number from processing circuit
According to.
As shown in fig. 1H, which can also include: weight cache unit 64, exist for caching this from processing circuit
The weight data needed in calculating process.
In an alternative embodiment, arithmetic element 12 may include branch process circuit 103 as shown in Figure 1B;It has
The connection structure of body is as shown in Figure 1B, wherein
Above-mentioned branch process circuit 103 may include memory, as shown in Figure 1B, the memory of branch process circuit 103
Size can for individually between 2 to 2.5 times of maximum data capacity that processing circuit needs to store, in this way setting with
Afterwards, from processing circuit i.e. no setting is required memory, relative to a branch process circuit, only with setting 2.5*R (individually from
Capability value needed for processor circuit), if there is no branch process circuit, need to be arranged 4*R, and its register
Utilization rate is also low, therefore the structure can effectively reduce the total capacity of memory, reduce cost.
The main process task circuit is specifically used for determining that the input data is broadcast data, and the convolution kernel is distribution number
According to, the convolution kernel is split into multiple Nuclear Data blocks, by the multiple Nuclear Data block at least one Nuclear Data block, input
At least one operational order in data and multiple operational orders is sent to the branch process circuit;
The branch process circuit, for forwarding the main process task circuit and the multiple nucleus number between processing circuit
According to block, input data and operational order;
It is the multiple from processing circuit, for being held according to the operational order to the Nuclear Data block and input data that receive
Row convolution algorithm obtains operation result, and operation result is transferred to the branch process circuit;
The main process task circuit, the operation result for sending branch process circuit are spliced to obtain convolution results;
The main process task circuit is also used to i-th layer of output result gradient being broadcast to the branch process circuit, according to
I-th layer of reversed input data for choosing reversed operation from i-th layer of input data according to convolution window, by i-th layer of reversed input number
According to multiple reversed input blocks are split into, multiple reversed input blocks and multiple reversed operational orders are distributed to described
Branch process circuit;
The branch process circuit is also used to forward the main process task circuit and the multiple from anti-between processing circuit
To input block, vector operation result, i-th layer of output result gradient and reversed operational order;
It is described from processing circuit, for according to the reversed operational order received by the reversed input block received with
I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result;The vector operation result is returned
Back to the branch process circuit;
The main process task circuit, for determining i-th layer of convolution according to the vector operation result of branch process circuit forwarding
I-th layer of convolution kernel gradient is executed update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel by core gradient.
On the other hand, the application also provides a kind of convolutional neural networks training method, and the method is applied to computing device,
The convolutional neural networks include: α layers, it is α layers described at least i-th layer be convolutional layer;The computing device includes: operation list
Member and controller unit;The arithmetic element includes: main process task circuit and from processing circuit, and the α is more than or equal to 2
Integer, the i be integer and be less than or equal to α;The convolutional neural networks training method includes at least: i-th layer of convolution forward direction
I-th layer of reversed operation of convolution of operation and execution;
I-th layer of convolution forward operation of the execution includes: that its forward operation schematic diagram is as shown in Figure 4 B.
The controller unit obtains i-th layer of input data, i-th layer of convolution kernel and i-th layer of positive computations;It should
Positive computations parse to obtain multiple forward operation instructions, by multiple operational order and the input data, the volume
Product core and multiple operational orders are sent to the main process task circuit;
The input data is broadcast to described from processing circuit by the main process task circuit, the convolution kernel is split into more
A Nuclear Data block, multiple Nuclear Data blocks are distributed to it is described from processing circuit, by the multiple operational order be sent to it is described from
Processing circuit;
It is described that convolution is executed to the input data and the Nuclear Data block received according to operational order from processing circuit
Operation obtains operation result, and operation result is transferred to the main process task circuit;
The main process task circuit carries out splicing to the operation result and obtains convolution results;
I-th layer of reversed operation of convolution of the execution includes: that its reversed operation schematic diagram is as shown in Figure 4 C.
The controller unit obtains i-th layer of output data gradient, i-th layer of convolution kernel, i-th layer of input data and reversed meter
Calculate instruction;It parses the retrospectively calculate to instruct to obtain multiple reversed operational orders, by the reversed operational order and described i-th
Layer output data gradient, i-th layer of convolution kernel, i-th layer of input data are sent to the main process task circuit;
The main process task circuit is reversed defeated according to convolution window choose reversed operation from i-th layer of input data i-th layer
Enter data, i-th layer of output data gradient is broadcast to described from processing circuit, i-th layer of reversed input data is split into
Multiple reversed input blocks and multiple reversed operational orders are distributed to described from processor by multiple reversed input blocks
Circuit;
The reversed input block that will be received from processing circuit according to the reversed operational order that receives with it is described
I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result;The vector operation result is returned to
The main process task circuit;
The main process task circuit determines i-th layer of convolution kernel gradient according to the vector operation result, by i-th layer of convolution kernel gradient
Update operation, which is executed, with i-th layer of convolution kernel obtains i-th layer of updated convolution kernel.
Refering to Fig. 4 D, Fig. 4 D is that operation result progress splicing obtains the splicing schematic diagram in convolution results,
The mode of splicing is as shown in Figure 4 D, determines that the input data element midrange minimum value for executing the operation result and line number are minimum
Value determines that the operation result in the columns that the position of convolution results is is the columns minimum value, and line number is line number minimum value.Traversal
All operation results can access convolution results by mentioned above principle splicing.
The application is also disclosed that a convolution training device comprising the calculating dress that one or more is mentioned in this application
It sets, for being obtained from other processing units to operational data and control information, executes specified machine learning operation, execute knot
Fruit passes to peripheral equipment by I/O interface.For example camera, display, mouse, keyboard, network interface card, wifi connect peripheral equipment
Mouthful, server.When comprising more than one computing device, it can be linked and be transmitted by specific structure between computing device
Data are for example interconnected by PCIE bus and are transmitted data, to support the fortune of more massive convolutional neural networks training
It calculates.At this point it is possible to share same control system, there can also be control system independent;Can be with shared drive, it can also be with every
A accelerator has respective memory.In addition, its mutual contact mode can be any interconnection topology.
The convolution training device compatibility with higher can be connected by PCIE interface with various types of servers
It connects.
The application is also disclosed that a combined treatment device comprising above-mentioned convolution training device, general interconnecting interface,
With other processing units.Machine learning arithmetic unit is interacted with other processing units, the common operation completing user and specifying.
Fig. 2 is the schematic diagram of combined treatment device.
Other processing units, including central processor CPU, graphics processor GPU, neural network processor etc. are general/special
With one of processor or above processor type.Processor quantity included by other processing units is with no restrictions.Its
His interface of the processing unit as machine learning arithmetic unit and external data and control, including data are carried, and are completed to the machine
Device learns the basic control such as unlatching, stopping of arithmetic unit;Other processing units can also cooperate with machine learning arithmetic unit
It is common to complete processor active task.
General interconnecting interface refers to for transmitting data and control between the convolution training device and other processing units
It enables.The convolution training device obtains required input data from other processing units, and write-in convolution training device on piece is deposited
Storage device;Control instruction, the control caching of write-in convolution training device on piece can be obtained from other processing units;It can also be with
It reads the data in the memory module of convolution training device and is transferred to other processing units.
Optionally, the structure is as shown in figure 3, can also include storage device, storage device is trained with the convolution respectively
Device is connected with other described processing units.Storage device is for being stored in the convolution training device and other described processing dresses
The data set, the data of operation required for being particularly suitable for are in the storage inside of this convolution training device or other processing units
The data that can not all save.
The combined treatment device can be used as the SOC on piece of the equipment such as mobile phone, robot, unmanned plane, video monitoring equipment
The die area of control section is effectively reduced in system, improves processing speed, reduces overall power.When this situation, the combined treatment
The general interconnecting interface of device is connected with certain components of equipment.Certain components for example camera, display, mouse, keyboard,
Network interface card, wifi interface.
In some embodiments, a kind of chip has also been applied for comprising above-mentioned convolution training device or combined treatment dress
It sets.
In some embodiments, a kind of chip-packaging structure has been applied for comprising said chip.
In some embodiments, a kind of board has been applied for comprising said chip encapsulating structure.Refering to Fig. 3 A, Fig. 3 A
A kind of board is provided, above-mentioned board can also include other matching components, this is matched other than including said chip 389
Set component includes but is not limited to: memory device 390, interface arrangement 391 and control device 392;
The memory device 390 is connect with the chip in the chip-packaging structure by bus, for storing data.Institute
Stating memory device may include multiple groups storage unit 393.Storage unit described in each group is connect with the chip by bus.It can
To understand, storage unit described in each group can be DDR SDRAM (English: Double Data Rate SDRAM, Double Data Rate
Synchronous DRAM).
DDR, which does not need raising clock frequency, can double to improve the speed of SDRAM.DDR allows the rising in clock pulses
Edge and failing edge read data.The speed of DDR is twice of standard SDRAM.In one embodiment, the storage device can be with
Including storage unit described in 4 groups.Storage unit described in each group may include multiple DDR4 particles (chip).In one embodiment
In, the chip interior may include 4 72 DDR4 controllers, and 64bit is used for transmission number in above-mentioned 72 DDR4 controllers
According to 8bit is used for ECC check.It is appreciated that data pass when using DDR4-3200 particle in the storage unit described in each group
Defeated theoretical bandwidth can reach 25600MB/s.
In one embodiment, storage unit described in each group include multiple Double Data Rate synchronous dynamics being arranged in parallel with
Machine memory.DDR can transmit data twice within a clock cycle.The controller of setting control DDR in the chips,
Control for data transmission and data storage to each storage unit.
The interface arrangement is electrically connected with the chip in the chip-packaging structure.The interface arrangement is for realizing described
Data transmission between chip and external equipment (such as server or computer).Such as in one embodiment, the interface
Device can be standard PCIE interface.For example, data to be processed are transferred to the core by standard PCIE interface by server
Piece realizes data transfer.Preferably, when using the transmission of PCIE3.0X16 interface, theoretical bandwidth can reach 16000MB/s.?
In another embodiment, the interface arrangement can also be other interfaces, and the application is not intended to limit above-mentioned other interfaces
Specific manifestation form, the interface unit can be realized signaling transfer point.In addition, the calculated result of the chip is still by described
Interface arrangement sends back external equipment (such as server).
The control device is electrically connected with the chip.The control device is for supervising the state of the chip
Control.Specifically, the chip can be electrically connected with the control device by SPI interface.The control device may include list
Piece machine (Micro Controller Unit, MCU).If the chip may include multiple processing chips, multiple processing cores or more
A processing circuit can drive multiple loads.Therefore, the chip may be at the different work shape such as multi-load and light load
State.It may be implemented by the control device to processing chips multiple in the chip, multiple processing and/or multiple processing circuits
Working condition regulation.
In some embodiments, a kind of electronic equipment has been applied for comprising above-mentioned board.
Electronic equipment include data processing equipment, robot, computer, printer, scanner, tablet computer, intelligent terminal,
Mobile phone, automobile data recorder, navigator, sensor, camera, server, cloud server, camera, video camera, projector, hand
Table, earphone, mobile storage, wearable device, the vehicles, household electrical appliance, and/or Medical Devices.
The vehicles include aircraft, steamer and/or vehicle;The household electrical appliance include TV, air-conditioning, micro-wave oven,
Refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator;The Medical Devices include Nuclear Magnetic Resonance, B ultrasound instrument
And/or electrocardiograph.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of
Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because
According to the application, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know
It knows, embodiment described in this description belongs to alternative embodiment, related actions and modules not necessarily the application
It is necessary.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment
Point, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed device, it can be by another way
It realizes.For example, the apparatus embodiments described above are merely exemplary, such as the division of the unit, it is only a kind of
Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can
To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual
Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of device or unit,
It can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also be realized in the form of software program module.
If the integrated unit is realized in the form of software program module and sells or use as independent product
When, it can store in a computer-readable access to memory.Based on this understanding, the technical solution of the application substantially or
Person says that all or part of the part that contributes to existing technology or the technical solution can body in the form of software products
Reveal and, which is stored in a memory, including some instructions are used so that a computer equipment
(can be personal computer, server or network equipment etc.) executes all or part of each embodiment the method for the application
Step.And memory above-mentioned includes: USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory
The various media that can store program code such as (RAM, Random Access Memory), mobile hard disk, magnetic or disk.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can
It is completed with instructing relevant hardware by program, which can store in a computer-readable memory, memory
May include: flash disk, read-only memory (English: Read-Only Memory, referred to as: ROM), random access device (English:
Random Access Memory, referred to as: RAM), disk or CD etc..
The embodiment of the present application is described in detail above, specific case used herein to the principle of the application and
Embodiment is expounded, the description of the example is only used to help understand the method for the present application and its core ideas;
At the same time, for those skilled in the art can in specific embodiments and applications according to the thought of the application
There is change place, in conclusion the contents of this specification should not be construed as limiting the present application.
Claims (20)
1. a kind of computing device, which is characterized in that the computing device is for executing convolutional neural networks training operation, the volume
Product neural network include: α layer, it is α layers described at least i-th layer be convolutional layer;The computing device include: arithmetic element and
Controller unit;The arithmetic element includes: main process task circuit and from processing circuit, and the α is whole more than or equal to 2
Number, the i are integer and are less than or equal to α;The computing device is for executing i-th layer of convolution forward operation and executing i-th layer of volume
The reversed operation of product;
I-th layer of convolution forward operation of the execution specifically includes:
The controller unit, for obtaining i-th layer of input data, i-th layer of convolution kernel and i-th layer of positive computations;
The controller unit is also used to parse the forward direction computations to obtain multiple forward operation instructions, by multiple fortune
It calculates instruction and the input data, the convolution kernel and multiple operational orders is sent to the main process task circuit;
The main process task circuit, it is described from processing circuit for the input data to be broadcast to, the convolution kernel is split into
Multiple Nuclear Data blocks, multiple Nuclear Data blocks are distributed to described from processing circuit, the multiple operational order are sent to described
From processing circuit;
It is described from processing circuit, for executing convolution to the input data and the Nuclear Data block received according to operational order
Operation obtains operation result, and operation result is transferred to the main process task circuit;
The main process task circuit obtains convolution results for carrying out splicing to the operation result;
I-th layer of reversed operation of convolution of the execution specifically includes:
The controller unit is also used to obtain i-th layer of output data gradient, i-th layer of convolution kernel, i-th layer of input data and anti-
To computations;
The controller unit is also used to parse the retrospectively calculate and instructs to obtain multiple reversed operational orders, will be described reversed
Operational order and i-th layer of output data gradient, i-th layer of convolution kernel, i-th layer of input data are sent to the main process task electricity
Road;
The main process task circuit, i-th layer for being also used to choose reversed operation from i-th layer of input data according to convolution window are reversed
Input data, i-th layer of output data gradient is broadcast to it is described from processing circuit, by i-th layer of reversed input data fractionation
At multiple reversed input blocks, multiple reversed input blocks and multiple reversed operational orders are distributed to described from processing
Device circuit;
It is described from processing circuit, for according to the reversed operational order received by the reversed input block received with it is described
I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result;The vector operation result is returned to
The main process task circuit;
The main process task circuit, for determining i-th layer of convolution kernel gradient according to the vector operation result, by i-th layer of convolution kernel ladder
Degree executes update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel.
2. the apparatus according to claim 1, which is characterized in that described to determine i-th layer of convolution according to the vector operation result
Core gradient specifically includes:
The main process task circuit is put down specifically for corresponding square of convolution kernel gradient for solving i-th layer of all slave computing module
MeanThe t when c is greater than threshold value, all gradients zoom in and out dw'=dw/c*t, according to the convolution kernel ladder after scaling
Degree updates the value of convolution kernel;The w is the convolution kernel gradient from computing module.
3. the apparatus according to claim 1, which is characterized in that the computing device further include: storage unit and directly in
Deposit access unit, the storage unit includes: register, any combination in caching;
The caching, for storing the input data and convolution kernel;
The register, for storing scalar data in the input data;
The caching includes that scratchpad caches;
The controller unit includes: the location of instruction, the location of instruction and storage queue unit;
Described instruction storage unit, for storing the convolutional neural networks training associated computations of operation;
Described instruction processing unit obtains multiple operational orders for parsing to the computations;
The storage queue unit, for storing instruction queue, the instruction queue include: pending by the tandem of the queue
Multiple operational orders or computations;
The main process task circuit includes: dependence processing unit;
The dependence processing unit, for determining the first operational order and the 0th operation before first operational order
Instruction whether there is incidence relation, and there are incidence relations for such as first operational order and the 0th operational order, will be described
First operational order is buffered in described instruction storage unit, after the 0th operational order is finished, from described instruction
Storage unit extracts first operational order and is transmitted to the arithmetic element;
The 0th operational order before the determination first operational order and the first operational order whether there is incidence relation packet
It includes:
The first storage address section of required data in first operational order, foundation are extracted according to first operational order
0th operational order extracts the 0th storage address section of required data in the 0th operational order, such as described first deposits
Storing up address section has Chong Die region with the 0th storage address section, determines first operational order and the described 0th
Operational order has incidence relation, and such as first storage address section does not have Chong Die with the 0th storage address section
Region determines that first operational order and the 0th operational order do not have incidence relation.
4. the apparatus according to claim 1, which is characterized in that as it is described from the quantity of processing circuit be multiple, the fortune
Calculating unit includes: tree-shaped module, and the tree-shaped module includes: a root port and multiple ports, the root of the tree-shaped module
Port connects the main process task circuit, and multiple ports of the tree-shaped module are separately connected multiple one from processing circuit
From processing circuit;
The tree-shaped module, for forwarding the main process task circuit and the multiple input data between processing circuit, volume
Product core, forward operation instruction, operation result, reversed operational order and input data gradient.
5. the apparatus according to claim 1, which is characterized in that as it is described from the quantity of processing circuit be multiple, the fortune
Calculating unit further includes one or more branch process circuits, each branch process circuit connection at least one from processing circuit,
The main process task circuit is specifically used for determining that the input data is broadcast data, and the convolution kernel is distribution data, will
The convolution kernel splits into multiple Nuclear Data blocks, by least one Nuclear Data block, input data in the multiple Nuclear Data block
And at least one operational order in multiple operational orders is sent to the branch process circuit;
The branch process circuit, for forwarding the main process task circuit and the multiple Nuclear Data between processing circuit
Block, input data and operational order;
It is the multiple from processing circuit, for executing volume to the Nuclear Data block and input data that receive according to the operational order
Product operation obtains operation result, and operation result is transferred to the branch process circuit;
The main process task circuit, the operation result for sending branch process circuit are spliced to obtain convolution results;
The main process task circuit is also used to for i-th layer of output result gradient being broadcast to the branch process circuit, according to volume
Product window chooses i-th layer of reversed input data of reversed operation from i-th layer of input data, and i-th layer of reversed input data is torn open
It is divided into multiple reversed input blocks, multiple reversed input blocks and multiple reversed operational orders is distributed to the branch
Processing circuit;
The branch process circuit is also used to forward the main process task circuit and the multiple from reversed defeated between processing circuit
Enter data block, vector operation result, i-th layer of output result gradient and reversed operational order;
It is described from processing circuit, for according to the reversed operational order received by the reversed input block received with it is described
I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result;The vector operation result is returned to
The branch process circuit;
The main process task circuit, for determining i-th layer of convolution kernel ladder according to the vector operation result of branch process circuit forwarding
I-th layer of convolution kernel gradient is executed update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel by degree.
6. the apparatus according to claim 1, which is characterized in that as it is described from the quantity of processing circuit be it is multiple, it is described more
It is a from processing circuit be in array distribution;It is each connect from processing circuit with other adjacent from processing circuit, the main process task electricity
Road connects the multiple k from processing circuit from processing circuit, the k tandem circuit are as follows: the n of the 1st row is a from processing
Circuit, the n m arranged from processing circuit and the 1st of m row are a from processing circuit;
The K is a from processing circuit, in the main process task circuit and multiple data and fortune between processing circuit
Calculate the forwarding of instruction;
The main process task circuit, for determining that the input data is broadcast data, convolution kernel is distribution data, by the convolution
Core splits into multiple Nuclear Data blocks, by the multiple Nuclear Data block at least one Nuclear Data block and multiple operational orders in
At least one operational order be sent to the K from processing circuit;
The K is a from processing circuit, for forwarding the main process task circuit and the multiple Nuclear Data between processing circuit
Block, input data and operational order;
It is the multiple from processing circuit, for executing volume to the Nuclear Data block and input data that receive according to the operational order
Product operation obtains operation result, and operation result is transferred to the K from processing circuit;
The main process task circuit, for the K to be spliced to obtain convolution results from the operation result that processing circuit is sent,
The convolution results are sent to the controller unit;
The main process task circuit is also used to for i-th layer of output result gradient being broadcast to the k from processing circuit, foundation
Convolution window chooses i-th layer of reversed input data of reversed operation from i-th layer of input data, by i-th layer of reversed input data
Multiple reversed input blocks are split into, multiple reversed input blocks and multiple reversed operational orders are distributed to the k
It is a from processing circuit;
The k, from processing circuit, are also used to forward the main process task circuit and the multiple from reversed between processing circuit
Input block, vector operation result, i-th layer of output result gradient and reversed operational order;
It is the multiple from processing circuit, for according to the reversed operational order received by the input block received with it is described
I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result;The vector operation result is returned to
The k is a from processing circuit;
The main process task circuit, for determining i-th layer of convolution from the vector operation result that processing circuit is sent according to the k
I-th layer of convolution kernel gradient is executed update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel by core gradient.
7. according to device as claimed in any one of claims 5 to 6, which is characterized in that
The main process task circuit is combined sequence specifically for the operation result for sending multiple processing circuits and obtains the convolution
As a result.
8. according to device as claimed in any one of claims 5 to 6, which is characterized in that the main process task circuit includes: at conversion
Manage circuit;
The conversion processing circuit, for executing conversion process to data, specifically: by the received input number of main process task circuit
The exchange between the first data structure and the second data structure is executed according to, convolution kernel or convolution results;Or main process task circuit is connect
Input data, convolution kernel or the convolution results of receipts execute the exchange between the first data type and the second data type.
9. device according to claim 5 or 6, which is characterized in that it is described from processing circuit include: multiplication process circuit and
Accumulation process circuit;
The multiplication process circuit, for the member to the element value in the Nuclear Data block received and corresponding position in input data
Plain value executes product calculation and obtains result of product;
The accumulation process circuit obtains the convolution results for executing accumulating operation to the result of product.
10. device according to claim 4, which is characterized in that the tree-shaped module be n pitch tree construction, the n be greater than
Integer equal to 2.
11. a kind of convolution training device, which is characterized in that the convolution training device includes one or more such as claim 1-
10 described in any item computing devices for being obtained from other processing units to operational data and control information, and execute and refer to
Implementing result is passed to other processing units by I/O interface by fixed convolution algorithm;
When the convolution training device includes multiple computing devices, spy can be passed through between the multiple computing device
Fixed structure is attached and transmits data;
Wherein, multiple computing devices are interconnected and are transmitted data by quick external equipment interconnection Bus PC IE bus,
To support the operation of more massive machine learning;Multiple computing devices share same control system or possess respective control
System processed;Multiple computing device shared drives possess respective memory;The mutual contact mode of multiple computing devices
It is any interconnection topology.
12. a kind of combined treatment device, which is characterized in that the combined treatment device includes convolution as claimed in claim 11
Training device, general interconnecting interface and other processing units;
The convolution training device is interacted with other described processing units, the common calculating operation completing user and specifying.
13. combined treatment device according to claim 12, which is characterized in that further include: storage device, the storage device
It is connect respectively with the convolution forward operation device and other described processing units, for saving the convolution training device and institute
State the data of other processing units.
14. a kind of neural network chip, which is characterized in that the neural network chip includes as described in claim 1 calculates
Device or convolution training device as claimed in claim 11 or combined treatment device as claimed in claim 13.
15. a kind of electronic equipment, which is characterized in that the electronic equipment includes the chip as described in the claim 14.
16. a kind of board, which is characterized in that the board includes: memory device, interface arrangement and control device and such as right
It is required that neural network chip described in 15;
Wherein, the neural network chip is separately connected with the memory device, the control device and the interface arrangement;
The memory device, for storing data;
The interface arrangement, for realizing the data transmission between the chip and external equipment;
The control device is monitored for the state to the chip.
17. board according to claim 16, which is characterized in that
The memory device includes: multiple groups storage unit, and storage unit described in each group is connect with the chip by bus, institute
State storage unit are as follows: DDRSDRAM;
The chip includes: DDR controller, the control for data transmission and data storage to each storage unit;
The interface arrangement are as follows: standard PCIE interface.
18. a kind of convolutional neural networks training method, which is characterized in that the method is applied to computing device, the convolution mind
Include: α layers through network, it is α layers described at least i-th layer be convolutional layer;The computing device includes: arithmetic element and control
Device unit;The arithmetic element includes: main process task circuit and from processing circuit, and the α is the integer more than or equal to 2, institute
I is stated to be integer and be less than or equal to α;The convolutional neural networks training method includes at least: i-th layer of convolution forward operation and execution
I-th layer of reversed operation of convolution;
I-th layer of convolution forward operation of the execution include:
The controller unit obtains i-th layer of input data, i-th layer of convolution kernel and i-th layer of positive computations;By the forward direction
Computations parse to obtain multiple forward operation instructions, by multiple operational order and the input data, the convolution kernel
And multiple operational orders are sent to the main process task circuit;
The input data is broadcast to described from processing circuit by the main process task circuit, and the convolution kernel is split into multiple cores
Data block, multiple Nuclear Data blocks is distributed to described from processing circuit, the multiple operational order is sent to described from processing
Circuit;
It is described that convolution algorithm is executed to the input data and the Nuclear Data block received according to operational order from processing circuit
Operation result is obtained, and operation result is transferred to the main process task circuit;
The main process task circuit carries out splicing to the operation result and obtains convolution results;
I-th layer of reversed operation of convolution of the execution include:
The controller unit obtains i-th layer of output data gradient, i-th layer of convolution kernel, i-th layer of input data and retrospectively calculate and refers to
It enables;The retrospectively calculate is parsed to instruct to obtain multiple reversed operational orders, by the reversed operational order and it is described i-th layer it is defeated
Data gradient, i-th layer of convolution kernel, i-th layer of input data are sent to the main process task circuit out;
The main process task circuit chooses i-th layer of reversed input number of reversed operation according to convolution window from i-th layer of input data
According to i-th layer of output data gradient being broadcast to described from processing circuit, i-th layer of reversed input data is split into multiple
Multiple reversed input blocks and multiple reversed operational orders are distributed to described from processor electricity by reversed input block
Road;
The reversed input block that will be received from processing circuit according to the reversed operational order that receives with described i-th layer
Output result gradient execution vector multiplies vector operation and obtains vector operation result;The vector operation result is returned into the master
Processing circuit;
The main process task circuit determines i-th layer of convolution kernel gradient according to the vector operation result, by i-th layer of convolution kernel gradient and the
I layers of convolution kernel execute update operation and obtain i-th layer of updated convolution kernel.
19. according to the method for claim 18, which is characterized in that as it is described from the quantity of processing circuit be it is multiple, it is described
Arithmetic element further includes one or more branch process circuits, each branch process circuit connection at least one from processing circuit,
I-th layer of convolution forward operation of the execution specifically includes:
The main process task circuit determines that the input data is broadcast data, and the convolution kernel is distribution data, by the convolution
Core splits into multiple Nuclear Data blocks, by least one Nuclear Data block in the multiple Nuclear Data block, input data and multiple
At least one operational order in operational order is sent to the branch process circuit;
The branch process circuit forwards the main process task circuit and the multiple Nuclear Data block between processing circuit, input
Data and operational order;
It is the multiple that convolution fortune is executed to the Nuclear Data block and input data received according to the operational order from processing circuit
Calculation obtains operation result, and operation result is transferred to the branch process circuit;
The main process task circuit is spliced the operation result that branch process circuit is sent to obtain convolution results;
I-th layer of reversed operation of convolution of the execution specifically includes:
I-th layer of output result gradient is broadcast to the branch process circuit by the main process task circuit, according to convolution window
I-th layer of reversed input data that reversed operation is chosen from i-th layer of input data splits into i-th layer of reversed input data more
Multiple reversed input blocks and multiple reversed operational orders are distributed to the branch process electricity by a reversed input block
Road;
The branch process circuit forwards the main process task circuit and the multiple reversed input data between processing circuit
Block, vector operation result, i-th layer of output result gradient and reversed operational order;
The reversed input block that will be received from processing circuit according to the reversed operational order that receives with described i-th layer
Output result gradient execution vector multiplies vector operation and obtains vector operation result;The vector operation result is returned to described point
Branch processing circuit;
The main process task circuit determines i-th layer of convolution kernel gradient according to the vector operation result of branch process circuit forwarding, will
I-th layer of convolution kernel gradient executes update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel.
20. according to the method for claim 18, which is characterized in that as it is described from the quantity of processing circuit be it is multiple, it is described
It is multiple from processing circuit be in array distribution;It is each connect from processing circuit with other adjacent from processing circuit, the main process task
Circuit connection is the multiple a from processing circuit, the k tandem circuit from the k in processing circuit are as follows: n of the 1st row are from
It is a from processing circuit to manage circuit, the n m arranged from processing circuit and the 1st of m row;
The K from processing circuit in the main process task circuit and multiple data and operational order between processing circuit
Forwarding;
I-th layer of convolution forward operation of the execution specifically includes:
The main process task circuit determines that the input data is broadcast data, and convolution kernel is distribution data, and the convolution kernel is torn open
Be divided into multiple Nuclear Data blocks, by the multiple Nuclear Data block at least one Nuclear Data block and multiple operational orders in extremely
A few operational order is sent to the K from processing circuit;
The K forwards the main process task circuit and the multiple Nuclear Data block between processing circuit, defeated from processing circuit
Enter data and operational order;
It is the multiple that convolution fortune is executed to the Nuclear Data block and input data received according to the operational order from processing circuit
Calculation obtains operation result, and operation result is transferred to the K from processing circuit;
The K are spliced to obtain convolution results by the main process task circuit from the operation result that processing circuit is sent, by this
Convolution results are sent to the controller unit;
I-th layer of reversed operation of convolution of the execution specifically includes:
I-th layer of output result gradient is broadcast to the k from processing circuit, according to convolution window by the main process task circuit
I-th layer of reversed input data that reversed operation is chosen from i-th layer of input data splits into i-th layer of reversed input data more
Multiple reversed input blocks and multiple reversed operational orders are distributed to the k from processing by a reversed input block
Circuit;
The k forward the main process task circuit and the multiple reversed input data between processing circuit from processing circuit
Block, vector operation result, i-th layer of output result gradient and reversed operational order;
The multiple input block that will be received from processing circuit according to the reversed operational order that receives with described i-th layer
Output result gradient execution vector multiplies vector operation and obtains vector operation result;The vector operation result is returned into the k
It is a from processing circuit;
The main process task circuit determines i-th layer of convolution kernel ladder from the vector operation result that processing circuit is sent according to the k
I-th layer of convolution kernel gradient is executed update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel by degree.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811181151.6A CN110059797B (en) | 2018-10-10 | 2018-10-10 | Computing device and related product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811181151.6A CN110059797B (en) | 2018-10-10 | 2018-10-10 | Computing device and related product |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110059797A true CN110059797A (en) | 2019-07-26 |
CN110059797B CN110059797B (en) | 2020-03-10 |
Family
ID=67315787
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811181151.6A Active CN110059797B (en) | 2018-10-10 | 2018-10-10 | Computing device and related product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110059797B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110717583A (en) * | 2019-09-30 | 2020-01-21 | 上海寒武纪信息科技有限公司 | Convolution circuit, processor, chip, board card and electronic equipment |
CN110990302A (en) * | 2019-11-22 | 2020-04-10 | 北京云宽志业网络技术有限公司 | Data caching method and device, electronic equipment and storage medium |
CN113837922A (en) * | 2021-09-26 | 2021-12-24 | 安徽寒武纪信息科技有限公司 | Computing device, data processing method and related product |
WO2022257980A1 (en) * | 2021-06-10 | 2022-12-15 | 寒武纪(西安)集成电路有限公司 | Computing apparatus, method for implementing convulution operation by using computing apparatus, and related product |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104809426A (en) * | 2014-01-27 | 2015-07-29 | 日本电气株式会社 | Convolutional neural network training method and target identification method and device |
US20170024849A1 (en) * | 2015-07-23 | 2017-01-26 | Sony Corporation | Learning convolution neural networks on heterogeneous cpu-gpu platform |
US20170103308A1 (en) * | 2015-10-08 | 2017-04-13 | International Business Machines Corporation | Acceleration of convolutional neural network training using stochastic perforation |
CN107239826A (en) * | 2017-06-06 | 2017-10-10 | 上海兆芯集成电路有限公司 | Computational methods and device in convolutional neural networks |
CN107341547A (en) * | 2016-04-29 | 2017-11-10 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for being used to perform convolutional neural networks training |
CN107341541A (en) * | 2016-04-29 | 2017-11-10 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for performing full articulamentum neural metwork training |
CN108133270A (en) * | 2018-01-12 | 2018-06-08 | 清华大学 | Convolutional neural networks accelerating method and device |
-
2018
- 2018-10-10 CN CN201811181151.6A patent/CN110059797B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104809426A (en) * | 2014-01-27 | 2015-07-29 | 日本电气株式会社 | Convolutional neural network training method and target identification method and device |
US20170024849A1 (en) * | 2015-07-23 | 2017-01-26 | Sony Corporation | Learning convolution neural networks on heterogeneous cpu-gpu platform |
US20170103308A1 (en) * | 2015-10-08 | 2017-04-13 | International Business Machines Corporation | Acceleration of convolutional neural network training using stochastic perforation |
CN107341547A (en) * | 2016-04-29 | 2017-11-10 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for being used to perform convolutional neural networks training |
CN107341541A (en) * | 2016-04-29 | 2017-11-10 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for performing full articulamentum neural metwork training |
CN107239826A (en) * | 2017-06-06 | 2017-10-10 | 上海兆芯集成电路有限公司 | Computational methods and device in convolutional neural networks |
CN108133270A (en) * | 2018-01-12 | 2018-06-08 | 清华大学 | Convolutional neural networks accelerating method and device |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110717583A (en) * | 2019-09-30 | 2020-01-21 | 上海寒武纪信息科技有限公司 | Convolution circuit, processor, chip, board card and electronic equipment |
CN110717583B (en) * | 2019-09-30 | 2020-08-25 | 上海寒武纪信息科技有限公司 | Convolution circuit, processor, chip, board card and electronic equipment |
CN110990302A (en) * | 2019-11-22 | 2020-04-10 | 北京云宽志业网络技术有限公司 | Data caching method and device, electronic equipment and storage medium |
WO2022257980A1 (en) * | 2021-06-10 | 2022-12-15 | 寒武纪(西安)集成电路有限公司 | Computing apparatus, method for implementing convulution operation by using computing apparatus, and related product |
CN113837922A (en) * | 2021-09-26 | 2021-12-24 | 安徽寒武纪信息科技有限公司 | Computing device, data processing method and related product |
Also Published As
Publication number | Publication date |
---|---|
CN110059797B (en) | 2020-03-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109543832A (en) | A kind of computing device and board | |
CN109522052A (en) | A kind of computing device and board | |
CN109657782A (en) | Operation method, device and Related product | |
CN109189473A (en) | Processing with Neural Network device and its method for executing vector exchange instruction | |
CN109685201A (en) | Operation method, device and Related product | |
CN110059797A (en) | A kind of computing device and Related product | |
CN110163362A (en) | A kind of computing device and method | |
CN109032670A (en) | Processing with Neural Network device and its method for executing vector duplicate instructions | |
CN110383300A (en) | A kind of computing device and method | |
CN109670581A (en) | A kind of computing device and board | |
CN110147249A (en) | A kind of calculation method and device of network model | |
CN109739703A (en) | Adjust wrong method and Related product | |
CN109753319A (en) | A kind of device and Related product of release dynamics chained library | |
CN111079908B (en) | Network-on-chip data processing method, storage medium, computer device and apparatus | |
CN110119807A (en) | Operation method, device, computer equipment and storage medium | |
CN110163349A (en) | A kind of calculation method and device of network model | |
CN109726822A (en) | Operation method, device and Related product | |
CN110059809A (en) | A kind of computing device and Related product | |
CN113010845A (en) | Computing device and method for executing matrix multiplication and related products | |
CN109711540A (en) | A kind of computing device and board | |
CN109740729A (en) | Operation method, device and Related product | |
CN110472734A (en) | A kind of computing device and Related product | |
CN109740730A (en) | Operation method, device and Related product | |
CN111381882B (en) | Data processing device and related product | |
CN109711538A (en) | Operation method, device and Related product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences Applicant after: Zhongke Cambrian Technology Co., Ltd Address before: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences Applicant before: Beijing Zhongke Cambrian Technology Co., Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |