CN110059797A

CN110059797A - A kind of computing device and Related product

Info

Publication number: CN110059797A
Application number: CN201811181151.6A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Zhongke Cambrian Technology Co Ltd
Current assignee: Cambricon Technologies Corp Ltd; Beijing Zhongke Cambrian Technology Co Ltd
Priority date: 2018-10-10
Filing date: 2018-10-10
Publication date: 2019-07-26
Anticipated expiration: 2038-10-10
Also published as: CN110059797B

Abstract

The application provides a kind of computing device and Related product, and for executing convolutional neural networks training operation, computing device provided by the present application has the advantages that at low cost, low in energy consumption the computing device.

Description

A kind of computing device and Related product

Technical field

This application involves technical field of information processing, and in particular to a kind of computing device and Related product.

Background technique

With the continuous development of information technology and the growing demand of people, requirement of the people to information timeliness is more next It is higher.Currently, terminal is all based on general processor acquisition to the acquisition and processing of information.

In practice, it has been found that this mode for handling information based on general processor runs software program, is limited to lead to With the operating rate of processor, especially in the biggish situation of general processor load, information processing efficiency is lower, time delay compared with Greatly, for the training of the convolution of the computation model of information processing such as computation model, convolution is trained computationally intensive, general The time that processor completes convolution training is long, low efficiency, and power consumption is high.

Summary of the invention

The embodiment of the present application provides a kind of computing device and Related product, can promote the processing speed of convolution training operation Degree, improves efficiency, saves power consumption.

In a first aspect, providing a kind of computing device, the computing device is for executing convolutional neural networks training fortune Calculate, the convolutional neural networks include: α layer, it is α layers described at least i-th layer be convolutional layer；The computing device includes: fortune Calculate unit and controller unit；The arithmetic element includes: main process task circuit and from processing circuit, the α be greater than Integer equal to 2, the i are integer and are less than or equal to α；The computing device is for executing i-th layer of convolution forward operation and holding I-th layer of reversed operation of convolution of row；

I-th layer of convolution forward operation of the execution specifically includes:

The controller unit refers to for obtaining i-th layer of input data, i-th layer of convolution kernel and i-th layer of positive calculates It enables；

The controller unit is also used to parse the forward direction computations to obtain multiple forward operation instructions, this is more A operational order and the input data, the convolution kernel and multiple operational orders are sent to the main process task circuit；

The main process task circuit, it is described from processing circuit for the input data to be broadcast to, the convolution kernel is torn open It is divided into multiple Nuclear Data blocks, multiple Nuclear Data blocks is distributed to described from processing circuit, the multiple operational order is sent to It is described from processing circuit；

It is described from processing circuit, for being executed according to operational order to the input data and the Nuclear Data block received Convolution algorithm obtains operation result, and operation result is transferred to the main process task circuit；

The main process task circuit obtains convolution results for carrying out splicing to the operation result；

I-th layer of reversed operation of convolution of the execution specifically includes:

The controller unit is also used to obtain i-th layer of output data gradient, i-th layer of convolution kernel, i-th layer of input data It is instructed with retrospectively calculate；

The controller unit is also used to parse the retrospectively calculate and instructs to obtain multiple reversed operational orders, will be described Reversed operational order and i-th layer of output data gradient, i-th layer of convolution kernel, i-th layer of input data are sent to the main place Manage circuit；

The main process task circuit is also used to choose i-th layer of reversed operation from i-th layer of input data according to convolution window Reversed input data, i-th layer of output data gradient is broadcast to it is described from processing circuit, by i-th layer of reversed input data Split into multiple reversed input blocks, by multiple reversed input blocks and multiple reversed operational orders be distributed to it is described from Processor circuit；

It is described from processing circuit, for according to the reversed operational order received by the reversed input block received with I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result；The vector operation result is returned Back to the main process task circuit；

The main process task circuit, for determining i-th layer of convolution kernel gradient according to the vector operation result, by i-th layer of convolution Core gradient executes update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel.

Second aspect, the embodiment of the present application provide a kind of convolution training device, which is characterized in that the convolution training cartridge The computing device provided including one or more first aspects is set, for obtaining from other processing units to operational data and control Information processed, and specified convolution algorithm is executed, implementing result is passed into other processing units by I/O interface；

When the convolution training device includes multiple computing devices, can lead between the multiple computing device Specific structure is crossed to be attached and transmit data；

Wherein, multiple computing devices are interconnected by quick external equipment interconnection Bus PC IE bus and transmit number According to support the operation of more massive machine learning；Multiple computing devices are shared same control system or are possessed respectively Control system；Multiple computing device shared drives possess respective memory；The interconnection of multiple computing devices Mode is any interconnection topology.

The third aspect, provides a kind of combined treatment device, and the combined treatment device includes the convolution training of second aspect Device, general interconnecting interface and other processing units；

The convolution training device is interacted with other described processing units, the common calculating behaviour for completing user and specifying Make.

Fourth aspect, provides a kind of neural network chip, and neural network chip includes the computing device that first aspect provides Or the combined treatment device that the convolution training device or the third aspect of second aspect offer provide.

5th aspect, provides a kind of electronic equipment, and the electronic equipment includes the chip provided such as fourth aspect.

6th aspect, provides a kind of board, and the board includes: memory device, interface arrangement and control device and the The neural network chip that four aspects provide；

Wherein, the neural network chip and the memory device, the control device and the interface arrangement are distinguished Connection；

The memory device, for storing data；

The interface arrangement, for realizing the data transmission between the chip and external equipment；

The control device is monitored for the state to the chip.

7th aspect, the embodiment of the present application also provide a kind of convolutional neural networks training method, and the method is applied to meter Calculate device, the convolutional neural networks include: α layer, it is α layers described at least i-th layer be convolutional layer；The computing device packet It includes: arithmetic element and controller unit；The arithmetic element includes: main process task circuit and from processing circuit, and the α is Integer more than or equal to 2, the i are integer and are less than or equal to α；The convolutional neural networks training method includes at least: i-th layer The i-th layer of reversed operation of convolution of convolution forward operation and execution；

I-th layer of convolution forward operation of the execution include:

The controller unit obtains i-th layer of input data, i-th layer of convolution kernel and i-th layer of positive computations；It should Positive computations parse to obtain multiple forward operation instructions, by multiple operational order and the input data, the volume Product core and multiple operational orders are sent to the main process task circuit；

The input data is broadcast to described from processing circuit by the main process task circuit, the convolution kernel is split into more A Nuclear Data block, multiple Nuclear Data blocks are distributed to it is described from processing circuit, by the multiple operational order be sent to it is described from Processing circuit；

It is described that convolution is executed to the input data and the Nuclear Data block received according to operational order from processing circuit Operation obtains operation result, and operation result is transferred to the main process task circuit；

The main process task circuit carries out splicing to the operation result and obtains convolution results；

I-th layer of reversed operation of convolution of the execution include:

The controller unit obtains i-th layer of output data gradient, i-th layer of convolution kernel, i-th layer of input data and reversed meter Calculate instruction；It parses the retrospectively calculate to instruct to obtain multiple reversed operational orders, by the reversed operational order and described i-th Layer output data gradient, i-th layer of convolution kernel, i-th layer of input data are sent to the main process task circuit；

The main process task circuit is reversed defeated according to convolution window choose reversed operation from i-th layer of input data i-th layer Enter data, i-th layer of output data gradient is broadcast to described from processing circuit, i-th layer of reversed input data is split into Multiple reversed input blocks and multiple reversed operational orders are distributed to described from processor by multiple reversed input blocks Circuit；

The reversed input block that will be received from processing circuit according to the reversed operational order that receives with it is described I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result；The vector operation result is returned to The main process task circuit；

The main process task circuit determines i-th layer of convolution kernel gradient according to the vector operation result, by i-th layer of convolution kernel gradient Update operation, which is executed, with i-th layer of convolution kernel obtains i-th layer of updated convolution kernel.

In some embodiments, the electronic equipment includes data processing equipment, robot, computer, printer, scanning Instrument, tablet computer, intelligent terminal, mobile phone, automobile data recorder, navigator, sensor, camera, server, cloud server, Camera, video camera, projector, wrist-watch, earphone, mobile storage, wearable device, the vehicles, household electrical appliance, and/or medical treatment Equipment.

In some embodiments, the vehicles include aircraft, steamer and/or vehicle；The household electrical appliance include electricity Depending on, air-conditioning, micro-wave oven, refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator；The Medical Devices include Nuclear Magnetic Resonance, B ultrasound instrument and/or electrocardiograph.

Detailed description of the invention

In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Figure 1A is a kind of structural schematic diagram of computing device provided by the embodiments of the present application.

Figure 1B is the structure chart for the computing device that the application one embodiment provides.

Fig. 1 C is the structure chart for the computing device that another embodiment of the application provides.

Fig. 1 D is the structure chart of main process task circuit provided by the embodiments of the present application.

Fig. 1 E is the structure chart of another computing device provided by the embodiments of the present application.

Fig. 1 F is the structural schematic diagram of tree-shaped module provided by the embodiments of the present application.

Fig. 1 G is the structure chart of another computing device provided by the embodiments of the present application.

Fig. 1 H is also a kind of structure chart of computing device provided by the embodiments of the present application.

Fig. 2 is a kind of structure chart of combined treatment device provided by the embodiments of the present application.

Fig. 2A is a kind of structural schematic diagram of computing device provided by the embodiments of the present application.

Fig. 3 is the structure chart of another combined treatment device provided by the embodiments of the present application.

Fig. 3 A is a kind of structural schematic diagram of board provided by the embodiments of the present application.

Fig. 4 A is a kind of hierarchical diagram of convolutional neural networks provided by the embodiments of the present application.

Fig. 4 B is i-th layer of forward operation schematic diagram provided by the embodiments of the present application.

Fig. 4 C is i-th layer provided by the embodiments of the present application reversed operation schematic diagram.

Fig. 4 D is splicing result schematic diagram provided by the embodiments of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.

The description and claims of this application and term " first ", " second ", " third " and " in the attached drawing Four " etc. are not use to describe a particular order for distinguishing different objects.In addition, term " includes " and " having " and it Any deformation, it is intended that cover and non-exclusive include.Such as it contains the process, method of a series of steps or units, be System, product or equipment are not limited to listed step or unit, but optionally further comprising the step of not listing or list Member, or optionally further comprising other step or units intrinsic for these process, methods, product or equipment.

Referenced herein " embodiment " is it is meant that a particular feature, structure, or characteristic described can wrap in conjunction with the embodiments It is contained at least one embodiment of the application.Each position in the description occur the phrase might not each mean it is identical Embodiment, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art explicitly and Implicitly understand, embodiment described herein can be combined with other embodiments.

Computing device used in this application is introduced first.A refering to fig. 1 provides a kind of computing device, which uses In executing convolutional neural networks training operation, which includes: controller unit 11 and arithmetic element 12, wherein Controller unit 11 is connect with arithmetic element 12, which includes: main process task circuit 101 and from processing circuit 102 (can be one or more from processing circuit, preferentially select multiple from processing circuit)；

It should be noted that above-mentioned main process task circuit itself includes memory (such as memory or register), the memory The some data that can store main process task circuit can choose carrying memory from processing circuit.

Above-mentioned convolutional neural networks include: α layers, it is α layers described at least i-th layer be convolutional layer, the convolutional neural networks The schematic diagram of multilayered structure can be as shown in Figure 4 A, it should be noted that above-mentioned i-th layer can be any one layer in α layers, For the convenience of picture, i-th layer of Fig. 4 A is by taking middle layer as an example.The α is integer more than or equal to 2, and the i is integer and small In equal to α；The computing device is for executing i-th layer of convolution forward operation and executing i-th layer of reversed operation of convolution；

I-th layer of convolution forward operation of above-mentioned execution specifically includes:

Above-mentioned i-th layer of output data gradient can be obtained from the reversed input data gradient of i+1 layer, for example, can be straight It connects using the reversed input data gradient of i+1 layer as i-th layer of output data gradient, can also be certainly, by the anti-of i+1 layer To input data gradient multiplied by the result of h (x) as i-th layer of output data gradient, h (x) is that i-th layer of activation primitive leads letter Number.

Above-mentioned i-th layer of reversed input data that reversed operation is chosen from i-th layer of input data according to convolution window is specific It can be i-th layer of reversed input data that reversed operation is selected according to information such as the size of convolution window, moving step lengths, example I-th layer of input data can be such as cut according to the size of convolution window, moving step length to obtain and the convolution window size, mobile step Long corresponding i-th layer of reversed input data.

Optionally, described to determine that i-th layer of convolution kernel gradient specifically includes according to the vector operation result:

The main process task circuit, the convolution kernel gradient specifically for solving i-th layer of all slave computing module are corresponding flat Square averageThe t when c is greater than threshold value, all gradients zoom in and out dw'=dw/c*t, according to the convolution after scaling The value of core gradient updating convolution kernel；The w is the convolution kernel gradient from computing module.

Optionally, it is above-mentioned i-th layer of convolution kernel gradient is executed with i-th layer of convolution kernel update operation obtain i-th layer update after Convolution kernel be specifically as follows, updated i-th layer of convolution kernel of convolution kernel=the i-th layer convolution kernel gradient *.Certainly it can also be Other update modes, the application do not limit to the specific method of above-mentioned update.

It should be noted that the training of above-mentioned convolutional neural networks is only by taking the training of i-th layer of convolutional layer as an example, for volume The training of product other layers of neural network can be not intended to limit convolutional neural networks except i-th using conventional training method, the application Training method other than layer.

Arithmetic element is arranged to one master and multiple slaves structure by technical solution provided by the present application, and the calculating of forward operation is referred to It enables, can will split data according to the computations of forward operation, in this way by can be to meter from processing circuit The biggish part of calculation amount carries out concurrent operation, to improve arithmetic speed, operation time is saved, and then reduce power consumption, for anti- To operation, i-th layer of reversed input data is split, is then assigned to from processing circuit and executes the calculating that vector multiplies vector, this Calculation amount can be distributed to multiple processing circuits (master and slave structure) parallel computation by sample, improve arithmetic speed, when saving operation Between, and then power consumption is reduced, therefore it has the advantages of reducing the training time, reducing power consumption.

It can be one layer of operation in convolutional neural networks for i-th layer of training, for multilayer convolutional neural networks, Its realization process is, in forward operation, after upper one layer of convolutional neural networks, which execute, to be completed, and next layer of operational order meeting Using output neuron calculated in arithmetic element as next layer input neuron carry out operation (or to the output mind The input neuron that certain operations are re-used as next layer is carried out through member), meanwhile, weight is also replaced with to next layer of weight；? In reversed operation, after the completion of the reversed operation of upper one layer of artificial neural network executes, next layer of operational order can be by operation list In member it is calculated input neuron gradient as next layer output neuron gradient carry out operation (or to the input mind Certain operations, which are carried out, through first gradient is re-used as next layer of output neuron gradient), while weight being replaced with to next layer of power Value.

For artificial neural network operation, if the artificial neural network operation have multilayer operation, multilayer operation it is defeated Enter neuron and output neuron does not mean that in the input layer of entire neural network neuron in neuron and output layer, but For two layers of arbitrary neighborhood in network, the neuron in network forward operation lower layer is to input neuron, is in net Neuron in network forward operation upper layer is output neuron.By taking convolutional neural networks as an example, if a convolutional neural networks There are L layers, K=1,2 ..., L-1, for K layers and K+1 layers, we are known as input layer, nerve therein for K layers Member is the input neuron, and K+1 layers are known as output layer, and neuron therein is the output neuron.Remove top Outside, each layer all can serve as input layer, and next layer is corresponding output layer.

Optionally, above-mentioned computing device can also include: the storage unit 10 and direct memory access unit 50, and storage is single Member 10 may include: register, one or any combination in caching, specifically, the caching, refers to for storing the calculating It enables；The register, for storing the input data and scalar；The caching is that scratchpad caches.Direct memory access Unit 50 is used to read from storage unit 10 or storing data.

Optionally, which includes: the location of instruction 110, instruction process unit 111 and storage queue unit 113；

The location of instruction 110, for storing the associated computations of artificial neural network operation；

Described instruction processing unit 111 obtains multiple operational orders for parsing to the computations；

Storage queue unit 113, for storing instruction queue, the instruction queue include: to wait for by the tandem of the queue The multiple operational orders or computations executed.

For example, main arithmetic processing circuit also may include a controller list in an optional technical solution Member, the controller unit may include master instruction processing unit, be specifically used for Instruction decoding into microcommand.Certainly in another kind Also may include another controller unit from arithmetic processing circuit in optinal plan, another controller unit include from Instruction process unit, specifically for receiving and processing microcommand.Above-mentioned microcommand can be the next stage instruction of instruction, micro- finger Order can further can be decoded as each component, each unit or each processing circuit by obtaining after the fractionation or decoding to instruction Control signal.

In a kind of optinal plan, the structure of the computations can be as shown in the table.

Operation code

Register or immediate

Register/immediate

...

Ellipsis expression in upper table may include multiple registers or immediate.

In alternative dispensing means, which may include: one or more operation domains and an operation code. The computations may include neural network computing instruction.By taking neural network computing instructs as an example, as shown in table 1, wherein deposit Device number 0, register number 1, register number 2, register number 3, register number 4 can be operation domain.Wherein, each register number 0, Register number 1, register number 2, register number 3, register number 4 can be the number of one or more register.

Above-mentioned register can be chip external memory, certainly in practical applications, or on-chip memory, for depositing Store up data, which is specifically as follows n dimension data, and n is the integer more than or equal to 1, for example, be 1 dimension data when n=1, i.e., to Amount is 2 dimension datas, i.e. matrix when such as n=2, is multidimensional tensor when such as n=3 or 3 or more.

Optionally, which can also include:

The dependence processing unit 108, for determining the first operational order and institute when with multiple operational orders The 0th operational order before stating the first operational order whether there is incidence relation, such as first operational order and the described 0th There are incidence relations for operational order, then first operational order are buffered in described instruction storage unit, the described 0th After operational order is finished, first operational order is extracted from described instruction storage unit and is transmitted to the arithmetic element；

The determination first operational order whether there is with the 0th operational order before the first operational order to be associated with System includes:

Extract required data (such as matrix) in first operational order according to first operational order first is deposited Address section is stored up, the 0th stored address area of required matrix in the 0th operational order is extracted according to the 0th operational order Between, such as first storage address section has Chong Die region with the 0th storage address section, it is determined that described first Operational order and the 0th operational order have incidence relation, such as first storage address section and the 0th storage Location section does not have the region of overlapping, it is determined that first operational order does not have with the 0th operational order to be associated with System.

In another alternative embodiment, arithmetic element 12 may include 101 He of main process task circuit as shown in Figure 1 C It is multiple from processing circuit 102.In one embodiment, as shown in Figure 1 C, it is multiple from processing circuit be in array distribution；Each from Reason circuit is connect with other adjacent from processing circuit, and the multiple k from processing circuit of main process task circuit connection are from Circuit is managed, the k is a from processing circuit are as follows: the n of n of the 1st row from processing circuit, m row is a to be arranged from processing circuit and the 1st M from processing circuit, it should be noted that K as shown in Figure 1 C only include n of the 1st row from processing circuit from processing Circuit, the n m arranged from processing circuit and the 1st of m row are a from processing circuit, i.e. the k are multiple from from processing circuit Manage circuit in directly with the slave processing circuit of main process task circuit connection.

K is from processing circuit, in the main process task circuit and multiple data between processing circuit and referring to The forwarding of order.

Optionally, as shown in figure iD, which can also include: conversion processing circuit 110, activation processing circuit 111, one of addition process circuit 112 or any combination；

Conversion processing circuit 110, for the received data block of main process task circuit or intermediate result to be executed the first data knot Exchange (such as conversion of continuous data and discrete data) between structure and the second data structure；Or it is main process task circuit is received Data block or intermediate result execute exchange (such as fixed point type and floating-point class between the first data type and the second data type The conversion of type)；

Processing circuit 111 is activated, for executing the activation operation of data in main process task circuit；

Addition process circuit 112, for executing add operation or accumulating operation.

The main process task circuit, for determining that the input data is broadcast data, convolution kernel is distribution data, will be described Convolution kernel splits into multiple Nuclear Data blocks, by the multiple Nuclear Data block at least one Nuclear Data block and multiple operations refer to At least one operational order in order is sent to the K from processing circuit；

The K is a from processing circuit, for forwarding the main process task circuit and the multiple core between processing circuit Data block, input data and operational order；

It is the multiple from processing circuit, for being held according to the operational order to the Nuclear Data block and input data that receive Row convolution algorithm obtains operation result, and operation result is transferred to the K from processing circuit；

The main process task circuit, for being spliced to obtain convolution from the operation result that processing circuit is sent by the K As a result, the convolution results are sent to the controller unit；

The main process task circuit, is also used to for i-th layer of output result gradient being broadcast to the k from processing circuit, I-th layer of reversed input data for choosing reversed operation from i-th layer of input data according to convolution window, by i-th layer of reversed input Data split into multiple reversed input blocks, and multiple reversed input blocks and multiple reversed operational orders are distributed to institute K are stated from processing circuit；

The k is also used to forward the main process task circuit and the multiple between processing circuit from processing circuit Reversed input block, vector operation result, i-th layer of output result gradient and reversed operational order；

It is the multiple from processing circuit, for according to the reversed operational order received by the input block received with I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result；The vector operation result is returned It is a from processing circuit back to the k；

The main process task circuit, for determining i-th layer from the vector operation result that processing circuit is sent according to the k I-th layer of convolution kernel gradient is executed update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel by convolution kernel gradient.

It is described from processing circuit include: multiplication process circuit；

The multiplication process circuit obtains result of product for executing product calculation to the data block received；

Forward process circuit (optional), for forwarding the data block received or result of product.

Accumulation process circuit, the accumulation process circuit obtain among this for executing accumulating operation to the result of product As a result.

In another embodiment, which is Matrix Multiplication in terms of the instruction of matrix, accumulated instruction, activation instruction etc. Calculate instruction.

Illustrate the circular of computing device as shown in Figure 1A below by neural network computing instruction.For For neural network computing instruction, the formula that actually needs to be implemented can be with are as follows: s=s (∑ wx_i+ b), wherein i.e. by weight w Multiplied by input data x_i, sum, then plus activation operation s (h) is done after biasing b, obtain final output result s.

In a kind of optional embodiment, as referring to figure 1E, the arithmetic element includes: tree-shaped module 40, the tree Pattern block includes: a root port 401 and multiple ports 404, and the root port of the tree-shaped module connects the main process task electricity Road, multiple ports of the tree-shaped module are separately connected multiple one from processing circuit from processing circuit；

Above-mentioned tree-shaped module has transmission-receiving function, such as referring to figure 1E, which is sending function, such as Fig. 2A Shown, which is receive capabilities.

The tree-shaped module, for forward the main process task circuit and the multiple data block between processing circuit, Weight and operational order.

Optionally, which is the optional as a result, it may include at least 1 node layer, the node of computing device For the cable architecture with forwarding capability, the node itself can not have computing function.If tree-shaped module has zero layer node, i.e., Without the tree-shaped module.

Optionally, which can pitch tree construction for n, for example, binary tree structure as shown in Figure 1 F, certainly may be used Think trident tree construction, which can be the integer more than or equal to 2.The application specific embodiment is not intended to limit the specific of above-mentioned n Value, the above-mentioned number of plies may be 2, can connect the node of other layers in addition to node layer second from the bottom from processing circuit, Such as it can connect the node of layer last as shown in Figure 1 F.

Optionally, above-mentioned arithmetic element can carry individual caching, may include: neuron caching as shown in Figure 1 G Unit, the neuron cache unit 63 cache the input neuron vector data and output neuron value number from processing circuit According to.

As shown in fig. 1H, which can also include: weight cache unit 64, exist for caching this from processing circuit The weight data needed in calculating process.

In an alternative embodiment, arithmetic element 12 may include branch process circuit 103 as shown in Figure 1B；It has The connection structure of body is as shown in Figure 1B, wherein

Above-mentioned branch process circuit 103 may include memory, as shown in Figure 1B, the memory of branch process circuit 103 Size can for individually between 2 to 2.5 times of maximum data capacity that processing circuit needs to store, in this way setting with Afterwards, from processing circuit i.e. no setting is required memory, relative to a branch process circuit, only with setting 2.5*R (individually from Capability value needed for processor circuit), if there is no branch process circuit, need to be arranged 4*R, and its register Utilization rate is also low, therefore the structure can effectively reduce the total capacity of memory, reduce cost.

The main process task circuit is specifically used for determining that the input data is broadcast data, and the convolution kernel is distribution number According to, the convolution kernel is split into multiple Nuclear Data blocks, by the multiple Nuclear Data block at least one Nuclear Data block, input At least one operational order in data and multiple operational orders is sent to the branch process circuit；

The branch process circuit, for forwarding the main process task circuit and the multiple nucleus number between processing circuit According to block, input data and operational order；

It is the multiple from processing circuit, for being held according to the operational order to the Nuclear Data block and input data that receive Row convolution algorithm obtains operation result, and operation result is transferred to the branch process circuit；

The main process task circuit, the operation result for sending branch process circuit are spliced to obtain convolution results；

The main process task circuit is also used to i-th layer of output result gradient being broadcast to the branch process circuit, according to I-th layer of reversed input data for choosing reversed operation from i-th layer of input data according to convolution window, by i-th layer of reversed input number According to multiple reversed input blocks are split into, multiple reversed input blocks and multiple reversed operational orders are distributed to described Branch process circuit；

The branch process circuit is also used to forward the main process task circuit and the multiple from anti-between processing circuit To input block, vector operation result, i-th layer of output result gradient and reversed operational order；

It is described from processing circuit, for according to the reversed operational order received by the reversed input block received with I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result；The vector operation result is returned Back to the branch process circuit；

The main process task circuit, for determining i-th layer of convolution according to the vector operation result of branch process circuit forwarding I-th layer of convolution kernel gradient is executed update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel by core gradient.

On the other hand, the application also provides a kind of convolutional neural networks training method, and the method is applied to computing device, The convolutional neural networks include: α layers, it is α layers described at least i-th layer be convolutional layer；The computing device includes: operation list Member and controller unit；The arithmetic element includes: main process task circuit and from processing circuit, and the α is more than or equal to 2 Integer, the i be integer and be less than or equal to α；The convolutional neural networks training method includes at least: i-th layer of convolution forward direction I-th layer of reversed operation of convolution of operation and execution；

I-th layer of convolution forward operation of the execution includes: that its forward operation schematic diagram is as shown in Figure 4 B.

I-th layer of reversed operation of convolution of the execution includes: that its reversed operation schematic diagram is as shown in Figure 4 C.

Refering to Fig. 4 D, Fig. 4 D is that operation result progress splicing obtains the splicing schematic diagram in convolution results, The mode of splicing is as shown in Figure 4 D, determines that the input data element midrange minimum value for executing the operation result and line number are minimum Value determines that the operation result in the columns that the position of convolution results is is the columns minimum value, and line number is line number minimum value.Traversal All operation results can access convolution results by mentioned above principle splicing.

The application is also disclosed that a convolution training device comprising the calculating dress that one or more is mentioned in this application It sets, for being obtained from other processing units to operational data and control information, executes specified machine learning operation, execute knot Fruit passes to peripheral equipment by I/O interface.For example camera, display, mouse, keyboard, network interface card, wifi connect peripheral equipment Mouthful, server.When comprising more than one computing device, it can be linked and be transmitted by specific structure between computing device Data are for example interconnected by PCIE bus and are transmitted data, to support the fortune of more massive convolutional neural networks training It calculates.At this point it is possible to share same control system, there can also be control system independent；Can be with shared drive, it can also be with every A accelerator has respective memory.In addition, its mutual contact mode can be any interconnection topology.

The convolution training device compatibility with higher can be connected by PCIE interface with various types of servers It connects.

The application is also disclosed that a combined treatment device comprising above-mentioned convolution training device, general interconnecting interface, With other processing units.Machine learning arithmetic unit is interacted with other processing units, the common operation completing user and specifying. Fig. 2 is the schematic diagram of combined treatment device.

Other processing units, including central processor CPU, graphics processor GPU, neural network processor etc. are general/special With one of processor or above processor type.Processor quantity included by other processing units is with no restrictions.Its His interface of the processing unit as machine learning arithmetic unit and external data and control, including data are carried, and are completed to the machine Device learns the basic control such as unlatching, stopping of arithmetic unit；Other processing units can also cooperate with machine learning arithmetic unit It is common to complete processor active task.

General interconnecting interface refers to for transmitting data and control between the convolution training device and other processing units It enables.The convolution training device obtains required input data from other processing units, and write-in convolution training device on piece is deposited Storage device；Control instruction, the control caching of write-in convolution training device on piece can be obtained from other processing units；It can also be with It reads the data in the memory module of convolution training device and is transferred to other processing units.

Optionally, the structure is as shown in figure 3, can also include storage device, storage device is trained with the convolution respectively Device is connected with other described processing units.Storage device is for being stored in the convolution training device and other described processing dresses The data set, the data of operation required for being particularly suitable for are in the storage inside of this convolution training device or other processing units The data that can not all save.

The combined treatment device can be used as the SOC on piece of the equipment such as mobile phone, robot, unmanned plane, video monitoring equipment The die area of control section is effectively reduced in system, improves processing speed, reduces overall power.When this situation, the combined treatment The general interconnecting interface of device is connected with certain components of equipment.Certain components for example camera, display, mouse, keyboard, Network interface card, wifi interface.

In some embodiments, a kind of chip has also been applied for comprising above-mentioned convolution training device or combined treatment dress It sets.

In some embodiments, a kind of chip-packaging structure has been applied for comprising said chip.

In some embodiments, a kind of board has been applied for comprising said chip encapsulating structure.Refering to Fig. 3 A, Fig. 3 A A kind of board is provided, above-mentioned board can also include other matching components, this is matched other than including said chip 389 Set component includes but is not limited to: memory device 390, interface arrangement 391 and control device 392；

The memory device 390 is connect with the chip in the chip-packaging structure by bus, for storing data.Institute Stating memory device may include multiple groups storage unit 393.Storage unit described in each group is connect with the chip by bus.It can To understand, storage unit described in each group can be DDR SDRAM (English: Double Data Rate SDRAM, Double Data Rate Synchronous DRAM).

DDR, which does not need raising clock frequency, can double to improve the speed of SDRAM.DDR allows the rising in clock pulses Edge and failing edge read data.The speed of DDR is twice of standard SDRAM.In one embodiment, the storage device can be with Including storage unit described in 4 groups.Storage unit described in each group may include multiple DDR4 particles (chip).In one embodiment In, the chip interior may include 4 72 DDR4 controllers, and 64bit is used for transmission number in above-mentioned 72 DDR4 controllers According to 8bit is used for ECC check.It is appreciated that data pass when using DDR4-3200 particle in the storage unit described in each group Defeated theoretical bandwidth can reach 25600MB/s.

In one embodiment, storage unit described in each group include multiple Double Data Rate synchronous dynamics being arranged in parallel with Machine memory.DDR can transmit data twice within a clock cycle.The controller of setting control DDR in the chips, Control for data transmission and data storage to each storage unit.

The interface arrangement is electrically connected with the chip in the chip-packaging structure.The interface arrangement is for realizing described Data transmission between chip and external equipment (such as server or computer).Such as in one embodiment, the interface Device can be standard PCIE interface.For example, data to be processed are transferred to the core by standard PCIE interface by server Piece realizes data transfer.Preferably, when using the transmission of PCIE3.0X16 interface, theoretical bandwidth can reach 16000MB/s.? In another embodiment, the interface arrangement can also be other interfaces, and the application is not intended to limit above-mentioned other interfaces Specific manifestation form, the interface unit can be realized signaling transfer point.In addition, the calculated result of the chip is still by described Interface arrangement sends back external equipment (such as server).

The control device is electrically connected with the chip.The control device is for supervising the state of the chip Control.Specifically, the chip can be electrically connected with the control device by SPI interface.The control device may include list Piece machine (Micro Controller Unit, MCU).If the chip may include multiple processing chips, multiple processing cores or more A processing circuit can drive multiple loads.Therefore, the chip may be at the different work shape such as multi-load and light load State.It may be implemented by the control device to processing chips multiple in the chip, multiple processing and/or multiple processing circuits Working condition regulation.

In some embodiments, a kind of electronic equipment has been applied for comprising above-mentioned board.

Electronic equipment include data processing equipment, robot, computer, printer, scanner, tablet computer, intelligent terminal, Mobile phone, automobile data recorder, navigator, sensor, camera, server, cloud server, camera, video camera, projector, hand Table, earphone, mobile storage, wearable device, the vehicles, household electrical appliance, and/or Medical Devices.

The vehicles include aircraft, steamer and/or vehicle；The household electrical appliance include TV, air-conditioning, micro-wave oven, Refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator；The Medical Devices include Nuclear Magnetic Resonance, B ultrasound instrument And/or electrocardiograph.

It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because According to the application, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, embodiment described in this description belongs to alternative embodiment, related actions and modules not necessarily the application It is necessary.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed device, it can be by another way It realizes.For example, the apparatus embodiments described above are merely exemplary, such as the division of the unit, it is only a kind of Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of device or unit, It can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also be realized in the form of software program module.

If the integrated unit is realized in the form of software program module and sells or use as independent product When, it can store in a computer-readable access to memory.Based on this understanding, the technical solution of the application substantially or Person says that all or part of the part that contributes to existing technology or the technical solution can body in the form of software products Reveal and, which is stored in a memory, including some instructions are used so that a computer equipment (can be personal computer, server or network equipment etc.) executes all or part of each embodiment the method for the application Step.And memory above-mentioned includes: USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory The various media that can store program code such as (RAM, Random Access Memory), mobile hard disk, magnetic or disk.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can store in a computer-readable memory, memory May include: flash disk, read-only memory (English: Read-Only Memory, referred to as: ROM), random access device (English: Random Access Memory, referred to as: RAM), disk or CD etc..

The embodiment of the present application is described in detail above, specific case used herein to the principle of the application and Embodiment is expounded, the description of the example is only used to help understand the method for the present application and its core ideas； At the same time, for those skilled in the art can in specific embodiments and applications according to the thought of the application There is change place, in conclusion the contents of this specification should not be construed as limiting the present application.

Claims

1. a kind of computing device, which is characterized in that the computing device is for executing convolutional neural networks training operation, the volume Product neural network include: α layer, it is α layers described at least i-th layer be convolutional layer；The computing device include: arithmetic element and Controller unit；The arithmetic element includes: main process task circuit and from processing circuit, and the α is whole more than or equal to 2 Number, the i are integer and are less than or equal to α；The computing device is for executing i-th layer of convolution forward operation and executing i-th layer of volume The reversed operation of product；

The controller unit, for obtaining i-th layer of input data, i-th layer of convolution kernel and i-th layer of positive computations；

The controller unit is also used to parse the forward direction computations to obtain multiple forward operation instructions, by multiple fortune It calculates instruction and the input data, the convolution kernel and multiple operational orders is sent to the main process task circuit；

The main process task circuit, it is described from processing circuit for the input data to be broadcast to, the convolution kernel is split into Multiple Nuclear Data blocks, multiple Nuclear Data blocks are distributed to described from processing circuit, the multiple operational order are sent to described From processing circuit；

It is described from processing circuit, for executing convolution to the input data and the Nuclear Data block received according to operational order Operation obtains operation result, and operation result is transferred to the main process task circuit；

The controller unit is also used to obtain i-th layer of output data gradient, i-th layer of convolution kernel, i-th layer of input data and anti- To computations；

The controller unit is also used to parse the retrospectively calculate and instructs to obtain multiple reversed operational orders, will be described reversed Operational order and i-th layer of output data gradient, i-th layer of convolution kernel, i-th layer of input data are sent to the main process task electricity Road；

The main process task circuit, i-th layer for being also used to choose reversed operation from i-th layer of input data according to convolution window are reversed Input data, i-th layer of output data gradient is broadcast to it is described from processing circuit, by i-th layer of reversed input data fractionation At multiple reversed input blocks, multiple reversed input blocks and multiple reversed operational orders are distributed to described from processing Device circuit；

It is described from processing circuit, for according to the reversed operational order received by the reversed input block received with it is described I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result；The vector operation result is returned to The main process task circuit；

The main process task circuit, for determining i-th layer of convolution kernel gradient according to the vector operation result, by i-th layer of convolution kernel ladder Degree executes update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel.

2. the apparatus according to claim 1, which is characterized in that described to determine i-th layer of convolution according to the vector operation result Core gradient specifically includes:

The main process task circuit is put down specifically for corresponding square of convolution kernel gradient for solving i-th layer of all slave computing module MeanThe t when c is greater than threshold value, all gradients zoom in and out dw'=dw/c*t, according to the convolution kernel ladder after scaling Degree updates the value of convolution kernel；The w is the convolution kernel gradient from computing module.

3. the apparatus according to claim 1, which is characterized in that the computing device further include: storage unit and directly in Deposit access unit, the storage unit includes: register, any combination in caching；

The caching, for storing the input data and convolution kernel；

The register, for storing scalar data in the input data；

The caching includes that scratchpad caches；

The controller unit includes: the location of instruction, the location of instruction and storage queue unit；

Described instruction storage unit, for storing the convolutional neural networks training associated computations of operation；

Described instruction processing unit obtains multiple operational orders for parsing to the computations；

The storage queue unit, for storing instruction queue, the instruction queue include: pending by the tandem of the queue Multiple operational orders or computations；

The main process task circuit includes: dependence processing unit；

The dependence processing unit, for determining the first operational order and the 0th operation before first operational order Instruction whether there is incidence relation, and there are incidence relations for such as first operational order and the 0th operational order, will be described First operational order is buffered in described instruction storage unit, after the 0th operational order is finished, from described instruction Storage unit extracts first operational order and is transmitted to the arithmetic element；

The 0th operational order before the determination first operational order and the first operational order whether there is incidence relation packet It includes:

The first storage address section of required data in first operational order, foundation are extracted according to first operational order 0th operational order extracts the 0th storage address section of required data in the 0th operational order, such as described first deposits Storing up address section has Chong Die region with the 0th storage address section, determines first operational order and the described 0th Operational order has incidence relation, and such as first storage address section does not have Chong Die with the 0th storage address section Region determines that first operational order and the 0th operational order do not have incidence relation.

4. the apparatus according to claim 1, which is characterized in that as it is described from the quantity of processing circuit be multiple, the fortune Calculating unit includes: tree-shaped module, and the tree-shaped module includes: a root port and multiple ports, the root of the tree-shaped module Port connects the main process task circuit, and multiple ports of the tree-shaped module are separately connected multiple one from processing circuit From processing circuit；

The tree-shaped module, for forwarding the main process task circuit and the multiple input data between processing circuit, volume Product core, forward operation instruction, operation result, reversed operational order and input data gradient.

5. the apparatus according to claim 1, which is characterized in that as it is described from the quantity of processing circuit be multiple, the fortune Calculating unit further includes one or more branch process circuits, each branch process circuit connection at least one from processing circuit,

The main process task circuit is specifically used for determining that the input data is broadcast data, and the convolution kernel is distribution data, will The convolution kernel splits into multiple Nuclear Data blocks, by least one Nuclear Data block, input data in the multiple Nuclear Data block And at least one operational order in multiple operational orders is sent to the branch process circuit；

The branch process circuit, for forwarding the main process task circuit and the multiple Nuclear Data between processing circuit Block, input data and operational order；

It is the multiple from processing circuit, for executing volume to the Nuclear Data block and input data that receive according to the operational order Product operation obtains operation result, and operation result is transferred to the branch process circuit；

The main process task circuit is also used to for i-th layer of output result gradient being broadcast to the branch process circuit, according to volume Product window chooses i-th layer of reversed input data of reversed operation from i-th layer of input data, and i-th layer of reversed input data is torn open It is divided into multiple reversed input blocks, multiple reversed input blocks and multiple reversed operational orders is distributed to the branch Processing circuit；

The branch process circuit is also used to forward the main process task circuit and the multiple from reversed defeated between processing circuit Enter data block, vector operation result, i-th layer of output result gradient and reversed operational order；

It is described from processing circuit, for according to the reversed operational order received by the reversed input block received with it is described I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result；The vector operation result is returned to The branch process circuit；

The main process task circuit, for determining i-th layer of convolution kernel ladder according to the vector operation result of branch process circuit forwarding I-th layer of convolution kernel gradient is executed update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel by degree.

6. the apparatus according to claim 1, which is characterized in that as it is described from the quantity of processing circuit be it is multiple, it is described more It is a from processing circuit be in array distribution；It is each connect from processing circuit with other adjacent from processing circuit, the main process task electricity Road connects the multiple k from processing circuit from processing circuit, the k tandem circuit are as follows: the n of the 1st row is a from processing Circuit, the n m arranged from processing circuit and the 1st of m row are a from processing circuit；

The K is a from processing circuit, in the main process task circuit and multiple data and fortune between processing circuit Calculate the forwarding of instruction；

The main process task circuit, for determining that the input data is broadcast data, convolution kernel is distribution data, by the convolution Core splits into multiple Nuclear Data blocks, by the multiple Nuclear Data block at least one Nuclear Data block and multiple operational orders in At least one operational order be sent to the K from processing circuit；

The K is a from processing circuit, for forwarding the main process task circuit and the multiple Nuclear Data between processing circuit Block, input data and operational order；

It is the multiple from processing circuit, for executing volume to the Nuclear Data block and input data that receive according to the operational order Product operation obtains operation result, and operation result is transferred to the K from processing circuit；

The main process task circuit, for the K to be spliced to obtain convolution results from the operation result that processing circuit is sent, The convolution results are sent to the controller unit；

The main process task circuit is also used to for i-th layer of output result gradient being broadcast to the k from processing circuit, foundation Convolution window chooses i-th layer of reversed input data of reversed operation from i-th layer of input data, by i-th layer of reversed input data Multiple reversed input blocks are split into, multiple reversed input blocks and multiple reversed operational orders are distributed to the k It is a from processing circuit；

The k, from processing circuit, are also used to forward the main process task circuit and the multiple from reversed between processing circuit Input block, vector operation result, i-th layer of output result gradient and reversed operational order；

It is the multiple from processing circuit, for according to the reversed operational order received by the input block received with it is described I-th layer of output result gradient execution vector multiplies vector operation and obtains vector operation result；The vector operation result is returned to The k is a from processing circuit；

The main process task circuit, for determining i-th layer of convolution from the vector operation result that processing circuit is sent according to the k I-th layer of convolution kernel gradient is executed update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel by core gradient.

7. according to device as claimed in any one of claims 5 to 6, which is characterized in that

The main process task circuit is combined sequence specifically for the operation result for sending multiple processing circuits and obtains the convolution As a result.

8. according to device as claimed in any one of claims 5 to 6, which is characterized in that the main process task circuit includes: at conversion Manage circuit；

The conversion processing circuit, for executing conversion process to data, specifically: by the received input number of main process task circuit The exchange between the first data structure and the second data structure is executed according to, convolution kernel or convolution results；Or main process task circuit is connect Input data, convolution kernel or the convolution results of receipts execute the exchange between the first data type and the second data type.

9. device according to claim 5 or 6, which is characterized in that it is described from processing circuit include: multiplication process circuit and Accumulation process circuit；

The multiplication process circuit, for the member to the element value in the Nuclear Data block received and corresponding position in input data Plain value executes product calculation and obtains result of product；

The accumulation process circuit obtains the convolution results for executing accumulating operation to the result of product.

10. device according to claim 4, which is characterized in that the tree-shaped module be n pitch tree construction, the n be greater than Integer equal to 2.

11. a kind of convolution training device, which is characterized in that the convolution training device includes one or more such as claim 1- 10 described in any item computing devices for being obtained from other processing units to operational data and control information, and execute and refer to Implementing result is passed to other processing units by I/O interface by fixed convolution algorithm；

When the convolution training device includes multiple computing devices, spy can be passed through between the multiple computing device Fixed structure is attached and transmits data；

Wherein, multiple computing devices are interconnected and are transmitted data by quick external equipment interconnection Bus PC IE bus, To support the operation of more massive machine learning；Multiple computing devices share same control system or possess respective control System processed；Multiple computing device shared drives possess respective memory；The mutual contact mode of multiple computing devices It is any interconnection topology.

12. a kind of combined treatment device, which is characterized in that the combined treatment device includes convolution as claimed in claim 11 Training device, general interconnecting interface and other processing units；

The convolution training device is interacted with other described processing units, the common calculating operation completing user and specifying.

13. combined treatment device according to claim 12, which is characterized in that further include: storage device, the storage device It is connect respectively with the convolution forward operation device and other described processing units, for saving the convolution training device and institute State the data of other processing units.

14. a kind of neural network chip, which is characterized in that the neural network chip includes as described in claim 1 calculates Device or convolution training device as claimed in claim 11 or combined treatment device as claimed in claim 13.

15. a kind of electronic equipment, which is characterized in that the electronic equipment includes the chip as described in the claim 14.

16. a kind of board, which is characterized in that the board includes: memory device, interface arrangement and control device and such as right It is required that neural network chip described in 15；

Wherein, the neural network chip is separately connected with the memory device, the control device and the interface arrangement；

The memory device, for storing data；

The control device is monitored for the state to the chip.

17. board according to claim 16, which is characterized in that

The memory device includes: multiple groups storage unit, and storage unit described in each group is connect with the chip by bus, institute State storage unit are as follows: DDRSDRAM；

The chip includes: DDR controller, the control for data transmission and data storage to each storage unit；

The interface arrangement are as follows: standard PCIE interface.

18. a kind of convolutional neural networks training method, which is characterized in that the method is applied to computing device, the convolution mind Include: α layers through network, it is α layers described at least i-th layer be convolutional layer；The computing device includes: arithmetic element and control Device unit；The arithmetic element includes: main process task circuit and from processing circuit, and the α is the integer more than or equal to 2, institute I is stated to be integer and be less than or equal to α；The convolutional neural networks training method includes at least: i-th layer of convolution forward operation and execution I-th layer of reversed operation of convolution；

I-th layer of convolution forward operation of the execution include:

The controller unit obtains i-th layer of input data, i-th layer of convolution kernel and i-th layer of positive computations；By the forward direction Computations parse to obtain multiple forward operation instructions, by multiple operational order and the input data, the convolution kernel And multiple operational orders are sent to the main process task circuit；

The input data is broadcast to described from processing circuit by the main process task circuit, and the convolution kernel is split into multiple cores Data block, multiple Nuclear Data blocks is distributed to described from processing circuit, the multiple operational order is sent to described from processing Circuit；

It is described that convolution algorithm is executed to the input data and the Nuclear Data block received according to operational order from processing circuit Operation result is obtained, and operation result is transferred to the main process task circuit；

I-th layer of reversed operation of convolution of the execution include:

The controller unit obtains i-th layer of output data gradient, i-th layer of convolution kernel, i-th layer of input data and retrospectively calculate and refers to It enables；The retrospectively calculate is parsed to instruct to obtain multiple reversed operational orders, by the reversed operational order and it is described i-th layer it is defeated Data gradient, i-th layer of convolution kernel, i-th layer of input data are sent to the main process task circuit out；

The main process task circuit chooses i-th layer of reversed input number of reversed operation according to convolution window from i-th layer of input data According to i-th layer of output data gradient being broadcast to described from processing circuit, i-th layer of reversed input data is split into multiple Multiple reversed input blocks and multiple reversed operational orders are distributed to described from processor electricity by reversed input block Road；

The reversed input block that will be received from processing circuit according to the reversed operational order that receives with described i-th layer Output result gradient execution vector multiplies vector operation and obtains vector operation result；The vector operation result is returned into the master Processing circuit；

The main process task circuit determines i-th layer of convolution kernel gradient according to the vector operation result, by i-th layer of convolution kernel gradient and the I layers of convolution kernel execute update operation and obtain i-th layer of updated convolution kernel.

19. according to the method for claim 18, which is characterized in that as it is described from the quantity of processing circuit be it is multiple, it is described Arithmetic element further includes one or more branch process circuits, each branch process circuit connection at least one from processing circuit,

The main process task circuit determines that the input data is broadcast data, and the convolution kernel is distribution data, by the convolution Core splits into multiple Nuclear Data blocks, by least one Nuclear Data block in the multiple Nuclear Data block, input data and multiple At least one operational order in operational order is sent to the branch process circuit；

The branch process circuit forwards the main process task circuit and the multiple Nuclear Data block between processing circuit, input Data and operational order；

It is the multiple that convolution fortune is executed to the Nuclear Data block and input data received according to the operational order from processing circuit Calculation obtains operation result, and operation result is transferred to the branch process circuit；

The main process task circuit is spliced the operation result that branch process circuit is sent to obtain convolution results；

I-th layer of output result gradient is broadcast to the branch process circuit by the main process task circuit, according to convolution window I-th layer of reversed input data that reversed operation is chosen from i-th layer of input data splits into i-th layer of reversed input data more Multiple reversed input blocks and multiple reversed operational orders are distributed to the branch process electricity by a reversed input block Road；

The branch process circuit forwards the main process task circuit and the multiple reversed input data between processing circuit Block, vector operation result, i-th layer of output result gradient and reversed operational order；

The reversed input block that will be received from processing circuit according to the reversed operational order that receives with described i-th layer Output result gradient execution vector multiplies vector operation and obtains vector operation result；The vector operation result is returned to described point Branch processing circuit；

The main process task circuit determines i-th layer of convolution kernel gradient according to the vector operation result of branch process circuit forwarding, will I-th layer of convolution kernel gradient executes update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel.

20. according to the method for claim 18, which is characterized in that as it is described from the quantity of processing circuit be it is multiple, it is described It is multiple from processing circuit be in array distribution；It is each connect from processing circuit with other adjacent from processing circuit, the main process task Circuit connection is the multiple a from processing circuit, the k tandem circuit from the k in processing circuit are as follows: n of the 1st row are from It is a from processing circuit to manage circuit, the n m arranged from processing circuit and the 1st of m row；

The K from processing circuit in the main process task circuit and multiple data and operational order between processing circuit Forwarding；

The main process task circuit determines that the input data is broadcast data, and convolution kernel is distribution data, and the convolution kernel is torn open Be divided into multiple Nuclear Data blocks, by the multiple Nuclear Data block at least one Nuclear Data block and multiple operational orders in extremely A few operational order is sent to the K from processing circuit；

The K forwards the main process task circuit and the multiple Nuclear Data block between processing circuit, defeated from processing circuit Enter data and operational order；

It is the multiple that convolution fortune is executed to the Nuclear Data block and input data received according to the operational order from processing circuit Calculation obtains operation result, and operation result is transferred to the K from processing circuit；

The K are spliced to obtain convolution results by the main process task circuit from the operation result that processing circuit is sent, by this Convolution results are sent to the controller unit；

I-th layer of output result gradient is broadcast to the k from processing circuit, according to convolution window by the main process task circuit I-th layer of reversed input data that reversed operation is chosen from i-th layer of input data splits into i-th layer of reversed input data more Multiple reversed input blocks and multiple reversed operational orders are distributed to the k from processing by a reversed input block Circuit；

The k forward the main process task circuit and the multiple reversed input data between processing circuit from processing circuit Block, vector operation result, i-th layer of output result gradient and reversed operational order；

The multiple input block that will be received from processing circuit according to the reversed operational order that receives with described i-th layer Output result gradient execution vector multiplies vector operation and obtains vector operation result；The vector operation result is returned into the k It is a from processing circuit；

The main process task circuit determines i-th layer of convolution kernel ladder from the vector operation result that processing circuit is sent according to the k I-th layer of convolution kernel gradient is executed update operation with i-th layer of convolution kernel and obtains i-th layer of updated convolution kernel by degree.