CN108170640A

CN108170640A - The method of its progress operation of neural network computing device and application

Info

Publication number: CN108170640A
Application number: CN201711452014.7A
Authority: CN
Inventors: 周聖元; 陈云霁; 陈天石; 刘少礼; 郭崎; 杜子东; 刘道福
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2017-10-17
Filing date: 2017-10-17
Publication date: 2018-06-15
Anticipated expiration: 2037-10-17
Also published as: CN107632965B; CN108170640B; CN107632965A

Abstract

The disclosure provides a kind of neural network computing device and method, and wherein device includes：Arithmetic section, for completing the convolution algorithm, comprising multiple arithmetic element groups, multiple arithmetic element groups are distributed in the array of X rows Y row, transmit data between arithmetic element group with S-shaped direction and/or inverse S-shaped direction, wherein X and Y are respectively positive integer；Caching, for transmitting data to the arithmetic element group and receiving the data after arithmetic element group operation.Completes the transmission of data in arithmetic element by using S-shaped and inverse S-shaped, while so as to effective accelerans network operations, reduce the reading repeatedly of weights and part and memory access power consumption caused by accessing repeatedly.

Description

The method of its progress operation of neural network computing device and application

Technical field

This disclosure relates to computer realm, further to artificial intelligence field.

Background technology

Deep neural network is the basis of current many artificial intelligence applications, in speech recognition, image procossing, data point The various aspects such as analysis, advertisement commending system, automatic driving have obtained breakthrough application so that deep neural network is applied In the various aspects of life.But the operand of deep neural network is huge, restrict always its faster development and more It is widely applied.When consider with accelerator design come when accelerating the operation of deep neural network, huge operand will necessarily With very big energy consumption expense, the further extensive use of accelerator equally restrict.

Existing common method is to use general processor (CPU).This method is by using general-purpose register and general Functional component performs universal command to support neural network algorithm.One of the disadvantages of this method is the operation of single general processor Performance is relatively low, can not meet the performance requirement of neural network computing.And multiple general processors are when performing parallel, general processor Intercommunication become performance bottleneck again.Another known method is to use graphics processor (GPU).This method is by making General SIMD instruction is performed with general-purpose register and general stream processing unit to support above-mentioned algorithm.Since GPU is specially to use Perform the equipment of graph image operation and scientific algorithm, on piece caching is smaller so that the outer bandwidth of piece becomes main performance Bottleneck brings huge power dissipation overhead.

Invention content

(1) technical problems to be solved

In view of this, the disclosure is designed to provide a kind of restructural S-shaped arithmetic unit and operation method, to solve Above-described at least part technical problem.

(2) technical solution

According to the one side of the disclosure, a kind of neural network computing device is provided, for carrying out convolution algorithm, including：

Arithmetic section, for completing the convolution algorithm, comprising multiple arithmetic element groups, multiple arithmetic element groups are in The array of X rows Y row is distributed, and transmits data between arithmetic element group with S-shaped direction and/or inverse S-shaped direction, wherein X and Y are respectively Positive integer；

Caching, for transmitting data to the arithmetic element group and receiving the data after arithmetic element group operation.

In a further embodiment, control section is further included, row control is deposited into for easing up the arithmetic section, Make the two that can cooperate, complete required function.

In a further embodiment, each arithmetic element group includes：Multiple arithmetic elements are arranged in M rows N Array distribution, data are transmitted with S-shaped direction and/or inverse S-shaped direction between arithmetic element, wherein M and N are respectively positive integer.

In a further embodiment, each arithmetic element group includes：Two or more multipliers；Two or with Levels device；Be provided with an at least inner storage portion in the arithmetic element, the inner storage portion and the multiplier and/ Or adder connection.

In a further embodiment, each arithmetic element group is also comprising two selectors, for skipping the operation list Multiplier and adder in member：When the arithmetic element requires calculation, the result of selector selection adder is made Output for arithmetic element；Or when the arithmetic element need not carry out operation, selector directly exports input data.

In a further embodiment, each arithmetic element group is additionally operable to individually broadcast data to caching part, also uses In under control performed by the control section, selecting different output channels, to realize work in series or concurrent working.

According to another aspect of the present disclosure, it provides and carries out convolution algorithm using any description above neural network computing device Method, including：Convolution kernel is set, convolution kernel size is more than the arithmetic element number in an arithmetic element group；It will be multiple Arithmetic element group is combined into an arithmetic element race so that the arithmetic element group in arithmetic element race is according to serial operation side Formula carries out data transfer and operation, carries out the transmission and operation of data between arithmetic element race according to parallel operation mode.

In a further embodiment, it further includes：The corresponding weight data of characteristic pattern will be exported and be sent into each arithmetic element Inner storage portion；The neuron for treating operation is respectively fed to carry out multiplication and add operation in arithmetic element；Carry out addition Operation result afterwards passes to next arithmetic element by the S-shaped or inverse S-shaped direction and carries out operation.

In a further embodiment, when the number of arithmetic element in an arithmetic element group is equal to convolution kernel size, Operation result is further operated into line activating.

In a further embodiment, when the number of arithmetic element in an arithmetic element group is less than convolution kernel size, Using operation result as interim data, it is sent into next arithmetic element group and continues operation.

(3) advantageous effect

(1) the neural network computing device of the disclosure completes the transmission of data using S-shaped and inverse S-shaped in arithmetic element, In combination with the characteristic of neural network " weights are shared ", while so as to effective accelerans network operations, power is reduced The reading repeatedly of value and part and repeatedly memory access power consumption caused by access.

(2) the neural network computing device of the disclosure has multiple arithmetic element groups, and arithmetic element group can be supported parallel It calculates, so as to so that each arithmetic element group reads and shares same group of neuron number evidence, while calculates multiple output characteristic patterns Data, improve the utilization rate and operation efficiency of neuron number evidence.

(3) multiple arithmetic element groups can be combined by the neural network computing device of the disclosure, in control section The transfer mode of the lower adjustment operational data of control and result data.In calculating process, it disclosure satisfy that in same operational network There are different weights scales, further expand the scope of application of arithmetic section, improve arithmetic element in device Utilization rate accelerates the arithmetic speed of neural network.

Description of the drawings

Fig. 1 is the schematic diagram of the neural network arithmetic unit of one embodiment of the disclosure.

Fig. 2 is one embodiment neural network computing device data of disclosure flowing direction schematic diagram.

Fig. 3 is another embodiment neural network computing device data flowing direction schematic diagram of the disclosure.

Fig. 4 is the schematic diagram of arithmetic element group in Fig. 1.

Fig. 5 is the schematic diagram for containing an arithmetic element group in Fig. 1.

Three arithmetic element groups is are combined as the schematic diagram of an arithmetic element race by Fig. 6.

Fig. 7 is the schematic diagram of an arithmetic element in Fig. 1.

Specific embodiment

Purpose, technical scheme and advantage to make the disclosure are more clearly understood, below in conjunction with specific embodiment, and reference Attached drawing is described in further detail the disclosure.

The primary structure of the disclosure is as shown in Figure 1, it is broadly divided into arithmetic section, storage section.Arithmetic section has been used for Into operation, comprising multiple arithmetic element groups, multiple arithmetic elements and 2 or more arithmetical logics are included in each arithmetic element group Unit (ALU).Storage section is for preserving data, and including external storage section and inner storage portion, external storage section exists Outside arithmetic element, multiple regions can be divided into, be respectively used to preserve input data, output data, temporal cache；Storage inside Part is located inside arithmetic section, and operational data is treated for preserving.Preferred situation further includes control section, for the device Various pieces control, can cooperate, complete required function.

Arithmetic section includes X*Y arithmetic element group, including X*Y (X, Y are arbitrary positive integer) a arithmetic element group, is in The two-dimensional array form of X rows Y row is arranged, and data are transmitted with S-shaped direction or inverse S-shaped direction between arithmetic element group.Each operation list Tuple can broadcast data to caching part, under control performed by the control section, select different output channels, so as to arithmetic element Group can be with work in series or concurrent working.That is, each arithmetic element group can be received with work in series from the operation of left/right side The data that unit group is transmitted to, after operation, by output data to the right/the arithmetic element group in left side transmits.The last one Arithmetic element group after caching, is preserved final result, data flow direction such as Fig. 2 institutes in incoming memory module Show.Between arithmetic element group can also concurrent working, i.e. primary data is transferred to by former s shapes path in each arithmetic element group, Arithmetic element group shares operational data, and carries out operation.The operation result of oneself is transferred directly to delay by each arithmetic element group It deposits middle caching and arranges, after treating operation, the data in caching are exported to be preserved into memory module, data flowing side To as shown in Figure 3.

As shown in figure 4, each arithmetic element group includes M*N, (M, N are positive integer, preferred M=N=3 or M=N= 5) a arithmetic element is arranged in the two-dimensional array form of M rows N row, and number is transmitted with S-shaped direction or inverse S-shaped direction between arithmetic element According to.Each arithmetic element includes two or more multipliers and (represents first multiplier, second multiplier with " X1 " " X2 " etc. Deng) and two or more adders (representing first adder, second adder etc. with "+1 " "+2 " etc.), an inside is deposited Storage unit.Multiplier in each arithmetic element carries out phase from the extraneous data read in data and internal storage unit every time Multiply, product is sent into adder.Adder is by the data transmitted along S-shaped or inverse S-shaped and the product addition of multiplier, as a result along S Shape or inverse S-shaped are transmitted in the adder of next arithmetic element.Wherein non-odd number (i.e. zero, second ...) adder Receive S-shaped direction and transmit the data progress add operation come, and result is continued to transmit according to S-shaped direction；Odd number (i.e. first A, third ...) adder receiving carrys out the data that the transmission of self-converse S-shaped comes, and result is continued to transmit by inverse S-shaped.Work as operation During to a last arithmetic element, it can select operation result passing back continuation operation along inverse S-shaped, can also be transmitted to and deposit Storage unit is preserved.

As shown in figure 5, first, use Wo, i, x here, y represents a weight data, represents that the data correspond to o-th Export characteristic pattern, i-th of input feature vector figure, the position for xth row y row.Arithmetic element number might as well be assumed for 3*3, and core is big Small is 3*3.So first output characteristic pattern and the corresponding first group of weight data of second output characteristic pattern are sent into first In the inner storage portion of each arithmetic element, as shown in Figure 5.The neuron of operation is treated from storage section taking-up, and be respectively fed to It is multiplied in arithmetic element.Then product is sent into adder, carries out add operation.If the number of pending add operation According to, "+0 " of arithmetic element 0 and "+1 " of arithmetic element 8 can directly acquire the data from storage unit and carry out add operation, 0 can be initialized as, sum of products 0 is made to carry out add operation.Then, by the add operation of arithmetic element and the direction according to regulation It is transmitted, as the add operation result of "+0 " of arithmetic element 0 is passed to by S-shaped the input terminal of "+0 " of arithmetic element 1, fortune The add operation result of "+0 " of calculation unit 2 is passed to the input terminal of "+0 " of arithmetic element 3 by S-shaped；"+1 " of arithmetic element 6 Add operation result passed to by inverse S-shaped arithmetic element 5 "+1 " input terminal, the add operation of "+1 " of arithmetic element 5 As a result the input terminal of "+1 " of arithmetic element 4 is passed to by inverse S-shaped.Then, data processing module is sent into second group of neuron number According to completing the operation being multiplied with weights and the part transmitted before and progress addition fortune in each arithmetic element, in arithmetic element It calculates, continues to transmit further in accordance with assigned direction, until completing all operations.The operation result of "+1 " of arithmetic element 0 and operation list The operation result of "+0 " of member 8 can be directly written back to the specified position of storage section.The scale of core is more than the number of arithmetic element, So the result may be interim part and data, be stored in the temporary storage section of storage unit, existed according to control instruction After the weight data more renewed, result is sent to the input terminal of "+0 " of arithmetic element 0 and "+1 " of arithmetic element 8, is continued Complete add operation.If what is obtained is final result, and has activation to operate, then result is input in ALU, into line activating Operation, is then written back storage section.Otherwise storage section is write direct to be preserved.In this way, convolution can be made full use of The characteristic of the shared weights of neural network, avoid weights reads the memory access power consumption brought repeatedly.Meanwhile read same group of god It calculates the data of two output characteristic patterns simultaneously through metadata, improves the utilization rate of neuron number evidence.In addition, multiple operation lists Member can be with concurrent operation, so as to greatly accelerate arithmetic speed.

This arithmetic unit can under control of the control means be combined arithmetic element, form an arithmetic element Race enables adaptation to the situation that different layers scale is different in same network model.I.e. convolution kernel is more than an arithmetic element When arithmetic element number in group, control device can control the transmission direction of data, and multiple arithmetic element groups are combined As an arithmetic element race so that the arithmetic element group in arithmetic element race carries out data transfer according to serial operation mode And operation, the transmission and operation of data are carried out between arithmetic element race according to parallel operation mode.That is, in an arithmetic element race Each arithmetic element group the multiplying and addition of data can be completed according to original order of operation (positive s shapes or inverse s shapes) Operation, then the data in arithmetic element race pass to another arithmetic element group adjacent with its in the race successively and transported It calculates, after operation, is exported result using the outgoing route of the last one arithmetic element group in the arithmetic element race Into caching.

By taking ALEXNET networks as an example, first convolutional layer core size is 11*11.Second convolutional layer core size is 5*5, The core size of third convolutional layer is 3*3, then we are initially configured in each arithmetic element group comprising 3*3 arithmetic element, That is M=N=3, a total of 15 arithmetic element groups, i.e. X=3, Y=5.When handling third convolutional layer (convolution kernel 3*3), Each arithmetic element group handles an arithmetic element core, and operation is carried out parallel between arithmetic element group, i.e., each arithmetic element group will Respective operation result is exported into caching.When handling second convolutional layer (convolution kernel 5*5), every three arithmetic element groups An arithmetic element race is combined as, is divided into 5 arithmetic element races, data order is transmitted in each arithmetic element race, each Data parallel operation between Elements Families；When handling first convolutional layer (convolution kernel 11*11), you can with by all arithmetic elements Sequence completes operation.The direction of data transfer is controlled by control section, so as to achieve the purpose that dynamic combined adjusts, There is the layer of different scales to meet consolidated network, improve the utilization rate of arithmetic unit.As shown in figure Fig. 6, i.e., To be combined three arithmetic element groups as the schematic diagram of an arithmetic element race.Former input data and intermediate result according to S shapes transmit the arithmetic element of a left/right side successively, and then, each arithmetic element race obtains operation knot as a basic unit Fruit is output in caching.Operation is treated as a result, the data in caching are output to storage section.

Preferably, also comprising two selectors in each arithmetic element, for skip the multiplier in the arithmetic element and Adder, as shown in Figure 7.When the arithmetic element requires calculation, selector select the result of adder as The output of arithmetic element.When the arithmetic element need not carry out operation, selector directly exports input data.For example, When the scale of convolution kernel is less than the arithmetic element number in arithmetic element group, extra arithmetic element can be skipped directly, Directly input and output are come out by selector, without progress multiply-add operation.Again for example, when multiple arithmetic element groups carry out group When conjunction, when the arithmetic element sum after combination is more than the number of convolution kernel, extra arithmetic element can pass through selector Directly input data is exported, other arithmetic elements are then exported the operation result of adder as final result.

Specifically, as M=N=3, there are 9 arithmetic elements in the arithmetic element group.When pending convolution kernel is 2* When 2, then only need to use 4 arithmetic elements, then input data is sequentially sent to multiplier, addition by this 4 arithmetic elements Operation is carried out in device, the result that the result of adder is then used as to the arithmetic element by selector exports, another 5 operation lists Member need not carry out operation, then using selector, directly be exported input data as the result of arithmetic element.When waiting to locate When managing convolution kernel as 5*5, need 3 arithmetic element groups being combined, become a big arithmetic element group, then, reel Product core only needs 5*5=25 arithmetic element, and after forming a big arithmetic element group, three arithmetic element groups share 3*3* 3=27 arithmetic element, then can two idle arithmetic elements, the two arithmetic elements will directly input number by selector According to output, multiplier is needed not move through, adder carries out operation.

In some embodiments, a kind of chip is disclosed, that includes above-mentioned neural network computing devices.

In some embodiments, a kind of chip-packaging structure is disclosed, that includes said chips.

In some embodiments, a kind of board is disclosed, that includes said chip encapsulating structures.

In some embodiments, a kind of electronic device is disclosed, that includes above-mentioned boards.

Electronic device include data processing equipment, robot, computer, printer, scanner, tablet computer, intelligent terminal, Mobile phone, automobile data recorder, navigator, sensor, camera, cloud server, camera, video camera, projecting apparatus, wrist-watch, earphone, Mobile storage, wearable device, the vehicles, household electrical appliance, and/or Medical Devices.

The vehicles include aircraft, steamer and/or vehicle；The household electrical appliance include TV, air-conditioning, micro-wave oven, Refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker and/or kitchen ventilator；The Medical Devices include Nuclear Magnetic Resonance, B Super instrument and/or electrocardiograph.

It should be appreciated that disclosed relevant apparatus and method, can realize by another way.For example, above institute The device embodiment of description is only schematical, for example, the division of the module or unit, only a kind of logic function is drawn Point, there can be other dividing mode in actual implementation, such as multiple units or component may be combined or can be integrated into separately One system or some features can be ignored or does not perform.

Each functional unit/module can be hardware, for example the hardware can be circuit, including digital circuit, simulation electricity Road etc..The physics realization of hardware configuration includes but is not limited to physical device, and physical device includes but is not limited to transistor, Memristor etc..Computing module in the computing device can be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP and ASIC etc..The storage unit can be any appropriate magnetic storage medium or magnetic-optical storage medium, than Such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC etc..

Particular embodiments described above has carried out the purpose, technical solution and advantageous effect of the disclosure further in detail Describe in detail bright, it should be understood that the foregoing is merely the specific embodiment of the disclosure, be not limited to the disclosure, it is all The spirit of the disclosure and any modification, equivalent substitution, improvement and etc. within principle, done, should be included in the protection of the disclosure Within the scope of.

Claims

1. a kind of neural network computing device, for carrying out convolution algorithm, which is characterized in that including：

Arithmetic section, for completing the convolution algorithm, comprising multiple arithmetic element groups, multiple arithmetic element groups are in X rows Y The array of row is distributed, and data are transmitted with S-shaped direction and/or inverse S-shaped direction between arithmetic element group, and wherein X and Y are respectively just whole Number；Each arithmetic element group includes multiple arithmetic elements, is distributed in the array of M rows N row, with S-shaped side between arithmetic element Data are transmitted to and/or against S-shaped direction, wherein M and N are respectively positive integer.

2. neural network computing device according to claim 1, which is characterized in that control section is further included, for institute It states arithmetic section and eases up and deposit into row control, both make to cooperate, complete required function.

3. neural network computing device according to claim 1, which is characterized in that each arithmetic element group includes：

Two or more multipliers；

Two or more adders；

An at least inner storage portion, the inner storage portion and the multiplier are provided in the arithmetic element and/or is added Musical instruments used in a Buddhist or Taoist mass connects.

4. neural network computing device according to claim 3, which is characterized in that each arithmetic element group also includes two Selector, for skipping the multiplier and adder in the arithmetic element：

When the arithmetic element requires calculation, selector selects the result of adder as the output of arithmetic element；

Or when the arithmetic element need not carry out operation, selector directly exports input data.

5. neural network computing device according to claim 1, which is characterized in that each arithmetic element group is additionally operable to individually Caching part is broadcast data to, is additionally operable under control performed by the control section, select different output channels, to realize work in series Or concurrent working.

6. the method that any neural network computing devices of application claim 1-5 carry out convolution algorithm, it is characterised in that packet It includes：

Convolution kernel is set, convolution kernel size is more than the arithmetic element number in an arithmetic element group；

Multiple arithmetic element groups are combined into an arithmetic element race so that the arithmetic element group in arithmetic element race is according to string Capable operation mode carries out data transfer and operation, between arithmetic element race according to parallel operation mode carry out data transmission and Operation.

7. according to the method described in claim 6, it is characterised in that it includes：

The inner storage portion of each arithmetic element of the corresponding weight data feeding of characteristic pattern will be exported；

The neuron for treating operation is respectively fed to carry out multiplication and add operation in arithmetic element；

It carries out the operation result after addition and passes to next arithmetic element progress operation by the S-shaped or inverse S-shaped direction.

8. the method according to the description of claim 7 is characterized in that the number when arithmetic element in an arithmetic element group is equal to volume During product core size, operation result is further operated into line activating.

9. according to the method described in claim 8, it is characterized in that, the number when arithmetic element in an arithmetic element group is less than volume During product core size, using operation result as interim data, it is sent into next arithmetic element group and continues operation.

10. it according to the method described in claim 6, it is characterized in that, is carried out between arithmetic element race according to parallel operation mode The transmission and operation of data include：

Each arithmetic element group in one arithmetic element race can complete multiplying for data according to the order of operation of S-shaped or inverse S-shaped Method operation and add operation；

Data in arithmetic element race pass to another arithmetic element group adjacent with its in the race and carry out operation successively, until Operation finishes；

Result is exported into caching using the outgoing route of the last one arithmetic element group in the arithmetic element race.