CN110399976A

CN110399976A - Computing device and calculation method

Info

Publication number: CN110399976A
Application number: CN201810376471.0A
Authority: CN
Inventors: 梁晓峣; 景乃锋; 崔晓松; 陈云
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-04-25
Filing date: 2018-04-25
Publication date: 2019-11-01
Anticipated expiration: 2038-04-25
Also published as: WO2019206162A1; CN110399976B

Abstract

This application provides a kind of computing device and calculation methods, can be improved data user rate.The computing device includes: multiple arithmetic elements, and for executing multiple convolution operation to multiple data, multiple arithmetic elements include the first arithmetic element and the second arithmetic element；First arithmetic element is used for: in the i-th convolution operation in multiple convolution operation, carrying out multiplying to the first data in the first weight data and multiple data, wherein i is the integer greater than 0；Receive the second data of the second arithmetic element transmission, wherein the second data are the data that the second arithmetic element executes convolution operation in i-th convolution operation；In multiple convolution operation in i+1 time convolution operation, multiplying is carried out to the first weight data and the second data.

Description

Computing device and calculation method

Technical field

This application involves data processing field more particularly to a kind of computing device and calculation methods.

Background technique

Convolution operation is the most common calculating operation in convolutional neural networks.Convolution operation is mainly used in convolutional layer In.In convolution operation, the corresponding input data of each of convolution kernel weight data is multiplied, then by dot product result It is added, obtains the output result an of convolution operation.Later according to the step size settings of convolutional layer, convolution kernel is slided, is repeated above-mentioned Convolution operation.The usual substantial amounts of data of convolutional neural networks processing, high data throughout need high access bandwidth.Example Such as, the arithmetic element and register file of multiple parallel data processings, register are generally included for the computing device of convolution algorithm Heap is used to store the data of pending convolution algorithm.In each convolution operation, multiple arithmetic elements are required from register file Data needed for reading convolution operation, since the data volume of convolution operation processing is usually very huge, thus multiple arithmetic elements exist It needs to read a large amount of data from register file during convolution operation.

Summary of the invention

The application provides a kind of computing device and calculation method, can be improved data user rate.

In a first aspect, providing a kind of computing device, comprising: multiple arithmetic elements, it is multiple for being executed to multiple data Convolution operation, the multiple arithmetic element include the first arithmetic element and the second arithmetic element；First arithmetic element is used for: In the i-th convolution operation in multiple convolution operation, to first in the first weight data and the multiple data Data carry out multiplying, wherein i is the integer greater than 0；Receive the second data that second arithmetic element is sent, wherein Second data are the data that second arithmetic element executes convolution operation in the i-th convolution operation；Described In i+1 time convolution operation in multiple convolution operation, multiplication is carried out to first weight data and second data Operation.

In the embodiment of the present application, the first arithmetic element can obtain weight in convolution operation twice from the second arithmetic element Multiple data, it is duplicate to convolved data without being read from register file, to improve the utilization rate of data, reduce The data volume interacted between arithmetic element and register file reduces opening for the access bandwidth between arithmetic element and register file Pin.

In one possible implementation, further includes: multiple register groups, for storing the multiple data；It is described Multiple arithmetic elements are also used to obtain the multiple number needed for the multiple convolution operation from the multiple register group According to, wherein each arithmetic element in the multiple arithmetic element can obtain any one in the multiple register group The data stored in register group.

In the embodiment of the present application, any operation unit in multiple arithmetic elements is able to access that any register group, because This above-mentioned multiple register group only need to be read in and be stored once to convolved data, without repeatedly reading and storing more from memory Duplicate mass data in secondary convolution operation, thus the memory space of multiple register groups can be saved, and reduce multiple post The pressure of access bandwidth between storage group and memory improves the degree of parallelism of data access.

In one possible implementation, first arithmetic element is configured as: in the i-th convolution operation In, first data are obtained from the first register group in the multiple register group；Second arithmetic element is matched It is set to: in the i-th convolution operation, second number is obtained from the second register group in the multiple register group According to first register group and second register group are different register groups.

In one possible implementation, during the i-th convolution operation, the multiple arithmetic element is from institute The data read in multiple register groups are stated to be located in the different registers group of the multiple register group.

In the embodiment of the present application, flexible data arrangement mode is used in multiple register groups, it can be at one Data as much as possible are read in the clock period, are avoided reading data conflict, are realized clog-free convolution algorithm process.

In one possible implementation, the corresponding convolution kernel size of the convolution operation is K*K, the multiple deposit Device group includes K*K register group, and the multiple data include M row N column data, and the multiple data are deposited at described K*K Distribution in device group meets the following conditions: the data in xth row s column in the multiple data are stored in b-th of register In group, the data in xth+a row s column in the multiple data are stored in [b+ (a*K)] mod (K*K) a register group In, wherein K is integer not less than 1, and mod indicates to carry out complementation, M >=x >=1, N >=s >=1, K*K >=b >=1, M, N, x, S, b, a are positive integer.

Second aspect provides a kind of calculation method, comprising: the computing device of the application calculation method includes multiple fortune Unit is calculated, the multiple arithmetic element is used to execute multiple data multiple convolution operation, and the multiple arithmetic element includes the One arithmetic element and the second arithmetic element；The described method includes: in the i-th convolution operation in multiple convolution operation, First arithmetic element carries out multiplying to the first data in the first weight data and the multiple data, wherein i For the integer greater than 0；First arithmetic element receives the second data that second arithmetic element is sent, wherein described the Two data are the data that second arithmetic element executes convolution operation in the i-th convolution operation；In the multiple volume In i+1 time convolution operation in product operation, first arithmetic element is to first weight data and second number According to progress multiplying.

In one possible implementation, the computing device further include: multiple register groups are described more for storing A data；Wherein, each arithmetic element in the multiple arithmetic element can be obtained for obtaining the multiple register group In any one register group in the data that store.

In one possible implementation, further includes: in the i-th convolution operation, first arithmetic element First data are obtained from the first register group in the multiple register group；In the i-th convolution operation, institute It states the second arithmetic element and obtains second data from the second register group in the multiple register group, described first posts Storage group and second register group are different register groups.

In one possible implementation, the corresponding convolution kernel size of the convolution operation is K*K, the multiple deposit Device group includes K*K register group, and the multiple data include M row N column data, and the multiple data are deposited at described K*K Distribution in device group meets the following conditions: the data in xth row s column in the multiple data are stored in r group register In, the data in xth+a row s column in the multiple data are stored in [r+ (a*K)] mod (K*K) group register, In, K is the integer not less than 1, and mod indicates to carry out complementation, M >=x >=1, N >=s >=1, K*K >=r >=1, M, N, x, s, r, a For positive integer.

The third aspect provides a kind of computing device, including multiple arithmetic elements and multiple register groups, wherein each Arithmetic element is separately connected the multiple register group；The multiple register group, for storing pending multiple convolution operation Multiple data；The multiple arithmetic element, for obtaining the multiple data from the multiple register group, and to described Multiple data execute multiple convolution operation, wherein each arithmetic element in the multiple arithmetic element is described more for obtaining The data stored in any one register group in a register group.

In one possible implementation, the multiple convolution operation in i-th convolution operation during, it is described The data that multiple arithmetic elements are read from the multiple register group are located at the different registers group of the multiple register group In, wherein i is the integer greater than 0.

In one possible implementation, the multiple arithmetic element includes the first arithmetic element and the second operation list Member, first arithmetic element are used for: in the i-th convolution operation in multiple convolution operation, to the first weight data And the first data in the multiple data carry out multiplying, wherein i is the integer greater than 0；Receive second operation The second data that unit is sent, wherein second data are that second arithmetic element is held in the i-th convolution operation The data of row convolution operation；In the i+1 time convolution operation in multiple convolution operation, to first weight data with And second data carry out multiplying.

Fourth aspect provides a kind of chip, and setting is any just like in first aspect or first aspect on the chip Computing device described in possible implementation, alternatively, setting is appointed just like in the third aspect or the third aspect on the chip Computing device described in a kind of possible implementation.

5th aspect provides a kind of computer storage medium, and the storage is used for the program code that computing device executes, Said program code includes the instruction for executing the method for second aspect.

Detailed description of the invention

Fig. 1 is the structural schematic diagram of computing device provided by the embodiments of the present application.

Fig. 2 is different cycles convolution kernel and the schematic diagram to convolved data progress convolution operation.

Fig. 3 is the structural schematic diagram of the computing device of the embodiment of the present application.

Fig. 4 is the structural schematic diagram of the computing device of the another embodiment of the application.

Fig. 5 is the connected mode schematic diagram of the register group and arithmetic element in the embodiment of the present application.

Fig. 6 is the connected mode schematic diagram of the register group and arithmetic element in the another embodiment of the application.

Fig. 7 is the arrangement mode schematic diagram to convolved data in multiple register groups in the embodiment of the present application.

Fig. 8 is the process schematic of the carry out convolution operation in the embodiment of the present application.

Fig. 9 is the flow diagram of the calculation method in the embodiment of the present application.

Specific embodiment

Below in conjunction with attached drawing, the technical solution in the application is described.

In order to make it easy to understand, first describing in detail to the concept of convolution operation and the relevant technologies.

Convolution operation is usually applied in convolutional neural networks.Convolutional neural networks can be used for handling image data.To into The data of row convolutional calculation are properly termed as input matrix or input data.The input data can be image data.Wherein, convolution Operation can correspond to a convolution kernel.Convolution kernel is the matrix of a K*K size, and K is the integer more than or equal to 1.In convolution kernel Each element be a weight data.In convolution process, convolution kernel slides on input matrix according to step-length, input matrix Many submatrixs identical with convolution kernel size can be divided by sliding window, each submatrix and convolution kernel carry out dot product, and Dot product result is added up, to obtain the operation result of convolution operation.It should be noted that the width and length of convolution kernel can It, can also be unequal with equal.In the embodiment of the present application, it is illustrated with the width and equal length of convolution kernel, i.e. convolution kernel Size is K*K.It will be appreciated by those skilled in the art that the computing device and calculation method in the embodiment of the present application can be applied to roll up The width and the unequal situation of length of product core.

Fig. 1 is the structural schematic diagram of the computing device 100 provided by the embodiments of the present application for convolution operation.Such as Fig. 1 institute Show, computing device 100 generally includes convolution processing unit 110, register file 120, control unit 130 and memory 140.Its In, convolution processing unit 110 may include multiple arithmetic elements.Each arithmetic element is for executing to convolved data and weight data Between multiplying.Convolution processing unit 110 may also include adder tree unit.The adder tree unit can be used for according to volume The rule that product calculates, the output result of arithmetic element is added up, to obtain the operation result of convolution operation.Memory 140 For storing data.And register file 120 is used to read the data to convolution from memory 140 by bus, and is stored in In register file 120.Control unit 130 can be used for controlling convolution processing unit 110 and register file 120 realizes above-mentioned behaviour Make.In the related art, it is run independently of each other between multiple arithmetic elements.It is closed between i.e. multiple arithmetic elements without connection System.So each arithmetic element needs to read from register file 120 to convolved data respectively at work.It is grasped in multiple convolution It is handled to there are a large amount of duplicate data in convolved data in work, but these duplicate data are still needed from register file It repeats to read in 120.Therefore lower to the utilization rate of data.

In order to make it easy to understand, being illustrated below with reference to process of the Fig. 2 to convolution operation.

Fig. 2 shows different cycles convolution kernels and the schematic diagram that convolution operation is carried out to convolved data.As shown in Fig. 2, volume The weight matrix that product core is 2*2, is multiple lines and multiple rows data to convolved data, is referred to as input data or input matrix.Fig. 2 Show handled data in convolution operation adjacent twice.Wherein, convolution kernel slides on input matrix according to step-length, The data of Fig. 2 hatched example areas covering are the duplicate data in operation twice, and which show the reusabilities of convolution algorithm.But In the related art, the computing device for carrying out convolution operation generallys use multiple arithmetic element parallel data processings.Multiple operations It is independent from each other between unit, cannot carry out data transmission from each other.Therefore, multiple arithmetic elements are in each convolution operation When require to read the data of this convolution operation needs from register file.Duplicate data can also weigh in convolution operation twice It is read from register file again.For example, it is assumed that the first arithmetic element and the second arithmetic element difference in multiple arithmetic elements Execute the first convolution operation and the second convolution operation.First convolution operation and the second convolution operation are adjacent operation.Then first Arithmetic element is read from register file to convolved data 10,27,35 and 82 in the first convolution operation cycle.Second arithmetic element It is read from register file to convolved data 27,69,82 and 75 in the second convolution operation cycle.Wherein, data 27,82 are weighed It is read from register file again.When the data volume of convolution operation processing is larger, mass data is read repeatedly, thus is increased The pressure of access bandwidth.

The embodiment of the present application proposes a kind of computing device.The computing device, which can be reduced, to be repeated to read from register file Data, reduce to access bandwidth.Each period can be simultaneously for each computing unit load weight data or to convolved data.

Fig. 3 is a kind of structural schematic diagram of computing device 300 provided in an embodiment of the present invention.As shown in figure 3, the calculating fills Setting 300 includes:

Multiple arithmetic elements 310, for executing multiple convolution operation to multiple data, the multiple arithmetic element 310 is wrapped Include the first arithmetic element 311 and the second arithmetic element 312.First arithmetic element 311 is used for: being operated in the multiple convolution In i-th convolution operation in, in the first weight data and the multiple data the first data carry out multiplying, In, i is the integer greater than 0；Receive the second data that second arithmetic element is sent, wherein second data are described Second arithmetic element 312 executes the data of convolution operation in the i-th convolution operation；In multiple convolution operation In i+1 time convolution operation, multiplying is carried out to first weight data and second data.

Optionally, the second data that above-mentioned the second arithmetic element of reception 312 is sent can refer to that the first arithmetic element 311 is logical The connection crossed between the second arithmetic element 312 directly acquires the second data.The first arithmetic element 311 can also be referred to by with Being indirectly connected between two arithmetic elements 312 obtains the second data.For example, the first arithmetic element 311 and third arithmetic element phase Even, third arithmetic element is connected with the second arithmetic element 312, and the second arithmetic element 312 is to third arithmetic element transmission described the Second data are sent to the first arithmetic element 311 again by two data, third arithmetic element.

Above-mentioned multiple arithmetic elements 310 are corresponded with multiple weight datas in convolution kernel, and in a convolution In operation, the corresponding multiplication operation of multiple weight datas is executed respectively.For example, above-mentioned first weight data is the first arithmetic element 311 corresponding weight datas, above-mentioned second weight data are the corresponding weight data of the second arithmetic element 312.

Optionally, above-mentioned computing device 300 can also include multiple register groups 320, and the multiple register group 320 is used In the multiple data of the storage for multiple convolution operation；The multiple arithmetic element 310 is also used to from the multiple The handled multiple data of the multiple convolution operation are obtained in register group 320.For example, above-mentioned partial arithmetic unit 310 can It is remaining with from being obtained in the multiple register group 320 with unduplicated data in last time convolution operation in this convolution operation Arithmetic element 310 can be obtained from other arithmetic elements 310 in this convolution operation with duplicate number in last time convolution operation According to.Optionally, in some convolution operations, due to there is no duplicate number between the secondary convolution operation and last convolution operation According to then all arithmetic elements 310 need to obtain the data to convolution from register group 320.For example, in first time convolution operation In, or when convolution kernel is moved by line direction, and when being moved to new a line, or when convolution kernel moves in column direction, and move When moving a new column, all arithmetic elements 310 are required to obtain the data to convolution from register group 320.

Above-mentioned multiple data include above-mentioned first data and above-mentioned second data.In an example it is assumed that the first data The first register group being stored in multiple register groups, the second data are stored in the second deposit in the multiple register group Device group.First register group and the second register group are different register groups.If the data handled in (i-1)-th convolution operation It does not include above-mentioned first data and above-mentioned second data, then in the i-th convolution operation, first arithmetic element is from institute It states and obtains first data in the first register group, described in second arithmetic element is obtained from second register group Second data.

Optionally, above-mentioned multiple register groups 320 are referred to as register file.Each register group includes multiple deposits Device, each register is for storing a data.Each register group can be equipped with a read port.I.e. above-mentioned multiple arithmetic elements In each arithmetic element a data can be read from a register group within each clock cycle.

In some instances, above-mentioned multiple arithmetic elements 310 are used in each convolution operation, are executed in convolution operation Multiplying.For example, it is assumed that the corresponding convolution kernel of convolution operation includes K*K weight data, K is the integer greater than 1.It is then multiple Arithmetic element 310 may include K*K arithmetic element, and the K*K arithmetic element is respectively used to execute K*K weight data one One corresponding K*K multiplication operation.

The K*K arithmetic element can be divided into K group processing unit, and every group includes K arithmetic element, in an example In, the K group arithmetic element can correspond to the K column weight data in K*K weight data, and the K in every group of arithmetic element Arithmetic element is successively corresponded with K weight data in each column weight data.Alternatively, then the K group arithmetic element can be with Corresponding to the K row weight data in K*K weight data, K arithmetic element in every group of arithmetic element successively with every row weight K weight data in data corresponds.In the embodiment of the present application, by taking convolution kernel is along the line direction of input matrix as an example into Row explanation.It will be appreciated by those skilled in the art that convolution kernel is similar with line direction along the scheme that the column direction of input matrix moves, Details are not described herein again.

The K group arithmetic element can be used for: in above-mentioned i-th convolution operation, it is corresponding to execute K*K weight data respectively K*K multiplication operation.After having executed the i-th convolution operation, K-t group in the K group arithmetic element to K Group arithmetic element is used for the data needed for obtaining i+1 time convolution operation in multiple register groups 320, wherein t is the volume Product operates corresponding sliding step, and t is the integer more than or equal to 1.Y group arithmetic element in the K group arithmetic element is used for Read second data from y+t group arithmetic element, y is positive integer, and 1≤y≤K-t.Wherein, y group arithmetic element In d-th of arithmetic element for d-th of arithmetic element reading, second data from y+t group arithmetic element.It is above-mentioned Second data are duplicate to convolved data in i-th convolution operation and i+1 time convolution operation.

Optionally, the first arithmetic element in Fig. 3 can be any operation unit in y group arithmetic element.In Fig. 3 Second arithmetic element can be any operation unit of the K-t into K group arithmetic element.

In some instances, the adjacent arithmetic element in every row arithmetic element in above-mentioned K*K arithmetic element mutually interconnects It connects.The connection type can support data to slide between arithmetic element.To the partial arithmetic unit in multiple arithmetic elements It can obtain non-repetitive data in convolution operation twice from multiple register groups, and remaining arithmetic element can be from adjacent Arithmetic element obtains duplicate data in continuous convolution operation twice.To avoid reading repeat number from multiple register groups According to reduce the data volume interacted between arithmetic element and register file.

It should be noted that be not limited to be connected with each other between every row arithmetic element between above-mentioned K*K arithmetic element, Other a variety of connection types can be used.For example, it may be the adjacent cells phase between arithmetic element in each column arithmetic element It connects.Alternatively, being all connected between the arithmetic element in every row arithmetic element.Or other connection sides can also be taken Formula, as long as the connection type enables to the first arithmetic element 311 that can obtain the second data i.e. from the second arithmetic element 312 It can.

As an example, Fig. 4 shows the concrete structure diagram of the computing device 400 in the application one embodiment.Such as Shown in Fig. 4, which includes convolutional calculation unit 31 and multiple register groups 320.Assuming that convolution kernel includes K*K Weight data.The convolution processing unit 31 includes K*K arithmetic element 310 and add tree unit 316.K*K operation list Member 310 is corresponded with K*K weight data, is respectively used to execute the one-to-one multiplication operation of K*K weight data.K*K A arithmetic element 310 is arranged as K row K column, and the adjacent arithmetic element in every row arithmetic element is connected.Specifically, each fortune Calculating unit 310 may include the first register 3101, the second register 3102 and a multiplier 3103.The multiplier 3103 It is connected with the first register 3101, the second register 3102.First register 3101 is corresponding for storing each arithmetic element 310 Weight data, it is corresponding to convolution in a convolution operation that the second register 3102 can be used for storing the arithmetic element 310 Data.Multiplier 3103 is used to carry out multiplication operation to the data that the first register 3101 and the second register 3102 store.Add Method tree unit 316 from the K*K reception multiplication of arithmetic element 310 for operating respectively as a result, and adding up to the above results Operation, to obtain the operation result of a convolution operation.

Optionally, above-mentioned computing device 400 can also include memory 330 and control unit 340.Wherein, memory 330 It can be connected by bus with multiple register groups 320.Memory 330 is used to store data to convolution, and pass through bus to Multiple register groups input the data to convolution.Control unit 340 can be used for controlling convolution processing unit 31 and multiple deposits Device group 320 realizes aforesaid operations.

The embodiment of the present application is not construed as limiting the type of memory.It is synchronized for example, above-mentioned memory can be Double Data Rate Dynamic RAM (Double Data Rate Synchronous Dynamic Random Access Memory, DDR SDRAM)。

Optionally, the embodiment of the present application is treated storage mode of the convolved data in multiple register groups 320 and is not construed as limiting. For example, subregion division can be carried out to multiple register groups 320.Each region corresponds to an arithmetic element 310, the operation list Member 310 is only needed for the reading of the region to convolved data, without reading data from other regions.

As an example, Fig. 5 shows multiple arithmetic elements 310 and multiple register groups in the embodiment of the present application 320 connection type.Multiple arithmetic elements 310 are corresponded with multiple register groups 320, each arithmetic element and a deposit Device group is connected.It is corresponding to convolved data that each register group can be used for storing an arithmetic element.Each arithmetic element exists A data are read in from corresponding register group in each clock cycle.

But in the storage mode of subregion, the data in corresponding region can only be read due to each arithmetic element, because This, if multiple arithmetic elements when convolved data repeats, multiple subregions in multiple register groups 320 are needed from memory The data are repeatedly read and stored in 330.This storage mode occupies the memory space of multiple register groups 320, also increases Bandwidth requirement of multiple register groups 320 from memory reading data is added.For example, with continued reference to Fig. 1, in a convolution operation In, if the first arithmetic element 311 reads in data 10, the second arithmetic element 312 reads in data 27, moves right a column in convolution kernel In the convolution operation carried out afterwards, the first arithmetic element 311 need to read in data 27, and the second arithmetic element 312 need to read in data 69.Then Data 27 can be repeated to read in and be stored by the first arithmetic element 311 and the corresponding subregion of the second arithmetic element 312.

Alternatively it is also possible to not carry out subregion, each accessible any deposit of arithmetic element to multiple register groups 320 The data stored in device group.Under the mode of not subregion, each arithmetic element 310 in the multiple arithmetic element 310 is used for Obtain the data stored in any one register group in the multiple register group 320.

As an example, Fig. 6 shows multiple arithmetic elements 310 in the another embodiment of the application and multiple registers The connection type of group 320.It should be noted that connection type shown in fig. 6 or structure can apply calculating provided by the present application In device, such as the computing device of Fig. 1, Fig. 3 or Fig. 4.It can also apply in other any computing devices, the embodiment of the present application pair This is not construed as limiting.

As shown in fig. 6, full connection switching network can be arranged between multiple arithmetic elements 310 and multiple register groups 320 Network.Full connection exchange network is for realizing the network being connected entirely before multiple arithmetic elements 310 and multiple register groups 320. By the full connection exchange network, multiple arithmetic elements 310 can be with any one register phase in multiple register groups 320 Even.In other words, each arithmetic element in the multiple arithmetic element can be used for obtaining appointing in the multiple register group The data stored in register group of anticipating.

For example, above-mentioned first arithmetic element 311 can be in the multiple register group 320 in i-th convolution operation The first register group in obtain above-mentioned first data, in jth time convolution operation, above-mentioned first arithmetic element 311 can be Third data are obtained in the second register group in the multiple register group 320.Wherein, i, j are positive integer, i ≠ j.

In the embodiment of the present application, full connection is used between multiple arithmetic elements 310 and multiple register groups 320 to hand over Switching network is connected, so that any operation unit 310 in multiple arithmetic elements 310 is able to access that any register group, therefore on Stating multiple register groups 320 need to read in and store once to convolved data, without reading and storing from memory more than 330 times Duplicate mass data in multiple convolution operation, thus the memory space of multiple register groups 320 can be saved, and reduce more The pressure of access bandwidth between a register group 320 and memory 330, improves the degree of parallelism of data access.

It is each due to each register group 320 correspondences, one read port but in the storage mode of not subregion Arithmetic element 310 was only capable of reading in a data from a register group 320 within each clock cycle.If in a convolution operation In it is corresponding multiple when convolved data is stored in the same register group 320, there are reading data conflicts.Then multiple operations Unit 310 needs multiple clock cycle that could all read required to convolved data.This storage mode will affect operation The processing speed of unit 310.

To solve the above-mentioned problems, the embodiment of the present application in multiple register groups 320 it is further proposed that store wait roll up The mode of volume data.Continue with the mode for describing multiple storages of register groups 320 to convolved data.

In some embodiments, multiple data in can operating to multiple convolution to convolution are arranged, so that i-th In secondary convolution operating process, the data that the multiple arithmetic element 310 is read from the multiple register group 320 are located at described In the different registers group of multiple register groups 320.

Wherein, above-mentioned i-th convolution operation can be any one secondary convolution operation in multiple convolution operation.It is above-mentioned multiple The data that arithmetic element 310 is read from the multiple register group 320 can refer to be needed in i-th convolution operation from described more The data read in a register group 320.The number read from multiple register groups 320 is needed in above-mentioned i-th convolution operation According to the total data that can be the processing in i-th convolution operation, it is also possible to the partial data of i-th convolution operation processing. For example, if in i-th convolution operation with there are the duplicate data in part, duplicate data can be from (i-1)-th convolution operation It is obtained inside multiple arithmetic elements 310, unduplicated data can be obtained from the multiple register group 320.Therefore i-th volume It is above-mentioned unduplicated data that the data read from multiple register groups 320 are needed in product operation.If i-th convolution operation with Duplicate data are not present in (i-1)-th convolution operation, then need to read from multiple register groups 320 in i-th convolution operation The data taken are total data to be processed in i-th convolution operation.

Carrying out arrangement to above-mentioned multiple data to convolution, there are a variety of concrete modes.In one example, the convolution Operating corresponding convolution kernel size is K*K, and the multiple register group includes K*K register group, and the multiple data include M Row N column data, distribution of the multiple data in the K*K register group meet the following conditions:

The data in xth row s column in the multiple data are stored in b-th of register, in the multiple data Xth+a row s column in data be stored in [b+ (a*K)] mod (K*K) a register group, wherein K be not less than 1 Integer, mod indicate progress complementation, and M >=x >=1, N >=s >=1, K*K >=b >=1, M, N, x, s, b, a is positive integer.

Optionally, on the basis of meeting above-mentioned condition, each row of data in above-mentioned M row data can be in multiple registers Arrangement is circuited sequentially in group.For example, the first row data can successively arrange from the first register group to the K*K register group, It then proceedes to carry out circulation arrangement since the first register group.

It is alternatively possible to arranged to above-mentioned multiple data so that: it is any in each row of data in above-mentioned M row data K continuous data is stored in any K consecutive numbers in every column data in different registers group and/or in above-mentioned N column data According to being stored in different register groups.

Wherein, in some embodiments, above-mentioned arrangement mode only as an example, is also referred to above-mentioned formula to more The arrangement mode of a data is modified and is converted.For example, above-mentioned multiple register groups can also be not limited to K*K register Group, as long as the data in a convolution operation to convolution are stored in different register groups.Alternatively, in above-mentioned multiple data Row and column can be inverted after, still can be according to above-mentioned arrangement mode storing data.

Fig. 7 is specifically showing for multiple arrangement modes to convolved data in multiple register groups in the embodiment of the present application It is intended to.In order to make it easy to understand, being illustrated below with reference to Fig. 7 to multiple specific arrangement modes to convolved data.Such as Fig. 7 institute Show, convolution kernel is the weight matrix of 3*3 size.Above-mentioned multiple arithmetic elements may include 3*3 arithmetic element, above-mentioned multiple to post Storage group may include 3*3 register group.By connecting exchange network entirely between multiple arithmetic elements and multiple register groups It is connected.Above-mentioned multiple register groups may be constructed register array.Every row indicates that a register group, each register group include Several registers.Wherein, the line number of register array can use row number r1, r2 ... rn to indicate.The line number of register array can To be indicated with b1, b2 ... b9.It is also shown in Fig. 7 multiple to convolved data needed for multiple convolution operation.Above-mentioned multiple data Including multiple lines and multiple rows data.For the ease of indicating, above-mentioned data use 1,2,3 respectively ..., and 256 indicate.

Above-mentioned multiple data can arrange in multiple register groups according to mode hereinbefore.For example, the first row data 1,2 ..., 16 can be stored among the first register group b1 to the 9th register b9 with circuiting sequentially.If in the first row data First data is stored into b-th of register, then the second row data are loaded since [b+ (a*K)] mod (K*K) organizes register Data.For example, first data 1 in the first row data are located in first register group, i.e. b=1, according to above-mentioned calculating side Formula, first data 17 in the second row data are stored in [1+ (1*3)] mod 9=4 register group.And third line number First data 33 in should be stored in [1+ (2*3)] mod 9=7 register group.As a kind of optional place Formula, the second row data can be stored into multiple register groups with circuiting sequentially since the 4th register group.The third line data It can be stored into multiple register groups with starting the cycle over from the 7th register group.It is more in the case where using above-mentioned arrangement mode All data needed for a arithmetic element can read a convolution operation within a clock cycle.In addition, as shown in fig. 7, Assuming that 3*3 weight data is respectively q1-q9.Then above-mentioned weight data can also be stored in the difference in multiple register groups respectively In register group.In the case where using above-mentioned arrangement mode, before multiple convolution operation starts, multiple arithmetic elements can be All weight datas are read in one clock cycle, then read again in convolution operation to convolved data.

In order to make it easy to understand, continuing to be discussed in detail the computing device in the embodiment of the present application below with reference to Fig. 8 and carrying out convolution The treatment process of operation.Fig. 8 is the specific example figure of the convolution operation in the another embodiment of the application.Assuming that the convolution in Fig. 8 Core is the convolution kernel of 3*3, and which employs the data assignment modes in Fig. 3 or computing device shown in Fig. 4 and Fig. 7.Convolution behaviour The process of work is as follows:

S801, from memory 330, the corresponding weight data of load convolution kernel is into multiple register groups 320,3*3 power Value Data is respectively stored into different register groups 320.

S802, the corresponding multiple data of multiple convolution operation are loaded into multiple register groups 320, arranging rule such as Fig. 7 It is shown.

S803, it reads from multiple register groups 320 to the 3*3 corresponding weight data of arithmetic element 310, and by weight Data are stored into the first register 3101 in each arithmetic element 310.Due to weight data ordered arrangement, carried , it can be achieved that Lothrus apterus is read when entering arithmetic element 310.

S804, in first time convolution operation, will be to preceding 3 reading data of 3 rows before in convolved data to multiple fortune Unit 310 is calculated, total 3*3 data, due to the above-mentioned arranging rule to convolved data, above-mentioned 3*3 data are from above-mentioned 3*3 It is read in different registers group 320 in register group 320, therefore first time convolution behaviour can be read in a clock cycle Data needed for making.Above-mentioned 3*3 arithmetic element 310 carries out operating with the one-to-one multiplication of 3*3 weight data respectively, and Export result.

The output result of 3*3 arithmetic element 310 is sent into add tree unit 316 and (added up by S805, each arithmetic element 310 Unit), the operation result of first time convolution operation is obtained, and operation result can be write back in register file.Wherein, described to post Storage heap may include above-mentioned multiple register groups 320, can also include other register groups.The operation result is writing back deposit When device heap, the storage location of the operation result does not conflict mutually with the position to convolved data stored before.

S806, before carrying out second of convolution algorithm, third column operations unit 310 is read from multiple register groups 320 Take the 4th data of preceding 3 row of multiple data；

The data of processing in first time convolution operation are transferred to the second column operations by S807, third column operations unit 310 In unit 310, the data of its processing in first time convolution operation are transferred to the first column operations by secondary series arithmetic element 310 In unit 310.Then, each arithmetic element 310 carries out the multiplying in second of convolution operation.

The output result of 3*3 arithmetic element 310 is sent into add tree unit 316 and (added up by S808, each arithmetic element 310 Unit), the operation result of second of convolution operation is obtained, and operation result can be write back in multiple register groups 320.Later Convolution process repeat aforesaid operations.

The present processes embodiment is described below, embodiment of the method is corresponding with Installation practice, therefore not in detail The part carefully described may refer to each Installation practice in front.

Fig. 9 is the schematic flow chart of the calculation method of the embodiment of the present application.Using the computing device of the calculation method Including multiple arithmetic elements, the multiple arithmetic element is used to execute multiple data multiple convolution operation, the multiple operation Unit includes the first arithmetic element and the second arithmetic element；

The described method includes:

In S901, the i-th convolution operation in the multiple convolution operates, first arithmetic element is to the first weight The first data in data and the multiple data carry out multiplying, wherein i is the integer greater than 0；

S902, first arithmetic element receive the second data that second arithmetic element is sent, wherein described second Data are the data that second arithmetic element executes convolution operation in the i-th convolution operation；

S903, in the i+1 time convolution operation in multiple convolution operation, first arithmetic element is to described the One weight data and second data carry out multiplying.

Optionally, the computing device in Fig. 9 further includes multiple register groups, for storing the multiple data；In Fig. 9 Calculation method further include: control the multiple arithmetic element obtained from multiple register groups of the computing device it is described repeatedly The multiple data needed for convolution operation.

Optionally, in the method for Fig. 9, each arithmetic element in the multiple arithmetic element is the multiple for obtaining The data stored in any one register group in register group.

Optionally, in the method for Fig. 9, during the i-th convolution operation, the multiple arithmetic element is from described The data read in multiple register groups are located in the different registers group of the multiple register group.

Optionally, in the method for Fig. 9, the corresponding convolution kernel size of the convolution operation is K*K, the multiple register Group includes K*K register group, and the multiple data include M row N column data, and the multiple data are in the K*K register Distribution in group meets the following conditions: the data in xth row s column in the multiple data are stored in r group register, The data in xth+a row s column in the multiple data are stored in [r+ (a*K)] mod (K*K) group register, wherein K is integer not less than 1, and mod indicates to carry out complementation, M >=x >=1, N >=s >=1, K*K >=r >=1, M, N, x, s, r, and a is Positive integer.

In the embodiment of the present application, a kind of computing device of the sliding to convolved data between supporting arithmetic element, energy are provided It supports that multiple arithmetic elements are read from neighbouring arithmetic element to convolved data, avoids reading from multiple register groups with this A large amount of repeated datas are taken, the data volume interacted between arithmetic element and multiple register groups is reduced, reduce data access conflict, Improve the operational performance of framework.Further, full connection exchange can be used between multiple arithmetic elements and multiple register groups Network is connected, so that any operation unit in multiple arithmetic elements is able to access that any register group, therefore above-mentioned multiple posts Storage group only need to be read in and be stored once to convolved data, without repeatedly reading and storing from memory in multiple convolution operation Duplicate mass data, thus the memory space of multiple register groups can be saved, and reduce multiple register groups and storage The pressure of access bandwidth between device improves the degree of parallelism of data access.It in the embodiment of the present application, can be in multiple registers Flexible data arrangement mode is used in group, can be read data as much as possible within a clock cycle, be avoided data Conflict is read, realizes clog-free convolution algorithm process.

Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually It is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technician Each specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceed Scope of the present application.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In several embodiments provided herein, it should be understood that disclosed systems, devices and methods, it can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.

It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product It is stored in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially in other words The part of the part that contributes to existing technology or the technical solution can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a People's computer, server or network equipment etc.) execute each embodiment the method for the application all or part of the steps. And storage medium above-mentioned includes: that USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), arbitrary access are deposited The various media that can store program code such as reservoir (Random Access Memory, RAM), magnetic or disk.

The above, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, it is any Those familiar with the art within the technical scope of the present application, can easily think of the change or the replacement, and should all contain Lid is within the scope of protection of this application.Therefore, the protection scope of the application should be based on the protection scope of the described claims.

Claims

1. a kind of computing device characterized by comprising

Multiple arithmetic elements, for executing multiple convolution operation to multiple data, the multiple arithmetic element includes the first operation Unit and the second arithmetic element；

First arithmetic element is used for:

In the i-th convolution operation of multiple convolution operation, to the in the first weight data and the multiple data One data carry out multiplying, wherein i is the integer greater than 0；

Receive the second data that second arithmetic element is sent, wherein second data are that second arithmetic element exists The data of convolution operation are executed in the i-th convolution operation；

In the i+1 time convolution operation of multiple convolution operation, to first weight data and second data Carry out multiplying.

2. device as described in claim 1, which is characterized in that further include:

Multiple register groups, for storing the multiple data；

The multiple arithmetic element is also used to obtain institute needed for the multiple convolution operation from the multiple register group State multiple data, wherein each arithmetic element in the multiple arithmetic element can obtain in the multiple register group The data stored in any one register group.

3. device as claimed in claim 2, which is characterized in that

First arithmetic element, is specifically used in the i-th convolution operation, and first from the multiple register group First data are obtained in register group；

Second arithmetic element, is specifically used in the i-th convolution operation, and second from the multiple register group Second data are obtained in register group, wherein first register group and second register group are posted to be different Storage group.

4. device as claimed in claim 2 or claim 3, which is characterized in that

During the i-th convolution operation, data that the multiple arithmetic element is read from the multiple register group In the different registers group of the multiple register group.

5. device as claimed in claim 4, which is characterized in that the corresponding convolution kernel size of the convolution operation is K*K, described Multiple register groups include K*K register group, and the multiple data include M row N column data, and the multiple data are in the K* Distribution in K register group meets the following conditions:

The data in xth row s column in the multiple data are stored in b-th of register group, in the multiple data Data in xth+a row s column are stored in [b+ (a*K)] mod (K*K) a register group, wherein K is whole not less than 1 Number, mod indicate progress complementation, and M >=x >=1, N >=s >=1, K*K >=b >=1, M, N, x, s, b, a is positive integer.

6. a kind of calculation method, which is characterized in that the computing device of the application calculation method includes multiple arithmetic elements, described Multiple arithmetic elements be used for multiple data execute multiple convolution operation, the multiple arithmetic element include the first arithmetic element and Second arithmetic element；

The described method includes:

In the i-th convolution operation of multiple convolution operation, first arithmetic element is to the first weight data and institute The first data stated in multiple data carry out multiplying, wherein i is the integer greater than 0；

First arithmetic element receives the second data that second arithmetic element is sent, wherein second data are institutes State the data that the second arithmetic element executes convolution operation in the i-th convolution operation；

In the i+1 time convolution operation of multiple convolution operation, first arithmetic element is to first weight data And second data carry out multiplying.

7. calculation method as claimed in claim 6, which is characterized in that the computing device further includes the multiple for storing Multiple register groups of data, wherein each arithmetic element in the multiple arithmetic element can obtain the multiple deposit The data stored in any one register group in device group.

8. calculation method as claimed in claim 7, which is characterized in that the method also includes:

In the i-th convolution operation, first arithmetic element is from the first register group in the multiple register group It is middle to obtain first data；

In the i-th convolution operation, second arithmetic element is from the second register group in the multiple register group It is middle to obtain second data, wherein first register group and second register group are different register groups.

9. calculation method as claimed in claim 7 or 8, it is characterised in that: described during the i-th convolution operation The data that multiple arithmetic elements are read from the multiple register group are located at the different registers group of the multiple register group In.

10. calculation method as claimed in claim 9, which is characterized in that the corresponding convolution kernel size of the convolution operation is K* K, the multiple register group include K*K register group, and the multiple data include M row N column data, and the multiple data exist Distribution in the K*K register group meets the following conditions:

The data in xth row s column in the multiple data are stored in r group register, the xth in the multiple data Data in+a row s column are stored in [r+ (a*K)] mod (K*K) group register, wherein and K is the integer not less than 1, Mod indicates progress complementation, and M >=x >=1, N >=s >=1, K*K >=r >=1, M, N, x, s, r, a is positive integer.

11. a kind of computing device, which is characterized in that including multiple arithmetic elements and multiple register groups, wherein each operation list Member is separately connected the multiple register group；

The multiple register group, for storing multiple data of pending multiple convolution operation；

The multiple arithmetic element, for obtaining the multiple data from the multiple register group, and to the multiple number It is operated according to multiple convolution is executed, wherein each arithmetic element in the multiple arithmetic element is for obtaining the multiple deposit The data stored in any one register group in device group.

12. device as claimed in claim 11, which is characterized in that the i-th convolution operation in multiple convolution operation In the process, the data that the multiple arithmetic element is read from the multiple register group are located at the multiple register group not With in register group, wherein i is the integer greater than 0.

13. the device as described in claim 11 or 12, which is characterized in that the corresponding convolution kernel size of the convolution operation is K* K, the multiple register group include K*K register group, and the multiple data include M row N column data, and the multiple data exist Distribution in the K*K register group meets the following conditions:

14. the device as described in any one of claim 11 to 13, which is characterized in that the multiple arithmetic element includes first Arithmetic element and the second arithmetic element,

First arithmetic element is used for:

In the i-th convolution operation in multiple convolution operation, in the first weight data and the multiple data First data carry out multiplying, wherein i is the integer greater than 0；

In the i+1 time convolution operation in multiple convolution operation, to first weight data and second number According to progress multiplying.