WO2019084788A1 - Computation apparatus, circuit and relevant method for neural network - Google Patents

Computation apparatus, circuit and relevant method for neural network Download PDF

Info

Publication number
WO2019084788A1
WO2019084788A1 PCT/CN2017/108640 CN2017108640W WO2019084788A1 WO 2019084788 A1 WO2019084788 A1 WO 2019084788A1 CN 2017108640 W CN2017108640 W CN 2017108640W WO 2019084788 A1 WO2019084788 A1 WO 2019084788A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
input feature
window
processing
column
Prior art date
Application number
PCT/CN2017/108640
Other languages
French (fr)
Chinese (zh)
Inventor
谷骞
高明明
李涛
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to CN201780013527.XA priority Critical patent/CN108780524A/en
Priority to PCT/CN2017/108640 priority patent/WO2019084788A1/en
Publication of WO2019084788A1 publication Critical patent/WO2019084788A1/en
Priority to US16/727,677 priority patent/US20200134435A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of neural networks and, more particularly, to an arithmetic device, circuit and related method for a neural network.
  • the convolutional neural network is formed by superimposing multiple layers.
  • the result of the upper layer is output feature maps (OFMs), which is used as the input feature map of the next layer.
  • OFMs output feature maps
  • the output feature map of the middle layer has many channels and the image is relatively large.
  • an output feature map is usually divided into multiple feature map segments, and each feature image is sequentially output. Segments, and each feature picture segment is output in parallel in columns. For example, a complete output feature map is divided into three feature image segments, and each feature image segment is sequentially output in columns.
  • a line buffer is usually used to implement data input of a convolution layer operation or a pooling layer operation.
  • the structure of the line buffer requires that the input data be input in the order of rasterization in the order of rows (or columns). Taking the height of the pooled window as k and the width of the input feature matrix as W, the line buffer needs to be buffered by the depth k*W, that is, the line buffer must buffer the input data of size k*W before the line buffer can be cached.
  • Data operations which increase the latency of data processing.
  • the existing image processing scheme requires a large buffer space and a large delay in data processing.
  • the present application provides an arithmetic device, a circuit, and a related method for a neural network, which can effectively save buffer space and reduce the delay of data processing.
  • an arithmetic device for a neural network comprising: a first processing unit, configured to perform a first operation on k1 input feature data according to a size of a calculation window Operation, obtaining an intermediate result, the size of the calculation window is k1 ⁇ k2, k1 and k2 are both positive integers; the second processing unit is configured to output k2 of the first processing unit according to the size of the calculation window The intermediate result performs a second arithmetic operation to obtain a calculation result.
  • the calculation can be started as long as one row or one column of input data is received, in other words, the row or column buffer can be used.
  • Input the feature matrix and can perform operations at the same time, without having to cache a certain amount of two-dimensional input data before the calculation can be performed, so that the delay of data processing can be effectively reduced, and the neural network can be effectively improved.
  • Data processing efficiency while also saving storage resources, thereby saving hardware resources.
  • the computing device includes M first processing units and M second processing units, and the M first processing units One-to-one corresponding to the M second processing units, M is a positive integer greater than 1; the computing device further includes: a pre-processing unit, configured to receive the input feature matrix by column, and receive the received image according to the calculation window The column input feature values are processed to obtain M sets of data, wherein each set of data includes k1 input feature values, and the pre-processing unit is further configured to input the M sets of data one-to-one to the M firsts Processing unit.
  • a circuit for processing a neural network comprising: a first processing circuit, configured to perform a first operation operation on the k1 input feature data according to a size of the calculation window, to obtain an intermediate result,
  • the size of the calculation window is k1 ⁇ k2, and k1 and k2 are both positive integers;
  • the second processing circuit is configured to perform a second operation operation on the k2 intermediate results output by the first processing circuit according to the size of the calculation window, Get the calculation results.
  • the calculation can be started as long as one row or one column of input data is received, in other words, the row or column buffer can be used.
  • Input the feature matrix and can perform operations at the same time, without having to cache a certain amount of two-dimensional input data before the calculation can be performed, so that the delay of data processing can be effectively reduced, and the neural network can be effectively improved.
  • Data processing efficiency while also saving storage resources, thereby saving hardware resources.
  • the circuit includes M first processing circuits and M second processing circuits, and the M first processing circuits One-to-one correspondence with the M second processing circuits, M is a positive integer greater than 1; the circuit further includes: a pre-processing circuit for receiving an input feature matrix by column, and receiving the column according to the calculation window The input feature values are processed to obtain M sets of data, wherein each set of data includes k1 input feature values, and the pre-processing circuit is further configured to input the M sets of data one-to-one to the M first processes In the circuit.
  • a third aspect provides a method for processing a neural network, the method comprising: performing a first operation operation on k1 input feature data according to a size of a calculation window to obtain an intermediate result, where the size of the calculation window is k1 ⁇ k2, k1 and k2 are both positive integers; according to the size of the calculation window, a second operation operation is performed on the k2 intermediate results obtained by the first operation operation, and a calculation result is obtained.
  • the calculation can be started as long as one row or one column of input data is received, in other words, the row or column buffer can be used.
  • Input the feature matrix and can perform operations at the same time, without having to cache a certain amount of two-dimensional input data before the calculation can be performed, so that the delay of data processing can be effectively reduced, and the neural network can be effectively improved.
  • Data processing efficiency while also saving storage resources, thereby saving hardware resources.
  • the method further includes: receiving an input feature matrix by a column, and processing the received column input feature value according to the calculation window to obtain the M group Data, wherein each set of data includes k1 input feature values; the first operation operation is performed on the k1 input feature data according to the size of the calculation window, and the intermediate result is obtained, including: according to the size of the calculation window, respectively
  • the M group data performs the first operation operation to obtain a corresponding intermediate result; the second operation operation is performed on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, and the calculation result is obtained.
  • the method includes: performing a second operation operation for obtaining a k2 intermediate results for each of the first operation operations corresponding to each of the M groups of data, to obtain a corresponding calculation result.
  • a fourth aspect provides a computer readable storage medium having stored thereon a computer program, the computer program being executed by a computer for implementing: converting k1 according to a size of a calculation window Performing a first operation operation on the feature data, obtaining an intermediate result, the size of the calculation window is k1 ⁇ k2, and k1 and k2 are both positive integers; k2 obtained for the first operation operation according to the size of the calculation window The intermediate result is subjected to a second arithmetic operation to obtain a calculation result.
  • the computer program when the computer program is executed by a computer, the computer program is further configured to: receive an input feature matrix by column, and input features to the received column according to the calculation window. The values are processed to obtain M sets of data, wherein each set of data includes k1 input feature values; when the computer program is executed by the computer, the first operation operation is performed on the k1 input feature data according to the size of the calculation window.
  • an intermediate result comprising: when the computer program is executed by the computer, implementing: performing the first operation operation on the M group data according to a size of the calculation window, respectively obtaining a corresponding intermediate result; the computer program When executed by the computer, the second operation operation is performed on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, and the calculation result is obtained, including: when the computer program is executed by the computer Implementing: for each first operation operation corresponding to each group of data in the M group data, each obtained k The two intermediate results are performed, and the second calculation operation is performed to obtain a corresponding calculation result.
  • the calculation can be started as long as one row or one column of input data is received, in other words, it can be pressed. Rows or columns are used to buffer the input feature matrix, and the operations can be performed at the same time. It is not necessary to cache a certain amount of two-dimensional input data before the calculation can be performed, so that the delay of data processing can be effectively reduced. It can effectively improve the data processing efficiency of the neural network, and also save storage resources, thereby saving hardware resources.
  • Figure 1 is a schematic diagram of a neural network convolutional layer operation.
  • FIG. 2 is a schematic diagram of a neural network pooling layer operation.
  • FIG. 3 is a schematic block diagram of an operation apparatus for a neural network according to an embodiment of the present application.
  • FIG. 4 is a schematic block diagram of an operation apparatus for a neural network according to another embodiment of the present application.
  • FIG. 5 is a schematic block diagram of an operation device for a neural network according to still another embodiment of the present application. Figure.
  • FIG. 6 is a schematic diagram of a circuit for processing a neural network according to an embodiment of the present application.
  • FIG. 7 is a schematic block diagram of a circuit for a neural network according to another embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a method for processing a neural network according to an embodiment of the present application.
  • the following describes the convolution layer operation and the pooling layer operation in the convolutional neural network.
  • the operation of the convolution layer operation is to slide a fixed-size window across the entire image plane, and multiply and accumulate the data covered in the window at each moment.
  • the window slides in steps of 1.
  • Figure 1 is a schematic diagram of a convolution layer operation.
  • the input image has a high H1 of 3 and a width W1 of 4; the convolution window has a high k1 of 2 and a width k2 of 2.
  • the convolution layer operation is a 2 ⁇ 2 convolution window sliding on a 3 ⁇ 4 image at intervals of 1 step, and the 4 data covered by each convolution window are multiplied and accumulated to obtain an output result, all The output results constitute an output image.
  • the output image has a height H2 of 2 and a width W2 of 3.
  • the operation mode of the operator op is multiply and accumulate.
  • the operation of the pooling layer operation is to slide a fixed-size window across the entire image plane, and calculate the data covered in the window at each moment to obtain a maximum value or an average value as an output.
  • the step size of the window slide is equal to the height (or width) of the window.
  • FIG. 2 is a schematic diagram of the pooling layer operation.
  • the input image has a high H1 of 6, and a width W1 of 8; the pooled window has a high k1 of 2 and a width k2 of 2.
  • the pooling layer operation is a 2 ⁇ 2 pooling window sliding on a 6 ⁇ 8 image at intervals of 2 steps, and the 4 data covered by each window obtains an output result, and all output results constitute an output image.
  • the output image has a height H2 of 3 and a width W2 of 4.
  • the operation mode of the operator op is to find the maximum value (max) or the average value (avg).
  • the process of "first taking a window and then performing calculation” is decomposed into a column operation and a row operation.
  • the process of “taking the window first, then performing the calculation” is decomposed into a prior operation, and then the operation is performed.
  • the data of the same column in the window is first calculated to obtain an intermediate result; then the intermediate result of all the columns in the window is calculated to obtain a calculation result.
  • the process of “first taking a window and then performing calculation” is decomposed into a pre-operation and a column operation.
  • the data of the same row in the window is first calculated to obtain an intermediate result; then the intermediate result of all the rows in the window is calculated to obtain a calculation result.
  • the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to the column operation and then the row operation; for example, the input data is pressed. If the row is input, it is cached by row, and the cached data is first processed and then columned.
  • FIG. 3 is a schematic block diagram of an arithmetic device 300 for a neural network provided by the present application.
  • the computing device 300 includes:
  • the first processing unit 310 is configured to perform a first operation operation on the k1 input feature data according to the size of the calculation window to obtain an intermediate result, where the size of the calculation window is k1 ⁇ k2, and both k1 and k2 are positive integers.
  • the second processing unit 320 is configured to perform a second operation operation on the k2 intermediate results output by the first processing unit according to the size of the calculation window, to obtain a calculation result.
  • the first processing unit 310 is configured to perform a first operation operation on the k1 input feature data in the column input feature values in the input feature matrix, where k1 represents a height of the calculation window, and k2 represents the calculation window. Width; the second processing unit 320 is configured to perform a second operation operation on the k2 intermediate results output by the first processing unit, which is equivalent to performing a second operation operation on the intermediate results of the different columns to obtain a calculation result.
  • the first processing unit 310 may be referred to as a column processing unit, and correspondingly, the first operation operation is referred to as a column operation; the second processing unit 320 may be referred to as a row processing unit, and correspondingly, the second operation operation is called For the line operation.
  • the first processing unit 310 is configured to perform a first operation operation on the k1 input feature data of the row input feature values in the input feature matrix, where k1 represents a width of the calculation window, and k2 represents the calculation window.
  • the second processing unit 320 is configured to perform a second operation operation on the k2 intermediate results output by the first processing unit, which is equivalent to performing a second operation operation on the intermediate results of the different rows to obtain a calculation result.
  • the first processing unit 310 may be referred to as a row processing unit, and correspondingly, the first operation operation is referred to as a row operation;
  • the second processing unit 320 may be referred to as a column processing unit, and correspondingly, the second operation operation is called For column operations.
  • the calculation can be started.
  • the input feature matrix can be buffered by row or column, and the operation can be performed simultaneously without having to cache first enough in the prior art.
  • the calculation can be performed. Therefore, the delay of data processing can be effectively reduced, the data processing efficiency of the neural network can be effectively improved, and the storage resources can be saved, thereby saving hardware resources.
  • the calculation window is a convolution window
  • the operation mode of the first operation operation is a multiplication and accumulation operation
  • the operation mode of the second operation operation is an accumulation operation
  • This embodiment can improve the convolution operation efficiency of the neural network.
  • the calculation window is a pooled window
  • the operation manner of the first operation operation is a maximum value or an average value
  • the operation mode of the second operation operation and the first operation operation are The operation is the same.
  • This embodiment can improve the pooling operation efficiency of the neural network.
  • the computing device includes M first processing units 310 and M second processing units 320, and the M first processing units 310 and the M second processing units 320 One-to-one correspondence, M is a positive integer greater than one;
  • the computing device 300 further includes:
  • the pre-processing unit 330 is configured to receive the input feature matrix by column, and process the received column input feature values according to the calculation window to obtain M group data, wherein each group of data includes k1 inputs.
  • the pre-processing unit is further configured to input the M-group data one-to-one into the M first processing units.
  • the pre-processing unit 330 receives the first column input feature value in the input feature matrix, processes it into M group data, and inputs the M first processing units 310 into the column processing respectively; the M first processing units 310 M intermediate results are output, and the M intermediate results are input one-to-one into the M second processing units 320.
  • the pre-processing unit 330 receives the second column input feature values in the input feature matrix, processes them into M groups of data, and inputs them into the M first processing units 310 for column processing respectively; the M first processing units 310 outputs M
  • the intermediate result is input to the M second processing units 320 one-to-one by the M intermediate results.
  • the pre-processing unit 330 when the pre-processing unit 330 receives the input feature value of the k2th column, it processes it into M sets of data, and inputs them into the M first processing units 310 for column processing respectively; the M first processing units 310 outputs M. Intermediate results, and the M intermediate results are input one-to-one into the M second processing units 320. At this time, each of the M second processing units 320 has received k2 As a result of the middle, the second processing unit 320 performs a row operation on the received k2 intermediate results to obtain an operation result, that is, the M second processing units 320 obtain M operation results. Subsequently, the pre-processing unit 330 may continue to receive the column input feature values, and repeatedly perform the above-described process to obtain the next M operation results, which are not described herein again.
  • an output feature map is generally divided into a plurality of feature picture segments, and each feature picture segment is sequentially output, and each feature picture segment is output in parallel in columns.
  • a complete output feature map is divided into three feature image segments, and each feature image segment is sequentially output in columns.
  • the data of the feature picture segment is input in columns
  • the line buffer is input in rows
  • the data corresponding to the feature picture segment is input in parallel
  • the way of the line buffer is serial processing data, which leads to input.
  • the output rate is not matched, and the rate of throughput data is too low, which will become the bottleneck of the accelerator and reduce the speed of the accelerator.
  • the pre-processing unit 310 receives a feature picture segment in columns, and the M first processing units perform column operations on the column input feature values of the feature picture segment, and the M second processing units are configured according to the M first processing units.
  • the intermediate result of the output is subjected to a row operation, thereby obtaining a calculation result of the feature picture segment, that is, a neural network processing result of the feature picture segment.
  • the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to column operations and then performed; For example, if the input data is input in rows, it is cached by line. The data is processed by performing row operations on the cached data before performing column operations, thereby improving data throughput.
  • the arithmetic device provided by the embodiment can implement parallel processing of image data, thereby effectively improving the efficiency of data processing.
  • the number M of the first processing unit 310 and the second processing unit 320 included in the computing device 300 is determined according to the size of the input feature matrix and the size of the calculation window.
  • the first processing unit 310 performs column processing
  • the second processing unit 320 performs line processing as an example.
  • H is an integer greater than or equal to k1
  • the height of the convolution window is k1
  • the width is k2
  • M H-(k1-1).
  • the M group data includes all the data in the column input feature values, that is, the computing device 300 provided by the present application can implement parallel processing of a column of input feature values.
  • the first processing unit 310 performs column processing
  • the second processing unit 320 performs row processing as an example.
  • H is an integer greater than or equal to k1
  • the height of the convolution window is k1
  • the width is k2
  • M mod (H / k1).
  • the M group data includes all data in the column input feature values, that is, the arithmetic device 300 provided by the present application can implement parallel processing of a column of input feature values.
  • the M group data is part of the input feature value of the column; the preprocessing unit 330 further includes a buffer; the preprocessing unit 330 is further configured to: The remaining data other than the M group of data is stored in the buffer.
  • the data of the last few rows of the input feature matrix needs to be cached in the buffer first, and then processed separately.
  • each feature picture segment is output in parallel in columns.
  • an output feature map is divided into two feature image segments.
  • the height of the first feature image segment is not divided by the high k1 of the pooled window, and the last few rows of the first feature image segment are first buffered in the buffer. Waiting until the input of the second feature picture segment is valid, reading the buffered data from the buffer, and splicing the current data (the data of the second feature picture segment) into a new feature picture segment, and re-mapping to M Processing is performed in the first processing unit 310.
  • FIG. 5 is a schematic block diagram of an arithmetic device 500 for a neural network provided by the present application.
  • the computing device 500 includes a pre-processing unit 510, M column processing units 520, M row processing units 530, and M column processing units 520 that are in one-to-one correspondence with M row processing units 530.
  • the pre-processing unit 510 is configured to receive input data, and pre-predict the input data according to the calculation window. Processing, obtaining M sets of data, each set of data includes k2 input feature values, and the M sets of data are input one-to-one into the M column processing units, the calculation window having a height of k1 and a width of k2.
  • the pre-processing unit 510 is configured to receive input data, and specifically includes: the pre-processing unit 510 receives the input feature matrix in columns.
  • the column processing unit 520 is configured to perform column operations on the input k2 input feature values, obtain an intermediate result, and input the intermediate result into the corresponding row processing unit 530.
  • column operations refer to maximizing or averaging.
  • column operations refer to multiply and accumulate operations.
  • the row processing unit 530 is configured to buffer the intermediate result output by the corresponding column processing unit 520.
  • the k2 intermediate results are row-operated to obtain a calculation result.
  • the operation mode corresponding to the row operation is the same as the operation mode corresponding to the column operation.
  • row operations refer to accumulation operations.
  • the calculation results of the M line processing units 530 constitute the output data of the arithmetic unit 500.
  • the input data received by the pre-processing unit 510 is a feature picture segment obtained by the input feature map to be processed.
  • the number M of column processing units 520 and row processing units 530 is determined by the size of the input feature matrix received by the pre-processing unit 510 and the size of the computation window.
  • the input feature matrix is a feature picture segment.
  • the pre-processing unit 510 is configured to sequentially receive the plurality of feature picture segments.
  • the sliding window (ie, the calculation window) may cover part of the data of the upper and lower feature image segments at the same time.
  • the pre-processing unit 510 is configured to buffer the last few rows of data of the previous feature picture segment in the window in the buffer of the pre-processing unit 510 (as shown in FIG. 5), and wait for the next feature picture.
  • the buffered data is read out from the buffer, and the current data (ie, the currently input feature picture segment) is spliced into a new feature picture segment, and then the new data is remapped to the M column processing units 520. in.
  • the cache space can be effectively saved, thereby saving hardware resources.
  • each column processing unit 520 in FIG. 5 may be Processing 2 rows of data of the same column, each row processing unit 530 can process 2 columns of data of the same row, as shown in FIG. Only three column processing units 520 and three row processing units 530 need to be provided in the arithmetic unit.
  • the input feature map is divided into two feature picture segments segment1 and segment2, and the height h of segment1 and segment2 is 14, the pooling window size is 3 ⁇ 3, and the step size is 2.
  • the pre-processing unit 510 needs to buffer the last two rows of segment1 into the buffer before processing segment1, and after receiving segment2, splicing 14 rows of data with segment2 into a new feature image segment with a height of 16 and then Remapped into column processing unit 520.
  • the present application breaks down the window operations of the neural network into column operations and row operations, so that as long as one row or column of input data is received, the calculation can be started without having to cache a certain amount of two-dimensional input data as in the prior art.
  • the calculation can be performed, so that the delay of data processing can be effectively reduced, so that real-time data processing can be realized.
  • the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to the column operation and then the row operation; for example, the input data is pressed. If the row is input, it is cached by row, and the cached data is first processed and then columned.
  • the computing device provided by the present application requires less buffer space than the prior art, thereby saving hardware overhead.
  • the computing device provided by some embodiments can implement parallel processing of multiple computing windows, thereby improving the data throughput rate and overcoming the bottleneck of the neural network accelerator.
  • the embodiment of the present application further provides a circuit 600 for processing a neural network.
  • the circuit 600 may correspond to the arithmetic device 300 or 500 provided by the above embodiment.
  • the circuit 600 includes:
  • the first processing circuit 610 is configured to perform a first operation operation on the k1 input feature data according to the size of the calculation window, to obtain an intermediate result, where the size of the calculation window is k1 ⁇ k2, and both k1 and k2 are positive integers;
  • the second processing circuit 620 is configured to perform a second operation operation on the k2 intermediate results output by the first processing circuit according to the size of the calculation window, to obtain a calculation result.
  • the first processing circuit 610 is configured to perform a first operation operation on the k1 input feature data of the column input feature values in the input feature matrix, where k1 represents a height of the calculation window, and k2 represents the calculation window.
  • the second processing circuit 620 is configured to perform a second operation operation on the k2 intermediate results output by the first processing circuit 610, which is equivalent to performing a second operation operation on the intermediate results of the different columns to obtain a calculation result.
  • the first processing circuit 610 may be referred to as a column processing circuit, correspondingly, the first operation
  • the arithmetic operation is referred to as a column operation
  • the second processing circuit 620 may be referred to as a row processing circuit, and correspondingly, the second arithmetic operation is referred to as a row operation.
  • the first processing circuit 610 is configured to perform a first operation operation on the k1 input feature data of the row input feature values in the input feature matrix, where k1 represents a width of the calculation window, and k2 represents the calculation window.
  • the second processing circuit 620 is configured to perform a second operation operation on the k2 intermediate results output by the first processing circuit, and is equivalent to performing a second operation operation on the intermediate results of the different rows to obtain a calculation result.
  • the first processing circuit 610 may be referred to as a row processing circuit, and correspondingly, the first operation operation is referred to as a row operation;
  • the second processing circuit 620 may be referred to as a column processing circuit, and correspondingly, the second operation operation is called For column operations.
  • the calculation can be started as long as one row or one column of input data is received, in other words, the row or column buffer can be used.
  • Input the feature matrix and can perform operations at the same time, without having to cache a certain amount of two-dimensional input data before the calculation can be performed, so that the delay of data processing can be effectively reduced, and the neural network can be effectively improved.
  • Data processing efficiency while also saving storage resources, thereby saving hardware resources.
  • the calculation window is a convolution window
  • the operation mode of the first operation operation is a multiplication and accumulation operation
  • the operation mode of the second operation operation is an accumulation operation
  • the calculation window is a pooling window
  • the operation manner of the first operation operation is to obtain a maximum value or an average value
  • the operation mode of the second operation operation and the first operation operation The operation is the same.
  • the circuit 600 includes M first processing circuits 610 and M second processing circuits 620, and the M first processing circuits 610 and the M
  • the second processing circuit 620 has a one-to-one correspondence, and M is a positive integer greater than 1.
  • the circuit 600 further includes: a pre-processing circuit 630, configured to receive the input feature matrix by column, and perform the received column input feature value according to the calculation window. Processing, the M sets of data are obtained, wherein each set of data includes k1 input feature values, and the pre-processing circuit 630 is further configured to input the M sets of data into the M first processing circuits 610 one-to-one.
  • the pre-processing circuit 630 receives the first column input feature values in the input feature matrix, processes them into M sets of data, and inputs them into the M first processing circuits 610 for column processing respectively; M first processing circuits 610 Output M intermediate results and lose the M intermediate results one-to-one
  • the M second processing circuits 620 are entered.
  • the pre-processing circuit 630 receives the second column input feature values in the input feature matrix, processes them into M sets of data, and inputs them into the M first processing circuits 610 for column processing respectively; the M first processing circuits 610 outputs M The intermediate result is input to the M second processing circuits 620 one-to-one by the M intermediate results.
  • the pre-processing circuit 630 when the pre-processing circuit 630 receives the input feature value of the k2th column, it processes it into M sets of data, and inputs them into the M first processing circuits 610 for column processing respectively; the M first processing circuits 610 outputs M. An intermediate result, and the M intermediate results are input one-to-one into the M second processing circuits 620. At this time, each of the M second processing circuits 620 has received k2 As a result of the middle, the second processing circuit 620 performs a row operation on the received k2 intermediate results to obtain an operation result, that is, the M second processing circuits 620 obtain M operation results. Subsequently, the pre-processing circuit 630 can continue to receive the column input feature values, and repeatedly perform the above-described process to obtain the next M operation results, which are not described herein again.
  • the pre-processing circuit 610 receives a feature picture segment in columns, and the M first processing circuits perform column operations on the column input feature values of the feature picture segment, and the M second processing circuits are based on the M first processing circuits.
  • the intermediate result of the output is subjected to a row operation, thereby obtaining a calculation result of the feature picture segment, that is, a neural network processing result of the feature picture segment.
  • the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to column operations and then performed; For example, if the input data is input in a row, the row buffer is performed, and the cached data is first subjected to row operations and then column operations, thereby improving data throughput.
  • the arithmetic device provided by the embodiment can implement parallel processing of image data, thereby effectively improving the efficiency of data processing.
  • the number M of the first processing circuit 610 and the second processing circuit 620 included in the computing device 300 is determined according to the size of the input feature matrix and the size of the calculation window.
  • the first processing circuit 610 performs column processing
  • the second processing circuit 620 performs line processing as an example.
  • H is an integer greater than or equal to k1
  • the height of the convolution window is k1
  • the width is k2
  • M H-(k1-1).
  • the M group data includes all the data in the column input feature values, that is, the computing device 300 provided by the present application can implement parallel processing of a column of input feature values.
  • the first processing circuit 610 performs column processing
  • the second portion The processing circuit 620 performs line processing as an example.
  • H is an integer greater than or equal to k1
  • the height of the convolution window is k1
  • the width is k2
  • M mod (H / k1).
  • the M group data includes all data in the column input feature values, that is, the arithmetic device 300 provided by the present application can implement parallel processing of a column of input feature values.
  • the M sets of data include all of the data in the column of input feature values.
  • the M group data is part of the input characteristic value of the column; the pre-processing circuit 630 further includes a buffer; the pre-processing circuit 630 is further configured to divide the column input feature value The remaining data other than the M group of data is stored in the buffer.
  • the M group data is part of the input feature values of the column.
  • the data of the last few rows of the input feature matrix needs to be first buffered in the buffer, and then processed separately.
  • each feature picture segment is output in parallel in columns.
  • an output feature map is divided into two feature image segments.
  • the height of the first feature image segment is not divided by the high k1 of the pooled window, and the last few rows of the first feature image segment are first buffered in the buffer. Waiting until the input of the second feature picture segment is valid, reading the buffered data from the buffer, and splicing the current data (the data of the second feature picture segment) into a new feature picture segment, and re-mapping to M Processing is performed in the first processing circuit 610.
  • the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to column operations and then performed; For example, if the input data is input in a row, the row buffer is performed, and the cached data is first subjected to row operations and then column operations, thereby improving data throughput.
  • the arithmetic device provided by the embodiment can implement parallel processing of image data, thereby effectively improving the efficiency of data processing.
  • the input feature matrix represents a feature picture segment in the image to be processed; the pre-processing circuit 630 is specifically configured to sequentially receive the feature picture segments of the to-be-processed image.
  • the circuit 600 further includes a communication interface for receiving image data to be processed, and for outputting a calculation result of the second processing circuit, that is, outputting image data.
  • the technical solution provided by the present application can be started by decomposing the window operation of the neural network into column operations and row operations, so that as long as one row or one column of input data is received, the calculation can be started without having to be prior art as in the prior art. After you have cached a certain amount of 2D input data, you can do it. The calculation, therefore, can effectively reduce the delay of data processing, so that real-time data processing can be realized.
  • the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to the column operation and then the row operation; for example, the input data is pressed.
  • the computing device If the row is input, it is cached by row, and the cached data is first processed and then columned.
  • the computing device provided by the present application requires less buffer space than the prior art, thereby saving hardware overhead.
  • the computing device provided by some embodiments can implement parallel processing of multiple computing windows, thereby improving the data throughput rate and overcoming the bottleneck of the neural network accelerator.
  • the embodiment of the present application further provides a method 800 for processing a neural network.
  • the method 800 can be performed by the operation device provided in the foregoing embodiment, and the description of the technical solutions and the technical effects in the foregoing embodiments can be applied to the embodiment. For brevity, the description is not repeated herein.
  • the method 800 includes the following steps.
  • the size of the calculation window is k1 ⁇ k2, and both k1 and k2 are positive integers.
  • step 810 can be performed by the first processing unit 310 in the above embodiment.
  • step 820 can be performed by the second processing unit 320 in the above embodiment.
  • the calculation can be started as long as one row or one column of input data is received, in other words, the row or column buffer can be used.
  • Input the feature matrix and can perform operations at the same time, without having to cache a certain amount of two-dimensional input data before the calculation can be performed, so that the delay of data processing can be effectively reduced, and the neural network can be effectively improved.
  • Data processing efficiency while also saving storage resources, thereby saving hardware resources.
  • the method 800 further includes: receiving an input feature matrix by columns, and processing the received column input feature values according to the calculation window to obtain M sets of data, wherein each set of data includes k1 Input feature values.
  • the step 810 specifically includes: performing the first operation operation on the M group data according to the size of the calculation window, and obtaining a corresponding intermediate result.
  • the first operation operations may be performed on the M group data by the M first processing units 310 in the foregoing embodiment, to obtain corresponding intermediate results.
  • the step 820 specifically includes: performing a k2 intermediate result for each first operation operation corresponding to each group of data in the M group data.
  • the second operation operation obtains a corresponding calculation result.
  • the first operation operations may be performed on the M group data by the M second processing units 320 in the foregoing embodiment to obtain corresponding intermediate results.
  • the value of M is related to the size of the input feature matrix and the size of the calculation window.
  • the M group data includes all data in the column input feature values.
  • the M group data is part of the column input feature values; the method 800 further includes: storing the remaining data of the column input feature values except the M group data. buffer.
  • the calculation window is a convolution window
  • the operation mode of the first operation operation is a multiplication and accumulation operation
  • the operation mode of the second operation operation is an accumulation operation
  • the calculation window is a pooling window
  • the operation mode of the first operation operation is to obtain a maximum value or an average value
  • the operation mode of the second operation operation and the first operation operation The operation is the same.
  • the input feature matrix represents a feature picture segment in the image to be processed; and receiving the input feature matrix by column comprises: sequentially receiving each feature picture segment of the image to be processed.
  • the embodiment of the present application further provides a computer readable storage medium, where the computer program is stored, and when the computer program is executed by the computer, the first operation operation is performed on the k1 input feature data according to the size of the calculation window.
  • the size of the calculation window is k1 ⁇ k2, and k1 and k2 are both positive integers; according to the size of the calculation window, performing a second operation operation on the k2 intermediate results obtained by the first operation operation, and obtaining a calculation result .
  • the method when the computer program is executed by the computer, the method further comprises: receiving an input feature matrix by columns, and processing the received column input feature values according to the calculation window to obtain M group data, wherein Each set of data includes k1 input feature values; when the computer program is executed by a computer, it is implemented: according to the size of the calculation window, the k1 input feature data is subjected to the first An arithmetic operation, obtaining an intermediate result, comprising: when the computer program is executed by the computer, implementing: performing the first operation operation on the M group data according to the size of the calculation window, and obtaining a corresponding intermediate result; the computer program is When the computer executes, the second operation operation is performed on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, and the calculation result is obtained, which is included when the computer program is executed by the computer: For the first operation operation corresponding to each group of data in the M group of data, each k2 intermediate results are obtained, and the second operation operation is performed to obtain
  • the value of M is related to the size of the input feature matrix and the size of the calculation window.
  • the M group data includes all data in the column input feature values.
  • the M group data is part of the column input feature value; when the computer program is executed by the computer, the method is further implemented to: input the column input feature value by the M group data. The remaining data is stored in the buffer.
  • the calculation window is a convolution window
  • the operation mode of the first operation operation is a multiplication and accumulation operation
  • the operation mode of the second operation operation is an accumulation operation
  • the calculation window is a pooling window
  • the operation mode of the first operation operation is to obtain a maximum value or an average value
  • the operation mode of the second operation operation and the first operation operation The operation is the same.
  • the input feature matrix represents a feature picture segment in the image to be processed; when the computer program is executed by the computer, the method is implemented to: receive the input feature matrix by column, including: the computer program is received by the computer The implementation is used to implement: sequentially receiving each feature picture segment of the image to be processed.
  • the present application is applicable to a convolutional neural network (CNN) hardware accelerator, and the application mode is an IP core, and may also be applied to other types of neural network accelerators/processors including a pooling layer.
  • CNN convolutional neural network
  • the above embodiments it may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • software it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present invention are generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, Or other programmable devices.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transmission to another website site, computer, server or data center via wired (eg coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.).
  • the computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that includes one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (such as a digital video disc (DVD)), or a semiconductor medium (such as a solid state disk (SSD)).
  • a magnetic medium for example, a floppy disk, a hard disk, a magnetic tape
  • an optical medium such as a digital video disc (DVD)
  • a semiconductor medium such as a solid state disk (SSD)
  • the disclosed apparatus may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)
  • Complex Calculations (AREA)

Abstract

Provided are a computation apparatus, circuit, and relevant method for a neural network. The computation apparatus comprises: a first processing unit, which is used for performing a first computation operation on k1 pieces of input feature data according to the size of a calculation window, to obtain intermediate results, wherein the size of the calculation window is k1×k2, k1 and k2 being both positive integers; and a second processing unit, which is used for performing, according to the size of the calculation window, a second computation operation on k2 intermediate results output by the first processing unit, to obtain a calculation result. The computation apparatus can effectively save on a cache space, so as to save on hardware resources, and can also reduce the delay of data processing.

Description

用于神经网络的运算装置、电路及相关方法Arithmetic device, circuit and related method for neural network
版权申明Copyright statement
本专利文件披露的内容包含受版权保护的材料。该版权为版权所有人所有。版权所有人不反对任何人复制专利与商标局的官方记录和档案中所存在的该专利文件或者该专利披露。The disclosure of this patent document contains material that is subject to copyright protection. This copyright is the property of the copyright holder. The copyright owner has no objection to the reproduction of the patent document or the patent disclosure in the official records and files of the Patent and Trademark Office.
技术领域Technical field
本申请涉及神经网络领域,并且更为具体地,涉及一种用于神经网络的运算装置、电路及相关方法。The present application relates to the field of neural networks and, more particularly, to an arithmetic device, circuit and related method for a neural network.
背景技术Background technique
卷积神经网络由多个层叠加在一起形成,上一层的结果为输出特征图(output feature maps,OFMs),作为下一层的输入特征图。通常中间层的输出特征图的通道非常多,图像也比较大。卷积神经网络的硬件加速器在处理特征图数据时,由于片上系统缓存大小和带宽的限制,通常将一张输出特征图分割成多个特征图片段(feature map segment),依次输出每个特征图片段,并且每个特征图片段按列并行输出。例如,一个完整的输出特征图被分割成3个特征图片段,每个特征图片段按列依次输出。The convolutional neural network is formed by superimposing multiple layers. The result of the upper layer is output feature maps (OFMs), which is used as the input feature map of the next layer. Usually, the output feature map of the middle layer has many channels and the image is relatively large. When the hardware accelerator of the convolutional neural network processes the feature map data, due to the limitation of the on-chip system cache size and bandwidth, an output feature map is usually divided into multiple feature map segments, and each feature image is sequentially output. Segments, and each feature picture segment is output in parallel in columns. For example, a complete output feature map is divided into three feature image segments, and each feature image segment is sequentially output in columns.
目前,在图像处理过程中,通常使用线缓冲器(line buffer)来实现卷积层运算或池化层运算的数据输入。线缓冲器的结构要求输入数据按照行(或列)优先以光栅化的顺序输入。以池化窗口的高度为k,输入特征矩阵的宽度为W为例,则线缓冲器需要缓存的深度k*W,即线缓冲器必须缓存大小为k*W的输入数据后,才可以进行数据运算,这样会增大数据处理的时延。Currently, in the image processing process, a line buffer is usually used to implement data input of a convolution layer operation or a pooling layer operation. The structure of the line buffer requires that the input data be input in the order of rasterization in the order of rows (or columns). Taking the height of the pooled window as k and the width of the input feature matrix as W, the line buffer needs to be buffered by the depth k*W, that is, the line buffer must buffer the input data of size k*W before the line buffer can be cached. Data operations, which increase the latency of data processing.
上述可知,现有的图像处理方案需要的缓存空间较大,同时数据处理的时延也较大。As can be seen from the above, the existing image processing scheme requires a large buffer space and a large delay in data processing.
发明内容Summary of the invention
本申请提供一种用于神经网络的运算装置、电路及相关方法,可以有效节省缓存空间,同时可以减小数据处理的时延。The present application provides an arithmetic device, a circuit, and a related method for a neural network, which can effectively save buffer space and reduce the delay of data processing.
第一方面,提供一种用于神经网络的运算装置,所述运算装置包括:第一处理单元,用于根据计算窗口的大小对k1个输入特征数据进行第一运算 操作,获得中间结果,所述计算窗口的大小为k1×k2,k1与k2均为正整数;第二处理单元,用于根据所述计算窗口的大小对所述第一处理单元输出的k2个中间结果进行第二运算操作,获得计算结果。In a first aspect, an arithmetic device for a neural network is provided, the computing device comprising: a first processing unit, configured to perform a first operation on k1 input feature data according to a size of a calculation window Operation, obtaining an intermediate result, the size of the calculation window is k1 × k2, k1 and k2 are both positive integers; the second processing unit is configured to output k2 of the first processing unit according to the size of the calculation window The intermediate result performs a second arithmetic operation to obtain a calculation result.
在本申请提供的技术方案中,通过将神经网络的窗口操作分解为列操作与行操作,使得只要接收到一行或一列输入数据,就可以开始计算,换句话说,可以按行或按列缓存输入特征矩阵,并且可以同时进行运算,无需像现有技术中必须先缓存够一定数量的二维输入数据之后,才可以进行计算,因此,可以有效减小数据处理的时延,有效提高神经网络的数据处理效率,同时也可以节省存储资源,从而节省硬件资源。In the technical solution provided by the present application, by decomposing the window operation of the neural network into a column operation and a row operation, the calculation can be started as long as one row or one column of input data is received, in other words, the row or column buffer can be used. Input the feature matrix, and can perform operations at the same time, without having to cache a certain amount of two-dimensional input data before the calculation can be performed, so that the delay of data processing can be effectively reduced, and the neural network can be effectively improved. Data processing efficiency, while also saving storage resources, thereby saving hardware resources.
结合第一方面,在第一方面的一种可能的实现方式中,所述运算装置包括M个所述第一处理单元与M个所述第二处理单元,且所述M个第一处理单元与所述M个第二处理单元一一对应,M为大于1的正整数;所述运算装置还包括:预处理单元,用于按列接收输入特征矩阵,并根据所述计算窗口对接收的列输入特征值进行处理,得到M组数据,其中,每组数据包括k1个输入特征值,所述预处理单元还用于将所述M组数据一对一地输入到所述M个第一处理单元中。With reference to the first aspect, in a possible implementation manner of the first aspect, the computing device includes M first processing units and M second processing units, and the M first processing units One-to-one corresponding to the M second processing units, M is a positive integer greater than 1; the computing device further includes: a pre-processing unit, configured to receive the input feature matrix by column, and receive the received image according to the calculation window The column input feature values are processed to obtain M sets of data, wherein each set of data includes k1 input feature values, and the pre-processing unit is further configured to input the M sets of data one-to-one to the M firsts Processing unit.
在本申请提供的技术方案中,可以实现图像数据的并行处理,从而有效提高数据处理的效率。In the technical solution provided by the present application, parallel processing of image data can be realized, thereby effectively improving the efficiency of data processing.
第二方面,提供一种用于处理神经网络的电路,所述电路包括:第一处理电路,用于根据计算窗口的大小对k1个输入特征数据进行第一运算操作,获得中间结果,所述计算窗口的大小为k1×k2,k1与k2均为正整数;第二处理电路,用于根据所述计算窗口的大小对所述第一处理电路输出的k2个中间结果进行第二运算操作,获得计算结果。In a second aspect, a circuit for processing a neural network is provided, the circuit comprising: a first processing circuit, configured to perform a first operation operation on the k1 input feature data according to a size of the calculation window, to obtain an intermediate result, The size of the calculation window is k1×k2, and k1 and k2 are both positive integers; the second processing circuit is configured to perform a second operation operation on the k2 intermediate results output by the first processing circuit according to the size of the calculation window, Get the calculation results.
在本申请提供的技术方案中,通过将神经网络的窗口操作分解为列操作与行操作,使得只要接收到一行或一列输入数据,就可以开始计算,换句话说,可以按行或按列缓存输入特征矩阵,并且可以同时进行运算,无需像现有技术中必须先缓存够一定数量的二维输入数据之后,才可以进行计算,因此,可以有效减小数据处理的时延,有效提高神经网络的数据处理效率,同时也可以节省存储资源,从而节省硬件资源。In the technical solution provided by the present application, by decomposing the window operation of the neural network into a column operation and a row operation, the calculation can be started as long as one row or one column of input data is received, in other words, the row or column buffer can be used. Input the feature matrix, and can perform operations at the same time, without having to cache a certain amount of two-dimensional input data before the calculation can be performed, so that the delay of data processing can be effectively reduced, and the neural network can be effectively improved. Data processing efficiency, while also saving storage resources, thereby saving hardware resources.
结合第二方面,在第二方面的一种可能的实现方式中,所述电路包括M个所述第一处理电路与M个所述第二处理电路,且所述M个第一处理电路 与所述M个第二处理电路一一对应,M为大于1的正整数;所述电路还包括:预处理电路,用于按列接收输入特征矩阵,并根据所述计算窗口对接收的列输入特征值进行处理,得到M组数据,其中,每组数据包括k1个输入特征值,所述预处理电路还用于将所述M组数据一对一地输入到所述M个第一处理电路中。With reference to the second aspect, in a possible implementation manner of the second aspect, the circuit includes M first processing circuits and M second processing circuits, and the M first processing circuits One-to-one correspondence with the M second processing circuits, M is a positive integer greater than 1; the circuit further includes: a pre-processing circuit for receiving an input feature matrix by column, and receiving the column according to the calculation window The input feature values are processed to obtain M sets of data, wherein each set of data includes k1 input feature values, and the pre-processing circuit is further configured to input the M sets of data one-to-one to the M first processes In the circuit.
在本申请提供的技术方案中,可以实现图像数据的并行处理,从而有效提高数据处理的效率。In the technical solution provided by the present application, parallel processing of image data can be realized, thereby effectively improving the efficiency of data processing.
第三方面,提供一种用于处理神经网络的方法,所述方法包括:根据计算窗口的大小,对k1个输入特征数据进行第一运算操作,获得中间结果,所述计算窗口的大小为k1×k2,k1与k2均为正整数;根据所述计算窗口的大小,对所述第一运算操作获得的k2个中间结果进行第二运算操作,获得计算结果。A third aspect provides a method for processing a neural network, the method comprising: performing a first operation operation on k1 input feature data according to a size of a calculation window to obtain an intermediate result, where the size of the calculation window is k1 ×k2, k1 and k2 are both positive integers; according to the size of the calculation window, a second operation operation is performed on the k2 intermediate results obtained by the first operation operation, and a calculation result is obtained.
在本申请提供的技术方案中,通过将神经网络的窗口操作分解为列操作与行操作,使得只要接收到一行或一列输入数据,就可以开始计算,换句话说,可以按行或按列缓存输入特征矩阵,并且可以同时进行运算,无需像现有技术中必须先缓存够一定数量的二维输入数据之后,才可以进行计算,因此,可以有效减小数据处理的时延,有效提高神经网络的数据处理效率,同时也可以节省存储资源,从而节省硬件资源。In the technical solution provided by the present application, by decomposing the window operation of the neural network into a column operation and a row operation, the calculation can be started as long as one row or one column of input data is received, in other words, the row or column buffer can be used. Input the feature matrix, and can perform operations at the same time, without having to cache a certain amount of two-dimensional input data before the calculation can be performed, so that the delay of data processing can be effectively reduced, and the neural network can be effectively improved. Data processing efficiency, while also saving storage resources, thereby saving hardware resources.
结合第三方面,在第三方面的一种可能的实现方式中,所述方法还包括:按列接收输入特征矩阵,并根据所述计算窗口对接收的列输入特征值进行处理,得到M组数据,其中,每组数据包括k1个输入特征值;所述根据计算窗口的大小,对k1个输入特征数据进行第一运算操作,获得中间结果,包括:根据计算窗口的大小,分别对所述M组数据进行所述第一运算操作,获得对应的中间结果;所述根据所述计算窗口的大小,对所述第一运算操作获得的k2个中间结果进行第二运算操作,获得计算结果,包括:分别针对所述M组数据中的每一组数据对应的第一运算操作,每获得k2个中间结果,进行所述第二运算操作,获得对应的计算结果。With reference to the third aspect, in a possible implementation manner of the third aspect, the method further includes: receiving an input feature matrix by a column, and processing the received column input feature value according to the calculation window to obtain the M group Data, wherein each set of data includes k1 input feature values; the first operation operation is performed on the k1 input feature data according to the size of the calculation window, and the intermediate result is obtained, including: according to the size of the calculation window, respectively The M group data performs the first operation operation to obtain a corresponding intermediate result; the second operation operation is performed on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, and the calculation result is obtained. The method includes: performing a second operation operation for obtaining a k2 intermediate results for each of the first operation operations corresponding to each of the M groups of data, to obtain a corresponding calculation result.
在本申请提供的技术方案中,可以实现图像数据的并行处理,从而有效提高数据处理的效率。In the technical solution provided by the present application, parallel processing of image data can be realized, thereby effectively improving the efficiency of data processing.
第四方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被计算机执行时用于实现:根据计算窗口的大小,对k1个输 入特征数据进行第一运算操作,获得中间结果,所述计算窗口的大小为k1×k2,k1与k2均为正整数;根据所述计算窗口的大小,对所述第一运算操作获得的k2个中间结果进行第二运算操作,获得计算结果。A fourth aspect provides a computer readable storage medium having stored thereon a computer program, the computer program being executed by a computer for implementing: converting k1 according to a size of a calculation window Performing a first operation operation on the feature data, obtaining an intermediate result, the size of the calculation window is k1×k2, and k1 and k2 are both positive integers; k2 obtained for the first operation operation according to the size of the calculation window The intermediate result is subjected to a second arithmetic operation to obtain a calculation result.
结合第四方面,在第四方面的一种可能的实现方式中,所述计算机程序被计算机执行时还用于实现:按列接收输入特征矩阵,并根据所述计算窗口对接收的列输入特征值进行处理,得到M组数据,其中,每组数据包括k1个输入特征值;所述计算机程序被计算机执行时用于实现:根据计算窗口的大小,对k1个输入特征数据进行第一运算操作,获得中间结果,包括:所述计算机程序被计算机执行时用于实现:根据计算窗口的大小,分别对所述M组数据进行所述第一运算操作,获得对应的中间结果;所述计算机程序被计算机执行时用于实现:根据所述计算窗口的大小,对所述第一运算操作获得的k2个中间结果进行第二运算操作,获得计算结果,包括:所述计算机程序被计算机执行时用于实现:分别针对所述M组数据中的每一组数据对应的第一运算操作,每获得k2个中间结果,进行所述第二运算操作,获得对应的计算结果。With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, when the computer program is executed by a computer, the computer program is further configured to: receive an input feature matrix by column, and input features to the received column according to the calculation window. The values are processed to obtain M sets of data, wherein each set of data includes k1 input feature values; when the computer program is executed by the computer, the first operation operation is performed on the k1 input feature data according to the size of the calculation window. Obtaining an intermediate result, comprising: when the computer program is executed by the computer, implementing: performing the first operation operation on the M group data according to a size of the calculation window, respectively obtaining a corresponding intermediate result; the computer program When executed by the computer, the second operation operation is performed on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, and the calculation result is obtained, including: when the computer program is executed by the computer Implementing: for each first operation operation corresponding to each group of data in the M group data, each obtained k The two intermediate results are performed, and the second calculation operation is performed to obtain a corresponding calculation result.
在本申请提供的技术方案中,可以实现图像数据的并行处理,从而有效提高数据处理的效率。In the technical solution provided by the present application, parallel processing of image data can be realized, thereby effectively improving the efficiency of data processing.
综上所述,在本申请提供的技术方案中,通过将神经网络的窗口操作分解为列操作与行操作,使得只要接收到一行或一列输入数据,就可以开始计算,换句话说,可以按行或按列缓存输入特征矩阵,并且可以同时进行运算,无需像现有技术中必须先缓存够一定数量的二维输入数据之后,才可以进行计算,因此,可以有效减小数据处理的时延,有效提高神经网络的数据处理效率,同时也可以节省存储资源,从而节省硬件资源。In summary, in the technical solution provided by the present application, by decomposing the window operation of the neural network into a column operation and a row operation, the calculation can be started as long as one row or one column of input data is received, in other words, it can be pressed. Rows or columns are used to buffer the input feature matrix, and the operations can be performed at the same time. It is not necessary to cache a certain amount of two-dimensional input data before the calculation can be performed, so that the delay of data processing can be effectively reduced. It can effectively improve the data processing efficiency of the neural network, and also save storage resources, thereby saving hardware resources.
附图说明DRAWINGS
图1为神经网络卷积层运算的示意图。Figure 1 is a schematic diagram of a neural network convolutional layer operation.
图2为神经网络池化层运算的示意图。2 is a schematic diagram of a neural network pooling layer operation.
图3为本申请实施例提供的用于神经网络的运算装置的示意性框图。FIG. 3 is a schematic block diagram of an operation apparatus for a neural network according to an embodiment of the present application.
图4为本申请另一实施例提供的用于神经网络的运算装置的示意性框图。FIG. 4 is a schematic block diagram of an operation apparatus for a neural network according to another embodiment of the present application.
图5为本申请再一实施例提供的用于神经网络的运算装置的示意性框 图。FIG. 5 is a schematic block diagram of an operation device for a neural network according to still another embodiment of the present application. Figure.
图6为本申请实施例提供的用于处理神经网络的电路的示意图。FIG. 6 is a schematic diagram of a circuit for processing a neural network according to an embodiment of the present application.
图7为本申请另一实施例提供的用于神经网络的电路的示意性框图。FIG. 7 is a schematic block diagram of a circuit for a neural network according to another embodiment of the present application.
图8为本申请实施例提供的用于处理神经网络的方法的示意性流程图。FIG. 8 is a schematic flowchart of a method for processing a neural network according to an embodiment of the present application.
具体实施方式Detailed ways
为了便于理解本申请提供的技术方案,下面首先介绍卷积神经网络中的卷积层运算与池化层运算。In order to facilitate the understanding of the technical solution provided by the present application, the following describes the convolution layer operation and the pooling layer operation in the convolutional neural network.
1)卷积层运算1) Convolution layer operation
卷积层运算的运算过程为,将一个固定大小的窗口滑动过整个图像平面,在每个时刻对窗口内覆盖的数据进行乘累加运算。卷积层运算中,窗口滑动的步长为1。The operation of the convolution layer operation is to slide a fixed-size window across the entire image plane, and multiply and accumulate the data covered in the window at each moment. In a convolutional layer operation, the window slides in steps of 1.
图1为卷积层运算的示意图。输入图像的高H1为3,宽W1为4;卷积窗口的高k1为2,宽k2为2。卷积层运算为一个2×2的卷积窗口在3×4的图像上以步长为1的间隔滑动,每个卷积窗口覆盖的4个数据进行乘累加运算,得到一个输出结果,所有输出结果构成输出图像。如图1所示,输出图像的高H2为2,宽W2为3。Figure 1 is a schematic diagram of a convolution layer operation. The input image has a high H1 of 3 and a width W1 of 4; the convolution window has a high k1 of 2 and a width k2 of 2. The convolution layer operation is a 2×2 convolution window sliding on a 3×4 image at intervals of 1 step, and the 4 data covered by each convolution window are multiplied and accumulated to obtain an output result, all The output results constitute an output image. As shown in FIG. 1, the output image has a height H2 of 2 and a width W2 of 3.
图1中所示的输出结果o1是通过如下公式得到的:The output result o1 shown in Fig. 1 is obtained by the following formula:
o1=op{d1,d2,d3,d4},O1=op{d1,d2,d3,d4},
其中,运算符op的运算方式为乘累加。Among them, the operation mode of the operator op is multiply and accumulate.
2)池化层运算2) Pool layer operation
池化层运算的运算过程为,将一个固定大小的窗口滑动过整个图像平面,在每个时刻对窗口内覆盖的数据进行运算,求最大值或者求平均值作为输出。积层运算中,窗口滑动的步长等于窗口的高(或宽)。The operation of the pooling layer operation is to slide a fixed-size window across the entire image plane, and calculate the data covered in the window at each moment to obtain a maximum value or an average value as an output. In a layered operation, the step size of the window slide is equal to the height (or width) of the window.
图2为池化层运算的示意图。输入图像的高H1为6,宽W1为8;池化窗口的高k1为2,宽k2为2。池化层运算为一个2×2的池化窗口在6×8的图像上以步长为2的间隔滑动,每个窗口覆盖的4个数据得到一个输出结果,所有输出结果构成输出图像。如图2所示,输出图像的高H2为3,宽W2为4。Figure 2 is a schematic diagram of the pooling layer operation. The input image has a high H1 of 6, and a width W1 of 8; the pooled window has a high k1 of 2 and a width k2 of 2. The pooling layer operation is a 2×2 pooling window sliding on a 6×8 image at intervals of 2 steps, and the 4 data covered by each window obtains an output result, and all output results constitute an output image. As shown in FIG. 2, the output image has a height H2 of 3 and a width W2 of 4.
图2中所示的输出结果o1是通过如下公式得到的:The output result o1 shown in Fig. 2 is obtained by the following formula:
o1=op{d1,d2,d3,d4}, O1=op{d1,d2,d3,d4},
其中,根据配置不同,运算符op的运算方式为求最大值(max)或求平均值(avg)。Among them, depending on the configuration, the operation mode of the operator op is to find the maximum value (max) or the average value (avg).
在现有的神经网络计算过程(卷积运算或池化运算)中,通常是“先取窗口,再进行计算”。以图2所示的池化运算为例,先获取池化窗口覆盖的4个输入数据,然后,再对4个输入数据进行计算。In the existing neural network calculation process (convolution operation or pooling operation), it is usually "first take the window and then perform the calculation". Taking the pooling operation shown in FIG. 2 as an example, the four input data covered by the pooled window are first acquired, and then the four input data are calculated.
在本申请中,将“先取窗口,再进行计算”的过程分解为列操作与行操作。In the present application, the process of "first taking a window and then performing calculation" is decomposed into a column operation and a row operation.
可选地,作为一种实现方式,将“先取窗口,再进行计算”的过程分解为先列操作,再行操作。Optionally, as an implementation manner, the process of “taking the window first, then performing the calculation” is decomposed into a prior operation, and then the operation is performed.
具体地,先对窗口内的同一列的数据进行计算,得到中间结果;然后对窗口内所有列的中间结果进行计算,得到计算结果。Specifically, the data of the same column in the window is first calculated to obtain an intermediate result; then the intermediate result of all the columns in the window is calculated to obtain a calculation result.
以图1所示窗口2×2为例,d1,d2,d3,d4参与运算,得到结果o1=op{d1,d2,d3,d4}。在本申请中,将如图1所示的窗口2×2的操作o1=op{d1,d2,d3,d4}分解为:先对窗口内同一列的d1与d3进行列操作,得到中间结果p1=op1{d1,d3},以及窗口内同一列的d2与d4进行列操作,得到中间结果p2=op2{d2,d4};然后对所有列的中间结果p1与p2进行行操作,得到最终运算结果o1=op3{p1p2}。其中,运算符op1与op2的运算方式均为乘累加,op3的运算方式为累加。Taking the window 2×2 shown in Fig. 1 as an example, d1, d2, d3, and d4 participate in the operation, and the result o1=op{d1, d2, d3, d4} is obtained. In the present application, the operation o1=op{d1, d2, d3, d4} of the window 2×2 as shown in FIG. 1 is decomposed into: firstly, the column operations of d1 and d3 in the same column in the window are performed, and an intermediate result is obtained. P1=op1{d1,d3}, and d2 and d4 in the same column in the window perform column operations, and get the intermediate result p2=op2{d2,d4}; then the intermediate results p1 and p2 of all columns are processed to obtain the final The result of the operation is o1=op3{p1p2}. Among them, the operation modes of the operators op1 and op2 are multiply and accumulate, and the operation mode of op3 is accumulative.
以图2所示窗口2×2为例,d1,d2,d3,d4参与运算,得到结果o1=op{d1,d2,d3,d4}。在本申请中,将如图2所示的窗口2×2的操作o1=op{d1,d2,d3,d4}分解为:先对窗口内同一列的d1与d3进行列操作,得到中间结果p1=op1{d1,d3},以及窗口内同一列的d2与d4进行列操作,得到中间结果p2=op2{d2,d4};然后对所有列的中间结果p1与p2进行行操作,得到最终运算结果o1=op3{p1p2}。其中,运算符op1、op2、op3的运算方式均为求最大值或求平均值。Taking the window 2×2 shown in FIG. 2 as an example, d1, d2, d3, and d4 participate in the operation, and the result o1=op{d1, d2, d3, d4} is obtained. In the present application, the operation of the window 2×2 as shown in FIG. 2, o1=op{d1, d2, d3, d4} is decomposed into: firstly, the column operations of d1 and d3 in the same column in the window are performed to obtain an intermediate result. P1=op1{d1,d3}, and d2 and d4 in the same column in the window perform column operations, and get the intermediate result p2=op2{d2,d4}; then the intermediate results p1 and p2 of all columns are processed to obtain the final The result of the operation is o1=op3{p1p2}. Among them, the calculation methods of the operators op1, op2, and op3 are all seeking the maximum value or averaging.
可选地,作为一种实现方式,将“先取窗口,再进行计算”的过程分解为先行操作,再列操作。Optionally, as an implementation manner, the process of “first taking a window and then performing calculation” is decomposed into a pre-operation and a column operation.
具体地,先对窗口内的同一行的数据进行计算,得到中间结果;然后对窗口内所有行的中间结果进行计算,得到计算结果。Specifically, the data of the same row in the window is first calculated to obtain an intermediate result; then the intermediate result of all the rows in the window is calculated to obtain a calculation result.
上述可知,在本申请中,通过将神经网络的窗口操作分解为列操作与行操作,使得只要接收到一行或一列输入数据,就可以开始计算,无需像现有 技术中必须先缓存够一定数量的二维输入数据之后,才可以进行计算,因此,可以有效减小数据处理的时延。同时,可以随输入数据的输入方式灵活设置数据缓存方式,例如,输入数据是按列输入的,则按列缓存,且对缓存的数据先进行列操作再进行行操作;再例如,输入数据是按行输入的,则按行缓存,且对缓存的数据先进行行操作再进行列操作。As can be seen from the above, in the present application, by decomposing the window operation of the neural network into column operations and row operations, the calculation can be started as long as one row or column of input data is received, without In the technology, a certain amount of two-dimensional input data must be cached before the calculation can be performed, so that the delay of data processing can be effectively reduced. At the same time, the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to the column operation and then the row operation; for example, the input data is pressed. If the row is input, it is cached by row, and the cached data is first processed and then columned.
下文对本申请提供的用于神经网络的运算装置、电路及相关方法进行详细描述。The arithmetic device, circuit and related method for neural network provided by the present application are described in detail below.
图3为本申请提供的用于神经网络的运算装置300的示意性框图。该运算装置300包括:FIG. 3 is a schematic block diagram of an arithmetic device 300 for a neural network provided by the present application. The computing device 300 includes:
第一处理单元310,用于根据计算窗口的大小对k1个输入特征数据进行第一运算操作,获得中间结果,该计算窗口的大小为k1×k2,k1与k2均为正整数。The first processing unit 310 is configured to perform a first operation operation on the k1 input feature data according to the size of the calculation window to obtain an intermediate result, where the size of the calculation window is k1×k2, and both k1 and k2 are positive integers.
第二处理单元320,用于根据该计算窗口的大小对该第一处理单元输出的k2个中间结果进行第二运算操作,获得计算结果。The second processing unit 320 is configured to perform a second operation operation on the k2 intermediate results output by the first processing unit according to the size of the calculation window, to obtain a calculation result.
可选地,第一处理单元310用于对输入特征矩阵中的列输入特征值中的k1个输入特征数据进行第一运算操作,其中,k1表示该计算窗口的高度,k2表示该计算窗口的宽度;第二处理单元320用于对第一处理单元输出的k2个中间结果进行第二运算操作,相当于是对不同列的中间结果做第二运算操作,获得计算结果。Optionally, the first processing unit 310 is configured to perform a first operation operation on the k1 input feature data in the column input feature values in the input feature matrix, where k1 represents a height of the calculation window, and k2 represents the calculation window. Width; the second processing unit 320 is configured to perform a second operation operation on the k2 intermediate results output by the first processing unit, which is equivalent to performing a second operation operation on the intermediate results of the different columns to obtain a calculation result.
在本实施例中,第一处理单元310可称为列处理单元,对应地,第一运算操作称为列操作;第二处理单元320可称为行处理单元,对应地,第二运算操作称为行操作。In this embodiment, the first processing unit 310 may be referred to as a column processing unit, and correspondingly, the first operation operation is referred to as a column operation; the second processing unit 320 may be referred to as a row processing unit, and correspondingly, the second operation operation is called For the line operation.
可选地,第一处理单元310用于对输入特征矩阵中的行输入特征值中的k1个输入特征数据进行第一运算操作,其中,k1表示该计算窗口的宽度,k2表示该计算窗口的高度;第二处理单元320用于对第一处理单元输出的k2个中间结果进行第二运算操作,相当于是对不同行的中间结果做第二运算操作,获得计算结果。Optionally, the first processing unit 310 is configured to perform a first operation operation on the k1 input feature data of the row input feature values in the input feature matrix, where k1 represents a width of the calculation window, and k2 represents the calculation window. The second processing unit 320 is configured to perform a second operation operation on the k2 intermediate results output by the first processing unit, which is equivalent to performing a second operation operation on the intermediate results of the different rows to obtain a calculation result.
在本实施例中,第一处理单元310可称为行处理单元,对应地,第一运算操作称为行操作;第二处理单元320可称为列处理单元,对应地,第二运算操作称为列操作。In this embodiment, the first processing unit 310 may be referred to as a row processing unit, and correspondingly, the first operation operation is referred to as a row operation; the second processing unit 320 may be referred to as a column processing unit, and correspondingly, the second operation operation is called For column operations.
在本申请提供的技术方案中,通过将神经网络的窗口操作分解为列操作 与行操作,使得只要接收到一行或一列输入数据,就可以开始计算,换句话说,可以按行或按列缓存输入特征矩阵,并且可以同时进行运算,无需像现有技术中必须先缓存够一定数量的二维输入数据之后,才可以进行计算,因此,可以有效减小数据处理的时延,有效提高神经网络的数据处理效率,同时也可以节省存储资源,从而节省硬件资源。In the technical solution provided by the present application, by decomposing the window operation of the neural network into column operations And the row operation, so that as long as one row or one column of input data is received, the calculation can be started. In other words, the input feature matrix can be buffered by row or column, and the operation can be performed simultaneously without having to cache first enough in the prior art. After a certain amount of two-dimensional input data, the calculation can be performed. Therefore, the delay of data processing can be effectively reduced, the data processing efficiency of the neural network can be effectively improved, and the storage resources can be saved, thereby saving hardware resources.
下文主要以先列处理后行处理为例进行描述,但本申请实施例并非限定于此。根据实际需要,可以先处理再列处理。The following is mainly described by taking the first-line processing and the subsequent processing as an example, but the embodiment of the present application is not limited thereto. According to actual needs, you can process the re-column processing first.
可选地,作为一个实施例,该计算窗口为卷积窗口,该第一运算操作的运算方式为乘累加运算,该第二运算操作的运算方式为累加运算。Optionally, as an embodiment, the calculation window is a convolution window, and the operation mode of the first operation operation is a multiplication and accumulation operation, and the operation mode of the second operation operation is an accumulation operation.
以图1所示的输入图像与卷积窗口为例,且以先列操作后行操作为例。先对窗口内同一列的d1与d3进行列操作,得到中间结果p1=op1{d1,d3},以及窗口内同一列的d2与d4进行列操作,得到中间结果p2=op2{d2,d4};然后对所有列的中间结果p1与p2进行行操作,得到最终运算结果o1=op3{p1p2}。其中,运算符op1与op2的运算方式均为乘累加,op3的运算方式为累加。Take the input image and the convolution window shown in FIG. 1 as an example, and take the operation after the first column operation as an example. First, the column operations of d1 and d3 in the same column in the window are performed, and the intermediate result p1=op1{d1, d3} is obtained, and the d2 and d4 in the same column in the window are column-operated, and the intermediate result p2=op2{d2, d4} is obtained. Then, the intermediate results p1 and p2 of all the columns are operated to obtain the final operation result o1=op3{p1p2}. Among them, the operation modes of the operators op1 and op2 are multiply and accumulate, and the operation mode of op3 is accumulative.
本实施例可以提高神经网络的卷积运算效率。This embodiment can improve the convolution operation efficiency of the neural network.
可选地,作为另一个实施例,该计算窗口为池化窗口,该第一运算操作的运算方式为求最大值或求平均值,该第二运算操作的运算方式与该第一运算操作的运算方式相同。Optionally, in another embodiment, the calculation window is a pooled window, and the operation manner of the first operation operation is a maximum value or an average value, and the operation mode of the second operation operation and the first operation operation are The operation is the same.
以图2所示的输入图像与卷积窗口为例,且以先列操作后行操作为例。先对窗口内同一列的d1与d3进行列操作,得到中间结果p1=op1{d1,d3},以及窗口内同一列的d2与d4进行列操作,得到中间结果p2=op2{d2,d4};然后对所有列的中间结果p1与p2进行行操作,得到最终运算结果o1=op3{p1p2}。其中,运算符op1、op2、op3的运算方式均为求最大值或求平均值。Take the input image and the convolution window shown in FIG. 2 as an example, and take the operation after the first column operation as an example. First, the column operations of d1 and d3 in the same column in the window are performed, and the intermediate result p1=op1{d1, d3} is obtained, and the d2 and d4 in the same column in the window are column-operated, and the intermediate result p2=op2{d2, d4} is obtained. Then, the intermediate results p1 and p2 of all the columns are operated to obtain the final operation result o1=op3{p1p2}. Among them, the calculation methods of the operators op1, op2, and op3 are all seeking the maximum value or averaging.
本实施例可以提高神经网络的池化运算效率。This embodiment can improve the pooling operation efficiency of the neural network.
可选地,如图4所示,该运算装置包括M个该第一处理单元310与M个该第二处理单元320,且该M个第一处理单元310与该M个第二处理单元320一一对应,M为大于1的正整数;Optionally, as shown in FIG. 4, the computing device includes M first processing units 310 and M second processing units 320, and the M first processing units 310 and the M second processing units 320 One-to-one correspondence, M is a positive integer greater than one;
该运算装置300还包括:The computing device 300 further includes:
预处理单元330,用于按列接收输入特征矩阵,并根据该计算窗口对接收的列输入特征值进行处理,得到M组数据,其中,每组数据包括k1个输 入特征值,该预处理单元还用于将该M组数据一对一地输入到该M个第一处理单元中。The pre-processing unit 330 is configured to receive the input feature matrix by column, and process the received column input feature values according to the calculation window to obtain M group data, wherein each group of data includes k1 inputs. The pre-processing unit is further configured to input the M-group data one-to-one into the M first processing units.
具体地,预处理单元330接收到输入特征矩阵中的第一列输入特征值,将其处理为M组数据,分别输入M个第一处理单元310中进行列处理;M个第一处理单元310输出M个中间结果,并将该M个中间结果一对一地输入到M个第二处理单元320中。预处理单元330接收到输入特征矩阵中的第二列输入特征值,将其处理为M组数据,分别输入M个第一处理单元310中进行列处理;M个第一处理单元310输出M个中间结果,并将该M个中间结果一对一地输入到M个第二处理单元320中。以此类推,当预处理单元330接收到第k2列输入特征值时,将其处理为M组数据,分别输入M个第一处理单元310中进行列处理;M个第一处理单元310输出M个中间结果,并将该M个中间结果一对一地输入到M个第二处理单元320中,这时,M个第二处理单元320中的每个第二处理单元320已经接收到k2个中间结果了,第二处理单元320对接收到的k2个中间结果进行行操作,得到运算结果,即M个第二处理单元320得到M个运算结果。后续,预处理单元330可以继续接收列输入特征值,重复执行上文描述的流程,得到下一次的M个运算结果,这里不再赘述。Specifically, the pre-processing unit 330 receives the first column input feature value in the input feature matrix, processes it into M group data, and inputs the M first processing units 310 into the column processing respectively; the M first processing units 310 M intermediate results are output, and the M intermediate results are input one-to-one into the M second processing units 320. The pre-processing unit 330 receives the second column input feature values in the input feature matrix, processes them into M groups of data, and inputs them into the M first processing units 310 for column processing respectively; the M first processing units 310 outputs M The intermediate result is input to the M second processing units 320 one-to-one by the M intermediate results. By the way, when the pre-processing unit 330 receives the input feature value of the k2th column, it processes it into M sets of data, and inputs them into the M first processing units 310 for column processing respectively; the M first processing units 310 outputs M. Intermediate results, and the M intermediate results are input one-to-one into the M second processing units 320. At this time, each of the M second processing units 320 has received k2 As a result of the middle, the second processing unit 320 performs a row operation on the received k2 intermediate results to obtain an operation result, that is, the M second processing units 320 obtain M operation results. Subsequently, the pre-processing unit 330 may continue to receive the column input feature values, and repeatedly perform the above-described process to obtain the next M operation results, which are not described herein again.
上文已述,当前技术中,通常将一张输出特征图分割成多个特征图片段,依次输出每个特征图片段,并且每个特征图片段按列并行输出。例如,一个完整的输出特征图被分割成3个特征图片段,每个特征图片段被按列依次输出。现有技术中,特征图片段的数据是按列输入,线缓冲器是按行输入,相当于特征图片段的数据是并行输入的,但是线缓冲器的方式是串行处理数据,会导致输入输出速率不匹配,吞吐数据的速率太低,会成为加速器的瓶颈,降低加速器的速率。As described above, in the current technology, an output feature map is generally divided into a plurality of feature picture segments, and each feature picture segment is sequentially output, and each feature picture segment is output in parallel in columns. For example, a complete output feature map is divided into three feature image segments, and each feature image segment is sequentially output in columns. In the prior art, the data of the feature picture segment is input in columns, the line buffer is input in rows, and the data corresponding to the feature picture segment is input in parallel, but the way of the line buffer is serial processing data, which leads to input. The output rate is not matched, and the rate of throughput data is too low, which will become the bottleneck of the accelerator and reduce the speed of the accelerator.
在本申请中,预处理单元310按列接收一个特征图片段,M个第一处理单元对该特征图片段的列输入特征值进行列操作,M个第二处理单元根据M个第一处理单元输出的中间结果进行行操作,从而得到该特征图片段的计算结果,即该特征图片段的神经网络处理结果。In the present application, the pre-processing unit 310 receives a feature picture segment in columns, and the M first processing units perform column operations on the column input feature values of the feature picture segment, and the M second processing units are configured according to the M first processing units. The intermediate result of the output is subjected to a row operation, thereby obtaining a calculation result of the feature picture segment, that is, a neural network processing result of the feature picture segment.
在本申请提供的技术方案中,可以随输入数据的输入方式灵活设置数据缓存方式,例如,输入数据是按列输入的,则按列缓存,且对缓存的数据先进行列操作再进行行操作;再例如,输入数据是按行输入的,则按行缓存, 且对缓存的数据先进行行操作再进行列操作,从而可以提高数据吞吐率。同时,本实施例提供的运算装置可以实现图像数据的并行处理,从而有效提高数据处理的效率。In the technical solution provided by the present application, the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to column operations and then performed; For example, if the input data is input in rows, it is cached by line. The data is processed by performing row operations on the cached data before performing column operations, thereby improving data throughput. At the same time, the arithmetic device provided by the embodiment can implement parallel processing of image data, thereby effectively improving the efficiency of data processing.
可选地,在本实施例中,运算装置300所包括的第一处理单元310以及第二处理单元320的数量M是根据输入特征矩阵的大小与计算窗口的大小确定的。Optionally, in the embodiment, the number M of the first processing unit 310 and the second processing unit 320 included in the computing device 300 is determined according to the size of the input feature matrix and the size of the calculation window.
以计算窗口为卷积窗口为例,以第一处理单元310进行列处理,第二处理单元320进行行处理为例。假设输入特征矩阵的行为H,H为大于或等于k1的整数,卷积窗口的高为k1,宽为k2,则M=H-(k1-1)。Taking the calculation window as a convolution window as an example, the first processing unit 310 performs column processing, and the second processing unit 320 performs line processing as an example. Suppose the behavior of the input feature matrix H, H is an integer greater than or equal to k1, the height of the convolution window is k1, the width is k2, then M = H-(k1-1).
在本实施例中,该M组数据包括该列输入特征值中的所有数据,即本申请提供的运算装置300可以实现一列输入特征值的并行处理。In this embodiment, the M group data includes all the data in the column input feature values, that is, the computing device 300 provided by the present application can implement parallel processing of a column of input feature values.
以计算窗口为池化窗口为例,以第一处理单元310进行列处理,第二处理单元320进行行处理为例。假设输入特征矩阵的行为H,H为大于或等于k1的整数,卷积窗口的高为k1,宽为k2,则M=mod(H/k1)。Taking the calculation window as a pooled window as an example, the first processing unit 310 performs column processing, and the second processing unit 320 performs row processing as an example. Suppose the behavior of the input feature matrix H, H is an integer greater than or equal to k1, the height of the convolution window is k1, the width is k2, then M = mod (H / k1).
当H可以被k1整除时,该M组数据包括该列输入特征值中的所有数据,即本申请提供的运算装置300可以实现一列输入特征值的并行处理。When H can be divisible by k1, the M group data includes all data in the column input feature values, that is, the arithmetic device 300 provided by the present application can implement parallel processing of a column of input feature values.
当H不被k1整除时,该M组数据为该列输入特征值中的部分数据;该预处理单元330还包括缓冲器;该预处理单元330还用于,将该列输入特征值中除该M组数据之外的剩余数据存入该缓冲器。When H is not divisible by k1, the M group data is part of the input feature value of the column; the preprocessing unit 330 further includes a buffer; the preprocessing unit 330 is further configured to: The remaining data other than the M group of data is stored in the buffer.
这种场景下,需要将输入特征矩阵的最后几行的数据先缓存在缓冲器中,后续单独进行处理。In this scenario, the data of the last few rows of the input feature matrix needs to be cached in the buffer first, and then processed separately.
例如,在将一张输出特征图分割成多个特征图片段,每个特征图片段按列并行输出的场景中。假设将一张输出特征图分割成2个特征图片段,第一个特征图片段的高度不被池化窗口的高k1整除,则第一个特征图片段的最后几行数据先缓存在缓冲器中,等到第二个特征图片段输入有效时,从缓冲器中读出缓存的数据,与当前数据(第二个特征图片段的数据)拼接成一个新的特征图片段,重新映射到M个第一处理单元310中进行处理。For example, in an scene in which an output feature map is divided into a plurality of feature picture segments, each feature picture segment is output in parallel in columns. Suppose that an output feature map is divided into two feature image segments. The height of the first feature image segment is not divided by the high k1 of the pooled window, and the last few rows of the first feature image segment are first buffered in the buffer. Waiting until the input of the second feature picture segment is valid, reading the buffered data from the buffer, and splicing the current data (the data of the second feature picture segment) into a new feature picture segment, and re-mapping to M Processing is performed in the first processing unit 310.
图5为本申请提供的用于神经网络的运算装置500的示意性框图。该运算装置500包括预处理单元510,M个列处理单元520,M个行处理单元530,M个列处理单元520与M个行处理单元530一一对应。FIG. 5 is a schematic block diagram of an arithmetic device 500 for a neural network provided by the present application. The computing device 500 includes a pre-processing unit 510, M column processing units 520, M row processing units 530, and M column processing units 520 that are in one-to-one correspondence with M row processing units 530.
预处理单元510,用于接收输入数据,按照计算窗口对输入数据进行预 处理,得到M组数据,每组数据包括k2个输入特征值,将M组数据一对一地输入到M个列处理单元中,该计算窗口的高为k1,宽为k2。The pre-processing unit 510 is configured to receive input data, and pre-predict the input data according to the calculation window. Processing, obtaining M sets of data, each set of data includes k2 input feature values, and the M sets of data are input one-to-one into the M column processing units, the calculation window having a height of k1 and a width of k2.
具体地,预处理单元510,用于接收输入数据,具体包括:预处理单元510按列接收输入特征矩阵。Specifically, the pre-processing unit 510 is configured to receive input data, and specifically includes: the pre-processing unit 510 receives the input feature matrix in columns.
列处理单元520,用于对输入的k2个输入特征值进行列操作,得到中间结果,并将中间结果输入到对应的行处理单元530中。The column processing unit 520 is configured to perform column operations on the input k2 input feature values, obtain an intermediate result, and input the intermediate result into the corresponding row processing unit 530.
具体地,对于池化层运算,列操作指的是求最大值或求平均值。对于卷积层运算,列操作指的是乘累加运算。Specifically, for pooled layer operations, column operations refer to maximizing or averaging. For convolutional layer operations, column operations refer to multiply and accumulate operations.
行处理单元530,用于缓存对应的列处理单元520输出的中间结果,每当接收到k2个中间结果时,对该k2个中间结果进行行操作,得到计算结果。The row processing unit 530 is configured to buffer the intermediate result output by the corresponding column processing unit 520. When k2 intermediate results are received, the k2 intermediate results are row-operated to obtain a calculation result.
具体地,对于池化层运算,行操作对应的运算方式与列操作对应的运算方式相同。对于卷积层运算,行操作指的是累加运算。Specifically, for the pooling layer operation, the operation mode corresponding to the row operation is the same as the operation mode corresponding to the column operation. For convolutional layer operations, row operations refer to accumulation operations.
如图5所示,M个行处理单元530的计算结果构成该运算装置500的输出数据。As shown in FIG. 5, the calculation results of the M line processing units 530 constitute the output data of the arithmetic unit 500.
可选地,在本实施例中,预处理单元510接收的输入数据为由待处理的输入特征图得到的特征图片段。Optionally, in this embodiment, the input data received by the pre-processing unit 510 is a feature picture segment obtained by the input feature map to be processed.
可选地,在某些实施例中,列处理单元520与行处理单元530的数量M由预处理单元510接收的输入特征矩阵的大小与计算窗口的大小确定。Optionally, in some embodiments, the number M of column processing units 520 and row processing units 530 is determined by the size of the input feature matrix received by the pre-processing unit 510 and the size of the computation window.
具体地,该输入特征矩阵为特征图片段。Specifically, the input feature matrix is a feature picture segment.
假设一个完整的输入特征图被分为若干个特征图片段。预处理单元510用于依次接收这若干个特征图片段。Suppose a complete input feature map is divided into several feature image segments. The pre-processing unit 510 is configured to sequentially receive the plurality of feature picture segments.
在某些情况下,滑动窗口(即计算窗口)可能会同时覆盖上下两个特征图片段的部分数据。对于这种情况,预处理单元510用于,将窗口内的前一个特征图片段的最后几行数据缓存在预处理单元510的缓冲器(如图5中所示)中,等下一个特征图片段输入有效时,从缓冲器中读出缓存的数据,与当前数据(即当前输入的特征图片段)拼接成一个新的特征图片段,然后将新的数据重新映射到M个列处理单元520中。In some cases, the sliding window (ie, the calculation window) may cover part of the data of the upper and lower feature image segments at the same time. For this case, the pre-processing unit 510 is configured to buffer the last few rows of data of the previous feature picture segment in the window in the buffer of the pre-processing unit 510 (as shown in FIG. 5), and wait for the next feature picture. When the segment input is valid, the buffered data is read out from the buffer, and the current data (ie, the currently input feature picture segment) is spliced into a new feature picture segment, and then the new data is remapped to the M column processing units 520. in.
在本实施例中,可以有效节省缓存空间,从而可以节省硬件资源。In this embodiment, the cache space can be effectively saved, thereby saving hardware resources.
作为一个示例,以图2中所示的高为6,宽为8的输入特征图,大小为2×2,步长为2的池化窗口为例,图5中每个列处理单元520可以处理同一列的2行数据,每个行处理单元530可以处理同一行的2列数据,图5所示 的运算装置中只需要设置3个列处理单元520和3个行处理单元530即可。As an example, an input feature map with a height of 6 and a width of 8 as shown in FIG. 2, a pooled window having a size of 2×2 and a step size of 2 is taken as an example, and each column processing unit 520 in FIG. 5 may be Processing 2 rows of data of the same column, each row processing unit 530 can process 2 columns of data of the same row, as shown in FIG. Only three column processing units 520 and three row processing units 530 need to be provided in the arithmetic unit.
作为另一个示例,假设将输入特征图分为两个特征图片段segment1和segment2,且segment1和segment2的高度h为14,池化窗口大小为3×3,步长为2。预处理单元510在处理segment1时,需要将segment1的最后两行数据需要先缓存到缓冲器中,在接收到segment2后,和segment2的14行数据拼接成新的高度为16的特征图像段,然后重新映射到列处理单元520中。As another example, assume that the input feature map is divided into two feature picture segments segment1 and segment2, and the height h of segment1 and segment2 is 14, the pooling window size is 3×3, and the step size is 2. The pre-processing unit 510 needs to buffer the last two rows of segment1 into the buffer before processing segment1, and after receiving segment2, splicing 14 rows of data with segment2 into a new feature image segment with a height of 16 and then Remapped into column processing unit 520.
本申请通过将神经网络的窗口操作分解为列操作与行操作,使得只要接收到一行或一列输入数据,就可以开始计算,无需像现有技术中必须先缓存够一定数量的二维输入数据之后,才可以进行计算,因此,可以有效减小数据处理的时延,从而可以实现实时数据处理。同时,可以随输入数据的输入方式灵活设置数据缓存方式,例如,输入数据是按列输入的,则按列缓存,且对缓存的数据先进行列操作再进行行操作;再例如,输入数据是按行输入的,则按行缓存,且对缓存的数据先进行行操作再进行列操作。此外,本申请提供的运算装置相对于现有技术所需缓存空间较小,从而可以节省硬件开销。某些实施例提供的运算装置可以实现多个计算窗口并行处理,从而可以提高数据吞吐速率,克服神经网络加速器的瓶颈。The present application breaks down the window operations of the neural network into column operations and row operations, so that as long as one row or column of input data is received, the calculation can be started without having to cache a certain amount of two-dimensional input data as in the prior art. The calculation can be performed, so that the delay of data processing can be effectively reduced, so that real-time data processing can be realized. At the same time, the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to the column operation and then the row operation; for example, the input data is pressed. If the row is input, it is cached by row, and the cached data is first processed and then columned. In addition, the computing device provided by the present application requires less buffer space than the prior art, thereby saving hardware overhead. The computing device provided by some embodiments can implement parallel processing of multiple computing windows, thereby improving the data throughput rate and overcoming the bottleneck of the neural network accelerator.
如图6所示,本申请实施例还提供一种用于处理神经网络的电路600。该电路600可以对应于上述实施例提供的运算装置300或500。如图6所述,该电路600包括:As shown in FIG. 6, the embodiment of the present application further provides a circuit 600 for processing a neural network. The circuit 600 may correspond to the arithmetic device 300 or 500 provided by the above embodiment. As shown in Figure 6, the circuit 600 includes:
第一处理电路610,用于根据计算窗口的大小对k1个输入特征数据进行第一运算操作,获得中间结果,该计算窗口的大小为k1×k2,k1与k2均为正整数;The first processing circuit 610 is configured to perform a first operation operation on the k1 input feature data according to the size of the calculation window, to obtain an intermediate result, where the size of the calculation window is k1×k2, and both k1 and k2 are positive integers;
第二处理电路620,用于根据该计算窗口的大小对该第一处理电路输出的k2个中间结果进行第二运算操作,获得计算结果。The second processing circuit 620 is configured to perform a second operation operation on the k2 intermediate results output by the first processing circuit according to the size of the calculation window, to obtain a calculation result.
可选地,第一处理电路610用于对输入特征矩阵中的列输入特征值中的k1个输入特征数据进行第一运算操作,其中,k1表示该计算窗口的高度,k2表示该计算窗口的宽度;第二处理电路620用于对第一处理电路610输出的k2个中间结果进行第二运算操作,相当于是对不同列的中间结果做第二运算操作,获得计算结果。Optionally, the first processing circuit 610 is configured to perform a first operation operation on the k1 input feature data of the column input feature values in the input feature matrix, where k1 represents a height of the calculation window, and k2 represents the calculation window. The second processing circuit 620 is configured to perform a second operation operation on the k2 intermediate results output by the first processing circuit 610, which is equivalent to performing a second operation operation on the intermediate results of the different columns to obtain a calculation result.
在本实施例中,第一处理电路610可称为列处理电路,对应地,第一运 算操作称为列操作;第二处理电路620可称为行处理电路,对应地,第二运算操作称为行操作。In this embodiment, the first processing circuit 610 may be referred to as a column processing circuit, correspondingly, the first operation The arithmetic operation is referred to as a column operation; the second processing circuit 620 may be referred to as a row processing circuit, and correspondingly, the second arithmetic operation is referred to as a row operation.
可选地,第一处理电路610用于对输入特征矩阵中的行输入特征值中的k1个输入特征数据进行第一运算操作,其中,k1表示该计算窗口的宽度,k2表示该计算窗口的高度;第二处理电路620用于对第一处理电路输出的k2个中间结果进行第二运算操作,相当于是对不同行的中间结果做第二运算操作,获得计算结果。Optionally, the first processing circuit 610 is configured to perform a first operation operation on the k1 input feature data of the row input feature values in the input feature matrix, where k1 represents a width of the calculation window, and k2 represents the calculation window. The second processing circuit 620 is configured to perform a second operation operation on the k2 intermediate results output by the first processing circuit, and is equivalent to performing a second operation operation on the intermediate results of the different rows to obtain a calculation result.
在本实施例中,第一处理电路610可称为行处理电路,对应地,第一运算操作称为行操作;第二处理电路620可称为列处理电路,对应地,第二运算操作称为列操作。In this embodiment, the first processing circuit 610 may be referred to as a row processing circuit, and correspondingly, the first operation operation is referred to as a row operation; the second processing circuit 620 may be referred to as a column processing circuit, and correspondingly, the second operation operation is called For column operations.
在本申请提供的技术方案中,通过将神经网络的窗口操作分解为列操作与行操作,使得只要接收到一行或一列输入数据,就可以开始计算,换句话说,可以按行或按列缓存输入特征矩阵,并且可以同时进行运算,无需像现有技术中必须先缓存够一定数量的二维输入数据之后,才可以进行计算,因此,可以有效减小数据处理的时延,有效提高神经网络的数据处理效率,同时也可以节省存储资源,从而节省硬件资源。In the technical solution provided by the present application, by decomposing the window operation of the neural network into a column operation and a row operation, the calculation can be started as long as one row or one column of input data is received, in other words, the row or column buffer can be used. Input the feature matrix, and can perform operations at the same time, without having to cache a certain amount of two-dimensional input data before the calculation can be performed, so that the delay of data processing can be effectively reduced, and the neural network can be effectively improved. Data processing efficiency, while also saving storage resources, thereby saving hardware resources.
可选地,在某些实施例中,该计算窗口为卷积窗口,该第一运算操作的运算方式为乘累加运算,该第二运算操作的运算方式为累加运算。Optionally, in some embodiments, the calculation window is a convolution window, and the operation mode of the first operation operation is a multiplication and accumulation operation, and the operation mode of the second operation operation is an accumulation operation.
可选地,在某些实施例中,该计算窗口为池化窗口,该第一运算操作的运算方式为求最大值或求平均值,该第二运算操作的运算方式与该第一运算操作的运算方式相同。Optionally, in some embodiments, the calculation window is a pooling window, and the operation manner of the first operation operation is to obtain a maximum value or an average value, and the operation mode of the second operation operation and the first operation operation The operation is the same.
可选地,如图7所示,作为一个实施例,该电路600包括M个该第一处理电路610与M个该第二处理电路620,且该M个第一处理电路610与该M个第二处理电路620一一对应,M为大于1的正整数;该电路600还包括:预处理电路630,用于按列接收输入特征矩阵,并根据该计算窗口对接收的列输入特征值进行处理,得到M组数据,其中,每组数据包括k1个输入特征值,该预处理电路630还用于将该M组数据一对一地输入到该M个第一处理电路610中。Optionally, as shown in FIG. 7, as an embodiment, the circuit 600 includes M first processing circuits 610 and M second processing circuits 620, and the M first processing circuits 610 and the M The second processing circuit 620 has a one-to-one correspondence, and M is a positive integer greater than 1. The circuit 600 further includes: a pre-processing circuit 630, configured to receive the input feature matrix by column, and perform the received column input feature value according to the calculation window. Processing, the M sets of data are obtained, wherein each set of data includes k1 input feature values, and the pre-processing circuit 630 is further configured to input the M sets of data into the M first processing circuits 610 one-to-one.
具体地,预处理电路630接收到输入特征矩阵中的第一列输入特征值,将其处理为M组数据,分别输入M个第一处理电路610中进行列处理;M个第一处理电路610输出M个中间结果,并将该M个中间结果一对一地输 入到M个第二处理电路620中。预处理电路630接收到输入特征矩阵中的第二列输入特征值,将其处理为M组数据,分别输入M个第一处理电路610中进行列处理;M个第一处理电路610输出M个中间结果,并将该M个中间结果一对一地输入到M个第二处理电路620中。以此类推,当预处理电路630接收到第k2列输入特征值时,将其处理为M组数据,分别输入M个第一处理电路610中进行列处理;M个第一处理电路610输出M个中间结果,并将该M个中间结果一对一地输入到M个第二处理电路620中,这时,M个第二处理电路620中的每个第二处理电路620已经接收到k2个中间结果了,第二处理电路620对接收到的k2个中间结果进行行操作,得到运算结果,即M个第二处理电路620得到M个运算结果。后续,预处理电路630可以继续接收列输入特征值,重复执行上文描述的流程,得到下一次的M个运算结果,这里不再赘述。Specifically, the pre-processing circuit 630 receives the first column input feature values in the input feature matrix, processes them into M sets of data, and inputs them into the M first processing circuits 610 for column processing respectively; M first processing circuits 610 Output M intermediate results and lose the M intermediate results one-to-one The M second processing circuits 620 are entered. The pre-processing circuit 630 receives the second column input feature values in the input feature matrix, processes them into M sets of data, and inputs them into the M first processing circuits 610 for column processing respectively; the M first processing circuits 610 outputs M The intermediate result is input to the M second processing circuits 620 one-to-one by the M intermediate results. By the way, when the pre-processing circuit 630 receives the input feature value of the k2th column, it processes it into M sets of data, and inputs them into the M first processing circuits 610 for column processing respectively; the M first processing circuits 610 outputs M. An intermediate result, and the M intermediate results are input one-to-one into the M second processing circuits 620. At this time, each of the M second processing circuits 620 has received k2 As a result of the middle, the second processing circuit 620 performs a row operation on the received k2 intermediate results to obtain an operation result, that is, the M second processing circuits 620 obtain M operation results. Subsequently, the pre-processing circuit 630 can continue to receive the column input feature values, and repeatedly perform the above-described process to obtain the next M operation results, which are not described herein again.
在本申请中,预处理电路610按列接收一个特征图片段,M个第一处理电路对该特征图片段的列输入特征值进行列操作,M个第二处理电路根据M个第一处理电路输出的中间结果进行行操作,从而得到该特征图片段的计算结果,即该特征图片段的神经网络处理结果。In the present application, the pre-processing circuit 610 receives a feature picture segment in columns, and the M first processing circuits perform column operations on the column input feature values of the feature picture segment, and the M second processing circuits are based on the M first processing circuits. The intermediate result of the output is subjected to a row operation, thereby obtaining a calculation result of the feature picture segment, that is, a neural network processing result of the feature picture segment.
在本申请提供的技术方案中,可以随输入数据的输入方式灵活设置数据缓存方式,例如,输入数据是按列输入的,则按列缓存,且对缓存的数据先进行列操作再进行行操作;再例如,输入数据是按行输入的,则按行缓存,且对缓存的数据先进行行操作再进行列操作,从而可以提高数据吞吐率。同时,本实施例提供的运算装置可以实现图像数据的并行处理,从而有效提高数据处理的效率。In the technical solution provided by the present application, the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to column operations and then performed; For example, if the input data is input in a row, the row buffer is performed, and the cached data is first subjected to row operations and then column operations, thereby improving data throughput. At the same time, the arithmetic device provided by the embodiment can implement parallel processing of image data, thereby effectively improving the efficiency of data processing.
可选地,在本实施例中,运算装置300所包括的第一处理电路610以及第二处理电路620的数量M是根据输入特征矩阵的大小与计算窗口的大小确定的。Optionally, in the embodiment, the number M of the first processing circuit 610 and the second processing circuit 620 included in the computing device 300 is determined according to the size of the input feature matrix and the size of the calculation window.
以计算窗口为卷积窗口为例,以第一处理电路610进行列处理,第二处理电路620进行行处理为例。假设输入特征矩阵的行为H,H为大于或等于k1的整数,卷积窗口的高为k1,宽为k2,则M=H-(k1-1)。Taking the calculation window as a convolution window as an example, the first processing circuit 610 performs column processing, and the second processing circuit 620 performs line processing as an example. Suppose the behavior of the input feature matrix H, H is an integer greater than or equal to k1, the height of the convolution window is k1, the width is k2, then M = H-(k1-1).
在本实施例中,该M组数据包括该列输入特征值中的所有数据,即本申请提供的运算装置300可以实现一列输入特征值的并行处理。In this embodiment, the M group data includes all the data in the column input feature values, that is, the computing device 300 provided by the present application can implement parallel processing of a column of input feature values.
以计算窗口为池化窗口为例,以第一处理电路610进行列处理,第二处 理电路620进行行处理为例。假设输入特征矩阵的行为H,H为大于或等于k1的整数,卷积窗口的高为k1,宽为k2,则M=mod(H/k1)。Taking the calculation window as a pooled window as an example, the first processing circuit 610 performs column processing, and the second portion The processing circuit 620 performs line processing as an example. Suppose the behavior of the input feature matrix H, H is an integer greater than or equal to k1, the height of the convolution window is k1, the width is k2, then M = mod (H / k1).
当H可以被k1整除时,该M组数据包括该列输入特征值中的所有数据,即本申请提供的运算装置300可以实现一列输入特征值的并行处理。When H can be divisible by k1, the M group data includes all data in the column input feature values, that is, the arithmetic device 300 provided by the present application can implement parallel processing of a column of input feature values.
在本实施例中,该M组数据包括该列输入特征值中的所有数据。In this embodiment, the M sets of data include all of the data in the column of input feature values.
当H不被k1整除时,该M组数据为该列输入特征值中的部分数据;该预处理电路630还包括缓冲器;该预处理电路630还用于,将该列输入特征值中除该M组数据之外的剩余数据存入该缓冲器。When H is not divisible by k1, the M group data is part of the input characteristic value of the column; the pre-processing circuit 630 further includes a buffer; the pre-processing circuit 630 is further configured to divide the column input feature value The remaining data other than the M group of data is stored in the buffer.
在本实施例中,该M组数据为该列输入特征值中的部分数据,这种场景下,需要将输入特征矩阵的最后几行的数据先缓存在缓冲器中,后续单独进行处理。In this embodiment, the M group data is part of the input feature values of the column. In this scenario, the data of the last few rows of the input feature matrix needs to be first buffered in the buffer, and then processed separately.
例如,在将一张输出特征图分割成多个特征图片段,每个特征图片段按列并行输出的场景中。假设将一张输出特征图分割成2个特征图片段,第一个特征图片段的高度不被池化窗口的高k1整除,则第一个特征图片段的最后几行数据先缓存在缓冲器中,等到第二个特征图片段输入有效时,从缓冲器中读出缓存的数据,与当前数据(第二个特征图片段的数据)拼接成一个新的特征图片段,重新映射到M个第一处理电路610中进行处理。For example, in an scene in which an output feature map is divided into a plurality of feature picture segments, each feature picture segment is output in parallel in columns. Suppose that an output feature map is divided into two feature image segments. The height of the first feature image segment is not divided by the high k1 of the pooled window, and the last few rows of the first feature image segment are first buffered in the buffer. Waiting until the input of the second feature picture segment is valid, reading the buffered data from the buffer, and splicing the current data (the data of the second feature picture segment) into a new feature picture segment, and re-mapping to M Processing is performed in the first processing circuit 610.
在本申请提供的技术方案中,可以随输入数据的输入方式灵活设置数据缓存方式,例如,输入数据是按列输入的,则按列缓存,且对缓存的数据先进行列操作再进行行操作;再例如,输入数据是按行输入的,则按行缓存,且对缓存的数据先进行行操作再进行列操作,从而可以提高数据吞吐率。同时,本实施例提供的运算装置可以实现图像数据的并行处理,从而有效提高数据处理的效率。In the technical solution provided by the present application, the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to column operations and then performed; For example, if the input data is input in a row, the row buffer is performed, and the cached data is first subjected to row operations and then column operations, thereby improving data throughput. At the same time, the arithmetic device provided by the embodiment can implement parallel processing of image data, thereby effectively improving the efficiency of data processing.
可选地,在某些实施例中,该输入特征矩阵表示待处理图像中的一个特征图片段;该预处理电路630具体用于,依次接收该待处理图像的各个特征图片段。Optionally, in some embodiments, the input feature matrix represents a feature picture segment in the image to be processed; the pre-processing circuit 630 is specifically configured to sequentially receive the feature picture segments of the to-be-processed image.
可选地,该电路600还包括通信接口,该通信接口用于接收待处理的图像数据,还用于输出第二处理电路的计算结果,即输出图像数据。Optionally, the circuit 600 further includes a communication interface for receiving image data to be processed, and for outputting a calculation result of the second processing circuit, that is, outputting image data.
综上所述,本申请提供的技术方案,通过将神经网络的窗口操作分解为列操作与行操作,使得只要接收到一行或一列输入数据,就可以开始计算,无需像现有技术中必须先缓存够一定数量的二维输入数据之后,才可以进行 计算,因此,可以有效减小数据处理的时延,从而可以实现实时数据处理。同时,可以随输入数据的输入方式灵活设置数据缓存方式,例如,输入数据是按列输入的,则按列缓存,且对缓存的数据先进行列操作再进行行操作;再例如,输入数据是按行输入的,则按行缓存,且对缓存的数据先进行行操作再进行列操作。此外,本申请提供的运算装置相对于现有技术所需缓存空间较小,从而可以节省硬件开销。某些实施例提供的运算装置可以实现多个计算窗口并行处理,从而可以提高数据吞吐速率,克服神经网络加速器的瓶颈。In summary, the technical solution provided by the present application can be started by decomposing the window operation of the neural network into column operations and row operations, so that as long as one row or one column of input data is received, the calculation can be started without having to be prior art as in the prior art. After you have cached a certain amount of 2D input data, you can do it. The calculation, therefore, can effectively reduce the delay of data processing, so that real-time data processing can be realized. At the same time, the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to the column operation and then the row operation; for example, the input data is pressed. If the row is input, it is cached by row, and the cached data is first processed and then columned. In addition, the computing device provided by the present application requires less buffer space than the prior art, thereby saving hardware overhead. The computing device provided by some embodiments can implement parallel processing of multiple computing windows, thereby improving the data throughput rate and overcoming the bottleneck of the neural network accelerator.
如图8所示,本申请实施例还提供一种用于处理神经网络的方法800。可选地,该方法800可以由上述实施例提供的运算装置执行,上述各个实施例中技术方案以及技术效果的描述均可应用于本实施例中,为了简洁,本实施不再赘述。如图8所示,该方法800包括如下步骤。As shown in FIG. 8, the embodiment of the present application further provides a method 800 for processing a neural network. Optionally, the method 800 can be performed by the operation device provided in the foregoing embodiment, and the description of the technical solutions and the technical effects in the foregoing embodiments can be applied to the embodiment. For brevity, the description is not repeated herein. As shown in FIG. 8, the method 800 includes the following steps.
810,根据计算窗口的大小,对k1个输入特征数据进行第一运算操作,获得中间结果,该计算窗口的大小为k1×k2,k1与k2均为正整数。810. Perform a first operation operation on the k1 input feature data according to the size of the calculation window, and obtain an intermediate result. The size of the calculation window is k1×k2, and both k1 and k2 are positive integers.
具体地,步骤810可以由上述实施例中的第一处理单元310执行。Specifically, step 810 can be performed by the first processing unit 310 in the above embodiment.
820,根据该计算窗口的大小,对该第一运算操作获得的k2个中间结果进行第二运算操作,获得计算结果。820. Perform a second operation operation on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, to obtain a calculation result.
具体地,步骤820可以由上述实施例中的第二处理单元320执行。Specifically, step 820 can be performed by the second processing unit 320 in the above embodiment.
在本申请提供的技术方案中,通过将神经网络的窗口操作分解为列操作与行操作,使得只要接收到一行或一列输入数据,就可以开始计算,换句话说,可以按行或按列缓存输入特征矩阵,并且可以同时进行运算,无需像现有技术中必须先缓存够一定数量的二维输入数据之后,才可以进行计算,因此,可以有效减小数据处理的时延,有效提高神经网络的数据处理效率,同时也可以节省存储资源,从而节省硬件资源。In the technical solution provided by the present application, by decomposing the window operation of the neural network into a column operation and a row operation, the calculation can be started as long as one row or one column of input data is received, in other words, the row or column buffer can be used. Input the feature matrix, and can perform operations at the same time, without having to cache a certain amount of two-dimensional input data before the calculation can be performed, so that the delay of data processing can be effectively reduced, and the neural network can be effectively improved. Data processing efficiency, while also saving storage resources, thereby saving hardware resources.
可选地,在本实施例中,该方法800还包括:按列接收输入特征矩阵,并根据该计算窗口对接收的列输入特征值进行处理,得到M组数据,其中,每组数据包括k1个输入特征值。步骤810具体包括:根据计算窗口的大小,分别对该M组数据进行该第一运算操作,获得对应的中间结果。具体地,可以通过上述实施例中的M个第一处理单元310分别对该M组数据进行该第一运算操作,获得对应的中间结果。步骤820具体包括:分别针对该M组数据中的每一组数据对应的第一运算操作,每获得k2个中间结果,进行 该第二运算操作,获得对应的计算结果。具体地,可以通过上述实施例中的M个第二处理单元320分别对该M组数据进行该第一运算操作,获得对应的中间结果。Optionally, in this embodiment, the method 800 further includes: receiving an input feature matrix by columns, and processing the received column input feature values according to the calculation window to obtain M sets of data, wherein each set of data includes k1 Input feature values. The step 810 specifically includes: performing the first operation operation on the M group data according to the size of the calculation window, and obtaining a corresponding intermediate result. Specifically, the first operation operations may be performed on the M group data by the M first processing units 310 in the foregoing embodiment, to obtain corresponding intermediate results. The step 820 specifically includes: performing a k2 intermediate result for each first operation operation corresponding to each group of data in the M group data. The second operation operation obtains a corresponding calculation result. Specifically, the first operation operations may be performed on the M group data by the M second processing units 320 in the foregoing embodiment to obtain corresponding intermediate results.
在本申请提供的技术方案中,可以实现图像数据的并行处理,从而有效提高数据处理的效率。In the technical solution provided by the present application, parallel processing of image data can be realized, thereby effectively improving the efficiency of data processing.
可选地,在本实施例中,M的数值与该输入特征矩阵的大小以及该计算窗口的大小有关。Optionally, in the embodiment, the value of M is related to the size of the input feature matrix and the size of the calculation window.
可选地,在本实施例中,该M组数据包括该列输入特征值中的所有数据。Optionally, in this embodiment, the M group data includes all data in the column input feature values.
可选地,在本实施例中,该M组数据为该列输入特征值中的部分数据;该方法800还包括:将该列输入特征值中除该M组数据之外的剩余数据存入缓冲器。Optionally, in this embodiment, the M group data is part of the column input feature values; the method 800 further includes: storing the remaining data of the column input feature values except the M group data. buffer.
可选地,在本实施例中,该计算窗口为卷积窗口,该第一运算操作的运算方式为乘累加运算,该第二运算操作的运算方式为累加运算。Optionally, in this embodiment, the calculation window is a convolution window, and the operation mode of the first operation operation is a multiplication and accumulation operation, and the operation mode of the second operation operation is an accumulation operation.
可选地,在本实施例中,该计算窗口为池化窗口,该第一运算操作的运算方式为求最大值或求平均值,该第二运算操作的运算方式与该第一运算操作的运算方式相同。Optionally, in this embodiment, the calculation window is a pooling window, and the operation mode of the first operation operation is to obtain a maximum value or an average value, and the operation mode of the second operation operation and the first operation operation The operation is the same.
可选地,在本实施例中,该输入特征矩阵表示待处理图像中的一个特征图片段;该按列接收输入特征矩阵,包括:依次接收该待处理图像的各个特征图片段。Optionally, in the embodiment, the input feature matrix represents a feature picture segment in the image to be processed; and receiving the input feature matrix by column comprises: sequentially receiving each feature picture segment of the image to be processed.
本申请实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被计算机执行时用于实现:根据计算窗口的大小,对k1个输入特征数据进行第一运算操作,获得中间结果,该计算窗口的大小为k1×k2,k1与k2均为正整数;根据该计算窗口的大小,对该第一运算操作获得的k2个中间结果进行第二运算操作,获得计算结果。The embodiment of the present application further provides a computer readable storage medium, where the computer program is stored, and when the computer program is executed by the computer, the first operation operation is performed on the k1 input feature data according to the size of the calculation window. Obtaining an intermediate result, the size of the calculation window is k1×k2, and k1 and k2 are both positive integers; according to the size of the calculation window, performing a second operation operation on the k2 intermediate results obtained by the first operation operation, and obtaining a calculation result .
上述各个实施例中技术方案以及技术效果的描述均可应用于本实施例中,为了简洁,本实施不再赘述。The description of the technical solutions and technical effects in the above embodiments may be applied to the present embodiment. For brevity, the present embodiment is not described again.
可选地,在本实施例中,该计算机程序被计算机执行时还用于实现:按列接收输入特征矩阵,并根据该计算窗口对接收的列输入特征值进行处理,得到M组数据,其中,每组数据包括k1个输入特征值;该计算机程序被计算机执行时用于实现:根据计算窗口的大小,对k1个输入特征数据进行第 一运算操作,获得中间结果,包括:该计算机程序被计算机执行时用于实现:根据计算窗口的大小,分别对该M组数据进行该第一运算操作,获得对应的中间结果;该计算机程序被计算机执行时用于实现:根据该计算窗口的大小,对该第一运算操作获得的k2个中间结果进行第二运算操作,获得计算结果,包括:该计算机程序被计算机执行时用于实现:分别针对该M组数据中的每一组数据对应的第一运算操作,每获得k2个中间结果,进行该第二运算操作,获得对应的计算结果。Optionally, in this embodiment, when the computer program is executed by the computer, the method further comprises: receiving an input feature matrix by columns, and processing the received column input feature values according to the calculation window to obtain M group data, wherein Each set of data includes k1 input feature values; when the computer program is executed by a computer, it is implemented: according to the size of the calculation window, the k1 input feature data is subjected to the first An arithmetic operation, obtaining an intermediate result, comprising: when the computer program is executed by the computer, implementing: performing the first operation operation on the M group data according to the size of the calculation window, and obtaining a corresponding intermediate result; the computer program is When the computer executes, the second operation operation is performed on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, and the calculation result is obtained, which is included when the computer program is executed by the computer: For the first operation operation corresponding to each group of data in the M group of data, each k2 intermediate results are obtained, and the second operation operation is performed to obtain a corresponding calculation result.
可选地,在本实施例中,M的数值与该输入特征矩阵的大小以及该计算窗口的大小有关。Optionally, in the embodiment, the value of M is related to the size of the input feature matrix and the size of the calculation window.
可选地,在本实施例中,该M组数据包括该列输入特征值中的所有数据。Optionally, in this embodiment, the M group data includes all data in the column input feature values.
可选地,在本实施例中,该M组数据为该列输入特征值中的部分数据;该计算机程序被计算机执行时还用于实现:将该列输入特征值中除该M组数据之外的剩余数据存入缓冲器。Optionally, in this embodiment, the M group data is part of the column input feature value; when the computer program is executed by the computer, the method is further implemented to: input the column input feature value by the M group data. The remaining data is stored in the buffer.
可选地,在本实施例中,该计算窗口为卷积窗口,该第一运算操作的运算方式为乘累加运算,该第二运算操作的运算方式为累加运算。Optionally, in this embodiment, the calculation window is a convolution window, and the operation mode of the first operation operation is a multiplication and accumulation operation, and the operation mode of the second operation operation is an accumulation operation.
可选地,在本实施例中,该计算窗口为池化窗口,该第一运算操作的运算方式为求最大值或求平均值,该第二运算操作的运算方式与该第一运算操作的运算方式相同。Optionally, in this embodiment, the calculation window is a pooling window, and the operation mode of the first operation operation is to obtain a maximum value or an average value, and the operation mode of the second operation operation and the first operation operation The operation is the same.
可选地,在本实施例中,该输入特征矩阵表示待处理图像中的一个特征图片段;该计算机程序被计算机执行时用于实现:按列接收输入特征矩阵,包括:该计算机程序被计算机执行时用于实现:依次接收该待处理图像的各个特征图片段。Optionally, in this embodiment, the input feature matrix represents a feature picture segment in the image to be processed; when the computer program is executed by the computer, the method is implemented to: receive the input feature matrix by column, including: the computer program is received by the computer The implementation is used to implement: sequentially receiving each feature picture segment of the image to be processed.
本申请适用于卷积神经网络(convolution neural network,CNN)硬件加速器中,应用方式为IP核,也有可能应用在包含池化层的其他类型的神经网络加速器/处理器中。The present application is applicable to a convolutional neural network (CNN) hardware accelerator, and the application mode is an IP core, and may also be applied to other types of neural network accelerators/processors including a pooling layer.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其他任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、 或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如数字视频光盘(digital video disc,DVD))、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present invention are generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, Or other programmable devices. The computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transmission to another website site, computer, server or data center via wired (eg coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.). The computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that includes one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (such as a digital video disc (DVD)), or a semiconductor medium (such as a solid state disk (SSD)). .
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods to implement the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present application.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。In the several embodiments provided by the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。 The foregoing is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present application. It should be covered by the scope of protection of this application. Therefore, the scope of protection of the present application should be determined by the scope of the claims.

Claims (32)

  1. 一种用于神经网络的运算装置,其特征在于,包括:An arithmetic device for a neural network, comprising:
    第一处理单元,用于根据计算窗口的大小对k1个输入特征数据进行第一运算操作,获得中间结果,所述计算窗口的大小为k1×k2,k1与k2均为正整数;a first processing unit, configured to perform a first operation operation on the k1 input feature data according to the size of the calculation window, to obtain an intermediate result, where the size of the calculation window is k1×k2, and both k1 and k2 are positive integers;
    第二处理单元,用于根据所述计算窗口的大小对所述第一处理单元输出的k2个中间结果进行第二运算操作,获得计算结果。And a second processing unit, configured to perform a second operation operation on the k2 intermediate results output by the first processing unit according to the size of the calculation window, to obtain a calculation result.
  2. 根据权利要求1所述的运算装置,其特征在于,所述运算装置包括M个所述第一处理单元与M个所述第二处理单元,且所述M个第一处理单元与所述M个第二处理单元一一对应,M为大于1的正整数;The computing device according to claim 1, wherein said computing device comprises M first processing units and M second processing units, and said M first processing units and said M The second processing units are in one-to-one correspondence, and M is a positive integer greater than one;
    所述运算装置还包括:The computing device further includes:
    预处理单元,用于按列接收输入特征矩阵,并根据所述计算窗口对接收的列输入特征值进行处理,得到M组数据,其中,每组数据包括k1个输入特征值,所述预处理单元还用于将所述M组数据一对一地输入到所述M个第一处理单元中。a pre-processing unit, configured to receive an input feature matrix by column, and process the received column input feature values according to the calculation window to obtain M sets of data, wherein each set of data includes k1 input feature values, and the pre-processing The unit is further configured to input the M sets of data one-to-one into the M first processing units.
  3. 根据权利要求2所述的运算装置,其特征在于,M的数值与所述输入特征矩阵的大小以及所述计算窗口的大小有关。The arithmetic device according to claim 2, wherein the value of M is related to the size of the input feature matrix and the size of the calculation window.
  4. 根据权利要求2或3所述的运算装置,其特征在于,所述M组数据包括所述列输入特征值中的所有数据。The arithmetic unit according to claim 2 or 3, wherein said M sets of data include all of said column input feature values.
  5. 根据权利要求2或3所述的运算装置,其特征在于,所述M组数据为所述列输入特征值中的部分数据;The computing device according to claim 2 or 3, wherein the M sets of data are part of the column input feature values;
    所述预处理单元还包括缓冲器;The pre-processing unit further includes a buffer;
    所述预处理单元还用于,将所述列输入特征值中除所述M组数据之外的剩余数据存入所述缓冲器。The pre-processing unit is further configured to store remaining data of the column input feature values other than the M group data into the buffer.
  6. 根据权利要求1至5中任一项所述的运算装置,其特征在于,所述计算窗口为卷积窗口,所述第一运算操作的运算方式为乘累加运算,所述第二运算操作的运算方式为累加运算。The computing device according to any one of claims 1 to 5, wherein the calculation window is a convolution window, and the operation mode of the first operation operation is a multiply and accumulate operation, and the second operation operation The operation method is an accumulation operation.
  7. 根据权利要求1至5中任一项所述的运算装置,其特征在于,所述计算窗口为池化窗口,所述第一运算操作的运算方式为求最大值或求平均值,所述第二运算操作的运算方式与所述第一运算操作的运算方式相同。The computing device according to any one of claims 1 to 5, wherein the calculation window is a pooling window, and the operation mode of the first operation operation is to obtain a maximum value or an average value, The operation mode of the second operation operation is the same as the operation mode of the first operation operation.
  8. 根据权利要求2至5中任一项所述的运算装置,其特征在于,所述 输入特征矩阵表示待处理图像中的一个特征图片段;An arithmetic device according to any one of claims 2 to 5, wherein said said The input feature matrix represents a feature picture segment in the image to be processed;
    所述预处理单元具体用于,依次接收所述待处理图像的各个特征图片段。The pre-processing unit is specifically configured to sequentially receive each feature picture segment of the to-be-processed image.
  9. 一种用于处理神经网络的电路,其特征在于,包括:A circuit for processing a neural network, comprising:
    第一处理电路,用于根据计算窗口的大小对k1个输入特征数据进行第一运算操作,获得中间结果,所述计算窗口的大小为k1×k2,k1与k2均为正整数;a first processing circuit, configured to perform a first operation operation on the k1 input feature data according to the size of the calculation window, to obtain an intermediate result, where the size of the calculation window is k1×k2, and both k1 and k2 are positive integers;
    第二处理电路,用于根据所述计算窗口的大小对所述第一处理电路输出的k2个中间结果进行第二运算操作,获得计算结果。And a second processing circuit, configured to perform a second operation operation on the k2 intermediate results output by the first processing circuit according to the size of the calculation window, to obtain a calculation result.
  10. 根据权利要求9所述的电路,其特征在于,所述电路包括M个所述第一处理电路与M个所述第二处理电路,且所述M个第一处理电路与所述M个第二处理电路一一对应,M为大于1的正整数;The circuit according to claim 9, wherein said circuit comprises M said first processing circuit and said M said second processing circuits, and said M first processing circuits and said M first The two processing circuits are in one-to-one correspondence, and M is a positive integer greater than one;
    所述电路还包括:The circuit also includes:
    预处理电路,用于按列接收输入特征矩阵,并根据所述计算窗口对接收的列输入特征值进行处理,得到M组数据,其中,每组数据包括k1个输入特征值,所述预处理电路还用于将所述M组数据一对一地输入到所述M个第一处理电路中。a pre-processing circuit, configured to receive an input feature matrix by column, and process the received column input feature values according to the calculation window to obtain M sets of data, wherein each set of data includes k1 input feature values, and the pre-processing The circuit is further configured to input the M sets of data one-to-one into the M first processing circuits.
  11. 根据权利要求10所述的电路,其特征在于,M的数值与所述输入特征矩阵的大小以及所述计算窗口的大小有关。The circuit of claim 10 wherein the value of M is related to the size of said input feature matrix and the size of said calculation window.
  12. 根据权利要求10或11所述的电路,其特征在于,所述M组数据包括所述列输入特征值中的所有数据。The circuit of claim 10 or 11, wherein said M sets of data comprise all of said column input feature values.
  13. 根据权利要求10或11所述的电路,其特征在于,所述M组数据为所述列输入特征值中的部分数据;The circuit according to claim 10 or 11, wherein the M sets of data are part of the column input feature values;
    所述预处理电路中还包括缓冲器,所述预处理电路还用于,将所述列输入特征值中除所述M组数据之外的剩余数据存入所述缓冲器。The pre-processing circuit further includes a buffer, and the pre-processing circuit is further configured to store the remaining data of the column input feature values other than the M group data into the buffer.
  14. 根据权利要求9至13中任一项所述的电路,其特征在于,所述计算窗口为卷积窗口,所述第一运算操作的运算方式为乘累加运算,所述第二运算操作的运算方式为累加运算。The circuit according to any one of claims 9 to 13, wherein the calculation window is a convolution window, and the operation mode of the first operation operation is a multiply and accumulate operation, and the operation of the second operation operation The way is the accumulation operation.
  15. 根据权利要求9至13中任一项所述的电路,其特征在于,所述计算窗口为池化窗口,所述第一运算操作的运算方式为求最大值或求平均值,所述第二运算操作的运算方式与所述第一运算操作的运算方式相同。 The circuit according to any one of claims 9 to 13, wherein the calculation window is a pooled window, and the operation mode of the first operation operation is to obtain a maximum value or an average value, the second The operation mode of the arithmetic operation is the same as that of the first arithmetic operation.
  16. 根据权利要求10所述的电路,其特征在于,所述输入特征矩阵表示待处理图像中的一个特征图片段;The circuit according to claim 10, wherein said input feature matrix represents a feature picture segment in an image to be processed;
    所述预处理电路具体用于,依次接收所述待处理图像的各个特征图片段。The pre-processing circuit is specifically configured to sequentially receive each feature picture segment of the image to be processed.
  17. 一种用于处理神经网络的方法,其特征在于,包括:A method for processing a neural network, comprising:
    根据计算窗口的大小,对k1个输入特征数据进行第一运算操作,获得中间结果,所述计算窗口的大小为k1×k2,k1与k2均为正整数;Performing a first operation operation on the k1 input feature data according to the size of the calculation window, and obtaining an intermediate result, where the size of the calculation window is k1×k2, and both k1 and k2 are positive integers;
    根据所述计算窗口的大小,对所述第一运算操作获得的k2个中间结果进行第二运算操作,获得计算结果。And performing a second operation operation on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, to obtain a calculation result.
  18. 根据权利要求17所述的方法,其特征在于,所述方法还包括:The method of claim 17, wherein the method further comprises:
    按列接收输入特征矩阵,并根据所述计算窗口对接收的列输入特征值进行处理,得到M组数据,其中,每组数据包括k1个输入特征值;Receiving an input feature matrix according to a column, and processing the received column input feature values according to the calculation window, to obtain M sets of data, wherein each set of data includes k1 input feature values;
    所述根据计算窗口的大小,对k1个输入特征数据进行第一运算操作,获得中间结果,包括:Performing a first operation operation on the k1 input feature data according to the size of the calculation window, and obtaining an intermediate result, including:
    根据计算窗口的大小,分别对所述M组数据进行所述第一运算操作,获得对应的中间结果;Performing the first operation operation on the M group data according to the size of the calculation window, and obtaining a corresponding intermediate result;
    所述根据所述计算窗口的大小,对所述第一运算操作获得的k2个中间结果进行第二运算操作,获得计算结果,包括:Performing a second operation operation on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, and obtaining a calculation result, including:
    分别针对所述M组数据中的每一组数据对应的第一运算操作,每获得k2个中间结果,进行所述第二运算操作,获得对应的计算结果。For each of the first operation operations corresponding to each of the M sets of data, each of the k2 intermediate results is obtained, and the second operation operation is performed to obtain a corresponding calculation result.
  19. 根据权利要求18所述的方法,其特征在于,M的数值与所述输入特征矩阵的大小以及所述计算窗口的大小有关。The method of claim 18 wherein the value of M is related to the size of said input feature matrix and the size of said calculation window.
  20. 根据权利要求18或19所述的方法,其特征在于,所述M组数据包括所述列输入特征值中的所有数据。The method of claim 18 or 19, wherein said M sets of data comprise all of said column input feature values.
  21. 根据权利要求18或19所述的方法,其特征在于,所述M组数据为所述列输入特征值中的部分数据;The method according to claim 18 or 19, wherein the M sets of data are part of the column input feature values;
    所述方法还包括:将所述列输入特征值中除所述M组数据之外的剩余数据存入缓冲器。The method also includes storing the remaining data of the column input feature values other than the M sets of data into a buffer.
  22. 根据权利要求17至21中任一项所述的方法,其特征在于,所述计算窗口为卷积窗口,所述第一运算操作的运算方式为乘累加运算,所述第二运算操作的运算方式为累加运算。 The method according to any one of claims 17 to 21, wherein the calculation window is a convolution window, and the operation mode of the first operation operation is a multiply and accumulate operation, and the operation of the second operation operation The way is the accumulation operation.
  23. 根据权利要求17至21中任一项所述的方法,其特征在于,所述计算窗口为池化窗口,所述第一运算操作的运算方式为求最大值或求平均值,所述第二运算操作的运算方式与所述第一运算操作的运算方式相同。The method according to any one of claims 17 to 21, wherein the calculation window is a pooling window, and the operation mode of the first operation operation is to obtain a maximum value or an average value, the second The operation mode of the arithmetic operation is the same as that of the first arithmetic operation.
  24. 根据权利要求17至21中任一项所述的方法,其特征在于,所述输入特征矩阵表示待处理图像中的一个特征图片段;The method according to any one of claims 17 to 21, wherein the input feature matrix represents a feature picture segment in an image to be processed;
    所述按列接收输入特征矩阵,包括:依次接收所述待处理图像的各个特征图片段。The receiving the input feature matrix in columns comprises: sequentially receiving each feature image segment of the image to be processed.
  25. 一种计算机可读存储介质,其特征在于,其上存储有计算机程序,所述计算机程序被计算机执行时用于实现:A computer readable storage medium having stored thereon a computer program for execution by:
    根据计算窗口的大小,对k1个输入特征数据进行第一运算操作,获得中间结果,所述计算窗口的大小为k1×k2,k1与k2均为正整数;Performing a first operation operation on the k1 input feature data according to the size of the calculation window, and obtaining an intermediate result, where the size of the calculation window is k1×k2, and both k1 and k2 are positive integers;
    根据所述计算窗口的大小,对所述第一运算操作获得的k2个中间结果进行第二运算操作,获得计算结果。And performing a second operation operation on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, to obtain a calculation result.
  26. 根据权利要求25所述的计算机可读存储介质,其特征在于,所述计算机程序被计算机执行时还用于实现:The computer readable storage medium according to claim 25, wherein said computer program is further executed to implement when executed by a computer:
    按列接收输入特征矩阵,并根据所述计算窗口对接收的列输入特征值进行处理,得到M组数据,其中,每组数据包括k1个输入特征值;Receiving an input feature matrix according to a column, and processing the received column input feature values according to the calculation window, to obtain M sets of data, wherein each set of data includes k1 input feature values;
    所述计算机程序被计算机执行时用于实现:根据计算窗口的大小,对k1个输入特征数据进行第一运算操作,获得中间结果,包括:When the computer program is executed by the computer, the first operation operation is performed on the k1 input feature data according to the size of the calculation window, and the intermediate result is obtained, including:
    所述计算机程序被计算机执行时用于实现:根据计算窗口的大小,分别对所述M组数据进行所述第一运算操作,获得对应的中间结果;When the computer program is executed by the computer, the method is implemented to: perform the first operation operation on the M group data according to the size of the calculation window, and obtain a corresponding intermediate result;
    所述计算机程序被计算机执行时用于实现:根据所述计算窗口的大小,对所述第一运算操作获得的k2个中间结果进行第二运算操作,获得计算结果,包括:When the computer program is executed by the computer, the second operation operation is performed on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, and the calculation result is obtained, including:
    所述计算机程序被计算机执行时用于实现:分别针对所述M组数据中的每一组数据对应的第一运算操作,每获得k2个中间结果,进行所述第二运算操作,获得对应的计算结果。When the computer program is executed by the computer, the second operation operation is performed for each of the first operation operations corresponding to each of the M sets of data, and the second operation operation is performed to obtain a corresponding Calculation results.
  27. 根据权利要求26所述的计算机可读存储介质,其特征在于,M的数值与所述输入特征矩阵的大小以及所述计算窗口的大小有关。The computer readable storage medium of claim 26, wherein the value of M is related to a size of the input feature matrix and a size of the calculation window.
  28. 根据权利要求26或27所述的计算机可读存储介质,其特征在于,所述M组数据包括所述列输入特征值中的所有数据。 A computer readable storage medium according to claim 26 or 27, wherein said M sets of data comprise all of said column input feature values.
  29. 根据权利要求26或27所述的计算机可读存储介质,其特征在于,所述M组数据为所述列输入特征值中的部分数据;The computer readable storage medium according to claim 26 or 27, wherein the M sets of data are part of the column input feature values;
    所述计算机程序被计算机执行时还用于实现:将所述列输入特征值中除所述M组数据之外的剩余数据存入缓冲器。When the computer program is executed by the computer, it is further implemented to: store the remaining data of the column input feature values except the M group data into a buffer.
  30. 根据权利要求25至29中任一项所述的计算机可读存储介质,其特征在于,所述计算窗口为卷积窗口,所述第一运算操作的运算方式为乘累加运算,所述第二运算操作的运算方式为累加运算。The computer readable storage medium according to any one of claims 25 to 29, wherein the calculation window is a convolution window, and the operation mode of the first operation operation is a multiply and accumulate operation, the second The operation method of the arithmetic operation is an accumulation operation.
  31. 根据权利要求25至29中任一项所述的计算机可读存储介质,其特征在于,所述计算窗口为池化窗口,所述第一运算操作的运算方式为求最大值或求平均值,所述第二运算操作的运算方式与所述第一运算操作的运算方式相同。The computer readable storage medium according to any one of claims 25 to 29, wherein the calculation window is a pooled window, and the operation manner of the first operation operation is to obtain a maximum value or an average value. The operation mode of the second operation operation is the same as the operation mode of the first operation operation.
  32. 根据权利要求26至29中任一项所述的计算机可读存储介质,其特征在于,所述输入特征矩阵表示待处理图像中的一个特征图片段;The computer readable storage medium according to any one of claims 26 to 29, wherein the input feature matrix represents a feature picture segment in an image to be processed;
    所述计算机程序被计算机执行时用于实现:按列接收输入特征矩阵,包括:When the computer program is executed by a computer, it is used to: receive an input feature matrix by column, including:
    所述计算机程序被计算机执行时用于实现:依次接收所述待处理图像的各个特征图片段。 When the computer program is executed by the computer, it is used to implement: sequentially receiving each feature picture segment of the image to be processed.
PCT/CN2017/108640 2017-10-31 2017-10-31 Computation apparatus, circuit and relevant method for neural network WO2019084788A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201780013527.XA CN108780524A (en) 2017-10-31 2017-10-31 Arithmetic unit, circuit and correlation technique for neural network
PCT/CN2017/108640 WO2019084788A1 (en) 2017-10-31 2017-10-31 Computation apparatus, circuit and relevant method for neural network
US16/727,677 US20200134435A1 (en) 2017-10-31 2019-12-26 Computation apparatus, circuit and relevant method for neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/108640 WO2019084788A1 (en) 2017-10-31 2017-10-31 Computation apparatus, circuit and relevant method for neural network

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/727,677 Continuation US20200134435A1 (en) 2017-10-31 2019-12-26 Computation apparatus, circuit and relevant method for neural network

Publications (1)

Publication Number Publication Date
WO2019084788A1 true WO2019084788A1 (en) 2019-05-09

Family

ID=64034073

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/108640 WO2019084788A1 (en) 2017-10-31 2017-10-31 Computation apparatus, circuit and relevant method for neural network

Country Status (3)

Country Link
US (1) US20200134435A1 (en)
CN (1) CN108780524A (en)
WO (1) WO2019084788A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445420A (en) * 2020-04-09 2020-07-24 北京爱芯科技有限公司 Image operation method and device of convolutional neural network and electronic equipment

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3089664A1 (en) * 2018-12-05 2020-06-12 Stmicroelectronics (Rousset) Sas Method and device for reducing the computational load of a microprocessor intended to process data by a convolutional neural network
CN110647978B (en) * 2019-09-05 2020-11-03 北京三快在线科技有限公司 System and method for extracting convolution window in convolution neural network
CN110991609B (en) * 2019-11-27 2023-12-26 天津大学 Line buffer for data transmission
CN110956258B (en) * 2019-12-17 2023-05-16 深圳鲲云信息科技有限公司 Neural network acceleration circuit and method
US11507831B2 (en) * 2020-02-24 2022-11-22 Stmicroelectronics International N.V. Pooling unit for deep learning acceleration
CN113255897B (en) * 2021-06-11 2023-07-07 西安微电子技术研究所 Pooling calculation unit of convolutional neural network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140122551A1 (en) * 2012-10-31 2014-05-01 Mobileye Technologies Limited Arithmetic logic unit
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
CN106779060A (en) * 2017-02-09 2017-05-31 武汉魅瞳科技有限公司 A kind of computational methods of the depth convolutional neural networks for being suitable to hardware design realization
CN106875012A (en) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 A kind of streamlined acceleration system of the depth convolutional neural networks based on FPGA
CN107239823A (en) * 2016-08-12 2017-10-10 北京深鉴科技有限公司 A kind of apparatus and method for realizing sparse neural network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6016154A (en) * 1991-07-10 2000-01-18 Fujitsu Limited Image forming apparatus
US9767565B2 (en) * 2015-08-26 2017-09-19 Digitalglobe, Inc. Synthesizing training data for broad area geospatial object detection
CN106951395B (en) * 2017-02-13 2018-08-17 上海客鹭信息技术有限公司 Parallel convolution operations method and device towards compression convolutional neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140122551A1 (en) * 2012-10-31 2014-05-01 Mobileye Technologies Limited Arithmetic logic unit
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
CN107239823A (en) * 2016-08-12 2017-10-10 北京深鉴科技有限公司 A kind of apparatus and method for realizing sparse neural network
CN106779060A (en) * 2017-02-09 2017-05-31 武汉魅瞳科技有限公司 A kind of computational methods of the depth convolutional neural networks for being suitable to hardware design realization
CN106875012A (en) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 A kind of streamlined acceleration system of the depth convolutional neural networks based on FPGA

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445420A (en) * 2020-04-09 2020-07-24 北京爱芯科技有限公司 Image operation method and device of convolutional neural network and electronic equipment
CN111445420B (en) * 2020-04-09 2023-06-06 北京爱芯科技有限公司 Image operation method and device of convolutional neural network and electronic equipment

Also Published As

Publication number Publication date
CN108780524A (en) 2018-11-09
US20200134435A1 (en) 2020-04-30

Similar Documents

Publication Publication Date Title
WO2019084788A1 (en) Computation apparatus, circuit and relevant method for neural network
US20200285446A1 (en) Arithmetic device for neural network, chip, equipment and related method
US9367892B2 (en) Processing method and apparatus for single-channel convolution layer, and processing method and apparatus for multi-channel convolution layer
US20210073569A1 (en) Pooling device and pooling method
CN111199273B (en) Convolution calculation method, device, equipment and storage medium
US9813502B1 (en) Data transfers in columnar data systems
US11734554B2 (en) Pooling processing method and system applied to convolutional neural network
US20180137407A1 (en) Convolution operation device and convolution operation method
WO2023065983A1 (en) Computing apparatus, neural network processing device, chip, and data processing method
CN110737401B (en) Method, apparatus and computer program product for managing redundant array of independent disks
WO2022041188A1 (en) Accelerator for neural network, acceleration method and device, and computer storage medium
CN111210004B (en) Convolution calculation method, convolution calculation device and terminal equipment
US9342564B2 (en) Distributed processing apparatus and method for processing large data through hardware acceleration
CN106227506A (en) A kind of multi-channel parallel Compress softwares system and method in memory compression system
CN111178513B (en) Convolution implementation method and device of neural network and terminal equipment
US20210201122A1 (en) Data processing methods, apparatuses, devices, storage media and program products
CN103902614B (en) A kind of data processing method, equipment and system
US11467973B1 (en) Fine-grained access memory controller
WO2023071566A1 (en) Data processing method and apparatus, computer device, computer-readable storage medium, and computer program product
US20160299800A1 (en) System and method for redundant database communication
US8473679B2 (en) System, data structure, and method for collapsing multi-dimensional data
TWI586144B (en) Multiple stream processing for video analytics and encoding
WO2021128820A1 (en) Data processing method, apparatus and device, and storage medium and computer program product
US10841405B1 (en) Data compression of table rows
TWI753728B (en) Architecture and cluster of processing elements and method of convolution operation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17930644

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17930644

Country of ref document: EP

Kind code of ref document: A1