WO2021179289A1 - 卷积神经网络的运算方法、装置、设备和存储介质 - Google Patents

卷积神经网络的运算方法、装置、设备和存储介质 Download PDF

Info

Publication number
WO2021179289A1
WO2021179289A1 PCT/CN2020/079221 CN2020079221W WO2021179289A1 WO 2021179289 A1 WO2021179289 A1 WO 2021179289A1 CN 2020079221 W CN2020079221 W CN 2020079221W WO 2021179289 A1 WO2021179289 A1 WO 2021179289A1
Authority
WO
WIPO (PCT)
Prior art keywords
calculation
data
unit
external memory
image
Prior art date
Application number
PCT/CN2020/079221
Other languages
English (en)
French (fr)
Inventor
罗岚
韩峰
杨康
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to CN202080004317.6A priority Critical patent/CN112602096A/zh
Priority to PCT/CN2020/079221 priority patent/WO2021179289A1/zh
Publication of WO2021179289A1 publication Critical patent/WO2021179289A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management

Definitions

  • the present invention relates to the field of artificial intelligence technology, and in particular to an operation method, device, equipment and storage medium of a convolutional neural network.
  • CNN convolution Neural Networks
  • BN batch-normalized
  • IFM input feature image
  • OFM Output Feature Map
  • a and b respectively represent the scaling factor and displacement factor of the scale layer.
  • the data involved in the scale layer has a large amount of calculation.
  • the calculation processing of the scale layer in related technologies is generally performed by the central processing unit (Center Processing Unit, referred to as CPU), graphics processing unit (Graphics Processing Unit, referred to as GPU) or Digital signal processor (Digital Signal Processing, referred to as DSP) is completed.
  • CPU, GPU, or DSP are general-purpose processors, and general-purpose processors can handle multiple types of transactions. Since general-purpose processors can handle many types of transactions, the logic of processing transactions at the bottom of it is more complicated. The calculation speed of the scale layer through a general-purpose processor is slower and the calculation performance is lower.
  • the invention provides a convolutional neural network operation method, device, equipment and storage medium, which can improve the calculation performance of the network layer.
  • the first aspect of the present invention provides an operation method of a convolutional neural network, which is applied to a processor that calculates a network layer in a convolutional neural network.
  • the processor includes a reading unit, a plurality of calculation units, and a writing unit. Into the unit, the method includes:
  • the feature image is split according to the computing capabilities of the multiple computing units, and the multiple sets of image data and the computing parameters obtained after the split are transferred to the multiple computing units for the multiple computing units.
  • the calculation unit calculates the image data received respectively;
  • the calculation results of the multiple calculation units are combined by the writing unit, and the combined calculation results are stored in the external memory.
  • the reading unit includes a first storage unit, and the reading of the characteristic image and calculation parameters input to the network layer from the external memory by the reading unit includes:
  • the sequentially read data are sequentially stored in the first storage unit.
  • the reading unit further includes a second storage unit
  • the method further includes:
  • the first storage unit sequentially transmits the read data to the second storage unit with a set delay.
  • the alternate reading of the characteristic image and the calculation parameter from the external memory includes:
  • D-1 image data of the preset data amount are sequentially read from the external memory.
  • the method further includes:
  • the configuration information corresponding to the characteristic image includes a starting storage address and a storage address length of the characteristic image in the external memory;
  • the configuration information corresponding to the calculation parameter includes a storage address of the calculation parameter in the memory.
  • the external memory includes a plurality of storage addresses, the plurality of storage addresses satisfy a set condition, and the plurality of storage addresses respectively store images of the preset data amount in the characteristic image data.
  • the computing capabilities of the multiple computing units reflect the amount of data that can be calculated by each computing unit at one time and the total number of units of the multiple computing units, and the comparison of the computing capabilities of the multiple computing units
  • the splitting of the characteristic image includes:
  • the image data of the preset data amount is split according to the amount of data that can be calculated by each calculation unit at one time and the total number of units. .
  • the splitting the image data of the preset data amount according to the amount of data that can be calculated by each calculation unit at a time and the total number of units includes:
  • the image data of the preset data amount is equally divided and divided according to the number of divisions.
  • the network layer includes an ELTWISE layer or a scale layer.
  • the second aspect of the present invention provides an arithmetic device for a convolutional neural network.
  • the device includes a reading unit, a plurality of calculation units, and a writing unit, wherein:
  • the reading unit is used to read the characteristic image and calculation parameters of the input network layer from an external memory; the characteristic image is split according to the calculation capabilities of the multiple calculation units, and the multiple obtained after the splitting is split The group of image data and the calculation parameters are transferred to the plurality of calculation units;
  • the multiple calculation units are used to calculate the image data received respectively;
  • the writing unit is configured to combine the calculation results of the multiple calculation units, and store the combined calculation results in the external memory.
  • the reading unit includes a first storage unit, and the first storage unit is configured to:
  • the reading unit further includes a second storage unit, and the first storage unit is further configured to:
  • the read data is sequentially transmitted to the second storage unit with a set delay.
  • the first storage unit is configured to:
  • D-1 image data of the preset data amount are sequentially read from the external memory.
  • the device further includes an acquiring unit configured to acquire configuration information corresponding to the characteristic image and the calculation data;
  • the first storage unit is configured to read the characteristic image and the calculation parameter from the external memory based on the configuration information.
  • the configuration information corresponding to the characteristic image includes a starting storage address and a storage address length of the characteristic image in the external memory;
  • the configuration information corresponding to the calculation parameter includes a storage address of the calculation parameter in the memory.
  • the external memory includes a plurality of storage addresses, the plurality of storage addresses satisfy a set condition, and the plurality of storage addresses respectively store images of the preset data amount in the characteristic image data.
  • the computing capabilities of the multiple computing units reflect the amount of data that can be calculated by each computing unit at one time and the total number of units of the multiple computing units, and the reading unit is configured to:
  • the image data of the preset data amount is split according to the amount of data that each calculation unit can calculate at a time and the total number of units. .
  • the reading unit is configured to:
  • the image data of the preset data amount is equally divided and divided according to the number of divisions.
  • the network layer includes an ELTWISE layer or a scale layer.
  • a third aspect of the present invention provides an electronic device including the computing device of the convolutional neural network described in the second aspect and a memory external to the computing device of the convolutional neural network.
  • a computer-readable storage medium is provided.
  • the storage medium is a computer-readable storage medium in which program instructions are stored, and the program instructions are used to implement the above-mentioned first aspect.
  • the calculation method of the convolutional neural network is provided.
  • a dedicated processor provided with a plurality of computing units is used to perform network layer arithmetic processing.
  • multiple sets of image data can be processed at the same time, which can increase the speed of network layer calculations, and thus can increase the processor's network layer calculations.
  • the computational performance of the processing is used to perform network layer arithmetic processing.
  • Fig. 1 is a schematic diagram showing a scale layer arithmetic processing according to an exemplary embodiment
  • Fig. 2 is a schematic structural diagram of a processor according to an exemplary embodiment
  • Fig. 3 is a schematic flow chart showing an operation method of a convolutional neural network according to an exemplary embodiment
  • Fig. 4 is a schematic structural diagram of a processor according to an exemplary embodiment
  • Fig. 5 is a schematic flow chart showing an operation method of a convolutional neural network according to an exemplary embodiment
  • Fig. 6 is a schematic diagram showing an aligned storage according to an exemplary embodiment
  • Fig. 7 is a schematic diagram showing a sequence of reading data according to an exemplary embodiment
  • Fig. 8 is a schematic diagram showing a sequence of data transmission according to an exemplary embodiment
  • Fig. 9 is a schematic structural diagram showing a processor according to an exemplary embodiment
  • Fig. 10 is a schematic structural diagram of a processor according to an exemplary embodiment
  • Fig. 11 is a schematic structural diagram showing a processor according to an exemplary embodiment
  • Fig. 12 is a schematic structural diagram of a processor according to an exemplary embodiment
  • Fig. 13 is a schematic structural diagram of a processor according to an exemplary embodiment
  • Fig. 14 is a schematic structural diagram showing a processor according to an exemplary embodiment
  • Fig. 15 is a schematic structural diagram showing an electronic device according to an exemplary embodiment.
  • An exemplary embodiment of the present invention provides an operation method of a convolutional neural network.
  • Convolutional Neural Network Convolutional Neural Network, abbreviated as CNN
  • CNN Convolutional Neural Network
  • the BN layer can be used in combination with the scale layer.
  • the arithmetic processing of the BN layer includes:
  • ⁇ B is the mean value
  • x i is the i-th pixel in the feature image
  • m is the number of all pixels in the feature image.
  • the arithmetic processing of the scale layer can be performed.
  • the scale layer plays the role of restoring the feature distribution, adding flexibility on the basis of regularization, making the training of the network more efficient. Taking the sigmoid activation function as an example, the distribution of the regularized data may be lost from the previous layer. The non-linear characteristics of the sigmoid can just solve this problem after adding the scale layer.
  • the arithmetic processing of the scale layer includes:
  • the scale layer its input may include input feature image (Input Feature Map, abbreviated as IFM) and calculation parameters.
  • IFM input feature image
  • the calculation parameters include a and b, a is the zoom ratio, and b is the displacement.
  • the output of the scale layer may include an output feature image (Output Feature Map, abbreviated as OFM).
  • the IFM can be a three-dimensional array, and the three-dimensional array corresponds to multiple feature images.
  • the image size can be recorded as H ⁇ W ⁇ N (image width ⁇ image height ⁇ image channel).
  • the calculation parameter can be a one-dimensional array, and the parameter size can be denoted as N (channel).
  • OFM and IFM have the same dimensions and are also three-dimensional arrays.
  • the arithmetic processing of the scale layer mainly involves multiplication and addition operations.
  • An exemplary embodiment of the present invention provides a convolutional neural network operation method, which can be applied to a processor that calculates a scale layer.
  • the processor includes a reading unit 210, a multiple A calculation unit 220 and a writing unit 230.
  • the processor may be a processor dedicated to performing calculations on the network layer in the convolutional neural network, and the network layer may be a scale layer or an ELTWISE layer.
  • the same processor can be used to perform ELTWISE layer arithmetic processing or scale-layer arithmetic processing at the same time through multiplexing.
  • the processing flow of the method may include the following steps:
  • step S301 the feature image and calculation parameters of the input network layer are read from the external memory by the reading unit 210.
  • the processor can read the IFM and calculation parameters from the external memory through the reading unit 210.
  • the external memory may be various types of memory, such as random access memory (Random-Access Memory, abbreviated as RAM), and the RAM may include DSRAM.
  • RAM Random-Access Memory
  • the IFM and calculation parameters can be read from the external memory, so that the subsequent calculation processing of the scale layer can be performed based on the IFM and calculation parameters.
  • the IFM includes multiple characteristic images. In the actual data reading process, a part of the data can be read from the IFM in a zigzag scanning manner one by one.
  • the IFM can be read in stages, and only a part of the data in the IFM can be read each time until the entire IFM is read into the processor.
  • the data volume of the calculation parameter is small, and the calculation parameter can be read into the processor at one time if the read bandwidth of the external memory allows. It should be noted that in the process of reading data from the external memory, the entire IFM is continuously read.
  • the IFM includes multiple feature images, and each feature image is on a channel. In the subsequent process of arithmetic processing, the arithmetic processing can be performed with a feature image and the corresponding calculation parameter as the unit, because a feature image can use the same a and b to perform multiplication and addition operations.
  • the reading unit 210 may include a first storage unit 411 and a second storage unit 412.
  • the first storage unit 411 may be a first input first output (FIFO) storage unit, which is represented by IFM_FIFO in the embodiment of the present invention.
  • the data that the processor reads from the outside can first be temporarily stored in IFM_FIFO.
  • the second storage unit 412 may be a ping-pong register, which is represented by a ping-pong buffer in the embodiment of the present invention.
  • the reading unit 210 may include one storage unit or multiple storage units. The types of the multiple storage units may be the same or different, and the number and types of storage units may be set according to actual needs.
  • a first storage unit 411 and a second storage unit 412 may be provided in the reading unit 210.
  • the rate at which the processor reads data from the external memory is the first rate
  • the rate at which the processor performs arithmetic processing on the data is the second rate. If the first rate is greater than the second rate, the rate at which the data is read and the arithmetic processing is performed The rate does not match, and the rate of reading data is faster than the rate of actually processing data. Therefore, the first storage unit 411 can be provided to solve this problem.
  • the data read by the processor from the external memory can be temporarily stored in the first storage unit 411, and the first storage unit 411 can transmit the temporarily stored data to the second storage unit 412 at a lower rate.
  • the second storage unit 412 may transmit the temporarily stored data to the calculation unit 220 according to the calculation capability of the calculation unit 220. In this way, even if data is read from the external memory at a relatively large rate, the data will not be lost due to the low computing power of the computing unit 220.
  • the data transmission is controlled according to the computing power of the computing unit 220. Up to the rate of the calculation unit 220, this will not cause a waste of bandwidth for reading and writing data. As much data as the calculation unit 220 can process, how much data can be transmitted to the calculation unit 220 for calculation processing.
  • the step of reading the feature image of the input scale layer and the calculation parameters from the external memory through the reading unit 210 may include: alternately reading the feature image and the calculation parameters from the external memory, wherein each time the feature image is read from the feature image Image data of a preset amount of data is read; the sequentially read data are sequentially stored in the first storage unit 411.
  • the preset amount of data can be recorded as M, and the processor can alternately initiate a read data request for reading the characteristic image and calculating parameters to the external memory, such as the first time that the external memory initiates the read data for reading the characteristic image. Request, initiate a read data request for reading calculation parameters to the external memory for the second time, and initiate a read data request for reading the characteristic image to the external memory again for the third time. Due to the large amount of data of the feature image, each time a read operation is performed, the image data of M bits in the feature image can be read from the external memory. With each read data request, when data is read from the external memory, the read data can be sequentially stored in IFM_FIFO. M can be a preset value, it can be manually configured, or it can be the default in the external memory. The preset data amount M can also be the data bit width of the external memory.
  • the step of alternately reading the characteristic image and calculating the parameters from the external memory may include: determining the number of reading times D corresponding to the characteristic image according to the total data amount of the characteristic image and the preset data amount; through one reading operation, Read the image data of the preset data amount from the external memory; read the calculation parameters from the external memory through one read operation; read D-1 preset data from the external memory sequentially through D-1 read operations Set the amount of image data.
  • the processor may determine the total data amount N of the feature image, which refers to a feature image in the IFM, or a feature image on a channel in the IFM.
  • N of the feature image refers to a feature image in the IFM
  • the processor can read M bits of image data in the feature image from the external memory, and it can calculate that if the entire feature image needs to be read, a total of the external memory needs to be initiated Read data requests, that is, a total of needs to be executed Only one reading operation can read the entire feature image.
  • the processor After the processor determines the number of reading times D required to read the entire feature image, it can alternately initiate a data read request to the external memory.
  • the processor can initiate a read data request to the external memory to read the feature image for the first time, the second time to the external memory to read a data read request to read the calculation parameters corresponding to the feature image, and the third time to initiate the read feature to the external memory again
  • the image read data request is repeatedly executed to initiate the read data request to the external memory to read the characteristic image until the last part of the characteristic image is read through the read data request initiated the Dth time.
  • the read data can be sequentially stored in IFM_FIFO.
  • the image data A of M bits in the feature image of the i-th channel can be stored in the first position in IFM_FIFO, and the calculation parameters of a and b corresponding to the i-th channel can be stored in the second position in IFM_FIFO.
  • an over-limit signal is sent to prompt the processor to suspend reading the data.
  • the processor If the processor receives the overrun signal sent by the external memory, it will suspend initiating a read request to the external memory, so that it can suspend the reading of data from the external memory.
  • the processor is temporarily prevented from continuing to issue read requests to the external memory.
  • the method provided by the embodiment of the present invention may further include: the first storage unit 411 sequentially transmits the read data to the second storage unit 412 with a set delay.
  • the IFM_FIFO can transmit to the ping-pong buffer with a set delay.
  • the IFM_FIFO can be implemented through software or hardware to transmit the read data to the ping-pong buffer with a set delay. For example, if it is implemented by software, a timer can be set. When the timer reaches the preset duration, the IFM_FIFO transmits the data read once to the ping-pong buffer. If implemented by hardware, a delay component can be set between IFM_FIFO and ping-pong buffer, and IFM_FIFO transmits the read data to the ping-pong buffer through the delay component. IFM_FIFO solves the problem of the mismatch between the data read rate and the processing rate inside and outside the processor.
  • the second storage unit 412 may include multiple ping-pong buffers, and the processor may sequentially store the data in the IFM_FIFO into the multiple ping-pong buffers. For example, store the image data A of M bits in the feature image of the i-th channel in the first ping-pong buffer, and store the calculation parameters of a and b corresponding to the i-th channel in the second ping-pong buffer , The image data B of M bits in the feature image of the i-th channel is stored in the third ping-pong buffer, and the image data C of M bits in the feature image of the i-th channel is stored in the fourth ping-pong buffer. In the ping-pong buffer, and so on, until the entire feature image of the i-th channel has been output.
  • the method provided by the embodiment of the present invention may further include: acquiring configuration information corresponding to the characteristic image and the calculation data, so that the first storage unit 411 reads the characteristic image and calculation parameters from the external memory based on the configuration information.
  • the technician can input the first configuration information corresponding to the characteristic image and the second configuration information corresponding to the calculation data, and can input the first configuration information and the second configuration information into the processor through the preset bus interface.
  • the preset bus interface may be a peripheral bus (Advanced Peripheral Bus, abbreviated as APB) interface.
  • APB Advanced Peripheral Bus
  • the processor may receive the first configuration information and the second configuration information from the APB interface, and temporarily store the first configuration information and the second configuration information in a third storage unit, which may be represented by INSTR_FIFO.
  • Each IFM corresponds to one configuration information. If multiple IFMs need to be read, the configuration information corresponding to the multiple IFMs can be obtained at the same time, and the configuration information corresponding to the multiple IFMs can be sequentially stored in the INSTR_FIFO.
  • the processor may also analyze the configuration information stored in the INSTR_FIFO, so that the first storage unit 411 reads the characteristic image and calculation parameters from the external memory based on the analyzed configuration information.
  • the first configuration information may include the data amount M of the image data read from the feature image each time. After parsing the first configuration information, the first storage unit 411 may follow the instructions of the first configuration information to retrieve the feature image each time. Read M bits of image data in the image.
  • the first configuration information may include the initial storage address and the storage address length of the feature image in the external memory; the second configuration information may include the storage address of the calculation parameter in the memory.
  • the characteristic image can be stored in an external memory, specifically in multiple storage addresses
  • the calculation parameters can also be stored in an external memory, specifically in one storage address. Therefore, the first storage unit 411 can determine from which storage address of the external memory the feature image is read according to the initial storage address of the feature image in the external memory. If the feature image is continuously stored in the external memory, the processor can read the image data stored on multiple consecutive storage addresses from the initial storage address to obtain the entire feature image. For example, if the starting storage address is 0 and the storage address length is 8, then all storage addresses include 0, 1, 2, 3, 4, 5, 6, and 7, the processor can read the storage addresses as 0, 1, Image data stored on 2, 3, 4, 5, 6, and 7.
  • the second configuration information corresponding to the calculation parameter may only include the storage address of the calculation parameter in the memory. Since the data volume of the calculation parameter is small, it is possible to store all the calculation parameters with only one storage address corresponding to the storage space, instead of using multiple storage addresses for storage.
  • the configuration information corresponding to the characteristic image and the calculation parameter may also include operation mode information.
  • eltwise_mode may be used to represent the operation mode information.
  • the processor is an ELTWISE layer processor, multiplex the ELTWISE layer processor to realize the scale layer operation processing.
  • the calculation mode information can be used to indicate the calculation mode of the current processor. If the calculation mode information indicates that the calculation mode of the current processor is the ELTWISE layer calculation, the ELTWISE layer calculation is implemented by the ELTWISE layer processor. If the calculation mode information indicates the calculation of the current processor The mode is the scale layer operation, and the scale layer operation is realized through the ELTWISE layer processor.
  • the operation mode switching function can be realized through the operation mode information to multiplex the ELTWISE layer processor.
  • the external memory includes multiple storage addresses, the multiple storage addresses satisfy a set condition, and the multiple storage addresses respectively store image data of a preset data amount in the characteristic image.
  • the IFM can be stored in the external memory in a certain format, and specifically can be stored in the external memory in an aligned manner. Multiple storage addresses can be selected in the external memory for the IFM. When the multiple storage addresses meet the preset conditions, the IFM can be considered to be stored in the external memory in an aligned manner.
  • the storage address is represented by W binary values. If the highest bit of the multiple storage addresses changes continuously and the lower W-1 bit is the same, it can be considered that the multiple storage addresses meet the requirements. Set conditions. For example, as shown in Figure 6, the storage address is represented by 4 binary values. If the lower 3 bits of the 4 storage addresses are 0 and the highest bit is 1, 2, 3, 4, that is, the 4 storage addresses are 0x1000. , 0x2000, 0x3000 and 0x4000, the 4 storage addresses meet the set conditions.
  • each Storage address can be stored The value of each pixel, that is, the value of every 8 pixels is aligned and stored in a storage address, and all storage addresses are kept aligned with 64 bits.
  • step S302 the feature image is split according to the computing capabilities of the multiple computing units 220, and the multiple sets of image data and calculation parameters obtained after the split are transferred to the multiple computing units 220, so that the multiple computing units 220 can receive each of them.
  • the image data is calculated.
  • the feature image can be split according to the computing capabilities of the multiple computing units 220, so that the split feature image is consistent with the computing capabilities of the multiple computing units 220. In this way, it is possible to avoid transmitting too much data to the calculation unit 220, to avoid the calculation unit 220 being unable to perform arithmetic processing on all the data immediately and wasting data transmission bandwidth, and to avoid transmitting less data to the calculation unit 220 at the same time. Some of the calculation units 220 in the plurality of calculation units 220 are in an idle state and the calculation speed is reduced.
  • each calculation unit 220 can receive its own set of to-be-calculated Image data and calculation parameters, and each calculation unit 220 can perform scale layer calculation processing based on the received image data and calculation parameters.
  • the computing capabilities of the multiple computing units 220 reflect the amount of data that each computing unit 220 can calculate at one time and the total number of units of the multiple computing units 220.
  • the step of splitting the characteristic image according to the computing capabilities of the multiple computing units 220 may include: determining whether to read the image with a preset data amount according to the amount of data that can be calculated by each computing unit 220 at one time and the total number of units. The data is split; if it is determined to split the image data of the preset data amount, the image data of the preset data amount is split according to the amount of data that can be calculated by each calculation unit 220 at one time and the total number of units.
  • the amount of data that can be calculated by each calculation unit 220 at one time is 8 bits
  • the amount of data that can be calculated by each calculation unit 220 in one clock cycle is 8 bits
  • the total number of units of multiple calculation units 220 is The amount of image data read from the external memory each time. It can be determined that the product of the amount of data that each calculation unit 220 can calculate at one time and the total number of units is Indicates that the amount of data that can be calculated by all computing units 220 at the same time is
  • the data volume of the image data read from the external memory is M each time, so it is impossible to process all the read image data at one time, and it needs to be divided. All the read image data can be processed only one time.
  • the preset amount of data can be divided by the product Get the number of splits.
  • the processor may divide the image data of the preset amount of data into equal parts according to the number of divided parts, and transmit one part of the divided image data and calculation parameters to the multiple calculation units 220 each time.
  • the processor may also perform serial-to-parallel conversion processing on each split image data, so as to convert each split image data into multiple sets of image data consistent with the total number of units of the multiple computing units 220.
  • the amount of data after each split can be The image data of, the serial-to-parallel conversion process is performed to obtain 4 sets of image data, and the data volume of each set of image data is When M is 64bits, It is 8 bits, and the data amount of each group of image data is 8 bits, which is exactly the same as the amount of data that can be calculated by each calculation unit 220 at one time. Therefore, each calculation unit 220 can complete a set of image data at one time.
  • the process of data transmission and splitting can be completed according to the following sequence:
  • IFM_FIFO reads the image data A of M bits in the characteristic image of the i-th channel, and stores the image data A in the first ping-pong buffer.
  • IFM_FIFO reads the a and b calculation parameters corresponding to the i-th channel, and stores the calculation parameters in the second ping-pong buffer.
  • IFM_FIFO reads the image data B of M bits in the characteristic image of the i-th channel, and stores the image data B in the third ping-pong buffer.
  • IFM_FIFO reads the image data C of M bits in the characteristic image of the i-th channel, and stores the image data C in the fourth ping-pong buffer.
  • eltwise_rd_addr is a storage address signal
  • ifm1_addr represents a storage address signal of image data of a preset data amount in a feature image of the input scale layer
  • ifm2_addr represents a storage address signal of a calculation parameter.
  • eltwise_rd_en is the enable signal. At high level, the eltwise_rd_addr signal is valid, and at low level, the eltwise_rd_addr signal is invalid.
  • eltwise_rd_data represents the data read signal
  • ifm1_data1_blk1 represents the first image data signal
  • b1 and a1 represent calculation parameter signals, respectively
  • ifm1_data2_blk1 represents the second image data signal
  • ifm1_data3_blk1 represents the third image data signal.
  • Figure 8 shows the control logic timing of IFM_FIFO and ping-pong buffer.
  • the first image data is stored in the first storage location in IFM_FIFO.
  • the calculation parameters are stored in the second storage location in IFM_FIFO.
  • ifm1_data2_blk1 store the second image data in the third storage location in IFM_FIFO. Transmit ifm1_data1_blk1 in IFM_FIFO to ping-pong buffer, split ifm1_data1_blk1 in ping-pong buffer to obtain ifm1_data1_h and ifm1_data1_l.
  • the ping-pong buffer outputs ifm1_data1_h, ifm1_data1_l, ifm1_data2_h, ifm1_data2_1, b1, and a1, so that the calculation unit 220 performs arithmetic processing based on the output of the ping-pong buffer.
  • multiple computing units 220 By arranging multiple computing units 220 to perform arithmetic processing on multiple sets of image data in parallel, it is possible to perform arithmetic processing on multiple sets of image data at the same time, which can increase the speed of the scale layer arithmetic processing.
  • the processor may convert the data format of the image data in advance, and input the converted image data and calculation parameters into the calculation unit 220 for calculation processing.
  • the image data may be 8 bits of data, and the image data may be shifted to convert the image data into data consistent with the data format of the calculation parameter. For example, 8bits image data can be shifted right, and 8bits image data can be converted into 32bits fixed-point type image data. At this time, the calculation parameters are also 32bits fixed-point type data. Two types of data in the same data format can be directly multiplied and added. operate.
  • the technician can also configure the data format of the output characteristic image, which can be carried in the configuration information and input to the processor.
  • the processor may determine the target data format of the output characteristic image, and convert the image data output by the calculation unit 220 into data in the target data format.
  • the target data format can be set to 8bits fixed-point type data
  • the image data directly output by the calculation unit 220 is 32bits fixed-point type data
  • the 32bits fixed-point type image data can be shifted to transfer the calculation unit 220
  • the data format of the directly output image data is converted into the target data format.
  • the 32-bit fixed-point type image data can be left-shifted
  • the 32-bit fixed-point type image data can be converted into the 8-bit fixed-point type image data.
  • the numerical value of the original data is unchanged, but the accuracy of the data has changed.
  • step S303 the calculation results of the multiple calculation units 220 are combined by the writing unit 230, and the combined calculation results are stored in the external memory.
  • each calculation unit 220 may perform calculation processing based on the image data and calculation parameters received, and output calculation results.
  • the writing unit 230 may combine the calculation results output by different calculation units 220.
  • the combined calculation result is bits. Then, the two combined calculation results can be combined again, and the combined calculation result is Mbits.
  • the writing unit 230 may obtain the initial storage address and storage address length corresponding to the combined calculation result in the external memory, and according to the initial storage address and storage address length corresponding to the combined calculation result in the external memory. Store the address length and store the combined calculation result in the external memory.
  • the external memory may include a first external memory and a second external memory, and the writing unit 230 stores the combined calculation result in the first external memory.
  • the first external memory transfers the stored data to the second external memory.
  • the first external memory may be a type of memory such as DSRAM
  • the second memory may be a type of memory such as a double-rate synchronous dynamic random access memory (Double Data Rate, DDR).
  • DDR Double Data Rate
  • the preset threshold may be the total amount of data of the OFM.
  • the entire OFM is gradually stored in the first external memory, and the entire OFM is transferred from the first external memory to the second external memory. In this way, repeated reading of data in the first external memory can be reduced, and data reading and writing bandwidth overhead can be saved.
  • the processor may also include four modules: Eltwise_Instr_Proc, FM_Loader, FM_Proc_Unit, and FM_Write_Back.
  • Eltwise_Instr_Proc In the Eltwise_Instr_Proc module, INSTR_FIFO can also be set, and the configuration information can be transmitted to the INSTR_FIFO through the APB interface, and the Eltwise_Instr_Proc module will analyze the configuration information.
  • the FM_Loader module IFM_FIFO can also be set. IFM_FIFO reads the characteristic image and calculation parameters from the external memory according to the parsed configuration information. IFM_FIFO can transmit the characteristic image and calculation parameters to the FM_Proc_Unit module.
  • the FM_Proc_Unit module is used to perform calculation processing based on the characteristic image and calculation parameters.
  • the FM_Proc_Unit module outputs the calculation result to the OFM_FIFO in the FM_Write_Back module, and the OFM_FIFO stores the calculation result in the external memory according to the configuration information.
  • the Eltwise_Instr_Proc module may include INSTR_FIFO and INSTR_DECODER, and INSTR_DECODER is used to parse the configuration information.
  • the Eltwise_Instr_Proc module can be connected to the three modules FM_Loader, FM_Proc_Unit, and FM_Write_Back at the same time.
  • the FM_Loader module includes IFM_FIFO, ifm_fifo_rd, flip-flops and ifm_rdata_pp_buffer.
  • ⁇ eltwise_rd_vld, eltwise_rd_data ⁇ represents the signal input to the IFM_FIFO
  • eltwise_rd_data represents the feature image and calculation parameters.
  • ifm_fifo_rd outputs the ifm_fifo_ren signal to IFM_FIFO, which is used to control IFM_FIFO.
  • the trigger is used to delay the transmission of the characteristic image and calculation parameters to the ifm_rdata_pp_buffer.
  • the feature image is split by ifm_rdata_pp_buffer to obtain two signals of ifm1_data and ifm2_data. After the trigger, the two signals scale_a and scale_b corresponding to the calculated parameters are directly output.
  • the FM_Proc_Unit module includes multiple Pre_Fix_Point, EltWise_Proc_Unit and Post_Fix_Point.
  • Pre_Fix_Point is used to perform data format conversion processing on the input image data and calculation parameters
  • EltWise_Proc_Unit is used to perform calculation processing based on the converted image data and calculation parameters.
  • Post_Fix_Point is used to convert the data format of the calculation result.
  • the input of the entire FM_Proc_Unit module is the ifm1_data and ifm2_data signals, where ifm1_data represents the feature image, and ifm2_data represents the calculation parameter.
  • the output of the entire FM_Proc_Unit module is the ofm_data signal, and ofm_data can represent the calculated result after conversion.
  • EltWise_Proc_Unit the internal circuit structure of the EltWise_Proc_Unit module can be seen as shown in Fig. 13.
  • the inputs include ifm1_data, ifm2_data, coeff_a, and scale_b signals, where coeff_a and scale_b represent calculation parameters.
  • EltWise_Proc_Unit can include multiple flip-flops (registers), multiplexers, multipliers and other electrical components. Extension in FIG. 13 represents an expandable part.
  • the FM_Write_Back module may include OFM_FIFO, OFM_DATA_PACKER, and ofm_wr_addr_gen.
  • OFM_FIFO is used to collect calculation results
  • OFM_DATA_PACKER is used to merge calculation results
  • ofm_wr_addr_gen is used to store the combined calculation results in the storage address indicated in DSRAM according to the storage address indicated in the configuration information.
  • a dedicated processor provided with a plurality of calculation units is used to perform scale-level operation processing.
  • multiple computing units to perform arithmetic processing on multiple sets of image data in parallel multiple sets of image data can be processed at the same time, which can increase the speed of the scale layer's arithmetic processing, and thereby increase the processor's scale-layer arithmetic The computational performance of the processing.
  • the device may be the processor in the foregoing embodiment.
  • the device may include a reading unit 210, a plurality of calculation units 220, and a writing unit 230, where:
  • the reading unit 210 is configured to read the feature image and calculation parameters input to the scale layer from an external memory; split the feature image according to the computing capabilities of the multiple computing units 220, and split the The multiple sets of image data and the calculation parameters obtained later are transferred to the multiple calculation units 220;
  • the multiple calculation units 220 are configured to perform calculations on the image data received respectively;
  • the writing unit 230 is configured to combine the calculation results of the multiple calculation units 220 and store the combined calculation result in the external memory.
  • the reading unit 210 includes a first storage unit, and the first storage unit is configured to:
  • the reading unit 210 further includes a second storage unit, and the first storage unit is further configured to:
  • the read data is sequentially transmitted to the second storage unit with a set delay.
  • the first storage unit is configured to:
  • D-1 image data of the preset data amount are sequentially read from the external memory.
  • the device further includes an acquiring unit configured to acquire configuration information corresponding to the characteristic image and the calculation data;
  • the first storage unit is configured to read the characteristic image and the calculation parameter from the external memory based on the configuration information.
  • the configuration information corresponding to the characteristic image includes a starting storage address and a storage address length of the characteristic image in the external memory;
  • the configuration information corresponding to the calculation parameter includes a storage address of the calculation parameter in the memory.
  • the external memory includes a plurality of storage addresses, the plurality of storage addresses satisfy a set condition, and the plurality of storage addresses respectively store images of the preset data amount in the characteristic image data.
  • the computing capabilities of the multiple computing units 220 reflect the amount of data that can be calculated by each computing unit 220 at one time and the total number of units of the multiple computing units 220, and the reading unit 210 is configured to:
  • the image data of the preset data amount is split according to the amount of data that each calculation unit 220 can calculate at one time and the total number of units. point.
  • the reading unit 210 is configured to:
  • the image data of the preset data amount is equally divided and divided according to the number of divisions.
  • the arithmetic device of the convolutional neural network shown in Fig. 2 can execute the method of the embodiment shown in Fig. 1 to Fig. 14.
  • the parts not described in detail in this embodiment please refer to the related description of the embodiment shown in Fig. 1 to Fig. 14. .
  • the implementation process and technical effects of this technical solution please refer to the description in the embodiment shown in FIG. 1 to FIG. 14, which will not be repeated here.
  • an embodiment of the present invention provides an electronic device.
  • the electronic device includes a convolutional neural network computing device 1501 and a memory 1502 external to the convolutional neural network computing device 1501.
  • the computing device 1501 of the network is the computing device of the convolutional neural network shown in FIG. 2 described above.
  • the embodiment of the present invention also provides a computer-readable storage medium, the storage medium is a computer-readable storage medium, the computer-readable storage medium stores program instructions, and the program instructions are used to implement the implementation shown in FIG. 1 to FIG. Example method.

Abstract

本发明是关于一种卷积神经网络的运算方法、装置、设备和存储介质,属于人工智能技术领域。所述方法应用于对卷积神经网络中的网络层进行计算的处理器,所述处理器包括读取单元、多个计算单元和写入单元,所述方法包括:通过读取单元从外部存储器中读取输入网络层的特征图像和计算参数;根据多个计算单元的计算能力对特征图像进行拆分,将拆分后得到的多组图像数据和计算参数传递至多个计算单元,以供多个计算单元对各自接收的图像数据进行计算;通过写入单元将多个计算单元的计算结果合并,并将合并后的计算结果存入外部存储器。采用本发明,可以提高网络层运算处理的速度,进而可以提高处理器进行网络层运算处理的计算性能。

Description

卷积神经网络的运算方法、装置、设备和存储介质 技术领域
本发明涉及人工智能技术领域,尤其涉及一种卷积神经网络的运算方法、装置、设备和存储介质。
背景技术
图像的特征信息的提取一直是计算机视觉领域一个重要的研究方向,训练后的卷积神经网络(Convolution Neural Networks,简称为CNN)可以很准确地提取图像的特征信息,基于特征信息完成对图像的分类。CNN可以由不同类型的网络层构成,不同类型的网络层包括卷积层、池化层、激活层、全连接层、批量归一化(Batch-normalized,简称为BN)层、scale层等。
在scale层中,主要是进行y=ax+b的运算处理,其中x表示scale层的输入数据即输入特征图像(Input Feature Map,简称为IFM),y表示scale层的输出数据即输出特征图像(Output Feature Map,简称为OFM),a和b依次分别表示scale层的比例缩放系数和位移系数。
scale层所涉及的数据的运算量较大,相关技术中的scale层的运算处理一般是由中央处理器(Center Processing Unit,简称为CPU)、图形处理器(Graphics Processing Unit,简称为GPU)或者数字信号处理器(Digital Signal Processing,简称为DSP)完成的。而CPU、GPU或者DSP为通用型的处理器,通用型的处理器可以处理多种类型的事务,由于通用型的处理器可以处理的事务的类型繁多,因此它底层处理事务的逻辑比较复杂。通过通用型的处理器进行scale层的运算速度较慢,计算性能较低。
发明内容
本发明提供了一种卷积神经网络的运算方法、装置、设备和存储介质,能够提高网络层的计算性能。
本发明的第一方面提供了一种卷积神经网络的运算方法,应用于对卷积神经网络中的网络层进行计算的处理器,所述处理器包括读取单元、多个计算单元和写入单元,所述方法包括:
通过所述读取单元从外部存储器中读取输入所述网络层的特征图像和计算参数;
根据所述多个计算单元的计算能力对所述特征图像进行拆分,将拆分后得到的多组图像数据和所述计算参数传递至所述多个计算单元,以供所述多个计算单元对各自接收的图像数据进行计算;
通过所述写入单元将所述多个计算单元的计算结果合并,并将合并后的计算结果存入所述外部存储器。
可选地,所述读取单元中包括第一存储单元,所述通过所述读取单元从外部存储器中读取输入所述网络层的特征图像和计算参数,包括:
从所述外部存储器中交替读取所述特征图像和所述计算参数,其中,每次从所述特征图像中读取预设数据量的图像数据;
将依次读取到的数据依次存储到所述第一存储单元中。
可选地,所述读取单元中还包括第二存储单元,所述方法还包括:
所述第一存储单元将读取到的数据依次以设定的延时向所述第二存储单元传输。
可选地,所述从所述外部存储器中交替读取所述特征图像和所述计算参数,包括:
根据所述特征图像的总数据量和所述预设数据量,确定所述特征图像对应的读取次数D;
通过一次读取操作,从所述外部存储器中读取所述预设数据量的图像数据;
通过一次读取操作,从所述外部存储器中读取所述计算参数;
通过D-1次读取操作,从所述外部存储器中依次读取D-1个所述预设数据量的图像数据。
可选地,所述方法还包括:
获取所述特征图像和所述计算数据对应的配置信息,以使所述第一存储单元基于所述配置信息从所述外部存储器中读取所述特征图像和所述计算参数。
可选地,所述特征图像对应的配置信息包括所述特征图像在所述外部存储器中的起始存储地址和存储地址长度;
所述计算参数对应的配置信息包括所述计算参数在所述存储器中的存储地址。
可选地,所述外部存储器中包括多个存储地址,所述多个存储地址满足设定条件,所述多个存储地址中分别存储有所述特征图像中的所述预设数据量的图像数据。
可选地,所述多个计算单元的计算能力反应了每个计算单元一次能够计算的数据量和所述多个计算单元的单元总数量,所述根据所述多个计算单元的计算能力对所述特征图像进行拆分,包括:
根据所述每个计算单元一次能够计算的数据量和所述单元总数量,确定是否对读取到的所述预设数据量的图像数据进行拆分;
如果确定对所述预设数据量的图像数据进行拆分,则根据所述每个计算单元一次能够计算的数据量和所述单元总数量,对所述预设数据量的图像数据进行拆分。
可选地,所述根据所述每个计算单元一次能够计算的数据量和所述单元总数量,对所述预设数据量的图像数据进行拆分,包括:
确定所述每个计算单元一次能够计算的数据量与所述单元总数量的乘积;
将所述预设数据量除以所述乘积,得到拆分份数;
将所述预设数据量的图像数据按照所述拆分份数进行均等分拆分。
可选地,所述网络层包括ELTWISE层或scale层。
本发明的第二方面提供了一种卷积神经网络的运算装置,所述装置包括读取单元、多个计算单元和写入单元,其中:
所述读取单元,用于从外部存储器中读取输入网络层的特征图像和计算参数;根据所述多个计算单元的计算能力对所述特征图像进行拆分,将拆分后得到的多组图像数据和所述计算参数传递至所述多个计算单元;
所述多个计算单元,用于对各自接收的图像数据进行计算;
所述写入单元,用于将所述多个计算单元的计算结果合并,并将合并后的计算结果存入所述外部存储器。
可选地,所述读取单元中包括第一存储单元,所述第一存储单元,用于:
从所述外部存储器中交替读取所述特征图像和所述计算参数,其中,每次从所述特征图像中读取预设数据量的图像数据;
存储依次读取到的数据。
可选地,所述读取单元中还包括第二存储单元,所述第一存储单元,还用于:
将读取到的数据依次以设定的延时向所述第二存储单元传输。
可选地,所述第一存储单元,用于:
根据所述特征图像的总数据量和所述预设数据量,确定所述特征图像对应的读取次数D;
通过一次读取操作,从所述外部存储器中读取所述预设数据量的图像数据;
通过一次读取操作,从所述外部存储器中读取所述计算参数;
通过D-1次读取操作,从所述外部存储器中依次读取D-1个所述预设数据量的图像数据。
可选地,所述装置还包括获取单元,所述获取单元,用于获取所述特征图像和所述计算数据对应的配置信息;
所述第一存储单元,用于基于所述配置信息从所述外部存储器中读取所述特征图像和所述计算参数。
可选地,所述特征图像对应的配置信息包括所述特征图像在所述外部存储器中的起始存储地址和存储地址长度;
所述计算参数对应的配置信息包括所述计算参数在所述存储器中的存储地址。
可选地,所述外部存储器中包括多个存储地址,所述多个存储地址满足设定条件,所述多个存储地址中分别存储有所述特征图像中的所述预设数据量的图像数据。
可选地,所述多个计算单元的计算能力反应了每个计算单元一次能够计算的数据量和所述多个计算单元的单元总数量,所述读取单元,用于:
根据所述每个计算单元一次能够计算的数据量和所述单元总数量,确定是否对读取到的所述预设数据量的图像数据进行拆分;
当确定对所述预设数据量的图像数据进行拆分时,根据所述每个计算单元一次能够计算的数据量和所述单元总数量,对所述预设数据量的图像数据进行拆分。
可选地,所述读取单元,用于:
确定所述每个计算单元一次能够计算的数据量与所述单元总数量的乘积;
将所述预设数据量除以所述乘积,得到拆分份数;
将所述预设数据量的图像数据按照所述拆分份数进行均等分拆分。
可选地,所述网络层包括ELTWISE层或scale层。
本发明的第三方面提供了一种电子设备,包括上述第二方面所述的卷积神经网络的运算装置和所述卷积神经网络的运算装置外部的存储器。
本发明的第四方面,提供一种计算机可读存储介质,所述存储介质为计算机可读存储介质,该计算机可读存储介质中存储有程序指令,所述程序指令用于实现上述第一方面所述的卷积神经网络的运算方法。
本发明的实施例提供的技术方案可以包括以下有益效果:
在本发明实施例提供的方法中,采用设置有多个计算单元的专用处理器进行网络层运算处理。通过多个计算单元并行对多组图像数据进行运算处理的方式,可以实现同时对多组图像数据进行运算处理,这样可以提高网络层运算处理的速度,进而可以提高处理器进行网络层运算处理的计算性能。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本发明的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据一示例性实施例示出的一种scale层运算处理的示意图;
图2是根据一示例性实施例示出的一种处理器的结构示意图;
图3是根据一示例性实施例示出的一种卷积神经网络的运算方法的流程示意图;
图4是根据一示例性实施例示出的一种处理器的结构示意图;
图5是根据一示例性实施例示出的一种卷积神经网络的运算方法的流程示意图;
图6是根据一示例性实施例示出的一种对齐存储的示意图;
图7是根据一示例性实施例示出的一种读数据的时序示意图;
图8是根据一示例性实施例示出的一种传输数据的时序示意图;
图9是根据一示例性实施例示出的一种处理器的结构示意图;
图10是根据一示例性实施例示出的一种处理器的结构示意图;
图11是根据一示例性实施例示出的一种处理器的结构示意图;
图12是根据一示例性实施例示出的一种处理器的结构示意图;
图13是根据一示例性实施例示出的一种处理器的结构示意图;
图14是根据一示例性实施例示出的一种处理器的结构示意图;
图15是根据一示例性实施例示出的一种电子设备的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于 本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
除非另有定义,本文所使用的所有的技术和科学术语与属于本发明的技术领域的技术人员通常理解的含义相同。本文中在本发明的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本发明。
本发明一示例性实施例提供了一种卷积神经网络的运算方法。卷积神经网络(Convolutional Neural Network,简写为CNN)可以由卷积层、池化层、激活层、全连接层、ELTWISE层、批量归一化层(Batch-normalized,简写为BN)、scale层等不同类型的网络层构成。其中,BN层可以和scale层结合使用。
由于网络越来越深,各网络层的输入数据分布的变化阻碍了深度网络的训练,数据的分布持续发生变化,只能设置更低的学习率,这样使得训练越来越困难,这种现象可以称为内部协方差变换(Internal Covariate Shift)。因此在白化(whiten)操作的基础上提出了BN层概念,通过BN层可以正则化固定每一层数据输入的均值和方差,这样可以有效加快深度网络的训练速度,使得计算参数的依赖相关性减弱,可以使用更高的学习率。
BN层的运算处理包括:
Figure PCTCN2020079221-appb-000001
其中,μ B为均值,x i为特征图像中第i个像素点,m为特征图像所有像素点的个数。
Figure PCTCN2020079221-appb-000002
其中,
Figure PCTCN2020079221-appb-000003
为方差。
Figure PCTCN2020079221-appb-000004
其中,
Figure PCTCN2020079221-appb-000005
为对x i进行归一化后的值。
在经过BN层的运算处理之后,可以进行scale层的运算处理。scale层起着 还原特征分布的作用,在正则化的基础上增加了灵活性,让网络的训练效率更高,以sigmoid激活函数为例,正则化后的数据的分布可能会丧失从前层传递过来的sigmoid的非线性特征,加入scale层后恰好可以解决这个问题。
scale层的运算处理包括:
Figure PCTCN2020079221-appb-000006
对于scale层,其输入可以包括输入特征图像(Input Feature Map,简写为IFM)和计算参数。在公式4中
Figure PCTCN2020079221-appb-000007
为IFM,计算参数包括a和b,a为缩放比例,b为位移。scale层的输出可以包括输出特征图像(Output Feature Map,简写为OFM)。IFM可以是三维数组,三维数组对应多个特征图像,图像尺寸可记为H×W×N(图像宽度×图像高度×图像通道)。计算参数可以是一维数组,参数尺寸可以记为N(通道)。OFM与IFM的维度一致,也是三维数组。scale层的运算处理主要涉及到乘法运算和加法运算。
如图1所示,假设第一个通道对应的a为1且b为2,对IFM第一个通道的每个像素点依次进行乘加运算,得到OFM第一个特征图像对应像素点的值,滑动计算窗口直至IFM的右下角,便可得到OFM第一个特征图像。对每个通道重复上述过程,不同通道可以使用不同的计算参数a、b,直至得到完整的OFM。
本发明一示例性实施例提供了一种卷积神经网络的运算方法,该方法可以应用于对scale层进行计算的处理器中,如图2所示,该处理器包括读取单元210、多个计算单元220和写入单元230。该处理器可以是专用于对卷积神经网络中的网络层进行计算的处理器,该网络层可以是scale层也可以是ELTWISE层。可以通过复用的方式,同时实现通过同一处理器进行ELTWISE层运算处理或者进行scale层运算处理。
如图3所示,该方法的处理流程可以包括如下的步骤:
步骤S301,通过读取单元210从外部存储器中读取输入网络层的特征图像和计算参数。
在实施中,处理器可以通过读取单元210从外部存储器中读取IFM和计算 参数。其中,外部存储器可以是各种类型的存储器,如随机存储器(Random-Access Memory,简写为RAM),RAM可以包括DSRAM。
在处理器实际进行scale的运算之前,可以从外部存储器中读取IFM和计算参数,以便后续可以基于IFM和计算参数,进行scale层的运算处理。IFM包括多张特征图像,实际读取数据的过程中,可以按照逐张、每张按照Z字形扫描的方式每次从IFM中读取一部分数据。
由于IFM的数据量较大,而外部存储器的读取带宽有限,因此可以分次读取IFM,每次只读取IFM中的一部分数据,直至将整个的IFM读取到处理器中。计算参数的数据量较小,在外部存储器的读取带宽允许的情况下,可以一次性将计算参数读取到处理器中。需要说明的是,从外部存储器读取数据的过程中,连续读取整个的IFM,IFM包括多个特征图像,每个特征图像在一个通道上。后续进行运算处理的过程中,可以以一个特征图像和对应的计算参数为单位进行运算处理,因为一个特征图像可以使用一个相同的a和b进行乘加操作。
如图4所示,读取单元210可以包括第一存储单元411和第二存储单元412。第一存储单元411可以是先进先出(First Input First Output,简写为FIFO)存储单元,在本发明实施例中通过IFM_FIFO表示。处理器从外部读取进来的数据首先可以暂存于IFM_FIFO中。第二存储单元412可以是ping-pong寄存器,在本发明实施例中通过ping-pong buffer表示。需要说明的是,读取单元210可以包括一个存储单元,也可以包括多个存储单元,多个存储单元的类型可以相同也可以不同,可以根据实际需求设置存储单元的数量和种类。
在本发明实施例中,可以在读取单元210中设置第一存储单元411和第二存储单元412。处理器从外部存储器中读取数据的速率为第一速率,且处理器对数据进行运算处理的速率为第二速率,如果第一速率大于第二速率,则读取数据的速率和进行运算处理的速率不相匹配,读取数据的速率要快于实际处理数据的速率,因此可以设置第一存储单元411来解决这个问题。在实际应用中,处理器从外部存储器读取到的数据可以暂存在第一存储单元411中,第 一存储单元411可以以一个较小的速率将暂存的数据传输到第二存储单元412,第二存储单元412可以根据计算单元220的计算能力将暂存的数据传输到计算单元220。通过这样的方式,即使以一个较大的速率从外部存储器读取数据,数据也不会因为计算单元220的计算能力较低而产生丢失的情况,同时根据计算单元220的计算能力,控制数据传输到计算单元220的速率,这样不会造成读写数据的带宽的浪费,计算单元220能够处理多少数据量,就可以传输多少数据给计算单元220进行运算处理。
可选地,通过读取单元210从外部存储器中读取输入scale层的特征图像和计算参数的步骤可以包括:从外部存储器中交替读取特征图像和计算参数,其中,每次从特征图像中读取预设数据量的图像数据;将依次读取到的数据依次存储到第一存储单元411中。
在实施中,预设数据量可以记为M,处理器可以交替向外部存储器发起用于读取特征图像和计算参数的读数据请求,如第一次向外部存储器发起读取特征图像的读数据请求,第二次向外部存储器发起读取计算参数的读数据请求,第三次再次向外部存储器发起读取特征图像的读数据请求。由于特征图像的数据量较大,每次进行读取操作时,可以从外部存储器读取特征图像中M bits的图像数据。通过每个读数据请求,从外部存储器读取到数据时,可以将读取到的数据依次存储在IFM_FIFO中。M可以是预设数值,可以是人工进行配置的,也可以是外部存储器中默认的。预设数据量M也可是外部存储器的数据位宽。
可选地,从外部存储器中交替读取特征图像和计算参数的步骤可以包括:根据特征图像的总数据量和预设数据量,确定特征图像对应的读取次数D;通过一次读取操作,从外部存储器中读取预设数据量的图像数据;通过一次读取操作,从外部存储器中读取计算参数;通过D-1次读取操作,从外部存储器中依次读取D-1个预设数据量的图像数据。
在实施中,处理器可以确定特征图像的总数据量N,该特征图像指的是IFM中的一个特征图像,或者是IFM中一个通道上的特征图像。每次进行读取 操作时,处理器可以从外部存储器读取特征图像中M bits的图像数据,则可以计算如果需要读取整个的特征图像,则总共需要向外部存储器发起
Figure PCTCN2020079221-appb-000008
次读数据请求,也即总共需要执行
Figure PCTCN2020079221-appb-000009
次读取操作才能读取到整个的特征图像。
在处理器确定读取整个的特征图像所需的读取次数D之后,可以交替向外部存储器发起读数据请求。处理器第一次可以向外部存储器发起读取特征图像的读数据请求,第二次向外部存储器发起读取特征图像对应的计算参数的读数据请求,第三次再次向外部存储器发起读取特征图像的读数据请求,重复执行向外部存储器发起读取特征图像的读数据请求,直到通过第D次发起的读数据请求读取到最后一部分的特征图像为止。
如图5所示,在处理器交替向外部存储器发起读数据请求,每次从外部存储器读取到数据之后,可以依次将读取到的数据存储到IFM_FIFO中。例如,可以将第i个通道的特征图像中的M bits的图像数据A存储在IFM_FIFO中的第一位置上,将第i个通道对应的a和b计算参数存储在IFM_FIFO中的第二位置上,将第i个通道的特征图像中的M bits的图像数据B存储在IFM_FIFO中的第三位置上,将第i个通道的特征图像中的M bits的图像数据C存储在IFM_FIFO中的第四位置上,以此类推,直到将第i个通道的整个的特征图像都输出完毕。
可选地,当外部存储器接收处理器发起的读数据请求超出限值时会发出超限信号,以提示处理器暂缓读取数据。
若处理器接收到外部存储器发送的超限信号,则暂停向外部存储器发起读请求,这样就可以暂停从外部存储器读取数据。
若IFM_FIFO中的数据达到预设数据量阈值,则暂停从外部存储器读取数据。例如,当IFM_FIFO中缓存的数据达到IFM_FIFO将满水线时,暂时阻止处理器继续向外部存储器发出读请求。
可选地,本发明实施例提供的方法还可以包括:第一存储单元411将读取到的数据依次以设定的延时向第二存储单元412传输。
在实施中,在IFM_FIFO接收到从外部存储器读取到的数据之后,可以以 设定的延时向ping-pong buffer传输。在该过程中,可以通过软件或者硬件的方式实现IFM_FIFO以设定的延时向ping-pong buffer传输读取到的数据。例如,如果通过软件的方式实现,可以设定计时器,当计时器的计时到达预设时长时,IFM_FIFO向ping-pong buffer传输一次读取到的数据。如果通过硬件的方式实现,可以在IFM_FIFO和ping-pong buffer之间设定延时部件,IFM_FIFO通过延时部件向ping-pong buffer传输读取到的数据。IFM_FIFO解决了处理器内外数据读取速率和处理速率不匹配的问题。
如图5所示,在本发明实施例中第二存储单元412可以包括多个ping-pong buffer,处理器可以依次将IFM_FIFO中的数据存储到多个ping-pong buffer中。例如,将第i个通道的特征图像中的M bits的图像数据A存储在第一个ping-pong buffer中,将第i个通道对应的a和b计算参数存储在第二个ping-pong buffer中,将第i个通道的特征图像中的M bits的图像数据B存储在第三个ping-pong buffer中,将第i个通道的特征图像中的M bits的图像数据C存储在第四个ping-pong buffer中,以此类推,直到将第i个通道的整个的特征图像都输出完毕。
可选地,本发明实施例提供的方法还可以包括:获取特征图像和计算数据对应的配置信息,以使第一存储单元411基于配置信息从外部存储器中读取特征图像和计算参数。
在实施中,可以对如何从外部存储器中读取数据的方式进行配置。技术人员可以输入特征图像对应的第一配置信息和计算数据对应的第二配置信息,可以通过预设总线接口将第一配置信息和第二配置信息输入到处理器中。其中,预设总线接口可以是外围总线(Advanced Peripheral Bus,简写为APB)接口。
处理器可以接收来自APB接口的第一配置信息和第二配置信息,将第一配置信息和第二配置信息暂存于第三存储单元中,第三存储单元可以通过INSTR_FIFO表示。每个IFM对应有一个配置信息,如果需要读取多个IFM,则同时可以获取到多个IFM分别对应的配置信息,可以将多个IFM分别对应的配置信息依次存储到INSTR_FIFO中。处理器还可以对INSTR_FIFO中存储的 配置信息进行解析,以使第一存储单元411基于解析后的配置信息从外部存储器中读取特征图像和计算参数。
第一配置信息可以包括每次从特征图像中读取的图像数据的数据量M,在对第一配置信息解析之后,第一存储单元411可以按照第一配置信息所指示的,每次从特征图像中读取M bits的图像数据。
可选地,第一配置信息可以包括特征图像在外部存储器中的起始存储地址和存储地址长度;第二配置信息可以包括计算参数在存储器中的存储地址。
在实施中,特征图像可以在外部存储器中存储,具体可以存储在多个存储地址上,计算参数也可以在外部存储器中存储,具体可以存储在一个存储地址上。因此第一存储单元411可以根据特征图像在外部存储器中的起始存储地址,确定从外部存储器的哪个存储地址读取到特征图像。如果特征图像在外部存储器中是连续存储的,处理器则可以从起始存储地址开始读取多个连续的存储地址上存储的图像数据,以获取到整个的特征图像。例如,起始存储地址为0,存储地址长度为8,则所有的存储地址包括0、1、2、3、4、5、6、7,处理器可以读取存储地址依次为0、1、2、3、4、5、6、7上存储的图像数据。
计算参数对应的第二配置信息中可以仅包括计算参数在存储器中的存储地址。由于计算参数的数据量较小,因此可以仅用一个存储地址对应的存储空间存储所有的计算参数,而无需采用多个存储地址存储。
可选地,特征图像和计算参数对应的配置信息还可以包括运算模式信息。
在实施中,在本发明实施例中可以通过eltwise_mode表示运算模式信息。如果处理器是ELTWISE层处理器,对ELTWISE层处理器进行复用以实现scale层运算处理。可以通过运算模式信息指示当前处理器的运算模式,如果运算模式信息指示当前处理器的运算模式为ELTWISE层运算,则通过ELTWISE层处理器实现ELTWISE层运算,如果运算模式信息指示当前处理器的运算模式为scale层运算,则通过ELTWISE层处理器实现scale层运算。
可以通过运算模式信息实现运算模式的切换功能,以对ELTWISE层处理 器进行复用。
可选地,外部存储器中包括多个存储地址,多个存储地址满足设定条件,多个存储地址中分别存储有特征图像中的预设数据量的图像数据。
在实施中,可以将IFM以一定的格式存储在外部存储器中,具体可以以对齐的方式存储在外部存储器中。可以为IFM在外部存储器中选定多个存储地址,当多个存储地址满足预设条件时,可以认为IFM以对齐的方式存储在外部存储器中。在一种可能的实现方式中,假设存储地址通过W个二进制数值表示,如果多个存储地址中最高位是连续变化的且低W-1位是相同的,则可以认为多个存储地址满足预设条件。例如,如图6所示,存储地址通过4个二进制数值表示,如果4个存储地址中低3位都是0且最高位依次是1、2、3、4,即4个存储地址依次为0x1000、0x2000、0x3000和0x4000,则该4个存储地址满足设定条件。
每个存储地址上能够存储的数据量是一定的,可以假设每个存储地址上能够存储的数据量为M,M为64bits,而特征图像中每个像素点的数据量为8bits,则每个存储地址上可以存储
Figure PCTCN2020079221-appb-000010
个像素点的数值,也即每8个像素点的数值对齐存储在一个存储地址上,所有存储地址保持64bits对齐。
通过在外部存储器中以对齐的方式存储IFM,便于对外部存储器进行读取操作,提高读取数据的性能。
步骤S302,根据多个计算单元220的计算能力对特征图像进行拆分,将拆分后得到的多组图像数据和计算参数传递至多个计算单元220,以供多个计算单元220对各自接收的图像数据进行计算。
在实施中,在处理器接收到特征图像和计算参数之后,可以根据多个计算单元220的计算能力对特征图像进行拆分,以使得拆分后的特征图像与多个计算单元220的计算能力相匹配,这样可以避免向计算单元220传输过多的数据,避免计算单元220无法立即对所有的数据进行运算处理而浪费数据传输带宽,同时还可以避免向计算单元220传输较少的数据,避免使得多个计算单元 220中的部分计算单元220处于空闲状态而降低运算速度。在对特征图像进行拆分之后,可以得到多组图像数据,可以将每组图像数据和计算参数传输至一个计算单元220,每个计算单元220都能接收到各自的一组待计算的图像数据和计算参数,进而每个计算单元220都可以基于接收到的图像数据和计算参数,进行scale层运算处理。
可选地,多个计算单元220的计算能力反应了每个计算单元220一次能够计算的数据量和多个计算单元220的单元总数量。根据多个计算单元220的计算能力对特征图像进行拆分的步骤可以包括:根据每个计算单元220一次能够计算的数据量和单元总数量,确定是否对读取到的预设数据量的图像数据进行拆分;如果确定对预设数据量的图像数据进行拆分,则根据每个计算单元220一次能够计算的数据量和单元总数量,对预设数据量的图像数据进行拆分。
在实施中,用每个计算单元220一次能够计算的数据量乘以单元总数量,得到多个计算单元220一次能够计算的总数据量,如果多个计算单元220一次能够计算的总数据量等于或者大于预设数据量,则无需对预设数据量的图像数据进行拆分。如果多个计算单元220一次能够计算的总数据量小于预设数据量,则需要对预设数据量的图像数据进行拆分。
假设每个计算单元220一次能够计算的数据量为8bits,也即每个计算单元220一个时钟周期能够计算的数据量为8bits,多个计算单元220的单元总数量为
Figure PCTCN2020079221-appb-000011
每次从外部存储器中读取到的图像数据的数据量为M。可以确定每个计算单元220一次能够计算的数据量与单元总数量的乘积为
Figure PCTCN2020079221-appb-000012
表示所有计算单元220同时能够计算的数据量为
Figure PCTCN2020079221-appb-000013
而每次从外部存储器中读取到的图像数据的数据量为M,则无法一次处理完所有读取到的图像数据,需要分
Figure PCTCN2020079221-appb-000014
次才能处理完所有读取到的图像数据。因此可以将预设数据量除以乘积
Figure PCTCN2020079221-appb-000015
得到拆分份数。处理器可以将预设数据量的图像数据按照拆分份数进行均等分 拆分,每次向多个计算单元220传输一份拆分后的图像数据和计算参数。
处理器还可以对每份拆分后的图像数据进行串并转换处理,以将每份拆分后的图像数据转换为和多个计算单元220的单元总数量相一致的多组图像数据。如图5所示,可以对每份拆分后的数据量为
Figure PCTCN2020079221-appb-000016
的图像数据,进行串并转换处理,得到4组图像数据,此时每组图像数据的数据量为
Figure PCTCN2020079221-appb-000017
当M为64bits时,
Figure PCTCN2020079221-appb-000018
为8bits,每组图像数据的数据量为8bits,与每个计算单元220一次能够计算的数据量正好是一致的,因此每个计算单元220一次能够计算完一组图像数据。
在时序上,可以控制在IFM_FIFO向多个ping-pong buffer传输数据的过程中,对特征图像进行拆分,而无需等到IFM_FIFO向多个ping-pong buffer传输完整个的特征图像再对特征图像进行拆分,这样可以提高处理数据的速度。在一种可能的实现方式中,可以按照如下时序完成数据传输和拆分的过程:
(1)IFM_FIFO读取第i个通道的特征图像中的M bits的图像数据A,将图像数据A存储在第一个ping-pong buffer中。
(2)IFM_FIFO读取第i个通道对应的a和b计算参数,将计算参数存储在第二个ping-pong buffer中。
(3)对第一个ping-pong buffer中的图像数据A进行拆分,提取第二个ping-pong buffer中的计算参数,向计算单元220分两次传输拆分后的图像数据A和计算参数。与此同时,IFM_FIFO读取第i个通道的特征图像中的M bits的图像数据B,将图像数据B存储在第三个ping-pong buffer中。
(4)对第三个ping-pong buffer中的图像数据B进行拆分,提取第二个ping-pong buffer中的计算参数,向计算单元220分两次传输拆分后的图像数据B和计算参数。与此同时,IFM_FIFO读取第i个通道的特征图像中的M bits的图像数据C,将图像数据C存储在第四个ping-pong buffer中。
按照类似于上述(1)-(4)的操作步骤执行,直到读取以及处理完第i 个通道的整个的特征图像。
上述操作步骤的电路时序图可见图7和图8。在图7中,eltwise_rd_addr为存储地址信号,ifm1_addr表示输入scale层的特征图像中的预设数据量的图像数据的存储地址信号,ifm2_addr表示计算参数的存储地址信号。eltwise_rd_en为使能信号,在高电平时,eltwise_rd_addr信号有效,在低电平时,eltwise_rd_addr信号无效。eltwise_rd_data表示数据读取信号,ifm1_data1_blk1表示第一个图像数据信号,b1和a1分别表示计算参数信号,ifm1_data2_blk1表示第二个图像数据信号,ifm1_data3_blk1表示第三个图像数据信号。
图8示出了IFM_FIFO和ping-pong buffer的控制逻辑时序。基于ifm1_data1_blk1,将第一个图像数据存储在IFM_FIFO中的第一个存储位置。基于b1和a1,将计算参数存储在IFM_FIFO中的第二个存储位置。基于ifm1_data2_blk1,将第二个图像数据存储在IFM_FIFO中的第三个存储位置。将IFM_FIFO中的ifm1_data1_blk1传输到ping-pong buffer中,在ping-pong buffer中对ifm1_data1_blk1进行拆分,得到ifm1_data1_h和ifm1_data1_l。接着,将IFM_FIFO中的b1和a1传输到ping-pong buffer中。随后,将IFM_FIFO中的ifm1_data2_blk1传输到ping-pong buffer中,在ping-pong buffer中对ifm1_data2_blk1进行拆分,得到ifm1_data2_h和ifm1_data2_l。同时,ping-pong buffer输出ifm1_data1_h、ifm1_data1_l、ifm1_data2_h、ifm1_data2_l、b1和a1,以使得计算单元220基于ping-pong buffer的输出进行运算处理。
通过设置多个计算单元220,并行对多组图像数据进行运算处理的方式,可以实现同时对多组图像数据进行运算处理,这样可以提高scale层运算处理的速度。
考虑到图像数据和计算参数的数据格式的差异性,可选地,处理器可以预先对图像数据的数据格式进行转换,将转换后的图像数据和计算参数输入到计算单元220中进行运算处理。
在实施中,图像数据可以是8bits的数据,可以对图像数据进行移位处理,以将图像数据转换为与计算参数的数据格式相一致的数据。例如,可以对8bits 的图像数据进行右移处理,将8bits的图像数据转换为32bits定点类型的图像数据,此时计算参数也是32bits定点类型的数据,两种同数据格式的数据可以直接进行乘加操作。
技术人员还可以对输出的特征图像的数据格式进行配置,可以携带在配置信息中输入到处理器中。可选地,处理器可以确定输出的特征图像的目标数据格式,将计算单元220输出的图像数据转换为目标数据格式的数据。
在实施中,可以将目标数据格式设置为8bits定点类型的数据,计算单元220直接输出的图像数据为32bits定点类型的数据,可以对32bits定点类型的图像数据进行移位处理,以将计算单元220直接输出的图像数据的数据格式转换为目标数据格式。例如,可以对32bits定点类型的图像数据进行左移处理,将32bits定点类型的图像数据转换为8bits定点类型的图像数据。在经过移位处理后,原数据的数值大小不变,只是数据的精度发生了变化。
步骤S303,通过写入单元230将多个计算单元220的计算结果合并,并将合并后的计算结果存入外部存储器。
在实施中,每个计算单元220可以基于各自接收到的图像数据和计算参数,进行运算处理,输出计算结果。写入单元230可以对不同计算单元220输出的计算结果进行合并。
例如,假设共有4个计算单元220,每个计算单元220输出的计算结果为
Figure PCTCN2020079221-appb-000019
bits,则4个计算单元220输出的计算结果经过合并之后,合并后的计算结果为
Figure PCTCN2020079221-appb-000020
bits。接着,可以将两次合并后的计算结果再次合并,再次合并后的计算结果为Mbits。
在得到合并后的计算结果之后,写入单元230可以获取合并后的计算结果在外部存储器中对应的初始存储地址和存储地址长度,根据合并后的计算结果在外部存储器中对应的初始存储地址和存储地址长度,将合并后的计算结果存入外部存储器中。
可选地,外部存储器可以包括第一外部存储器和第二外部存储器,写入 单元230将合并后的计算结果存入第一外部存储器,当存入第一外部存储器的计算结果的数据量达到预设阈值时,第一外部存储器将存入的数据传输到第二外部存储器中。
在实施中,第一外部存储器可以是DSRAM等类型的存储器,第二存储器可以是双倍速率同步动态随机存储器(Double Data Rate,DDR)等类型的存储器。
预设阈值可以是OFM的总数据量。随着计算单元220完成scale层的运算处理,第一外部存储器中逐渐地存储有整个的OFM,在第一外部存储器将整个的OFM传输到第二外部存储器中。这样,可以减少第一外部存储器中数据的重复读取,可以节省数据读写带宽开销。
在一种可能的实现方式中,如图9所示,处理器也可以包括Eltwise_Instr_Proc、FM_Loader、FM_Proc_Unit、FM_Write_Back四个模块。在Eltwise_Instr_Proc模块中还可以设置INSTR_FIFO,可以通过APB接口将配置信息传输到INSTR_FIFO,由Eltwise_Instr_Proc模块对配置信息进行解析。在FM_Loader模块中还可以设置IFM_FIFO,IFM_FIFO根据解析后的配置信息,从外部存储器中读取特征图像和计算参数。IFM_FIFO可以将特征图像和计算参数传输到FM_Proc_Unit模块中,FM_Proc_Unit模块用于基于特征图像和计算参数,进行运算处理。FM_Proc_Unit模块将计算结果输出到FM_Write_Back模块中的OFM_FIFO中,OFM_FIFO根据配置信息,将计算结果存储到外部存储器中。
如图10所示,Eltwise_Instr_Proc模块可以包括INSTR_FIFO和INSTR_DECODER,INSTR_DECODER用于对配置信息进行解析。Eltwise_Instr_Proc模块可以同时和FM_Loader、FM_Proc_Unit、FM_Write_Back三个模块连接。
如图11所示,FM_Loader模块包括IFM_FIFO、ifm_fifo_rd、触发器和ifm_rdata_pp_buffer。其中,{eltwise_rd_vld,eltwise_rd_data}表示输入到IFM_FIFO中的信号,eltwise_rd_data表示特征图像和计算参数。ifm_fifo_rd向 IFM_FIFO输出ifm_fifo_ren信号,用于控制IFM_FIFO。触发器用于延时向ifm_rdata_pp_buffer传输特征图像和计算参数。特征图像经ifm_rdata_pp_buffer进行拆分,得到ifm1_data和ifm2_data两路信号。经过触发器之后,计算参数对应的两路信号scale_a和scale_b直接输出。
如图12所示,FM_Proc_Unit模块包括多个Pre_Fix_Point、EltWise_Proc_Unit和Post_Fix_Point。Pre_Fix_Point用于对输入的图像数据和计算参数进行数据格式转换处理,EltWise_Proc_Unit用于基于转换后的图像数据和计算参数进行运算处理。Post_Fix_Point用于对计算结果进行数据格式转换处理。整个FM_Proc_Unit模块的输入为ifm1_data和ifm2_data信号,其中,ifm1_data表示特征图像,ifm2_data表示计算参数。整个FM_Proc_Unit模块的输出为ofm_data信号,ofm_data可以表示转换后的计算结果。
在一种可能的实现方式中,EltWise_Proc_Unit模块内部的电路结构可见图13所示,输入包括ifm1_data、ifm2_data信号、coeff_a和scale_b信号,其中,coeff_a和scale_b表示计算参数。EltWise_Proc_Unit可以包括多个触发器(寄存器)、多路选择器、乘法器等电器部件。图13中的Extension表示可扩展的部分。
如图14所示,FM_Write_Back模块可以包括OFM_FIFO、OFM_DATA_PACKER和ofm_wr_addr_gen。其中,OFM_FIFO用于收集计算结果,OFM_DATA_PACKER用于对计算结果进行合并处理,ofm_wr_addr_gen用于根据配置信息中指示的存储地址,将合并后的计算结果存储在DSRAM中指示的存储地址上。
在本发明实施例提供的方法中,采用设置有多个计算单元的专用处理器进行scale层运算处理。通过多个计算单元并行对多组图像数据进行运算处理的方式,可以实现同时对多组图像数据进行运算处理,这样可以提高scale层运算处理的速度,进而可以提高处理器进行scale层运算处理的计算性能。
本发明又一示例性实施例提供了一种卷积神经网络的运算装置,该装置 可以是上述实施例中的处理器。如图2所示,该装置可以包括读取单元210、多个计算单元220和写入单元230,其中:
所述读取单元210,用于从外部存储器中读取输入所述scale层的特征图像和计算参数;根据所述多个计算单元220的计算能力对所述特征图像进行拆分,将拆分后得到的多组图像数据和所述计算参数传递至所述多个计算单元220;
所述多个计算单元220,用于对各自接收的图像数据进行计算;
所述写入单元230,用于将所述多个计算单元220的计算结果合并,并将合并后的计算结果存入所述外部存储器。
可选地,所述读取单元210中包括第一存储单元,所述第一存储单元,用于:
从所述外部存储器中交替读取所述特征图像和所述计算参数,其中,每次从所述特征图像中读取预设数据量的图像数据;
存储依次读取到的数据。
可选地,所述读取单元210中还包括第二存储单元,所述第一存储单元,还用于:
将读取到的数据依次以设定的延时向所述第二存储单元传输。
可选地,所述第一存储单元,用于:
根据所述特征图像的总数据量和所述预设数据量,确定所述特征图像对应的读取次数D;
通过一次读取操作,从所述外部存储器中读取所述预设数据量的图像数据;
通过一次读取操作,从所述外部存储器中读取所述计算参数;
通过D-1次读取操作,从所述外部存储器中依次读取D-1个所述预设数据量的图像数据。
可选地,所述装置还包括获取单元,所述获取单元,用于获取所述特征图像和所述计算数据对应的配置信息;
所述第一存储单元,用于基于所述配置信息从所述外部存储器中读取所 述特征图像和所述计算参数。
可选地,所述特征图像对应的配置信息包括所述特征图像在所述外部存储器中的起始存储地址和存储地址长度;
所述计算参数对应的配置信息包括所述计算参数在所述存储器中的存储地址。
可选地,所述外部存储器中包括多个存储地址,所述多个存储地址满足设定条件,所述多个存储地址中分别存储有所述特征图像中的所述预设数据量的图像数据。
可选地,所述多个计算单元220的计算能力反应了每个计算单元220一次能够计算的数据量和所述多个计算单元220的单元总数量,所述读取单元210,用于:
根据所述每个计算单元220一次能够计算的数据量和所述单元总数量,确定是否对读取到的所述预设数据量的图像数据进行拆分;
当确定对所述预设数据量的图像数据进行拆分时,根据所述每个计算单元220一次能够计算的数据量和所述单元总数量,对所述预设数据量的图像数据进行拆分。
可选地,所述读取单元210,用于:
确定所述每个计算单元220一次能够计算的数据量与所述单元总数量的乘积;
将所述预设数据量除以所述乘积,得到拆分份数;
将所述预设数据量的图像数据按照所述拆分份数进行均等分拆分。
图2所示的卷积神经网络的运算装置可以执行图1-图14所示实施例的方法,本实施例未详细描述的部分,可参考对图1-图14所示实施例的相关说明。该技术方案的执行过程和技术效果参见图1-图14所示实施例中的描述,在此不再赘述。
另外,本发明实施例提供了一种电子设备,如图15所示,该电子设备包 括卷积神经网络的运算装置1501和所述卷积神经网络的运算装置1501外部的存储器1502,卷积神经网络的运算装置1501为上述图2所示的卷积神经网络的运算装置。
本发明实施例还提供了一种计算机可读存储介质,存储介质为计算机可读存储介质,该计算机可读存储介质中存储有程序指令,程序指令用于实现上述图1-图14所示实施例的方法。
以上各个实施例中的技术方案、技术特征在不相冲突的情况下均可以单独,或者进行组合,只要未超出本领域技术人员的认知范围,均属于本发明保护范围内的等同实施例。
以上所述仅为本发明的实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。

Claims (22)

  1. 一种卷积神经网络的运算方法,其特征在于,应用于对卷积神经网络中的网络层进行计算的处理器,所述处理器包括读取单元、多个计算单元和写入单元,所述方法包括:
    通过所述读取单元从外部存储器中读取输入所述网络层的特征图像和计算参数;
    根据所述多个计算单元的计算能力对所述特征图像进行拆分,将拆分后得到的多组图像数据和所述计算参数传递至所述多个计算单元,以供所述多个计算单元对各自接收的图像数据进行计算;
    通过所述写入单元将所述多个计算单元的计算结果合并,并将合并后的计算结果存入所述外部存储器。
  2. 根据权利要求1所述的方法,其特征在于,所述读取单元中包括第一存储单元,所述通过所述读取单元从外部存储器中读取输入所述网络层的特征图像和计算参数,包括:
    从所述外部存储器中交替读取所述特征图像和所述计算参数,其中,每次从所述特征图像中读取预设数据量的图像数据;
    将依次读取到的数据依次存储到所述第一存储单元中。
  3. 根据权利要求2所述的方法,其特征在于,所述读取单元中还包括第二存储单元,所述方法还包括:
    所述第一存储单元将读取到的数据依次以设定的延时向所述第二存储单元传输。
  4. 根据权利要求2所述的方法,其特征在于,所述从所述外部存储器中交替读取所述特征图像和所述计算参数,包括:
    根据所述特征图像的总数据量和所述预设数据量,确定所述特征图像对应的读取次数D;
    通过一次读取操作,从所述外部存储器中读取所述预设数据量的图像数据;
    通过一次读取操作,从所述外部存储器中读取所述计算参数;
    通过D-1次读取操作,从所述外部存储器中依次读取D-1个所述预设数据量的图像数据。
  5. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    获取所述特征图像和所述计算数据对应的配置信息,以使所述第一存储单元基于所述配置信息从所述外部存储器中读取所述特征图像和所述计算参数。
  6. 根据权利要求5所述的方法,其特征在于,所述特征图像对应的配置信息包括所述特征图像在所述外部存储器中的起始存储地址和存储地址长度;
    所述计算参数对应的配置信息包括所述计算参数在所述存储器中的存储地址。
  7. 根据权利要求2所述的方法,其特征在于,所述外部存储器中包括多个存储地址,所述多个存储地址满足设定条件,所述多个存储地址中分别存储有所述特征图像中的所述预设数据量的图像数据。
  8. 根据权利要求2所述的方法,其特征在于,所述多个计算单元的计算能力反应了每个计算单元一次能够计算的数据量和所述多个计算单元的单元总数量,所述根据所述多个计算单元的计算能力对所述特征图像进行拆分,包括:
    根据所述每个计算单元一次能够计算的数据量和所述单元总数量,确定是否对读取到的所述预设数据量的图像数据进行拆分;
    如果确定对所述预设数据量的图像数据进行拆分,则根据所述每个计算单元一次能够计算的数据量和所述单元总数量,对所述预设数据量的图像数据进行拆分。
  9. 根据权利要求8所述的方法,其特征在于,所述根据所述每个计算单元一次能够计算的数据量和所述单元总数量,对所述预设数据量的图像数据进行拆分,包括:
    确定所述每个计算单元一次能够计算的数据量与所述单元总数量的乘积;
    将所述预设数据量除以所述乘积,得到拆分份数;
    将所述预设数据量的图像数据按照所述拆分份数进行均等分拆分。
  10. 根据权利要求1-9中任一项所述的方法,其特征在于,所述网络层包括ELTWISE层或scale层。
  11. 一种卷积神经网络的运算装置,其特征在于,所述装置包括读取单元、多个计算单元和写入单元,其中:
    所述读取单元,用于从外部存储器中读取输入网络层的特征图像和计算参数;根据所述多个计算单元的计算能力对所述特征图像进行拆分,将拆分后得到的多组图像数据和所述计算参数传递至所述多个计算单元;
    所述多个计算单元,用于对各自接收的图像数据进行计算;
    所述写入单元,用于将所述多个计算单元的计算结果合并,并将合并后的计算结果存入所述外部存储器。
  12. 根据权利要求11所述的装置,其特征在于,所述读取单元中包括第一存储单元,所述第一存储单元,用于:
    从所述外部存储器中交替读取所述特征图像和所述计算参数,其中,每次从所述特征图像中读取预设数据量的图像数据;
    存储依次读取到的数据。
  13. 根据权利要求12所述的装置,其特征在于,所述读取单元中还包括第二存储单元,所述第一存储单元,还用于:
    将读取到的数据依次以设定的延时向所述第二存储单元传输。
  14. 根据权利要求12所述的装置,其特征在于,所述第一存储单元,用于:
    根据所述特征图像的总数据量和所述预设数据量,确定所述特征图像对应的读取次数D;
    通过一次读取操作,从所述外部存储器中读取所述预设数据量的图像数据;
    通过一次读取操作,从所述外部存储器中读取所述计算参数;
    通过D-1次读取操作,从所述外部存储器中依次读取D-1个所述预设数据量的图像数据。
  15. 根据权利要求12所述的装置,其特征在于,所述装置还包括获取单元,所述获取单元,用于获取所述特征图像和所述计算数据对应的配置信息;
    所述第一存储单元,用于基于所述配置信息从所述外部存储器中读取所述特征图像和所述计算参数。
  16. 根据权利要求15所述的装置,其特征在于,所述特征图像对应的配置信息包括所述特征图像在所述外部存储器中的起始存储地址和存储地址长度;
    所述计算参数对应的配置信息包括所述计算参数在所述存储器中的存储地址。
  17. 根据权利要求12所述的装置,其特征在于,所述外部存储器中包括多个存储地址,所述多个存储地址满足设定条件,所述多个存储地址中分别存储有所述特征图像中的所述预设数据量的图像数据。
  18. 根据权利要求12所述的装置,其特征在于,所述多个计算单元的计算能力反应了每个计算单元一次能够计算的数据量和所述多个计算单元的单元总数量,所述读取单元,用于:
    根据所述每个计算单元一次能够计算的数据量和所述单元总数量,确定是否对读取到的所述预设数据量的图像数据进行拆分;
    当确定对所述预设数据量的图像数据进行拆分时,根据所述每个计算单元一次能够计算的数据量和所述单元总数量,对所述预设数据量的图像数据进行拆分。
  19. 根据权利要求18所述的装置,其特征在于,所述读取单元,用于:
    确定所述每个计算单元一次能够计算的数据量与所述单元总数量的乘积;
    将所述预设数据量除以所述乘积,得到拆分份数;
    将所述预设数据量的图像数据按照所述拆分份数进行均等分拆分。
  20. 根据权利要求11-19中任一项所述的装置,其特征在于,所述网络层包括ELTWISE层或scale层。
  21. 一种电子设备,其特征在于,包括:权利要求11-19中任一项所述的卷积神经网络的运算装置和所述卷积神经网络的运算装置外部的存储器。
  22. 一种计算机可读存储介质,其特征在于,所述存储介质为计算机可读存储介质,该计算机可读存储介质中存储有程序指令,所述程序指令用于实现权利要求1-10中任意一项所述的卷积神经网络的运算方法。
PCT/CN2020/079221 2020-03-13 2020-03-13 卷积神经网络的运算方法、装置、设备和存储介质 WO2021179289A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080004317.6A CN112602096A (zh) 2020-03-13 2020-03-13 卷积神经网络的运算方法、装置、设备和存储介质
PCT/CN2020/079221 WO2021179289A1 (zh) 2020-03-13 2020-03-13 卷积神经网络的运算方法、装置、设备和存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/079221 WO2021179289A1 (zh) 2020-03-13 2020-03-13 卷积神经网络的运算方法、装置、设备和存储介质

Publications (1)

Publication Number Publication Date
WO2021179289A1 true WO2021179289A1 (zh) 2021-09-16

Family

ID=75208095

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/079221 WO2021179289A1 (zh) 2020-03-13 2020-03-13 卷积神经网络的运算方法、装置、设备和存储介质

Country Status (2)

Country Link
CN (1) CN112602096A (zh)
WO (1) WO2021179289A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114118356A (zh) * 2021-10-11 2022-03-01 北京百度网讯科技有限公司 神经网络处理器验证方法、装置、电子设备及存储介质
CN114528526A (zh) * 2022-04-24 2022-05-24 深圳思谋信息科技有限公司 卷积数据处理方法、装置、卷积运算加速器和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170265190A1 (en) * 2016-03-10 2017-09-14 Nokia Technologies Oy Method, apparatus, and computer program product for user equipment capability indication of parallel processing
CN108171317A (zh) * 2017-11-27 2018-06-15 北京时代民芯科技有限公司 一种基于soc的数据复用卷积神经网络加速器
CN109919311A (zh) * 2019-03-13 2019-06-21 北京地平线机器人技术研发有限公司 生成指令序列的方法、执行神经网络运算的方法和装置
CN110533177A (zh) * 2019-08-22 2019-12-03 安谋科技(中国)有限公司 一种数据读写装置、方法、设备、介质及卷积加速器
CN110647722A (zh) * 2019-09-20 2020-01-03 北京中科寒武纪科技有限公司 数据处理方法及装置以及相关产品

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4682107B2 (ja) * 2006-08-28 2011-05-11 株式会社リコー 画像形成装置、情報処理方法及び情報処理プログラム

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170265190A1 (en) * 2016-03-10 2017-09-14 Nokia Technologies Oy Method, apparatus, and computer program product for user equipment capability indication of parallel processing
CN108171317A (zh) * 2017-11-27 2018-06-15 北京时代民芯科技有限公司 一种基于soc的数据复用卷积神经网络加速器
CN109919311A (zh) * 2019-03-13 2019-06-21 北京地平线机器人技术研发有限公司 生成指令序列的方法、执行神经网络运算的方法和装置
CN110533177A (zh) * 2019-08-22 2019-12-03 安谋科技(中国)有限公司 一种数据读写装置、方法、设备、介质及卷积加速器
CN110647722A (zh) * 2019-09-20 2020-01-03 北京中科寒武纪科技有限公司 数据处理方法及装置以及相关产品

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114118356A (zh) * 2021-10-11 2022-03-01 北京百度网讯科技有限公司 神经网络处理器验证方法、装置、电子设备及存储介质
CN114528526A (zh) * 2022-04-24 2022-05-24 深圳思谋信息科技有限公司 卷积数据处理方法、装置、卷积运算加速器和存储介质
CN114528526B (zh) * 2022-04-24 2022-08-02 深圳思谋信息科技有限公司 卷积数据处理方法、装置、卷积运算加速器和存储介质

Also Published As

Publication number Publication date
CN112602096A (zh) 2021-04-02

Similar Documents

Publication Publication Date Title
US10936941B2 (en) Efficient data access control device for neural network hardware acceleration system
WO2018120989A1 (zh) 卷积运算芯片和通信设备
US11775430B1 (en) Memory access for multiple circuit components
WO2018107476A1 (zh) 访存设备、计算设备和应用于卷积神经网络运算的设备
JP5837153B2 (ja) 画素速度での画像処理のための方法および装置
CN106846235B (zh) 一种利用NVIDIA Kepler GPU汇编指令加速的卷积优化方法及系统
WO2019170049A1 (zh) 一种卷积神经网络加速装置和方法
CN108388527B (zh) 直接存储器存取引擎及其方法
WO2021179289A1 (zh) 卷积神经网络的运算方法、装置、设备和存储介质
CN110276444B (zh) 基于卷积神经网络的图像处理方法及装置
CN112836813B (zh) 一种用于混合精度神经网络计算的可重构脉动阵列系统
WO2021128776A1 (zh) 数据处理方法、装置、设备、系统、存储介质和程序产品
CN112905530B (zh) 片上架构、池化计算加速器阵列、单元以及控制方法
WO2021249192A1 (zh) 图像处理方法及装置、机器视觉设备、电子设备和计算机可读存储介质
WO2022151779A1 (zh) 卷积运算的实现方法、数据处理方法及装置
WO2022205197A1 (zh) 一种矩阵乘法器、矩阵计算方法及相关设备
WO2021128820A1 (zh) 数据处理方法、装置、设备、存储介质和程序产品
CN115860080B (zh) 计算核、加速器、计算方法、装置、设备、介质及系统
CN116431562B (zh) 一种基于加速处理器的多头注意力机制融合计算分配方法
US20230206049A1 (en) Data processing method and device, and neural network processing device
WO2020042770A1 (zh) 图像识别处理方法和装置
CN116166185A (zh) 缓存方法、图像传输方法、电子设备及存储介质
WO2022001550A1 (zh) 一种地址生成的方法、相关装置以及存储介质
CN114254563A (zh) 数据处理方法及装置、电子设备、存储介质
KR102247741B1 (ko) 이미지 프로세서, 상기 이미지 프로세서의 동작 방법, 및 상기 이미지 프로세서를 포함하는 애플리케이션 프로세서

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20924391

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20924391

Country of ref document: EP

Kind code of ref document: A1