Detailed Description
So that the manner in which the features and aspects of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings.
Fig. 1 is a schematic flow chart illustrating an implementation process of an operation method based on a neural network according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step S101, obtaining the calculation type of a neural network model, wherein the calculation type of a current layer and the calculation type of a next layer are included;
step S102, judging whether the calculation type meets a preset condition or not, and obtaining a judgment result;
step S103, if the judgment result is satisfied, taking out input data required by the next layer of hybrid calculation to perform hybrid calculation, and obtaining a hybrid calculation result; writing the mixed calculation result into a memory, thereby shortening the operation time; wherein the calculation types include at least: convolution calculation type and hybrid calculation type.
In another embodiment, the determining whether the calculation type satisfies a preset condition and obtaining a determination result includes: and if the judgment result is not satisfied, writing the convolution calculation result of the current layer back to the memory, and calculating when starting the calculation task of the next cycle.
In another embodiment, if the determination result is not satisfied, the method includes: the current layer is of convolution calculation type, and the next layer is still of convolution calculation type.
In another embodiment, if the determination result is satisfied, taking out input data required by the next layer of hybrid calculation to perform hybrid calculation includes: the current layer is of a convolution calculation type, and the next layer is of a hybrid calculation type; and taking out input data required by the next layer of mixing calculation in a cache region of the convolution calculation result to perform mixing calculation.
In another embodiment, after writing the hybrid computation result into the memory, the method further includes: the input data stream is processed in a pipeline structure.
In another embodiment, before obtaining the calculation type of the neural network model, the method includes: and taking out data according to the convolution size and strides, and performing convolution calculation.
In another embodiment, the data is extracted by convolution size and strides, and before performing convolution calculation, the method further includes: and acquiring data, and caching the data in a block for storage according to a convolution rule.
In another embodiment, fig. 2 is a block diagram of the system of the present invention, the inference calculation platform is an FPGA chip, and a general cache storage mechanism is designed inside the FPGA chip to support inference calculation of different neural network models. The inference calculation process of the neural network model is actually an implementation process of a specific calculation operator for each layer of the neural network model, and the specific calculation operator has a conventional convolution class: including the conventional convolution, separation convolution, deconvolution calculations, and hybrid calculation classes: the method comprises computing operators such as posing pooling downsampling, upsampling, route channel corresponding point addition, concat channel corresponding connection and the like.
B4096 represents the calculation Parallelism, a schematic diagram of the calculation Parallelism is shown in fig. 3, and represents that 16 pixels are processed simultaneously in the depth direction of the Input image as a group of Input Channel Parallelism representing Input Channel Parallelism (ICP for short), that 16 pixels are processed simultaneously by the weight filter as a group of Output Channel Parallelism representing Output Channel Parallelism (OCP for short), that is, corresponding to 16 pixels in the depth direction of the Output feature map, and that 8 pixels are processed simultaneously in the column direction of the Input image as a group of Pixel Parallelism representing Pixel Parallelism (PP for short), and that is, corresponding to 8 pixels in the column direction of the Output feature map. The calculation parallelism PP ICP OCP 2, the multiplication by 2 and the addition each represent one operation. B4096 ═ 8 × 16 × 2, 4096 multiply-accumulate operations can be completed in each clock cycle clk by performing convolution operations with this computational parallelism. The clock period of the operation in the system is 300MHz, the operation force corresponding to one calculating unit can reach 4096 × 300M-1.2 TOP/S, and if a plurality of calculating units are operated, the calculation force of the system can be multiplied.
The general design block diagram is divided into a PS system and a PL subsystem, wherein an ARM processor core and a memory controller are embedded in the PS, and the PS is responsible for converting parameters of a neural network structure model into instruction sets and issuing the instruction sets and performing subsequent processing on a result of PL accelerated completion. The functional division of the PL subsystem is as follows:
1. and the subfunction of the instruction set function and the state feedback function module needs to acquire and analyze an instruction set written into the DDR space by the PS, convert the instruction set into a control signal and a configuration parameter, and transmit the control signal and the configuration parameter to the read-write controller, the PE array convolution calculation module, the hybrid calculation module and the buffer block management module. And collecting the execution state and feedback parameters of each execution module, and informing the PS to acquire the calculation result in time through AXI 4-lite.
2. A convolution calculation function module: after receiving the instruction, starting the calculation processes of convolution, separation convolution and deconvolution, and writing the intermediate calculation result into the cache block. And splitting or packaging the data according to the input channel, the output channel and the pixel parallelism in the calculation process, and acquiring and writing the data by matching with the cache pool management function module.
3. The cache pool management function module: in the convolution computation mode: and receiving an instruction to start data transmission read-write control, wherein the data transmission read-write control is started, the data is read from the DDR and written into a corresponding cache block, input image data, convolution kernel data and parameters are read from a corresponding cache block storage space according to the data structure of each layer and the characteristics of convolution kernels and are transmitted to the PE array for calculation, the calculation is completed, the convolution calculation result is written into the cache block, and the DDR write-transmission control is started. And in the mixed calculation mode, acquiring data in the same convolution calculation mode, performing mixed settlement, and finishing calculation and writing into the DDR.
4. A hybrid computing function: and after receiving the instruction, calling the corresponding mixed operator module according to the operator type, starting calculation, and matching with the cache pool management function module to acquire and write data.
5. The bus transmission control function module: after receiving the instruction, the data is read from or written back to the DDR in cooperation with the cache management module, and the DDR data control and arbitration functions are completed.
The execution sequence of the above functions is performed according to the execution flow of fig. 4, the PL is judged according to the calculation types of the upper layer and the lower layer of the acquired neural network model in the execution process, if the current layer is a convolution calculation type and the next layer is a hybrid calculation type, the input data required by the hybrid calculation of the next layer is taken out from the cache region of the convolution calculation result according to fig. 4 for hybrid calculation, and finally the hybrid calculation result is written into the DDR, and the input data stream is processed in the pipeline structure, so that the time of the overall operation can be shortened, and the execution efficiency is greatly improved. If the current layer is of a convolution calculation type, the next layer is still of a convolution calculation type, the convolution operation unit at the time always operates the convolution calculation task of the current layer, and the convolution calculation task of the next layer does not have time to run, so that the convolution calculation result of the current layer needs to be written back to the DDR, and calculation is performed when the calculation task of the next cycle is started.
Most of the calculation processes are developed around convolution calculation and mixed calculation processes, wherein in the convolution calculation process, a data processing object has parameters such as an input image, weight, bias and the like, and a characteristic image is output. Fig. 6 is a data input/output flow diagram in the calculation process, and it can be seen from the diagram that in the convolution calculation, it is necessary to introduce an input image, weighting, bias, and other misc parameters into respective buffer cache spaces from a storage space designated in an external DDR memory, and then read out the input data of these parts from the buffer cache spaces in parallel according to the calculation parallelism, convolution window, and convolution data format to perform convolution operation and activate output. Taking the neural network structure model YOLOv3 as an example, the convolution calculation process of each convolution kernel is shown in fig. 7. And after the convolution layer result is calculated, judging that the next layer calculation type is still the convolution calculation type, outputting and storing the finished convolution result in an output1buffer cache space, starting a DDR burst write-back process, burst writing the output result stored in an output1buffer back to an external DDR memory, and starting the calculation task of the next cycle. And if the next layer of calculation type is judged to be the mixed calculation type, storing the partial convolution result in an output1buffer cache space, and starting the mixed calculation process. In the hybrid calculation process, the input data object is a feature image output by convolution calculation, and the convolution calculation result of the previous layer needs to be read out from the output1buffer cache space and transferred to the hybrid calculation array for hybrid calculation, and common hybrid calculation processes include posing, upsample, route, concat and the like as shown in fig. 8. Outputting the normalized feature image after the mixed calculation, storing the normalized feature image in an output2buffer cache space, starting a DDR write-back process, and writing the output result stored in an output2buffer back to an external DDR memory in a burst mode. The process of convolution and hybrid calculations can be repeated for each layer of the inference implementation of the neural network model. From the above analysis, it can be seen that storage spaces such as imagebuffer, weight buffer, misc buffer, output1buffer, and output2buffer are required to be used in the convolution calculation and the hybrid calculation. In the FPGA device, because the storage space on the BRAM chip is limited, the convolution windows of all input channel images cannot be stored due to the limited capacity of image buffers and the like, a middlebuffer space for storing the convolution result of the middle part is also required to be set for buffering in the convolution calculation process.
In the system, the higher the calculation parallelism is, the higher the operation calculation power is, the higher the execution efficiency is, but the more hardware resources BRAM, DSP, registers and wiring resources are consumed, and the more difficult the wiring of the router is. According to the characteristics of the neural network model that shallow input data are more, intermediate results are more, output results are more, weight data are less, deep input data are less, intermediate results are less, output results are less, weight data are more and characteristics of hardware resources in the FPGA device, the parallelism of an input channel and an output channel is designed to be 16, and the parallelism of pixels is designed to be 8. A calculation parallelism structure is determined in the FPGA design process, and the corresponding peak value calculation force is also determined. The data volume and format of each calculation under the calculation parallelism are 8 pixel points in the column direction of the input image, 16 pixel points in the depth direction and 16 weight parameter data in the depth direction of 16 weight filters. The depth, width and height of an input image of each layer in the neural network structure model, the number and size of the weight filters and the like are different, and it is important to design an on-chip cache and scheduling mechanism based on the calculation parallelism degree to adapt to the calculation of different data volumes and formats of different neural network structure models. The quantized weight parameter is INT8, the bit width of the input/output data bus of the computing unit is designed to be 16 x 8-128 bit, the bit width of the DDR bus and the channel parallelism are mutually matched, and extra cache memories are not needed to be consumed to carry out bit width conversion on the computed data in the data inflow and outflow process, so that the storage space is saved, and the design is facilitated.
The buffer space with 7 logic functions such as the Image buffer and the like is used in an embedded BRAM on-chip memory in an FPGA device, and the BRAM is characterized in that only one group of effective read-write ports are arranged at the same time, and the size of the buffer space is determined by bit width and depth together. When convolution calculation or mixed calculation of a plurality of channels and a plurality of pixel points is realized at the same time, the design of each logic function cache space not only meets the read-write requirement of the parallelism of a plurality of groups of ports, but also has enough space to ensure the continuity of data in the inflow and outflow processes of the front and the rear stages. Fig. 9 shows the setting parameters of the cache space of each function of the system. The Image buffer is used for storing an input Image of each layer of convolution calculation of the neural network model, because the number of depth directions D, column directions W and row directions H of input Image data of each layer is changed, D is increased along with the increase of the number of convolution kernels of each layer, W and H directions are changed along with the change of convolution windows and steps, and when padding is not considered, the relational expression between input and output sizes is o ═ i-k)/s +1, wherein i is the input Image size and represents the width or height, k is the convolution window size, s is the step of convolution window sliding, and o is the size of output after convolution. According to the characteristics of the neural network model, and statistics of parameters of most conventional neural network models, it is known that the depth D of the input image gradually increases, the width and height dimensions gradually decrease, and the product W × D of the column direction dimension W and the depth direction dimension D does not exceed 2048 × 16 — 32768, and each pixel occupies 8 bits for the INT8 fixed point quantization method. The method comprises the steps that a row of image data is 32768 pixel points 8 bits according to the plane size formed by the depth direction and the column direction, when a convolution window is read out, the output of data with 8 pixel parallelism and 16 input channel parallelism is simultaneously supported, the input parallelism 16 pixel points 8 bits is 128 bits and is designed to be the data bit width of an image buffer ram, 8 rams are simultaneously set to support the output of 8 pixel parallelism, the maximum ram depth occupied by one row of data is 32768 pixel depth 8 bits/8 pixel depth 16 bits is 256, if the convolution operation with the window size of 7 line or even larger is supported, at least 7 rows of data are stored, in addition, a 1-row maneuvering adjusting space is added for supporting a pipeline structure, and the depth of 8 row pixels 256 is designed to be 2048 pixels. As shown in fig. 10, the image buffer ram is a schematic space diagram, and the image buffer is configured with 8 BRAMs, each BRAM is 128 bits wide, and has a depth of 2048, and the depth direction is used for distinguishing the boundary of the input image data line. The storage order of the input image in the DDR is shown in fig. 11. And sequentially loading 1 row of data from the DDR, and writing the data into 8 BRAMs of the image buffer according to an input channel, a convolution kernel step S and a column direction sequence. The method can completely cache all channel data of one row, data do not need to be read from the DDR repeatedly in the direction of the channel in the calculation process, the DDR bandwidth is saved, the multiplexing rate of each row of data is improved, and meanwhile convolution operations of different window sizes K-3, 5-5, 7-7 and the like and Stride unfixed can be flexibly supported. And each BRAM has a group of read-write ports, and when data is read out from the address space, new data can be written in at the same time. For example, when the convolution window slides, after one line of data is used up, the address space can be covered by the data of the following line, and the address space is recycled, so that the utilization rate of the cache space is greatly improved, and meanwhile, the time for loading the image data is shortened or even avoided.
The Weight buffer cache space stores Weight parameters, in the neural network model reasoning process, as the number of network layers increases, the number of Weight filters gradually increases, and the depth direction D of a feature image corresponding to convolution calculation output also gradually increases, when counting most parameters of a conventional neural network model, the Weight buffer designs 16 rams and simultaneously supports reading of 16 output channel data, and each ram data design designs 16 bits 8 bits 128 bits to support reading and writing of 16 input channel data. The depth of each ram is 16384/16 1024. For example, fig. 12 is a schematic diagram of Weight buffer ram space, fig. 13 is a schematic diagram of Weight parameter storage in DDR, and 16 Weight filter parameters loaded from DDR are written into 16 rams in sequence according to input channel, column direction, and number of output channels. Since the weight buffer ram depth direction is 1024, which is not full in most cases, for example, the weighting filter parameter K × K — 3 and the channel number CH — 256 in a certain layer, the storage space of 3 × 256/1024 × 16 — 1/8 is only used for storing a group of 16 weighting filters in the ram depth direction, which indicates that the ram space can store 8 groups of 16 weighting filter parameters at the same time. Similarly, the read-write port and the vacant space of the weight buffer ram can also support pipeline flow operation, for example, convolution calculation reads out the front 1-4 groups of weight filter parameters, obtains the rear 5-8 groups of weight filter parameters from the off-chip DDR and writes the weight filter parameters into the other weight buffer ram vacant space, and simultaneously supports that the weight filter parameters of the front group are covered after being used up, so that the address space is recycled, the utilization rate of the cache space is greatly improved, and the time for repeatedly loading the weight filter data is shortened or even avoided.
The Misc buffer is used for caching parameters such as bias and scale, in the convolution calculation process in the system diagram 7, the bias is accumulated after the multiplication and addition calculation of a convolution window, the scale is a coefficient multiplied after all the accumulation is finished, the number of the bias and scale parameters in each layer is the same as that of the weight filter, the bit width of one bias parameter is 32 bits, the bit width of one scale parameter is 16 bits, and the data volume of the bias and scale is relatively small. And setting the data bit width 128bit of each ram of 4 BRAMs by combining the characteristics of the DDR bus bit width 128bit and the parallelism degree of an output channel 16, wherein the ram depth is 512. FIG. 14 is a schematic diagram of Misc buffer ram space. When the device is used, every 4 32 bits of the bias parameters are spliced into a group, the scale parameters complement 16 bits of 0 in front of 16 bits, then 4 scales after 0 complementation are spliced into a group, each group of data is 128 bits and contains 4 parameters, 4 groups of bias write ram are sequentially read, 4 groups of scale parameters are read and written into ram, and the write ram is alternately written and read alternately during convolution calculation. The method not only meets the output parallelism degree, but also fully utilizes the storage space and simplifies the read-write operation of the bias and scale parameters.
In the FPGA computing platform, in the convolution calculation process, the product accumulation of all pixels in one convolution kernel cannot be completed at one time due to the fact that the convolution kernel is large in size or the number of channels is large, the convolution calculation can be performed only by reading data from image buffer ram and weight buffer ram in one clock cycle according to the calculation parallelism, partial intermediate results are generated after the calculation, the intermediate results need to be temporarily cached in a middle 1buffer cache space, the intermediate results are read out for accumulation when the product of the next pixel in the convolution kernel is waited for, and the final output result is generated after the convolution kernel size and the data in the channels are completely accumulated in a circulating mode and is written into an output1 buffer. The bit width and depth of the midle 1buffer depend on the computation parallelism and convolution kernel size, and the number of channels. Fig. 15 is a schematic diagram of a midlet buffer ram space, where the midlet 1buffer ram is designed to be PP OCP D @32bit dual port, 8bit weight per pixel of convolution calculation image is 8bit 16bit, and if the maximum size of convolution kernel is K × CH 8bit 1024, the bit width of the accumulated product of all channel pixels in the convolution window will not exceed 16bit +10bit +6bit 32 bit. Therefore, the bit width of the intermediate result generated by each convolution window is 32 bits, the pixel parallelism is output by 8 x within each clock cycle, and the intermediate results with the parallelism of 16 are output correspondingly to the calculation parallelism. For example, in the figure, in the operation process, a group of output channel parallelism 16 × 32 bits is formed by splicing 512 bits, 8 midlets 1buffer rams are designed, each ram bit width is 512 bits, in the convolution calculation process, a convolution window slides from left to right until one row is finished, a row of intermediate results are generated, in the system, the maximum column direction W of one row is 1024, and the ram depth is W/8-128. The read/write port supports the writing of one set of intermediate results while supporting the reading of another set of intermediate results. In the hybrid calculation process, input data is a feature image output by a convolution result, the data bit width is 8 bits, the hybrid calculation process is different from the multiply and accumulation calculation of a convolution window, in most cases, the calculation processes of maximum value calculation, average value calculation and the like in the window are performed, if the maximum size of the window is K × 8, the bit width of an intermediate result generated in the hybrid calculation process is 8+8 × 16 bits, so that a midle 2buffer ram is designed to be PP × OCP × D @16 bits dual ports, corresponding to the calculation parallelism, 8 × output pixel parallelism and 16 intermediate results output parallelism are output in each clock cycle, similarly, 16 bits output channel parallelism are spliced into a group, 8 midle 2 buffers are designed, each ram is 256 bits wide, and the ram depth is W/8 × 128. The read-write port supports pipeline structure, and continuous reading calculation of data is guaranteed.
When the final result of the convolution calculation is output and buffered, 8-pixel parallelism and 16-output-channel parallelism are output simultaneously, each pixel is 8 bits, 16 output channels 8 bits are set to 128 bits to be the data bit width of output1buffer ram, the read ports of 8 rams support the output of 8 pixels simultaneously, and fig. 16 is a schematic space diagram of output buffer ram. And outputting the data volume of each line of the characteristic image as D, W, 16, 1024, and the maximum occupied ram depth of one line of data as 16, 1024, 8bit/8, 16, 8bit 128, if the hybrid calculation process supports convolution operation with the window size of 7 or even larger, storing at least 7 lines of data, and in addition, adding a maneuvering adjustment space of 1 line to support a pipeline structure of pipeline, designing the depth of 8, 128 and 1024 pixels as output1buffer ram. And after storing a partial convolution result in the output1buffer ram, starting a mixed settlement process as an input data cache space of a mixed calculation type, wherein the process of taking data from the output1buffer by mixed calculation is similar to the process of taking data from an image buffer by convolution calculation, outputting intermediate results compared or accumulated in the mixed calculation process in a mixed calculation window corresponding to the calculation parallelism, obtaining a maximum value and an average value, and storing the final calculation result in an output2 buffer. Setting 16 output channels by 8 bits to 128 bits as data bit width of output2buffer ram, and the write ports of 8 rams simultaneously support writing of 8 pixels. And the maximum data volume of each line of the normalized output characteristic image is D × W (16 × 1024), the maximum occupied ram depth of one line of data is 16 × 1024 × 8bit/8 × 16 × 8bit 128, the output2buffer stores a certain number of calculation results and then starts a DDR write-back process, and 8 × 128 is set to be the depth of each ram based on the burst characteristic of the DDR and the requirement of pipeline running water buffer space. After a certain number of characteristic results are stored in the output2buffer ram, DDR transmission is started when the data size meets the DDR burst length, data in 8 rams are read and written into DDR in sequence, the result of hybrid calculation is kept written into the output2buffer ram and read and written into DDR at the same time, the pipeline structure ensures that data continuously flows in and flows out, a time process of waiting for data loading or interruption does not exist, and the efficiency of calculation and scheduling of the data is greatly improved.
In another embodiment, the apparatus comprises: the acquisition unit is used for acquiring the calculation type of the neural network model, wherein the calculation type of the current layer and the calculation type of the next layer are included; the judging unit is used for judging whether the calculation type meets a preset condition or not and obtaining a judgment result; the operation unit is used for taking out input data required by the next layer of hybrid calculation to perform hybrid calculation if the judgment result is satisfied, and obtaining a hybrid calculation result; writing the mixed calculation result into a memory, thereby shortening the operation time; wherein the calculation types include at least: convolution calculation type and hybrid calculation type.
In another embodiment, the apparatus comprises: the computer system comprises a memory, a processor and a response program stored in the memory and operated by the processor, wherein the processor responds to the steps of the operation method when operating the response program.
It should be noted that: in the data processing apparatus provided in the above embodiment, when the program is developed, only the division of the program modules is illustrated, and in practical applications, the processing may be distributed to different program modules according to needs, that is, the internal structure of the data processing apparatus may be divided into different program modules to complete all or part of the processing described above. In addition, the data processing apparatus provided in the above embodiment and the data processing method embodiment belong to the same concept, and specific implementation processes thereof are described in the method embodiment and are not described herein again.
Fig. 5 is a first schematic structural diagram of a data processing device in an embodiment of the present invention, and as shown in fig. 5, the data processing device 500 may be a handle, a mouse, a trackball, a mobile phone, a smart pen, a smart watch, a smart ring, a smart bracelet, a smart glove, or the like. The data processing apparatus 500 shown in fig. 5 includes: at least one processor 501, memory 502, at least one network interface 504, and a user interface 503. The various components in the data processing device 500 are coupled together by a bus system 505. It is understood that the bus system 505 is used to enable connection communications between these components. The bus system 505 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 505 in FIG. 5.
The user interface 503 may include a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, a touch screen, or the like, among others.
It will be appreciated that the memory 502 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface memory may beMagnetic disk memoryOrMagnetic tape memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 302 described in connection with the embodiments of the invention is intended to comprise, without being limited to, these and any other suitable types of memory.
The memory 502 in embodiments of the present invention is used to store various types of data to support the operation of the data processing apparatus 500. Examples of such data include: any computer programs for operating on the data processing apparatus 500, such as an operating system 5021 and application programs 5022; music data; animation data; book information; video, drawing information, etc. The operating system 5021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application 5022 may contain various applications such as a media player (MediaPlayer), a Browser (Browser), etc., for implementing various application services. The program for implementing the method according to the embodiment of the present invention may be included in the application program 5022.
The method disclosed by the above-mentioned embodiments of the present invention may be applied to the processor 501, or implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 501. The Processor 501 may be a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. Processor 501 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed by the embodiment of the invention can be directly implemented by a hardware decoding processor, or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 502 and the processor 501 reads the information in the memory 302 and in combination with its hardware performs the steps of the method described above.
In an exemplary embodiment, the data processing apparatus 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, Micro Controllers (MCUs), microprocessors (microprocessors), or other electronic components for performing the foregoing methods.
Specifically, when the processor 501 runs the computer program, it executes: obtaining the calculation type of a neural network model, wherein the calculation type of a current layer and the calculation type of a next layer are included; judging whether the calculation type meets a preset condition or not, and obtaining a judgment result; if the judgment result is satisfied, taking out input data required by the next layer of hybrid calculation to perform hybrid calculation to obtain a hybrid calculation result; writing the mixed calculation result into a memory; wherein the calculation types include at least: convolution calculation type and hybrid calculation type.
When the processor 501 runs the computer program, it further executes: judging whether the calculation type meets a preset condition, and after a judgment result is obtained, the method comprises the following steps: and if the judgment result is not satisfied, writing the convolution calculation result of the current layer back to the memory, and calculating when starting the calculation task of the next cycle.
When the processor 501 runs the computer program, it further executes: if the judgment result is not satisfied, the method comprises the following steps: the current layer is of convolution calculation type, and the next layer is still of convolution calculation type.
When the processor 501 runs the computer program, it further executes: if the judgment result is satisfied, taking out input data required by the next layer of mixed calculation for mixed calculation, wherein the step of mixed calculation comprises the following steps: the current layer is of a convolution calculation type, and the next layer is of a hybrid calculation type; and taking out input data required by the next layer of mixing calculation in a cache region of the convolution calculation result to perform mixing calculation.
When the processor 501 runs the computer program, it further executes: after writing the mixed calculation result into the memory, the method further comprises: the input data stream is processed in a pipeline structure.
When the processor 501 runs the computer program, it further executes: before obtaining the calculation type of the neural network model, the method comprises the following steps: and taking out data according to the convolution size and strides, and performing convolution calculation.
When the processor 501 runs the computer program, it further executes: data are taken out through convolution size and strides, and before convolution calculation, the method further comprises the following steps: and acquiring data, and caching the data in a block for storage according to a convolution rule.
In an exemplary embodiment, the present invention further provides a computer readable storage medium, such as a memory 502, comprising a computer program, which is executable by a processor 501 of a data processing apparatus 500 to perform the steps of the aforementioned method. The computer readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flashmemory, magnetic surface memory, optical disk, or CD-ROM; or may be a variety of devices including one or any combination of the above memories, such as a mobile phone, computer, tablet device, personal digital assistant, etc.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, performs: obtaining the calculation type of a neural network model, wherein the calculation type of a current layer and the calculation type of a next layer are included; judging whether the calculation type meets a preset condition or not, and obtaining a judgment result; if the judgment result is satisfied, taking out input data required by the next layer of hybrid calculation to perform hybrid calculation to obtain a hybrid calculation result; writing the mixed calculation result into a memory; wherein the calculation types include at least: convolution calculation type and hybrid calculation type.
The computer program, when executed by the processor, further performs: judging whether the calculation type meets a preset condition, and after a judgment result is obtained, the method comprises the following steps: and if the judgment result is not satisfied, writing the convolution calculation result of the current layer back to the memory, and calculating when starting the calculation task of the next cycle.
The computer program, when executed by the processor, further performs: if the judgment result is not satisfied, the method comprises the following steps: the current layer is of convolution calculation type, and the next layer is still of convolution calculation type.
The computer program, when executed by the processor, further performs: if the judgment result is satisfied, taking out input data required by the next layer of mixed calculation for mixed calculation, wherein the step of mixed calculation comprises the following steps: the current layer is of a convolution calculation type, and the next layer is of a hybrid calculation type; and taking out input data required by the next layer of mixing calculation in a cache region of the convolution calculation result to perform mixing calculation.
The computer program, when executed by the processor, further performs: after writing the mixed calculation result into the memory, the method further comprises: the input data stream is processed in a pipeline structure.
The computer program, when executed by the processor, further performs: before obtaining the calculation type of the neural network model, the method comprises the following steps: and taking out data according to the convolution size and strides, and performing convolution calculation.
The computer program, when executed by the processor, further performs: data are taken out through convolution size and strides, and before convolution calculation, the method further comprises the following steps: and acquiring data, and caching the data in a block for storage according to a convolution rule.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.