CN111931917A

CN111931917A - Forward computing implementation method and device, storage medium and electronic device

Info

Publication number: CN111931917A
Application number: CN202010845682.1A
Authority: CN
Inventors: 陈梁
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2020-11-13
Anticipated expiration: 2040-08-20

Abstract

The invention provides a method and a device for realizing forward calculation, a storage medium and an electronic device, wherein the method comprises the following steps: obtaining a calculation parameter corresponding to forward calculation of a target neural network, wherein the calculation parameter comprises at least one of the following: the structure of the target neural network, and the weight of the target neural network; performing a low-precision quantization process on the calculation parameters to obtain quantized data; the quantized data are processed through the digital signal processing DSP end to achieve forward calculation of the target neural network.

Description

Forward computing implementation method and device, storage medium and electronic device

Technical Field

The present invention relates to the field of communications, and in particular, to a method and an apparatus for implementing forward computation, a storage medium, and an electronic apparatus.

Background

With the development of the technology, the related technology of deep learning is widely applied to the fields of image processing, voice recognition, text processing, security monitoring and the like, and the related technology is usually applied to embedded vision, so that the real-time requirement on intelligent services is high.

A target Neural Network (CNN) is a feed-forward type Neural Network for deep learning, which has excellent performance in large-scale image processing, and is currently widely used in the fields of image classification, positioning, and the like. Compared with other neural network structures, the target neural network needs relatively few parameters, so that the target neural network can be widely applied and deeply researched, in the practical application of image processing and pattern recognition, the target neural network is generally realized by more network layers, the operation complexity is higher, a large amount of intensive image convolution operation and the like are included, the algorithm performance is directly influenced, the CNN algorithm is realized based on the CPU processor, and the operation performance of the algorithm cannot be obviously improved due to the limited calculation rate.

The related technology provides a technical scheme, combines the hardware characteristics of a graphic processing chip, mainly improves the data input efficiency, and balances the resource proportion between data input and data operation. One of them uses on-chip shared memory to realize the optimization of general matrix multiplication, but the hardware will bring large cost.

Aiming at the problems that in the related art, when the CPU performs forward calculation on the target neural network, the calculation performance cannot be effectively improved due to the limited calculation capability of the CPU, and the like, an effective technical scheme is not provided.

Disclosure of Invention

The embodiment of the invention provides a forward calculation implementation method and device, a storage medium and an electronic device, which are used for at least solving the problems that the calculation performance cannot be effectively improved due to the limited calculation capability of a CPU (central processing unit) when the CPU is used for calculating a target neural network in the related art.

The embodiment of the invention provides a method for realizing forward calculation, which comprises the following steps: obtaining a calculation parameter corresponding to forward calculation of a target neural network, wherein the calculation parameter comprises at least one of the following: the structure of the target neural network, and the weight of the target neural network; performing a low-precision quantization process on the calculation parameters to obtain quantized data; and processing the quantized data through a digital signal processing DSP end to realize the forward calculation of the target neural network.

In this embodiment of the present invention, performing a low-precision quantization process on the calculation parameters to obtain quantized data includes: an int8 quantization process is performed on the calculated parameters to obtain quantized data.

In this embodiment of the present invention, performing int8 quantization on the calculation parameter to obtain quantized data includes: acquiring the floating-point type calculation parameters; and carrying out int8 quantization on the calculation parameters of the floating point type according to a first target formula to obtain quantized data.

In this embodiment of the present invention, performing int8 quantization on the floating-point type calculation parameter according to a first target formula to obtain quantized data, where the int8 quantization includes: acquiring a structure C of the target neural network in a floating point type, namely X, Y + B, and a weight Y of the target neural network, wherein X is an input of the target neural network, Y is a weight of the target neural network, B is a bias of the target neural network, and X, Y and B are all floating point type data; by the following formula: x-s 1 and Y-s 2 respectively turn X and Y both to int8 type data; obtaining the quantized data c is achieved by the following formulas B ═ B ×(s) s1 ×(s) s2, and c ═ x y + B ═ (s/s1/s2), wherein s, s1 and s2 are all fixed values, s1 and s2 are all floating point type data, and s is an input quantization factor of a next layer of the current layer of the target neural network.

In this embodiment of the present invention, before the quantized data is processed by a digital signal DSP end to implement forward computation of the target neural network, the method further includes: obtaining a code execution operator of the forward computation configured for the target neural network; and transmitting the code execution operator to the DSP end through a target function, so that the DSP end processes the quantized data in parallel according to the code execution operator to realize the forward calculation of the target neural network.

In this embodiment of the present invention, after obtaining the calculation parameters corresponding to the forward calculation of the target neural network, the method further includes: traversing the target neural network; in the event that a target network segment is detected to be present in the target neural network, combining weights of a target layer in the target neural network to a convolutional layer of the target neural network.

In an embodiment of the present invention, merging weights of a target layer in the target neural network to a convolutional layer of the target neural network comprises: merging weights of a BN layer and a Scale layer into a convolutional layer of the target neural network when the network segment of Conv, BN and Scale is detected to exist in the target neural network; incorporating weights of a BN layer into a convolutional layer of the target neural network upon detecting the presence of Conv and a network segment of a BN in the target neural network.

According to another embodiment of the present invention, there is also provided an apparatus for implementing forward computation, including: an obtaining module, configured to obtain a calculation parameter corresponding to forward calculation of a target neural network, where the calculation parameter includes at least one of: the structure of the target neural network, and the weight of the target neural network; the quantization module is used for executing a low-precision quantization process on the calculation parameters to obtain quantized data; and the processing module is used for processing the quantized data through a digital signal processing DSP end so as to realize the forward calculation of the target neural network.

In this embodiment of the present invention, the quantization module is further configured to perform an int8 quantization process on the calculation parameter to obtain quantized data.

In the embodiment of the present invention, the quantization module is further configured to obtain the floating-point type calculation parameter; and carrying out int8 quantization on the calculation parameters of the floating point type according to a first target formula to obtain quantized data.

According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the invention, the calculation parameters corresponding to the forward calculation of the target neural network are obtained, wherein the calculation parameters comprise at least one of the following parameters: the structure of the target neural network, and the weight of the target neural network; performing a low-precision quantization process on the calculation parameters to obtain quantized data; the quantized data are processed through the digital signal processing DSP end to achieve forward calculation of the target neural network, namely the quantized data corresponding to the forward calculation can be processed through the DSP end, and then the forward calculation of the target neural network can be achieved through the DSP end.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware structure of a computer terminal for implementing a forward computing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for implementing forward computation according to an embodiment of the present invention;

FIG. 3 is a flow chart diagram of a method for implementing forward computation in accordance with an alternative embodiment of the present invention;

fig. 4 is a block diagram of an apparatus for implementing forward computation according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The method provided by the embodiment of the application can be executed in a computer terminal or a similar operation device. It should be noted that the computer terminal is a computer terminal including a CPU + DSP heterogeneous platform, and for example, the computer terminal is operated on, and fig. 1 is a hardware structure block diagram of the computer terminal of the implementation method of forward computing according to the embodiment of the present invention. As shown in fig. 1, the computer terminal may include one or more (only one shown in fig. 1) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and in an exemplary embodiment, may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the computer terminal. For example, the computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration with equivalent functionality to that shown in FIG. 1 or with more functionality than that shown in FIG. 1.

The memory 104 can be used for storing computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the forward computing implementation in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, thereby implementing the above-mentioned methods. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a computer terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

According to an embodiment of the present invention, there is provided a method for implementing forward computation, and fig. 2 is a flowchart of a method for implementing forward computation according to an embodiment of the present invention, as shown in fig. 2, including:

step S202, obtaining a calculation parameter corresponding to forward calculation of the target neural network, wherein the calculation parameter includes at least one of the following: the structure of the target neural network, and the weight of the target neural network;

step S204, executing a low-precision quantization process on the calculation parameters to obtain quantized data;

and step S206, processing the quantized data through a digital signal processing DSP end to realize the forward calculation of the target neural network.

Through the technical scheme of the embodiment of the invention, the calculation parameters corresponding to the forward calculation of the target neural network are obtained, wherein the calculation parameters comprise at least one of the following parameters: the structure of the target neural network, and the weight of the target neural network; performing a low-precision quantization process on the calculation parameters to obtain quantized data; the quantized data are processed through the digital signal processing DSP end to achieve forward calculation of the target neural network, namely the quantized data corresponding to the forward calculation can be processed through the DSP end, and then the forward calculation of the target neural network can be achieved through the DSP end.

It should be noted that the low-precision quantization process may be an int8 quantization process, an int4 quantization process, or an int16 quantization process, and the following embodiment of the present invention takes an int8 quantization process as an example, and an implementation flow after the int4 quantization process and the int16 quantization process is similar to the int8 quantization process.

It should be noted that, the current CNN training model is basically floating point type float32 data, and the conversion of the floating point type float32 data into int8 can reduce the size of the model, increase the speed, and reduce the precision too much, and the embodiment of the present invention introduces the conversion of float32 data into int8 data into the implementation process of forward calculation, and based on this, in an optional implementation manner of the step S204, the following scheme can be implemented:

acquiring the floating-point type calculation parameters; optionally, the int8 quantization process is performed on the floating-point type calculation parameter according to a first target formula to obtain quantized data, and the int8 quantization process is performed on the floating-point type calculation parameter according to a first target formula to obtain quantized data, which can be implemented by the following scheme: acquiring a structure C of the target neural network in a floating point type, namely X, Y + B, and a weight Y of the target neural network, wherein X is an input of the target neural network, Y is a weight of the target neural network, B is a bias of the target neural network, and X, Y and B are all floating point type data; by the following formula: x-s 1 and Y-s 2 respectively turn X and Y both to int8 type data; obtaining the quantized data c is achieved by the following formulas B ═ B ×(s) s1 ×(s) s2, and c ═ x y + B ═ (s/s1/s2), wherein s, s1 and s2 are all fixed values, s1 and s2 are all floating point type data, and s is an input quantization factor of a next layer of the current layer of the target neural network.

It should be noted that the first target formula may also be another technical solution that can implement a process of int8 quantization on a floating-point type calculation parameter, which is not limited in the embodiment of the present invention.

In order to enable the DSP side to calculate the forward calculation of the target neural network more specifically, the following process may be prepared before the DSP side performs the forward calculation: obtaining a code execution operator of the forward computation configured for the target neural network; and transmitting the code execution operator to the DSP end through a target function, so that the DSP end processes the quantized data in parallel according to the code execution operator to implement forward calculation of the target neural network, wherein a CPU forward frame calls the target function to enter the DSP for execution, which is not limited in the embodiment of the present invention.

In this embodiment of the present invention, after obtaining the calculation parameters corresponding to the forward calculation of the target neural network, the method further includes: traversing the target neural network; merging weights of a target layer in the target neural network into a convolutional layer of the target neural network in a case where a target network segment is detected to exist in the target neural network, specifically, merging weights of a BN layer and a Scale layer into a convolutional layer of the target neural network in a case where network segments of Conv, BN and Scale are detected to exist in the target neural network; incorporating weights of a BN layer into a convolutional layer of the target neural network upon detecting the presence of Conv and a network segment of a BN in the target neural network.

By the technical scheme, the calculation of the BN layer and the Scale layer can be reduced in the quantization process, and the data copying between the layers is reduced, so that the bandwidth is reduced.

The DSP in the embodiment of the invention is a special processor for accelerating visual processing, has programmable capability, supports vector fixed-point and (single-precision) floating-point operation, and can develop a series of basic operation functions for intelligent analysis algorithm and realize complex algorithm based on the DSP. The DSP has a wide VLIW SIMD architecture, a highly optimized instruction set and a well-adapted image database, has good optimization support aiming at int8 and fp32 data types, can greatly improve the calculation rate and improve the overall use performance of the equipment.

In summary, according to the technical solution of the embodiment of the present invention, under the condition of multiple intelligent algorithm applications, how to utilize Digital Signal Processing, DSP for short, to realize resource balance optimization for the condition of limited CPU performance, improve the efficiency of cooperatively executing the target neural network algorithm by the CPU and the DSP, improve cooperative computing capability, overcome the deficiency of computing resources, and ensure that the final algorithm result is basically not different from the standard execution.

The following explains the implementation flow of the forward calculation with reference to several alternative embodiments, but is not intended to limit the technical solution of the embodiments of the present invention.

An alternative embodiment of the present invention provides a method for optimizing a target neural network algorithm under a CPU + DSP heterogeneous platform, as shown in fig. 3, the method including:

step S302, training a target neural network by using Caffe or other deep learning frameworks, which is not limited by the optional embodiment of the invention;

step S304, exporting the structure and weight of the trained target neural network to a file;

step S306, the FP32 data model of the target neural network is converted into an int8 data model, specifically, taking the target detection network as an example, a weight/activation value floating point tensor of an original FP32 is converted into an int8/uint8 tensor for processing, and each layer of the target neural network receives int8 type quantization input, so that multiplication or multiply-add calculation which is originally of Float32 type can be performed at the DSP end by using int8 type to perform more calculations in batch (SIMD instruction set) at a time, thereby effectively improving the calculation performance of CNN inference.

In addition, since the weight (weights) range of each layer is basically determined and the fluctuation is not large, the method is suitable for quantization compression, and the memory access and the calculation amount are reduced. Carrying out model conversion on the trained fp32 target neural network structure and weight by using quantization tool software, converting the structure and the weight into a data type of int8, and generating an inference model of int8 by setting a quantization picture set, a model preprocessing mode, an RGB channel sequence and the like, wherein the description about a quantization scheme is as follows:

and (4) floating, wherein C is X, Y and B, X is input, Y is weight, B is bias, all three are Float type data, and s1 and s2 are floating point numbers.

X s1, changing X to int 8;

y — Y × s2, changing Y also to int 8;

b-bs 1-s 2(B is int32 type, the converting tool is complete);

c ═ y + b ═ s/s1/s2 (s s1 s2 are all constant values, s is the input blob quantization factor of the next layer), where c is the type of Int8 after quantization, and the layer without calculation parameters like relu and concat in the general target neural network can be calculated by using Int8, otherwise, it is fp32, and it can be dynamically set according to the situation, specifically, if the output layer is a convolution layer, a full-link layer or other layers not involved in calculation (such as permute, concat, etc.), the output data type of the network layer is set to Int8, otherwise, it is set to float 32. Regarding the selection of the quantization factors, fp32 type reasoning is carried out according to partial pictures in the training set, the maximum value of the absolute value of input data of each convolutional layer or full connected layer is counted, and then the quantization factors s1 and s2 are calculated by utilizing a global _ scale algorithm.

For the CPU end + DSP end heterogeneous platform, the following technical scheme is realized at the CPU end:

step S308, a CPU end program code model is loaded to a CNN frame to achieve forward calculation of a convolutional neural network, a target detection network is trained on the basis of yolov3, the structure and weight of the convolutional network are converted into corresponding int8 quantization models through an int8 quantization tool, then the CNN frame is led into the model, image frames are extracted from videos in the experimental process, then generally pre-processing operation is conducted on loaded images in order to enable the CNN model to be higher in detection accuracy, such as scaling and mean value reduction operation, then the pre-processed images are used as network input, forward calculation of the neural network is achieved according to the trained network structure, detection target results with higher reliability are screened out from the neural network output, and the selected mode in the target detection network is post-processing logic of a detectout layer in yolov 3.

The memory of the CPU (referred to as a main control ARM chip) opens up the interface allocation with cache to improve the data processing efficiency. The forward frame mainly comprises four parts, including NetParse (network analysis), Layer (Layer implementation), Workspace (sensor management) and Runtime (platform operation set), wherein the network analysis module further analyzes a self-defined converted int8 quantization model file, the network Layer module implements a forward process comprising different platforms, the Workspace uniformly manages all input and output sensor memories, and the Runtime is a memory related to the platforms and other hardware operation sets, including the combination of specific layers. The platform supports various hardware forward acceleration, and mainly comprises hardware acceleration realization of a general CPU, an ARM, a DSP and the like, wherein a framework supports Int8 quantization, Int8 quantization of floating point weight data is carried out in a model stage, quantization inverse quantization of input and output is carried out on each layer in a forward process, and meanwhile, mixed forward of CPU realization (including ARM/DSP optimization realization) and other hardware acceleration realization is supported.

In the embodiment of the present invention, most layers of the target neural network can use DSP instructions to perform vectorization operations, and if there are layers that are not suitable for vectorization, it is also considered to use, for example, arm-end NEON optimization, which is not limited in the embodiment of the present invention.

And aiming at the program code of the CPU, loop nesting optimization is carried out, and unnecessary cache flushing operation is reduced. Analyzing the calling functions, acquiring the time consumption proportion of each calling function, optimizing and modifying the corresponding core codes according to the proportion, such as cyclic expansion, cyclic combination and inline replacement, or moving part of functions into a DSP for optimization implementation, and improving the optimization effect of the program. In addition, different layers of the neural network can be represented by different member functions in the framework, the member function parameters comprise specification parameters of the layers, input characteristic diagrams, weight values of the layers and function return values serving as output characteristic diagrams, then the execution of the layers is connected in series to obtain the output of the neural network, and the DSP optimization interface is directly called when specific layer operator operation is called.

Step S310, the DSP performs vectorization execution on the codes of forward calculation of each layer to obtain forward results and perform the processing flow of subsequent algorithms; the DSP end program realizes the following scheme: the DSP is used for performing forward execution of various layers such as a volume layer, an FC layer, a pooling layer, an eltwise and the like, the volume layer can directly combine parameters of a BN layer, a scale layer and a relu layer to simplify operation, operator optimization calculation is executed according to a standard caffe framework forward code in operator optimization of each layer of the DSP, and vectorization parallel calculation is used. Each operator is transmitted to the DSP end as a separate message through arm, wherein a very important ping-pong buffer structure is utilized, and the meaning of the ping-pong buffer structure here can be understood as follows: two identical objects are used as buffers (the object types can be arbitrary), the two objects are read and written alternately, the DSP internally comprises two dram buffers with the size of 180k, the DMA is used for directly transferring data in the DDR to the bank in a Ping-Pong buffer mode, the calculation is performed, and then the output data is transferred to the external DDR. The dma data transmission needs very small cycles, and the consumption cycles are mainly optimized by using DSP assembly instructions on the aspect of operator processing, so that the efficiency is improved by performing the data transmission and the data processing in parallel.

The assembly instructions utilized in the DSP mainly adopt a load instruction to perform parallel load operation of 64 int8 data or 32 short or 16 int type data, a store instruction performs parallel storage on accumulated data, a logic operation instruction realizes parallel bit operation of AND or NOR, a multiplication instruction is adopted to perform parallel multiplication or multiplication and addition operation on data, an addition instruction performs corresponding parallel addition and then accumulation, a data conversion instruction performs flexible conversion between floating points and integer types, a select instruction can dynamically adjust the arrangement sequence of data in a single sequence and a double sequence, and Gather/scanner introduces data of a certain address in a bank into a register or disperses the data from the register to the corresponding address of the bank, and the specific application method is as follows:

the optimization of several common operators in yolo networks is listed below

1) And (3) rolling layers:

the forward time consumption of the neural network is mainly determined by the time consumption of convolution, and the mainstream method is im2col + GEMM universal matrix multiplication. Outputc is an output channel, inputc is an input channel, kernel and kernel are convolution kernel widths and heights, and the optimization in the method mainly comprises the following steps:

a. in order to reduce the runtime conversion step, putting the weights of the convolved output, input, kernel and kernel in sequence, using a new piece of memory to perform cache allocation operation and flushing the cache for storage, wherein the storage sequence is output, kernel, input _ aligned ((input +3)/4) =;

b. for convolution with special parameters, for example, the number of channels is large, the weight data may not be placed in the RAM at one time, and at this time, the output channels need to be segmented, and the input data needs to be transported to the RAM many times.

c. The convolution calculation part uses different core functions under different configurations according to kernel parameters and stride parameters and carries out special DSP vectorization assembly instruction optimization

d. The input and output quantization and inverse quantization operations are all placed in a DSP to be executed, the calculation consumption of a cpu end is reduced, for example, after each layer of convolution specific operation, according to whether a channel quantization condition is carried out or not, the activation value of int32 is inversely quantized to fp32 by adding the bias parameter, and int32 is quantized to int8(int8 range-128 to 127)

e. Pure c code can be directly used if the pad operation is included, because the pad initialization work consumes few resources, and some layers after convolution, such as relu layers, can be combined into the convolution.

The relevant DSP vectorization instruction includes the following operations: in the conversion from fp32 to Int8, four times of IVP _ LAVN _2XF32_ XP commands are used to load 64 fp32 data simultaneously, IVP _ firunndn _2XF32 rounds 16 fp32 data into a reshaped data format, IVP _ TRUNCN _2XF32 truncates 16 fp32 data into Int32 data, and then some commands such as IVP _ MAXN _2X32 (get maximum value), IVP _ MOVNX16_ FROMN _2X32(mov operation), IVP _ SEL2NX8I _ S0 (select operation) are used to store 64 Int8 data into corresponding ram memories.

The Int 32-Int 8 loads 64 Int32 data simultaneously by using IVP _ LAN _2X32_ IP instructions four times, adds Int32 type bias, then shifts right by using IVP _ SRAN _2X32 to convert into Int32 data, and then obtains 64 Int8 data by using IVP _ MAXN _2X32, IVP _ MINN _2X32, IVP _ MOV2NX8_ FROMNX16, IVP _ SEL2NX8I _ S0 and other series of operations.

The flow from Int32 to fp32 is similar to that from Int32 to Int8, and 64 Int32 data are loaded by four times of IVP _ LAN _2X32_ IP, and multiplied by a quantization coefficient after adding bias, and output data are stored in ram by using IVP _ FLOATN _2X32, IVP _ MAXN _2XF32 and IVP _ SAN _2XF32_ IP.

2) Layer of pooling

The target detection network uses the maximum pooling, and is divided into two branches according to the comparison between the size of the input/output featuremap and the size of the ram memory, wherein one branch is used for processing the condition that the ram can load the input/output featuremap, and the other branch needs to consider dividing the height of the input/output faturemap by 2 so that the ram can simultaneously store the input/output data, and then the residual unprocessed height of the input data content is continuously processed. The method is characterized in that special instruction packaging is carried out for different kernelsize, strde and input and output data types (int8 or fp32), the method mainly comprises the steps that IVP _ MAX2NX8 compares two by two in two 64-bit int8 data to obtain the maximum value and stores the maximum value into a vector form, IVP _ MOV2NX8T uses MOV operation with flag data to copy filling data when a corresponding bit flag is valid, IVP _ SEL2NX8I can flexibly select and generate a part of data by setting a macro, IVP _ DSEL2NX8I and IVP _ DSELN _2XF32I select two int8 or fp32 types of vector cross separation bits to generate target vector data, IVP _ SAV2NX8_ XP stores 64-bit int8 data into a designated ram address space, and IVP _ SAXN _2XF32 compares two 16-bit fp32 data to obtain the maximum value into two by two in 16-bit int8 or fp 32.

3) eltwise layer

And in the processing dimension division, processing h × c as a dimension, and calculating the maximum storable multiple maxProcessDim2 according to the SIZE of the input/output featuremap and the ram memory, wherein maxpprocessdim 2 is LOCAL _ MEM _ SIZE/(inBlobPitch inblobdubyte + outbipatch outboblobdim byte), wherein LOCAL _ MEM _ SIZE180KB, inBlobPitch inputs the span of the feature, inblobdubbyte inputs the number of data bytes, inBlobNum inputs the number of data bytes, outblobpictch outputs the span of the blob, outblobubbyte outputs the number of data bytes, and the cycle processes the input length of the maxpprocesseddim height, including moving out and executing the computation of operators until the h × c height data processing is completed. The Eltwise has configuration modes such as maximum value taking, product taking, summation taking and the like, and is packaged independently according to different modes, and the Eltwise is mainly used for instructions similar to the instructions of the convolution and is not described here.

4) For Int8 data input, the concat layer uses IVP _ LAV2NX8_ XP to load 64 Int8 data at the same time, and IVP _ SAV2NX8_ XP stores 64 Int8 data into ram memory.

In a word, firstly, when the length of input and output data exceeds the range of ram, data segmentation and blocking processing are required, secondly, because the quantized model inputs both integer data and floating point data, data transmission and processing are required to be carried out in a program, and simultaneously, different modes are adopted for instruction optimization of integer data and floating point data. Most of forward time consumption is in a convolution part, when a DSP single instruction multiple data stream technology is used for optimization, weight data needs to be subjected to memory rearrangement, then a vectorization instruction technology is used for optimizing a calculation process, the convolution operation time is reduced, in addition, the number of DSP cores started in an actual operation forward network can be dynamically adjusted, and theoretically, under the condition that the bandwidth is met, multi-core can be accelerated better than a single core.

5) When all layers of the target neural network are executed on the DSP, comparing results of all layers with standard output results, if basically no difference indicates normal, counting accuracy and recall rate of mass test data and comparing, and processing a CNN forward result after an algorithm to obtain a final algorithm result.

In the embodiment of the invention, the support for the type of the convolutional neural network is not limited to a lightweight target neural network, the convolutional processing is not limited to 1x1 convolution and 3 x3 deep separable convolution, the method has more universality, good parallelization processing is performed on other layers such as posing and the like, two resources, namely cpu and DSP, are fully utilized through the parallelization processing of the matrix multiplication operation of the core cycle such as convolution, the balanced optimization of the resources is realized, and the calculation of the vectorized optimized neural network can maximally utilize the hardware resources of the embedded device under the condition of limited resources.

By the technical scheme, the 8-bit low-precision reasoning quantification scheme converts an fp32 model into an int8 reasoning model, so that the memory bandwidth and the storage space are reduced, the system delay is reduced, and the precision is almost not reduced.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a device for implementing forward computation is also provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and details of which have been already described are omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a block diagram of an apparatus for implementing forward computation according to an embodiment of the present invention, as shown in fig. 4, the apparatus includes:

an obtaining module 40, configured to obtain a calculation parameter corresponding to a forward calculation of a target neural network, where the calculation parameter includes at least one of: the structure of the target neural network, and the weight of the target neural network;

a quantization module 42, configured to perform a low-precision quantization process on the calculation parameters to obtain quantized data;

and the processing module 44 is configured to process the quantized data through a digital signal processing DSP end to implement forward calculation of the target neural network.

Optionally, the quantization module is further configured to: acquiring a structure C of the target neural network in a floating point type, namely X, Y + B, and a weight Y of the target neural network, wherein X is an input of the target neural network, Y is a weight of the target neural network, B is a bias of the target neural network, and X, Y and B are all floating point type data; by the following formula: x-s 1 and Y-s 2 respectively turn X and Y both to int8 type data; obtaining the quantized data c is achieved by the following formulas B ═ B ×(s) s1 ×(s) s2, and c ═ x y + B ═ (s/s1/s2), wherein s, s1 and s2 are all fixed values, s1 and s2 are all floating point type data, and s is an input quantization factor of a next layer of the current layer of the target neural network.

In order to enable the forward calculation of the target neural network by the DSP end to be calculated more specifically, the processing module 44 is further configured to obtain a code execution operator of the forward calculation configured for the target neural network; and transmitting the code execution operator to the DSP end through a target function, so that the DSP end processes the quantized data in parallel according to the code execution operator to implement forward calculation of the target neural network, wherein a CPU forward frame calls the target function to enter the DSP for execution, which is not limited in the embodiment of the present invention.

In this embodiment of the present invention, the obtaining module 40 is further configured to traverse the target neural network; merging weights of a target layer in the target neural network into a convolutional layer of the target neural network in a case where a target network segment is detected to exist in the target neural network, specifically, merging weights of a BN layer and a Scale layer into a convolutional layer of the target neural network in a case where network segments of Conv, BN and Scale are detected to exist in the target neural network; incorporating weights of a BN layer into a convolutional layer of the target neural network upon detecting the presence of Conv and a network segment of a BN in the target neural network.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, obtaining a calculation parameter corresponding to the forward calculation of the target neural network, wherein the calculation parameter comprises at least one of the following: the structure of the target neural network, and the weight of the target neural network;

s2, performing a low-precision quantization process on the calculation parameters to obtain quantized data;

and S3, processing the quantized data through a digital signal processing DSP end to realize the forward calculation of the target neural network.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for implementing forward computing, comprising:

obtaining a calculation parameter corresponding to forward calculation of a target neural network, wherein the calculation parameter comprises at least one of the following: the structure of the target neural network, and the weight of the target neural network;

performing a low-precision quantization process on the calculation parameters to obtain quantized data;

and processing the quantized data through a digital signal processing DSP end to realize the forward calculation of the target neural network.

2. The method of claim 1, wherein performing a low-precision quantization process on the calculated parameters to obtain quantized data comprises:

an int8 quantization process is performed on the calculated parameters to obtain quantized data.

3. The method of claim 1, wherein performing int8 quantization on the computation parameters to obtain quantized data comprises:

acquiring the floating-point type calculation parameters;

and carrying out int8 quantization on the calculation parameters of the floating point type according to a first target formula to obtain quantized data.

4. The method according to claim 3, wherein the int8 quantization process of the calculation parameters of the floating point type according to a first target formula to obtain quantized data comprises:

acquiring a structure C of the target neural network in a floating point type, namely X, Y + B, and a weight Y of the target neural network, wherein X is an input of the target neural network, Y is a weight of the target neural network, B is a bias of the target neural network, and X, Y and B are all floating point type data;

by the following formula: x-s 1 and Y-s 2 respectively turn X and Y both to int8 type data;

obtaining the quantized data c is achieved by the following formulas B ═ B ×(s) s1 ×(s) s2, and c ═ x y + B ═ (s/s1/s2), wherein s, s1 and s2 are all fixed values, s1 and s2 are all floating point type data, and s is an input quantization factor of a next layer of the current layer of the target neural network.

5. The method of claim 1, wherein before processing the quantized data by a digital signal DSP port to achieve forward computation of the target neural network, the method further comprises:

obtaining a code execution operator of the forward computation configured for the target neural network;

and transmitting the code execution operator to the DSP end through a target function, so that the DSP end processes the quantized data in parallel according to the code execution operator to realize the forward calculation of the target neural network.

6. The method of claim 1, wherein after obtaining the corresponding calculation parameters for the forward calculation of the target neural network, the method further comprises:

traversing the target neural network;

in the event that a target network segment is detected to be present in the target neural network, combining weights of a target layer in the target neural network to a convolutional layer of the target neural network.

7. The method of claim 6, wherein merging weights of a target layer in the target neural network to a convolutional layer of the target neural network comprises:

merging weights of a BN layer and a Scale layer into a convolutional layer of the target neural network when the network segment of Conv, BN and Scale is detected to exist in the target neural network;

incorporating weights of a BN layer into a convolutional layer of the target neural network upon detecting the presence of Conv and a network segment of a BN in the target neural network.

8. An apparatus for implementing forward computation, comprising:

an obtaining module, configured to obtain a calculation parameter corresponding to forward calculation of a target neural network, where the calculation parameter includes at least one of: the structure of the target neural network, and the weight of the target neural network;

the quantization module is used for executing a low-precision quantization process on the calculation parameters to obtain quantized data;

and the processing module is used for processing the quantized data through a digital signal processing DSP end so as to realize the forward calculation of the target neural network.

9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 7 when executed.

10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 7.