CN113298259B

CN113298259B - CNN (computer network) reasoning framework design method supporting multi-core parallelism of embedded platform

Info

Publication number: CN113298259B
Application number: CN202110647708.6A
Authority: CN
Inventors: 王嘎; 杨洋; 唐强; 韩文俊; 丁琳琳
Original assignee: CETC 14 Research Institute
Current assignee: CETC 14 Research Institute
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2024-04-26
Anticipated expiration: 2041-06-10
Also published as: CN113298259A

Abstract

The invention belongs to the technical field of radar information processing, and discloses a CNN (computer numerical network) reasoning framework design method supporting multi-core parallelism of an embedded platform. Reading a model file after training of a deep learning framework, extracting weight and bias parameters from the model file and defining the weight and bias parameters by using pointer variables; the operation in the CNN network is respectively packaged into operation kernel functions, and a general programming interface design is carried out for constructing a prediction function of a CNN network reasoning framework; binding a prediction function to a core number of the multi-core processor based on a multithreading mechanism of the multi-core processor; writing a corresponding VSIPL static library according to the platform type of the CNN network reasoning framework to be deployed; and deploying the CNN network reasoning framework on each operating system. The invention establishes the reasoning frame of the neural network model on the embedded platform, meets the real-time processing requirement of the application scene, supports various chips and is compatible with various operating systems.

Description

CNN (computer network) reasoning framework design method supporting multi-core parallelism of embedded platform

Technical Field

The invention belongs to the technical field of radar information processing, and particularly relates to a CNN (computer network n) network reasoning framework design method supporting multi-core parallelism of an embedded platform.

Background

In the DSP platform of the embedded system, the challenge faced by deep learning is mainly that there is no unified and general high-performance reasoning framework. The deep learning framework tensorflow of Google is only applicable to the CPU and GPU of the mobile terminal; hundred degrees artificial intelligence framework PADLLEPADDLE is also not suitable for DSP platforms. The deployment of low-storage, low-complexity deep learning frameworks in low-cost, low-energy-consumption, computing-power-limited embedded systems presents challenges. Hardware manufacturers are striving to develop special design chips based on artificial intelligence, accelerate in hardware, optimize a deep learning algorithm for embedded equipment, perform high-performance parallel computation for an embedded platform, and establish a deep learning framework based on the embedded platform, so that the method becomes a new solution. At present, a DSP platform lacks a deep learning framework, for example, a neural network algorithm does not have an inference framework or the real-time performance of the inference framework is not high in an embedded DSP platform.

Disclosure of Invention

Aiming at the problems that a neural network algorithm in the prior art has no inference frame or the inference frame is not high in real-time, the invention aims to provide a CNN (computer network) inference frame design method supporting multi-core parallelism of an embedded platform, which is used for establishing the inference frame of a neural network model in the embedded platform such as a CPU (Central processing unit), a DSP (digital signal processor) and the like so as to meet the real-time processing requirement of an application scene.

Specifically, the invention is realized by adopting the following technical scheme.

The invention provides a CNN network reasoning framework design method supporting multi-core parallelism of an embedded platform, which comprises the following steps:

CNN network model loading: reading a model file trained by the deep learning framework, extracting weights and bias parameters from the model file, and outputting the model weights and bias parameters defined by pointer variables;

CNN network function encapsulation: the convolution operation, pooling operation, activation operation and full-connection operation in the CNN network are respectively packaged into operation kernel functions by adopting a vector instruction set, an assembly language and a C language, the input of each operation kernel function is the model weight and offset defined by the pointer variable, and the output is a convolution layer function, a pooling layer function, an activation function and a full-connection layer function respectively; designing a basic block and view object based on VSIPL standard, and unifying a packaged convolution layer function, a pooling layer function, an activation function and a full-connection layer function, including function parameters, to form an operation kernel function with a general neural network programming interface; constructing a prediction function of a CNN (computer numerical network) reasoning framework by adopting the operation kernel function with the universal neural network programming interface;

And (3) carrying out parallelization design: based on a multithreading mechanism of the multi-core processor, taking a prediction function of the CNN reasoning framework as a thread function, creating a plurality of tasks or a plurality of threads, designing thread synchronization and communication, dividing data of an input test data set based on a load balancing principle, and binding each task or thread to a core number of the multi-core processor through a thread binding function;

Performing cross-platform design: writing a corresponding VSIPL static library according to the platform type of the CNN network reasoning framework to be deployed; the CNN network reasoning framework is deployed at VxWorks, linux, windows, sylixOS or Reworks operating systems.

Further, the trained model file is a binary file containing model parameters and consists of control header parameters and data;

The control header parameters are integers, the 1 st word of the control header parameters is the number of layers of the neural network, the 2 nd, 3 rd and 4 th words represent the dimension of a weight matrix of the first layer neural network model, the 5 th, 6 th and 7 th words are the dimension of the first layer pooling bias, the 8 th word is the dimension of the first layer pooling bias, the 9 th, 10 th and 11 th words are the dimension of the second layer pooling bias, and the 12 th word is the dimension of the second layer pooling bias; and so on, up to the last layer of neural network;

the data are data which are stored in binary files according to the values of the control header parameters and the weights and the bias data of the neural network models from the first layer to the last layer in sequence.

Further, the convolution layer function carries out one-dimensional convolution, two-dimensional convolution or three-dimensional convolution operation, and the dimension, the number of convolution kernels and the size parameters of the convolution kernels of the convolution operation are set;

the pooling layer function performs one-dimensional pooling, two-dimensional pooling or three-dimensional pooling, and the dimension, pooling type, interval and step length of pooling operation are set;

The full connection layer function sets the dimension of the weight matrix;

The activation function sets an activation function.

Further, the designing the basic block and view object based on VSIPL standard to perform programming interface design on the encapsulated convolution layer function, pooling layer function, activation function, full connection layer function, including function parameters, and unifying the operation kernel function with the general neural network programming interface includes:

Calculating middleware standard definition basic blocks and views based on VSIPL, binding pointer variables loaded and output by the CNN network model into basic blocks, extracting data from the basic blocks to bind the data into views, wherein the views are matrixes or vectors; and calling the operation kernel function by taking the converted matrix or vector as an input parameter.

Further, the data partitioning of the input test data set based on the load balancing principle includes:

And dividing the input test data set into N parts averagely, wherein N is the number of cores of the multi-core processor, creating N tasks or threads, binding the tasks or threads on N cores of the multi-core processor, and executing in a data parallel mode.

The CNN network reasoning framework design method supporting the embedded platform multi-core parallelism has the following beneficial effects:

a processing framework for reasoning the convolutional neural network on the embedded platform is established, and a convolutional neural network model is quickly built by using a packaged kernel function, so that a threshold for developing an artificial intelligent algorithm on the embedded platform is reduced, and the reasoning efficiency of the convolutional neural network on the CPU and the DSP platform is improved;

The artificial intelligent algorithm is subjected to multi-core parallel design in the embedded platform reasoning framework, the data and the calculation tasks are automatically divided and mapped to the hardware threads through the multi-core parallel operation reasoning framework, the high-speed real-time processing of the convolutional neural network in the CPU and the DSP platform is ensured, and the hardware resource utilization rate of the DSP platform is brought into play;

the universal convolutional neural network interface is provided, and a developer can perform secondary development according to the programming interface provided by the invention through the self-defined convolutional neural network operation kernel function operator, the programming interface and the bottom assembly function library, so that the programming efficiency is improved.

Drawings

Fig. 1 is a schematic diagram of a CNN network reasoning framework design method supporting multi-core parallelism of an embedded platform according to this embodiment.

Fig. 2 is a schematic diagram of the data arrangement in the model file of the present embodiment.

Fig. 3 is a forward reasoning flowchart of the CNN model of the present embodiment.

Detailed Description

The invention is described in further detail below with reference to the examples and with reference to the accompanying drawings.

Example 1:

The embodiment of the invention relates to a CNN network reasoning framework design method supporting DSP multi-core parallelism. As shown in fig. 1, the CNN network reasoning framework design method supporting DSP multi-core parallelism of the present embodiment includes:

1. CNN network model loading

The C language is used for reading and writing the file, namely, reading the model file trained by the deep learning framework (for example tensorflow, pytorch), extracting the weight and the bias parameter from the model file, outputting the model weight and the bias parameter defined by the pointer variable, and writing the model weight and the bias parameter into a new file. The trained model file is a binary file containing model parameters, takes bin as a suffix and consists of control header parameters and data. The control header parameters are integers, as shown in fig. 2, the 1 st word of the control header parameters is the number of layers of the neural network, the 2 nd, 3 rd and 4 th words represent the dimension of a weight matrix of the first layer neural network model, the 5 th, 6 th and 7 th words are the dimension of the first layer pooling, the 8 th word is the dimension of the first layer pooling bias, the 9 th, 10 th and 11 th words are the dimension of the second layer pooling bias, and the 12 th word is the dimension of the second layer pooling bias; and so on, up to the last layer of neural network. The data are data which are stored in binary files according to the weight values and the bias data of the neural network models of the first layer to the last layer in sequence according to the values of the control header parameters.

2. CNN network function encapsulation

(1) And adopting a vector instruction set, an assembly language and a C language to respectively perform function encapsulation on operations (including convolution operation, pooling operation, activation operation, full-connection operation and prediction function) in the CNN network, and encapsulating the functions into an operation kernel function. The input of each operation kernel function is model weight and bias defined by pointer variables, and the output is convolution layer function, pooling layer function, activation function, normalization function, full connection layer function, etc.

The kernel functions defined in the present invention are shown in table 1.

Table 1 convolutional neural network unified programming interface

The operation kernel functions comprise convolution layer functions, pooling layer functions, activation functions, full connection layer functions and prediction functions. The method specifically comprises the following steps:

A convolution layer function, performing one-dimensional convolution, two-dimensional convolution and three-dimensional convolution operation, setting parameters such as the dimension, the number and the size of convolution kernels of the convolution operation, and packaging the parameters into an operation kernel function;

Pooling layer functions, carrying out one-dimensional pooling, two-dimensional pooling and three-dimensional pooling, setting the dimensionality, pooling type, interval and step length of pooling operation, and packaging into an operation kernel function;

packaging the activation functions, and designing ReLU, softmax, tanh and other activation functions;

Packaging the full-connection layer function and designing the dimension of the weight matrix;

and the prediction function is constructed by adopting a convolution layer function, a pooling layer function, an activation function and a full connection layer function.

(2) Based on VSIPL standard basic block and view object design, the operation kernel functions (convolution layer function, pooling layer function, activation function, full connection layer function) of each package, including function parameters, are subjected to programming interface design and unified into the operation kernel function with a general neural network programming interface. Namely:

The middleware standard definition block (basic block) and view (view) are calculated based on VSIPL (Vector SIGNAL AND IMAGE Processing Library), the output (i.e., pointer variable) of the previous step (neural network model loading) is bound as block, and the data binding is extracted from the block as view, which is a matrix or Vector. The model parameters are converted into matrixes and vectors through the binding of blocks and views, and the converted matrixes or vectors are used as input parameters (see p1 in the table 1) to call the operation kernel function defined by the invention.

The operation kernel function defined by the invention is designed based on VSIPL standard, the parameters are in standard data format, the data in the parameters are matrix or vector, other parameters except the data are relevant to application, and the algorithm is only required to be focused during secondary development, and the parameters are only required to be set. The operation kernel function is optimized through assembly language, so that the pipeline efficiency and the cache hit rate can be fully improved.

The operation kernel function is written by different assembly languages and can support x86, MIPS or powerpc architectures.

(3) And constructing a CNN network prediction function by adopting an operation kernel function (a convolution layer function, a pooling layer function, an activation function and a full connection layer function) with a general neural network programming interface.

3. Parallelization design for CNN network prediction function

Based on multithreading mechanisms of multi-core processors such as CPU, DSP and the like, CNN network prediction functions are used as thread functions, a plurality of tasks (tasks) or a plurality of threads (pthread) are created, thread synchronization and communication are designed, data division is carried out on an input test data set based on a load balancing principle, and each task or thread is bound to a core number of the multi-core processor through a thread binding function. That is, the input test data set is divided into N parts on average, N is the number of cores of the multi-core processor, N is the parallelism of parallel processing, and the task (task) or thread (pthread) is bound to N cores of the multi-core processor to execute in a data parallel manner. The method comprises the steps of dividing data to be predicted into a plurality of parts on average, calling model parameters loaded by neural network model parameters, writing thread functions by using encapsulated operation kernel functions, and binding the thread functions to each core of a multi-core processor for operation, so that automatic mapping of data and calculation tasks to the multi-core processor is realized.

The parallel design is carried out on the prediction function, the parallelism parameter is provided, the forward propagation calculation task of the convolutional neural network can be mapped to the multi-core processor, the data division is carried out on the test data, the task division is carried out on the irrelevant calculation task, and the high-speed real-time processing is realized.

The data can be divided into data according to the verification data set by the hardware platform, a multithreading or multitasking mechanism is created, the data and the computing tasks are automatically mapped to hardware, and single-machine multi-core, data parallelism and task parallelism are supported.

Based on the parallel running mechanism of different operating systems, the prediction function can support running on the operating systems Linux, windows, vxWorks, sylixOS, reworks and the like.

4. Performing cross-platform design

Different platforms call different VSIPL static libraries, and corresponding VSIPL static libraries are written according to the platform types of the CNN network reasoning framework to be deployed; the CNN network reasoning framework is deployed at VxWorks, linux, windows, sylixOS or Reworks operating systems.

The convolutional neural network reasoning code is written based on the neural network programming interface defined by the invention, a secondary development user can write any convolutional neural network reasoning code according to the requirements, write any static library calling different platforms, so that the operation kernel function can be deployed and operated on different hardware platforms, and cross-platform high performance is realized. Meanwhile, the VSIPL static library is realized by adopting assembly languages and C languages of different instruction sets, and the high-performance function library can fully exert computing resources and improve software performance.

Example 2:

The invention relates to a multi-core parallel processing method for CNN (computer network node) reasoning in DSP (digital signal processor) in another embodiment. In this embodiment, the target recognition reasoning part based on the convolutional neural network may run on a vxworks operating system DSP platform or a linux operating system x86 architecture CPU platform.

As shown in FIG. 3, the forward reasoning flow of the neural network model is based on convolutional neural network for identifying the model of the airplane, and the model of the airplane is totally four layers of neural networks. A piece of data read from the training dataset, the data being a vector of length 150, is used as input data for forward reasoning. The first layer of convolution layer is used for carrying out convolution of 6 channels and average pooling with the step length of 2; the second layer of convolution layer is used for carrying out convolution of 12 channels and average pooling with the step length of 2; the third layer is a full connection layer; the fourth layer is the output layer.

The multi-core parallel processing method of CNN (computer numerical network) reasoning in the DSP comprises the following steps:

1. CNN network model loading

And loading tensorflow a trained CNN network model file, wherein the model file is a bin file, and the data arrangement mode is shown in figure 2. Wherein the fourth output layer is not pooled and the parameter defaults to [ 111 ]. Reading the header and the data by the language C to obtain a model of 4 layers; the first layer has 6 convolution kernels with a size of [ 15 ], a pooling layer step size of 2, and a bias of 6; the second layer has 12 convolution kernels with a size of [ 65 ], a pooling layer step size of 2, and a bias of 12; the third layer is a fully connected layer, the weight size is [384 50], and the bias is 50; the fourth layer is the output layer, the weight size is [50 ], biased to 10.

2. CNN network function encapsulation

And adopting a vector instruction set, an assembly language and a C language to respectively perform function encapsulation on operations (including convolution operation, pooling operation, activation operation, full-connection operation and prediction function) in the CNN network, and encapsulating the functions into an operation kernel function. Including vsip _conv1d function (for first layer convolution), vsip _ avgpool _f function (for first layer pooling), vsip _active_f function (for first layer activation), vsip _conv1d function (for second layer convolution), vsip _ avgpool _f function (for second layer pooling), vsip _active_f function (for second layer activation), vsip _ fullnet _f function (for third layer full concatenation), vsip _ fullnet _f function (for fourth layer output), wherein the output layer (fourth layer) outputs confidence probabilities (outputs 10 probability values) for the class to which the aircraft model belongs.

In fig. 2, D1, D2, D3, and D4 represent weight data of the first layer (L1), the second layer (L2), the third layer (L3), and the fourth layer (L4), respectively. D1, D2, D3, D4 data are bound to block, then from block to view, i.e. matrix or vector, then view is taken as input, and vsip _conv1d function (for first layer convolution), vsip _ avgpool _f function (for first layer pooling), vsip _active_f function (for first layer activation), vsip _conv1d function (for second layer convolution), vsip _ avgpool _f function (for second layer pooling), vsip _active_f function (for second layer activation), vsip _ fullnet _f function (for third layer full connection), vsip _ fullnet _f function (for fourth layer output) are called. And designing a programming interface by using the packaged operation kernel function, including function parameters, based on VSIPL standard block and view object designs, and unifying the programming interface into a universal neural network programming interface.

The prediction function of the convolutional neural network inference framework is constructed with the encapsulated vsip _conv1d function (for first layer convolution), vsip _ avgpool _f function (for first layer pooling), vsip _active_f function (for first layer activation), vsip _conv1d function (for second layer convolution), vsip _ avgpool _f function (for second layer pooling), vsip _active_f function (for second layer activation), vsip _ fullnet _f function (for third layer full join), vsip _ fullnet _f function (for fourth layer output).

3. Parallelization design

The method comprises the steps of taking a prediction function of a convolutional neural network reasoning framework as a thread function, dividing data to be predicted into N parts (N is the parallelism of multi-thread parallel processing), creating N threads, binding the threads on N cores of a multi-core processor, and executing in a data parallel mode.

4. Performing cross-platform design

Writing a corresponding VSIPL static library according to the platform type of the CNN network reasoning framework to be deployed; the inference framework is deployed in a vxworks operating system or a linux operating system to complete cross-platform design, judge the labels of the test data sets, and improve the accuracy of target identification classification.

The CNN network reasoning framework design method supporting DSP multi-core parallelism provided by the invention is based on a convolution neural network reasoning framework of a DSP platform, adopts the means of model loading, defining an operation kernel function operator library, bottom assembly optimization and the like, adopts the multi-core parallelism software design method and a mode of packaging into a universal neural network interface, packages and data-parallels a neural network layer, a pooling layer and a full-connection layer, realizes the target recognition high-performance processing requirement of a radar system, and has the characteristics of cross-platform, high performance, parallelization and usability.

(1) Cross-platform: the universal convolutional neural network operation kernel function operator is designed based on VSIPL standard interfaces, operation kernel functions are packaged, a unified programming interface is established, one-time programming is realized, multiple operation is realized, operating systems such as Linux, windows, vxWorks, sylixOS, reworks are supported, and x86, MIPS and powerpc architectures are supported.

(2) High performance: aiming at the characteristics of different instruction sets, the operation kernel function operator adopts vectorization, parallelization and pipeline design, fully exerts the multistage pipeline efficiency of the CPU and the DSP, improves the efficiency of a vector processor, reduces the cache miss rate, thereby realizing the high performance of the calculation function and improving the reasoning speed.

(3) Parallelization: through multi-thread design, the algorithm is mapped to the multi-core processor, and multi-core parallel operation of the CPU and the DSP platform is supported.

(4) Ease of use: and a tensorflow, pytorch framework is supported, and a convolutional neural network model is supported.

In some embodiments, certain aspects of the techniques described above may be implemented by one or more processors of a processing system executing software. The software includes one or more sets of executable instructions stored or otherwise tangibly embodied on a non-transitory computer-readable storage medium. The software may include instructions and certain data that, when executed by one or more processors, operate the one or more processors to perform one or more aspects of the techniques described above. The non-transitory computer readable storage medium may include, for example, a magnetic or optical disk storage device, a solid state storage device such as flash memory, cache, random Access Memory (RAM), or other non-volatile memory device. Executable instructions stored on a non-transitory computer-readable storage medium may be in source code, assembly language code, object code, or other instruction format that is interpreted or otherwise executed by one or more processors.

A computer-readable storage medium may include any storage medium or combination of storage media that can be accessed by a computer system during use to provide instructions and/or data to the computer system. Such storage media may include, but is not limited to, optical media (e.g., compact Disc (CD), digital Versatile Disc (DVD), blu-ray disc), magnetic media (e.g., floppy disk, magnetic tape, or magnetic hard drive), volatile memory (e.g., random Access Memory (RAM) or cache), non-volatile memory (e.g., read Only Memory (ROM) or flash memory), or microelectromechanical system (MEMS) based storage media. The computer-readable storage medium may be embedded in a computing system (e.g., system RAM or ROM), fixedly attached to the computing system (e.g., a magnetic hard drive), removably attached to the computing system (e.g., an optical disk or Universal Serial Bus (USB) based flash memory), or coupled to the computer system via a wired or wireless network (e.g., network-accessible storage (NAS)).

Note that not all of the activities or elements in the above general description are required, that a portion of a particular activity or device may not be required, and that one or more further activities or included elements may be performed in addition to those described. Still further, the order in which the activities are listed need not be the order in which they are performed. Moreover, these concepts have been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of present disclosure.

Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature or element of any or all the claims. Furthermore, the particular embodiments disclosed above are illustrative only, as the disclosed subject matter may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. No limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope of the disclosed subject matter.

Claims

1. A CNN network reasoning frame design method supporting embedded platform multi-core parallelism is characterized by comprising the following steps:

CNN network model loading: reading a model file trained by the deep learning framework, extracting weights and bias parameters from the model file, and outputting the model weights and bias parameters defined by pointer variables; the trained model file is a binary file containing model parameters and consists of control header parameters and data; the control header parameters are integers, the 1 st word of the control header parameters is the number of layers of the neural network, the 2 nd, 3 rd and 4 th words represent the dimension of a weight matrix of the first layer neural network model, the 5 th, 6 th and 7 th words are the dimension of the first layer pooling bias, the 8 th word is the dimension of the first layer pooling bias, the 9 th, 10 th and 11 th words are the dimension of the second layer pooling bias, and the 12 th word is the dimension of the second layer pooling bias; and so on, up to the last layer of neural network; the data are data which are stored in binary files according to the values of control header parameters and the weights and bias data of the neural network models from the first layer to the last layer in sequence;

and (3) carrying out automatic parallelization design: based on a multithreading mechanism of the multi-core processor, taking a prediction function of the CNN reasoning framework as a thread function, creating a plurality of tasks or a plurality of threads, designing thread synchronization and communication, dividing data of an input test data set based on a load balancing principle, and binding each task or thread to a core number of the multi-core processor through a thread binding function;

2. The CNN network reasoning framework design method supporting multi-core parallelism of embedded platform of claim 1, wherein,

The convolution layer function carries out one-dimensional convolution, two-dimensional convolution or three-dimensional convolution operation, and the dimension, the number of convolution kernels and the size parameters of the convolution kernels of the convolution operation are set;

The full connection layer function sets the dimension of the weight matrix;

The activation function sets an activation function.

3. The CNN network inference framework design method supporting multi-core parallelism of an embedded platform according to claim 1, wherein the designing of the basic block and view object based on VSIPL standard, the programming interface design of the packaged convolution layer function, pooling layer function, activation function, full connection layer function, including function parameters, and the unification of the operation kernel function with the general neural network programming interface includes:

4. The CNN network reasoning framework design method supporting multi-core parallelism of the embedded platform according to claim 1, wherein the data partitioning of the input test data set based on the load balancing principle comprises: