CN112308762A

CN112308762A - Data processing method and device

Info

Publication number: CN112308762A
Application number: CN202011148331.1A
Authority: CN
Inventors: 柴双林
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2021-02-02

Abstract

The processor can acquire and store image data acquired by an image sensor in the cache module, when the image data is processed through the data processing model, model parameters of each layer of the data processing model are acquired, the model parameters are loaded into each logic module, an intermediate image acquired by the layer is acquired from the cache module, the intermediate image is divided to acquire sub intermediate images, and the model parameters and the sub intermediate images are operated through each logic module loaded in each logic module of each sub intermediate image loading to acquire an intermediate image acquired by the layer and stored in the cache module. By the method, the image data or the intermediate images obtained by each layer are stored in the cache, so that the problem of limited data transmission bandwidth is solved, computational resources of the processor are effectively used, and the effect of reducing time consumption is achieved.

Description

Data processing method and device

Technical Field

The present disclosure relates to the field of machine learning technologies, and in particular, to a data processing method and apparatus.

Background

Currently, in terms of hardware support, the operation of the machine learning model can be implemented by a processor and a memory. The processor may include a Graphics Processing Unit (GPU), a Tensor Processing Unit (TPU), etc., and the Memory may include Memory components such as a Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), etc.

Taking target detection on image data and taking a machine learning model as a Convolutional Neural Network (CNN) model as an example, the image data, model parameters of the CNN and an operation result of each layer of the CNN can be stored in a DDR SDRAM, for operation of TPU on any layer of the CNN, firstly, the model parameters of the CNN and the image data or the operation result of the previous layer are sequentially loaded into the TPU from the DDR SDRAM, the TPU performs operation according to the model parameters and the image data, and the operation result is stored in the DDR SDRAM.

The image data, the CNN operation result and other data are large in size, so that the image data and the CNN operation result are loaded into the TPU from the DDR SDRAM and need a high bandwidth to support data transmission, the DDR SDRAM and the TPU are limited in transmission bandwidth, the data loading needs to take a long time, and in addition, as the calculation power and the bandwidth need are in a positive correlation relationship, when the bandwidth is limited, the calculation power resources in the TPU cannot be effectively used, the TPU operation speed is low, and the TPU needs to take a long time for carrying out target detection on the image data.

Disclosure of Invention

Embodiments of the present specification provide a data processing method and apparatus, so as to partially solve the above problems in the prior art.

The embodiment of the specification adopts the following technical scheme:

in a data processing method provided in this specification, a processor includes a cache module and a plurality of logic modules, and the method includes:

the processor acquires image data acquired by the image sensor, stores the image data in the cache module, and processes the image data through a data processing model;

aiming at each layer of the data processing model, obtaining model parameters of the layer of the data processing model stored in a memory, and loading the model parameters into each logic module; acquiring an intermediate image obtained by the layer above the layer from the cache module, segmenting the intermediate image according to the information of each logic module determined in advance to obtain a plurality of sub-intermediate images, and loading each sub-intermediate image into each logic module;

and calculating the model parameters and the sub-intermediate images through each logic module to obtain a calculation result as an intermediate image obtained by the layer, and storing the intermediate image obtained by the layer into the cache module.

Optionally, the obtaining of the intermediate image obtained from the layer above the layer from the cache module specifically includes:

and if the layer is the first layer of the data processing model, acquiring the image data from the cache module.

Optionally, the predetermining information of each logic module specifically includes:

determining the size of an intermediate image required to be processed in each layer in the data processing model and the information of model parameters of each layer according to the model structure of the data processing model;

and determining the number of logic modules contained in the processor according to the size of the intermediate image required to be processed by each layer in the data processing model and/or the information of the model parameters of each layer.

Optionally, loading the model parameters into each logic module specifically includes:

aiming at each parameter value in the obtained parameter matrix of the model parameter, determining each logic module needing to load the parameter value in each logic module according to the position of the parameter value in the parameter matrix of the model parameter;

and loading the parameter values into the determined logic modules in parallel.

Optionally, loading each sub-intermediate image into each logic module specifically includes:

and sequentially loading each sub-intermediate image into each logic module according to the position of each sub-intermediate image in the intermediate image.

Optionally, sequentially loading each sub-intermediate image into each logic module specifically includes:

for each sub-intermediate image, when the sub-intermediate image is loaded into the logic module, performing the following operations:

aiming at each pixel value in the sub-intermediate image, determining each logic module needing to load the pixel value in each logic module according to the position information of the pixel value in the sub-intermediate image;

and loading the pixel values into the determined logic modules in parallel.

Optionally, the model parameters and the sub intermediate images are calculated through each logic module, and an operation result is obtained as the intermediate image obtained by the layer, and the method specifically includes:

aiming at each sub-intermediate image, calculating the model parameters and the sub-intermediate image through each logic module to obtain the calculation result of the sub-intermediate image;

and taking the operation result of each sub intermediate image as the intermediate image obtained by the layer.

Optionally, the calculating the model parameter and the sub-intermediate image through each logic module to obtain a calculation result of the sub-intermediate image specifically includes:

and performing parallel operation on the model parameters loaded into each logic module and the sub-intermediate image through each logic module to obtain the operation result of each logic module on the sub-intermediate image.

convolving the model parameters and the sub-intermediate image through each logic module;

and performing pooling operation on the result after convolution to obtain an operation result of the sub-intermediate image.

The present specification provides a data processing system, the system comprising: the image sensor, the processor and the memory, wherein the processor comprises a cache module and a plurality of logic modules;

the image sensor is used for acquiring image data;

the processor is configured to obtain model parameters of the data processing model, where the image data is stored in the cache module and the memory stores the model parameters, and load the model parameters into each logic module; acquiring an intermediate image from the cache module, segmenting the intermediate image according to predetermined information of each logic module to obtain a plurality of sub-intermediate images, loading each sub-intermediate image into each logic module, performing operation on the model parameters and each sub-intermediate image through each logic module to obtain an operation result as an intermediate image, and storing the intermediate image into the cache module;

and the memory is used for storing the model parameters of the data processing model.

This specification provides a data processing apparatus, a processor where the apparatus is located includes a cache module and a plurality of logic modules, the apparatus includes:

the acquisition module is used for acquiring image data acquired by an image sensor and storing the image data in the cache module by the processor where the device is located and processing the image data through a data processing model;

the loading module is used for acquiring the model parameters of each layer of the data processing model stored in the memory and loading the model parameters into each logic module; acquiring an intermediate image obtained by the layer above the layer from the cache module, segmenting the intermediate image according to the information of each logic module determined in advance to obtain a plurality of sub-intermediate images, and loading each sub-intermediate image into each logic module;

and the operation module is used for operating the model parameters and the sub-intermediate images through the logic modules to obtain an operation result as the intermediate image obtained by the layer, and storing the intermediate image obtained by the layer into the cache module.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described data processing method.

The embodiment of the specification adopts at least one technical scheme which can achieve the following beneficial effects:

in this specification, the processor may include a cache module and a plurality of logic modules, the processor may acquire and store image data acquired by the image sensor in the cache module, when the image data is processed by the data processing module, for each layer of the data processing model, acquire a model parameter of the layer of the data processing model stored in the memory, load the model parameter into each logic module, acquire an intermediate image obtained from a layer above the layer from the cache module, divide the intermediate image according to information of each predetermined logic module to obtain each sub-intermediate image, load each sub-intermediate image into each logic module, calculate the model parameter and each sub-intermediate image by each logic module to obtain an intermediate image obtained by the layer, and store the intermediate image obtained by the layer into the cache module. In the description, the image data or the intermediate images obtained by each layer of the data processing model are stored in the cache, the on-chip data transmission bandwidth of the processor is far larger than the data transmission bandwidth between the memory and the processor, only the model parameters need to be loaded in real time through the memory, and the bandwidth required by the model parameters is one twentieth of the bandwidth of the image data, so that the description solves the problem that the data transmission bandwidth is limited, in particular the bandwidth bottleneck problem of image data transmission when the computing power is fully loaded, the computing power resources of the processor are fully used, the frame rate is greatly improved, the algorithm delay is greatly reduced, and the time consumed by processing the image data through the data processing model is shortened.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification and are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description serve to explain the specification and not to limit the specification in a non-limiting sense. In the drawings:

FIG. 1 is a block diagram of a prior art process for processing image data;

fig. 2 is a block diagram for processing image data according to an embodiment of the present disclosure;

fig. 3 is a flowchart of a data processing method provided in an embodiment of the present specification;

FIG. 4 is a schematic diagram of loading pixel values in sub-intermediate images into logic modules in parallel according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort belong to the protection scope of the present specification.

In the process of processing image data by using a machine learning model in the implementation of underlying hardware, cooperation between a processor and a memory is required as a hardware support. Generally, a processor may include a Graphics Processing Unit (GPU), a Tensor Processing Unit (TPU), a Field Programmable Gate Array (FPGA), and the like, and a Memory is one of computer readable storage media, and may include a Memory, a hard disk, and the like, where the Memory may include a Memory component such as a Double Data Rate Dynamic Random Access (DDR SDRAM), and for convenience of description, the Memory is referred to as DDR SDRAM (DDR for short).

In the prior art, a machine learning model is a Convolutional Neural Network (CNN) model, and a TPU is used to perform target detection on image data through the CNN, where the TPU is a processor used for Neural network model operation.

Fig. 1 is a block diagram illustrating image data processing according to the prior art, and in fig. 1, information stored in the DDR may include image data collected by an image sensor, an operation result of each layer of the CNN, and a model parameter of the CNN. When the TPU is used for operation, model parameters of the CNN stored in the DDR are preloaded into the TPU in a serial mode, then an operation result of a layer above the current layer stored in the DDR is loaded into the TPU, the TPU performs operation according to the loaded model parameters of the CNN and the operation result of the layer above the current layer to obtain an operation result of the current layer, and the operation result of the current layer is stored into the DDR. And if the current layer is the first layer, the operation result of the previous layer of the current layer is the image data collected by the image sensor.

Due to the fact that data volumes of image data, CNN operation results and the like are large, in the process of loading the operation result of the upper layer of the current layer stored in the DDR into the TPU, a high bandwidth is needed to support data transmission, the data transmission is limited by a transmission bandwidth between the DDR and the TPU, and long time is needed for data loading. In addition, since the model parameters stored in the DDR are serially loaded into the TPU, a plurality of clock cycles are required, and a certain amount of time is also required. Considering that the computational load of the processor is in a positive correlation with the bandwidth requirement, that is, the greater the computational load, the higher the bandwidth requirement, and when the bandwidth is limited, the computational resource in the TPU cannot be used effectively, for the above reasons, it takes a long time for the TPU to perform the target detection on the image data.

Therefore, the present specification provides a data processing method to solve the problems in the prior art.

First, information such as the bottom hardware requirement of the data processing method provided in this specification is described.

The hardware required in this specification also includes a processor and a memory, where the processor may include an FPGA, a Complex Programming Logic Device (CPLD), and the like, and is characterized in that the processor is a semi-custom circuit and can be programmed according to actual requirements, thereby implementing functions in actual requirements. Taking FPGA as an example, the basic structure of FPGA may include a configurable logic block, an embedded block RAM, and the like, where the embedded block RAM (i.e., a cache module in this specification) is used as an on-chip cache of FPGA, and mainly includes BRAM and URAM, and is characterized by high bandwidth and small space. The configurable logic block (i.e., the logic module in this description, which can be used as an Artificial Intelligence (AI) computational core by those skilled in the art) can perform operations.

Fig. 2 is a block diagram of processing image data according to an embodiment of the present disclosure, in fig. 2, image data acquired by an image sensor and operation results of a data processing model may be stored in a cache module, model parameters of the data processing model may be stored in a memory, when a processor is used to process image data through the data processing model, for each layer of the data processing model, model parameters of the layer of the data processing model stored in the memory are loaded into a logic module, operation results of a layer above the layer stored in the cache module are loaded into the logic module, the logic module may perform an operation according to the model parameters and the operation results of the layer above to obtain operation results of the layer, and store the operation results of the layer in the cache module to cover the operation results of the layer above, that is, operation results stored in the cache module last time in history, when the layer is the first layer, the operation result of the layer above the first layer is the image data collected by the image sensor.

Therefore, the data processing system provided in this specification may include an image sensor, a processor, and a memory, where the processor includes a cache module and a plurality of logic modules, and may run on an embedded platform, where the image sensor may collect image data, the embedded platform may include a PetaLinuxAPP, the cache module is used to load the image data collected by the image sensor into the processor, and load model parameters of a data processing model stored in the memory into the logic module in the processor, the processor may load the image data from the cache into the logic module, the logic module performs an operation according to the loaded model parameters and the image data, and an obtained operation result is stored in the cache module.

Then, the specific contents of the data processing method provided in the present specification will be described again. The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

Fig. 3 is a flowchart of a data processing method provided in an embodiment of the present specification, which may specifically include the following steps:

s100: the processor acquires image data acquired by the image sensor, stores the image data in the cache module, and processes the image data through the data processing model.

S102: aiming at each layer of the data processing model, obtaining model parameters of the layer of the data processing model stored in a memory, and loading the model parameters into each logic module; and acquiring an intermediate image obtained by a layer above the layer from the cache module, segmenting the intermediate image according to the information of each logic module determined in advance to obtain a plurality of sub-intermediate images, and loading each sub-intermediate image into each logic module.

In this specification, the processor includes a cache module (including BRAM/URAM) and a plurality of logic modules, and the processor can acquire image data acquired by the image sensor and store the image data in the cache module, and specifically, in this specification, the image data acquired by the image sensor in real time can be directly loaded to the cache module in the processor through the embedded platform, and also the image data stored in the memory can be loaded to the cache module, that is, the image data acquired by the image sensor is stored in the memory, and the image data is loaded to the cache module by the embedded platform, so that the processor can process the image data.

The model structure of the CNN is chained, that is, the operation data of the next layer is from the operation result obtained from the previous layer, so that the operation result obtained from the previous layer is stored in the cache module, and after the operation of the next layer is finished, the operation result of the next layer is stored in the cache module and the operation result obtained from the previous layer is covered. If the model structure of the data processing model is complex, that is, the operation result obtained by a certain layer is used for the operation of the next layer and the operation of other layers after the next layer, the operation result obtained by the layer is additionally stored in the memory so that the other layers after the next layer can call the operation result obtained by the layer.

The data processing model is one of machine learning models, and may be a neural network such as CNN, or may be another machine learning model, for example, a Decision Tree (Decision Tree) model, and the like.

In general, the processor processes the image data through the data processing model to obtain a data processing result of the image data, which may be specifically interpreted as: the image sensor collects image data and stores the image data in the cache module, model parameters of each layer of the data processing model are stored in the memory, the image data and the model parameters of the first layer of the data processing model are calculated through the logic module, the calculation result of the first layer is stored in the cache module, the calculation result of the first layer and the model parameters of the second layer are calculated through the logic module to obtain the calculation result of the second layer, and the like, so that the calculation result of the last layer is the data processing result of the image data.

Thus, the operations of the processor are described using one of the layers of the data processing model as an example.

First, the processor includes logic modules.

In this specification, a processor may include a number of logic modules, where the number of logic modules included in the processor may be predetermined.

Specifically, according to the model structure of the data processing model, the size of an intermediate image required to be processed in each layer in the data processing model and the information of model parameters of each layer can be determined; the number of logic modules included in the processor is determined based on information on the size of the intermediate image to be processed for each layer in the data processing model and/or model parameters for each layer.

For example, based on the model structure of the data processing model, it may be determined that the minimum size of the intermediate images to be processed in the data processing model is N × N, the parameter matrix of the model parameters is M × M, and the number of logic modules included in the processor may be N × M.

Of course, the above is merely illustrative, and in this example, each logic module may be reused to the maximum extent that the computational effort may be fully loaded. The number of logic modules may be other numbers, for example, (N +1) × M, but in this case, not every logic module is reused to the maximum extent, and computational resources are not fully utilized. No matter what the number of the logic modules is, the value is determined according to the information such as the size of the intermediate image to be processed in each layer, the dimension of the parameter matrix of the model parameter in each layer, and the like in the data processing model, so that the model parameter and the operation result (i.e., the intermediate image) obtained in the previous layer are loaded into each logic module according to the number of the logic modules. Considering that the data processing model in the present solution may be any one of machine learning models, the number of logic modules may also be determined empirically, and loading based on the number of logic modules is also required when loading model parameters and intermediate images.

The intermediate image obtained from the previous layer is then loaded into the logic module.

If the current layer is the first layer of the data processing model, the image data collected by the image sensor and stored in the cache module is loaded into each logic module. If the current layer is the second layer of the data processing model or another layer after the second layer, the operation result (i.e., the intermediate image obtained from the previous layer) obtained from the previous layer of the current layer and stored in the cache module is loaded into each logic module. For convenience of description, the image data acquired by the image sensor is hereinafter referred to as an intermediate image. Note that, in this specification, the intermediate image may actually include the intermediate characteristic value data block, and the pixel value of the intermediate image may include the characteristic value.

In a preferred embodiment, the image data acquired by the image sensor and required to be processed in the first layer of the data processing model can be directly loaded into each logic module without storing the image acquired by the image sensor in a buffer module in the processor, and then loading the image data into each logic module from the buffer module, so that the image data reloading can be avoided, one frame of image data access delay can be saved, and in practical engineering implementation, since the size of the image data acquired by the image sensor does not usually match with the input size of YOLOv3, this step simultaneously completes the image scaling process, i.e. the image size is scaled from 1980 x 1020 to 416 x 416.

Since the number of logic modules included in the processor is predetermined, the number of data that can be calculated by each logic module is also fixed, that is, the image size of the intermediate image that can be calculated by each logic module is also determined, and therefore, the intermediate image can be divided according to the information such as the data amount that can be calculated by each logic module, the number of logic modules included in the processor, and the like, to obtain a plurality of sub-intermediate images, and each sub-intermediate image can be sequentially loaded into each logic module according to the information such as the position of each sub-intermediate image in the intermediate image.

It should be noted here that each sub-intermediate image is loaded into each logic module in a serial loading manner, each logic module performs operation on one sub-intermediate image, and then performs operation on the next loaded sub-intermediate image, and after the operation on all sub-intermediate images is completed, the operation result of the intermediate image, that is, the operation result obtained by the current layer, is obtained.

The manner of loading each sub-intermediate image into each logic module is the same, and the loading of one sub-intermediate image is taken as an example for explanation here.

And for each pixel value in the sub-intermediate image, determining each logic module needing to load the pixel value in each logic module according to the position information of the pixel value in the sub-intermediate image, and loading the pixel value into each determined logic module in parallel.

Fig. 4 is a schematic diagram for loading pixel values in a sub-intermediate image into logic modules in parallel according to an embodiment of the present specification. In fig. 4, according to the position of each pixel value in the sub-intermediate image and the position of each logic module in the area where the logic module is located, the logic module that needs to load the pixel value-6 in the first row and the second column of the sub-intermediate image may be determined as the first logic module in the first row and the second logic module in the area where the logic module is located, and the logic module that needs to load the pixel value 2 in the first row and the first column of the sub-intermediate image may be determined as the first logic module in the first row and the first logic module in the second row, so that the pixel values may be loaded in parallel, and each pixel value may be loaded in parallel into each logic module that needs to load the pixel value when being loaded.

Of course, in this specification, parallel loading of each pixel value may also be supported, where each pixel value is serially loaded into each logic module that needs to load the pixel value, and serial loading of each pixel value may also be supported, where each pixel value is parallelly loaded into each logic module that needs to load the pixel value. The method shown in fig. 4 is faster and takes less time than the latter two methods.

The model parameters of the data processing model are then loaded into the logic module.

In this specification, the model parameters of the data processing model may include a weight for each layer in the data processing model, and the parameter values of the model parameters may include weight values. According to the model structure of the data processing model, the information of the model parameters of each layer of the data model, for example, the dimension of the parameter matrix of the model parameters, can be determined, and the values of the parameter matrix of the model parameters of each layer can be obtained by training the data processing model in advance. The parameter matrices of the model parameters of each layer may be stored in a memory, wherein the pre-training of the data processing model may refer to existing training approaches.

When the model parameters of the current layer of the data processing model are loaded into each logic module, aiming at each parameter value in the parameter matrix of the obtained model parameters, determining each logic module needing to load the parameter value in each logic module according to the position of the parameter value in the parameter matrix of the model parameters; and loading the parameter values into the determined logic modules in parallel.

Specifically, in this specification, the model parameters may be loaded into each logic module from a Memory in a Direct Memory Access (DMA) manner, or the model parameters may be loaded into each logic module from the Memory by the processor.

Referring to fig. 4, the sub-intermediate images stored in the cache memory in fig. 4 are replaced with a parameter matrix of the model parameters stored in the memory, for each parameter value in the parameter matrix of the model parameters, each logic module requiring the parameter value can be determined according to the position of the parameter value in the parameter matrix and the position of each logic module in the area where the logic module is located, each parameter value can be loaded into each logic module in parallel, and each parameter value is loaded into each logic module requiring the parameter value in parallel.

Similarly, the present specification may also support parallel loading of each parameter value into each logic module, serial loading of each parameter value into each logic module that needs to load the parameter value, and serial loading of each parameter value into each logic module, parallel loading of each parameter value into each logic module that needs to load the parameter value, where the time consumed by the first mode is shorter than that consumed by the second mode.

In this specification, each pixel value of the sub-intermediate image and each parameter value in the parameter matrix of the model parameter are loaded into each logic module in a parallel loading manner, that is, each pixel value of the sub-intermediate image and each parameter value in the parameter matrix can be multiplexed, so that the bandwidth required after multiplexing the pixel values of the sub-intermediate image is reduced from 3456.0GB/s to 384.9GB/s, and the bandwidth required after multiplexing each parameter value in the parameter matrix is reduced from 3456.0GB/s to 20.5 GB/s. The bandwidth requirement is greatly reduced, and computational resources can be effectively utilized.

S104: and calculating the model parameters and the sub-intermediate images through each logic module to obtain a calculation result as an intermediate image obtained by the layer, and storing the intermediate image obtained by the layer into the cache module.

Since the intermediate image is divided into several sub-intermediate images, the result of processing the intermediate image by the data processing model is determined by the result of processing each sub-intermediate image by the data processing model. Aiming at each sub-intermediate image, calculating the model parameters and the sub-intermediate image through each logic module to obtain the calculation result of the sub-intermediate image; and taking the operation result of each sub intermediate image as the intermediate image obtained by the layer. In this specification, each sub-intermediate image is sequentially loaded into each logic module, and each logic module performs parallel operation, that is, each logic module performs parallel operation on the model parameters loaded into each logic module and the sub-intermediate image to obtain an operation result of each logic module on the sub-intermediate image.

Of course, in this specification, each logic module may also perform operations in sequence, that is, after the operation of one logic module is finished, another logic module performs operations, and only because each logic module has loaded the parameter value and the pixel value to be loaded, in fact, the operations of each logic module may be completely performed in parallel.

When the model parameters and the sub-intermediate image are operated through each logic module, the model parameters and the sub-intermediate image can be convolved, and the result after convolution is subjected to pooling operation to obtain the operation result of the sub-intermediate image.

Specifically, for each logic module, when the logic module performs operation, the model parameter and the pixel value loaded in the logic module may be operated, and the conventional implementation may be referred to when the operation is performed using a data processing resource (DSP) resource and a look-up table (LUT) resource.

In the process of pooling operation, average pooling, maximum pooling and the like can be selected, in a preferred embodiment, the result of convolution operation can be subjected to X maximum pooling, the result obtained by convolution operation can be reduced to 1/(X), and then the result of pooling operation is stored in a cache module as an intermediate image obtained by a current layer, so that the data volume of the intermediate image is reduced, and the time consumed by using the intra-processor chip data transmission bandwidth is far less than the time consumed by storing the intermediate image into a memory.

After the convolution operation, other operations, such as Batch Normalization, Scale, Bias, reLu, etc., may also be performed, and the other operations are incorporated into each logic module in the present specification to be completed once, that is, after the logic module convolves the parameter value and the pixel value loaded into the logic module, the operations such as Batch Normalization, Scale, Bias, Maxpooling, reLu and the like can be directly carried out, after each operation is finished, the final operation result is taken as an intermediate image obtained by the current layer to be stored in a cache module, the operation result of each operation does not need to be stored in the cache module and then taken out for the next operation, that is, the operation result of each operation does not need to be repeatedly read and written when various operations are performed through the logic module in the specification, so that the bandwidth consumption caused by repeatedly reading and writing the operation result into the cache module is avoided, and the effect of reducing the algorithm delay can be achieved. The specific operation mode of various operations performed by the logic module may refer to the existing operation mode, and is not described in detail in this specification.

In a specific implementation, the YOLOv3 neural network is adopted in the specification, and the FPGA is adopted in the processor. According to the model structure of the neural network, the minimum size of the intermediate image processed by the neural network is determined to be 13 × 13, the model parameters are determined to be a 3 × 3 parameter matrix, therefore, the number of logic modules in the processor can be 13 × 9 (other numbers can be set, and the calculation power is mainly used as a main target when engineering is realized), for the calculation of any layer of the neural network, the intermediate image is stored in a cache module, the parameter matrix of the model parameters is stored in a memory, the intermediate image is divided to obtain a plurality of sub-intermediate images with the image size of 13 × 13, and for each sub-intermediate image, the pixel values of the sub-intermediate image are loaded into the logic modules in parallel.

Because the processor adopts Xilinx ZU7ev, the ultimate computing power (computing power full load) is 6.9Tops, the total bandwidth requirement required by the model parameters and the intermediate images is 2x3456.0GB/s, in the scheme, the model parameters are stored in the memory, the intermediate images are stored in the cache module, and the model parameters and the intermediate images are loaded in parallel when being loaded into each logic module, the bandwidth requirement of the model parameters is reduced to 20.5GB/s, and the ZCU106 board can be supported by PL DDR4(42GB/s) and PS DDR4(34 GB/s). The intermediate image bandwidth requirement is reduced to 384.9GB/s, ZCU106 can be supported by BRAM (1404.0GB/s) and URAM (864.0GB/s) on the board, when a YOLOv3 Tiny (128Gops) target detection algorithm is implemented, the frame rate is 1617FPS, and the delay is 61.8 microseconds; implementing the YOLOv3(3.84Tops) target detection algorithm, the frame rate was 54FPS and the delay was 18.5 milliseconds.

In addition, after the convolution is carried out on the intermediate image, 2 × 2 maximum pooling operation can be carried out, and the maximum caching requirement 2,768,896 byte of the intermediate image is reduced to be one fourth of the original 692,224 byte, so that the requirement of the intermediate image on the bandwidth for storage is reduced to be one fourth of the original requirement, and the occupation of the space of the caching module is reduced.

Therefore, in this specification, the image data acquired by the image sensor and the intermediate image obtained by each layer in the data processing model are directly stored in the cache module on the processor chip, and the model parameters are stored in the memory, so only the model parameters need to be loaded in real time through the memory, and the bandwidth required for loading the model parameters is only 1/20 of the bandwidth required for loading the intermediate image, thereby solving the bottleneck problem of reading and writing bandwidth of the image data when the computing power resources are fully loaded, that is, solving the problem that the computing power cannot be fully loaded due to limited bandwidth, and adopting a parallel loading mode when the model parameters are loaded into the logic module, in addition, the operations such as convolution operation and pooling in this specification are all processed in the logic module inside the processor, thereby realizing the lowest processing delay and effectively using the computing power resources of the processor, the effect of shortening the time consumed by processing the image data through the data processing model is achieved.

Based on the data processing method described above, an embodiment of the present specification further provides a schematic structural diagram of a data processing apparatus, as shown in fig. 5.

Fig. 5 is a schematic structural diagram of a data processing apparatus provided in an embodiment of the present specification, where a processor of the apparatus includes a cache module and a plurality of logic modules, and the apparatus includes:

an obtaining module 501, configured to obtain, by the processor where the apparatus is located, image data acquired by an image sensor, store the image data in the cache module, and process the image data through a data processing model;

a loading module 502, configured to obtain, for each layer of the data processing model, a model parameter of the layer of the data processing model stored in a memory, and load the model parameter into each logic module; acquiring an intermediate image obtained by the layer above the layer from the cache module, segmenting the intermediate image according to the information of each logic module determined in advance to obtain a plurality of sub-intermediate images, and loading each sub-intermediate image into each logic module;

and an operation module 503, configured to perform an operation on the model parameters and each sub-intermediate image through each logic module to obtain an operation result as an intermediate image obtained by the layer, and store the intermediate image obtained by the layer in the cache module.

In the present specification, the image data or the intermediate image obtained by each layer of the data processing model is stored in the cache, and the on-chip data transmission bandwidth of the processor is much larger than the data transmission bandwidth between the memory and the processor, so that the present specification solves the problem that the data transmission bandwidth is limited, thereby effectively using the computational power resources of the processor, and achieving the effect of shortening the time consumed for processing the image data through the data processing model.

Optionally, the loading module 502 is specifically configured to, if the layer is the first layer of the data processing model, obtain the image data from the cache module.

Optionally, the apparatus further comprises a determining module 504;

the determining module 504 is specifically configured to determine, according to a model structure of the data processing model, a size of an intermediate image that needs to be processed in each layer in the data processing model and information of a model parameter of each layer; and determining the number of logic modules contained in the processor according to the size of the intermediate image required to be processed by each layer in the data processing model and/or the information of the model parameters of each layer.

Optionally, the loading module 502 is specifically configured to, for each parameter value in the obtained parameter matrix of the model parameter, determine, in each logic module, each logic module that needs to load the parameter value according to a position of the parameter value in the parameter matrix of the model parameter; and loading the parameter values into the determined logic modules in parallel.

Optionally, the loading module 502 is specifically configured to sequentially load each sub-intermediate image into each logic module according to the position of each sub-intermediate image in the intermediate image.

Optionally, the loading module 502 is specifically configured to, when each sub-intermediate image is loaded into the logic module, perform the following operations: aiming at each pixel value in the sub-intermediate image, determining each logic module needing to load the pixel value in each logic module according to the position information of the pixel value in the sub-intermediate image; and loading the pixel values into the determined logic modules in parallel.

Optionally, the operation module 503 is specifically configured to, for each sub-intermediate image, perform an operation on the model parameter and the sub-intermediate image through each logic module to obtain an operation result of the sub-intermediate image; and taking the operation result of each sub intermediate image as the intermediate image obtained by the layer.

Optionally, the operation module 503 is specifically configured to perform parallel operation on the model parameters loaded into each logic module and the sub-intermediate image through each logic module, so as to obtain an operation result of each logic module on the sub-intermediate image.

Optionally, the operation module 503 is specifically configured to perform convolution on the model parameter and the sub-intermediate image through each logic module; and performing pooling operation on the result after convolution to obtain an operation result of the sub-intermediate image.

The present specification also provides a computer readable storage medium, which stores a computer program, and the computer program can be used to execute the data processing method described above. Where a computer-readable storage medium includes the memory described above, the computer-readable medium can be, for example, permanent and non-permanent, removable and non-removable media implemented in any method or technology for storage of information. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A data processing method is characterized in that a processor comprises a cache module and a plurality of logic modules, and the method comprises the following steps:

2. The method of claim 1, wherein obtaining the intermediate image from the layer above the layer from the cache module specifically comprises:

3. The method of claim 1, wherein predetermining information for each logic module specifically comprises:

4. The method of claim 1, wherein loading the model parameters into the logic modules specifically comprises:

and loading the parameter values into the determined logic modules in parallel.

5. The method of claim 1, wherein loading each sub-intermediate image into each logic module specifically comprises:

6. The method of claim 5, wherein loading each sub-intermediate image into each logic module in sequence, specifically comprises:

and loading the pixel values into the determined logic modules in parallel.

7. The method of claim 1, wherein the performing, by each logic module, an operation on the model parameter and each sub-intermediate image to obtain an operation result as the intermediate image obtained by the layer specifically comprises:

8. The method of claim 7, wherein the performing, by each logic module, an operation on the model parameter and the sub-intermediate image to obtain an operation result of the sub-intermediate image comprises:

9. The method of claim 7, wherein the performing, by each logic module, an operation on the model parameter and the sub-intermediate image to obtain an operation result of the sub-intermediate image comprises:

10. A data processing system, characterized in that the system comprises: the image sensor, the processor and the memory, wherein the processor comprises a cache module and a plurality of logic modules;

the image sensor is used for acquiring image data;

11. A data processing device is characterized in that a processor of the device comprises a cache module and a plurality of logic modules, and the device comprises:

12. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-9.