CN113706366B

CN113706366B - Image feature data extraction method, system and related device

Info

Publication number: CN113706366B
Application number: CN202110873716.2A
Authority: CN
Inventors: 蒋东东; 董刚; 赵雅倩
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2024-02-27
Anticipated expiration: 2041-07-30
Also published as: CN113706366A

Abstract

The application provides an extraction method of image characteristic data, which comprises the following steps: acquiring image characteristic data and determining the corresponding parallel channel number; taking the parallel channel number as the depth of the image characteristic data, and increasing the height of the image characteristic data; using a preset number of RAMs of the FPGA as a first-level cache of the DDR for data multiplexing; and configuring a register corresponding to the convolution kernel at the back end of the first-level cache, and outputting the image characteristic data row by utilizing the register. The DDR data reading and writing efficiency can be improved to the maximum extent, uninterrupted data output of the assembly line is realized, the data reading pressure of the DDR is reduced, and the input requirement of the back-end high-bandwidth convolution computing unit can be met. The application also provides an extraction system of the image characteristic data, a computer readable storage medium and electronic equipment, which have the beneficial effects.

Description

Image feature data extraction method, system and related device

Technical Field

The present invention relates to the field of data processing, and in particular, to a method, a system, and a related device for extracting image feature data.

Background

Currently, there are mainly the following 2 implementations of CNN (Convolutional Neural Networks, convolutional neural network) convolutional data extraction processes:

1. the image characteristic data are cached in an off-chip DDR (Dynamic Random Access Memory ) of an FPGA (Field-Programmable Gate Array, field programmable gate array), only the data of the small 3*3 are read for convolution each time, the pressure of storage resources of the FPGA is reduced by utilizing the DDR with a plurality of times of reading the small range, the wiring difficulty is reduced, and the rate of convolution multiplication of a back-end pipeline is improved.

Because 3*3 data is equivalent to 3 3*1 data, reading one 3*3 data requires sending 3 read and refresh commands to the DDR, and the address needs to jump, the small segment of data read/write of the non-continuous address can greatly reduce the DDR read/write rate, typically to below 10%. Even though the partial optimization algorithm can realize pipeline convolution calculation of 3×11 data, the read-write capability of DDR (double data Rate) still cannot be completely released after 11×3 data are read each time, and the bottleneck of system calculation speed is formed.

2. The method has the advantages that data are all read into the FPGA, 3*3 data at any position can be read in 1 period and used for convolution calculation at the back end, but the FPGA has the defects that internal RAM resources are very expensive and very small, the size of 5MB can be rarely achieved in general, one input channel data is generally smaller than 512 x 8bit, the data of 20 input channels can be stored at most, and the ping-pong cache is supposed to be carried out according to 16 channel data, so that the RAM resources in the FPGA are excessively occupied, and because the RAMs of the FPGA are uniformly distributed, large-area serial wiring is required, so that wiring congestion is caused, the design implementation difficulty is extremely large, the efficiency is low, and the more the input channels are, the method is unsuitable and is not suitable for expansion.

Disclosure of Invention

The invention aims to provide an extraction method, an extraction system, a computer-readable storage medium and electronic equipment for image characteristic data, which can realize back-end multidimensional convolution calculation of a high-speed assembly line.

In order to solve the technical problems, the application provides an extraction method of image feature data, which comprises the following specific technical scheme:

acquiring image characteristic data and determining the corresponding parallel channel number;

taking the parallel channel number as the depth of the image characteristic data, and increasing the height of the image characteristic data;

using a preset number of RAMs of the FPGA as a first-level cache of the DDR for data multiplexing; each RAM stores a row of lateral data;

a register corresponding to the convolution kernel is configured at the rear end of the first-level cache, and the image characteristic data is output row by utilizing the register; wherein the time at which the image feature data is output once is taken as one clock cycle, and the image feature data in the last clock cycle is multiplexed from the second clock cycle.

Optionally, before the data multiplexing is performed by using a preset number of RAMs of the FPGA as the first level buffer of the DDR, the method further includes:

and determining the preset quantity according to the size of the convolution kernel, wherein the preset quantity is larger than the size of the convolution kernel.

Optionally, after the first level buffer backend is configured with a register corresponding to the convolution kernel, the method further includes:

and adding corresponding padding based on the register and the image characteristic data.

Optionally, when the image feature data is output line by using the register, before each line feed of the register, the register further includes:

resetting the value of the register and multiplexing the repeated data in the RAM.

The application also provides an extraction system of image feature data, comprising:

the data acquisition module is used for acquiring image characteristic data and determining the corresponding parallel channel number;

the data format changing module is used for taking the parallel channel number as the depth of the image characteristic data and increasing the height of the image characteristic data;

the data multiplexing module is used for multiplexing data by using a preset number of RAMs of the FPGA as a first-level cache of the DDR; each RAM stores a row of lateral data;

the data extraction module is used for configuring a register corresponding to the convolution kernel at the rear end of the first-level cache and outputting the image characteristic data line by utilizing the register; wherein the time at which the image feature data is output once is taken as one clock cycle, and the image feature data in the last clock cycle is multiplexed from the second clock cycle.

Optionally, the method further comprises:

the quantity determining module is used for determining the preset quantity according to the size of the convolution kernel, and the preset quantity is larger than the size of the convolution kernel.

Optionally, the method further comprises:

and the extraction preparation module is used for adding corresponding padding based on the register and the image characteristic data.

Optionally, the method further comprises:

and the reset module is used for resetting the value of the register before each line feed of the register when the register is utilized to output the image characteristic data line by line, and multiplexing the repeated data in the RAM.

The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as described above.

The application also provides an electronic device comprising a memory in which a computer program is stored and a processor which when calling the computer program in the memory implements the steps of the method as described above.

The application provides an extraction method of image characteristic data, which comprises the following steps: acquiring image characteristic data and determining the corresponding parallel channel number; taking the parallel channel number as the depth of the image characteristic data, and increasing the height of the image characteristic data; using a preset number of RAMs of the FPGA as a first-level cache of the DDR for data multiplexing; each RAM stores a row of lateral data; a register corresponding to the convolution kernel is configured at the rear end of the first-level cache, and the image characteristic data is output row by utilizing the register; wherein the time at which the image feature data is output once is taken as one clock cycle, and the image feature data in the last clock cycle is multiplexed from the second clock cycle.

According to the method, the register shift is utilized to adjust the image characteristic data format, the DDR reading and writing efficiency is improved to the greatest extent, the first-stage buffer memory of the RAM and the register shift array unit are combined, automatic packing and multiplexing of the image characteristic data are achieved, continuous data output of a production line is achieved, meanwhile, the DDR data reading pressure is reduced, and the input requirement of a rear-end high-bandwidth convolution computing unit can be met.

The application further provides an extraction system of image feature data, a computer readable storage medium and an electronic device, which have the above beneficial effects and are not described herein.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

FIG. 1 is a schematic diagram of a three-dimensional convolution calculation process provided herein;

fig. 2 is a flowchart of a method for extracting image feature data according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an original format of image feature data according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a rearranged format of image feature data according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a first level cache structure according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a register output process for a first clock cycle according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a second clock cycle register output process according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a register output process after line feed according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an extraction system of image feature data according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The three-dimensional convolution calculation process performed in the CNN network is as follows: referring to fig. 1, fig. 1 is a schematic diagram of a three-dimensional convolution calculation process provided in the present application, and a color image is set to be 6×6×3, where 3 refers to three color channels, and may be a stack of three 6×6 images. To detect the edges or other features of the image, it is convolved with a three-dimensional filter whose dimensions are 3 x 3, the filter also having three layers, corresponding to the three red-green and blue channels. The first 6 of the original image represents the image height, the second 6 represents the width, and 3 represents the number of channels. The filter also has a height, width and number of channels, and the number of channels of the image must be equal to the number of channels of the filter. This convolution operation would be a 4 x 1 image. The convolution kernels of the sizes 3*3, 7*7,5*5 and 1*1 are popular, and the calculation principle is similar, so the following is mainly exemplified by the application scenario of the convolution kernels of the size 3*3, but the application is compatible with the convolution kernels of the sizes 7*7,5*5 and 1*1.

Referring to fig. 2, fig. 2 is a flowchart of an image feature data extraction method provided in an embodiment of the present application, where the specific technical scheme is as follows:

s101: acquiring image characteristic data and determining the corresponding parallel channel number;

s102: taking the parallel channel number as the depth of the image characteristic data, and increasing the height of the image characteristic data;

s103: using a preset number of RAMs of the FPGA as a first-level cache of the DDR for data multiplexing;

s104: and configuring a register corresponding to the convolution kernel at the back end of the first-level cache, and outputting the image characteristic data row by utilizing the register.

Firstly, the image characteristic data needs to be acquired, and the number of parallel channels is determined, wherein the number of the parallel channels depends on the depth of the image characteristic data. In general, if the depth of image feature data in CNN calculation is an integer multiple of 64, an exponent power of 2 can be taken as the number of parallel channels. Step S102 requires converting the image feature data with larger depth into data with lower depth but higher height, so as to facilitate improving the read-write bandwidth of the DDR read image feature data.

The preset number is not limited herein, and the preset number may be determined according to the size of the convolution kernel before the data multiplexing is performed by using the preset number of RAMs of the FPGA as the first level buffer of the DDR, where the preset number is greater than the size of the convolution kernel. Since the 7*7,5*5,3*3 and 1*1 convolution kernels are typically employed, if 8 RAMs are employed as the first level caches of the DDR, it is sufficient to be compatible with the common various convolution kernels. Second, not all RAM is used, which is used to multiplex data. The width of the RAM is not limited in this embodiment, and may be set according to the image feature data and the FPGA computing resource used. Note that each RAM holds a row of lateral data. It should be noted that, DDR in this step refers to a memory using a double rate synchronous dynamic random access memory.

Thereafter, at the back end of the first level cache, registers are designed corresponding to the convolution kernels. If a convolution kernel of 3*3 is used, 9 registers are needed, and if a convolution kernel of 5*5 is used, 25 registers are needed. The register is used for outputting image characteristic data, multiplexing the data, and automatically adding corresponding padding based on the register and the image characteristic data, wherein the padding refers to a space between a frame of a defined element and the content of the element, and a register in which five 0 s in the 9 registers on the left side in fig. 6 are located is the added padding.

And then outputting image characteristic data line by using a register, so that uninterrupted data output of the pipeline can be realized.

According to the embodiment of the application, the image characteristic data format is adjusted by using the register shift, the DDR reading and writing efficiency is improved to the greatest extent, the first-stage buffer memory of the RAM and the register shift array unit are combined, automatic packing and multiplexing of the image characteristic data are realized, uninterrupted data output of a production line is realized, the data reading pressure of the DDR is reduced, and the input requirement of a rear-end high-bandwidth convolution computing unit can be met.

To better describe the above embodiments, the following exemplifies the above procedure:

referring to fig. 3, in the CNN calculation, the depth of the image feature data is an integer multiple of 64 (except that the original image is 3 layers), the width and the height are identical, and are integers of 224 at maximum, and if the depth direction is in 8 units (the input channel calculation parallelism, here, an example of the 8-channel parallel calculation), the original format of the feature data is as shown in fig. 3.

In order to furthest improve the DDR read-write bandwidth when reading the image characteristic data, the storage format of the data in the DDR needs to be rearranged, and the invention calculates all the image characteristic data of the first 8 input channels. It should be noted that, the input channel 8 may be arbitrarily adjusted, and generally, a corresponding setting is made according to the depth direction of the image feature data. The data format rearranged in DDR is shown in fig. 4.

The data corresponding to one address in the DDR may have data of n x 8 channels (determined by DDR data bit width), but n is necessarily an integer, so the data is extracted, the addresses are all arranged sequentially, and the DDR can be operated with the maximum read-write bandwidth.

In the embodiment of the present application, 3*3 is used as an example, and only the first 4 of 8 RAMs are used, as shown in fig. 5, and one more RAM is used for redundancy ping-pong for enabling data to flow. RAM is used for multiplexing the previous data. The RAM has a width of 8B and corresponds to 8 channels of data (e.g., adjustable according to the FPGA computing resources and cache resources used) and a depth of 256 (> 224). Each RAM stores a whole row of data in the W direction.

At the back end of the first level cache, 9 registers are designed to exemplify for a 3*3 convolution kernel, while 7*7 convolution kernels are similar to 5*5 convolution kernels, multiplexing and automatic padding are performed on data corresponding to one input channel. The initial value of the 9 registers is 0. To implement the functions of multiplexing data, padding=1 and outputting data, as shown in 3*3 convolution in fig. 6, the first output data is output after padding is needed to be added to the upper left corner of the image feature data, when the first 2 lines of data of the image feature data are buffered in the RAM buffer of the first stage, the data output can be started, first 2 data of the first 2 lines are output to the corresponding positions of 9 registers, and the effective data of the first 3*3 can be output by using initial values of 0 padding of 9 registers. Meanwhile, the first-level cache can continue to cache the data of the third line and the fourth line, so that data cache stream is realized.

After the first line is output, and then the second clock period is needed to realize 3*3 data on the left side of fig. 7, the 9 registers are moved to the right in the whole first, and meanwhile, the third data output by the first-level buffer is received, so that the output of the data can be realized, and the previous data can be multiplexed. The image feature data is output for one line at a time as a clock cycle, and the image feature data in the last clock cycle is multiplexed from the second clock cycle. Repeating the above steps, finishing the packing, selecting multiplexing and outputting of the first two rows of image characteristic data, then carrying out line feed, and resetting 9 register values. And multiplexing the first and second lines of data in the RAM cache.

Therefore, continuous output of data and maximum multiplexing can be ensured, and the multiplexing and packing of the whole image characteristic data and high-speed continuous output can be completed sequentially.

According to the method and the device, address skipping can be avoided when DDR reading is performed, DDR reading and writing efficiency is improved to the greatest extent, meanwhile, the first-level cache only occupies less than 3% of RAM resources of VU7 (an FPGA), meanwhile, because all needed data are multiplexed, the problem that the back-end computing capacity is not matched with DDR data output bandwidth is effectively solved. Meanwhile, as part of input channel data is calculated first, all the filter cores can be multiplexed, and the data transmission bandwidth requirement of the filter cores is reduced.

The following describes an image feature data extraction system provided in the embodiments of the present application, and the image feature data extraction system described below and the image feature data extraction method described above may be referred to correspondingly.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an extraction system of image feature data provided in an embodiment of the present application, and the present application further provides an extraction system of image feature data, including:

a data acquisition module 100, configured to acquire image feature data and determine a corresponding number of parallel channels;

a data format changing module 200, configured to increase the height of the image feature data by using the parallel channel number as the depth of the image feature data;

the data multiplexing module 300 is configured to perform data multiplexing by using a preset number of RAMs of the FPGA as a first level buffer of the DDR; each RAM stores a row of lateral data;

the data extraction module 400 is configured to configure a register corresponding to the convolution kernel at the back end of the first-level buffer, and output the image feature data line by using the register; wherein the time at which the image feature data is output once is taken as one clock cycle, and the image feature data in the last clock cycle is multiplexed from the second clock cycle.

Based on the above embodiment, as a preferred embodiment, further comprising:

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed, implements the steps provided by the above embodiments. The storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The application also provides an electronic device, which may include a memory and a processor, where the memory stores a computer program, and the processor may implement the steps provided in the foregoing embodiments when calling the computer program in the memory. Of course the electronic device may also include various network interfaces, power supplies, etc.

In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. The system provided by the embodiment is relatively simple to describe as it corresponds to the method provided by the embodiment, and the relevant points are referred to in the description of the method section.

Specific examples are set forth herein to illustrate the principles and embodiments of the present application, and the description of the examples above is only intended to assist in understanding the methods of the present application and their core ideas. It should be noted that it would be obvious to those skilled in the art that various improvements and modifications can be made to the present application without departing from the principles of the present application, and such improvements and modifications fall within the scope of the claims of the present application.

It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. An extraction method of image feature data, characterized by comprising the following steps:

using a preset number of RAMs of the FPGA as a first-level cache of the DDR for data multiplexing; each RAM stores a row of lateral data; wherein the storage format of the data in the DDR is rearranged, and the DDR is operated with a maximum read-write bandwidth when the data is extracted from the DDR, and the addresses in the DDR are sequentially arranged;

a register corresponding to the convolution kernel is configured at the rear end of the first-level cache, and the image characteristic data is output row by utilizing the register; the time of outputting the image characteristic data once is taken as one clock period, and the image characteristic data in the last clock period is multiplexed from the second clock period;

before the data multiplexing is performed by using the preset number of RAMs of the FPGA as the first-level buffer memory of the DDR, the method further comprises the following steps:

determining the preset number according to the size of the convolution kernel, wherein the preset number is larger than the size of the convolution kernel;

when the register is utilized to output the image characteristic data row by row, the register further comprises before each line feed:

2. The extraction method according to claim 1, further comprising, after the first level cache back end is configured with a register corresponding to a convolution kernel:

3. An extraction system of image feature data, comprising:

the data multiplexing module is used for multiplexing data by using a preset number of RAMs of the FPGA as a first-level cache of the DDR; each RAM stores a row of lateral data; wherein the storage format of the data in the DDR is rearranged, and the DDR is operated with a maximum read-write bandwidth when the data is extracted from the DDR, and the addresses in the DDR are sequentially arranged;

the data extraction module is used for configuring a register corresponding to the convolution kernel at the rear end of the first-level cache and outputting the image characteristic data line by utilizing the register; the time of outputting the image characteristic data once is taken as one clock period, and the image characteristic data in the last clock period is multiplexed from the second clock period;

wherein, the extraction system further includes:

the quantity determining module is used for determining the preset quantity according to the size of the convolution kernel, and the preset quantity is larger than the size of the convolution kernel;

4. The extraction system of claim 3, further comprising:

5. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, carries out the steps of the image feature data extraction method according to any one of claims 1-2.

6. An electronic device comprising a memory and a processor, the memory having a computer program stored therein, the processor, when calling the computer program in the memory, implementing the steps of the method for extracting image feature data according to any one of claims 1-2.