CN114359683B

CN114359683B - Text positioning-oriented single-core HOG efficient heterogeneous acceleration method

Info

Publication number: CN114359683B
Application number: CN202111671159.2A
Authority: CN
Inventors: 阎波; 张国宁; 孙王超; 陈俊希; 覃昊洁
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2023-10-20
Anticipated expiration: 2041-12-31
Also published as: CN114359683A

Abstract

The invention discloses a text positioning-oriented single-core HOG efficient heterogeneous acceleration method, which comprises the steps of distributing a work item for each pixel, convoluting pixels around each pixel, calculating amplitude and phase of the convolved pixels, calculating a discrete gradient direction of the pixels through a bilinear interpolation algorithm, storing the discrete gradient direction in a local memory of hardware, and releasing the work item distributed for the pixels; distributing a work item for each cell unit and carrying out global indexing of hardware; calculating voting results in the discrete gradient direction, and finishing statistics of each row of pixels; normalizing and summing the counted pixels to form an HOG feature vector, and obtaining the feature vector of the image; the method is realized in a heterogeneous platform, and heterogeneous acceleration is completed. The invention meets the requirements of text positioning instantaneity and low energy consumption, and can further improve the reliability of scene character recognition technology.

Description

Text positioning-oriented single-core HOG efficient heterogeneous acceleration method

Technical Field

The invention relates to the field of scene character recognition, in particular to a single-core HOG efficient heterogeneous acceleration method for text positioning.

Background

With the wide spread of intelligent handheld devices and the rapid development of artificial intelligence, images and videos become the main media information delivery modes. The media information contains a large number of natural scenes, and the text information has important application value. The accurate and rapid extraction of text information from natural scenes is of great importance, where text localization technology is a major concern.

Since text localization faces high complexity implementation algorithms and continuously growing data, the real-time performance of text localization algorithms is challenged. The HOG (Histogram of Oriented Gradient, directional gradient histogram) algorithm is the most commonly used algorithm in text localization calculations. The existing multi-kernel HOG acceleration scheme is to perform global synchronization through a plurality of kernels at the equipment end, and achieve pixel gradient calculation, cell gradient statistics and block normalization of HOG features. However, a high-cost loop operation is generated, and the access and memory overhead of global synchronization and global memory is also high. In heterogeneous system implementations, multi-core acceleration schemes can present significant power consumption issues.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a single-core HOG efficient heterogeneous acceleration method for text positioning, which solves the problems of high memory access cost and high operation amount in the prior art.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

the method for efficiently and isomerically accelerating the single-core HOG oriented to text positioning comprises the following steps:

s1, acquiring pixels of a gray level image, and distributing a work item for each pixel; the Cx×Cy connected pixel areas with uniform sizes form a cell unit;

s2, in each working item, performing row convolution and column convolution on pixels around each pixel by using a differential template and a transposition thereof;

s3, calculating the amplitude and the phase of the convolved pixel;

s4, calculating the discrete gradient direction of the pixel by using the obtained amplitude and phase through a bilinear interpolation algorithm, storing the discrete gradient direction in a local memory of hardware, and releasing a work item allocated to the pixel;

s5, distributing a work item for each cell unit, and carrying out global indexing of hardware;

s6, creating a statistical variable for counting the discrete gradient direction of the pixels, directly adding the discrete gradient direction of each pixel in parallel to the statistical variable, and correspondingly calculating a voting result of the discrete gradient direction of each row of pixels in one cell by using a group of variables to finish the statistics of each row of pixels to obtain the number of different discrete gradient directions in each cell;

s7, carrying out local memory synchronization of hardware, counting pixels of one row in the cell by utilizing each work item based on voting results, carrying out parallel protocol to obtain gradient statistics of all the cell units, storing discrete gradient results of the gradient statistics in the local memory of the hardware, and releasing work items distributed for the cell units;

s8, calculating normalization of discrete gradients after gradient statistics of a cell by using a work item, summing up the normalized results of each cell to obtain a sum value corresponding to each cell, caching the sum value corresponding to each cell in the same block into a local memory of hardware, and carrying out local synchronization of the hardware to obtain a local direction gradient of each block; one image comprises a plurality of blocks;

s9, combining the local direction gradients of each block into an HOG feature vector to obtain the feature vector of the image;

s10, loading the steps onto a heterogeneous platform to realize heterogeneous acceleration.

Further, in step S1, the working item is the smallest working unit in OpenCL; the cell unit is an image minimum dividing unit; each cell comprises a plurality of connected pixel areas with uniform sizes of Cx×Cy, the total pixel size of the window image is Wx×Wy, and the generated two-dimensional index is (Wx, wy).

Further, the differential template in step S2 is [ -1,0,1].

Further, the global index in step S5 has a size of (Wx/Cx, wy/Cy).

Further, the size of the global index at the time of parallel reduction in step S7 is (Wx/Cx, wy), and the total of wx×wy/Cx work items are used.

Further, the first parallel protocol in step S7 uses Cy/2 work items altogether to count two columns of gradients; the second parallel protocol uses Cy/4 work items to count two columns of gradients in total, and then uses the previous 1/2 work items to count gradients in sequence until the gradient statistics is completed.

The beneficial effects of the invention are as follows:

1. in the gradient statistics process, a work item is distributed to each cell, instead of creating a work item for each pixel, so that the problem of access conflict is solved;

2. the pixels of one row are counted corresponding to each work item in the step S7, so that the continuous access of the work items is ensured, the parallelism of Cy times is improved, the utilization rate of GPU resources is improved, the capability of parallel processing of the GPU is fully exerted, and the expenditure is reduced;

3. in the gradient statistics process, the GPU avoids access conflict of local memory through a high-cost atomic function; the FPGA avoids access conflict of the local memory through alternate access of a plurality of pieces of physical memory;

4. in the step S8, the summation result is cached into a local memory of the hardware, so that the access of the global memory is reduced, the reduction time is shortened, and the calculation time is saved;

5. through local memory synchronization, a corresponding computing task is completed by using one equipment kernel, the resource consumption is reduced by more than 50%, and compared with a CPU, the energy efficiency ratio of the scheme on a GPU and an FPGA platform is respectively 22.8 and 42.5, so that the equipment energy consumption can be effectively reduced;

6. the complex protocol operation in the traditional statistical method is avoided by adopting the voting mode, the algorithm calculation time is reduced by more than 50%, and compared with a CPU, the acceleration ratio of the scheme on the GPU and the FPGA platform is respectively 28 and 6.9, so that the calculation time can be effectively reduced;

7. the calculation time of the HOG algorithm on the GPU and the FPGA platform is 25ms and 102ms respectively, and the energy consumption is 4J and 2.14J respectively, so that the requirements of text positioning instantaneity and low energy consumption are met, and the reliability of scene (including images and videos) character recognition can be further improved.

Drawings

FIG. 1 is a flow chart of the present invention;

fig. 2 is a block diagram of the design of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.

As shown in fig. 1 and fig. 2, the text positioning-oriented single-core HOG efficient heterogeneous acceleration method includes the following steps:

s3, calculating the amplitude and the phase of the convolved pixel;

In the step S1, the working item is the smallest working unit in OpenCL; the cell unit is an image minimum dividing unit; each cell comprises a plurality of connected pixel areas with uniform sizes of Cx×Cy, the total pixel size of the window image is Wx×Wy, and the generated two-dimensional index is (Wx, wy).

The differential template in step S2 is [ -1,0,1].

The global index in step S5 has a size of (Wx/Cx, wy/Cy).

The global index size at the parallel reduction in step S7 is (Wx/Cx, wy), and a total of Wx X Wy/Cx work items are used.

In the step S7, the first parallel protocol uses Cy/2 work items to count two rows of gradients; the second parallel protocol uses Cy/4 work items to count two columns of gradients in total, and then uses the previous 1/2 work items to count gradients in sequence until the gradient statistics is completed.

In step S3, there is no data interaction between the work items, i.e. no global synchronization or no local synchronization.

The high-level description of OpenCL of steps S1 to S9 is converted into a hardware language by AOCL, and a specific hardware circuit is generated.

The scheme is respectively implemented on a CPU+GPU and a CPU+FPGA heterogeneous platform. And taking the CPU as a host computer, executing system scheduling, and taking the GPU and the FPGA as devices respectively. The platform and device are first initialized and a series of configurations are performed. And then controlling the starting equipment to perform other operations. And after the result is obtained, finishing final classification calculation at the host end. Through related experiments, the scheme meets the requirements of text positioning instantaneity and low energy consumption, and can further improve the reliability of scene character recognition technology.

According to the invention, a work item is allocated to each cell in the gradient statistics process, instead of creating a work item for each pixel, so that the problem of access conflict is solved;

the pixels of one row are counted corresponding to each work item in the step S7, so that the continuous access of the work items is ensured, the parallelism of Cy times is improved, the utilization rate of GPU resources is improved, the capability of parallel processing of the GPU is fully exerted, and the expenditure is reduced;

in the gradient statistics process, the GPU avoids access conflict of local memory through a high-cost atomic function; the OpenCL atomic function can perform atomic operation on 32-bit signed and unsigned integers in the global local memory; when one work item accesses the memory, other work items cannot access the memory, and in step S6, when the discrete gradients of a plurality of pixels in the cell are consistent, parallel writing to the same memory can be caused, race conditions can be caused, data can be lost, and the problem can be solved by the atomic function;

in the gradient statistics process, the FPGA avoids access conflict of local memories through alternate access of a plurality of pieces of physical memories; M9K on a plurality of chips of the FPGA is used as a local memory, so that each work item of the same work group is supported to be accessed alternately, and access and memory conflicts of the local memory are avoided; in the step S6, the pixel voting calculation result is stored into a local memory by reasonably dividing the working group, so that the FPGA can avoid the atomic operation of adding floating point numbers at high cost;

in the step S8, the summation result is cached into a local memory of the hardware, so that the access of the global memory is reduced, the reduction time is shortened, and the calculation time is saved;

through local memory synchronization, a corresponding computing task is completed by using one equipment kernel, the resource consumption is reduced by more than 50%, and compared with a CPU, the energy efficiency ratio of the scheme on a GPU and an FPGA platform is respectively 22.8 and 42.5, so that the equipment energy consumption can be effectively reduced;

the complex protocol operation in the traditional statistical method is avoided by adopting the voting mode, the algorithm calculation time is reduced by more than 50%, and compared with a CPU, the acceleration ratio of the scheme on the GPU and the FPGA platform is respectively 28 and 6.9, so that the calculation time can be effectively reduced;

the calculation time of the HOG algorithm on the GPU and the FPGA platform is 25ms and 102ms respectively, and the energy consumption is 4J and 2.14J respectively, so that the requirements of text positioning instantaneity and low energy consumption are met, and the reliability of scene (including images and videos) character recognition can be further improved.

Claims

1. A text positioning-oriented single-core HOG efficient heterogeneous acceleration method is characterized by comprising the following steps of:

s3, calculating the amplitude and the phase of the convolved pixel;

s4, calculating the discrete gradient direction of the pixel by using the obtained amplitude and phase through a bilinear interpolation algorithm, storing the discrete gradient direction in a local memory of hardware, synchronizing the local memory of the hardware, and releasing a work item allocated to the pixel;

2. The text-oriented single-core HOG efficient heterogeneous acceleration method according to claim 1, wherein in step S1, the work item is the smallest work unit in OpenCL; the cell unit is an image minimum dividing unit; each cell comprises a plurality of connected pixel areas with uniform sizes of Cx×Cy, the total pixel size of the window image is Wx×Wy, and the generated two-dimensional index is (Wx, wy).

3. The text-oriented single-core HOG high-efficiency heterogeneous acceleration method of claim 1, wherein the differential template in step S2 is [ -1,0,1].

4. The text-oriented single-core HOG efficient heterogeneous acceleration method according to claim 2, characterized in that the global index in step S5 has a size (Wx/Cx, wy/Cy).

5. The text-oriented single-core HOG heterogeneous acceleration method of claim 4, wherein the global index size at parallel reduction in step S7 is (Wx/Cx, wy), using a total of Wx x Wy/Cx work items.

6. The text-oriented single-core HOG efficient heterogeneous acceleration method according to claim 5, wherein the first parallel reduction in step S7 uses Cy/2 work items in total to count two columns of gradients; the second parallel protocol uses Cy/4 work items to count two columns of gradients in total, and then uses the previous 1/2 work items to count gradients in sequence until the gradient statistics is completed.