CN114359683B - Text positioning-oriented single-core HOG efficient heterogeneous acceleration method - Google Patents

Text positioning-oriented single-core HOG efficient heterogeneous acceleration method Download PDF

Info

Publication number
CN114359683B
CN114359683B CN202111671159.2A CN202111671159A CN114359683B CN 114359683 B CN114359683 B CN 114359683B CN 202111671159 A CN202111671159 A CN 202111671159A CN 114359683 B CN114359683 B CN 114359683B
Authority
CN
China
Prior art keywords
cell
pixels
hog
pixel
hardware
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111671159.2A
Other languages
Chinese (zh)
Other versions
CN114359683A (en
Inventor
阎波
张国宁
孙王超
陈俊希
覃昊洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202111671159.2A priority Critical patent/CN114359683B/en
Publication of CN114359683A publication Critical patent/CN114359683A/en
Application granted granted Critical
Publication of CN114359683B publication Critical patent/CN114359683B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a text positioning-oriented single-core HOG efficient heterogeneous acceleration method, which comprises the steps of distributing a work item for each pixel, convoluting pixels around each pixel, calculating amplitude and phase of the convolved pixels, calculating a discrete gradient direction of the pixels through a bilinear interpolation algorithm, storing the discrete gradient direction in a local memory of hardware, and releasing the work item distributed for the pixels; distributing a work item for each cell unit and carrying out global indexing of hardware; calculating voting results in the discrete gradient direction, and finishing statistics of each row of pixels; normalizing and summing the counted pixels to form an HOG feature vector, and obtaining the feature vector of the image; the method is realized in a heterogeneous platform, and heterogeneous acceleration is completed. The invention meets the requirements of text positioning instantaneity and low energy consumption, and can further improve the reliability of scene character recognition technology.

Description

Text positioning-oriented single-core HOG efficient heterogeneous acceleration method
Technical Field
The invention relates to the field of scene character recognition, in particular to a single-core HOG efficient heterogeneous acceleration method for text positioning.
Background
With the wide spread of intelligent handheld devices and the rapid development of artificial intelligence, images and videos become the main media information delivery modes. The media information contains a large number of natural scenes, and the text information has important application value. The accurate and rapid extraction of text information from natural scenes is of great importance, where text localization technology is a major concern.
Since text localization faces high complexity implementation algorithms and continuously growing data, the real-time performance of text localization algorithms is challenged. The HOG (Histogram of Oriented Gradient, directional gradient histogram) algorithm is the most commonly used algorithm in text localization calculations. The existing multi-kernel HOG acceleration scheme is to perform global synchronization through a plurality of kernels at the equipment end, and achieve pixel gradient calculation, cell gradient statistics and block normalization of HOG features. However, a high-cost loop operation is generated, and the access and memory overhead of global synchronization and global memory is also high. In heterogeneous system implementations, multi-core acceleration schemes can present significant power consumption issues.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a single-core HOG efficient heterogeneous acceleration method for text positioning, which solves the problems of high memory access cost and high operation amount in the prior art.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
the method for efficiently and isomerically accelerating the single-core HOG oriented to text positioning comprises the following steps:
s1, acquiring pixels of a gray level image, and distributing a work item for each pixel; the Cx×Cy connected pixel areas with uniform sizes form a cell unit;
s2, in each working item, performing row convolution and column convolution on pixels around each pixel by using a differential template and a transposition thereof;
s3, calculating the amplitude and the phase of the convolved pixel;
s4, calculating the discrete gradient direction of the pixel by using the obtained amplitude and phase through a bilinear interpolation algorithm, storing the discrete gradient direction in a local memory of hardware, and releasing a work item allocated to the pixel;
s5, distributing a work item for each cell unit, and carrying out global indexing of hardware;
s6, creating a statistical variable for counting the discrete gradient direction of the pixels, directly adding the discrete gradient direction of each pixel in parallel to the statistical variable, and correspondingly calculating a voting result of the discrete gradient direction of each row of pixels in one cell by using a group of variables to finish the statistics of each row of pixels to obtain the number of different discrete gradient directions in each cell;
s7, carrying out local memory synchronization of hardware, counting pixels of one row in the cell by utilizing each work item based on voting results, carrying out parallel protocol to obtain gradient statistics of all the cell units, storing discrete gradient results of the gradient statistics in the local memory of the hardware, and releasing work items distributed for the cell units;
s8, calculating normalization of discrete gradients after gradient statistics of a cell by using a work item, summing up the normalized results of each cell to obtain a sum value corresponding to each cell, caching the sum value corresponding to each cell in the same block into a local memory of hardware, and carrying out local synchronization of the hardware to obtain a local direction gradient of each block; one image comprises a plurality of blocks;
s9, combining the local direction gradients of each block into an HOG feature vector to obtain the feature vector of the image;
s10, loading the steps onto a heterogeneous platform to realize heterogeneous acceleration.
Further, in step S1, the working item is the smallest working unit in OpenCL; the cell unit is an image minimum dividing unit; each cell comprises a plurality of connected pixel areas with uniform sizes of Cx×Cy, the total pixel size of the window image is Wx×Wy, and the generated two-dimensional index is (Wx, wy).
Further, the differential template in step S2 is [ -1,0,1].
Further, the global index in step S5 has a size of (Wx/Cx, wy/Cy).
Further, the size of the global index at the time of parallel reduction in step S7 is (Wx/Cx, wy), and the total of wx×wy/Cx work items are used.
Further, the first parallel protocol in step S7 uses Cy/2 work items altogether to count two columns of gradients; the second parallel protocol uses Cy/4 work items to count two columns of gradients in total, and then uses the previous 1/2 work items to count gradients in sequence until the gradient statistics is completed.
The beneficial effects of the invention are as follows:
1. in the gradient statistics process, a work item is distributed to each cell, instead of creating a work item for each pixel, so that the problem of access conflict is solved;
2. the pixels of one row are counted corresponding to each work item in the step S7, so that the continuous access of the work items is ensured, the parallelism of Cy times is improved, the utilization rate of GPU resources is improved, the capability of parallel processing of the GPU is fully exerted, and the expenditure is reduced;
3. in the gradient statistics process, the GPU avoids access conflict of local memory through a high-cost atomic function; the FPGA avoids access conflict of the local memory through alternate access of a plurality of pieces of physical memory;
4. in the step S8, the summation result is cached into a local memory of the hardware, so that the access of the global memory is reduced, the reduction time is shortened, and the calculation time is saved;
5. through local memory synchronization, a corresponding computing task is completed by using one equipment kernel, the resource consumption is reduced by more than 50%, and compared with a CPU, the energy efficiency ratio of the scheme on a GPU and an FPGA platform is respectively 22.8 and 42.5, so that the equipment energy consumption can be effectively reduced;
6. the complex protocol operation in the traditional statistical method is avoided by adopting the voting mode, the algorithm calculation time is reduced by more than 50%, and compared with a CPU, the acceleration ratio of the scheme on the GPU and the FPGA platform is respectively 28 and 6.9, so that the calculation time can be effectively reduced;
7. the calculation time of the HOG algorithm on the GPU and the FPGA platform is 25ms and 102ms respectively, and the energy consumption is 4J and 2.14J respectively, so that the requirements of text positioning instantaneity and low energy consumption are met, and the reliability of scene (including images and videos) character recognition can be further improved.
Drawings
FIG. 1 is a flow chart of the present invention;
fig. 2 is a block diagram of the design of the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
As shown in fig. 1 and fig. 2, the text positioning-oriented single-core HOG efficient heterogeneous acceleration method includes the following steps:
s1, acquiring pixels of a gray level image, and distributing a work item for each pixel; the Cx×Cy connected pixel areas with uniform sizes form a cell unit;
s2, in each working item, performing row convolution and column convolution on pixels around each pixel by using a differential template and a transposition thereof;
s3, calculating the amplitude and the phase of the convolved pixel;
s4, calculating the discrete gradient direction of the pixel by using the obtained amplitude and phase through a bilinear interpolation algorithm, storing the discrete gradient direction in a local memory of hardware, and releasing a work item allocated to the pixel;
s5, distributing a work item for each cell unit, and carrying out global indexing of hardware;
s6, creating a statistical variable for counting the discrete gradient direction of the pixels, directly adding the discrete gradient direction of each pixel in parallel to the statistical variable, and correspondingly calculating a voting result of the discrete gradient direction of each row of pixels in one cell by using a group of variables to finish the statistics of each row of pixels to obtain the number of different discrete gradient directions in each cell;
s7, carrying out local memory synchronization of hardware, counting pixels of one row in the cell by utilizing each work item based on voting results, carrying out parallel protocol to obtain gradient statistics of all the cell units, storing discrete gradient results of the gradient statistics in the local memory of the hardware, and releasing work items distributed for the cell units;
s8, calculating normalization of discrete gradients after gradient statistics of a cell by using a work item, summing up the normalized results of each cell to obtain a sum value corresponding to each cell, caching the sum value corresponding to each cell in the same block into a local memory of hardware, and carrying out local synchronization of the hardware to obtain a local direction gradient of each block; one image comprises a plurality of blocks;
s9, combining the local direction gradients of each block into an HOG feature vector to obtain the feature vector of the image;
s10, loading the steps onto a heterogeneous platform to realize heterogeneous acceleration.
In the step S1, the working item is the smallest working unit in OpenCL; the cell unit is an image minimum dividing unit; each cell comprises a plurality of connected pixel areas with uniform sizes of Cx×Cy, the total pixel size of the window image is Wx×Wy, and the generated two-dimensional index is (Wx, wy).
The differential template in step S2 is [ -1,0,1].
The global index in step S5 has a size of (Wx/Cx, wy/Cy).
The global index size at the parallel reduction in step S7 is (Wx/Cx, wy), and a total of Wx X Wy/Cx work items are used.
In the step S7, the first parallel protocol uses Cy/2 work items to count two rows of gradients; the second parallel protocol uses Cy/4 work items to count two columns of gradients in total, and then uses the previous 1/2 work items to count gradients in sequence until the gradient statistics is completed.
In step S3, there is no data interaction between the work items, i.e. no global synchronization or no local synchronization.
The high-level description of OpenCL of steps S1 to S9 is converted into a hardware language by AOCL, and a specific hardware circuit is generated.
The scheme is respectively implemented on a CPU+GPU and a CPU+FPGA heterogeneous platform. And taking the CPU as a host computer, executing system scheduling, and taking the GPU and the FPGA as devices respectively. The platform and device are first initialized and a series of configurations are performed. And then controlling the starting equipment to perform other operations. And after the result is obtained, finishing final classification calculation at the host end. Through related experiments, the scheme meets the requirements of text positioning instantaneity and low energy consumption, and can further improve the reliability of scene character recognition technology.
According to the invention, a work item is allocated to each cell in the gradient statistics process, instead of creating a work item for each pixel, so that the problem of access conflict is solved;
the pixels of one row are counted corresponding to each work item in the step S7, so that the continuous access of the work items is ensured, the parallelism of Cy times is improved, the utilization rate of GPU resources is improved, the capability of parallel processing of the GPU is fully exerted, and the expenditure is reduced;
in the gradient statistics process, the GPU avoids access conflict of local memory through a high-cost atomic function; the OpenCL atomic function can perform atomic operation on 32-bit signed and unsigned integers in the global local memory; when one work item accesses the memory, other work items cannot access the memory, and in step S6, when the discrete gradients of a plurality of pixels in the cell are consistent, parallel writing to the same memory can be caused, race conditions can be caused, data can be lost, and the problem can be solved by the atomic function;
in the gradient statistics process, the FPGA avoids access conflict of local memories through alternate access of a plurality of pieces of physical memories; M9K on a plurality of chips of the FPGA is used as a local memory, so that each work item of the same work group is supported to be accessed alternately, and access and memory conflicts of the local memory are avoided; in the step S6, the pixel voting calculation result is stored into a local memory by reasonably dividing the working group, so that the FPGA can avoid the atomic operation of adding floating point numbers at high cost;
in the step S8, the summation result is cached into a local memory of the hardware, so that the access of the global memory is reduced, the reduction time is shortened, and the calculation time is saved;
through local memory synchronization, a corresponding computing task is completed by using one equipment kernel, the resource consumption is reduced by more than 50%, and compared with a CPU, the energy efficiency ratio of the scheme on a GPU and an FPGA platform is respectively 22.8 and 42.5, so that the equipment energy consumption can be effectively reduced;
the complex protocol operation in the traditional statistical method is avoided by adopting the voting mode, the algorithm calculation time is reduced by more than 50%, and compared with a CPU, the acceleration ratio of the scheme on the GPU and the FPGA platform is respectively 28 and 6.9, so that the calculation time can be effectively reduced;
the calculation time of the HOG algorithm on the GPU and the FPGA platform is 25ms and 102ms respectively, and the energy consumption is 4J and 2.14J respectively, so that the requirements of text positioning instantaneity and low energy consumption are met, and the reliability of scene (including images and videos) character recognition can be further improved.

Claims (6)

1. A text positioning-oriented single-core HOG efficient heterogeneous acceleration method is characterized by comprising the following steps of:
s1, acquiring pixels of a gray level image, and distributing a work item for each pixel; the Cx×Cy connected pixel areas with uniform sizes form a cell unit;
s2, in each working item, performing row convolution and column convolution on pixels around each pixel by using a differential template and a transposition thereof;
s3, calculating the amplitude and the phase of the convolved pixel;
s4, calculating the discrete gradient direction of the pixel by using the obtained amplitude and phase through a bilinear interpolation algorithm, storing the discrete gradient direction in a local memory of hardware, synchronizing the local memory of the hardware, and releasing a work item allocated to the pixel;
s5, distributing a work item for each cell unit, and carrying out global indexing of hardware;
s6, creating a statistical variable for counting the discrete gradient direction of the pixels, directly adding the discrete gradient direction of each pixel in parallel to the statistical variable, and correspondingly calculating a voting result of the discrete gradient direction of each row of pixels in one cell by using a group of variables to finish the statistics of each row of pixels to obtain the number of different discrete gradient directions in each cell;
s7, carrying out local memory synchronization of hardware, counting pixels of one row in the cell by utilizing each work item based on voting results, carrying out parallel protocol to obtain gradient statistics of all the cell units, storing discrete gradient results of the gradient statistics in the local memory of the hardware, and releasing work items distributed for the cell units;
s8, calculating normalization of discrete gradients after gradient statistics of a cell by using a work item, summing up the normalized results of each cell to obtain a sum value corresponding to each cell, caching the sum value corresponding to each cell in the same block into a local memory of hardware, and carrying out local synchronization of the hardware to obtain a local direction gradient of each block; one image comprises a plurality of blocks;
s9, combining the local direction gradients of each block into an HOG feature vector to obtain the feature vector of the image;
s10, loading the steps onto a heterogeneous platform to realize heterogeneous acceleration.
2. The text-oriented single-core HOG efficient heterogeneous acceleration method according to claim 1, wherein in step S1, the work item is the smallest work unit in OpenCL; the cell unit is an image minimum dividing unit; each cell comprises a plurality of connected pixel areas with uniform sizes of Cx×Cy, the total pixel size of the window image is Wx×Wy, and the generated two-dimensional index is (Wx, wy).
3. The text-oriented single-core HOG high-efficiency heterogeneous acceleration method of claim 1, wherein the differential template in step S2 is [ -1,0,1].
4. The text-oriented single-core HOG efficient heterogeneous acceleration method according to claim 2, characterized in that the global index in step S5 has a size (Wx/Cx, wy/Cy).
5. The text-oriented single-core HOG heterogeneous acceleration method of claim 4, wherein the global index size at parallel reduction in step S7 is (Wx/Cx, wy), using a total of Wx x Wy/Cx work items.
6. The text-oriented single-core HOG efficient heterogeneous acceleration method according to claim 5, wherein the first parallel reduction in step S7 uses Cy/2 work items in total to count two columns of gradients; the second parallel protocol uses Cy/4 work items to count two columns of gradients in total, and then uses the previous 1/2 work items to count gradients in sequence until the gradient statistics is completed.
CN202111671159.2A 2021-12-31 2021-12-31 Text positioning-oriented single-core HOG efficient heterogeneous acceleration method Active CN114359683B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111671159.2A CN114359683B (en) 2021-12-31 2021-12-31 Text positioning-oriented single-core HOG efficient heterogeneous acceleration method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111671159.2A CN114359683B (en) 2021-12-31 2021-12-31 Text positioning-oriented single-core HOG efficient heterogeneous acceleration method

Publications (2)

Publication Number Publication Date
CN114359683A CN114359683A (en) 2022-04-15
CN114359683B true CN114359683B (en) 2023-10-20

Family

ID=81104866

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111671159.2A Active CN114359683B (en) 2021-12-31 2021-12-31 Text positioning-oriented single-core HOG efficient heterogeneous acceleration method

Country Status (1)

Country Link
CN (1) CN114359683B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750131A (en) * 2012-06-07 2012-10-24 中国科学院计算机网络信息中心 Graphics processing unit (GPU) oriented bitonic merge sort method
CN104598929A (en) * 2015-02-03 2015-05-06 南京邮电大学 HOG (Histograms of Oriented Gradients) type quick feature extracting method
CN106095583A (en) * 2016-06-20 2016-11-09 国家海洋局第海洋研究所 Principal and subordinate's nuclear coordination calculation and programming framework based on new martial prowess processor
CN106780360A (en) * 2016-11-10 2017-05-31 西安电子科技大学 Quick full variation image de-noising method based on OpenCL standards
CN109726806A (en) * 2017-10-30 2019-05-07 上海寒武纪信息科技有限公司 Information processing method and terminal device
CN109767637A (en) * 2019-02-28 2019-05-17 杭州飞步科技有限公司 The method and apparatus of the identification of countdown signal lamp and processing
CN112232372A (en) * 2020-09-18 2021-01-15 南京理工大学 Monocular stereo matching and accelerating method based on OPENCL

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8824742B2 (en) * 2012-06-19 2014-09-02 Xerox Corporation Occupancy detection for managed lane enforcement based on localization and classification of windshield images
US11004205B2 (en) * 2017-04-18 2021-05-11 Texas Instruments Incorporated Hardware accelerator for histogram of oriented gradients computation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750131A (en) * 2012-06-07 2012-10-24 中国科学院计算机网络信息中心 Graphics processing unit (GPU) oriented bitonic merge sort method
CN104598929A (en) * 2015-02-03 2015-05-06 南京邮电大学 HOG (Histograms of Oriented Gradients) type quick feature extracting method
CN106095583A (en) * 2016-06-20 2016-11-09 国家海洋局第海洋研究所 Principal and subordinate's nuclear coordination calculation and programming framework based on new martial prowess processor
CN106780360A (en) * 2016-11-10 2017-05-31 西安电子科技大学 Quick full variation image de-noising method based on OpenCL standards
CN109726806A (en) * 2017-10-30 2019-05-07 上海寒武纪信息科技有限公司 Information processing method and terminal device
CN109767637A (en) * 2019-02-28 2019-05-17 杭州飞步科技有限公司 The method and apparatus of the identification of countdown signal lamp and processing
CN112232372A (en) * 2020-09-18 2021-01-15 南京理工大学 Monocular stereo matching and accelerating method based on OPENCL

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Guoning Zhang 等.Efficient Heterogeneous Acceleration Using Single-core Histograms of Oriented Gradients.2021 International Conference on UK-China Emerging Technologies (UCET).2022,209-214. *
刘毅飞 等.形状模型分割中形状对齐GPU加速的OpenCL实现.信息技术.2016,(第03期),28-30+40. *
胡辉 等.基于多处理机平台并行扩维DFT算法的实现研究.遥测遥控.2002,(第02期),44-50. *
贺江.面向场景字符识别关键算法的多平台异构加速研究.中国优秀硕士学位论文全文数据库 信息科技辑.2018,(第(2018)02期),I138-1511. *

Also Published As

Publication number Publication date
CN114359683A (en) 2022-04-15

Similar Documents

Publication Publication Date Title
CN108765247B (en) Image processing method, device, storage medium and equipment
CN107341127B (en) Convolutional neural network acceleration method based on OpenCL standard
CN108388537B (en) Convolutional neural network acceleration device and method
US9235769B2 (en) Parallel object detection method for heterogeneous multithreaded microarchitectures
CN109885407B (en) Data processing method and device, electronic equipment and storage medium
CN102647588B (en) GPU (Graphics Processing Unit) acceleration method used for hierarchical searching motion estimation
WO2019184888A1 (en) Image processing method and apparatus based on convolutional neural network
Wai et al. GPU acceleration of real time Viola-Jones face detection
KR20200043617A (en) Artificial neural network module and scheduling method thereof for highly effective operation processing
Poostchi et al. Efficient GPU implementation of the integral histogram
CN117785480B (en) Processor, reduction calculation method and electronic equipment
CN109447239B (en) Embedded convolutional neural network acceleration method based on ARM
CN109740619B (en) Neural network terminal operation method and device for target recognition
CN114359683B (en) Text positioning-oriented single-core HOG efficient heterogeneous acceleration method
CN110796244B (en) Core computing unit processor for artificial intelligence device and accelerated processing method
CN108960203B (en) Vehicle detection method based on FPGA heterogeneous computation
Ibrahim et al. Gaussian Blur through Parallel Computing.
CN110322389A (en) Pond method, apparatus and system, computer readable storage medium
CN114600128A (en) Three-dimensional convolution in a neural network processor
Jinguji et al. Weight sparseness for a feature-map-split-cnn toward low-cost embedded fpgas
Li et al. VNet: a versatile network to train real-time semantic segmentation models on a single GPU
CN111860540B (en) Neural network image feature extraction system based on FPGA
CN111612685B (en) GPU dynamic self-adaptive acceleration method for remote sensing image
TWI798591B (en) Convolutional neural network operation method and device
US20220222509A1 (en) Processing non-power-of-two work unit in neural processor circuit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant