CN115695806A

CN115695806A - High-efficiency low-delay HEVC encoder data caching and processing method

Info

Publication number: CN115695806A
Application number: CN202211277789.6A
Authority: CN
Inventors: 陈志峰; 陈业旺; 吴林煌; 钟昌标
Original assignee: Fuzhou Shixin Technology Co ltd
Current assignee: Fuzhou Shixin Technology Co ltd
Priority date: 2022-10-19
Filing date: 2022-10-19
Publication date: 2023-02-03

Abstract

The invention provides a high-efficiency low-delay HEVC encoder data caching and processing method, aims to enable an H.265/HEVC hard encoder to reduce encoding delay as much as possible, and meanwhile reduces pressure of an increasingly sharp storage space and data bandwidth caused by increasingly high video quality, and provides a high-efficiency encoding data access scheme. The self-adaptive input video resolution is realized, so that the data cache and the edge filling width of the reconstructed image are controlled; the sliding window reading of the search frame reconstruction pixels is realized, and the DDR4 bandwidth requirement and data reading delay are reduced; by reasonably controlling the burst length of DDR4 at each read-write time

The CTU blocks are used for storing data in units, DDR4 line feed or bank switching is reduced, and therefore the efficiency of the storage system is further improved.

Description

High-efficiency low-delay HEVC encoder data caching and processing method

Technical Field

The invention belongs to the technical field of video coding and decoding, particularly relates to a high-efficiency low-delay HEVC encoder data caching and processing method, and particularly relates to a high-efficiency HEVC video coding data access scheme for a heterogeneous FPGA.

Background

Under the condition of the same Video compression quality, the High Efficiency Video Coding standard H.265/HEVC (High Efficiency Video Coding) improves the Coding Efficiency by about 50 percent compared with the previous generation Coding standard H.264, but simultaneously, the complexity of H.265 relative to H.264 is also increased ^[1] . The h.265 hard encoder needs to solve an important problem to compress and encode the ultra high definition video: how to satisfy the data throughput multiplied by the number of the data, and simultaneously, the access delay is reduced as much as possible and the storage bandwidth pressure is reduced. Since the CTUs can be dynamically divided in the range of 8 × 8 to 64 × 64 in the h.265 standard, 33 intra prediction modes are supported ^[3] This further increases the need for memory space access.

In order to reduce coding delay and memory bandwidth as much as possible, various solutions are proposed in the prior art, such as: (1) A novel nonvolatile memory, such as ReRAM, STT-RAM and the like, and a DRAM form a layered memory structure, so that the total memory area is effectively saved, and the power consumption of coded data access is reduced. However, in this scheme, the DRAM and the nonvolatile memory need different memory controllers to manage the access to the memory, which undoubtedly increases the hardware cost ^[4] . According to the scheme, data required by deblocking filtering and brightness and chrominance prediction in the encoding process are divided into independent data access event processing, DDR access times are increased, and the efficiency of accessing a memory cannot be further improved. (2) Receiving video data to be coded by adopting a PCIe interface expanded by the FPGA, caching the video data into a DDR, sending the data preprocessed by the FPGA into a DSP chip by utilizing an SRIO (Serial Rapid IO) interface, and finishing HEVC compression coding on the DSP chip ^[5] . The methodThe hardware cost is high, the instability of the system is increased by connecting a plurality of interfaces with different hardware circuits, the video coding delay is increased, and the utilization rate of transmission bandwidth is not high. (3) Designing a DDR-BRAM-LUT three-level physical storage structure, using DDR to buffer frame, using BRAM line buffer, using LUT buffer reference pixel ^[6] . This relieves the bandwidth pressure of the off-chip memory to a certain extent, but does not consider the relation between the storage arrangement of the CTU in the memory and the DDR access efficiency, and the data access efficiency is not high.

Reference documents

[1]J.Vanne,M.Viitanen,T.D.Hamalainen,A.Hallapuro,“Comparative Rate-Distortion-Complexity Analysis of HEVC and AVC Video Codecs”，IEEE Trans.on CAS for Video Technology,vol.22,no.12,pp.1885-1898,Dec.2012.

[2]Bouaafia,S.,Khemiri,R.,Messaoud,S.et al.Deep CNN Co-design for HEVC CU Partition Prediction on FPGA-SoC.Neural Process Lett(2022).https://doi.org/10.1007/s11063-022-10765-1

[3]Bouaafia,S.,Khemiri,R.,Messaoud,S.et al.Deep CNN Co-design for HEVC CU Partition Prediction on FPGA-SoC.Neural Process Lett(2022).

https://doi.org/10.1007/s11063-022-10765-1

[4]D.S.Silveira,A.Mativi,M.S.Porto and S.Bampi,"Energy Savings with Non-Volatile Memory System for High Definition Video Encoders,"201917th IEEE International New Circuits and Systems Conference(NEWCAS),2019,pp.1-4,doi:10.1109/NEWCAS44328.2019.8961238.

[5] Cheiloyo. Video processing system design based on h.265 code [ D ]. University of capitalization, 2017.

[6] Li Shen, chaiZhiLei, yanwei, summer jade, zhao Jian bin H.265 intraframe mode decision parallel computing method research and implementation [ J ] small-sized microcomputer system, 2018,39 (11): 2523-2527.

Disclosure of Invention

In order to make up for the blank and the defects of the prior art, in order to enable an H.265/HEVC hard encoder to reduce encoding delay as much as possible and reduce the pressure of an increasingly sharp storage space and data bandwidth caused by higher and higher video quality, the invention provides a high-efficiency low-delay HEVC encoder data caching and processing method, which belongs to a high-efficiency encoding data access scheme, and realizes the self-adaptive input video resolution, thereby controlling the data caching and the edge filling width of a reconstructed image; the sliding window reading of the search frame reconstruction pixels is realized, and the DDR4 bandwidth requirement and data reading delay are reduced; by reasonably controlling the burst length of DDR4 during each read-write operation and storing data by taking a CTU block of 64 multiplied by 64 as a unit, DDR4 line feed or bank switching is reduced, and the efficiency of the storage system is further improved.

The invention specifically adopts the following technical scheme:

a high-efficiency low-latency HEVC encoder data caching and processing method is characterized by comprising the following steps:

step S1: receiving video data, converting the RGB888 data format into YUV444, and then down-sampling into YUV420 format;

step S2: calculating the resolution of the current input video, and outputting the result to the subsequent steps for use;

and step S3: storing the video data in YUV420 format into a given address space of DDR4 by storing a brightness Y component, a chroma U component and a V component in a CTB block of 64 multiplied by 64 respectively;

and step S4: when the input video data reaches 64 lines of effective pixels, a starting signal is sent to an H.265_ Encoder/HEVC Encoder kernel, and the original data of 64 multiplied by 64 is read out to the H.265_ Encoder in a pipeline mode;

step S5: storing reconstructed data output by an encoder into a given address space of DDR4 in CTU blocks of 64 multiplied by 64 through DMA transmission;

step S6: during interframe coding, reading reconstructed image data by taking a CTU (computer to Unit) as a unit, performing edge filling on an image according to the current resolution, and then sliding and reading search frame pixels;

step S7: and storing the coded code stream into DDR4 in real time, generating an interrupt whenever the stored data is full lKB, informing a CPU (Central processing Unit) to read the code stream data, packaging the code stream data into an RTP (real time protocol) format, and sending the code stream data to a target IP (Internet protocol) through UDP (user Datagram protocol).

Further, step S1 specifically includes the following steps:

step S11: receiving video data and outputting the video data in a native video signal format;

step S12: after the RGB888 data is converted into YUV444 by the color space conversion RGB _ to _ YUV420 module, the RGB888 data is sequentially down-sampled into YUV422 and YUV420, and the color conversion formula is as follows:

wherein Y is a luminance component; u and V are chrominance components; r is a red channel; g represents a green channel; b represents a blue channel; the color conversion operation is implemented in the FPGA to convert floating point operations to fixed point operations.

Furthermore, in step S2, resolution calculation is started according to the line and field synchronization signals of the received video, and the width and height of the effective pixels of the video are counted to obtain a resolution value; when two frames of video signals are lost continuously, the last statistical result is kept to be output, so that the misjudgment of video signal loss caused by the jitter of the input interface is avoided.

Further, step S3 specifically includes the following steps:

step S31: distributing and storing the luminance component Y in units of 64 × 64;

step S32: two 32 × 32 chrominance components U, V are distributed and stored for one data unit;

step S33: data is written in the DDR4 from the AXI _ HP port in a DMA manner by address bit control.

Further, in step S4, two 64 × 64 ping-pong RAMs are instantiated on the FPGA for storing the CU and PU data read from the DDR4 and belonging to the same CTU, so as to achieve the reduction of the bandwidth pressure of the off-chip memory and the reduction of the encoder coding delay by using a small amount of logic resources.

Further, step S6 specifically includes the following steps:

step S61: receiving coordinate values of a current prediction block in an image frame;

step S62: initiating an AXI4 burst read signal, and reading corresponding reconstructed image data from DDR4 according to block coordinates;

step S63: storing the read data in 80 rows of RAM;

step S64: performing edge filling on the image;

step S65: outputting the filled 64 lines of search box data;

step S66: the read-down data continues to fill the free 16 rows.

Compared with the prior art, the invention and the optimal scheme thereof realize the self-adaptive input video resolution, thereby controlling the data cache and the edge filling width of the reconstructed image; the sliding window reading of the search frame reconstruction pixels is realized, and the DDR4 bandwidth requirement and data reading delay are reduced; by reasonably controlling the burst length of DDR4 during each read-write operation and storing data by taking a CTU block of 64 multiplied by 64 as a unit, DDR4 line feed or bank switching is reduced, and the efficiency of the storage system is further improved.

Drawings

Fig. 1 is a schematic diagram illustrating DDR4 memory space partitioning according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a Y component storage arrangement in the embodiment of the present invention.

FIG. 3 is a diagram illustrating an AXI _ HP interface with a memory according to an embodiment of the present invention.

Fig. 4 is a filling diagram of a reconstructed image in the embodiment of the invention.

FIG. 5 is a diagram illustrating search box data movement according to an embodiment of the present invention.

FIG. 6 is a schematic diagram of Y-component sliding reading in an embodiment of the present invention.

FIG. 7 is a timing diagram of an encoder reading a data block according to an embodiment of the present invention.

FIG. 8 is a diagram illustrating the effect of encoding and decoding in the embodiment of the present invention.

Fig. 9 is a schematic workflow diagram in an embodiment of the present invention.

Detailed Description

In order to make the features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail as follows:

it should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1-9, in the implementation of the present embodiment, the adopted coding System may use a heterogeneous FPGA platform including PL (Programmable Logic) and PS (Processing System), accelerate the coding operation by using the highly parallel operation of PL hardware, and control the coding parameters of the encoder by using PS (Programmable Logic controller) ^[2] . This allows the encoder to both speed up the encoding operation and provide a control interface to the PS-side CPU without malfunctioning.

Therefore, a hardware architecture capable of effectively improving the data access efficiency of the H.265 encoder is obtained. The DDR access frequency can be reduced, and the hardware cost is reduced because a plurality of modules use the same memory; the DDR4 hardware structure characteristic and the H.265 coding characteristic are comprehensively considered: the method has the advantages that data access is carried out by taking a 64 x 64 CTU data block as a unit, the storage position of data in a memory is reasonably arranged, and a scheme of self-adapting to input resolution and sliding reading and filling to reconstruct a coded image is provided, so that the provided architecture can better meet the requirements of high efficiency and low time delay of an H.265 encoder.

The designed high-efficiency low-delay HEVC encoder data caching and processing method specifically comprises the following steps:

step S1: the video data is received through the HDMI _ RX module, and the RGB _ to _ YUV420 module converts the RGB888 data format into YUV444 and then down-samples the YUV420 format;

step S2: the Resolution _ Detection module calculates the Resolution of the current input video and outputs the result to a subsequent module for use;

and step S3: an Original _ Data _ Write module stores video Data in YUV420 format into DDR4 specific address space in CTB blocks of 64 x 64 by DMA transfer respectively for a luminance Y component and a chrominance UV component;

and step S4: when the input video Data reaches 64 lines of effective pixels, a starting signal is sent to an H.265_ Encoder core, and an Original _ Data _ read module reads out 64 multiplied by 64 Original Data to the H.265_ Encoder in a pipeline mode;

step S5: the Recovery _ Data _ Write module stores the reconstructed Data output by the encoder into the DDR4 specific address space in CTU blocks of 64 multiplied by 64 through DMA transmission;

step S6: during interframe coding, a Recovery _ Data _ Read module reads reconstructed image Data by taking a CTU as a unit, performs edge filling on an image according to the current resolution, and then reads pixels of a search frame in a sliding manner;

step S7: the Bit _ Stream _ Write module stores the coded code Stream into DDR4 in real time, generates an interrupt every time when the stored data is full of lKB, informs a CPU to read the code Stream data, packs the code Stream data into an RTP format, and sends the code Stream data to a target IP through UDP;

as a preferable scheme, the step S1 specifically includes the following steps:

step S11: the HDMI _ RX module receives video data and outputs the video data in a native video signal format;

wherein Y is a luminance component; u and V are chrominance components; r is a red channel; g represents a green channel; b denotes the blue channel. The color conversion operation is implemented in the FPGA, and requires conversion from floating point operation to fixed point operation.

Step S2 specifically includes the following features:

(1) The Resolution _ Detection module starts Resolution calculation according to line and field synchronous signals of the received video, and counts the width and height of effective pixels of the video to obtain a Resolution value;

(2) When two frames of video signals are continuously lost, the last statistical result is kept to be output, and misjudgment of video signal loss caused by jitter of the input interface is avoided.

The data storage scheme of the embodiment has the following characteristics:

the DDR4 is read and written by a 64-bit AXI4 data bus with the burst length of 32 each time, and the efficiency of the AXI4 bus is improved;

DDR4 is read and written by each module of the system in a continuous memory access mode, and the access efficiency is improved;

each module data read-write uses a continuous 2KB memory page as a unit, and each time data is read, one page can be read at one time, so that the memory access delay caused by memory page jump is reduced;

as a preferred solution, the Original _ Data _ Write module in step S3 specifically includes the following steps:

step S33: data is written from the AXI _ HP port into the DDR4 in a DMA manner by means of address bit control.

The Recovery _ Data _ Write module in step S4 has the following characteristics:

instantiating two 64 × 64 ping-pong RAMs on an FPGA, and storing CU (Coding Unit) and PU (Prediction Unit) data which are read from DDR4 and belong to the same CTU (China data Unit), so as to achieve the effects of reducing bandwidth pressure of an off-chip memory and reducing Coding delay of an encoder by using a small amount of logic resources;

the Recovery _ Data _ Read module in step S6 has the following steps:

step S62: initiating an AXI4 burst read signal, and reading corresponding reconstructed image data from the DDR4 according to the block coordinates;

step S63: storing the read data in 80 rows of RAM;

step S64: performing edge filling on the image;

step S65: outputting the filled 64-line search box data;

step S66: the read-down data continues to fill the free 16 rows.

Through the design, 64 × 64 data blocks can be completely sent to the encoder only by 64 clks, and the waveform of the timing waveform of the data read by the capture storage system to the encoder is shown in fig. 7. The high-definition camera is used for shooting a test video played by the notebook computer, the camera outputs 1080P30 frames of video to the encoder for encoding, the code stream is decoded and played by the PC-end self-research software, and an encoding and decoding effect graph is shown in FIG. 8.

The above program design scheme related to the algorithm provided in this embodiment can be stored in a computer readable storage medium in a coded form, and implemented in a computer program manner, and inputs basic parameter information required for calculation through computer hardware, and outputs a calculation result.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow of the flowchart illustrations, and combinations of flows in the flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

The present invention is not limited to the above preferred embodiments, and other efficient and low-latency HEVC encoder data buffering and processing methods can be derived by anyone based on the teaching of the present invention.

Claims

1. A high-efficiency low-latency HEVC encoder data caching and processing method is characterized by comprising the following steps:

and step S3: storing the video data in YUV420 format into a given address space of DDR4 by respectively storing a brightness Y component, a chroma U component and a V component in a CTB block of 64 multiplied by 64;

and step S4: when the input video data reaches 64 lines of effective pixels, a starting signal is sent to an H.265_ Encoder/HEVC Encoder core, and 64 x 64 original data is read out to the H.265_ Encoder in a pipeline mode;

step S6: during interframe coding, reading reconstructed image data by taking a CTU (computer to Unit) as a unit, filling edges of an image according to the current resolution, and then sliding and reading pixels of a search frame;

step S7: and storing the coded code stream into DDR4 in real time, generating an interrupt every time when the stored data is full of lKB, informing a CPU to read the code stream data, packaging the code stream data into an RTP format, and sending the code stream data to a target IP through UDP.

2. The HEVC encoder data caching and processing method with high efficiency and low latency according to claim 1, wherein: the step S1 specifically includes the following steps:

wherein Y is a luminance component; u and V are chrominance components; r is a red channel; g represents a green channel; b represents a blue channel; color conversion operations are implemented in FPGAs, converting floating point operations to fixed point operations.

3. The HEVC encoder data caching and processing method with high efficiency and low latency according to claim 1, wherein: in step S2, resolution calculation is started according to line and field synchronizing signals of a received video, and the width and the height of effective pixels of the video are counted to obtain a resolution value; when two frames of video signals are lost continuously, the last statistical result is kept to be output, so that the misjudgment of video signal loss caused by the jitter of the input interface is avoided.

4. The HEVC encoder data caching and processing method with high efficiency and low latency according to claim 1, wherein: the step S3 specifically includes the following steps:

step S33: data is written from the AXI _ HP port into the DDR4 in DMA by means of address bit control.

5. The HEVC encoder data caching and processing method with high efficiency and low latency according to claim 1, wherein: in step S4, two 64 × 64 ping-pong RAMs are instantiated on the FPGA for storing CU and PU data read from the DDR4 and belonging to the same CTU, so as to reduce the bandwidth pressure of the off-chip memory and reduce the encoder coding delay by using a small amount of logic resources instead.

6. The HEVC encoder data caching and processing method with high efficiency and low latency according to claim 1, wherein: step S6 specifically includes the following steps:

step S63: storing the read data in 80 rows of RAM;

step S64: performing edge filling on the image;

step S65: outputting the filled 64 lines of search box data;

step S66: the read-down data continues to fill the free 16 rows.