Background
The Frame Buffer Compression (FBC) technology is an effective scheme for effectively reducing bandwidth in the current high-bandwidth application scenario. Especially, under the performance requirement that the video image processing speed is 4K 60P/S or more at present, saving the DDR bandwidth of the double-rate synchronous dynamic random access memory becomes one of the important factors to be considered for each video processing service unit.
The application scenarios of this technology are mainly in Image Video Processing units such as Image Signal Processing (ISP), Video codec (Video codec), Video Processing subsystem (VPSS), Video Display (VDP), and the like. After the frame buffer compression FBC, each frame of pixel data can be effectively compressed and put into the DDR, and when the DDR is subsequently processed or used, the image pixel data is recovered through a decompression unit. Because the data is compressed, the required data bandwidth for decompression is saved greatly. The requirements in each of the above scenario services are briefly introduced.
(1) The image signal processing ISP is generally a unit mainly used for signal processing of the output signal of the front-end image sensor. It is common to perform correlation processing on RAW image DATA (RAW DATA) field DATA. The processing order is raster scan (raster scan) within a frame, and the processing bit depth is chip dependent. The data storage mode is a packing format (pack mode).
(2) The Video codec is mainly used to acquire reference frame data from the DDR for inter-frame data prediction. The application scenario is complex. The video prediction method is mainly embodied in that the number of reference frames of each video protocol is large, the range of predicted Motion Vectors (MVs) is random, and the MVs of luminance and chrominance components are not equal. This requires two more critical requirements, random access (random access), that must be internally integrated with the cache resources. Two types of caches are mainly involved. A compressed header cache and a compressed data cache.
(3) Video processing subsystem VPSS, video display VDP
The main application scenario is raster scanning processing inside the frame, and a small amount of random access is also applied in VPSS.
At present, the algorithm is basically the same and different, and compression and decompression are carried out on the basis of blocks. Wherein the block is a compression unit (compression unit). Then, the compression unit is internally divided into sub blocks (sub blocks), which are really one compression processing unit. The size of the compression unit is determined based on various factors such as algorithm principle, application scenario, hardware implementation area, DDR access efficiency, and the like. The compressed data block is divided into a compressed data block header and a compressed data block body after compression, and the decoding of the compressed data block body is completed according to the compressed header during decoding. The Compression head mainly includes information such as the size of each subblock, a Compression flag, a Compression mode, and a Direct Current (DC) value. The compressed data block mainly stores residual data between pixels.
In random access scenarios, it is particularly sensitive to the size of the data block header, and may directly affect the size of the data block header cache (head cache). Specifically, the number of ways (ways) of the cache is greatly increased. If the performance of more than 4K 60P/S is required to be ensured, the cache size and the hardware implementation time sequence (Timing) are great bottlenecks.
Specific examples are as follows:
for a data block head cache head, under the condition that the cache size is the same, the hit rate of the cache is improved as much as possible, and the performance of each processing unit is favorably ensured.
With respect to the data block body size and DDR burst, it is clear that the number of data bits of the header is small, about 16-64 bits, and the cache line is generally set to the unit of the number of image lines. Assuming that the algorithm is compressed according to 64 × 2 pixel blocks (64 pixels in width and 2lines in height), the maximum image line number of the head that the bus can apply for at a time is analyzed as follows for the cache miss case (the head size is set to 2 according to the most optimistic analysis, i.e. 16 bits; image width 4K; reference frames for video encoding and decoding at 4K resolution).
The head size of each 64 × 2 pixel block is 16bits, the number of compressed blocks of 2 image lines is 4K/64-64, corresponding to 64 heads, and 64 × 16 bits-1024 bits, so that the required burst number 1024/128 is 8 for a bus bit width with a burst length of 128bits, and the cache line is the head data amount corresponding to 4 image lines when the maximum system AXI burst length is 16 at most.
For a 4K image, when there are 2 reference frames per frame, the total lines (64 luminance lines +32 chrominance lines) x2 reference frame numbers (reference frames) of at least one 64-size CTB image line (64 image lines) are buffered (192lines size/cache line) 48; therefore, for the cache, the cache way is very large, which causes the size of the cache to become large; while the larger the way, the worse the timing is for hardware implementation. In poor conditions, Timing can be difficult to meet the performance requirements of the various processing units at increased frequencies.
That is, the size of the head directly restricts the cache resources and the timing of the hardware implementation, and finally the area and performance of the hardware implementation of the algorithm are bottlenecks.
There are two main hardware solutions available.
Scheme 1: the compressed data block header compression head and the compressed data block compression body are stored separately, the compression body is stored according to the size of the compressed data, and bytes are aligned. The following flow of compressed data stored in YUV420semi-planar certain component (Y or UV) format is shown in FIG. 1.
Each compression unit block is compressed through an internal compression processing unit, and each compressed component (Y or UV) is placed in two areas of the DDR, namely a head area and a body area, wherein the head area is a fixed N bits, and the size of the body data is not fixed.
During decompression, the header content is obtained first, then the body data with the corresponding size is obtained, and finally the decompression is carried out.
The scheme has the following defects:
the bit number of the head of the block is larger, generally at least 16bits, and basically ranges from 32bits to 64 bits; as described above, the size of the head directly restricts the resources of the cache and the timing of the hardware implementation, and finally the area and performance of the hardware implementation are bottlenecks.
The block data length is not fixed, the bus will be written to DDR according to strobe (a signal in axi write data protocol indicating which bytes locations of write data are valid) during compression, and strobe read problems will also be involved during decompression.
The advantage of this section is that DDR memory space is saved, while bandwidth is also saved. But cannot satisfy the random access application to the VPU. Because data has the problem of strobe alignment, the hardware implementation complexity of the data cache can be greatly increased. The Data Cache read always focuses on the Strobe flag.
Scheme 2:
the size (body size) of the data block is mainly adjusted, that is, the DDR space occupied by the body size is the space under the condition of no compression, so that the memory capacity is not changed, and the bandwidth is only reduced.
The scheme has no problem of strobe alignment, and greatly reduces the complexity of hardware implementation. However, the size of the head directly restricts the resources of the cache and the timing of the hardware implementation, and finally, the area and performance of the hardware implementation are bottlenecks.
Disclosure of Invention
In view of the above, the present invention provides a method for detecting a defect. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
The invention provides a compression storage method of video data, which comprises the following steps:
compressing the video data to obtain a data block head (compression head) and a data block body (compression body);
storing the data block head and the data block body into a preset storage space;
setting an indication mark for indicating the compression size of the corresponding data block, and storing the data block in a memory according to the fixed bit number of raster scanning;
and determining a storage basic unit of the compressed data block based on the bus bandwidth, and storing the compressed data block according to the storage basic unit and the raster scanning sequence.
The first data block header and the first data block are stored in a predetermined storage space, specifically, the first data block header and the first data block are combined together, and the first data block header takes a prefix body prefix of the first data block as a part of a body.
If the size of the compressed data block is KxL, K is 2kAnd k is 1,2,3, …, the number of bits of the indication identifier is:
where K is the number of columns, L is the number of rows, N is the pixel bit depth, 2
mIs the bus bit width.
The compression storage method further comprises the following steps:
when data is stored, the size (body size) of the compressed data block is obtained from a DDR header field (head domain), and the body of the compressed data block is obtained according to the size so as to be placed at a corresponding position.
Determining the basic unit of storage based on the bus bandwidth, if the bus bandwidth is 2
mA bit, then the basic unit of storage is
And the unit of the bus bit width, wherein K is the column number, L is the row number, and N is the pixel bit depth.
The method further comprises the step of finding and retrieving the stored data blocks:
reading corresponding address data in a cache according to preset coordinate offset and vertical offset, and then intercepting corresponding offset bits from the data of each address to obtain an indication identifier (HEAD _ NEW) of the compression unit;
and acquiring the data block body with the corresponding length through a bus based on the specification of the data block body, and finding the corresponding data block body according to the coordinate of the indication identifier.
The present invention also provides a storage device for storing video data, comprising:
a storage unit for storing video compression data; the video compression data comprises a data block header and a data block body;
a storage unit for storing an indication flag (HEAD _ NEW) indicating a size of a corresponding video compressed data unit, a bit number of the indication flag being determined according to the size of the compressed data unit;
the video compression data is stored according to a storage basic unit and an indication mark indication, wherein the storage basic unit is determined based on the bus bandwidth.
The storage device stores compressed data units, and a data block head and a data block body of the compressed data units are combined together, wherein the data block head takes a prefix (body prefix) of the data block body as a part of a new data block body.
If the size of the compressed data block is KxL, K is 2kAnd k is 1,2,3, …, the number of bits of the indication identifier is:
where K is the number of columns, L is the number of rows, N is the pixel bit depth, 2
mIs the bus bit width.
The basic unit of the compressed data unit is determined according to the bus bandwidth, if the bus bandwidth is 2
mBit, then the basic unit of storage is
And the width unit of each bus, wherein K is the column number, L is the row number, and N is the pixel bit depth.
In summary, in the technical solution provided by the present invention, the data block head and the data block are stored in a predetermined storage space; the HEAD _ NEW is set to indicate the effective size of the corresponding data block and to give bus bandwidth alignment. Therefore, under the requirement of the same performance, the required cache space (s ize) is 1/4, the number of cache ways (ways) is 1/4, and the great reduction of the number of ways directly leads to the reduction of the logic level of hardware realization, which is beneficial to the time sequence convergence of the rear end, can operate to a higher frequency and obviously improve the performance.
For the purposes of the foregoing and related ends, the one or more embodiments include the features hereinafter fully described and particularly pointed out in the claims. The following description and the annexed drawings set forth in detail certain illustrative aspects and are indicative of but a few of the various ways in which the principles of the various embodiments may be employed. Other benefits and novel features will become apparent from the following detailed description when considered in conjunction with the drawings and the disclosed embodiments are intended to include all such aspects and their equivalents.
Detailed Description
The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them. Other embodiments may incorporate structural, logical, electrical, process, and other changes. The examples merely typify possible variations. Individual components and functions are optional unless explicitly required, and the sequence of operations may vary. Portions and features of some embodiments may be included in or substituted for those of others. The scope of embodiments of the invention encompasses the full ambit of the claims, as well as all available equivalents of the claims. Embodiments of the invention may be referred to herein, individually or collectively, by the term "invention" merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or inventive concept if more than one is in fact disclosed.
In order to achieve the object of the present invention, the present invention provides a method for compressing and storing video data, referring to fig. 2, including the following steps:
s01, compressing the video data to obtain a data block head (compression head) and a data block body (compression body);
s02, storing the data block head and the data block into a preset storage space;
s03, setting indicator (HEAD _ NEW) to indicate the compression size of the corresponding data block, and storing the indicator in the memory according to the raster scanning sequence and fixed bit number, wherein the bit number of the indicator is determined according to the size of the compressed data block;
s04, determining the basic storage unit of the compressed data block based on the bus bandwidth, and storing the compressed data block according to the basic storage unit and the raster scanning order.
In step S02, the data block header and the data block are stored in a predetermined storage space, specifically, the data block header compression head is merged with the data block compression body, and the data block header (compression head) has the prefix body prefix of the data block as a part of the body.
If the size of the compressed data block is KxL, K is 2kAnd k is 1,2,3, …, the number of bits of the indication identifier is:
where K is the number of columns, L is the number of rows, N is the pixel bit depth, 2
mIs the bus bit width.
The compression storage method specifically comprises the following steps:
when data is stored, the specification (body size) of the compressed data block is obtained from the DDR head domain, and then the compressed data block body is obtained according to the specification and placed at a corresponding position.
Determining the basic unit of storage based on the bus bandwidth, if the bus bandwidth is 2
mBit, then the basic unit of storage is
One byte of which 2
kIs the column number, L is the row number, N is the pixel bit depth, [ 2 ]]Indicating rounding.
Further, the method comprises the steps of searching and extracting the stored data block:
reading corresponding address data in a cache according to preset coordinate offset and vertical offset, and then intercepting corresponding offset bits from the data of each address to obtain an indication identifier (HEAD _ NEW) of the compression unit;
and acquiring the data block body with the corresponding length through a bus based on the specification (body size) of the data block body, and finding the corresponding data block body according to the coordinate of the indication identifier.
The present invention also provides a storage device for storing video data, comprising:
a storage unit for storing video compression data; the video compression data comprises a data block header (compression head) and a data block body (compression body);
and the storage unit is used for storing an indication mark (HEAD _ NEW), the indication mark is used for indicating the compression size of the corresponding video compressed data unit and is stored in the memory according to the fixed bit number of the raster scanning sequence, and the bit number of the indication mark is determined according to the size of the compressed data unit.
The data block header (compression head) and the data block body (compression body) are merged together, and the data block header (compression head) takes the prefix body prefix of the data block body as a part of the new data block body.
The video compression data is stored according to a storage basic unit and an indication mark indication, wherein the storage basic unit is determined based on the bus bandwidth. The size offset space of the data block body between each compression unit is opened up in the original compression unit (uncompressed case) and is also stored in raster scan order.
If the size of the compressed data block is KxL, K is 2kThen, the number of bits of the indication flag is:
where K is the number of columns, L is the number of rows, N is the pixel bit depth, 2
mIs the bus bit width.
The basic unit of the compressed data unit is determined according to the bus bandwidth, if the bus bandwidth is 2
iBit, then the basic unit of storage is
A number of bytes, where K is the number of columns, L is the number of rows, and N is the pixel bit depth.
In order to make the principle, features and advantages of the present invention more apparent, the following detailed description of the embodiments of the present invention is provided.
Referring to fig. 3, a compressed data block header compression head output by a normal arithmetic compression unit is incorporated into a compressed data body, i.e. a block prefix body prefix is used as a part of the body, and the original body is used as a body core. After incorporation, the increase in bandwidth of the body is almost negligible, since the bus access is at least 16bytes aligned. Thus, for hardware implementation, the pointer (HEAD _ NEW) information that needs to be known in advance is very small, as long as the size of the compressed chunk body is obtained. Thus, the size body size of the compressed block of the compression unit is obtained from the DDR header domain, then the data block is obtained according to the block size, and finally decoding is carried out. Also for this specification, if the bus bandwidth is 128bits, then 16bytes alignment. The basic unit is 16bytes, rather than being aligned in bytes. The contents of the set indicator (HEAD _ NEW) are as follows:
for YUV420semi-planar, since luminance and chrominance are stored separately and compressed separately. Taking the 64x2 compression unit as an example, 16bytes alignment is used. Since 64x2 is 64x2bytes at maximum, 8x16 bytes, and is 8 basic units at maximum, 3 bits can represent it without compression.
For hardware implementation convenience, a reserved bit reserve bit may be extended, i.e. the newly set indicator HEAD _ NEW is 4 bits. Correspondingly, according to the previous analysis, for a 4K frame, the cache line is the HEAD _ NEW data amount of 16 image lines (i.e., 8 HEAD lines because the height of each compression unit is 2 lines).
In a specific application, such as Video decoding Video codec, the implementation scheme of partial decompression of reference frame data acquired by the prediction branch is analyzed as follows with reference to fig. 4:
(1) assuming that a reference prediction data block (also referred to as a prediction data unit PU) currently required for decoding is a reference block size of 16x4, and a line (cur _ line) of the currently requested reference block is a 0 th line and its coordinates are P (x, y) ═ P (a, B), then the location where the compression unit block 64x2 corresponding to the line is a unit, that is, the coordinates are U (x, y) ═ U ((a > >2), (B > >1)) can be found;
(2) judging whether the cache hits according to the U (x, y) coordinates; assuming that the cache is in miss, first obtains HEAD _ NEW data and writes the data into the HEAD _ NEW cache, and then according to the coordinates and the size of the HEAD _ NEW is fixed 4bits, the HEAD _ NEW data (16 image lines are 8 HEAD lines) corresponding to the cache line can be obtained, as shown in fig. 4 as HEAD _ NEW lines 0 to 7; a Cache line corresponding to a current line of a current prediction data unit (PU).
(3) Then reading the corresponding HEAD _ NEW in the cache according to the current line (cur _ line) coordinate;
the correspondence between (2) and (3), the image line (or HEAD _ NEW line) and the address of the cache is shown in fig. 5.
A cache writing process: in the above figure, the number of bits occupied by a head line in the 4K image is: 4K/64 × 4bit 256bit 2 × 128 bits; namely 2 bus bursts; namely, one head line corresponds to 2 cache addresses; labeled as in the upper graph 1/2; thus the entire 16 bus burst corresponds to addr 0-addr 15.
A cache reading process: each address occupies 1/2HEAD lines, and the horizontal coordinate offset and the vertical offset of U (x, y) in each HEAD line are also known, so after the corresponding address data of the cache is read, the HEAD _ NEW of the compression unit can be obtained by intercepting the corresponding offset bit from the data of each address.
(4) After the HEAD _ NEW is obtained, the data block body with the corresponding length is obtained through the bus according to the data block specification size, and because the memory occupied by the data block body is the same as the memory occupied by the data block body without compression, namely, the space occupied by each 64x2 after compression is 64x2bytes, the corresponding data block body can be found according to the coordinate of the indication identifier HEAD _ NEW. In this way, bandwidth is saved significantly.
(5) Finally, decoding the compressed body prefix (body _ prefix) is started after the body is acquired, and then the image pixel data is restored according to the body _ prefix content and the compressed body core (body _ core) content.
For each of the above schemes, taking a 64 × 2 compression unit as an example, the comparison is as follows under the same performance requirements:
|
Prior Art
|
The invention
|
HEAD _ NEW bits bit number
|
At least 16bits
|
4bits
|
Required Cache space size
|
At least 4N
|
N
|
Number of Cache ways required
|
Over 48, typical value 64
|
12 or more, typical value 16
|
|
|
|
And (4) conclusion: according to the technical scheme provided by the invention, under the same performance requirement, the required cache space size is 1/4, the number of cache ways (ways) is 1/4, the great reduction of the number of the ways directly leads to the reduction of the logic level of hardware realization, is beneficial to rear-end time sequence convergence, can run to higher frequency, and obviously improves the performance.
Those of skill in the art will understand that the various exemplary method steps and apparatus elements described in connection with the embodiments disclosed herein can be implemented as electronic hardware, software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative steps and elements have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The various illustrative elements described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method described in connection with the embodiments disclosed above may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a subscriber station. In the alternative, the processor and the storage medium may reside as discrete components in a subscriber station.
The disclosed embodiments are provided to enable those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the scope or spirit of the invention. The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.