CN110191344B

CN110191344B - Intelligent coding method for light field image

Info

Publication number: CN110191344B
Application number: CN201910492720.7A
Authority: CN
Inventors: 雷建军; 张凯明; 刘晓寰; 何景逸; 石雅南; 张宗千
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2021-11-02
Anticipated expiration: 2039-06-06
Also published as: CN110191344A

Abstract

The invention discloses an intelligent coding method of a light field image, which comprises the following steps: according to the parameters of the light field camera, vertical coordinate and horizontal coordinate transformation are carried out on the macro pixels, and self-adaptive pixel interpolation is adopted for blank areas lacking pixel values to realize the transformation of the light field image; according to the spatial similarity between adjacent macro pixels in the transformed light field image, down-sampling is carried out by taking the macro pixels as units; carrying out coding transmission aiming at the low-resolution coding tree unit after down-sampling; at a decoding end, recovering the original resolution of the coding tree unit by adopting up-sampling based on macro-pixel replication; in order to further improve the quality of the reconstruction coding tree unit, local features and parallax features of the macro pixels are obtained by utilizing feature extraction; fusing the local features and the parallax features together to realize thickness fusion; and mapping the fused features to an RGB space by utilizing image reconstruction, and realizing macro-pixel-level refined up-sampling reconstruction.

Description

Intelligent coding method for light field image

Technical Field

The invention relates to the field of video image coding, in particular to an intelligent light field image coding method.

Background

In recent years, with the rapid development of true three-dimensional stereoscopic display technology and the continuous improvement of user requirements, light field cameras have gained great attention by virtue of unique light field capturing capability. Unlike conventional cameras, light field cameras based on microlens arrays can simultaneously record light intensity and propagation direction in three-dimensional space with a single exposure. After post-processing and calibration, the captured light field image can be applied to the fields of true three-dimensional display without visual fatigue, 3D television, significance detection, object identification and the like, and the imaging quality is improved. However, light-field images contain much more information than 2D images, which presents a significant challenge to the storage and transmission of data. Therefore, the compression algorithm research aiming at the light field image has important theoretical and practical application values.

Due to the unique imaging characteristics of light field cameras, the captured light field images also exhibit different characteristics than ordinary 2D images. The light field image is composed of a large number of macro-pixels with approximate hexagonal structures, each macro-pixel is derived from a cone-shaped light beam and corresponds to projection imaging of one micro-lens, and adjacent macro-pixels have the characteristics of displacement similarity and the like. However, the conventional image/video coding standard does not fully utilize the characteristics of the light field data, and how to effectively utilize the characteristics of the light field data to improve the compression efficiency of the light field image becomes a hot spot of current research.

Li et al propose a displacement intra-prediction encoding method that removes spatially redundant information inside a light-field image by introducing a motion-compensated prediction tool into the intra-prediction process, using the reconstructed region of the current encoded frame as a reference frame. Jin et al propose a reversible light field image transformation method based on the hexagonal structure of macropixels, which aligns the boundaries of each hexagonal macropixel with block-shaped coding units, and increases the spatial correlation between adjacent blocks by reducing the complexity of texture in the block, thereby achieving the purpose of reducing spatial redundancy. Aggouun et al propose a 3D discrete wavelet transform method that performs spatial wavelet decomposition on a single viewpoint and inter-viewpoint images, respectively, to obtain an improvement in rate-distortion performance by removing redundant information between the spatial and inter-viewpoint. Liu and the like organize the sub-aperture images into a pseudo video sequence according to the spatial position relation of the viewpoints, and the compression efficiency is improved by researching the coding sequence, the prediction structure and the code rate distribution of the sequence.

In the process of implementing the invention, the inventor finds that at least the following disadvantages and shortcomings exist in the prior art:

the methods in the prior art often design the down-sampling mechanism of the image at the pixel level without considering the link between the data structure characteristics of the macro-pixels and the up-sampling reconstruction; the existing method generally carries out up-sampling reconstruction based on deep fine-grained characteristics of an image, and lacks the research on coarse-fine fusion characteristics.

Disclosure of Invention

The invention provides an intelligent coding method of a light field image, which designs a macro-pixel-level down-sampling intra-frame intelligent coding method by analyzing the data structure characteristics of the light field image, reduces intra-frame information redundancy, improves the coding efficiency of the light field image, and is described in detail as follows:

a light field image intelligent encoding method, the method comprising the steps of:

according to the parameters of the light field camera, vertical coordinate and horizontal coordinate transformation are carried out on the macro pixels, and self-adaptive pixel interpolation is adopted for blank areas lacking pixel values to realize the transformation of the light field image;

according to the spatial similarity between adjacent macro pixels in the transformed light field image, down-sampling is carried out by taking the macro pixels as units;

carrying out coding transmission aiming at the low-resolution coding tree unit after down-sampling; at a decoding end, recovering the original resolution of the coding tree unit by adopting up-sampling based on macro-pixel replication;

in order to further improve the quality of the reconstruction coding tree unit, local features and parallax features of the macro pixels are obtained by utilizing feature extraction; fusing the local features and the parallax features together to realize thickness fusion; and mapping the fused features to an RGB space by utilizing image reconstruction, and realizing macro-pixel-level refined up-sampling reconstruction.

The downsampling with the macro-pixel as the unit specifically comprises the following steps:

LR^2×2＝Center(HR^4×4)

where LR denotes a downsampled coding tree unit, HR denotes an original coding tree unit, 2 × 2 and 4 × 4 denote the number of macro pixels in the coding tree unit, and the Center function denotes a macro pixel from which the HR Center position is extracted.

The upsampling based on macro-pixel replication specifically includes:

HR'^4×4＝Upscale(LR'^2×2)

where HR 'represents a reconstructed high resolution coding tree element, LR' represents a reconstructed low resolution coding tree element, and the Upscale function represents upsampling the reconstructed low resolution coding tree element to high resolution.

Further, the method further comprises: and performing rate distortion cost comparison on all the coding modes through a rate distortion optimization process.

The technical scheme provided by the invention has the beneficial effects that:

1. the invention provides an intelligent coding method of the light field image by analyzing and utilizing the data structure characteristics of the light field image, and the code rate is saved under the condition of ensuring the objective quality to be unchanged;

2. the invention reduces the data quantity at the CTU (coding tree unit) level by utilizing a down-sampling mechanism, reserves complete macro-pixel structure information, and improves the quality of reconstructed images and obtains the improvement of the whole coding performance through a convolutional neural network.

Drawings

FIG. 1 is a flow chart of a method for intelligently encoding a light field image;

fig. 2 is a diagram of an encoding RD (rate distortion) curve.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

In order to overcome the defects of the prior art, the embodiment of the invention provides a macro-pixel level light field image downsampling intra-frame intelligent coding method, which analyzes and utilizes the data structure characteristics of a light field image, carries out macro-pixel level downsampling on each CTU on the basis of light field image transformation, respectively carries out upsampling on a decoding end by using a method based on macro-pixel copying and a method based on a convolutional neural network, and finally decides the optimal coding mode through rate distortion optimization, and referring to fig. 1, the method comprises the following steps:

light field image transformation

To achieve vertical alignment of the macro-pixels and the 16 × 16 size coding units, on the basis of obtaining the light field camera parameters, the macro-pixels are first subjected to vertical coordinate transformation:

L_t(x,y+lk)＝L(x,y)x,y∈Z,x∈[1,W],y∈[1,H]

wherein L is_tFor the generated image after vertical coordinate transformation, L is the original image, (x, y) is the pixel coordinate in the image, W and H are the width and height of the image before transformation, respectively, n is the size of the coding unit, the embodiment of the present invention is described by taking 16 as an example, v is the vertical offset distance of the centers of two adjacent macropixel rows, y is the size of the coding unit₀Is the vertical offset distance of the center of the first macropixel row from the top boundary, Z is the integer set, l is the vertical offset distance required to separate two adjacent macropixel rows, and k is the macropixel row number.

On the basis of vertical coordinate transformation, according to light field camera parameters, the horizontal coordinate transformation alignment of the macro-pixels is further realized by aligning the calibration centers of the macro-pixels.

After the macro-pixel shift, each coding unit of 16 × 16 size contains a complete macro-pixel and a small number of blank boundaries lacking actual pixel values. In order to increase the continuity of pixel values between adjacent macro-pixels as much as possible, an adaptive pixel interpolation method is adopted for a blank area lacking pixel values:

L_t(x,y)＝W_v(L_t(x,y+v_d)+v_d×V_detector)+W_h(L_t(x+h_r,y)+h_r×H_detector)

wherein, W_v、W_hWeights, v, representing vertical and horizontal prediction, respectively_dRepresenting an image to be interpolatedShortest distance, V, of pixel from the effective pixel below_detectorRepresenting the gradient of longitudinal variation of pixel values of the interpolated region, h_rRepresenting the shortest distance, H, of the pixel to be interpolated from the right active pixel_detectorRepresenting the gradient of horizontal variation of pixel values of the interpolated region.

Two, macro-pixel based down-sampling

After the light field image transformation, the macro pixels are mutually orthogonally arranged without crossing and are matched with the coding unit. On the basis, the embodiment of the invention provides a down-sampling method based on macro-pixels according to extremely high spatial similarity between adjacent macro-pixels in a light field image, and the image resolution is reduced by taking the macro-pixels as units.

Considering that a fixed sampling rate cannot be applied to all image regions, the embodiment of the present invention performs a downsampling process at the CTU level. To implement downsampling without destroying the macro-pixel structure, the process for each CTU is as follows:

LR^2×2＝Center(HR^4×4)

where LR denotes the down-sampled CTU, HR denotes the original CTU, 2 × 2 and 4 × 4 denote the number of macro pixels within the CTU, and the Center function represents the macro pixel from which the HR Center position is extracted.

Specifically, 4 macro-pixels at the center of each original CTU are extracted as downsampled data, and after image conversion, each original CTU internally contains 4 × 4 orthogonally arranged macro-pixels, and 2 × 2 macro-pixels at the center thereof are extracted as downsampled data instead of the original high-resolution CTU, thereby completing the encoding process.

Three, down sampling CTU coding

The embodiment of the invention uses the low-resolution CTU after down sampling to replace the original high-resolution CTU for traditional coding transmission. The coding link mainly comprises: the method comprises the following steps of predictive coding, transform coding, quantization, entropy coding and the like, completes the conversion of pixel values to binary code streams, and realizes the compression of image data, and specifically comprises the following steps:

firstly, traversing 35 traditional intra-frame prediction modes for a low-resolution CTU, and screening out the best intra-frame prediction mode; then, converting the pixel value into a transform coefficient through discrete cosine transform to complete transform coding; through the quantization step, the transformation coefficient can be further mapped into a limited number of discrete amplitude values, and the value dynamic range is reduced; finally, the discrete signals are converted into binary code streams by means of entropy coding and transmitted.

Upsampling based on macropixel replication

For the decoded low-resolution reconstructed CTU, the embodiment of the present invention proposes an upsampling method based on macro-pixel replication, which restores the low-resolution CTU to the original CTU size, and the method performs the following processes:

HR'^4×4＝Upscale(LR'^2×2)

where HR 'represents the reconstructed high resolution CTU, LR' represents the reconstructed low resolution CTU, and the Upscale function represents upsampling the reconstructed low resolution CTU to high resolution.

Specifically, i.e., in units of macro-pixels, up-sampling based on macro-pixel replication is achieved by copying each macro-pixel to the 3 closest to the macro-pixel to be filled. Taking the first macropixel in the low-resolution CTU as an example, the macropixel will be copied to its three macropixel positions on the left, top, and left, completing the fill.

Five, macro-pixel level up-sampling based on convolutional neural network

In order to obtain a more accurate reconstructed CTU, the embodiment of the present invention designs a 12-layer upsampling convolutional neural network, which mainly includes: feature extraction, thickness fusion and image reconstruction. In the process of up-sampling through a network, firstly, local features and parallax features of macro pixels are obtained by utilizing feature extraction; then, fusing the local features and the parallax features together to realize thickness fusion; and finally, mapping the fused features to an RGB space by utilizing image reconstruction, and realizing macro-pixel-level refined up-sampling reconstruction.

1) Feature extraction

Feature extraction consists of one convolutional layer and two Residual Blocks (RB). Firstly, mapping brightness component information of an image to a feature space by utilizing a convolution layer; then, feature extraction is done by two concatenated RBs (residual block). Each RB contains two residual units, each consisting of two concatenated convolutional layers.

Wherein, the specific expression of RB is as follows:

H^u＝F(H^u-1,W^u)+H^u-1

where u ∈ {1,2} represents the sequence number of the residual block, H^u-1、H^uAnd W^uRespectively representing the input, output and weight of the u-th residual block, and F representing the residual mapping function.

2) Thickness and fineness fusion

As the depth of the network increases, more macro-pixel disparity features will be extracted, but also loss of macro-pixel local features will result. In order to simultaneously retain the local feature and the parallax feature of the macro-pixel, the embodiment of the invention designs a feature thickness fusion module, which is composed of a Concat layer and two convolution layers. The concrete expression of the Concat layer is as follows:

Y_c＝Concat(H₁,H₂)

wherein H₁、H₂Respectively representing the outputs of two RBs, Concat representing cascade operation in channel dimension, Y_cRepresenting the cascaded features, the cascaded features are sent to two subsequent convolution layers to complete the feature fusion operation, namely, the thickness fusion of the local features and the parallax features is realized.

3) Image reconstruction

And finally, mapping the feature subjected to thickness fusion to an RGB space through a convolution layer to complete image reconstruction. The residual map Res of the image can be obtained by:

Res＝W*R_f+B

where Res represents the residual map, W and B represent the weight coefficient and offset of the convolutional layer, respectively, and R_fRepresenting the fusion characteristics. Then, the reconstructed CTU image Rec can be obtained by summing the input X of the network and the residual mapping, and the specific implementation process is as follows:

Rec＝Res+X

sixthly, Rate-distortion optimization (RDO)

For each CTU, in order to decide the best coding mode, the embodiment of the present invention performs rate distortion cost comparison on all coding modes through an RDO (rate distortion optimization) process.

Firstly, respectively calculating the rate distortion cost of up-sampling based on macro-pixel replication and macro-pixel level up-sampling based on a convolutional neural network for the CTU after down-sampling, wherein the method with low cost is selected as the optimal up-sampling method, and the optimal up-sampling method is used for recording and transmitting through a flag bit; then, the rate distortion cost of the down-sampling coding method provided by the embodiment of the invention is calculated, the cost comparison is carried out by the down-sampling coding method provided by the embodiment of the invention and the traditional intra-frame prediction coding mode, whether the current CTU selects the proposed down-sampling method or not is decided, and a flag bit is also set for recording to provide reference for a decoding end.

Example 2

The feasibility of the protocol of example 1 is verified below with reference to FIG. 2, described in detail below:

experimental tests were performed on the reference software HTM16.18-SCM8.7 of HEVC-SCC. The test was performed using an Ankylosaurus _ Diplococcus _1 image in JPEG plus Dataset at a resolution of 7728X 5368, the experiment was in a full intra-frame encoding configuration with quantization parameters set to {32, 37, 42, 47 }. To make the effect of the embodiments of the present invention comparable, a comparison was made by comparing the IBC algorithm in the embodiments of the present invention and the SCM under the same experimental conditions:

referring to fig. 2, experimental results show that, compared with the IBC algorithm, the method achieves a quality improvement of 0.97dB under the same code rate.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A light field image intelligent coding method is characterized by comprising the following steps:

2. The light-field image intelligent encoding method according to claim 1, wherein the downsampling in units of macro-pixels is specifically:

LR^2×2＝Center(HR^4×4)

3. The light field image intelligent encoding method according to claim 1, wherein the upsampling based on macro-pixel replication is specifically:

HR'^4×4＝Upscale(LR'^2×2)

4. The method for intelligently encoding light field images according to any one of claims 1-3, wherein the method further comprises: and performing rate distortion cost comparison on all the coding modes through a rate distortion optimization process.