CN112601092A

CN112601092A - Video coding and decoding method and device

Info

Publication number: CN112601092A
Application number: CN202110224136.0A
Authority: CN
Inventors: 罗伟节; 向国庆; 滕波; 葛强; 洪一帆
Original assignee: Zhejiang Smart Video Security Innovation Center Co Ltd
Current assignee: Zhejiang Smart Video Security Innovation Center Co Ltd
Priority date: 2021-03-01
Filing date: 2021-03-01
Publication date: 2021-04-02

Abstract

The invention discloses a method and a device for video coding and decoding, which relate to the field of 3D video coding/stereo video coding; the video coding method comprises the following steps: dividing the current texture frame into a plurality of coding blocks according to the depth information; determining candidate matching blocks in a plurality of coding blocks with the same or similar depths of the coding blocks, and determining an optimal matching block from the candidate matching blocks; performing intra-prediction encoding based on the best matching block; the method has low requirement on calculated amount and improves the compression efficiency.

Description

Video coding and decoding method and device

Technical Field

The present disclosure relates to the field of 3D video encoding/stereoscopic video encoding, and in particular, to a method and an apparatus for video encoding and decoding.

Background

The stereo video is one of multi-viewpoint videos, two paths of video signals are provided and respectively correspond to a left eye and a right eye of a person, and a three-dimensional effect is formed. Among them, stereoscopic video has a variety of different representation formats, and among them, the "texture + depth" based representation format has wide application.

In the "texture + depth" based representation format, two similar data are included, one being two-dimensional video data (texture image/video data) and the other being a corresponding depth map. The two-dimensional video data (texture image) can be encoded and decoded by various conventional video encoding and decoding methods, including various standardized techniques such as MPEG2, MPEG4, h.264, h.265, and h.266. The various existing standard video encoding and decoding technologies all adopt an intra-frame prediction encoding technology.

The basic idea of intra prediction is to remove spatial redundancy by exploiting the correlation of neighboring pixels. Unlike inter prediction, neighboring pixels in video coding refer to reconstructed pixels of encoded blocks surrounding the current block. The VCC (H266) standard introduces an intra multi-row prediction (MRL) technique, in which a predicted value of a current block is calculated using an adjacent left column and an adjacent upper row of the current block as reference samples in a previous intra prediction; in the MRL technique, the reference lines that can be used are extended to three lines. However, video compression performance is limited by using only the correlation between adjacent pixels or pixels, ignoring the correlation between non-adjacent regions that are further apart within the frame. If the intra-frame matching block search is simply carried out by applying the inter-frame prediction coding mode, the calculation amount requirement is high, the application is limited, and the compression efficiency is low.

Disclosure of Invention

In view of the above technical problems in the prior art, the embodiments of the present disclosure provide a method and an apparatus for video encoding and decoding, so as to solve the problems in the prior art, such as limited video compression performance, high requirement on computational complexity, and low compression efficiency.

A first aspect of the disclosed embodiments discloses a method for video encoding, where the method includes:

dividing the current texture frame into a plurality of coding blocks according to the depth information;

determining candidate matching blocks in a plurality of coding blocks with the same or similar depths of the coding blocks, and determining an optimal matching block from the candidate matching blocks;

performing intra-prediction encoding based on the best matching block.

In some embodiments, the depth-wise division of the current texture frame into multiple encoded blocks may be replaced by: dividing a current texture frame into a plurality of areas according to depth information to obtain at least one texture subframe; and partitioning the texture sub-frame to form a plurality of coding blocks.

In some embodiments, the method further comprises:

comparing the depth information of the current coding block with the depth information of other coding blocks;

and taking the coded block corresponding to the difference value meeting the threshold condition as a candidate matching block.

In some embodiments, the determining the best matching block from the candidate matching blocks specifically comprises:

taking the pixel value of the candidate matching block as a first predicted value;

calculating a cost function for a residual and/or a motion vector of the first predictor, and selecting the best matching block with the smallest cost.

In some embodiments, the cost function is specifically any one of SAD (sum of absolute difference), SATD (sum of absolute values after hadamard transformation), SSD (sum of squared differences), MAD (mean absolute difference), MSD (mean squared error).

In some embodiments, the method further comprises: taking the pixel value of the best matching block as a second predicted value, and acquiring residual data according to the second predicted value and the pixel value of the current coding block;

and performing an encoding operation on the residual data to generate a video encoding stream.

In some embodiments, the best match block indication information is placed in a video coding stream.

A second aspect of the embodiments of the present disclosure discloses a method for video decoding, where the method includes:

decoding a video coding stream to obtain an optimal matching block;

determining candidate matching blocks according to the optimal matching blocks, and determining coding blocks corresponding to the optimal matching blocks in the candidate matching blocks;

and splicing at least one coding block according to the depth information to obtain the current texture frame.

A third aspect of the disclosed embodiments discloses an apparatus for video encoding, the apparatus comprising:

the dividing module is used for dividing the current texture frame into a plurality of coding blocks according to the depth information;

the optimal matching module determining module is used for determining candidate matching blocks in a plurality of coding blocks with the same or similar depths as the coding blocks and determining an optimal matching block from the candidate matching blocks;

an encoding module that performs intra prediction encoding based on the best matching block.

In some embodiments, the partitioning module is further configured to: dividing a current texture frame into a plurality of areas according to depth information to obtain at least one texture subframe; and partitioning the texture sub-frame to form a plurality of coding blocks.

In some embodiments, the candidate match block determination module is specifically configured to: taking the pixel value of the candidate matching block as a first predicted value;

In some embodiments, the apparatus further includes a residual data module, configured to use a pixel value of the best matching block as a second prediction value, and obtain residual data according to the second prediction value and the pixel value of the current coding block;

A fourth aspect of the disclosed embodiments discloses an apparatus for video decoding, the apparatus comprising:

the decoding module is used for decoding the video coding stream to obtain an optimal matching block;

the coding block determining module is used for determining candidate matching blocks according to the optimal matching block and determining a coding block corresponding to the optimal matching block in the candidate matching blocks;

and the splicing module is used for splicing at least one coding block according to the depth information to obtain the current texture frame.

The embodiment of the disclosure discloses a method and a device for video encoding and decoding, wherein a current texture frame is divided into a plurality of coding blocks according to depth information, an optimal matching block is determined in the coding blocks with the same or similar depths, and intra-frame prediction encoding is executed based on the optimal matching block; the method does not limit the video compression performance, has low requirement on the calculated amount, and improves the efficiency of video coding and decoding.

Drawings

The features and advantages of the present disclosure will be more clearly understood by reference to the accompanying drawings, which are illustrative and not to be construed as limiting the disclosure in any way, and in which:

FIG. 1 is a diagram illustrating multiple rows of intra prediction modes according to some embodiments of the present disclosure;

FIG. 2 is a schematic diagram of an image frame shown in accordance with some embodiments of the present disclosure;

fig. 3 is a flow diagram of a method of video encoding, according to some embodiments of the present disclosure;

FIG. 4 is an exemplary depth map shown in accordance with some embodiments of the present disclosure;

FIG. 5 is an exemplary depth map shown in accordance with some embodiments of the present disclosure;

fig. 6 is a flow diagram of a method of video decoding, shown in accordance with some embodiments of the present disclosure;

fig. 7 is a schematic diagram of an apparatus for video encoding according to some embodiments of the present disclosure;

fig. 8 is a schematic diagram of an apparatus for video decoding according to some embodiments of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to some embodiments of the present disclosure.

Detailed Description

In the following detailed description, numerous specific details of the disclosure are set forth by way of examples in order to provide a thorough understanding of the relevant disclosure. However, it will be apparent to one of ordinary skill in the art that the present disclosure may be practiced without these specific details. It should be understood that the use of the terms "system," "apparatus," "unit" and/or "module" in this disclosure is a method for distinguishing between different components, elements, portions or assemblies at different levels of sequence. However, these terms may be replaced by other expressions if they can achieve the same purpose.

It will be understood that when a device, unit or module is referred to as being "on" … … "," connected to "or" coupled to "another device, unit or module, it can be directly on, connected or coupled to or in communication with the other device, unit or module, or intervening devices, units or modules may be present, unless the context clearly dictates otherwise. For example, as used in this disclosure, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present disclosure. As used in the specification and claims of this disclosure, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" are intended to cover only the explicitly identified features, integers, steps, operations, elements, and/or components, but not to constitute an exclusive list of such features, integers, steps, operations, elements, and/or components.

These and other features and characteristics of the present disclosure, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will be better understood by reference to the following description and drawings, which form a part of this specification. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the disclosure. It will be understood that the figures are not drawn to scale.

Various block diagrams are used in this disclosure to illustrate various variations of embodiments according to the disclosure. It should be understood that the foregoing and following structures are not intended to limit the present disclosure. The protection scope of the present disclosure is subject to the claims.

In the prior art, a multi-row intra prediction mode is usually adopted, that is, intra prediction using multiple Reference rows, specifically, as shown in fig. 1, not only the most adjacent reconstructed samples are used as Reference samples, but also the adjacent second row, third row and fourth row can be used as Reference samples, but in the standard received MRL technique, these four rows of reconstructed samples are not adopted, and instead, the adjacent first row, adjacent second row, adjacent fourth row and adjacent third row are not adopted (i.e., Reference line 0, Reference line 1 and Reference line 3 are shown in the figure), which is designed by comprehensively considering the contradiction between complexity and performance. Specifically, when predicting each angle mode (only for angle mode), all three reference rows are tried, and then the smallest reference row is selected according to the rate-distortion cost, and the index of the reference row is also sent to the decoding end as well as the mode index. And the decoding end selects the corresponding reference line for prediction according to the reference line index.

The reference row index (mrl _ idx) of the multi-row prediction needs to be transmitted to the decoding end to generate the corresponding intra reference pixel, i.e. the intra multi-row coding and decoding mode information is "prediction mode + reference row index". Obviously, video compression performance is greatly limited by only using the correlation between adjacent pixels or several adjacent pixels, and neglecting the correlation between non-adjacent areas at greater distances within the frame. If the intra-frame matching block search is simply carried out by applying the inter-frame prediction coding mode, the calculation amount requirement is high, and the application is limited.

As shown in fig. 2, a frame of image is shown, wherein the rectangular object is blocked by the triangular object a. The image frame is divided into 6 × 6=36 coding blocks (coding subblocks). When coding sub-block b, if an intra search based on matching block prediction is applied, i.e. a best matching block is searched for in the whole image frame area, the coded block a is the best matching block of b, and assuming that b is similar enough to a, its prediction residual will be small enough. Compared with the existing method of predicting by adopting intra-frame reference pixels, the method adopting intra-frame matching block prediction can obtain higher compression performance, but the search of the best matching block in the frame needs a large amount of processing time.

Because the same object in the video has stronger correlation in a space domain, if the depth information is utilized, the same object can be identified from the same video frame, so that intra-frame search is performed in the video data block of the same object as much as possible, the search range of the intra-frame matching block is limited, and the search calculation amount is possibly limited in an acceptable range. Meanwhile, the intra-frame matching block realized by the method has higher degree of correlation due to the strong degree of spatial correlation of the same object, and is beneficial to reducing the prediction residual value. Accordingly, an embodiment of the present disclosure provides a method for video encoding, as shown in fig. 3, the method includes:

s301, dividing a current texture frame into a plurality of coding blocks according to depth information;

s302, determining candidate matching blocks in a plurality of coding blocks with the same or similar depths of the coding blocks, and determining an optimal matching block from the candidate matching blocks;

and S303, executing intra-frame prediction coding based on the best matching block.

In some embodiments, the blocking method of VVC and HEVC may be used for the current texture frame. A frame of image is divided into one or more coding tree units. According to the HEVC standard, a coding tree unit is divided into coding units in a quadtree division manner, and each coding unit is divided into a prediction unit and a transform unit. The concept of coding units, prediction units, and transform units will not be distinguished in VVC. Either way, in the present invention, a block of video data that is intra-prediction encoded is represented by an encoded block or sub-block or partition.

In some embodiments, the depth information corresponding to each coding block may be determined by the depth map information.

In particular, a large probability that the same object in the video is presented on the depth map corresponds to a region with the same or slowly changing depth value, so the depth map can better reflect the contour of the object. Meanwhile, the depth map information also comprises position information of each pixel in the video, so that the depth can better reflect the position of an object appearing in the video image.

As shown in fig. 4, it is an exemplary depth map, where the depth value of the object B is 1m, and the depth of the sub-blocks a and B is 2 m.

In some embodiments, the depth value of the sub-block is represented by an average depth value of all pixels in the sub-block, as shown in fig. 5. It is not difficult to derive from the figure that sub-blocks a and b have the same depth value of 2 meters, while sub-block c has a depth value of 1.8 m.

Further, the method further comprises:

For example, assuming that the threshold is 1m, sub-block c is the candidate matching block for sub-block b.

a cost function is calculated for the residual and/or motion vector of the first predictor, and the best matching block (or reference block) is selected with the lowest cost.

in some embodiments, the method further comprises: after the pixel value of the best matching block is adjusted to a certain extent, a second predicted value is determined, and residual data are obtained according to the second predicted value and the pixel value of the current coding block; such an approach may result in a smaller energy residual than simply taking the pixel values of the best matching block as the predicted values, and thus a better compression ratio. The manner in which the reference block data is adjusted may generally be implemented using a prediction filter. For example, affine transform motion compensation prediction based on blocks is proposed in h.266/VVC, taking into account non-translational motion such as scaling, rotation, etc. And determining a predicted value of the current block based on the reference block, and then comparing the pixel data of the current coding block with the predicted value to obtain a residual signal.

In the prior art, video coding standards such as MPEG2, MPEG4, h.264, h.265, h.266/VCC support block-based motion prediction coding and coding of residual data formed by motion prediction coding techniques. Generally, the residual data may be scanned first, and the two-dimensional image residual data may be converted into one-dimensional data; then DCT transformation is carried out to convert the data into frequency domain data; the frequency domain data is then quantized. Because human eyes have different degrees of sensitivity to image data of different frequencies, different quantization parameters can be adopted for data of different frequencies. The quantized data is further compressed by using entropy coding techniques such as variable length coding and arithmetic coding. Finally, compressed video data is formed.

The prior art processing method for residual data can be continued. In some embodiments, the intra prediction encoding further comprises encoding intra prediction residual data: and sequentially performing operations of scanning, DCT (discrete cosine transform), quantization, entropy coding and the like.

In some embodiments, when determining the best match block, best match block indication information is generated and placed into the video coding stream.

The embodiment of the present disclosure further discloses a method for video decoding, as shown in fig. 6, the method includes:

s601, decoding a video coding stream to obtain an optimal matching block;

s602, determining candidate matching blocks according to the optimal matching block, and determining a coding block corresponding to the optimal matching block in the candidate matching blocks;

and S603, splicing at least one coding block according to the depth information to obtain the current texture frame.

The embodiment of the present disclosure also discloses an apparatus 700 for video encoding, as shown in fig. 7, the apparatus includes:

a dividing module 701, configured to divide a current texture frame into multiple coding blocks according to depth information;

a best matching module determining module 702, configured to determine candidate matching blocks among multiple coding blocks with the same or similar depths as the coding blocks, and determine a best matching block from among the candidate matching blocks;

an encoding module 703 that performs intra prediction encoding based on the best matching block.

The embodiment of the present disclosure further discloses a device 800 for video decoding, as shown in fig. 8, the device includes:

a decoding module 801, configured to decode a video encoded stream to obtain an optimal matching block;

a coding block determining module 802, configured to determine candidate matching blocks according to the best matching block, and determine a coding block corresponding to the best matching block in the candidate matching blocks;

and a splicing module 803, configured to splice at least one coding block according to the depth information to obtain a current texture frame.

Referring to fig. 9, a schematic diagram of an electronic device according to an embodiment of the disclosure is provided. Wherein, this electronic equipment 900 includes:

a memory 930 and one or more processors 910;

wherein the memory 930 is communicatively coupled to the one or more processors 910, and the memory 930 stores instructions 932 executable by the one or more processors 910, the instructions 932 being executable by the one or more processors 910 to cause the one or more processors 910 to perform the methods of the foregoing embodiments of the disclosure.

Specifically, the processor 910 and the memory 930 may be connected by a bus or other means, such as the bus 940 in fig. 9. Processor 910 may be a Central Processing Unit (CPU). The Processor 910 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or any combination thereof.

The memory 930, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules. The processor 910 performs various functional applications of the processor and data processing by executing non-transitory software programs, instructions, and modules 932 stored in the memory 930.

The memory 930 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor 910, and the like. Further, the memory 930 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 930 may optionally include memory located remotely from processor 910 and such remote memory may be coupled to processor 910 via a network, such as through communications interface 920. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present disclosure also provides a computer-readable storage medium, in which computer-executable instructions are stored, and the computer-executable instructions are executed to perform the method in the foregoing embodiment of the present disclosure.

The foregoing computer-readable storage media include physical volatile and nonvolatile, removable and non-removable media implemented in any manner or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. The computer-readable storage medium specifically includes, but is not limited to, a USB flash drive, a removable hard drive, a Read-Only Memory (ROM), a Random Access Memory (RAM), an erasable programmable Read-Only Memory (EPROM), an electrically erasable programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, a CD-ROM, a Digital Versatile Disk (DVD), an HD-DVD, a Blue-Ray or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.

While the subject matter described herein is provided in the general context of execution in conjunction with the execution of an operating system and application programs on a computer system, those skilled in the art will recognize that other implementations may also be performed in combination with other types of program modules. Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types. Those skilled in the art will appreciate that the subject matter described herein may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like, as well as distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure.

In summary, the present disclosure provides a method and an apparatus for video encoding and decoding, an electronic device and a computer-readable storage medium thereof. Dividing a current texture frame into a plurality of coding blocks according to depth information, determining an optimal matching block in the coding blocks with the same or similar depths, and executing intra-frame prediction coding based on the optimal matching block; the method does not limit the video compression performance, has low requirement on the calculated amount, and improves the efficiency of video coding and decoding.

It is to be understood that the above-described specific embodiments of the present disclosure are merely illustrative of or illustrative of the principles of the present disclosure and are not to be construed as limiting the present disclosure. Accordingly, any modification, equivalent replacement, improvement or the like made without departing from the spirit and scope of the present disclosure should be included in the protection scope of the present disclosure. Further, it is intended that the following claims cover all such variations and modifications that fall within the scope and bounds of the appended claims, or equivalents of such scope and bounds.

Claims

1. A method of video encoding, the method comprising:

performing intra-prediction encoding based on the best matching block.

2. The method of claim 1, wherein the depth-wise partitioning of the current texture frame into the plurality of encoded blocks is replaced with: dividing a current texture frame into a plurality of areas according to depth information to obtain at least one texture subframe; and partitioning the texture sub-frame to form a plurality of coding blocks.

3. The method of claim 1, further comprising:

4. The method of claim 1, wherein determining the best matching block from the candidate matching blocks comprises:

5. Method according to claim 4, characterized in that the cost function is in particular any of SAD (sum of absolute error), SATD (sum of absolute values after hadamard transformation), SSD (sum of squared differences of differences), MAD (mean absolute differences), MSD (mean squared errors).

6. The method of claim 1, further comprising: taking the pixel value of the best matching block as a second predicted value, and acquiring residual data according to the second predicted value and the pixel value of the current coding block;

7. The method of claim 1, wherein the best match block indication information is placed in a video coding stream.

8. A method of video decoding, the method comprising:

decoding a video coding stream to obtain an optimal matching block;

9. An apparatus for video encoding, the apparatus comprising:

10. The apparatus of claim 9, wherein the partitioning module is further configured to: dividing a current texture frame into a plurality of areas according to depth information to obtain at least one texture subframe; and partitioning the texture sub-frame to form a plurality of coding blocks.

11. The apparatus of claim 9, wherein the candidate match block determination module is specifically configured to: taking the pixel value of the candidate matching block as a first predicted value;

12. The apparatus according to claim 9, further comprising a residual data module configured to use the pixel value of the best matching block as a second prediction value, and obtain residual data according to the second prediction value and the pixel value of the current coding block;

13. An apparatus for video decoding, the apparatus comprising: