WO2024027616A1 - 帧内预测方法、装置、计算机设备及可读介质 - Google Patents

帧内预测方法、装置、计算机设备及可读介质 Download PDF

Info

Publication number
WO2024027616A1
WO2024027616A1 PCT/CN2023/110099 CN2023110099W WO2024027616A1 WO 2024027616 A1 WO2024027616 A1 WO 2024027616A1 CN 2023110099 W CN2023110099 W CN 2023110099W WO 2024027616 A1 WO2024027616 A1 WO 2024027616A1
Authority
WO
WIPO (PCT)
Prior art keywords
image block
image
sequence
predicted
prediction
Prior art date
Application number
PCT/CN2023/110099
Other languages
English (en)
French (fr)
Inventor
任聪
徐科
孔德辉
杨维
曹洲
Original Assignee
深圳市中兴微电子技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市中兴微电子技术有限公司 filed Critical 深圳市中兴微电子技术有限公司
Publication of WO2024027616A1 publication Critical patent/WO2024027616A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/44Decoders specially adapted therefor, e.g. video decoders which are asymmetric with respect to the encoder
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/593Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial prediction techniques

Definitions

  • the present disclosure relates to the technical field of video coding and decoding, and specifically relates to an intra prediction method, device, computer equipment and readable medium.
  • intra-frame prediction technology is a very important one, because among various video frame types, I frames (intra-coded frames) all use intra-frame prediction, and the compression ratio of I frames is usually higher than that of P frames (predicted frames). Encoded frames) and B frames (bidirectional predictive encoded frames) are lower, so the efficiency of intra-frame predictive encoding has a greater impact on the overall average bit rate of the video.
  • the I frame is usually used as a reference frame in the decoding process of the P frame/B frame. If there is an error in the encoding of the I frame, then not only the error occurs in the I frame, but also the P frame and B frame of the I frame. It also cannot be decoded correctly.
  • the present disclosure provides an intra prediction method, device, computer equipment and readable medium.
  • an intra prediction method is provided.
  • the method is applied to a Transformer network and includes: dividing an image to be predicted into a preset number of image blocks, and generating image blocks including the image blocks. sequence; perform dimensional processing on the image block sequence to obtain an image block embedding output sequence; encode the image to be predicted according to the image block embedding output sequence and the first position information of the image block to obtain image blocks Encoding output sequence, the image block encoding output sequence includes global information in the first frame; decoding the image block encoding output sequence according to the second position information of the image block and the predicted image block prediction sequence, A current image block prediction sequence is obtained, the current image block prediction sequence includes the second intra-frame global information; and a predicted image is generated according to the current image block prediction sequence.
  • an intra prediction device is provided.
  • the device is a Transformer network device and includes a dividing module, a dimension processing module, an encoding module, a decoding module and a generating module, wherein the dividing module is It is configured to divide the image to be predicted into a preset number of image blocks and generate an image block sequence including the image blocks; the dimension processing module is configured to perform dimension processing on the image block sequence to obtain image blocks.
  • the encoding module is configured to encode the image to be predicted according to the image block embedding output sequence and the first position information of the image block to obtain an image block encoding output sequence, the image
  • the block encoding output sequence includes the first intra-frame global information
  • the decoding module is configured to predict the image block according to the second position information of the image block and the predicted image block.
  • Sequence decode the image block encoding output sequence to obtain the current image block prediction sequence, the current image block prediction sequence includes the global information in the second frame;
  • the generation module is configured to, according to the current A predicted sequence of image blocks generates a predicted image.
  • a computer device including: one or more processors; a storage device having one or more programs stored thereon; when the one or more programs are processed by the When executed by or multiple processors, the one or more processors implement the intra prediction method as described above.
  • a computer-readable medium is provided with a computer program stored thereon, wherein when the program is executed, the intra prediction method as described above is implemented.
  • Figure 1 is a schematic diagram comparing intra prediction and traditional intra prediction using the Transformer network according to an embodiment of the present disclosure
  • Figure 2 is a schematic diagram of an intra prediction process according to an embodiment of the present application.
  • Figure 3 is a schematic diagram of the encoding process according to an embodiment of the present application.
  • Figure 4 is a schematic diagram of the decoding process according to an embodiment of the present application.
  • Figure 5 is a schematic flowchart of generating an image block sequence according to an embodiment of the present application.
  • Figure 6 is a schematic flowchart 1 of generating a predicted image according to an embodiment of the present application.
  • Figure 7 is a schematic flowchart 2 of generating a predicted image according to an embodiment of the present application.
  • Figure 8 is a schematic flowchart of determining the number of encoding and decoding times according to an embodiment of the present application
  • Figure 9 is a schematic structural diagram of an intra prediction device according to an embodiment of the present application.
  • Figure 10 is a schematic structural diagram 2 of an intra prediction device according to an embodiment of the present application.
  • Embodiments described herein may be described with reference to plan and/or cross-sectional illustrations, with the aid of idealized schematic illustrations of the present disclosure. Accordingly, example illustrations may be modified based on manufacturing techniques and/or tolerances. Therefore, the embodiments are not limited to those shown in the drawings but include modifications of configurations formed based on the manufacturing process. Accordingly, the regions illustrated in the figures are of a schematic nature and the shapes of the regions shown in the figures are illustrative of the specific shapes of regions of the element and are not intended to be limiting.
  • the traditional intra-frame prediction method defines 35 prediction modes based on PU (Prediction Unit), which can be divided into TU (Transform Unit) in the form of a quadtree. , and all TUs in a PU share the same prediction mode, and all TUs in a PU share the same prediction mode.
  • the H.265 intra prediction process is as follows: determine whether the adjacent reference pixels of the current TU are available and perform corresponding processing, filter the reference pixels, and calculate the predicted pixel value of the current TU based on the filtered reference pixels.
  • Traditional intra prediction has multiple prediction modes, resulting in high computational overhead.
  • Figure 1 shows a schematic structural diagram of a Transformer network according to an embodiment of the present disclosure.
  • the embodiment of the present disclosure uses the Transformer network to perform intra-frame prediction coding, replacing the traditional intra-frame estimation and intra-frame prediction coding, and the output prediction image is used for subsequent quantization operations.
  • the intra prediction process of the embodiment of the present disclosure will be described in detail below with reference to FIG. 1 and FIG. 2 .
  • An embodiment of the present disclosure provides an intra prediction method, which is applied to a Transformer network. As shown in Figure 1 and Figure 2, the method includes the following steps S11 to S15.
  • step S11 the image to be predicted is divided into a preset number of image blocks, and an image block sequence including the image blocks is generated.
  • the width of the image to be predicted is W and the height is H.
  • step S12 dimensionality processing is performed on the image block sequence to obtain an image block embedding output sequence.
  • step S13 the image to be predicted is encoded according to the image block embedding output sequence and the first position information of the image block to obtain an image block encoding output sequence, which includes the first intra-frame global information.
  • the image to be predicted is encoded N times.
  • the number of encoding times N is preconfigured, and N is an integer greater than 1.
  • the Encoder module encodes the image to be predicted based on the image block embedding output sequence P ,P e_2 ,...,P e_s ].
  • the image block encoding output sequence Pe includes global information within the first frame, and global information within the first frame is generated during the encoding process of the image to be predicted.
  • step S14 the image block encoding output sequence is decoded according to the second position information of the image block and the predicted image block prediction sequence to obtain the current image block prediction sequence.
  • the current image block prediction sequence includes the second frame Inside global information.
  • the Decoder module shown in Figure 1 (which is composed of a stack of M identical decoder sub-modules) decodes the image block encoding output sequence Pe .
  • the number of decoding times M is preconfigured, and M is an integer greater than 1.
  • the third feature sequence P d_i of each image block constitutes the current image block prediction sequence. Pd .
  • the current image block prediction sequence P d includes the second intra-frame global information, and the second intra-frame global information is generated during the decoding process of the image block encoding output sequence P e .
  • step S15 a predicted image is generated based on the current image block prediction sequence.
  • the Fusion module shown in Figure 1 performs dimension conversion processing on the unified dimensions of the current image block prediction sequence P d and then splices them to obtain a prediction image with a width W and a height H.
  • the intra-frame prediction method provided by the embodiment of the present disclosure is applied to the Transformer network, including: dividing the image to be predicted into a preset number of image blocks, and generating an image block sequence including the image blocks; performing dimensional processing on the image block sequence , obtain the image block embedding output sequence; according to the image block embedding output sequence and the first position information of the image block, encode the image to be predicted to obtain the image block encoding output sequence, and the image block encoding output sequence includes the global information in the first frame; according to The second position information of the image block and the predicted image block prediction sequence are decoded to obtain the current image block prediction sequence.
  • the current image block prediction sequence includes the global information in the second frame; according to the current A prediction sequence of image blocks generates a predicted image.
  • the embodiments of the present disclosure implement intra-frame prediction coding through the Transformer network, which not only utilizes local information within the image block, but also uses the self-attention mechanism layer in the Transformer to obtain global information within the frame, effectively overcoming the limitations caused by convolution induction bias. This makes the information interaction more complete, thereby obtaining the intra-coded predicted image more accurately.
  • the self-attention mechanism (self-attention) is the core of the Transformer network and an important means for obtaining global information within a frame in this embodiment of the disclosure.
  • An exemplary process of obtaining global information in the first frame and global information in the second frame will be described below with reference to Figures 3 and 4 respectively.
  • the first intra-frame global information is calculated based on the image block embedding output sequence P x to which the first position information has been added.
  • the encoder submodule first performs self-attention processing based on the image block embedding output sequence P x with the first position information added to obtain the global information in the first frame. After adding and normalizing the image block embedding output sequence P After unification and other processing and positive feedback processing, the addition process is performed again to realize the coding of the image to be predicted, and the image block coding output sequence P e is obtained.
  • the second intra-frame global information is calculated based on the image block encoding output sequence Pe and the predicted image block prediction sequence Pd ' to which the second position information has been added.
  • the decoder sub-module first performs the first self-attention processing based on the predicted image block prediction sequence P d ' with the second position information added, and then compares it with the predicted image block prediction sequence P d '
  • the second self-attention processing is performed based on the image block encoding output sequence P e .
  • normalization, positive feedback and other processing with the image block encoding output sequence P e the process is completed.
  • the current image block prediction sequence P d is obtained.
  • the global information in the first frame can be calculated by the following formula (1):
  • d t is the dimension of the first feature sequence P x_i
  • softmax() is the activation Function
  • Q, K, V matrices represent the weight values (i.e., dependencies) between the first feature sequences P x_i of each dimension that have added the first position information.
  • the Q, K, V matrices embedding the output sequence P It is obtained after matrix transformation (multiplication) with three preset matrices.
  • the parameters in the three preset matrices can be obtained in a learnable manner. It can be seen from this that the Q, K, and V matrices are only related to the image block embedding output sequence P x , which is the self-attention to the image block embedding output sequence P x .
  • step S11 dividing the image to be predicted into a preset number of image blocks and generating an image block sequence including the image blocks (ie step S11) includes the following steps S111 to S112.
  • step S111 the image to be predicted is divided into a preset number of equal-sized image blocks.
  • step S112 the image blocks are sorted from left to right and from top to bottom to generate an image block sequence.
  • steps S111-S112 the traditional coding method that requires calculation of CTU (Coding Tree Units, tree coding unit) block division is abandoned, and the direct equal block method is used to improve the efficiency of image block division and solve the problem of high computational overhead of traditional intra-frame prediction. .
  • CTU Coding Tree Units, tree coding unit
  • generating a predicted image according to the current image block prediction sequence includes the following steps S151 to S153.
  • step S151 the current image block prediction sequence is linearized to obtain a first sequence.
  • the first sequence includes a one-dimensional array of each image block.
  • step S152 the one-dimensional array of each image block is converted into a two-dimensional matrix, and a second sequence is generated based on the two-dimensional matrix.
  • step S153 according to the second sequence, each of the two-dimensional matrices is spliced in order from left to right and from top to bottom to obtain a predicted image.
  • the Fusion module can include three processing units: Linear, Reshape and Concat.
  • Use Linear to linearize the input current image block prediction sequence P d [P d_1 , P d_2 ,..., P d_s ] to obtain the first sequence P L
  • P L [P L_1 , P L_2 ,. .., P L_s ]
  • the first sequence P L is input to Reshape.
  • Reshape converts the one-dimensional array P L_i into a two-dimensional matrix P R_i .
  • the width of the two-dimensional matrix P R_i is W/num and the height is H/num.
  • the second sequence P R is input to Concat, and Concat splices the second sequence P R into a prediction image in the order from left to right and top to bottom.
  • the width of the prediction image is W and the height is H, and the final prediction of the Transformer network is output.
  • the image is provided for subsequent quantification processing.
  • N and M are determined through the following steps S21 and S22.
  • step S21 the texture complexity of the image to be predicted is calculated.
  • the texture complexity is the variance ⁇ 2 of the gray-level histogram of the image, and for example, the texture complexity of the image to be predicted is calculated by the estimation module responsible for the texture shown in Figure 1.
  • the texture complexity can be calculated by the following Formula (2) calculation:
  • z represents the grayscale of the image to be predicted
  • p(z i ) is the corresponding histogram
  • L is the number of grayscales
  • m is the mean value of z, which can be calculated by the following formula (3):
  • texture complexity is not limited to the above calculation method, and can also be calculated using other methods, such as gradient-based calculation, deep learning-based method, etc.
  • N and M are determined according to the texture complexity and a preset reference threshold, and N and M are respectively one of the preconfigured thresholds.
  • the reference threshold includes a first reference threshold and a second reference threshold
  • determining the N and M according to the texture complexity and the preset reference threshold includes the following steps: when the texture is complex When the texture complexity is greater than or equal to the first reference threshold and less than or equal to
  • N is determined to be the preconfigured second encoding threshold N2
  • M is determined to be the preconfigured second decoding threshold M2
  • N is determined to be the preconfigured second encoding threshold N2.
  • the threshold judgment module in Figure 1 performs texture complexity judgment. If the texture complexity is less than the first reference threshold, it means that the image to be predicted has a weak texture, and the number of encoding times N and the number of decoding times M are set to a smaller value (N1 and M1); if the texture complexity is greater than or equal to the first reference threshold and less than or equal to the second reference threshold, it means that the image to be predicted is a medium texture, then the number of encoding times N and the number of decoding times M are set to intermediate values (N2 and M2) ; If the texture complexity is greater than the second reference threshold, it means that the image to be predicted has a strong texture, then the number of encoding times N and the number of decoding times M are set to larger values (N3 and M3).
  • the Transformer network includes a texture complexity estimation module and a threshold judgment module, and the Transformer network is a Dynamic Transformer network.
  • the number of encoding times N i.e., the number of stacks of encoder submodules
  • the number of decoding times M i.e., the number of stacks of decoder submodules
  • Transformer network does not include texture complexity estimation module and threshold judgment module.
  • the first location information and the second location information may be calculated by the following formulas (4) and (5):
  • pos represents the number of the image block
  • PE(pos,2i) is the position of the even-numbered image block
  • PE(pos,2i+1) is the position of the odd-numbered image block
  • i represents the identity of the d t dimension.
  • first position information and the second position information can also be obtained using deep learning.
  • cross entropy can be used as the loss function to train the Transformer network, and other loss functions can also be used, such as L1 loss function, L2 loss function, etc.
  • the disclosed embodiment is applied to Transformer network and can be replaced by other Transformer network variants, such as swin-Transformer, Sparse Transformer, Image Transformer and other networks.
  • the embodiments of this disclosure provide an intra prediction coding method based on the Dynamic Transformer network, abandoning the traditional coding method that requires calculation of CTU block division, and using patches to directly divide the blocks into equal blocks to improve the efficiency of block division; realizing intra prediction coding through the Transformer network, It utilizes the local information within the image block and obtains the global information within the frame through the self-attention mechanism (self-attention) layer in the Transformer network, which makes the information interaction in the network more complete, and thus obtains the predicted image encoded within the frame more accurately. , to ensure coding quality.
  • Dynamic Transformer combines the evaluation of intra-frame texture complexity and uses reference thresholds to make judgments.
  • the Transformer network uses a shallower model depth to obtain global information within the frame, and consumes less overall computing resources.
  • the intra prediction device is a Transformer network device. As shown in Figure 9, the intra prediction device includes a dividing module 101 and a dimension processing module. 102. Encoding module 103, decoding module 104 and generation module 105.
  • the dividing module 101 is configured to divide the image to be predicted into a preset number of image blocks and generate an image block sequence including the image blocks.
  • the dimension processing module 102 is configured to perform dimension processing on the image block sequence to obtain an image block embedding output sequence.
  • the encoding module 103 is configured to encode the image to be predicted according to the image block embedding output sequence and the first position information of the image block to obtain an image block encoding output sequence, where the image block encoding output sequence includes Global information within the first frame.
  • the decoding module 104 is configured to decode the image block encoding output sequence according to the second position information of the image block and the predicted image block prediction sequence to obtain the current image block prediction sequence.
  • the image block prediction sequence includes the second intra-frame global information.
  • the generation module 105 is configured to generate a predicted image according to the current image block prediction sequence.
  • the first intra-frame global information is calculated based on an image block embedding output sequence to which the first position information has been added.
  • the second intra-frame global information is calculated based on the image block encoding output sequence and a predicted image block prediction sequence to which the second position information has been added.
  • the dividing module 101 is configured to divide the image to be predicted into a preset number of equal-sized image blocks; sort the image blocks in order from left to right and top to bottom to generate the sequence of image blocks.
  • the generation module 105 is configured to linearize the current image block prediction sequence to obtain a first sequence, where the first sequence includes a one-dimensional array of each image block; The one-dimensional array of the image blocks is converted into a two-dimensional matrix, and a second sequence is generated according to the two-dimensional matrix; according to the second sequence, each of the two sequences is sequenced from left to right and from top to bottom. Dimensional matrices are spliced to obtain the predicted image.
  • the encoding module 103 is configured to encode the image to be predicted N times; the decoding module 104 is configured to decode the image block encoding output sequence M times.
  • the N and M are preconfigured, and N and M are integers greater than 1.
  • the intra prediction device further includes a codec number determination module 106.
  • the codec number determination module 106 is configured to calculate the texture complexity of the image to be predicted; according to the The texture complexity and the preset reference threshold determine the N and M, and the N and M are one of the preconfigured thresholds.
  • the reference threshold includes a first reference threshold and a second reference threshold.
  • the encoding and decoding times determination module 106 is configured to determine N as the preconfigured first encoding threshold N1 and determine M as the preconfigured first decoding threshold when the texture complexity is less than the first reference threshold. M1; In the case where the texture complexity is greater than or equal to the first reference threshold and less than or equal to the second reference threshold, determine N to be the preconfigured second coding threshold N2, and determine M to be the preconfigured second coding threshold N2.
  • second decoding threshold M2 when the texture complexity is greater than the second reference threshold, determine N to be the preconfigured third encoding threshold N3, and determine M to be the preconfigured third decoding threshold M3; wherein, N3>N2>N1, M3>M2>M1.
  • Embodiments of the present disclosure also provide a computer device.
  • the computer device includes: one or more processors and a storage device; wherein one or more programs are stored on the storage device.
  • the one or more programs are used by the above-mentioned one When executed by or multiple processors, the above one or more processors implement the intra prediction method as provided in the foregoing embodiments.
  • Embodiments of the present disclosure also provide a computer-readable medium on which a computer program is stored, wherein when the computer program is executed, the intra prediction method as provided in the foregoing embodiments is implemented.
  • Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).
  • computer storage media includes volatile and nonvolatile media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. removable, removable and non-removable media.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassettes, tapes, disk storage or other magnetic storage devices, or may Any other medium used to store the desired information and that can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .
  • Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a general illustrative sense only and not for purpose of limitation. In some instances, it will be apparent to those skilled in the art that features, characteristics and/or elements described in connection with a particular embodiment may be used alone, or may be used in conjunction with other embodiments, unless expressly stated otherwise. Features and/or components used in combination. Accordingly, it will be understood by those skilled in the art that various changes in form and details may be made without departing from the scope of the invention as set forth in the appended claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本公开提供一种帧内预测方法,应用于Transformer网络,包括:将待预测图像划分为预设数量的图像块,并生成包括图像块的图像块序列;对图像块序列进行维度处理,得到图像块嵌入输出序列;根据图像块嵌入输出序列和图像块的第一位置信息,对待预测图像进行编码,得到图像块编码输出序列,图像块编码输出序列包括第一帧内全局信息;根据图像块的第二位置信息和已预测得到的图像块预测序列,对图像块编码输出序列进行解码,得到当前的图像块预测序列,当前的图像块预测序列包括第二帧内全局信息;根据当前的图像块预测序列生成预测图像。本公开还提供一种帧内预测装置、计算机设备和可读介质。

Description

帧内预测方法、装置、计算机设备及可读介质
相关申请的交叉引用
本申请要求于2022年8月1日提交的名称为“帧内预测方法、装置、计算机设备及可读介质”的中国专利申请CN202210914755.7的优先权,其全部内容通过引用并入本文。
技术领域
本公开涉及视频编解码技术领域,具体涉及一种帧内预测方法、装置、计算机设备及可读介质。
背景技术
随着用户对高清视频需求量的增加,视频多媒体的视频数据量也在与日俱增。由于视频中包含很多冗余信息,如果未经压缩,这些视频很难应用于实际的存储和传输,因此需要编码技术来对视频进行压缩,减少存储和传输的压力。在编码技术中,帧内预测技术是非常重要的一种,因为在各种视频帧类型中,I帧(帧内编码帧)全部采用帧内预测,I帧的压缩比率通常比P帧(预测编码帧)和B帧(双向预测编码帧)更低,因此帧内预测编码的效率对视频整体平均码率具有较大影响。另一方面,I帧通常都会作为P帧/B帧解码过程中的参考帧,如果I帧的编码出现了错误,那么不仅仅是该I帧出现错误,参考该I帧的P帧和B帧也同样不能正确解码。
发明内容
本公开提供一种帧内预测方法、装置、计算机设备和可读介质。
在本公开的一方面中,提供了一种帧内预测方法,所述方法应用于Transformer网络,包括:将待预测图像划分为预设数量的图像块,并生成包括所述图像块的图像块序列;对所述图像块序列进行维度处理,得到图像块嵌入输出序列;根据所述图像块嵌入输出序列和所述图像块的第一位置信息,对所述待预测图像进行编码,得到图像块编码输出序列,所述图像块编码输出序列包括第一帧内全局信息;根据所述图像块的第二位置信息和已预测得到的图像块预测序列,对所述图像块编码输出序列进行解码,得到当前的图像块预测序列,所述当前的图像块预测序列包括第二帧内全局信息;以及根据所述当前的图像块预测序列生成预测图像。
在本公开的另一方面中,提供了一种帧内预测装置,所述装置为Transformer网络设备,包括划分模块、维度处理模块、编码模块、解码模块和生成模块,其中,所述划分模块被配置为,将待预测图像划分为预设数量的图像块,并生成包括所述图像块的图像块序列;所述维度处理模块被配置为,对所述图像块序列进行维度处理,得到图像块嵌入输出序列;所述编码模块被配置为,根据所述图像块嵌入输出序列和所述图像块的第一位置信息,对所述待预测图像进行编码,得到图像块编码输出序列,所述图像块编码输出序列包括第一帧内全局信息;所述解码模块被配置为,根据所述图像块的第二位置信息和已预测得到的图像块预测 序列,对所述图像块编码输出序列进行解码,得到当前的图像块预测序列,所述当前的图像块预测序列包括第二帧内全局信息;所述生成模块被配置为,根据所述当前的图像块预测序列生成预测图像。
在本公开的再一方面中,提供了一种计算机设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序;当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如前所述的帧内预测方法。
在本公开的又一方面中,提供了一种计算机可读介质,其上存储有计算机程序,其中,所述程序被执行时实现如前所述的帧内预测方法。
附图说明
图1是根据本公开实施例利用Transformer网络进行帧内预测与传统的帧内预测的对比示意图;
图2是根据本申请实施例的帧内预测流程示意图;
图3是根据本申请实施例的编码过程示意图;
图4是根据本申请实施例的解码过程示意图;
图5是根据本申请实施例的生成图像块序列的流程示意图;
图6是根据本申请实施例的生成预测图像的流程示意图一;
图7是根据本申请实施例的生成预测图像的流程示意图二;
图8是根据本申请实施例的确定编解码次数的流程示意图;
图9是根据本申请实施例的帧内预测装置的结构示意图一;
图10是根据本申请实施例的帧内预测装置的结构示意图二。
具体实施方式
在下文中将参考附图更充分地描述示例实施例,但是所述示例实施例可以以不同形式来体现且不应当被解释为限于本文阐述的实施例。反之,提供这些实施例的目的在于使本公开透彻和完整,并将使本领域技术人员充分理解本公开的范围。
如本文所使用的,术语“和/或”包括一个或多个相关列举条目的任何和所有组合。
本文所使用的术语仅用于描述特定实施例,且不意欲限制本公开。如本文所使用的,单数形式“一个”和“该”也意欲包括复数形式,除非上下文另外清楚指出。还将理解的是,当本说明书中使用术语“包括”和/或“由……制成”时,指定存在所述特征、整体、步骤、操作、元件和/或组件,但不排除存在或添加一个或多个其他特征、整体、步骤、操作、元件、组件和/或其群组。
本文所述实施例可借助本公开的理想示意图而参考平面图和/或截面图进行描述。因此,可根据制造技术和/或容限来修改示例图示。因此,实施例不限于附图中所示的实施例,而是包括基于制造工艺而形成的配置的修改。因此,附图中例示的区具有示意性属性,并且图中所示区的形状例示了元件的区的具体形状,但并不旨在是限制性的。
除非另外限定,否则本文所用的所有术语(包括技术和科学术语)的含义与本领域普通技术人员通常理解的含义相同。还将理解,诸如那些在常用字典中限定的那些术语应当被解释为具有与其在相关技术以及本公开的背景下的含义一致的含义,且将不解释为具有理想化或 过度形式上的含义,除非本文明确如此限定。
传统的帧内预测方式,以H.265为例,是在PU(Prediction Unit,预测单元)的基础上定义35种预测模式,PU可以以四叉树的形式划分TU(Transform Unit,转换单元),且一个PU内所有TU共享同一种预测模式的形式划分TU,且一个PU内的所有TU共享同一种预测模式。H.265帧内预测过程如下:判断当前TU相邻参考像素是否可用并做相应的处理,对参考像素进行滤波,根据滤波后的参考像素计算当前TU的预测像素值。传统的帧内预测具有多种预测模式,导致运算开销大。
在目前的相关技术中,除了传统的帧内预测方式,还有使用深度学习中卷积神经网络进行帧内预测的方式,但在该方式中,卷积在提取特征时通过局部感受野进行提取,一定程度上忽视了较长距离编码块与块的相关性。为了获取更准确的帧内预测值,本公开实施例提出了基于Transformer网络架构的视频编码的帧内预测方法。
图1的右部分示出了根据本公开实施例的Transformer网络的结构示意图。
如图1所示,本公开实施例利用Transformer网络进行帧内预测编码,代替了传统的帧内估计和帧内预测编码,输出的预测图像用于后续的量化操作。以下结合图1和图2,对本公开实施例的帧内预测过程进行详细说明。
本公开实施例提供一种帧内预测方法,所述方法应用于Transformer网络,结合图1和图2所示,所述方法包括以下步骤S11至S15。
在步骤S11中,将待预测图像划分为预设数量的图像块,并生成包括图像块的图像块序列。
待预测图像的宽为W,高为H。在本步骤中,例如由图1所示的Extracted Patches模块将大小为W*H的待预测图像划分为预设数量S的相同大小的图像块(patch)Pi,(i=1,2,...,S),S个图像块形成图像块序列P,P=[P1,P2,...,Ps]。
在步骤S12中,对图像块序列进行维度处理,得到图像块嵌入输出序列。
在本步骤中,例如由图1所示的Embedding模块将步骤S11得到的图像块序列P=[P1,P2,...,Ps]中的每个图像块Pi通过全连接层得到维度为dt的一维的第一特征序列Px_i=[x1,x2,...,xt],每个图像块的维度是dt,Embedding模块输出图像块嵌入输出序列Px,Px=[Px_1,Px_2,...,Px_s]。
在步骤S13中,根据图像块嵌入输出序列和图像块的第一位置信息,对待预测图像进行编码,得到图像块编码输出序列,图像块编码输出序列包括第一帧内全局信息。
例如,由图1所示的Encoder模块(其由N个相同的encoder子模块堆叠构成),对待预测图像进行N次编码,编码次数N预先配置,N为大于1的整数。在本步骤中,Encoder模块根据图像块嵌入输出序列Px和图像块的第一位置信息(Positional Encoding 1),对待预测图像进行编码,得到图像块编码输出序列Pe,Pe=[Pe_1,Pe_2,...,Pe_s]。每个图像块得到维度为dt的一维的第二特征序列Pe_i(i=1,2,...,S),各个图像块的第二特征序列Pe_i组成图像块编码输出序列Pe
需要说明的是,图像块编码输出序列Pe包括第一帧内全局信息,第一帧内全局信息在对待预测图像进行编码过程中生成。
在步骤S14中,根据图像块的第二位置信息和已预测得到的图像块预测序列,对图像块编码输出序列进行解码,得到当前的图像块预测序列,当前的图像块预测序列包括第二帧内 全局信息。
例如,由图1所示的Decoder模块(其由M个相同的decoder子模块堆叠构成),对图像块编码输出序列Pe进行解码,解码次数M预先配置,M为大于1的整数。在本步骤中,Decoder模块根据图像块的第二位置信息(Positional Encoding 2)和已预测得到的图像块预测序列Pd’,Pd’=[Pd_1,Pd_2,...,Pt-1_s],对图像块编码输出序列Pe进行解码,得到当前的图像块预测序列Pd,Pd=[Pd_1,Pd_2,...,Pd_s]。每个图像块得到维度为dt的一维的第三特征序列Pd_i(i=1,2,...,S),各个图像块的第三特征序列Pd_i组成当前的图像块预测序列Pd
需要说明的是,当前的图像块预测序列Pd包括第二帧内全局信息,第二帧内全局信息在对图像块编码输出序列Pe进行解码过程中生成。
在步骤S15中,根据当前的图像块预测序列生成预测图像。
例如,由图1所示的Fusion模块对当前的图像块预测序列Pd统一维度进行维度转换处理后进行拼接,得到宽为W、高为H的预测图像。
本公开实施例提供的帧内预测方法,应用于Transformer网络,包括:将待预测图像划分为预设数量的图像块,并生成包括所述图像块的图像块序列;对图像块序列进行维度处理,得到图像块嵌入输出序列;根据图像块嵌入输出序列和图像块的第一位置信息,对待预测图像进行编码,得到图像块编码输出序列,图像块编码输出序列包括第一帧内全局信息;根据图像块的第二位置信息和已预测得到的图像块预测序列,对图像块编码输出序列进行解码,得到当前的图像块预测序列,当前的图像块预测序列包括第二帧内全局信息;根据当前的图像块预测序列生成预测图像。本公开实施例通过Transformer网络来实现帧内预测编码,既利用了图像块内的局部信息,又利用Transformer中的自注意力机制层获取帧内全局信息,有效克服卷积归纳偏差带来的局限性,使得信息交互更加充分,从而更准确地得到帧内编码的预测图像。
自注意力机制(self-attention)是Transformer网络的核心,也是本公开实施例获取帧内全局信息的重要手段。以下分别结合图3和图4,对获取第一帧内全局信息和第二帧内全局信息的示例性过程进行说明。
在一些实施例中,第一帧内全局信息根据已添加第一位置信息的图像块嵌入输出序列Px计算得到。如图3所示,encoder子模块先根据已添加第一位置信息的图像块嵌入输出序列Px进行自注意处理得到第一帧内全局信息,经过与图像块嵌入输出序列Px相加、归一化等处理以及正反馈处理后再次相加处理,实现对待预测图像编码,得到图像块编码输出序列Pe
在一些实施例中,第二帧内全局信息根据图像块编码输出序列Pe和已添加第二位置信息的已预测得到的图像块预测序列Pd’计算得到。如图4所示,decoder子模块先根据已添加第二位置信息的已预测得到的图像块预测序列Pd’进行第一次自注意处理,经过与已预测得到的图像块预测序列Pd’相加、归一化等处理后,再根据图像块编码输出序列Pe进行第二次自注意处理,经过与图像块编码输出序列Pe相加、归一化、正反馈等处理后,完成解码过程,得到当前的图像块预测序列Pd
以下以第一帧内全局信息为例,说明其计算过程。第一帧内全局信息可以通过以下公式(1)计算得到:
其中,Attention为第一帧内全局信息,dt为第一特征序列Px_i的维度,softmax()为激活 函数,Q、K、V矩阵表示各维度的已添加第一位置信息的第一特征序列Px_i之间的权重值(即依赖关系),Q、K、V矩阵根据图像块嵌入输出序列Px与3个预设矩阵经过矩阵变换(相乘)后得到,3个预设矩阵中的参数可以通过可学习的方式获得。由此可以看出,Q、K、V矩阵仅和图像块嵌入输出序列Px有关,就是对于图像块嵌入输出序列Px的自注意力。
在一些实施例中,如图5所示,所述将待预测图像划分为预设数量的图像块,并生成包括图像块的图像块序列(即步骤S11),包括以下步骤S111至S112。
在步骤S111中,将待预测图像划分为预设数量的大小相等的图像块。
待预测图像大小为W*H,将大小为W*H的待预测图像划分为例如num*num=S个图像块,则每个图像块的大小为即图像块的宽为W/num、高为H/num。
在步骤S112中,按照从左往右、从上到下的顺序将图像块排序,生成图像块序列。
通过步骤S111-S112,抛弃传统编码需要计算CTU(Coding Tree Units,树编码单元)块划分的方式,采用直接等分块的方式提升图像块划分效率,解决传统帧内预测的运算开销大的问题。
在一些实施例中,如图6所示,所述根据当前的图像块预测序列生成预测图像(即步骤S15),包括以下步骤S151至S153。
在步骤S151中,对当前的图像块预测序列进行线性化处理,得到第一序列,第一序列包括各图像块的一维数组。
在步骤S152中,将各图像块的一维数组转换为二维矩阵,并根据二维矩阵生成第二序列。
在步骤S153中,根据第二序列,按照从左到右、从上到下的顺序对各所述二维矩阵进行拼接,得到预测图像。
结合图6和图7所示,Fusion模块可包括Linear、Reshape和Concat三个处理单元。利用Linear将输入的当前的图像块预测序列Pd=[Pd_1,Pd_2,...,Pd_s]线性化处理,得到第一序列PL,PL=[PL_1,PL_2,...,PL_s],PL_i(i=1,2,...,S)的维度是H*W/s。第一序列PL输入Reshape,Reshape将一维数组PL_i转换成二维矩阵PR_i,二维矩阵PR_i的宽为W/num、高为H/num,各二维矩阵PR_i组成第二序列PR=[PR_1,PR_2,...,PR_s]。第二序列PR输入Concat,Concat将第二序列PR按照从左往右、从上往下的排列顺序拼接成预测图像,预测图像的宽为W、高为H,输出Transformer网络最终的预测图像,提供给后续量化处理。
在一些实施例中,在对不同的待预测图像进行编码的次数不同和/或进行解码的次数不同的情况下,通过一下步骤S21和S22来确定N和M。
在步骤S21中,计算待预测图像的纹理复杂度。
在一些实施例中,纹理复杂度为图像的灰度级直方图的方差μ2,且例如由图1所示的纹理负责的估计模块计算待预测图像的纹理复杂度,纹理复杂度可以通过以下公式(2)计算:
其中,z表示待预测图像的灰度,p(zi)为相应的直方图,L为灰度的数量。m是z的均值,可以通过以下公式(3)计算:
需要说明的是,纹理复杂度不限于上述计算方式,也可以采用其他方式计算,如基于梯度的计算、基于深度学习方式等。
在步骤S22中,根据纹理复杂度和预设的参考阈值确定N和M,N和M分别为预先配置的阈值中的一个。
在一些实施例中,参考阈值包括第一参考阈值和第二参考阈值,所述根据纹理复杂度和预设的参考阈值确定所述N和M(即步骤S22),包括以下步骤:在纹理复杂度小于第一参考阈值的情况下,确定N为预先配置的第一编码阈值N1,并确定M为预先配置的第一解码阈值M1;在纹理复杂度大于或等于第一参考阈值且小于或等于第二参考阈值的情况下,确定N为预先配置的第二编码阈值N2,并确定M为预先配置的第二解码阈值M2;在纹理复杂度大于第二参考阈值的情况下,确定N为预先配置的第三编码阈值N3,并确定M为预先配置的第三解码阈值M3;其中,N3>N2>N1,M3>M2>M1,N1,N2,N3以及M1,M2,M3根据实际应用设置。
例如,由图1的阈值判断模块进行纹理复杂度判断,若纹理复杂度小于第一参考阈值,表示待预测图像为弱纹理,则将编码次数N和解码次数M设置为较小的值(N1和M1);若纹理复杂度大于或等于第一参考阈值且小于或等于第二参考阈值,表示待预测图像为中纹理,则将编码次数N和解码次数M设置为中间值(N2和M2);若纹理复杂度大于第二参考阈值,表示待预测图像为强纹理,则将编码次数N和解码次数M设置为较大的值(N3和M3)。
通过上述步骤S21-S22,可以实现编码次数N(即encoder子模块的堆叠数量)和解码次数M(即decoder子模块的堆叠数量)的动态调整。在这种情况下,Transformer网络包括纹理复杂度估计模块和阈值判断模块,该Transformer网络为Dynamic Transformer网络。需要说明的是,编码次数N(即encoder子模块的堆叠数量)和解码次数M(即decoder子模块的堆叠数量)也可以配置为常量,无需根据待预测图像进行动态调整,在这种情况下,Transformer网络不包括纹理复杂度估计模块和阈值判断模块。
在一些实施例中,第一位置信息和第二位置信息可以通过以下公式(4)和(5)计算:

其中,pos表示图像块的编号,PE(pos,2i)是偶数编号图像块的位置,PE(pos,2i+1)是奇数编号图像块的位置,i表示dt维度的标识。
需要说明的是,第一位置信息和第二位置信息也可以采用深度学习的方式获取。
在本公开实施例中,对Transformer网络进行训练可以采用交叉熵作为损失函数,也可以使用其他损失函数,如L1损失函数、L2损失函数等。
本公开实施例应用于Transformer网络,可以替换成其他Transformer网络变体,如swin-Transformer、Sparse Transformer、Image Transformer等网络。
本公开实施例提供了基于Dynamic Transformer网络的帧内预测编码方法,抛弃传统编码需要计算CTU块划分的方式,采用patch直接等分块的方式提升块划分效率;通过Transformer网络实现帧内预测编码,利用了图像块内的局部信息,又经过Transformer网络中的自注意力机制(self-attention)层获取帧内全局信息,使得网络中信息交互更加充分,从而更准确地得到帧内编码的预测图像,保证编码质量。Dynamic Transformer结合对帧内纹理复杂度评估,利用参考阈值进行判断,能够自适应地调节编解码子模块的堆叠数量,动态改变网络深度,降低网络计算资源。相比于CNN(Convolution neural network,卷积神经网络),Transformer网络采用更浅的模型深度就可以获取帧内全局信息,整体的计算资源占用更少。
基于相同的技术构思,本公开实施例还提供一种帧内预测装置,所述帧内预测装置为Transformer网络设备,如图9所示,所述帧内预测装置包括划分模块101、维度处理模块102、编码模块103、解码模块104和生成模块105。
划分模块101被配置为,将待预测图像划分为预设数量的图像块,并生成包括所述图像块的图像块序列。
维度处理模块102被配置为,对所述图像块序列进行维度处理,得到图像块嵌入输出序列。
编码模块103被配置为,根据所述图像块嵌入输出序列和所述图像块的第一位置信息,对所述待预测图像进行编码,得到图像块编码输出序列,所述图像块编码输出序列包括第一帧内全局信息。
解码模块104被配置为,根据所述图像块的第二位置信息和已预测得到的图像块预测序列,对所述图像块编码输出序列进行解码,得到当前的图像块预测序列,所述当前的图像块预测序列包括第二帧内全局信息。
生成模块105被配置为,根据所述当前的图像块预测序列生成预测图像。
在一些实施例中,所述第一帧内全局信息根据已添加所述第一位置信息的图像块嵌入输出序列计算得到。
在一些实施例中,所述第二帧内全局信息根据所述图像块编码输出序列和已添加所述第二位置信息的已预测得到的图像块预测序列计算得到。
在一些实施例中,划分模块101被配置为,将待预测图像划分为预设数量的大小相等的图像块;按照从左往右、从上到下的顺序将所述图像块排序,生成所述图像块序列。
在一些实施例中,生成模块105被配置为,对所述当前的图像块预测序列进行线性化处理,得到第一序列,所述第一序列包括各所述图像块的一维数组;将各所述图像块的一维数组转换为二维矩阵,并根据所述二维矩阵生成第二序列;根据所述第二序列,按照从左到右、从上到下的顺序对各所述二维矩阵进行拼接,得到预测图像。
在一些实施例中,编码模块103被配置为,对所述待预测图像进行N次编码;解码模块104被配置为,对所述图像块编码输出序列进行M次解码。所述N和M预先配置,N和M为大于1的整数。
在一些实施例中,如图10所示,所述帧内预测装置还包括编解码次数确定模块106,编解码次数确定模块106被配置为,计算所述待预测图像的纹理复杂度;根据所述纹理复杂度和预设的参考阈值确定所述N和M,所述N和M为预先配置的阈值中的一个。
在一些实施例中,所述参考阈值包括第一参考阈值和第二参考阈值。
编解码次数确定模块106被配置为,在所述纹理复杂度小于所述第一参考阈值的情况下,确定N为预先配置的第一编码阈值N1,并确定M为预先配置的第一解码阈值M1;在所述纹理复杂度大于或等于所述第一参考阈值且小于或等于所述第二参考阈值的情况下,确定N为预先配置的第二编码阈值N2,并确定M为预先配置的第二解码阈值M2;在所述纹理复杂度大于所述第二参考阈值的情况下,确定N为预先配置的第三编码阈值N3,并确定M为预先配置的第三解码阈值M3;其中,N3>N2>N1,M3>M2>M1。
本公开实施例还提供了一种计算机设备,该计算机设备包括:一个或多个处理器以及存储装置;其中,存储装置上存储有一个或多个程序,当上述一个或多个程序被上述一个或多个处理器执行时,使得上述一个或多个处理器实现如前述各实施例所提供的帧内预测方法。
本公开实施例还提供了一种计算机可读介质,其上存储有计算机程序,其中,该计算机程序被执行时实现如前述各实施例所提供的帧内预测方法。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
本文已经公开了示例实施例,并且虽然采用了具体术语,但它们仅用于并仅应当被解释为一般说明性含义,并且不用于限制的目的。在一些实例中,对本领域技术人员显而易见的是,除非另外明确指出,否则可单独使用与特定实施例相结合描述的特征、特性和/或元素,或可与其他实施例相结合描述的特征、特性和/或元件组合使用。因此,本领域技术人员将理解,在不脱离由所附的权利要求阐明的本发明的范围的情况下,可进行各种形式和细节上的改变。

Claims (11)

  1. 一种帧内预测方法,所述方法应用于Transformer网络,包括:
    将待预测图像划分为预设数量的图像块,并生成包括所述图像块的图像块序列;
    对所述图像块序列进行维度处理,得到图像块嵌入输出序列;
    根据所述图像块嵌入输出序列和所述图像块的第一位置信息,对所述待预测图像进行编码,得到图像块编码输出序列,所述图像块编码输出序列包括第一帧内全局信息;
    根据所述图像块的第二位置信息和已预测得到的图像块预测序列,对所述图像块编码输出序列进行解码,得到当前的图像块预测序列,所述当前的图像块预测序列包括第二帧内全局信息;
    根据所述当前的图像块预测序列生成预测图像。
  2. 如权利要求1所述的方法,其中,所述第一帧内全局信息根据已添加所述第一位置信息的图像块嵌入输出序列计算得到。
  3. 如权利要求1所述的方法,其中,所述第二帧内全局信息根据所述图像块编码输出序列和已添加所述第二位置信息的已预测得到的图像块预测序列计算得到。
  4. 如权利要求1所述的方法,其中,所述将待预测图像划分为预设数量的图像块,并生成包括所述图像块的图像块序列,包括:
    将待预测图像划分为预设数量的大小相等的图像块;
    按照从左往右、从上到下的顺序将所述图像块排序,生成所述图像块序列。
  5. 如权利要求1所述的方法,其中,所述根据所述当前的图像块预测序列生成预测图像,包括:
    对所述当前的图像块预测序列进行线性化处理,得到第一序列,所述第一序列包括各所述图像块的一维数组;
    将各所述图像块的一维数组转换为二维矩阵,并根据所述二维矩阵生成第二序列;
    根据所述第二序列,按照从左到右、从上到下的顺序对各所述二维矩阵进行拼接,得到预测图像。
  6. 如权利要求1-5任一项所述的方法,其中,所述对所述待预测图像进行编码,包括:对所述待预测图像进行N次编码;
    所述对所述图像块编码输出序列进行解码,包括:对所述图像块编码输出序列进行M次解码;
    所述N和M预先配置,N和M为大于1的整数。
  7. 如权利要求6所述的方法,其中,在对不同的待预测图像进行编码的次数不同和/或 进行解码的次数不同的情况下,所述N和所述M通过以下方式确定:
    计算所述待预测图像的纹理复杂度;
    根据所述纹理复杂度和预设的参考阈值确定所述N和M,所述N和M为预先配置的阈值中的一个。
  8. 如权利要求7所述的方法,其中,所述参考阈值包括第一参考阈值和第二参考阈值,所述根据所述纹理复杂度和预设的参考阈值确定所述N和M,包括:
    在所述纹理复杂度小于所述第一参考阈值的情况下,确定N为预先配置的第一编码阈值N1,并确定M为预先配置的第一解码阈值M1;
    在所述纹理复杂度大于或等于所述第一参考阈值且小于或等于所述第二参考阈值的情况下,确定N为预先配置的第二编码阈值N2,并确定M为预先配置的第二解码阈值M2;
    在所述纹理复杂度大于所述第二参考阈值的情况下,确定N为预先配置的第三编码阈值N3,并确定M为预先配置的第三解码阈值M3;
    其中,N3>N2>N1,M3>M2>M1。
  9. 一种帧内预测装置,所述装置为Transformer网络设备,包括划分模块、维度处理模块、编码模块、解码模块和生成模块,所述划分模块被配置为,将待预测图像划分为预设数量的图像块,并生成包括所述图像块的图像块序列;
    所述维度处理模块被配置为,对所述图像块序列进行维度处理,得到图像块嵌入输出序列;
    所述编码模块被配置为,根据所述图像块嵌入输出序列和所述图像块的第一位置信息,对所述待预测图像进行编码,得到图像块编码输出序列,所述图像块编码输出序列包括第一帧内全局信息;
    所述解码模块被配置为,根据所述图像块的第二位置信息和已预测得到的图像块预测序列,对所述图像块编码输出序列进行解码,得到当前的图像块预测序列,所述当前的图像块预测序列包括第二帧内全局信息;
    所述生成模块被配置为,根据所述当前的图像块预测序列生成预测图像。
  10. 一种计算机设备,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个程序;
    当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如权利要求1-8任一项所述的帧内预测方法。
  11. 一种计算机可读介质,其上存储有计算机程序,其中,所述程序被执行时实现如权利要求1-8任一项所述的帧内预测方法。
PCT/CN2023/110099 2022-08-01 2023-07-31 帧内预测方法、装置、计算机设备及可读介质 WO2024027616A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210914755.7 2022-08-01
CN202210914755.7A CN117544774A (zh) 2022-08-01 2022-08-01 帧内预测方法、装置、计算机设备及可读介质

Publications (1)

Publication Number Publication Date
WO2024027616A1 true WO2024027616A1 (zh) 2024-02-08

Family

ID=89794429

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/110099 WO2024027616A1 (zh) 2022-08-01 2023-07-31 帧内预测方法、装置、计算机设备及可读介质

Country Status (2)

Country Link
CN (1) CN117544774A (zh)
WO (1) WO2024027616A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180376147A1 (en) * 2016-02-17 2018-12-27 Nippon Hoso Kyokai Encoding device, decoding device, and program
CN110650349A (zh) * 2018-06-26 2020-01-03 中兴通讯股份有限公司 一种图像编码方法、解码方法、编码器、解码器及存储介质
CN114286093A (zh) * 2021-12-24 2022-04-05 杭州电子科技大学 一种基于深度神经网络的快速视频编码方法
CN114550033A (zh) * 2022-01-29 2022-05-27 珠海横乐医学科技有限公司 视频序列导丝分割方法、装置、电子设备及可读介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180376147A1 (en) * 2016-02-17 2018-12-27 Nippon Hoso Kyokai Encoding device, decoding device, and program
CN110650349A (zh) * 2018-06-26 2020-01-03 中兴通讯股份有限公司 一种图像编码方法、解码方法、编码器、解码器及存储介质
CN114286093A (zh) * 2021-12-24 2022-04-05 杭州电子科技大学 一种基于深度神经网络的快速视频编码方法
CN114550033A (zh) * 2022-01-29 2022-05-27 珠海横乐医学科技有限公司 视频序列导丝分割方法、装置、电子设备及可读介质

Also Published As

Publication number Publication date
CN117544774A (zh) 2024-02-09

Similar Documents

Publication Publication Date Title
Liu et al. Parallel fractal compression method for big video data
US10009611B2 (en) Visual quality measure for real-time video processing
CN105706449A (zh) 样本自适应偏移控制
CN105721878A (zh) Hevc视频编解码中执行帧内预测的图像处理装置及方法
US20120183043A1 (en) Method for Training and Utilizing Separable Transforms for Video Coding
US9014499B2 (en) Distributed source coding using prediction modes obtained from side information
CN103188494A (zh) 跳过离散余弦变换对深度图像编码/解码的设备和方法
US11394966B2 (en) Video encoding and decoding method and apparatus
US20230362378A1 (en) Video coding method and apparatus
Suzuki et al. Image pre-transformation for recognition-aware image compression
JP2021150955A (ja) 訓練方法、画像符号化方法、画像復号化方法及び装置
WO2021114100A1 (zh) 帧内预测方法、视频编码、解码方法及相关设备
Rhee et al. Channel-wise progressive learning for lossless image compression
WO2024027616A1 (zh) 帧内预测方法、装置、计算机设备及可读介质
He et al. End-to-end facial image compression with integrated semantic distortion metric
US20240163485A1 (en) Multi-distribution entropy modeling of latent features in image and video coding using neural networks
WO2023203509A1 (en) Image data compression method and device using segmentation and classification
TW202337211A (zh) 條件圖像壓縮
CN116193140A (zh) 基于lcevc的编码方法、解码方法及译码设备
EP3154023A1 (en) Method and apparatus for de-noising an image using video epitome
WO2022061563A1 (zh) 视频编码方法、装置及计算机可读存储介质
Sun et al. YOCO: Light-weight rate control model learning
WO2022183345A1 (zh) 编码方法、解码方法、编码器、解码器以及存储介质
WO2024083249A1 (en) Method, apparatus, and medium for visual data processing
WO2023092388A1 (zh) 解码方法、编码方法、解码器、编码器和编解码系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23849324

Country of ref document: EP

Kind code of ref document: A1