WO2024077797A1

WO2024077797A1 - Method and system for retargeting image

Info

Publication number: WO2024077797A1
Application number: PCT/CN2022/144301
Authority: WO
Inventors: Olgierd Stankiewicz; Slawomir ROZEK; Slawomir Mackowiak; Tomasz Grajek; Marek Domanski
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2022-10-11
Filing date: 2022-12-30
Publication date: 2024-04-18

Abstract

Methods and systems for retargeting an image based on regions of interest. According to certain embodiments, an exemplary encoding method may include: determining one or more regions of interest (ROIs) in an image; performing, based on one or more downscaling factors associated with the one or more ROIs, at least one of horizontal downscaling or vertical downscaling of the image; and compressing the downscaled image.

Description

METHOD AND SYSTEM FOR RETARGETING IMAGE

TECHNICAL FIELD

The present disclosure generally relates to image data processing, and more particularly, to methods and systems of compressing image data by using image retargeting techniques based on regions of interest.

BACKGROUND

Limited communication bandwidth poses challenge to transmitting images or videos (hereinafter collectively referred to as “image data” ) at desired quality. Although various encoding techniques have been developed to compress the image data before it is transmitted, the increasing demand for massive amount of image data has put a constant pressure on the image/video systems, particularly mobile devices that frequently need to work in networks with limited bandwidth.

SUMMARY

Embodiments of the present disclosure relate to methods of compressing image data by retargeting an image based on regions of interest. In some embodiments, a computer-implemented encoding method is provided. The method may include: determining one or more regions of interest (ROIs) in an image; performing, based on one or more downscaling factors associated with the one or more ROIs, at least one of horizontal downscaling or vertical downscaling of the image; and compressing the downscaled image.

In some embodiments, a device for encoding image data is provided. The device includes a memory storing instructions; and one or more processors configured to execute the instructions to cause the device to perform operations including: determining one or more regions of interest (ROIs) in an image; performing, based on one or more downscaling factors associated with the one or more ROIs, at least one of horizontal downscaling or vertical downscaling of the image; and compressing the downscaled image.

In some embodiments, a computer-implemented decoding method is provided. The method may include: decoding a bitstream including: position information associated with one or more regions of interest (ROIs) , and information indicating one or more downscaling factors associated with the one or more ROIs; determining a reconstructed image; and performing, based on the one or more downscaling factors, at least one of horizontal upscaling or vertical upscaling of the reconstructed image.

In some embodiments, a device for decoding image data is provided. The device includes a memory storing instructions; and one or more processors configured to execute the instructions to cause the device to perform operations including: decoding a bitstream including: position information associated with one or more regions of interest (ROIs) , and information indicating one or more downscaling factors associated with the one or more ROIs; determining a reconstructed image; and performing, based on the one or more downscaling factors, at least one of horizontal upscaling or vertical upscaling of the reconstructed image.

Aspects of the disclosed embodiments may include non-transitory, tangible computer-readable media that store software instructions that, when executed by one or more processors, are configured for and capable of performing and executing one or more of the methods, operations, and the like consistent with the disclosed embodiments. Also, aspects of the disclosed embodiments may be performed by one or more processors that are configured as special-purpose processor (s) based on software instructions that are programmed with logic and instructions that perform, when executed, one or more operations consistent with the disclosed embodiments.

Additional objects and advantages of the disclosed embodiments will be set forth in part in the following description, and in part will be apparent from the description, or may be learned by practice of the embodiments. The objects and advantages of the disclosed embodiments may be realized and attained by the elements and combinations set forth in the claims.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosed embodiments, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating an exemplary system for encoding and decoding image data, consistent with embodiments of the present disclosure.

FIG. 2 is a block diagram showing an exemplary encoding process, consistent with embodiments of the present disclosure.

FIG. 3 is a block diagram showing an exemplary decoding process, consistent with embodiments of the present disclosure.

FIG. 4 is a block diagram of an exemplary apparatus for encoding or decoding image data, consistent with embodiments of the present disclosure.

FIG. 5 is a schematic diagram illustrating an exemplary method for compressing image data using the image retargeting technique, consistent with embodiments of the present disclosure.

FIG. 6 is a schematic diagram illustrating an image including multiple ROIs, consistent with embodiments of the present disclosure.

FIG. 7 is a schematic diagram illustrating an exemplary process for performing retargeting and inverse retargeting of an image, consistent with embodiments of the present disclosure.

FIG. 8 is a flowchart of an exemplary method for encoding image data, consistent with embodiments of the present disclosure.

FIG. 9 is a schematic diagram illustrating an input image partitioned into a plurality of blocks by a grid, consistent with embodiments of the present disclosure.

FIG. 10A is a schematic diagram illustrating a process for assigning initial horizontal scaling factors to the plurality of blocks partitioned in FIG. 9, consistent with embodiments of the present disclosure.

FIG. 10B is a schematic diagram illustrating a process for assigning initial vertical scaling factors to the plurality of blocks partitioned in FIG. 9, consistent with embodiments of the present disclosure.

FIG. 11A is a schematic diagram illustrating a process for determining a commonly-agreeable horizontal downscaling factor for each column of the blocks in FIG. 9, consistent with embodiments of the present disclosure.

FIG. 11B is a schematic diagram illustrating a process for determining a commonly-agreeable vertical downscaling factor for each row of the blocks in FIG. 9, consistent with embodiments of the present disclosure.

FIG. 12 is a schematic diagram illustrating a process for performing downscaling and upscaling of a grid, consistent with embodiments of the present disclosure.

FIG. 13 is a schematic diagram illustrating a process for performing retargeting and inverse retargeting of an image, consistent with embodiments of the present disclosure.

FIG. 14 is a flowchart of an exemplary method for decoding image data, consistent with embodiments of the present disclosure.

DETAILED DESCRIPTION

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the invention. Instead, they are merely examples of apparatuses and methods consistent with aspects related to the invention as recited in the appended claims.

Example embodiments of present disclosure provide image data compression methods and systems that use image retargeting techniques based on regions of interest. Image retargeting involves rescaling (i.e., resizing) an image based on the contents of the image, so as to preserve important regions and minimize distortions. It can be used to increase the ratio of the sizes of certain regions of interests (ROIs) over the entire size of the rescaled image. For example, during retargeting, the sizes of the ROIs in an image may remain the same or be downscaled at a less degree, whereas the rest of the image is more heavily downscaled. This way, after the retargeting, the ROIs can occupy a greater ratio of the retargeted image. As illustrated by the following exemplary embodiments, the image retargeting may be performed before an image (i.e., herein referred to as “original image” ) is compressed (i.e., encoded) , such that more bits are allocated to encoding the ROIs in the retargeted image. At the decoder side, the retargeted image is decoded and reconstructed. In some embodiments, the reconstructed images may be directly used and analyzed by downstream machine vision applications, such as object detection and recognition. In some embodiments, the reconstructed image is first inversely retargeted to restore the original image, before being used and analyzed by the downstream machine vision applications.

FIG. 1 is a block diagram illustrating a system 100 for encoding and decoding image data, according to some disclosed embodiments. The image data may include an image (also called a “picture” or “frame” ) , multiple images, or a video. An image is a static picture. Multiple images may be related or unrelated, either spatially or temporary. A video includes a set of images arranged in a temporal sequence.

As shown in FIG. 1, system 100 includes a source device 120 that provides encoded image data to be decoded by a destination device 140. Consistent with the disclosed embodiments, each of source device 120 and destination device 140 may include any of a wide range of devices, including a desktop computer, a notebook (e.g., laptop) computer, a server, a tablet computer, a set-top box, a mobile phone, a vehicle, a camera, an image sensor, a robot, a television, a camera, a wearable device (e.g., a smart watch or a wearable camera) , a display device, a digital media player, a video gaming console, a video streaming device, or the like. Source device 120 and destination device 140 may be equipped for wireless or wired communication.

Referring to FIG. 1, source device 120 may include an image/video source 122, an image/video encoder 124, and an output interface 126. Destination device 140 may include an input interface 142, an image/video decoder 144, and one or more image/video applications 146. Image/video source 122 of source device 120 may include an image/video capture device, such as a camera, an image/video archive containing previously captured video, or an image/video feed interface to receive image/video data from a content provider. As a further alternative, image/video source 122 may generate computer graphics-based data as the source image/video, or a combination of live image/video, archived image/video, and computer-generated image/video. The captured, pre-captured, or computer-generated image data may be encoded by image/video encoder 124. The encoded image data may then be output by output interface 126 onto a communication medium 160. Image/video encoder 124 encodes the input image data and outputs an encoded bitstream 162 via output interface 126. Encoded bitstream 162 is transmitted through a communication medium 160, and received by input interface 142. Image/video decoder 144 then decodes encoded bitstream 162 to generate decoded data, which can be utilized by image/video applications 146.

Image/video encoder 124 and image/video decoder 144 each may be implemented as any of a variety of suitable encoder or decoder circuitry, such as one or more microprocessors, digital signal processors (DSPs) , application specific integrated circuits (ASICs) , field programmable gate arrays (FPGAs) , discrete logic, software, hardware, firmware, or any combinations thereof. When the encoding or decoding is implemented partially in software, image/video encoder 124 or image/video decoder 144 may store instructions for the software in a suitable, non-transitory computer-readable medium and execute the instructions in hardware using one or more processors to perform the techniques consistent this disclosure. Each of image/video encoder 124 or image/video decoder 144 may be included in one or more encoders or decoders, either of which may be integrated as part of a combined encoder/decoder (CODEC) in a respective device.

Image/video encoder 124 and image/video decoder 144 may operate according to any video coding standard, such as Advanced Video Coding (AVC) , High Efficiency Video Coding (HEVC) , Versatile Video Coding (VVC) , AOMedia Video 1 (AV1) , Joint Photographic Experts Group (JPEG) , Moving Picture Experts Group (MPEG) , etc. Alternatively, image/video encoder 124 and image/video decoder 144 may be customized devices that do not comply with the existing standards. Although not shown in FIG. 1, in some embodiments, image/video encoder 124 and image/video decoder 144 may each be integrated with an audio encoder and decoder, and may include appropriate MUX-DEMUX units, or other hardware and software, to handle encoding of both audio and video in a common data stream or separate data streams.

Output interface 126 may include any type of medium or device capable of transmitting encoded bitstream 162 from source device 120 to destination device 140. For example, output interface 126 may include a transmitter or a transceiver configured to transmit encoded bitstream 162 from source device 120 directly to destination device 140 in real-time. Encoded bitstream 162 may be modulated according to a communication standard, such as a wireless communication protocol, and transmitted to destination device 140.

Communication medium 160 may include transient media, such as a wireless broadcast or wired network transmission. For example, communication medium 160 may include a radio frequency (RF) spectrum or one or more physical transmission lines (e.g., a cable) . Communication medium 160 may form part of a packet-based network, such as a local area network, a wide-area network, or a global network such as the Internet. In some embodiments, communication medium 160 may include routers, switches, base stations, or any other equipment that may be useful to facilitate communication from source device 120 to destination device 140. For example, a network server (not shown) may receive encoded bitstream 162 from source device 120 and provide encoded bitstream 162 to destination device 140, e.g., via network transmission.

Communication medium 160 may also be in the form of a storage media (e.g., non-transitory storage media) , such as a hard disk, flash drive, compact disc, digital video disc, Blu-ray disc, volatile or non-volatile memory, or any other suitable digital storage media for storing encoded image data. In some embodiments, a computing device of a medium production facility, such as a disc stamping facility, may receive encoded image data from source device 120 and produce a disc containing the encoded video data.

Input interface 142 may include any type of medium or device capable of receiving information from communication medium 160. The received information includes encoded bitstream 162. For example, input interface 142 may include a receiver or a transceiver configured to receive encoded bitstream 162 in real-time.

Image/video applications 146 include various hardware and/or software for utilizing the decoded image data generated by image/video decoder 144. For example, image/video applications 146 may include a display device that displays the decoded image data to a user and may include any of a variety of display devices such as a cathode ray tube (CRT) , a liquid crystal display (LCD) , a plasma display, an organic light emitting diode (OLED) display, or another type of display device. As another example, image/video applications 146 may include one or more processors configured to use the decoded image data to perform various machine-vision applications, such as object recognition and tracking, face recognition, images matching, image/video search, augmented reality, robot vision and navigation, autonomous driving, 3-dimension structure construction, stereo correspondence, motion tracking, etc.

As described above, image/video encoder 124 and image/video decoder 144 may or may not operate according to a video coding standard. An exemplary encoding process performable by image/video encoder 124 and an exemplary decoding process performable by image/video decoder 144 are described below in connection with FIG. 2 and FIG. 3, respectively.

FIG. 2 is a block diagram showing an exemplary encoding process 200, according to some embodiments of the present disclosure. Encoding process 200 can be performed by an encoder, such as image/video encoder 124 (FIG. 1) . Consistent with the disclosed embodiments, encoding process 200 may be performed after an input image (also referred to as “original image” ) is retargeted. Referring to FIG. 2, encoding process 200 includes a picture partitioning stage 210, an inter prediction stage 220, an intra prediction stage 225, a transform stage 230, a quantization stage 235, a rearrangement stage 260, an entropy coding stage 265, an inverse quantization stage 240, an inverse transform stage 245, a filter stage 250, and a buffer stage 255.

Picture partitioning stage 210 partitions an input picture 205 into at least one processing unit. The processing unit may be a prediction unit (PU) , a transform unit (TU) , or a coding unit (CU) . Picture partitioning stage 210 partitions a picture into a combination of a plurality of coding units, prediction units, and transform units. A picture may be encoded by selecting a combination of a coding unit, a prediction unit, and a transform unit based on a predetermined criterion (e.g., a cost function) .

For example, one picture may be partitioned into a plurality of coding units. In order to partition the coding units in a picture, a recursive tree structure such as a quad tree structure may be used. Specifically, with the recursive tree structure, a picture serves a root node and is recursively partitioned into smaller regions, i.e., child nodes. A coding unit that is not partitioned any more according to a predetermined restriction becomes a leaf node. For example, when it is assumed that only square partitioning is possible for a coding unit, the coding unit is partitioned into up to four different coding units.

It may be determined whether to use inter prediction or to perform intra prediction for a prediction unit, and determine specific information (e.g., intra prediction mode, motion vector, reference picture, etc. ) according to the prediction method. A processing unit for performing prediction may be different from a processing unit for determining a prediction method and specific content. For example, a prediction method and a prediction mode may be determined in a prediction unit, and prediction may be performed in a transform unit. A residual coefficient (residual block) between the reconstructed prediction block and the original block may be input into transform stage 230. In addition, prediction mode information, motion vector information and the like used for prediction may be encoded by the entropy coding stage 265 together with the residual coefficient and transferred to a decoder. In some implementations, an original block may be encoded and transmitted to a decoder without generating a prediction block through inter prediction stage 220 and intra prediction stage 225.

Inter prediction stage 220 predicts a prediction unit based on information in at least one picture before or after the current picture. In some embodiments, inter prediction stage 220 predicts a prediction unit based on information on a partial area that has been encoded in the current picture. Inter prediction stage 220 further includes a reference picture interpolation stage, a motion prediction stage, and a motion compensation stage (not shown) .

The reference picture interpolation stage receives reference picture information from buffer stage 255 and generates pixel information of an integer number of pixels or less from the reference picture. In the case of a luminance pixel, a DCT-based 8-tap interpolation filter with a varying filter coefficient is used to generate pixel information of an integer number of pixels or less by the unit of 1/4 pixels. In the case of a color difference signal, a DCT-based 4-tap interpolation filter with a varying filter coefficient is used to generate pixel information of an integer number of pixels or less by the unit of 1/8 pixels.

The motion prediction stage performs motion prediction based on the reference picture interpolated by the reference picture interpolation stage. Various methods such as a full search-based block matching algorithm (FBMA) , a three-step search (TSS) , and a new three-step search algorithm (NTS) may be used as a method of calculating a motion vector. The motion vector may have a motion vector value of a unit of 1/2 or 1/4 pixels based on interpolated pixels. The motion prediction stage predicts a current prediction unit based on a motion prediction mode. Various methods such as a skip mode, a merge mode, an advanced motion vector prediction (AMVP) mode, an intra-block copy mode and the like may be used as the motion prediction mode.

Intra prediction stage 225 generates a prediction unit based on the information on reference pixels in the neighborhood of the current block, which is pixel information in the current picture. When a block in the neighborhood of the current prediction unit is a block on which inter prediction has been performed and thus the reference pixel is a pixel on which inter prediction has been performed, the reference pixel included in the block on which inter prediction has been performed may be used in place of reference pixel information of a block in the neighborhood on which intra prediction has been performed. That is, when a reference pixel is unavailable, at least one reference pixel among available reference pixels may be used in place of unavailable reference pixel information.

In the intra prediction, the prediction mode may have an angular prediction mode that uses reference pixel information according to a prediction direction, and a non-angular prediction mode that does not use directional information when performing prediction. A mode for predicting luminance information may be different from a mode for predicting color difference information, and intra prediction mode information used to predict luminance information or predicted luminance signal information may be used to predict the color difference information.

If the size of the prediction unit is the same as the size of the transform unit when intra prediction is performed, the intra prediction may be performed for the prediction unit based on a pixel on the left side, a pixel on the top-left side, and a pixel on the top of the prediction unit. If the size of the prediction unit is different from the size of the transform unit when the intra prediction is performed, the intra prediction may be performed using a reference pixel based on the transform unit.

The intra prediction method generates a prediction block after applying an Adaptive Intra Smoothing (AIS) filter to the reference pixel according to a prediction mode. The type of the AIS filter applied to the reference pixel may vary. In order to perform the intra prediction method, the intra prediction mode of the current prediction unit may be predicted from the intra prediction mode of the prediction unit existing in the neighborhood of the current prediction unit. When a prediction mode of the current prediction unit is predicted using the mode information predicted from the neighboring prediction unit, the following method may be performed. If the intra prediction modes of the current prediction unit is the same as the prediction unit in the neighborhood, information indicating that the prediction modes of the current prediction unit are the same as the prediction unit in the neighborhood may be transmitted using predetermined flag information. If the prediction modes of the current prediction unit and the prediction unit in the neighborhood are different from each other, prediction mode information of the current block may be encoded by performing entropy coding.

In addition, a residual block, which is a difference value of the prediction unit with the original block, is generated. The generated residual block is input into transform stage 230.

Transform stage 230 transforms the residual block using a transform method such as Discrete Cosine Transform (DCT) or Discrete Sine Transform (DST) . The DCT transform core includes at least one among DCT2 and DCT8, and the DST transform core includes DST7. Whether or not to apply DCT or DST to transform the residual block may be determined based on intra prediction mode information of a prediction unit used to generate the residual block. The transform on the residual block may be skipped. A flag indicating whether or not to skip the transform on the residual block may be encoded. The transform skip may be allowed for a residual block having a size smaller than or equal to a threshold, a luma component, or a chroma component under the 4: 4: 4 format.

Quantization stage 235 quantizes values transformed into the frequency domain by transform stage 230. Quantization coefficients may vary according to the block or the importance of a video. A value calculated by the quantization stage 235 is provided to inverse quantization stage 240 and the rearrangement stage 260.

Rearrangement stage 260 rearranges values of the quantized residual coefficients. Rearrangement stage 260 changes coefficients of a two-dimensional block shape into a one-dimensional vector shape through a coefficient scanning method. For example, rearrangement stage 260 may scan DC coefficients up to high-frequency domain coefficients using a zig-zag scan method, and change the coefficients into a one-dimensional vector shape. According to the size of the transform unit and the intra prediction mode, a vertical scan of scanning the coefficients of a two-dimensional block shape in the column direction and a horizontal scan of scanning the coefficients of a two-dimensional block shape in the row direction may be used instead of the zig-zag scan. In some embodiments, according to the size of the transform unit and the intra prediction mode, a scan method that will be used may be determined among the zig-zag scan, the vertical direction scan, and the horizontal direction scan.

Entropy coding stage 265 performs entropy coding based on values determined by rearrangement stage 260. Entropy coding may use various encoding methods such as Exponential Golomb, Context-Adaptive Variable Length Coding (CAVLC) , Context-Adaptive Binary Arithmetic Coding (CABAC) , and the like.

Entropy coding stage 265 encodes various information such as residual coefficient information and block type information of a coding unit, prediction mode information, partitioning unit information, prediction unit information and transmission unit information, motion vector information, reference frame information, block interpolation information, and filtering information input from rearrangement stage 260, inter prediction stage 220, and intra prediction stage 225. Entropy coding stage 265 may also entropy-encode the coefficient value of a coding unit input from rearrangement stage 260. The output of entropy coding stage 265 forms an encoded bitstream 270, which may be transmitted to a decoding device.

Inverse quantization stage 240 and inverse transform stage 245 inverse-quantize the values quantized by quantization stage 235 and inverse-transform the values transformed by transform stage 230. The residual coefficient generated by inverse quantization stage 240 and inverse transform stage 245 may be combined with the prediction unit predicted through motion estimation, motion compensation, and/or intra prediction to generate a reconstructed block.

Filter stage 250 includes at least one among a deblocking filter, an offset correction unit, and an adaptive loop filter (ALF) .

The deblocking filter removes block distortion generated by the boundary between blocks in the reconstructed picture. In order to determine whether or not to perform deblocking, information of the pixels included in several columns or rows included in the block may be used. A strong filter or a weak filter may be applied according to the deblocking filtering strength needed when the deblocking filter is applied to a block. In addition, when vertical direction filtering and horizontal direction filtering are performed in applying the deblocking filter, horizontal direction filtering and vertical direction filtering may be processed in parallel.

The offset correction unit corrects an offset to the original video by the unit of pixel for a video on which the deblocking has been performed. Offset correction for a specific picture may be performed by dividing pixels included in the video into a certain number of areas, determining an area to perform offset, and applying the offset to the area. Alternatively, the offset correction may be performed by applying an offset considering edge information of each pixel.

Adaptive Loop Filtering (ALF) is performed based on a value obtained by comparing the reconstructed and filtered video with the original video. After dividing the pixels included in the video into predetermined groups, one filter to be applied to a corresponding group is determined, and filtering may be performed differently for each group. A luminance signal, which is the information related to whether or not to apply ALF, may be transmitted for each coding unit (CU) , and the shape and filter coefficient of an ALF filter to be applied varies according to each block. In some implementations, an ALF filter of the same type (fixed type) is applied regardless of the characteristic of a block to be applied.

Buffer stage 255 temporarily stores the reconstructed blocks or pictures calculated through filter stage 250, and provides them to inter prediction stage 220 when inter prediction is performed.

FIG. 3 is a block diagram showing an exemplary decoding process 300, according to some embodiments of the present disclosure. Decoding process 300 can be performed by a decoder, such as image/video decoder 144 (FIG. 1) . Consistent with the disclosed embodiments, decoding process 300 may be used to decode a bitstream including one or more retargeted images. After the one or more retargeted images are decoded and reconstructed, they can be inversely retargeted to restore the original images. Referring to FIG. 3, decoding process 300 includes an entropy decoding stage 310, a rearrangement stage 315, an inverse quantization stage 320, an inverse transform stage 325, an inter prediction stage 330, an intra prediction stage 335, a filter stage 340, and a buffer stage 345.

Specifically, an input bitstream 302 is decoded in a procedure opposite to that of an encoding process, e.g., encoding process 200 (FIG. 2) . Specifically, as shown in FIG. 3, entropy decoding stage 310 may perform entropy decoding in a procedure opposite to that performed in an entropy coding stage. For example, entropy decoding stage 310 may use various methods corresponding to those performed in the entropy coding stage, such as Exponential Golomb, Context-Adaptive Variable Length Coding (CAVLC) , and Context-Adaptive Binary Arithmetic Coding (CABAC) . Entropy decoding stage 310 decodes information related to intra prediction and inter prediction.

Rearrangement stage 315 performs rearrangement on the output of entropy decoding stage 310 based on the rearrangement method used in the corresponding encoding process. The coefficients expressed in a one-dimensional vector shape are reconstructed and rearranged as coefficients of two-dimensional block shape. Rearrangement stage 315 receives information related to coefficient scanning performed in the corresponding encoding process and performs reconstruction through a method of inverse-scanning based on the scanning order performed in the corresponding encoding process.

Inverse quantization stage 320 performs inverse quantization based on a quantization parameter provided by the encoder and a coefficient value of the rearranged block.

Inverse transform stage 325 performs inverse DCT or inverse DST. The DCT transform core may include at least one among DCT2 and DCT8, and the DST transform core may include DST7. Alternatively, when the transform is skipped in the corresponding encoding process, inverse transform stage 325 may be skipped. The inverse transform may be performed based on a transmission unit determined in the corresponding encoding process. Inverse transform stage 325 may selectively perform a transform technique (e.g., DCT or DST) according to a plurality of pieces of information such as a prediction method, a size of a current block, a prediction direction and the like.

Inter prediction stage 330 and/or intra prediction stage 335 generate a prediction block based on information related to generation of a prediction block (provided by entropy decoding stage 310) and information of a previously decoded block or picture (provided by buffer stage 345) . The information related to generation of a prediction block may include intra prediction mode, motion vector, reference picture, etc.

A prediction type selection process may be used to determine whether to use one or both of inter prediction stage 330 and intra prediction stage 335. The prediction type selection process receives various information such as prediction unit information input from the entropy decoding stage 310, prediction mode information of the intra prediction method, information related to motion prediction of an inter prediction method, and the like. The prediction type selection process identifies the prediction unit from the current coding unit and determines whether the prediction unit performs inter prediction or intra prediction. Inter prediction stage 330 performs inter prediction on the current prediction unit based on information included in at least one picture before or after the current picture including the current prediction unit by using information necessary for inter prediction of the current prediction unit. In some embodiments, the inter prediction stage 330 may perform inter prediction based on information on a partial area previously reconstructed in the current picture including the current prediction unit. In order to perform inter prediction, it is determined, based on the coding unit, whether the motion prediction method of the prediction unit included in a corresponding coding unit is a skip mode, a merge mode, a motion vector prediction mode (AMVP mode) , or an intra-block copy mode.

Intra prediction stage 335 generates a prediction block based on the information on the pixel in the current picture. When the prediction unit is a prediction unit that has performed intra prediction, the intra prediction may be performed based on intra prediction mode information of the prediction unit. Intra prediction stage 335 may include an Adaptive Intra Smoothing (AIS) filter, a reference pixel interpolation stage, and a DC filter. The AIS filter is a stage that performs filtering on the reference pixel of the current block, and may determine whether or not to apply the filter according to the prediction mode of the current prediction unit and apply the filter. AIS filtering may be performed on the reference pixel of the current block by using the prediction mode and AIS filter information of the prediction unit used in the corresponding encoding process. When the prediction mode of the current block is a mode that does not perform AIS filtering, the AIS filter may not be applied.

When the prediction mode of the prediction unit is intra prediction based on a pixel value obtained by interpolating the reference pixel, the reference pixel interpolation stage generates a reference pixel of a pixel unit having an integer value or less by interpolating the reference pixel. When the prediction mode of the current prediction unit is a prediction mode that generates a prediction block without interpolating the reference pixel, the reference pixel is not interpolated. The DC filter generates a prediction block through filtering when the prediction mode of the current block is the DC mode.

The reconstructed block or picture is provided to filter stage 340. Filter stage 340 may include a deblocking filter, an offset correction unit, and an ALF.

Information on whether a deblocking filter is applied to a corresponding block or picture and information on whether a strong filter or a weak filter is applied when a deblocking filter is applied may be provided by the corresponding encoding process. The deblocking filter at filter stage 340 may be provided with information related to the deblocking filter used in the corresponding encoding process.

The offset correction unit at filter stage 340 may perform offset correction on the reconstructed video based on the offset correction type and offset value information, which may be provided by the encoder.

The ALF at filter stage 340 is applied to a coding unit based on information on whether or not to apply the ALF and information on ALF coefficients, which are provided by the encoder. The ALF information may be provided in a specific parameter set.

Buffer stage 345 stores the reconstructed picture or block as a reference picture or a reference block, and outputs it to a downstream application, e.g., image/video applications 146 in FIG. 1.

The above-described encoding process 200 (FIG. 2) and decoding process 300 (FIG. 3) are provided as examples to illustrate the possible encoding and decoding processes consistent with the present disclosure, and are not meant to limit it. For example, some embodiments encode and decode image data using wavelet transform, rather than directly coding the pixels in blocks.

FIG. 4 shows a block diagram of an exemplary apparatus 400 for encoding or decoding image data, according to some embodiments of the present disclosure. As shown in FIG. 4, apparatus 400 can include processor 402. When processor 402 executes instructions described herein, apparatus 400 can become a specialized machine for processing the image data. Processor 402 can be any type of circuitry capable of manipulating or processing information. For example, processor 402 can include any combination of any number of a central processing unit (or “CPU” ) , a graphics processing unit (or “GPU” ) , a neural processing unit ( “NPU” ) , a microcontroller unit ( “MCU” ) , an optical processor, a programmable logic controller, a microcontroller, a microprocessor, a digital signal processor, an intellectual property (IP) core, a Programmable Logic Array (PLA) , a Programmable Array Logic (PAL) , a Generic Array Logic (GAL) , a Complex Programmable Logic Device (CPLD) , a Field-Programmable Gate Array (FPGA) , a System On Chip (SoC) , an Application-Specific Integrated Circuit (ASIC) , or the like. In some embodiments, processor 402 can also be a set of processors grouped as a single logical component. For example, as shown in FIG. 4, processor 402 can include multiple processors, including processor 402a, processor 402b, ..., and processor 402n.

Apparatus 400 can also include memory 404 configured to store data (e.g., a set of instructions, computer codes, intermediate data, or the like) . For example, as shown in FIG. 4, the stored data can include program instructions (e.g., program instructions for implementing the processing of image data) and image data to be processed. Processor 402 can access the program instructions and data for processing (e.g., via bus 410) , and execute the program instructions to perform an operation or manipulation on the data. Memory 404 can include a high-speed random-access storage device or a non-volatile storage device. In some embodiments, memory 404 can include any combination of any number of a random-access memory (RAM) , a read-only memory (ROM) , an optical disc, a magnetic disk, a hard drive, a solid-state drive, a flash drive, a security digital (SD) card, a memory stick, a compact flash (CF) card, or the like. Memory 404 can also be a group of memories (not shown in FIG. 4) grouped as a single logical component.

Bus 410 can be a communication device that transfers data between components inside apparatus 400, such as an internal bus (e.g., a CPU-memory bus) , an external bus (e.g., a universal serial bus port, a peripheral component interconnect express port) , or the like.

For ease of explanation without causing ambiguity, processor 402 and other data processing circuits (e.g., analog/digital converters, application-specific circuits, digital signal processors, interface circuits, not shown in FIG. 4) are collectively referred to as a “data processing circuit” in this disclosure. The data processing circuit can be implemented entirely as hardware, or as a combination of software, hardware, or firmware. In addition, the data processing circuit can be a single independent module or can be combined entirely or partially into any other component of apparatus 400.

Apparatus 400 can further include network interface 406 to provide wired or wireless communication with a network (e.g., the Internet, an intranet, a local area network, a mobile communications network, or the like) . In some embodiments, network interface 406 can include any combination of any number of a network interface controller (NIC) , a radio frequency (RF) module, a transponder, a transceiver, a modem, a router, a gateway, a wired network adapter, a wireless network adapter, a Bluetooth adapter, an infrared adapter, a near-field communication ( “NFC” ) adapter, a cellular network chip, or the like.

In some embodiments, optionally, apparatus 400 can further include peripheral interface 408 to provide a connection to one or more peripheral devices. As shown in FIG. 4, the peripheral device can include, but is not limited to, a cursor control device (e.g., a mouse, a touchpad, or a touchscreen) , a keyboard, a display (e.g., a cathode-ray tube display, a liquid crystal display, or a light-emitting diode display) , an image/video input device (e.g., a camera or an input interface coupled to an image/video archive) , or the like.

Consistent with the disclosed embodiments, the encoding and decoding of image data can be optimized by allocating more bits to certain contents in an image, such as important contents, while allocating less bits to other contents, such as a background of the image. The important contents are also referred to as regions of interest ( “ROIs” ) . The ROI technique is suitable for processing image data consumed by machine vision tasks, because usually only particular regions in an image are relevant. As an example, for machine vision tasks that involve with recognition or identification of people from an image, only human faces and their surrounding regions need to be encoded and transmitted with high quality, whereas other regions of the image may be distorted without causing negative impact on the machine vision tasks. As another example, in medical imaging, the boundaries of a tumor need to be defined on an image and the image data representing the boundaries needs to be transmitted losslessly or semi losslessly to ensure the size of the tumor can be accurately measured. In contrast, the image regions not containing the tumor may be transmitted in a lossy manner.

To give more weight to the ROIs in the encoding and decoding processes, the disclosed image data compressing methods use image retargeting technique to resize the ROIs and non-ROIs in an image, such that the ROIs collectively occupy a higher percentage of the image after retargeting. FIG. 5 is a schematic diagram illustrating a method 500 for compressing image data using the image retargeting technique, according to some embodiments consistent with the present disclosure. As shown in FIG. 5, method 500 may be performed by an encoder 510 and a decoder 520. Encoder 510 first analyzes an input image 511 to determine one or more ROIs. As described below in more detail, the ROIs can be determined according to any method known in the art, which is not limited by the present disclosure. Encoder 510 then performs retargeting 512 of input image 511 based on the one or more ROIs. Subsequently, encoder 510 performs encoding 514 (i.e., compression) of the retargeted image and generates an encoded bitstream 530, which is transmitted to decoder 520. At the decoder side, decoder 520 performs decoding 522 of bitstream 530 to generate a reconstructed image. Optionally, decoder 520 may perform inverse retargeting 524 on the reconstructed image to generate an output image 525 that has the same scale as input image 511. In some embodiments, inverse retargeting 524 may be skipped and the reconstructed image (i.e., retargeted image) is directly output by decoder 520. Consistent with the disclosed embodiments, encoding 514 and decoding 522 may be performed according to any image data coding standard, such as AVC, HEVC, VVC, AV1, JPEG, MPEG, etc.

In some embodiments, the ROIs are defined as rectangular or square regions each of which is assigned a rescaling factor. Thus, each of the ROIs can be defined by a rectangular or square bounding box. The ROIs can have other shapes, such as circles, triangles, or polygons, without departing from the scope of the present disclosure. The rescaling factor associated with an ROI can be a downscaling factor or an upscaling factor. The downscaling factor describes the maximum level the ROI can be downsized, and the upscaling factor describes the maximum level the ROI can be stretched. Without losing generality or limiting the present disclosure, the following description discusses the use cases of image compression and thus assumes the rescaling factor associated with each ROI is a downscaling factor. As described below in more detail, the rectangle or square shaped ROIs simplifies the image retargeting process and therefore reduces the computation complexity.

FIG. 6 is a schematic diagram illustrating an image 610 including multiple ROIs, according to some embodiments consistent with the present disclosure. As shown in FIG. 6, an object detection technique may be used to detect the objects represented by image 610. For example, objects 611 and 613 may be detected from image 610. An ROI 612 is defined as a rectangular or square shaped region encompassing object 611, and an ROI 614 is defined as a rectangular or square shaped region encompassing object 613.

As shown in FIG. 6, an ROI may be a rectangle or square the edges of which are parallel or aligned to the horizontal and vertical axes of the image. The rectangle or square shaped ROI can be described by its position information and downscaling factor. The position information contains information describing the location of the ROI in an image. For example, the position information may include coordinates of the ROI’s left-top corner and right-bottom corner. As another example, the position information may include coordinates of the ROI’s left-top corner and the size (e.g., width and height) of the ROI. Moreover, the downscaling factor specifies the maximum level the ROI can be downscaled. For example, the downscaling factor may be represented by the minimum size (e.g., minimal_retargeted_width and minimal_retargeted_height) that the ROI can be downscaled to. As another example, the downscaling factor may be a value (denoted hereinafter as “S” ) equal to a ratio of the ROI’s original size over its minimum allowable size. In some embodiments, a common S value may be used for both the width and height of an ROI. For example, S=2.0 means that the ROI can be at most halved in size, i.e., each of the height and width of the ROI can be at most halved in size. In some embodiments, separate S values, i.e., S_horizontal and S_vertical, are given to an ROI’s width and height, respectively. The above-described forms of downscaling factor are mathematically interchangeable. For example, their relationships can be expressed as the following Equations 1 and 2:

minimal_retargeted_width = original_width /S_horizontal (Eq. 1)

minimal_retargeted_height = original_height /S_vertical (Eq. 2)

In

Equations

1 and 2, the retargeted width and height, i.e., minimal_retargeted_width and minimal_retargeted_height, can be calculated using any rounding method, such as rounding towards zero, nearest integer, nearest power of 2, etc. The retargeted width and height can also be calculated using a maximum clipping method, i.e., limiting the maximal value of the minimal_retargeted_width or minimal_retargeted_height. The retargeted width and height can also be calculated using a minimal clipping method, i.e., limiting the minimal value of the minimal_retargeted_width or minimal_retargeted_height.

The above-defined downscaling factor corresponds to the maximum level of downscaling a ROI can reach (i.e., the minimum size the ROI can be downscaled to) . But the ROI does not always need to be downscaled at the maximum level. Sometimes, the ROI can be downscaled using a smaller downscaling factor, resulting in a retargeted ROI that is bigger than the minimum size the ROI can be downscaled to. For example, a ROI with original_width=100, original_height=150, S_horizontal=6, and S_vertical=4 can be horizontally scaled down by a factor of 4 (4<S_horizontal=6) and vertically scaled down only by a factor of 3 (3<S_vertical=4) . This results in the retargeted_width = 100 /4 = 25, and retargeted_height = 150 /3 = 50.

In some embodiments, it is required to cover the entire area of an image by at least one ROI. To meet this requirement, a background ROI can be defined to cover the whole area of the image. For example, as shown FIG. 6, a background ROI 616 can be defined as the entire area of image 610.

FIG. 7 is a schematic diagram illustrating a process 700 for performing retargeting and inverse retargeting of an image, according to some embodiments consistent with the present disclosure. As shown in FIG. 7, an original image 710 includes three ROIs: an ROI 712 encompassing an object 711 (i.e., house) , an ROI 714 encompassing an object 713 (i.e., cloud) , and a background ROI 716 covering the entire area of original image 710. For example, the scaling factors associated with

ROIs

712, 714, and 716 may be set as 1.0, 1.0, and 2.0, respectively. This means that

ROIs

712 and 714 cannot be scaled down, while one or both of the width and height of ROI 716 can be scaled down by a factor of 2.0. Thus, retargeting image 710 may result in a retargeted image 730, in which

ROIs

732 and 734 have the same sizes as those of

ROIs

712 and 714 respectively, while ROI 736 is scaled down from ROI 716 by a factor of 2.0. Image 730 can also be inverse retargeted to restore the original image size. Namely, after the inverse retargeting,

ROIs

752, 754, and 756 in the resulted image 750 have the same sizes as those of

ROIs

712, 714, and 716 in the original image 710, respectively.

As shown by the example in FIG. 7, the retargeting is performed by downscaling one or more ROIs, and the inverse retargeting is performed by stretching (i.e., upscaling) one or more ROIs. The downscaling and upscaling are performed based on the positions and rescaling factors of the one or more ROIs.

Next, exemplary encoding and decoding methods consistent with the process 500 are described in detail.

FIG. 8 is a flowchart of an exemplary method 800 for encoding image data, according to some embodiments consistent with the present disclosure. Method 800 may be used to compress image data by retargeting an input image before it is encoded. For example, method 800 may be performed by an image data encoder, such as image/video encoder 124 in FIG. 1. As shown in FIG. 8, method 800 includes the following steps 802-812.

At step 802, the encoder detects one or more objects from an input image. For example, referring to image 710 in FIG. 7, the encoder may detect each of

objects

711 and 713.

In some embodiments, the detection of object (s) from an input image includes three stages: image segmentation (i.e., object isolation or recognition) , feature extraction, and object classification. At the image segmentation stage, the encoder may execute an image segmentation algorithm to partition the input image into multiple segments, e.g., sets of pixels each of which representing a portion of the input image. The image segmentation algorithm may assign a label (i.e., category) to each set of pixels, such that pixels with the same label share certain common characteristics. The encoder may then group the pixels according to their labels and designate each group as a distinct object.

After the input image is segmented, i.e., after at least one object is recognized, the encoder may extract, from the input image, features which are characteristic of the object. The features may be based on the size of the object or on its shape, such as the area of the object, the length and the width of the object, the distance around the perimeter of the object, rectangularity of the object shape, circularity of the object shape. The features associated with each object can be combined to form a feature vector.

At the classification stage, the encoder may classify each recognized object on the basis of the feature vector associated with the respective object. The feature values can be viewed as coordinates of a point in an N-dimensional space (one feature value implies a one-dimensional space, two features imply a two-dimensional space, and so on) , and the classification is a process for determining the sub-space of the feature space to which the feature vector belongs. Since each sub-space corresponds to a distinct object, the classification accomplishes the task of object identification. For example, the encoder may classify a recognized object into one of multiple predetermined classes, such as human, face, animal, automobile or vegetation. In some embodiments, the encoder may perform the classification based on a model. The model may be trained using pre-labeled training data sets that are associated with known objects. In some embodiments, the encoder may perform the classification using a machine learning technology, such as convolutional neural network (CNN) , deep neural network (DNN) , etc.

The above-described method of object detection is for exemplary purpose only. The present disclosure does not limit the way of detecting objects from an image. For example, step 802 may be implemented using any commercially available or open-source software. As another example, step 802 may be partially or entirely performed using a hardware method, such as using sensors (e.g., infrared sensors, sound or ultrasound sensors, lasers, etc. ) to detect the presence and/or positions of objects in an image.

Still referring to FIG. 8, at step 804, the encoder determines one or more ROIs in the input image, based on the one or more detected objects. Specifically, for each of the one or more detected objects, the encoder defines a ROI to encompass the object. The ROI may be a rectangle or square bounding box, or a bounding box with any other shape, e.g., quadrilateral or hexagon. After that, the encoder further determines the position information and downscaling factor associated with each of the ROIs. As described above, the encoder may determine the position information based on each ROI’s coordinates in the input image, and/or each ROI’s size (e.g., width and height) . Moreover, the encoder may determine the downscaling factor associated with each ROI based on its class. In particular, the encoder may classify the one or more ROIs by classifying the objects associated with the ROIs. For example, the encoder may determine each ROI to have the same class as its associated object. Thus, the class is also referred to as “object/ROI class. ” An ROI’s downscaling factor may be set to be correlated to its class. This is because object/ROI sizes and classes may impact machine vision tasks’ precision. For example, when a neural network is trained for recognizing a particular class of objects/ROIs (e.g., human faces) with a dimension of 224×224 pixels, performing inference for objects/ROIs in the trained size (i.e., 224×224 pixels) usually results in the highest precision. Thus, each object/ROI class has a corresponding target object size/ROI that can maximize the machine vision tasks’ precision in analyzing objects/ROIs belonging to the particular class. For example, the following Tables 1-3 show the possible bitrate gains achieved by downscaling various object/ROI classes. The bitrate gain in Table 1 is measured by the quality of detection with respect to score parameter from a neural network used for machine vision task. The bitrate gain in Table 2 is measured by the quality of detection with respect to the best attained Intersection over Union (IoU) . The bitrate gain in Table 3 is measured by the quality of detection with respect to matching categorization with threshold of IoU=0.5. In Tables 1-3, the values ranging from 0 to 1 indicate the ratio of a downscaled object/ROI size versus its original size. As shown in Tables 1-3, each object/ROI class has certain size (s) that can maximize the bitrate gain.

Table 1: Quality of detection versus object/ROI size, with respect to score parameter from neural network

Tabel 2: Quality of detection versus object/ROI size, with respect to best attained IoU

Tabel 3: Quality of detection versus object/ROI size, with respect to matching categorization with threshold of IoU=0.5

Based on the relationship between object/ROI classes and their corresponding target object/ROI sizes, the encoder may determine a minimum allowable size for each of the one or more ROIs detected from the input image. The encoder may further determine a downscaling factor for each ROI based on its original size in the input image and its minimum allowable size. For example, the downscaling factor may be set equal to the ratio of the original size over the minimum allowable size.

At step 806, the encoder partitions the input image into a plurality of blocks using a grid of horizontal lines and vertical lines. The encoder may create the grid based on the determined one or more ROIs. For example, each horizontal grid line is aligned with at least one horizontal edge of the one or more ROIs, and each vertical grid line is aligned with at least one vertical edge of the one or more ROIs.

FIG. 9 is a schematic diagram illustrating an input image 910 partitioned into a plurality of blocks by a grid, according to some exemplary embodiments. As shown in FIG. 9, input image 910 includes three ROIs: an ROI 912, an ROI 914, and a background ROI 916. The downscaling factor associated each ROI may have a horizontal component and a vertical component. The horizontal component is used to downscale the width of the ROI, while the vertical component is used to downscale the height of the ROI. For example, ROI 912 has S_horizontal=S_vertical=1; ROI 914 has S_horizontal=2 and S_vertical=3; and ROI 916 has S_horizontal=5 and S_vertical=6. Moreover, a plurality of horizontal grid lines 920 and a plurality of vertical grid lines 940 are used to partition input image 910 into a plurality of blocks 950. Horizontal grid lines 920 and vertical grid lines 940 may be created based on the edges of the ROIs. Specifically, each horizontal grid line 920 is aligned with at least one horizontal edge of ROIs 912-916, and each vertical grid line 940 is aligned with at least one vertical edge of ROIs 912-916, such that each edge of ROIs 912-916 is aligned with a horizontal grid line 920 or a vertical grid line 940.

Referring back to FIG. 8, at step 808, the encoder determines an allowable downscaling factor for each of the plurality of blocks respectively. The allowable downscaling factors are target downscaling factors used for downscaling the respective blocks, and are made smaller than or equal to the downscaling factors associated with the one or more ROIs, such that downscaling the blocks based on the allowable downscaling factors will satisfy the allowed levels of downscaling for all the ROIs.

Consistent with the present disclosure, any suitable method can be used to determine the allowable downscaling factors for the plurality of blocks. Without losing generality and to illustrate the way of determining an allowable downscaling factor for each of the plurality of blocks respectively, an exemplary method is described below. Specifically, according to some embodiments, the method starts with the encoder assigning an initial downscaling factor to each of the plurality of blocks. The initial downscaling factor for a block may be set to be equal to the smallest downscaling factor associated with the ROIs that overlap with the respective block. This step can be performed independently for horizontal (FIG. 10A) axis and vertical axis (FIG. 10B) if the ROIs use different downscaling factors for their horizontal edges and vertical edges. FIG. 10A is schematic diagram showing a process for assigning initial horizontal scaling factors to the plurality of blocks partitioned in FIG. 9, according to some exemplary embodiments. As shown in FIG. 10A, blocks overlapping with both ROI 912 and ROI916 are assigned S_horizontal=min (1; 5) =1; blocks overlapping with both ROI 914 and ROI916 are assigned S_horizontal=min (2; 5) =2; and the rest of the blocks only overlap with ROI916 and therefore are assigned S_horizontal=5. FIG. 10B is schematic diagram showing a process for assigning initial horizontal scaling factors to the plurality of blocks partitioned in FIG. 9, according to some exemplary embodiments. As shown in FIG. 10B, blocks overlapping with both ROI 912 and ROI916 are assigned S_vertical=min (1; 6) =1; blocks overlapping with both ROI 914 and ROI916 are assigned S_vertical=min (3; 6) =3, and the rest of the blocks only overlap with ROI916 and therefore are assigned S_horizontal=6.

After the initial downscaling factors are assigned, the encoder determines a commonly-agreeable horizontal downscaling factor for each column of the blocks (FIG. 11A) , and determines a commonly-agreeable vertical downscaling factor for each row of the blocks (FIG. 11B) . FIG. 11A is schematic diagram showing a process for determining a commonly-agreeable horizontal downscaling factor for each column of the blocks in FIG. 9, according to some exemplary embodiments. As shown in FIG. 11A, the commonly-agreeable horizontal downscaling factor for a given column is calculated as a minimum of all horizontal downscaling factors in this column. For example, from left to right, the commonly-agreeable horizontal downscaling factors for the columns are 5, 1, 5, 2, and 5, respectively. This means that the first column from left can be downscaled horizontally by a factor of 5; the second column from left cannot be downscaled horizontally and must remain in the same horizontal dimension; the third column from left can be downscaled horizontally by a factor of 5; the fourth column from left can be downscaled horizontally by a factor of 2; and the fifth column from left can be downscaled horizontally by a factor of 5. On the other hand, FIG. 11B is schematic diagram showing a process for determining a commonly-agreeable horizontal downscaling factor for each column of the blocks in FIG. 9, according to some exemplary embodiments. As shown in FIG. 11B, the commonly-agreeable vertical downscaling factor for a given row is calculated as a minimum of all vertical downscaling factors in this row. For example, from top to bottom, the commonly-agreeable vertical downscaling factors for the rows are 6, 3, 1, 1, and 6, respectively. This means that the first row from top can be downscaled horizontally by a factor of 6; the second row from top can be downscaled horizontally by a factor of 3; the third and fourth rows from top cannot be downscaled horizontally and must remain in the same horizontal dimensions; and the fifth row from top can be downscaled horizontally by a factor of 6.

To summarize the above-described process for determining the allowable downscaling factors for the plurality of blocks, the encoder may determine one or more ROIs horizontally collocated with a block; and then determine an allowable vertical downscaling factor for the block to be equal to a smallest vertical downscaling factor associated with the one or more ROIs horizontally collocated with the block. And, independently, the encoder may determine one or more ROIs vertically collocated with the block; and then determine an allowable horizontal downscaling factor for the block to be equal to a smallest horizontal downscaling factor associated with the one or more ROIs vertically collocated with the block.

Equivalently, since the plurality of blocks is organized in rows and columns, the above process for determining the allowable downscaling factor can also be performed according to the rows and columns of the blocks. Specifically, the encoder may determine one or more ROIs collocated with a row of blocks; and then determine an allowable vertical downscaling factor for the row of blocks to be equal to a smallest vertical downscaling factor associated with the one or more ROIs collocated with the row of blocks. And, independently, the encoder may determine one or more ROIs collocated with a column of blocks; and then determining an allowable horizontal downscaling factor for the column of blocks to be equal to a smallest horizontal downscaling factor associated with the one or more ROIs collocated with the column of blocks.

Referring back to FIG. 8, at step 810, the encoder performs at least one of horizontal downscaling of vertical downscaling of the plurality of blocks based on the respective allowable downscaling factors. Specifically, the encoder may vertically downscale each block based on the respective allowable vertical downscaling factor. Alternatively or additionally, the encoder may horizontally downscale each block based on the respective allowable horizontal downscaling factor.

For example, as schematically shown in FIG. 12, to perform the downscaling, the grid generated in FIG. 9 can be first downscaled. Each block formed by the downscaled grid has a width and height according to:

rescaled_width = original_width /min_in_column (S_horizontal) (Eq. 3)

rescaled _height = original_height /min_in_row (S_vertical) (Eq. 4)

In Equations 3 and 4, the rescaled width and height, i.e., rescaled_width and rescaled _height, can be calculated using any rounding method, such as rounding towards zero, nearest integer, nearest power of 2, etc. The rescaled width and height can also be calculated using a maximum clipping method, i.e., limiting the maximal size of the individual blocks. rescaled width and height can also be calculated using a minimal clipping method, i.e., limiting the minimal size of the individual blocks

After the grid is downscaled, contents in the original input image can be retargeted by downscaling the image contents in the original grid to fit to the corresponding blocks of the downscaled grid, as schematically shown in FIG. 13.

As shown in FIG. 12, the downscaled grid can also be upscaled (i.e., inversely downscaled) to restore the original grid. Correspondingly, as shown in FIG. 13, the retargeted image can be inversely retargeted to restore the original image.

In some embodiments, instead of downscaling each block individually, the downscaling can be performed on a column or a row of blocks simultaneously. Specifically, the encoder may vertically downscale a row of blocks based on the respective allowable vertical downscaling factor. Alternatively or additionally, the encoder may horizontally downscale a column of blocks based on the respective allowable vertical downscaling factor.

Consistent with the disclosed embodiments, the downscaling can be performed using any suitable technique as known in the art, such as nearest neighbor down-sampling, linear down-sampling, bilinear down-sampling, cubic down-sampling, bicubic down-sampling, spline down-sampling, polynomial down-sampling, affine transformation, graphic card (GPU) based texture down-sampling.

Referring back to FIG. 8, at step 812, the encoder compresses the downscaled image. The encoder may perform the compression (i.e., encoding) according to any image data coding standard, such as AVC, HEVC, VVC, AV1, JPEG, MPEG, etc., and generate an encoded bitstream. In some embodiments, the encoder may also encode the information regarding the ROIs into the bitstream, to facilitate decoding. The information regarding the ROIs may include position information associated with one or more ROIs in the original input (i.e., un-retargeted) image. The information regarding the ROIs may also include information indicating the downscaling factors used for downscaling the one or more ROIs. The information indicating the downscaling factors may directly indicate the values of the downscaling factors, or indirectly indicate the values of the downscaling factors by signaling the classes associated with the ROIs.

At the decoder side, the bitstreams output from the encoder can be decoded and used to restore the original image. The construction of the grid is performed identically at both the encoder and decoder side. FIG. 14 is a flowchart of an exemplary method 1400 for decoding a bitstream, according to some embodiments consistent with the present disclosure. For example, method 1400 may be performed by an image data decoder, such as image/video decoder 144 in FIG. 1. As shown in FIG. 14, method 1400 includes the following steps 1402-1406.

At step 1402, the decoder decodes a bitstream that includes position information associated with one or more ROIs, and information indicating one or more downscaling factors associated with the one or more ROIs. The information indicating the downscaling factors may directly indicate the values of the downscaling factors, or indirectly indicate the values of the downscaling factors by signaling the classes associated with the ROIs. When the classes associated with the ROIs are signaled, the decoder determines the one or more downscaling factors based on a predetermined corresponding relationship (e.g., a lookup table) between the classes and the downscaling factors.

At step 1404, the decoder generates a reconstructed image based on the decoded bitstream. The decoder may perform the decoding and reconstruction according to any image data coding standard, such as AVC, HEVC, VVC, AV1, JPEG, MPEG, etc.

At step 1406, the decoder performs, based on the one or more downscaling factors, at least one of horizontal upscaling or vertical upscaling of the reconstructed image.

In some embodiments, performing the at least one of horizontal upscaling or vertical upscaling of the reconstructed image includes: locating, based on the position information, the one or more ROIs in the reconstructed image; partitioning the reconstructed image into a plurality of blocks using a grid of horizontal lines and vertical lines, wherein each horizontal line of the grid is aligned with at least one horizontal edge of the one or more ROIs, and each vertical line of the grid is aligned with at least one vertical edge of the one or more ROIs; determining an allowable downscaling factor for each of the plurality of blocks respectively, wherein the allowable downscaling factor is smaller than or equal to the one or more downscaling factors associated with the one or more ROIs; horizontally or vertically upscaling each of the plurality of blocks based on the respective allowable downscaling factor.

In some embodiments, performing the horizontal upscaling includes: determining one or more ROIs vertically collocated with a block; determining an allowable horizontal downscaling factor for the block to be equal to a smallest horizontal downscaling factor associated with the one or more ROIs vertically collocated with the block; and horizontally upscaling the block based on the allowable horizontal downscaling factor.

In some embodiments, performing the horizontal upscaling includes: determining one or more ROIs collocated with a column of blocks; determining an allowable horizontal downscaling factor for the column of blocks to be equal to a smallest horizontal downscaling factor associated with the one or more ROIs collocated with the column of blocks; and horizontally upscaling the column of blocks based on the allowable vertical downscaling factor.

In some embodiments, performing the vertical upscaling includes: determining one or more ROIs horizontally collocated with a block; determining an allowable vertical downscaling factor for the block to be equal to a smallest vertical downscaling factor associated with the one or more ROIs horizontally collocated with the block; and vertically upscaling the block based on the allowable vertical downscaling factor.

In some embodiments, performing the vertical upscaling includes: determining one or more ROIs collocated with a row of blocks; determining an allowable vertical downscaling factor for the row of blocks to be equal to a smallest vertical downscaling factor associated with the one or more ROIs collocated with the row of blocks; and vertically upscaling the row of blocks based on the allowable vertical downscaling factor.

With the disclosed encoding and decoding methods, images after retargeting are smaller in size and proportionally more pixels are allocated to important ROIs as opposed to other regions (or background specified as non-important ROI) . The size of an image/video that needs to be compressed/transmitted is smaller and requires less bits. Moreover, the amount of pixels that need to be processed by the encoder and decoder is smaller, thereby reducing computational complexity. Furthermore, transmission of the retargeting parameters is efficient, because retargeting process may be controlled by positions of the ROIs and the allowed rescaling factors.

In some embodiments, a non-transitory computer-readable storage medium including instructions is also provided, and the instructions may be executed by a device (such as the disclosed encoder and decoder) , for performing the above-described methods. Common forms of non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM or any other flash memory, NVRAM, a cache, a register, any other memory chip or cartridge, and networked versions of the same. The device may include one or more processors (CPUs) , an input/output interface, a network interface, and/or a memory.

It should be noted that, the relational terms herein such as “first” and “second” are used only to differentiate an entity or operation from another entity or operation, and do not require or imply any actual relationship or sequence between these entities or operations. Moreover, the words “comprising, ” “having, ” “containing, ” and “including, ” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items.

As used herein, unless specifically stated otherwise, the term “or” encompasses all possible combinations, except where infeasible. For example, if it is stated that a database may include A or B, then, unless specifically stated otherwise or infeasible, the database may include A, or B, or A and B. As a second example, if it is stated that a database may include A, B, or C, then, unless specifically stated otherwise or infeasible, the database may include A, or B, or C, or A and B, or A and C, or B and C, or A and B and C.

It will be appreciated that the present invention is not limited to the exact construction that has been described above and illustrated in the accompanying drawings, and that various modifications and changes can be made without departing from the scope thereof. It is intended that the scope of the invention should only be limited by the appended claims.

Claims

An encoding method, comprising:

determining one or more regions of interest (ROIs) in an image;

performing, based on one or more downscaling factors associated with the one or more ROIs, at least one of horizontal downscaling or vertical downscaling of the image; and

compressing the downscaled image.
The method of claim 1, wherein the one or more ROIs comprise a first ROI and a second ROI, and a downscaling factor associated with the first ROI is different from a downscaling factor associated with the second ROI.
The method of claim 2, wherein the first ROI comprises a background of the image, the second ROI comprises an object, and wherein the downscaling factor associated with the first ROI is higher than the downscaling factor associated with the second ROI.
The method of any of claims 1-3, wherein:

determining the one or more ROIs in the image comprises defining at least one of the one or more ROIs to have a rectangular shape; and

performing the at least one of horizontal downscaling or vertical downscaling of the image comprises:

partitioning the image into a plurality of blocks using a grid of horizontal lines and vertical lines, wherein the horizontal lines of the grid are aligned with at least one horizontal edge of the one or more ROIs, and the vertical lines of the grid are aligned with at least one vertical edge of the one or more ROIs; and

horizontally or vertically downscaling at least one the plurality of blocks.
The method of claim 4, wherein horizontally or vertically downscaling at least one of the plurality of blocks further comprises:

determining a target downscaling factor for at least one of the plurality of blocks respectively, wherein the target downscaling factor is smaller than or equal to the one or more downscaling factors associated with the one or more ROIs; and

horizontally or vertically downscaling the at least one of the plurality of blocks based on the respective target downscaling factor.
The method of claim 4, wherein the plurality of blocks comprises a first block, and the method further comprises:

determining one or more ROIs horizontally collocated with the first block;

determining a target vertical downscaling factor for the first block to be equal to a smallest vertical downscaling factor associated with the one or more ROIs horizontally collocated with the first block; and

vertically downscaling the first block based on the target vertical downscaling factor.
The method of claim 4, wherein the plurality of blocks comprises a first block, and the method further comprises:

determining one or more ROIs vertically collocated with the first block;

determining a target horizontal downscaling factor for the first block to be equal to a smallest horizontal downscaling factor associated with the one or more ROIs vertically collocated with the first block; and

horizontally downscaling the first block based on the target horizontal downscaling factor.
The method of claim 4, wherein the plurality of blocks forms a plurality of rows, and the method further comprises:

determining one or more ROIs collocated with a first row of blocks;

determining a target vertical downscaling factor for the first row of blocks to be equal to a smallest vertical downscaling factor associated with the one or more ROIs collocated with the first row of blocks; and

vertically downscaling the first row of blocks based on the target vertical downscaling factor.
The method of claim 4, wherein the plurality of blocks forms a plurality of columns, and the method further comprises:

determining one or more ROIs collocated with a first column of blocks;

determining a target horizontal downscaling factor for the first column of blocks to be equal to a smallest horizontal downscaling factor associated with the one or more ROIs collocated with the first column of blocks; and

horizontally downscaling the first column of blocks based on the target vertical downscaling factor.
The method of any of claims 1-9, wherein performing the at least one of horizontal downscaling or vertical downscaling of the image comprises performing at least one of:

nearest neighbor down-sampling;

linear down-sampling;

bilinear down-sampling;

cubic down-sampling;

bicubic down-sampling;

spline down-sampling;

polynomial down-sampling;

affine transformation; or

texture down-sampling.
The method of any of claims 1-10, further comprising:

classifying the one or more ROIs; and;

determining the one or more downscaling factors associated with the one or more ROIs, based on the classes associated with the one or more ROIs.
The method of any of claims 1-11, further comprising:

generating an encoded bitstream comprising information indicating:

positions of the one or more ROIs in the image; and

the one or more downscaling factors associated with the one or more ROIs.
A device comprising:

a memory storing instructions; and

one or more processor configured to execute the instructions to cause the device to performing operations comprising:

determining one or more regions of interest (ROIs) in an image;

performing, based on one or more downscaling factors associated with the one or more ROIs, at least one of horizontal downscaling or vertical downscaling of the image; and

compressing the downscaled image.
A non-transitory computer-readable medium storing a set of instructions that is executable by one or more processors of a device to cause the device to perform a method comprising:

determining one or more regions of interest (ROIs) in an image;

performing, based on one or more downscaling factors associated with the one or more ROIs, at least one of horizontal downscaling or vertical downscaling of the image; and

compressing the downscaled image.
A decoding method, comprising:

decoding a bitstream comprising:

position information associated with one or more regions of interest (ROIs) , and

information indicating one or more downscaling factors associated with the one or more ROIs;

determining a reconstructed image; and

performing, based on the one or more downscaling factors, at least one of horizontal upscaling or vertical upscaling of the reconstructed image.
The method of claim 15, wherein the information indicating the one or more downscaling factors comprises classes associated with the one or more ROIs respectively, and the method further comprises:

determining the one or more downscaling factors based on the classes associated with the one or more ROIs, respectively.
The method of any of claim 15 or 16, wherein the one or more ROIs comprise a first ROI and a second ROI, and a downscaling factor associated with the first ROI is different from a downscaling factor associated with the second ROI.
The method of claim 17, wherein the first ROI comprises a background of the image, the second ROI comprises an object, and wherein the downscaling factor associated with the first ROI is higher than the downscaling factor associated with the second ROI.
The method of any of claims 15-18, wherein at least one of the one or more ROIs has a rectangular shape, and performing at least one of horizontal upscaling or vertical upscaling of the reconstructed image comprises:

locating, based on the position information, the one or more ROIs in the reconstructed image;

partitioning the reconstructed image into a plurality of blocks using a grid of horizontal lines and vertical lines, wherein at least one of the horizontal lines of the grid is aligned with at least one horizontal edge of the one or more ROIs, and at least one of the vertical lines of the grid is aligned with at least one vertical edge of the one or more ROIs; and

horizontally or vertically upscaling at least one of the plurality of blocks.
The method of claim 19, wherein horizontally or vertically upscaling at least one of the plurality of blocks further comprises:

determining a target downscaling factor for at least one of the plurality of blocks respectively, wherein the target downscaling factor is smaller than or equal to the one or more downscaling factors associated with the one or more ROIs; and

horizontally or vertically upscaling the at least one of the plurality of blocks based on the respective target downscaling factor.
The method of claim 19, wherein the plurality of blocks comprises a first block, and the method further comprises:

determining one or more ROIs horizontally collocated with the first block;

determining a target vertical downscaling factor for the first block to be equal to a smallest vertical downscaling factor associated with the one or more ROIs horizontally collocated with the first block; and

vertically upscaling the first block based on the target vertical downscaling factor.
The method of claim 19, wherein the plurality of blocks comprises a first block, and the method further comprises:

determining one or more ROIs vertically collocated with the first block;

determining a target horizontal downscaling factor for the first block to be equal to a smallest horizontal downscaling factor associated with the one or more ROIs vertically collocated with the first block; and

horizontally upscaling the first block based on the target horizontal downscaling factor.
The method of claim 19, wherein the plurality of blocks forms a plurality of rows, and the method further comprises:

determining one or more ROIs collocated with a first row of blocks;

determining a target vertical downscaling factor for the first row of blocks to be equal to a smallest vertical downscaling factor associated with the one or more ROIs collocated with the first row of blocks; and

vertically upscaling the first row of blocks based on the target vertical downscaling factor.
The method of claim 19, wherein the plurality of blocks forms a plurality of columns, and the method further comprises:

determining one or more ROIs collocated with a first column of blocks;

determining a target horizontal downscaling factor for the first column of blocks to be equal to a smallest horizontal downscaling factor associated with the one or more ROIs collocated with the first column of blocks; and

horizontally upscaling the first column of blocks based on the target vertical downscaling factor.
A device comprising:

a memory storing instructions; and

one or more processor configured to execute the instructions to cause the device to performing operations comprising:

decoding a bitstream comprising:

position information associated with one or more regions of interest (ROIs) , and

information indicating one or more downscaling factors associated with the one or more ROIs;

determining a reconstructed image; and

performing, based on the one or more downscaling factors, at least one of horizontal upscaling or vertical upscaling of the reconstructed image.
A non-transitory computer-readable medium storing a set of instructions that is executable by one or more processors of a device to cause the device to perform a method comprising:

decoding a bitstream comprising:

position information associated with one or more regions of interest (ROIs) , and

information indicating one or more downscaling factors associated with the one or more ROIs;

determining a reconstructed image; and

performing, based on the one or more downscaling factors, at least one of horizontal upscaling or vertical upscaling of the reconstructed image.