WO2024077798A1

WO2024077798A1 - Image data coding methods and systems

Info

Publication number: WO2024077798A1
Application number: PCT/CN2022/144305
Authority: WO
Inventors: Marek Domanski; Slawomir Mackowiak; Slawomir ROZEK; Olgierd Stankiewicz; Tomasz Grajek
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2022-10-11
Filing date: 2022-12-30
Publication date: 2024-04-18

Abstract

Methods and systems for encoding and decoding image data used by machine vision applications are provided. According to certain embodiments, an exemplary encoding method may include: determining one or more regions of interest (ROIs) in an image; modifying the image based on the one or more ROIs, wherein modifying the image comprises at least one of the following: modifying a first ROI based on a portion of the image surrounding the first ROI; reducing content of a second ROI; or reducing a size of a third ROI; and compressing the modified image.

Description

IMAGE DATA CODING METHODS AND SYSTEMS

TECHNICAL FIELD

The present disclosure generally relates to image data processing, and more particularly, to methods and systems for encoding and decoding image data used by machine vision applications.

BACKGROUND

Limited communication bandwidth poses challenge to transmitting images or videos (hereinafter collectively referred to as “image data” ) at desired quality. Although various coding techniques have been developed to compress the image data before it is transmitted, the increasing demand for massive amount of image data has put a constant pressure on the image/video systems, particularly mobile devices that frequently need to work in networks with limited bandwidth.

Moreover, with the rise of machine learning technologies and machine vision applications, the amount of video and images consumed by machines has rapidly grown. Typical use cases include intelligent transportation, smart city technology, intelligent content management, etc., which incorporate machine vision tasks such as object detection, instance segmentation, and object tracking. Machine vision applications extract and analyze features from the image data. As feature information differs substantially from conventional image or video data, coding technologies and solutions for machine usage could differ from conventional human-viewing-oriented applications to achieve optimized performance. Therefore, there is a need to develop coding methods and systems suitable for machine usage.

SUMMARY

Embodiments of the present disclosure relate to methods and systems of encoding and decoding image data used by machine vision applications. In some embodiments, a computer-implemented encoding method is provided. The method may include: determining one or more regions of interest (ROIs) in an image; modifying the image based on the one or more ROIs, wherein modifying the image comprises at least one of the following: modifying a first ROI based on a portion of the image surrounding the first ROI; reducing content of a second ROI; or reducing a size of a third ROI; reducing content of the background remaining after selection of all ROIs, and compressing the modified image.

In some embodiments, a device for encoding image data is provided. The device includes a memory storing instructions; and one or more processors configured to execute the instructions to cause the device to perform operations including: determining one or more regions of interest (ROIs) in an image; modifying the image based on the one or more ROIs, wherein modifying the image comprises at least one of the following: modifying a first ROI based on a portion of the image surrounding the first ROI; reducing content of a second ROI; or reducing a size of a third ROI; reducing content of the background remaining after selection of all ROIs; and compressing the modified image.

In some embodiments, a computer-implemented decoding method is provided. The method may include: decoding a bitstream comprising metadata descriptive of: one or more regions of interest (ROIs) in an uncompressed image, the background in an uncompressed image and a modification made to the uncompressed image; determining a reconstructed image; and restoring the uncompressed image by restoring content or a size of a ROI based on the metadata.

In some embodiments, a device for decoding image data is provided. The device includes a memory storing instructions; and one or more processors configured to execute the instructions to cause the device to perform operations including: decoding a bitstream comprising metadata descriptive of: one or more regions of interest (ROIs) in an uncompressed image, and a modification made to the uncompressed image; determining a reconstructed image; and restoring the uncompressed image by restoring content or a size of a ROI based on the metadata.

Aspects of the disclosed embodiments may include non-transitory, tangible computer-readable media that store software instructions that, when executed by one or more processors, are configured for and capable of performing and executing one or more of the methods, operations, and the like consistent with the disclosed embodiments. Also, aspects of the disclosed embodiments may be performed by one or more processors that are configured as special-purpose processor (s) based on software instructions that are programmed with logic and instructions that perform, when executed, one or more operations consistent with the disclosed embodiments.

Additional objects and advantages of the disclosed embodiments will be set forth in part in the following description, and in part will be apparent from the description, or may be learned by practice of the embodiments. The objects and advantages of the disclosed embodiments may be realized and attained by the elements and combinations set forth in the claims.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosed embodiments, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating an exemplary architecture for a video-coding-for-machines (VCM) encoder, consistent with embodiments of the present disclosure.

FIG. 2 is a schematic diagram illustrating an exemplary image inpainting process performed by the VCM encoder of FIG. 1, consistent with embodiments of the present disclosure.

FIG. 3 is a schematic diagram illustrating an exemplary image simplification process performed by the VCM encoder of FIG. 1, consistent with embodiments of the present disclosure.

FIG. 4 is a schematic diagram illustrating an exemplary image retargeting process performed by the VCM encoder of FIG. 1, consistent with embodiments of the present disclosure.

FIG. 5 is a flowchart of an exemplary encoding method, consistent with embodiments of the present disclosure.

FIG. 6 is a flowchart of an exemplary image inpainting method, consistent with embodiments of the present disclosure.

FIG. 7 is a flowchart of an exemplary image simplification method, consistent with embodiments of the present disclosure.

FIG. 8 is a schematic diagram illustrating an object surrounded by an adjacent band, consistent with embodiments of the present disclosure.

FIG. 9 is a schematic diagram illustrating multiple objects surrounded by respective adjacent bands, consistent with embodiments of the present disclosure.

FIG. 10 is a schematic diagram illustrating an exemplary process for performing retargeting and inverse retargeting of an image, consistent with embodiments of the present disclosure.

FIG. 11 is a flowchart of an exemplary image retargeting method, consistent with embodiments of the present disclosure.

FIG. 12 is a schematic diagram illustrating an input image partitioned into a plurality of blocks by a grid, consistent with embodiments of the present disclosure.

FIG. 13A is a schematic diagram illustrating a process for assigning initial horizontal scaling factors to the plurality of blocks partitioned in FIG. 12, consistent with embodiments of the present disclosure.

FIG. 13B is a schematic diagram illustrating a process for assigning initial vertical scaling factors to the plurality of blocks partitioned in FIG. 12, consistent with embodiments of the present disclosure.

FIG. 14A is a schematic diagram illustrating a process for determining a commonly-agreeable horizontal downscaling factor for each column of the blocks in FIG. 12, consistent with embodiments of the present disclosure.

FIG. 14B is a schematic diagram illustrating a process for determining a commonly-agreeable vertical downscaling factor for each row of the blocks in FIG. 12, consistent with embodiments of the present disclosure.

FIG. 15 is a schematic diagram illustrating a process for performing downscaling and upscaling of a grid, consistent with embodiments of the present disclosure.

FIG. 16 is a schematic diagram illustrating a process for performing retargeting and inverse retargeting of an image, consistent with embodiments of the present disclosure.

FIG. 17 is a block diagram illustrating an exemplary architecture for a VCM decoder, consistent with embodiments of the present disclosure.

FIG. 18 is a flowchart of an exemplary decoding method, consistent with embodiments of the present disclosure.

FIG. 19 is a schematic diagram illustrating an exemplary process for performing inpainting and region-of-interest (ROI) synthesis of an image, consistent with embodiments of the present disclosure.

FIG. 20 is a block diagram of an exemplary apparatus for encoding or decoding image data, consistent with embodiments of the present disclosure.

FIG. 21A is a plot of mean average precision of detection for an example still image decoded from Versatile Video Coding (VVC) bitstreams.

FIG. 21B is a plot of mean average precision of instant segmentation for an example still image decoded from the VVC bitstreams.

DETAILED DESCRIPTION

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the invention. Instead, they are merely examples of apparatuses and methods consistent with aspects related to the invention as recited in the appended claims.

The present disclosure provides an encoding and decoding architecture for compressing image data, including image data for machine vision applications. The disclosed architecture includes a Video Coding for Machines (VCM) technique that aims at compressing features for machine-performed tasks such as video object detection, event analysis, object tracking and others. The input (original) and the output (decoded) of the disclosed VCM architecture is a still image or video (i.e., a sequence of images) . The architecture includes an encoder, communication/storage system of any type, and a decoder. The architecture is designed to represent the input image or video with possibly low bitrate but assuring possibly high efficiency of machine vision tasks executed on the output (decoded) image or video. The architecture is suitable for compressing image data used by a variety of machine vision tasks. For example, the architecture can be used to improve the average precision of detection or segmentation executed on the output image or video. As another example, it can used to improve the tracking performance in the output video.

FIG. 1 is a block diagram illustrating an exemplary architecture for a video-coding-for-machines (VCM) encoder 100, consistent with embodiments of the present disclosure. Encoder 100 is capable of compressing an image (also called a “picture” or “frame” ) , multiple images, or a video. An image is a static picture. Multiple images may be related or unrelated, either spatially or temporary. A video includes a set of images arranged in a temporal sequence.

As shown in FIG. 1, encoder 100 may include a region of interest (ROI) determination and classification module 102, an image inpainting module 110, an image simplification module 120, an image retargeting module 130, an ROI metadata formation module 140, and an encoding (i.e., compressing) module 150. The input of encoder 100 may include one or more input images 101, and the output of encoder 100 may include encoded bitstream 160 and/or ROI metadata 162. Consistent with the disclosed embodiments, encoder 100 can be made scalable-some of the above modules may be absent in individual implementations, depending on the actual use case. For example, not all of image inpainting module 110, image simplification module 120, and image retargeting module 130 are required in an individual use case.

Specifically, determination and classification module 102 analyzes an input image 101 and determines one or more ROIs in which objects of interest may be located. An ROI is a region in the input image that contains content critical to the performance or precision of machine vision applications, such as object detection and tracking. In other words, altering or removing the ROI will deteriorate the performance or precision of the machine vision applications. Consistent with the disclosed embodiment, an ROI may further include one or more smaller ROIs, and/or a part or the entirety of the input image’s background. In some embodiments, the detection of object (s) from an input image includes three stages: image segmentation (i.e., object isolation or recognition) , feature extraction, and object classification. At the image segmentation stage, the encoder may execute an image segmentation algorithm to partition the input image into multiple segments, e.g., sets of pixels each of which representing a portion of the input image. The image segmentation algorithm may assign a label (i.e., category) to each set of pixels, such that pixels with the same label share certain common characteristics. The encoder may then group the pixels according to their labels and designate each group as a distinct object.

After the input image is segmented, i.e., after at least one object is recognized, the encoder may extract, from the input image, features which are characteristic of the object. The features may be based on the size of the object or on its shape, such as the area of the object, the length and the width of the object, the distance around the perimeter of the object, rectangularity of the object shape, circularity of the object shape. The features associated with each object can be combined to form a feature vector.

At the classification stage, the encoder may classify each recognized object on the basis of the feature vector associated with the respective object. The feature values can be viewed as coordinates of a point in an N-dimensional space (one feature value implies a one-dimensional space, two features imply a two-dimensional space, and so on) , and the classification is a process for determining the sub-space of the feature space to which the feature vector belongs. Since each sub-space corresponds to a distinct object, the classification accomplishes the task of object identification. For example, the encoder may classify a recognized object into one of multiple predetermined classes, such as human, face, animal, automobile or vegetation. In some embodiments, the encoder may perform the classification based on a model. The model may be trained using pre-labeled training data sets that are associated with known objects. In some embodiments, the encoder may perform the classification using a machine learning technology, such as convolutional neural network (CNN) , deep neural network (DNN) , etc.

The result 103 of ROI determination and classification is used to facilitate the operations of image inpainting module 110, image simplification module 120, and/or image retargeting module 130. In particular, based on result 103, image inpainting module 110 removes certain less important objects from input image 101, and replace them with image elements with simple content, so that no holes are left in the image. The less important, removed objects may be small objects or objects repeatedly showing in the image. As described below in connected with the decoder, the removed objects can be synthesized by the decoder and added back to the decoded image. In exemplary embodiments, the image area originally occupied by a removed object may be inpainted by simple content that is represented by a small number of bits. FIG. 2 is a schematic diagram illustrating an image inpainting process 200, according to some exemplary embodiment consistent with the present disclosure. As shown in FIG. 2, determination and classification module 102 determines

objects

212a, 214a, 216a, and 218a from an original image 210a. Image inpainting module 110 removes

objects

212a, 214a, 216a, and 218a, and inpaints the affected image areas using the content and color surrounding the removed objects, to generate an inpainted image 210b.

In some embodiments, the content (e.g., objects) removed by image inpainting module 110 is notified in the bitstream, such that it can be restored at the decoder side. For example, the notification may include a label from a catalogue of common characteristic objects (like car, man, woman, dog, cat, car, etc. ) . Another example is to notify the location of a similar object in the same image or video, by providing a shift vector pointing to the similar object or providing a latent representation of the object replaced by the inpainted area. Other appropriate approaches for notifying the removed content can also be used.

Image simplification module 120 aims to produce content that may be represented with reduced number of bits after compression without significant deterioration of the efficiency of the machine vision tasks executed on decoded (decompressed) image/video. FIG. 3 is a schematic diagram illustrating an image simplification process 300, according to some exemplary embodiment consistent with the present disclosure. As shown in FIG. 3, determination and classification module 102 determines

objects

312 and 314 from an original image 310a. Image simplification module 120 further determines a buffer region 322, a buffer region 324, and a background region 330 from original image 310a. Buffer region 322 surrounds object 312, buffer region 324 surrounds object 314, and background region 330 is the remaining part of original image 310a. Image simplification module 120 simplifies content of background region 330 by reducing the details therein, such that the simplified background region 330 can be represented by a smaller number bits. In contrast, objects 312, 314 and

buffer regions

322, 324 remain unchanged or only slightly simplified, to preserve the precision of the machine vision tasks. For example, for a big-size object, an inner part of the object can be simplified without negatively affecting the performance of the downstream machine vision tasks. As shown in FIG. 3, in simplified image 310b, background region 330 is significantly simplified, while

objects

312, 314 and

buffer regions

322, 324 are largely unchanged. The goal of the image simplification is to produce image or video that can be represented by a small number of bits (in compressed representation) whereas the object (s) in the image or video can still be well analyzed (recognized, detected, tracked, etc. ) by the machine vision tasks.

Image retargeting module 140 reduces the number of rows and columns of an input image or the video frame according to result 103 of ROI determination and classification 102. Specifically, image retargeting module 140 may remove one or more rows or columns of an input image when their do not comprise samples of objects. FIG. 4 is a schematic diagram illustrating an image retargeting process 400, according to some exemplary embodiment consistent with the present disclosure. As shown in FIG. 4, determination and classification module 102 determines

ROIs

412 and 414 from an input image 410. Image retargeting module 140 determines, in image 410, a row 422 and a column 424 that do not collocate or overlap with

ROIs

412 and 414. Image retargeting module 140 retargets image 410 by removing row 422 and column 424, such that image 410 is downscaled and

ROIs

412, 414 occupy a larger portion of the retargeted image. In some embodiments, the removal of the rows and columns is preceded by a local lowpass (i.e., anti-aliasing) filtering. In some embodiments, rows or columns that comprise samples of the objects may be downscaled according to analysis of the objects, without causing deterioration to the machine vision tasks’ ability to recognize the objects.

Referring back to FIG. 1, after input image 101 is processed by one or more of image inpainting module 110, image simplification module 120, and image retargeting module 130, the output is encoded (i.e., compressed) by an encoding module 150. The inpainting, simplification, and retargeting can be applied regardless of the image/video encoding (compression) techniques used by encoding module 150. For example, encoding module 150 may operate according to any video coding standard, such as Advanced Video Coding (AVC) , High Efficiency Video Coding (HEVC) , Versatile Video Coding (VVC) , AOMedia Video 1 (AV1) , Joint Photographic Experts Group (JPEG) , Moving Picture Experts Group (MPEG) , etc. Encoding module 150 outputs an encoded bitstream 160 including the compressed image/video.

Metadata formation module 140 formats ROI metadata 162 for transmission as side information in addition to bitstream 160 produced by encoding module 150. ROI metadata 162 may include information describing the input image processing performed by the inpainting module 110, simplification module 120, and/or retargeting module 130. Various methods of transmitting ROI metadata 162 can be used to secure frame synchronization between ROI metadata 162 and encoded bitstream 160. For example, ROI metadata 162 can be transmitted in a Picture Parameter Set of encoded bitstream 160, by making necessary modification to the existing Picture Parameter Set syntax used in AVC, HEVC and VVC standards of ISO, IEC and ITU. In particular, to avoid ROI metadata 162 from being transmitted out-of-phase with encoded bitstream 160, Picture Order Count (POC) may be used by the encoder and decoder to synchronize ROI metadata 162 with encoded bitstream 160. For other known standards, the transmission of ROI metadata 162 may be organized in a similar way, without any intervention into the high level syntax of encoded bitstream 160. Consistent with the disclosed embodiments, the format of ROI metadata 162 is standardized, so that it can be unambiguously decoded.

Consistent with the disclosed embodiments, encoder 100 can be used to compress a still image or a video. For a video, the frame structure, e.g., the sequence of I, P, B frames, is configured consistently with the respective temporal configuration of inpainting, simplification and retargeting. As known in the art, a frame coded without referencing another frame (i.e., it is its own reference frame) is referred to as an “I-frame. ” A frame coded using a previous frame as a reference frame is referred to as a “P-frame. ” A frame coded using both a previous frame and a future frame as reference frames (i.e., the reference is “bi-directional” ) is referred to as a “B-frame. ”

Consistent with the disclosed embodiments, encoder 100 is scalable because not all of image inpainting module 110, image simplification module 120, and image retargeting module 130 need to be present in each implementation. Omitting one or two of these modules leads to a simplified system with reduced complexity.

Next, the signal path in encoder 100 is described. Input image (s) 101 is/are uncompressed representation of a still image or video, and can be monochrome or color, standard dynamic range (SDR) or high dynamic range (HDR) , in any color space, etc. Input image (s) 101 can be represented by samples including values of color component samples.

Signal 115 is uncompressed (i.e., unencoded) image data. Signal 115 is generated by inpainting an input image 101 in areas corresponding to objects of less interest, such as less important or typical objects, which are described in ROI metadata 162. For example, an object removed by the inpainting process may be described by an object index to a catalogue (database) including a list of predetermined objects. As another example, a removed object may be described by a pointer (in time or space) to the same or similar object in a previous frame or another place in the current frame. As yet another example, a removed object may be described by a latent representation produced by a neural network for the object. Other methods of the object description may be also used.

Signal 125 is uncompressed (i.e., unencoded) image data. Signal 125 is generated by simplifying the content in certain regions of an image, such as a background region which does not contain objects.

Signal 135 is uncompressed (i.e., unencoded) image data. Signal 135 is generated by retargeting an image. The retargeted image has a smaller size than the original image, because the retargeting removes certain rows and/or columns from the original image. Therefore, the retargeted image includes a smaller number of samples than the original image.

Retargeting parameters 132 are output by image retargeting module 130. Retargeting parameters 132 describe the numbers and locations of rows and columns removed in the retargeting process.

Encoded bitstream 160 is generated by encoding module 150. It can be a legacy bitstream generated based on any existing standard, such as AVC, HEVC, VVC, AV1, JPEG, MPEG, etc. Depending on whether image inpainting module 110, image simplification module 120, or image retargeting module 130 is included in encoder 100, encoded bitstream 160 may be generated by encoding

signal

115, 125, or 135.

ROI metadata 162 is transmitted as side information in addition to encoded bitstream 160 produced by encoding module 150. Any mechanism of transmission can be used to transmit ROI metadata 162 as long it secures frame synchronization between ROI metadata 162 and encoded bitstream 160. As an example, ROI metadata 162 may be signaled in Picture Parameter Set as used in AVC, HEVC and VVC standards of ISO, IEC and ITU, respectively. Nevertheless, such implementation needs modification to the high level syntax of these standards. In some embodiments, to avoid ROI metadata 162 from being transmitted out-of-phase with encoded bitstream 160, Picture Order Count (POC) can be used in the encoder and decoder to synchronize the transmission of ROI metadata 162 and encoded bitstream 160. For other known standards, the transmission of ROI metadata 162 may be organized in a similar way, without any intervention into the high level syntax of encoded bitstream 160. Consistent with the disclosed embodiments, the format of ROI metadata 162 may be standardized, so that it can be unambiguously decoded.

ROI metadata 162 may include three categories of metadata, as described below.

Metadata Class 1-Metadata defining ROIs that correspond to the potential objects important for the machine vision tasks on the decoder side. A ROI may be defined in various ways depending on the specific implementation. For example, a ROI may be defined as a bounding box around an object. Other known methods for segmenting an ROI from the rest of the image can also be used. The ROI may be shaped as a square, a rectangle, a polygon, etc.

Metadata Class 2-Metadata that describes retargeting and can be used by a decoder to implement inverse retargeting.

Metadata Class 3-Metadata that defines object representations for the objects being removed and replaced by the image inpainting. The removed objects may be defined in any reasonable way, such as indices indicating the object type and/or the basic features of an object (e.g., a size and/or orientation of the removed object) . The removed objects may also be described by other latent representations.

FIG. 5 is a flowchart of an exemplary encoding method 500 for compressing image data, consistent with embodiments of the present disclosure. For example, method 500 may be performed by encoder 100 (FIG. 1) . As shown in FIG. 5, method 500 includes the following steps 502-506.

At step 502, the encoder determines one or more regions of interest (ROIs) in an image. In some embodiments, the encoder may detect one or more objects from the image, and then determine one or more ROIs based on the objects, with each ROI encompassing one or more of the detected objects. In some embodiments, each ROI may also include a buffer region surrounding an object encompassed by the ROI. In some embodiments, the one or more ROIs may be bounding boxes shaped as triangles, squares, rectangles, polygons, circles, etc.

At step 504, the encoder modifies the image by performing, based on the one or more ROIs, at least one of inpainting, simplification, or retargeting of the image. Exemplary methods for image inpainting, simplification, and retargeting are described below in connection with FIGs. 6, 7, and 10, respectively. The image inpainting is used to remove ROIs that are less critical to the performance of the downstream machine vision tasks. The image simplification is used to simplify content of the image regions not including the ROIs. The image retargeting is used to increase the ratio or weight of the ROIs over the entire image.

At step 506, the encoder compresses the modified image. The encoder may perform the compression (i.e., encoding) according to any image data coding standard, such as AVC, HEVC, VVC, AV1, JPEG, MPEG, etc., and generate an encoded bitstream. In some embodiments, the encoder may also encode the metadata regarding the ROIs (e.g., ROI metadata 162 in FIG. 1) into the bitstream, to facilitate decoding. For example, the ROI metadata may be signaled in a header portion (e.g., picture parameter set) of the bitstream.

FIG. 6 is a flowchart of an exemplary image inpainting method 600, consistent with embodiments of the present disclosure. For example, method 600 may be performed by encoder 100 (FIG. 1) . As shown in FIG. 6, method 600 includes the following steps 602-608.

At step 602, the encoder determines a first ROI to be removed from an input image.

At step 604, the encoder generates an inpainting image element. The inpainting image element is an image portion used to replace the first ROI. The inpainting image elements has simpler content than the first ROI. Therefore, the number of bits used for representing the inpainting image element is smaller than the number of bits required for representing the first ROI. In some embodiments, the inpainting image element may be generated based on one or more samples surrounding the first ROI. For example, as shown in FIG. 2, objects 212a-212d in original image 210a are replaced by inpainting image elements that resemble the respective objects’ surrounding area.

Referring back to FIG. 6, at step 606, the encoder replaces the first ROI with the inpainting image element, to generate an inpainted image. For example, as shown in FIG. 2, in inpainted image 210b, the areas originally occluded by objects 212a-212d are inpainted by samples resembling the samples surrounding these objects.

Referring back to FIG. 6, at step 608, the encoder generates metadata describing the removed first ROI. For example, the metadata may include an index associated with a class of the removed ROI, a feature vector descriptive of the characteristics of the removed ROI, and/or a shift vector pointing to an unremoved ROI in the image that resemble the removed first ROI.

FIG. 7 is a flowchart of an exemplary image simplification method 700, consistent with embodiments of the present disclosure. For example, method 700 may be performed by encoder 100 (FIG. 1) . As shown in FIG. 7, method 700 includes the following steps 702-704.

In step 702, the encoder determines, in an image, an object, a buffer region surrounding the object, and a background region surrounding the buffer region. For example, as schematically shown in FIG. 8, the encoder detects an object 810 in an image 800, and defines a buffer region 820 surrounding object 810. In the following description, a “buffer region” is also referred to and used interchangeably as a “band. ” Consistent with the disclosed embodiments, a background region is an image region that does not include any object or band. In some embodiments, after the object and band are determined from an image, the encoder determines the rest of the image to be a background region. For example, in FIG. 8, after object 810 and buffer region 820 are determined, the encoder determines the rest of image 800 to be a background region 830, which surrounds buffer region 820.

Although FIG. 8 only shows one object, the encoder may detect multiple objects from an image and define multiple bands respectively surrounding the multiple objects. For example, as schematically shown in FIG. 9, the encoder first identifies an object 910 and an object 920 in an image 900. The encoder then defines a buffer region 912 surrounding object 910, and defines a buffer region 922 surrounding object 920. After these objects and buffer regions are determined, the encoder determines the rest of image 900 to be a background region 930, which surrounds

buffer regions

912 and 922. Although FIG. 8 and FIG. 9 show that the bands (i.e., buffer regions) are surrounding their respective objects entirely, in the disclosed embodiments a band can also be partially surrounding an associated object. Thus, the term “surrounding” used in the present disclosure means completely or partially surrounding an image region (e.g., an image region occupied by an object) . Moreover, some objects may not have respective bands (i.e., buffer regions) . That is, the bands associated with these objects have a zero size.

In some embodiments, to detect or identify the object (s) in an image, the encoder may execute an image segmentation algorithm to partition an input image into multiple segments, i.e., sets of pixels each of which representing a portion of the image. The image segmentation algorithm may assign a label (i.e., category) to each set of pixels, such that pixels with the same label share certain common characteristics. The encoder may then determine whether each image segment corresponds to an object, such as human body, animal, automobile or vegetation. According to some embodiments, the image segmentation classifies each pixel in an image into one of multiple categories. The encoder may perform the classification based on a model. The model may be trained using pre-labeled training data sets that are associated with known objects. In some embodiments, the encoder may perform the classification using a machine learning technology, such as convolutional neural network (CNN) , deep neural network (DNN) , etc.

In the disclosed embodiments, the width of a band can be defined in various ways. For example, the band surrounding an object may have a uniform width or a variable width. As another example, when multiple objects and bands are determined in an image, the multiple bands may have the same width or different widths. The present disclosure does not limit the ways of defining the width of a band.

In some embodiments, the encoder determines a width of a band based on a characteristic dimension of the object surrounded by the band. For example, the encoder may define the width of the band as a percentage of the characteristic dimension, according to the following Equation 1:

BW=kw·L, (1)

In Equation 1, BW denotes the width of a band. L denotes the characteristic dimension of the object surrounded by the band. And kw denotes a width coefficient, which may be a constant (e.g., 0.15) or a variable. The band width BW and/or characteristic dimension L may be defined and measured using Euclidean metric, Manhattan metric, or any other suitable distance metric. As known in the art, the Euclidean metric defines the distance between two points as being equal to the length of a straight line between these two points, while the Manhattan metric defines the distance between two points as being equal to the sum of the absolute differences of the points’ Cartesian coordinates.

Moreover, the above-described band width BW and/or characteristic dimension L may be defined based on an actual size of the object. Unlike the object’s apparent size shown in an image, the object’s actual size is its prospectively adjusted size based on the image. In some embodiments, to determine an object’s actual size, the encoder may determine an apparent size of the object in an image, and a distance between the object and the image sensor (e.g., camara) that captures the image. Various methods can be used to measure the distance between the object and the camera. In one example, the distance may be measured by a distance sensor. Exemplary distance sensors include, but are not limited to, infrared-ray sensor or laser sensor that are designed to measure the distance based on the time that a light signal travels between the object and the camera. As another example, the distance may be determined using the lens formula shown in Equation 2:

In Equation 2, OD denotes the distance between the object and the center of the camera lens, CL denotes the distance between the center of the camera lens to the image, and f denotes the focal length of the camera lens. Equation 2 is based on the assumption and thus is applicable to the situation that OD is much longer than f. After the apparent size of the object in the image and the distance between the object and the camera are determined, the encoder may determine the object’s actual size based on Equation 3:

In Equation 3, AS denotes the object’s actual size, IS denotes the object’s apparent size in the image, OD denotes the distance between the object and the center of the lens of the camera, and f denotes the focal length of the camera lens. The above examples for determine an object’s actual size are for illustrative purpose only, and are not meant to limit the disclosed embodiments.

In some embodiments, an object’s characteristic dimension may be defined as a largest dimension of the object. For example, the characteristic dimension L in Equation 1 may be defined as the largest dimension of the object. Also for example, when the encoder runs a machine vision algorithm that uses bounding boxes to track objects, the largest object dimension L may be defined as the largest dimension of the bounding box associated with the object. Thus, in the example shown in FIG. 8, the encoder may determine object 810’s largest dimension to be a height of a rectangular bounding box (not shown in FIG. 8) surrounding object 810, the height corresponding to the distance from the person’s head to her feet. Similarly, in the example shown in FIG. 9, the encoder may use bounding boxes (not shown in FIG. 9) to find the respective largest dimensions of

objects

910 and 920. The above-described way of defining the characteristic dimension L is given as an example only, and is not meant to limit the disclosed embodiments.

In some embodiments, the band width BW may be dependent on an object class associated with the object. As described above, image segmentation technique may be used to identify object (s) in an image. During this process, the image segmentation technique may classify segments of the image into predetermined categories. Some of the predetermined categories are used to describe the classes of objects. The object classes may be defined based on the object types (e.g., car, building, people, human face, toy, etc. ) . In some embodiments, the parameter kw in Equation 1 may be determined based on an object class, such that wider or narrower band are assigned to various types of objects (e.g., human face, building, etc. ) . For example, as shown in FIGs. 8 and 9, the encoder may classify object 810 to “human” while classify

objects

910 and 920 to “pets. ” Based on the classification, the encoder may assign a wider band to object 810 than to object 910 or 920.

In some embodiments, the band width BW may be a variable dependent of the local properties of object. For example, the parameter kw in Equation 1 may be a variable depending on the position on the object edge. For example, in FIG. 8, the encoder may increase the value of the parameter kw at areas with rich textures (e.g., animal fur, human face) or sharp curvatures (e.g., joints of a human body) .

In some embodiments, the width coefficient kw in Equation 1 may be a variable depending on the position on the object edge, or depending on the local texture or curvature of the object edge. For example, the value of the width coefficient kw may be set proportional to the value of the local curvature. Alternatively or additionally, the width coefficient kw may be based on the color of the object and/or the color of the background region. For example, the value of the width coefficient kw may be set proportional to the difference between the object’s color hue and the background region’s color hue.

Referring back to FIG. 7, in step 704, the encoder simplifies at least content of the background region. The modifying of the image data reduces the number of bits for representing the compressed image. Specifically, the encoder may modify the image data by simplifying content of the background region, to reduce the amount of bits for representing the background region in the compressed image.

The encoder can simplify the content of the background region in various ways. In some embodiments, the encoder may perform low-pass filtering on image data representing at least a part of the background region. The low-pass filtering may use a constant cut-off frequency. Alternatively, the low-pass filtering may use a variable cut-off frequency inversely correlating to a distance from a nearest band.

In some embodiments, the encoder may modify the image content in the background region, e.g., decaying the image content (i.e., decreasing the brightness and color contrast of the image content) . For example, if the distance d between a pixel in the background region and a nearest object is smaller than a predefined value B, the image data representing the pixel in the background region may be modified according to Equations 4-6:

Y′=0.5+ (Y-0.5) (1- (d-BW) /B) , (4)

CR′= (CR) (1- (d-BW) /B) , (5)

CB′= (CB) (1- (d-BW) /B) (6)

In Equations 4-6, Y is the normalized luma component of a pixel value before modification, and CB and CR are the normalized chroma components of the pixels before modification. The values of Y, CB, and CR satisfy the following the ranges: 0≤Y≤1, -0.5≤CB, CR≤0.5. Moreover, Y’, CB’, and CR’ stand for the respective luma and chroma values after the modification. BW stands for the width of the band. In contrast, if the distance d from the nearest object is greater than BW+B, the image data representing the background region may be modified according to Equations 7-9:

Y′=0.5, (7)

CR’ = 0, (8)

CB′=0 (9)

Moreover, if the distance d from the nearest object is between B and BW+B (i.e., B ≤d≤BW+B) , the image data is unchanged.

In some embodiments, the preprocessing may set the values of the pixels in the entire background region according to Equations 7-9.

In some embodiments, the encoder may also modify the values of pixels located in the object, to reduce the number of bits for representing the object in the compressed image. For example, the encoder may reduce the size of the object according to a scaling factor sc. The scaling factor has a value less than 1. The encoder may set the value of the scaling factor based on an object class associated with the object. The scaling factor may be included in the encoded bitstream that is transmitted to a decoder, so that the decoder can use the scaling factor to return the object to its original size. Alternatively, the value of the scaling factor may be predetermined; thus, there is no need to transmit the scaling factor to the decoder. As another example, the encoder may perform low-pass filtering on image data representing a non-edge part of the object. The non-edge part is a part that does not include edges of the object, and is also referred to as an inner part of the object. The low-pass filtering removes high-frequency components of the pixel values, thereby reducing the number of bits for representing the object. Since the non-edge part is less important in recognizing and tracking an object, performing low-pass filtering on the non-edge part may not reduce the efficiency of machine vision tasks.

In some embodiments, the encoder may also modify the values of pixels located in the band, to reduce the number of bits for representing the band in the compressed image. The encoder can perform the modification in various ways. In some embodiments, if the encoder reduces the size of the object surrounded by the band, the encoder may also reduce the size of the band using the same scaling factor. In some embodiments, the encoder may perform low-pass filtering on image data representing at least a part of the band. For example, the low-pass filtering may be performed on a part of the band that does not contact the object’s edge.

FIG. 10 is a schematic diagram illustrating a process 1000 for performing retargeting and inverse retargeting of an image, according to some embodiments consistent with the present disclosure. As shown in FIG. 10, an original image 1010 includes three ROIs: an ROI 1012 encompassing an object 1011 (i.e., house) , an ROI 1014 encompassing an object 1013 (i.e., cloud) , and a background ROI 1016 covering the entire area of original image. Background ROI 1016 includes ROI 1012, ROI 1014, and the background of original image 1010. For example, the scaling factors associated with

ROIs

1012, 1014, and 1016 may be set as 1.0, 1.0, and 2.0, respectively. This means that

ROIs

1012 and 1014 cannot be scaled down, while one or both of the width and height of ROI 1016 can be scaled shown by a factor of 2.0. Thus, retargeting image 1010 may result in a retargeted image 1030, in which

ROIs

1032 and 1034 have the same sizes as those of

ROIs

1012 and 1014 respectively, while ROI 1036 is scaled down from ROI 1016 by a factor of 2.0. Image 1030 can also be inverse retargeted to restore the original image size. Namely, after the inverse retargeting,

ROIs

1052, 1054, and 1056 in the resulted image 1050 have the same sizes as those of

ROIs

1012, 1014, and 1016 in the original image 1010, respectively.

As shown by the example in FIG. 10, the retargeting is performed by downscaling one or more ROIs, and the inverse retargeting is performed by stretching (i.e., upscaling) one or more ROIs. The downscaling and upscaling are performed based on the positions and rescaling factors of the one or more ROIs.

Next, a more detailed description about an exemplary image retargeting method is provided.

FIG. 11 is a flowchart of an exemplary image retargeting method 1100, consistent with embodiments of the present disclosure. For example, method 1100 may be performed by encoder 100 (FIG. 1) . As shown in FIG. 11, method 1100 includes the following steps 1102-1110.

Still referring to FIG. 11, at step 1102, the encoder determines downscaling factors associated with one or more ROIs in an input image. Each ROI may be a rectangle or square, or any other shape. After that, the encoder further determines the position information and downscaling factor associated with each of the ROIs. The encoder may determine the position information based on each ROI’s coordinates in the input image, and/or each ROI’s size (e.g., width and height) . Moreover, the encoder may determine the downscaling factor associated with each ROI based on its class. In particular, the encoder may classify the one or more ROIs by classifying the objects associated with the ROIs. For example, the encoder may determine each ROI to have the same class as its associated object. Thus, the class is also referred to as “object/ROI class. ” An ROI’s downscaling factor may be set to be correlated to its class. This is because object/ROI sizes and classes may impact machine vision tasks’ precision. For example, when a neural network is trained for recognizing a particular class of objects/ROIs (e.g., human faces) with a dimension of 224×224 pixels, performing inference for objects/ROIs in the trained size (i.e., 224×224 pixels) usually results in the highest precision. Thus, each object/ROI class has a corresponding target object size/ROI that can maximize the machine vision tasks’ precision in analyzing objects/ROIs belonging to the particular class. For example, the following Tables 1-3 show the possible bitrate gains achieved by downscaling various object/ROI classes. The bitrate gain in Table 1 is measured by the quality of detection with respect to score parameter from a neural network used for machine vision task. The bitrate gain in Table 2 is measured by the quality of detection with respect to the best attained Intersection over Union (IoU) . The bitrate gain in Table 3 is measured by the quality of detection with respect to matching categorization with threshold of IoU=0.5. In Tables 1-3, the values ranging from 0 to 1 indicate the ratio of a downscaled object/ROI size versus its original size. As shown in Tables 1-3, each object/ROI class has certain size (s) that can maximize the bitrate gain.

Table 1: Quality of detection versus object/ROI size, with respect to score parameter from neural network

Tabel 2: Quality of detection versus object/ROI size, with respect to best attained IoU

Tabel 3: Quality of detection versus object/ROI size, with respect to matching categorization with threshold of IoU=0.

Based on the relationship between object/ROI classes and their corresponding target object/ROI sizes, the encoder may determine a minimum allowable size for each of the one or more ROIs detected from the input image. The encoder may further determine a downscaling factor for each ROI based on its original size in the input image and its minimum allowable size. For example, the downscaling factor may be set equal to the ratio of the original size over the minimum allowable size.

At step 1104, the encoder partitions the input image into a plurality of blocks using a grid of horizontal lines and vertical lines. The encoder may create the grid based on the determined one or more ROIs. For example, each horizontal grid line is aligned with at least one horizontal edge of the one or more ROIs, and each vertical grid line is aligned with at least one vertical edge of the one or more ROIs.

FIG. 12 is a schematic diagram illustrating an input image 1210 partitioned into a plurality of blocks by a grid, according to some exemplary embodiments. As shown in FIG. 12, input image 1210 includes three ROIs: an ROI 1212, an ROI 1214, and a background ROI 1216. The downscaling factor associated each ROI may have a horizontal component and a vertical component. The horizontal component is used to downscale the width of the ROI, while the vertical component is used to downscale the height of the ROI. For example, ROI 1212 has S_horizontal=S_vertical=1; ROI 1214 has S_horizontal=2 and S_vertical=3; and ROI 1216 has S_horizontal=5 and S_vertical=6. Moreover, a plurality of horizontal grid lines 1220 and a plurality of vertical grid lines 1240 are used to partition input image 1210 into a plurality of blocks 1250. Horizontal grid lines 1220 and vertical grid lines 1240 may be created based on the edges of the ROIs. Specifically, each horizontal grid line 1220 is aligned with at least one horizontal edge of ROIs 1212-1216, and each vertical grid line 1240 is aligned with at least one vertical edge of ROIs 1212-1216, such that each edge of ROIs 1212-1216 is aligned with a horizontal grid line 1220 or a vertical grid line 1240.

Referring back to FIG. 11, at step 1106, the encoder determines an allowable downscaling factor for each of the plurality of blocks respectively. The allowable downscaling factor are target downscaling factors used for downscaling the respective blocks, and are made smaller than or equal to the downscaling factors associated with the one or more ROIs, such that downscaling the blocks based on the allowable downscaling factors will satisfy the allowed levels of downscaling for all the ROIs.

To illustrate the way to determine an allowable downscaling factor for each of the plurality of blocks respectively, an exemplary method is described below. Specifically, the method starts with the encoder assigning an initial downscaling factor to each of the plurality of blocks. The initial downscaling factor for a block may be set to be equal to the smallest downscaling factor associated with the ROIs that overlap with the respective block. This step can be performed independently for horizontal (FIG. 13A) axis and vertical axis (FIG. 13B) if the ROIs use different downscaling factors for their horizontal edges and vertical edges. FIG. 13A is schematic diagram showing a process for assigning initial horizontal scaling factors to the plurality of blocks partitioned in FIG. 12, according to some exemplary embodiments. As shown in FIG. 13A, blocks overlapping with both ROI 1212 and ROI 1216 are assigned S_horizontal=min (1; 5) =1; blocks overlapping with both ROI 1214 and ROI 1216 are assigned S_horizontal=min (2; 5) =2; and the rest of the blocks only overlap with ROI 1216 and therefore are assigned S_horizontal=5. FIG. 13B is schematic diagram showing a process for assigning initial horizontal scaling factors to the plurality of blocks partitioned in FIG. 12, according to some exemplary embodiments. As shown in FIG. 13B, blocks overlapping with both ROI 1212 and ROI 1216 are assigned S_vertical=min (1; 6) =1; blocks overlapping with both ROI 1214 and ROI 1216 are assigned S_vertical=min (3; 6) =3, and the rest of the blocks only overlap with ROI 1216 and therefore are assigned S_horizontal=6.

After the initial downscaling factors are assigned, the encoder determines a commonly-agreeable horizontal downscaling factor for each column of the blocks (FIG. 14A) , and determines a commonly-agreeable vertical downscaling factor for each row of the blocks (FIG. 14B) . FIG. 14A is schematic diagram showing a process for determining a commonly-agreeable horizontal downscaling factor for each column of the blocks in FIG. 12, according to some exemplary embodiments. As shown in FIG. 14A, the commonly-agreeable horizontal downscaling factor for a given column is calculated as a minimum of all horizontal downscaling factors in this column. For example, from left to right, the commonly-agreeable horizontal downscaling factors for the columns are 5, 1, 5, 2, and 5, respectively. This means that the first column from left can be downscaled horizontally by a factor of 5; the second column from left cannot be downscaled horizontally and must remain in the same horizontal dimension; the third column from left can be downscaled horizontally by a factor of 5; the fourth column from left can be downscaled horizontally by a factor of 2; and the fifth column from left can be downscaled horizontally by a factor of 5. On the other hand, FIG. 14B is schematic diagram showing a process for determining a commonly-agreeable horizontal downscaling factor for each column of the blocks in FIG. 12, according to some exemplary embodiments. As shown in FIG. 14B, the commonly-agreeable vertical downscaling factor for a given row is calculated as a minimum of all vertical downscaling factors in this row. For example, from top to bottom, the commonly-agreeable vertical downscaling factors for the rows are 6, 3, 1, 1, and 6, respectively. This means that the first row from top can be downscaled horizontally by a factor of 6; the second row from top can be downscaled horizontally by a factor of 3; the third and fourth rows from top cannot be downscaled horizontally and must remain in the same horizontal dimensions; and the fifth row from top can be downscaled horizontally by a factor of 6.

To summarize the above-described process for determining the allowable downscaling factors for the plurality of blocks, the encoder may determine one or more ROIs horizontally collocated with a block; and then determine an allowable vertical downscaling factor for the block to be equal to a smallest vertical downscaling factor associated with the one or more ROIs horizontally collocated with the block. And, independently, the encoder may determine one or more ROIs vertically collocated with the block; and then determine an allowable horizontal downscaling factor for the block to be equal to a smallest horizontal downscaling factor associated with the one or more ROIs vertically collocated with the block.

Equivalently, since the plurality of blocks is organized in rows and columns, the above process for determining the allowable downscaling factor can also be performed according to the rows and columns of the blocks. Specifically, the encoder may determine one or more ROIs collocated with a row of blocks; and then determine an allowable vertical downscaling factor for the row of blocks to be equal to a smallest vertical downscaling factor associated with the one or more ROIs collocated with the row of blocks. And, independently, the encoder may determine one or more ROIs collocated with a column of blocks; and then determining an allowable horizontal downscaling factor for the column of blocks to be equal to a smallest horizontal downscaling factor associated with the one or more ROIs collocated with the column of blocks.

Referring back to FIG. 11, at step 1108, the encoder performs at least one of horizontal downscaling of vertical downscaling of the plurality of blocks based on the respective allowable downscaling factors. Specifically, the encoder may vertically downscale each block based on the respective allowable vertical downscaling factor. Alternatively or additionally, the encoder may horizontally downscale each block based on the respective allowable horizontal downscaling factor.

For example, as schematically shown in FIG. 15, to perform the downscaling, the grid generated in FIG. 15 can be first downscaled. Each block formed by the downscaled grid has a width and height according to Equations 10 and 11:

rescaled_width = original_width /min_in_column (S_horizontal) (10)

rescaled _height = original_height /min_in_row (S_vertical) (11)

In Equations 10 and 11, the rescaled width and height, i.e., rescaled_width and rescaled _height, can be calculated using any rounding method, such as rounding towards zero, nearest integer, nearest power of 2, etc. The rescaled width and height can also be calculated using a maximum clipping method, i.e., limiting the maximal size of the individual blocks. rescaled width and height can also be calculated using a minimal clipping method, i.e., limiting the minimal size of the individual blocks

After the grid is downscaled, contents in the original input image can be retargeted by downscaling the image contents in the original grid to fit to the corresponding blocks of the downscaled grid, as schematically shown in FIG. 16.

As shown in FIG. 15, the downscaled grid can also be upscaled (i.e., inversely downscaled) to restore the original grid. Correspondingly, as shown in FIG. 16, the retargeted image can be inversely retargeted to restore the original image.

In some embodiments, instead of downscaling each block individually, the downscaling can be performed on a column or a row of blocks simultaneously. Specifically, the encoder may vertically downscale a row of blocks based on the respective allowable vertical downscaling factor. Alternatively or additionally, the encoder may horizontally downscale a column of blocks based on the respective allowable vertical downscaling factor.

Consistent with the disclosed embodiments, the downscaling can be performed using any suitable technique as known in the art, such as nearest neighbor down-sampling, linear down-sampling, bilinear down-sampling, cubic down-sampling, bicubic down-sampling, spline down-sampling, polynomial down-sampling, affine transformation, graphic card (GPU) based texture down-sampling.

Referring back to FIG. 11, at step 1110, the encoder generates parameters used in the retargeting. For example, the retargeting parameters may include downscaling factors used in the retargeting process.

FIG. 17 is a block diagram illustrating an exemplary architecture for a VCM decoder 1700, consistent with embodiments of the present disclosure. As shown in FIG. 17, the input of decoder 1700 includes an encoded bitstream 1701 and ROI metadata 1702. For example, encoded bitstream 1701 and ROI metadata 1702 may be generated by and transmitted from an encoder, such as encoded bitstream 160 and ROI metadata 162 in FIG. 1. As shown in FIG. 17, decoder 1700 includes a decoding module 1710, an inverse retargeting module 1720, and an ROI synthesis module 1730.

As described above, the disclosed encoding and decoding methods are blind to the image/video compression technology. Thus, decoding module 1710 may perform the decoding according to any existing standard, such as AVC, HEVC, VVC, AV1, JPEG, MPEG, etc.

Inverse retargeting module 1720 restores the original size (e.g., the original numbers of rows and columns) of an image or video frame. Inverse retargeting module 1720 may use Metadata Class 2 that includes retargeting parameters generated during the retargeting process at the encoder side. Moreover, Metadata Class 1 can be used to provide the information on ROIs. For a video sequence, the frame sizes after inverse retargeting may be set constant. For example, as required by some video coding standards, frame size needs to remain unchanged for a whole sequence starting with an Instantaneous Decoder Refresh (IDR) frame.

ROI synthesis module 1730 uses Metadata Class 1 and/or Class 3 to restore the objects removed by inpainting. Metadata Class 1 defines ROIs that correspond to the objects, including objects removed by inpainting. Metadata Class 3 defines object representations for the objects being removed and replaced by inpainting. The metadata may be defined in any reasonable way, e.g., as indices indicative of an object type and/or the basic features of an object (e.g., size and orientation) . The metadata can also include latent representation of an object. In some embodiments, the restored objects may be an approximate to the original objects. For example, the restored objects may retain the main features of the original objects, to allow for proper object detection, recognition, classification, or tracking by the downstream machine vision tasks.

Next, the signal path in decoder 1700 is described.

Encoded bitstream 1702 is generated by and received from an encoder, e.g., encoder 100 (FIG. 1) . Encoded bitstream 1702 can be a legacy bitstream generated based on any existing standard, such as AVC, HEVC, VVC, AV1, JPEG, MPEG, etc.

ROI metadata 1702 may include one or more of Metadata Classes 1-3, which are described above in connection with encoder 100 (FIG. 1) .

Signal 1715 includes decoded (i.e., decompressed) image data. Some objects (e.g., objects removed by inpainting) are still absent in signal 1715. Moreover, the image/frame size represented by signal 1715 may be reduced with respect to the original image/frame size.

Signal 1725 includes images or video frames in their original size (e.g., the original numbers of rows and columns) , which is restored by inverse retargeting 1720.

Decoded image/video 1740 includes images or video frames in their original size and having the inpainted objects restored by ROI synthesis 1730.

As described above, the disclosed encoding and decoding architecture can be used to compress both still images and videos. For video processing, the following temporal setup may be used for the disclosed encoder and decoder. For example, the disclosed encoder and decoder are made compliant with configurations of the standard video scheme, with respect to the basic video coding parameters (e.g., frame size) . For an I-frame (and IDR frame in particular) , the above-described procedure may be used for encoding and decoding. And for other frame types (e.g., P-frames or B-frames) , the following rules may be observed.

In some embodiments, the set of objects removed by inpainting is fixed for all frames of a sequence (e.g., the full interval between IDR images) . And the inpainted content may remain unchanged for all frames of a sequence (e.g., the full interval between IDR images) . For example, at the encoder side, a same inpainting image element may be used for the sequence of frames. And at the decoder side, a same ROI is synthesized for the sequence of frames.

Simplified background may remain unchanged for all frames of a sequence (e.g., the full interval between IDR images) . For example, in the image simplification performed at the encoder side, a same simplified background region may be used for the sequence of frames.

For retargeting and inverse retargeting, the object-related ROIs may be defined by the sum of pixels of all positions of the object in a sequence. The retargeting and inverse retargeting each are performed in the same way for all frames of a video sequence. For example, at the encoder side, the same set of ROIs and associated downscaling factors are used in each of the sequence of frames to perform the image retargeting. And at the decoder side, the same set of ROIs and associated downscaling factors are used in each of the sequence of frames to perform the inverse retargeting.

FIG. 18 is a flowchart of an exemplary decoding method 1800, consistent with embodiments of the present disclosure. For example, method 1800 may be performed by performed by a VCM decoder, such as decoder 1700 (FIG. 17) . As shown in FIG. 18, decoding method 1800 includes the following steps.

At step 1802, the decoder receives an encoded bitstream. The encoded bitstream includes compressed image data and metadata descriptive of: one or more regions of interest (ROIs) in an uncompressed image (i.e., the original image) , and a modification made to the uncompressed image. For example, the metadata may indicate the positions, shapes, or classes of the ROIs. Moreover, the modification made to the uncompressed image may include at least one of image inpainting, simplification, or retargeting. For example, the metadata may include downscaling factors used for downscaling the ROIs and retargeting the uncompressed image. As another example, the metadata may include information descriptive of one or more ROIs removed from the uncompressed image by inpainting, such as an index associated with a class of the removed ROI, a feature vector descriptive of the removed ROI, or a shift vector pointing to an unremoved ROI in the uncompressed image.

At step 1804, the decoder decodes the bitstream to generate a reconstructed image. The decoder may decode the bitstream based on any existing image or video coding standard, such as AVC, HEVC, VVC, AV1, JPEG, MPEG, etc.

At step 1806, the decoder restores the uncompressed image by performing, based on the metadata, at least one of inverse retargeting or ROI synthesis on the reconstructed image.

To perform the inverse retargeting, the decoder first locates, based on the metadata, the one or more ROIs in the reconstructed image. The decoder then partitions the reconstructed image into a plurality of blocks using a grid of horizontal lines and vertical lines. Each horizontal line of the grid is aligned with at least one horizontal edge of the one or more ROIs, and each vertical line of the grid is aligned with at least one vertical edge of the one or more ROIs. The decoder further determines an allowable downscaling factor for each of the plurality of blocks respectively. The allowable downscaling factor is smaller than or equal to the one or more downscaling factors associated with the one or more ROIs. Finally, the decoder horizontally or vertically upscaling each of the plurality of blocks based on the respective allowable downscaling factor. The inverse retargeting process is illustrated in FIG. 10, which shows that an original image 1010 is retargeted at the encoder side, and then inversely retargeted at the decoder side to restore the original size.

To perform the ROI synthesis, the decoder determines, based on the metadata, an inpainting image element in the reconstructed image. The decoder then synthesizes a first ROI based on the metadata. Finally, the decoder replaces the inpainting image element with the first ROI. FIG. 19 is a schematic diagram illustrating an exemplary process 1900 for performing inpainting and ROI synthesis of an image, consistent with embodiments of the present disclosure. As shown in FIG. 19, an original image 1910 includes, among others, objects (i.e., ROIs) 1912, 1914, 1916, and 1918. At the encoder side, image inpainting is performed on original image 1910, in which inpainting

image elements

1932, 1934, 1936, and 1938 with synthetic content are used to replace

objects

1912, 1914, 1916, and 1918, respectively. The resulted inpainted image 1930 is then compressed by the encoder. At the decoder side, ROI synthesis is performed on inpainted image 1930, in which inpainting

image elements

1932, 1934, 1936, and 1938 are replaced by

ROIs

1952, 1954, 1956, and 1958, respectively.

ROIs

1952, 1954, 1956, and 1958 include objects that are in the same or similar classes as

objects

1912, 1914, 1916, and 1918.

Next, a hardware environment for implementing the disclosed encoder and decoder is described. FIG. 20 shows a block diagram of an exemplary apparatus 2000 for encoding or decoding image data, according to some embodiments of the present disclosure. As shown in FIG. 20, apparatus 2000 can include processor 2002. When processor 2002 executes instructions described herein, apparatus 2000 can become a specialized machine for processing the image data. Processor 2002 can be any type of circuitry capable of manipulating or processing information. For example, processor 2002 can include any combination of any number of a central processing unit (or “CPU” ) , a graphics processing unit (or “GPU” ) , a neural processing unit ( “NPU” ) , a microcontroller unit ( “MCU” ) , an optical processor, a programmable logic controller, a microcontroller, a microprocessor, a digital signal processor, an intellectual property (IP) core, a Programmable Logic Array (PLA) , a Programmable Array Logic (PAL) , a Generic Array Logic (GAL) , a Complex Programmable Logic Device (CPLD) , a Field-Programmable Gate Array (FPGA) , a System On Chip (SoC) , an Application-Specific Integrated Circuit (ASIC) , or the like. In some embodiments, processor 2002 can also be a set of processors grouped as a single logical component. For example, as shown in FIG. 20, processor 2002 can include multiple processors, including processor 2002a, processor 2002b, ..., and processor 2002n.

Apparatus 2000 can also include memory 2004 configured to store data (e.g., a set of instructions, computer codes, intermediate data, or the like) . For example, as shown in FIG. 20, the stored data can include program instructions (e.g., program instructions for implementing the processing of image data) and image data to be processed. Processor 2002 can access the program instructions and data for processing (e.g., via bus 2010) , and execute the program instructions to perform an operation or manipulation on the data. Memory 2004 can include a high-speed random-access storage device or a non-volatile storage device. In some embodiments, memory 2004 can include any combination of any number of a random-access memory (RAM) , a read-only memory (ROM) , an optical disc, a magnetic disk, a hard drive, a solid-state drive, a flash drive, a security digital (SD) card, a memory stick, a compact flash (CF) card, or the like. Memory 2004 can also be a group of memories (not shown in FIG. 20) grouped as a single logical component.

Bus 2010 can be a communication device that transfers data between components inside apparatus 2000, such as an internal bus (e.g., a CPU-memory bus) , an external bus (e.g., a universal serial bus port, a peripheral component interconnect express port) , or the like.

For ease of explanation without causing ambiguity, processor 2002 and other data processing circuits (e.g., analog/digital converters, application-specific circuits, digital signal processors, interface circuits, not shown in FIG. 20) are collectively referred to as a “data processing circuit” in this disclosure. The data processing circuit can be implemented entirely as hardware, or as a combination of software, hardware, or firmware. In addition, the data processing circuit can be a single independent module or can be combined entirely or partially into any other component of apparatus 2000.

Apparatus 2000 can further include network interface 2006 to provide wired or wireless communication with a network (e.g., the Internet, an intranet, a local area network, a mobile communications network, or the like) . In some embodiments, network interface 2006 can include any combination of any number of a network interface controller (NIC) , a radio frequency (RF) module, a transponder, a transceiver, a modem, a router, a gateway, a wired network adapter, a wireless network adapter, a Bluetooth adapter, an infrared adapter, a near-field communication ( “NFC” ) adapter, a cellular network chip, or the like.

In some embodiments, optionally, apparatus 2000 can further include peripheral interface 2008 to provide a connection to one or more peripheral devices. As shown in FIG. 20, the peripheral device can include, but is not limited to, a cursor control device (e.g., a mouse, a touchpad, or a touchscreen) , a keyboard, a display (e.g., a cathode-ray tube display, a liquid crystal display, or a light-emitting diode display) , an image/video input device (e.g., a camera or an input interface coupled to an image/video archive) , or the like.

In some embodiments, a non-transitory computer-readable storage medium including instructions is also provided, and the instructions may be executed by a device (such as the disclosed encoder and decoder) , for performing the above-described methods. Common forms of non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM or any other flash memory, NVRAM, a cache, a register, any other memory chip or cartridge, and networked versions of the same. The device may include one or more processors (CPUs) , an input/output interface, a network interface, and/or a memory.

The disclosed systems and methods for VCM encoding and decoding can be implemented on top of the general image/video coding technology. The solution is blind to the image/video coding technology applied. That is, any current or future image/video coding technology may be applied.

The disclosed systems and methods are agnostic to and compatible with the general image/video compression technology. Moreover, the disclosed systems and methods are efficient for low and very low bitrates where anchor technology (i.e., image compression technique) fails to provide images and videos appropriate for satisfactory efficiency of the machine vision tasks. For example, the image compression technology can be the state-of-the-art video codec VVC (Versatile Video Codec standardized by ISO/IEC as 23090-3 and by ITU as Rec. H. 266, where ISO stands for International Organization for Standardization, IEC for International Electrotechnical Commission and ITU stands for International Telecommunication Union) . The usage of VVC provides good results for relatively light compression where the efficiency of machine vision tasks executed on decoded images/video is nearly the same as for these tasks executed on the original (uncompressed) images/video. However, for low and very low bitrates, when executed on VVC-decoded images/video, the performance of machine vision tasks is unsatisfactory, resulting in low precision of object detection, instant segmentation, etc. FIGs. 21A and 21B are plots of mean average precision of detection and instant segmentation, respectively, for an example still image decoded from VVC bitstreams (the dash line plots marked as “anchor” ) compared with an improvement obtained by the disclosed approach (the solid line plots marked as “proposed” ) . As shown in FIGs. 21A and 21B, the VVC bitstreams are associated with low average precision at the bitrates below 0.1 bit/pixel for an example still image. This situation can be improved by the disclosed approach.

The disclosed solution is also scalable. One or two major building blocks (inpainting, simplification, retargeting) may be omitted thus reducing the complexity of VCM coding.

Moreover, the disclosed solution allows for control of the balance between quality of the output (decoded) image/video perceived by humans and the efficiency of the machine vision tasks executed on the output (decoded) image/video.

Additionally, the disclosed encoder and decoder are compatible with the ISO/IEC standard on Immersive Video Coding (MIV –MPEG-I Part 12 of ISO/IEC) . The disclosed architecture can be used to compress a single 2D video in Immersive Video Coding (MIV) .

It should be noted that, the relational terms herein such as “first” and “second” are used only to differentiate an entity or operation from another entity or operation, and do not require or imply any actual relationship or sequence between these entities or operations. Moreover, the words “comprising, ” “having, ” “containing, ” and “including, ” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items.

As used herein, unless specifically stated otherwise, the term “or” encompasses all possible combinations, except where infeasible. For example, if it is stated that a database may include A or B, then, unless specifically stated otherwise or infeasible, the database may include A, or B, or A and B. As a second example, if it is stated that a database may include A, B, or C, then, unless specifically stated otherwise or infeasible, the database may include A, or B, or C, or A and B, or A and C, or B and C, or A and B and C.

It will be appreciated that the present invention is not limited to the exact construction that has been described above and illustrated in the accompanying drawings, and that various modifications and changes can be made without departing from the scope thereof. It is intended that the scope of the invention should only be limited by the appended claims.

Claims

An encoding method, comprising:

determining one or more regions of interest (ROIs) in an image;

modifying the image based on the one or more ROIs, wherein modifying the image comprises at least one of the following:

modifying a first ROI based on a portion of the image surrounding the first ROI;

reducing content of a second ROI; or

reducing a size of a third ROI; and

compressing the modified image.
The method of claim 1, wherein one of the one or more ROIs comprises a background region of the image.
The method of claim 1-2, wherein the second ROI and the third ROI are the same.
The method of claim 1, further comprising generating metadata descriptive of:

the one or more ROIs, and

a modification to the image.
The method of claim 4, wherein the metadata comprises:

at least one of a position, a shape, or a class of an ROI.
The method of any of claim 4 or 5, wherein the metadata comprises:

a downscaling factor associated with reducing the size of the third ROI.
The method of any of claim 4-6, wherein the metadata comprises:

information descriptive of the first ROI.
The method of claim 7, wherein the metadata descriptive of the first ROI comprises at least one of:

an index associated with a class of the first ROI,

a feature vector descriptive of the first ROI, or

a shift vector pointing to a fourth ROI in the image.
The method of any of claims 4-8, further comprising:

outputting a bitstream comprising the compressed image and the metadata, wherein the metadata is signaled in a header portion associated with the compressed image.
The method of any of claims 1-9, wherein modifying the first ROI comprises:

replacing the first ROI with an inpainting image element, wherein the first ROI and the inpainting image element are represented by a first number of bits and a second number of bits respectively, the second number being smaller than the first number.
The method of claim 10, further comprising:

generating the inpainting image element based on one or more samples surrounding the first ROI.
The method of any of claims 1-11, wherein the second ROI comprises a background region, and reducing the content of the second ROI comprises:

determining, in the image, the background region based on the one or more ROIs; and

simplifying content of the background region to reduce a number of bits for representing the background region.
The method of claim 12, wherein simplifying the content of the background region comprises:

low-pass filtering at least a part of the background region.
The method of claim 13, wherein the low-pass filtering uses a variable cut-off frequency inversely correlating to a distance between a sample in the background region and an object nearest to the sample in the background region.
The method of any of claims 12-14, wherein the second ROI comprises:

one or more objects; and

one or more buffer regions respectively surrounding the one or more objects.
The method of claim 15, wherein the one or more buffer regions comprise a first buffer region, and reducing the content of the second ROI comprises:

low-pass filtering at least a part of the first buffer region.
The method of any of claim 15 or 16, wherein the one or more objects comprise a first object, and reducing the content of the second ROI comprises:

low-pass filtering a non-edge part of the first object.
The method of claim 16, wherein the one or more objects comprise a first object surrounded by the first buffer region, and reducing the content of the second ROI comprises:

downscaling the first object and the first buffer region.
The method of any of claims 1-18, wherein reducing the size of the third ROI comprises:

determining at least one of a row or a column of the image, wherein the at least one of row or column does not collocate with the one or more ROIs; and

removing the at least one of row or column from the image.
The method of any of claims 1-18, wherein reducing the size of the third ROI comprises:

defining at least one of the one or more ROIs to have a rectangular shape; and

partitioning the image into a plurality of blocks using a grid of horizontal lines and vertical lines, wherein at least one of the horizontal lines of the grid is aligned with at least one horizontal edge of the one or more ROIs, and at least one of the vertical lines of the grid is aligned with at least one vertical edge of the one or more ROIs; and

horizontally or vertically downscaling at least one of the plurality of blocks.
The method of claim 20, wherein horizontally or vertically downscaling at least one of the plurality of blocks comprises:

determining a target downscaling factor for at least one of the plurality of blocks respectively, wherein the target downscaling factor is smaller than or equal to the one or more downscaling factors associated with the one or more ROIs; and

horizontally or vertically downscaling the at least of the plurality of blocks based on the respective target downscaling factor.
The method of any of claims 1-18, wherein the image is an I frame of a sequence of frames, and the method further comprises at least one of:

applying a same inpainting image element for the sequence of frames;

applying a same simplified background region for the sequence of frames; or

determining the one or more ROIs from at least one of the sequence of frames, and retargeting the sequence of frames based on the one or more ROIs and one or more downscaling factors respectively associated with one or more ROIs, wherein the one or more ROIs and the one or more downscaling factors remain same across the sequence of frames.
A device comprising:

a memory storing instructions; and

one or more processor configured to execute the instructions to cause the device to performing operations comprising:

determining one or more regions of interest (ROIs) in an image;

modifying the image based on the one or more ROIs, wherein modifying the image comprises at least one of the following:

modifying a first ROI based on a portion of the image surrounding the first ROI;

reducing content of a second ROI; or

reducing a size of a third ROI; and

compressing the modified image.
A non-transitory computer-readable medium storing a set of instructions that is executable by one or more processors of a device to cause the device to perform a method comprising:

determining one or more regions of interest (ROIs) in an image;

modifying the image based on the one or more ROIs, wherein modifying the image comprises at least one of the following:

modifying a first ROI based on a portion of the image surrounding the first ROI;

reducing content of a second ROI; or

reducing a size of a third ROI; and

compressing the modified image.
A decoding method, comprising:

decoding a bitstream comprising metadata descriptive of:

one or more regions of interest (ROIs) in an uncompressed image, and

a modification made to the uncompressed image;

determining a reconstructed image; and

restoring the uncompressed image by restoring content or a size of a ROI based on the metadata.
The method of claim 25, wherein the metadata comprises:

at least one of a position, a shape, or a class of an ROI.
The method of any of claim 25 or 26, wherein the metadata comprises:

a downscaling factor associated with the one or more ROIs.
The method of any of claim 25-27, wherein the metadata comprises:

information descriptive of an ROI removed from the uncompressed image.
The method of claim 28, wherein the metadata comprises at least one of:

an index associated with a class of the removed ROI,

a feature vector descriptive of the removed ROI, or

a shift vector pointing to an unremoved ROI in the uncompressed image.
The method of any of claims 25-29, wherein restoring a size of a ROI comprises:

locating, based on the metadata, the ROI in the reconstructed image; and

partitioning the reconstructed image into a plurality of blocks using a grid of horizontal lines and vertical lines, wherein at least one of the horizontal lines of the grid is aligned with at least one horizontal edge of the ROI, and at least one of the vertical lines of the grid is aligned with at least one vertical edge of the ROI; and

horizontally or vertically upscaling at least one of the plurality of blocks.
The method of claim 30, wherein horizontally or vertically upscaling at least one of the plurality of blocks comprises:

determining a target downscaling factor for at least one of the plurality of blocks respectively, wherein the target downscaling factor is smaller than or equal to the one or more downscaling factors associated with the ROI; and

horizontally or vertically upscaling the at least one of the plurality of blocks based on the respective target downscaling factor.
The method of any of claims 25-29, wherein restoring content of a ROI comprises:

determining, based on the metadata, an inpainting image element in the reconstructed image;

synthesizing the ROI based on the metadata; and

replacing the inpainting image element with the ROI.
The method of any of claims 25-32, wherein the image is an I frame of a sequence of frames, and the method further comprises at least one of:

synthesizing a same ROI for the sequence of frames; or

determining the one or more ROIs from at least one of the sequence of frames, and performing inverse retargeting on the sequence of frames based on the one or more ROIs and one or more downscaling factors respectively associated with one or more ROIs, wherein the one or more ROIs and the one or more downscaling factors remain same across the sequence of frames.
A device comprising:

a memory storing instructions; and

one or more processor configured to execute the instructions to cause the device to performing operations comprising:

decoding a bitstream comprising metadata descriptive of:

one or more regions of interest (ROIs) in an uncompressed image, and

a modification made to the uncompressed image;

determining a reconstructed image; and

restoring the uncompressed image by restoring content or a size of a ROI based on the metadata.
A non-transitory computer-readable medium storing a set of instructions that is executable by one or more processors of a device to cause the device to perform a method comprising:

decoding a bitstream comprising metadata descriptive of:

one or more regions of interest (ROIs) in an uncompressed image, and

a modification made to the uncompressed image;

determining a reconstructed image; and

restoring the uncompressed image by restoring content or a size of a ROI based on the metadata.