WO2024066123A1 - Lossless hierarchical coding of image feature attributes - Google Patents

Lossless hierarchical coding of image feature attributes Download PDF

Info

Publication number
WO2024066123A1
WO2024066123A1 PCT/CN2022/144351 CN2022144351W WO2024066123A1 WO 2024066123 A1 WO2024066123 A1 WO 2024066123A1 CN 2022144351 W CN2022144351 W CN 2022144351W WO 2024066123 A1 WO2024066123 A1 WO 2024066123A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
image features
partitioning
feature
medoid
Prior art date
Application number
PCT/CN2022/144351
Other languages
French (fr)
Inventor
Marek Domanski
Slawomir Mackowiak
Olgierd Stankiewicz
Slawomir ROZEK
Tomasz Grajek
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp., Ltd. filed Critical Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Publication of WO2024066123A1 publication Critical patent/WO2024066123A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/94Vector quantisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]

Definitions

  • the present disclosure generally relates to image or video processing, and more particularly, to methods and systems for representing, encoding, decoding attributes of image features.
  • AVC Advanced Video Coding
  • HEVC High Efficiency Video Coding
  • VVC Versatile Video Coding
  • machine vision applications such as object recognition and tracking, face recognition, image/video search, mobile augmented reality (MAR) , autonomous vehicles, Internet of Things (IoT) , images matching, 3-dimension structure construction, stereo correspondence, and motion tracking, focus on the analysis and usage of features extracted from the image/video data, rather than the subjective quality.
  • machine vision applications such as object recognition and tracking, face recognition, image/video search, mobile augmented reality (MAR) , autonomous vehicles, Internet of Things (IoT) , images matching, 3-dimension structure construction, stereo correspondence, and motion tracking
  • MPEG Moving Picture Experts Group
  • VCM Video Coding for Machines
  • Embodiments of the present disclosure relate to methods of representing and coding (i.e., encoding or decoding) feature attributes.
  • a computer-implemented method for decoding feature attributes includes: receiving a data stream comprising a plurality of parameters respectively associated with a plurality of image features; partitioning the plurality of image features into one or more clusters; determining a reference value for at least one of the one or more clusters; and determining respective attributes of the plurality of image features, wherein the attribute of each image feature is determined based on: the parameter associated with the respective image feature, and the reference value for the respective cluster.
  • a computer-implemented method for encoding feature attributes includes: partitioning a plurality of image features into one or more clusters; determining a reference value for at least one of the one or more clusters; and generating a plurality of parameters respectively associated with the plurality of image features, wherein each of the plurality of parameters is generated based on:a value of an attribute of the respective image feature, and the reference value for the respective cluster.
  • a device for decoding feature attributes includes a memory storing instructions; and a processor configured to execute the instructions to cause the device to: receive a data stream comprising a plurality of parameters respectively associated with a plurality of image features; partitioning the plurality of image features into one or more clusters; determine a reference value for at least one of the one or more clusters; and determine respective attributes of the plurality of image features, wherein the attribute of each image feature is determined based on: the parameter associated with the respective image feature, and the reference value for the respective cluster.
  • a device for encoding feature attributes includes a memory storing instructions; and a processor configured to execute the instructions to cause the device to: partition a plurality of image features into one or more clusters; determine a reference value for at least one of the one or more clusters; and generate a plurality of parameters respectively associated with the plurality of image features, wherein each of the plurality of parameters is generated based on: a value of an attribute of the respective image feature, and the reference value for the respective cluster.
  • aspects of the disclosed embodiments may include non-transitory, tangible computer-readable media that store software instructions that, when executed by one or more processors, are configured for and capable of performing and executing one or more of the methods, operations, and the like consistent with the disclosed embodiments. Also, aspects of the disclosed embodiments may be performed by one or more processors that are configured as special-purpose processor (s) based on software instructions that are programmed with logic and instructions that perform, when executed, one or more operations consistent with the disclosed embodiments.
  • FIG. 1 is a schematic diagram illustrating an exemplary encoding and decoding system for processing and transmitting image data and feature data, consistent with embodiments of the present disclosure.
  • FIG. 2A is a schematic diagram illustrating an exemplary input image, consistent with embodiments of the present disclosure.
  • FIG. 2B is a schematic diagram illustrating keypoints identified on the input image showing in FIG. 2A, consistent with embodiments of the present disclosure.
  • FIG. 3 is a schematic diagram illustrating an exemplary keypoint in an image scale space, consistent with embodiments of the present disclosure.
  • FIG. 4 is a table of exemplary feature vectors that are used for describing keypoints extracted from image data, consistent with embodiments of the present disclosure.
  • FIG. 5 is a schematic diagram illustrating a cluster of keypoints, consistent with embodiments of the present disclosure.
  • FIG. 6 is a schematic diagram illustrating a convex hull of a set of keypoints, consistent with embodiments of the present disclosure.
  • FIG. 7 is a schematic diagram illustrating an exemplary medoid clustering process, consistent with embodiments of the present disclosure.
  • FIG. 8 is a block diagram of an exemplary apparatus for encoding or decoding image feature attributes, consistent with embodiments of the present disclosure.
  • FIG. 9 is a flowchart of an exemplary method for encoding image feature attributes, consistent with embodiments of the present disclosure.
  • FIG. 10 is a flowchart of an exemplary hierarchical partitioning method for use in the encoding method shown in FIG. 9, consistent with embodiments of the present disclosure.
  • FIG. 11 is a flowchart for determining parameters associated with non-medoid keypoints, consistent with embodiments of the present disclosure.
  • FIG. 12 is a flowchart of an exemplary method for decoding image feature attributes, consistent with embodiments of the present disclosure.
  • FIG. 13 is a flowchart of an exemplary hierarchical partitioning method for use in the decoding method shown in FIG. 12, consistent with embodiments of the present disclosure.
  • FIG. 14 is a flowchart for determining attribute values of non-medoid keypoints, consistent with embodiments of the present disclosure.
  • FIG. 15 is a histogram of the scale values for keypoints extracted from a sample image, without using the disclosed methods for representing image feature attributes.
  • FIG. 16 is a histogram of the scale values for keypoints extracted from a sample image, by using the disclosed methods for representing image feature attributes.
  • FIG. 17 is a flowchart of an exemplary Scale Invariant Feature Transform (SIFT) algorithm for extracting features from image data, consistent with embodiments of the present disclosure.
  • SIFT Scale Invariant Feature Transform
  • FIG. 18 is a schematic diagram illustrating an exemplary Jarvis march algorithm, consistent with embodiments of the present disclosure.
  • the present disclosure provides systems and methods for representing and coding (i.e., encoding or decoding) attributes of features extracted from image data.
  • the image data may include an image, multiple images, or a video.
  • An image is a static picture (e.g., a frame) . Multiple images may be related or unrelated, either spatially or temporary.
  • a video includes a set of images arranged in a temporal sequence.
  • Features of an image are information used to describe local properties of the image. For example, a feature may be a local structure, pattern, texture, or shape in an image.
  • the attributes of features are data quantifying and describing the features extracted from image data, and are also referred to as “image feature attributes, ” “feature attributes, ” or “feature data” in the present disclosure.
  • the features may be any kind of regions of interest or points of interests extracted from the image data, such as keypoints, blobs, edges, corners, vertices, points, ridges, objects, etc.
  • attributes can be used to describe these features, such as scales, orientations, curvatures, peak responses of Laplacian of Gaussian (LoG) , distanced to image center, gradients, etc.
  • SIFT Scale Invariant Feature Transform
  • the feature attributes may be represented using any data structures, such as parameters, coefficients, data arrays, vectors, matrixes, or any other data structures suitable for quantifying the features.
  • the feature attributes may be stored or transmitted as encoded data streams, which are also referred as “feature streams” or “feature bitstreams” in the present disclosure.
  • FIG. 1 is a block diagram illustrating an encoding and decoding system 100 that may be used to process and transmit image data and feature data, according to some disclosed embodiments.
  • system 100 includes a source device 120 that provides encoded image data and/or feature data to be decoded by a destination device 140.
  • each of source device 120 and destination device 140 may include any of a wide range of devices, including a desktop computer, a notebook (e.g., laptop) computer, a server, a tablet computer, a set-top box, a mobile phone, a vehicle, a camera, an image sensor, a robot, a television, a camera, a wearable device (e.g., a smart watch or a wearable camera) , a display device, a digital media player, a video gaming console, a video streaming device, or the like.
  • Source device 120 and destination device 140 may be equipped for wireless or wired communication.
  • source device 120 may include an image/video encoder 122, a feature extractor 124, a feature encoder 126, and an output interface 128. Consistent with the disclosed embodiments, source device 120 may further include various devices (not shown) for providing raw images or videos to be processed by image/video encoder 122, feature extractor 124, and/or feature encoder 126.
  • the devices for providing the raw images or videos may include an image/video capture device, such as a camera, an image/video storage device containing previously captured images/videos, or an image/video feed interface to receive images/videos from an image/video content provider.
  • source device 120 may generate computer graphics-based data as the source images/videos, or a combination of live images/videos, archived images/videos, and computer-generated images/videos.
  • the captured, pre-captured, or computer-generated images/videos may be encoded by image/video encoder 122 to generate an encoded image/video bitstream 162, or analyzed by feature extractor 124 to extract features.
  • the attributes of the extracted features are encoded by feature encoder 126 to generate an encoded feature bitstream 164.
  • feature extractor 124 may also be configured to extract features from the output of the image/video encoder 122, i.e., encoded image/video bitstream 162.
  • the features extracted by feature extractor 124 may be used by image/video encoder 122 to facilitate the encoding of the image data. For example, prior to the encoding of the image data, image/video encoder 122 may remove certain redundancy from the image data, based on the extracted features, such as size/shape information or luminance gradient. The redundancy may include details (e.g., background luminance, local texture, temporal change, spatial-temporal frequency information, etc. ) in the image data that are not perceivable by human eyes. Consistent with the disclosed embodiments, encoded image/video bitstream 162 and encoded feature bitstream 164 are output by output interface 126 onto a communication medium 160.
  • Output interface 128 may include any type of medium or device capable of transmitting the encoded image/video bitstream 162 and/or feature bitstream 164 from source device 120 to destination device 140.
  • output interface 128 may include a transmitter or a transceiver configured to transmit encoded image/video bitstream 162 and/or feature bitstream 164 from source device 120 directly to destination device 140.
  • the encoded image/video bitstream 162 and/or feature bitstream 164 may be modulated according to a communication standard, such as a wireless communication protocol, and transmitted to destination device 140.
  • Communication medium 160 may include transient media, such as a wireless broadcast or wired network transmission.
  • communication medium 160 may include a radio frequency (RF) spectrum or one or more physical transmission lines (e.g., a cable) .
  • Communication medium 160 may form part of a packet-based network, such as a local area network, a wide-area network, or a global network such as the Internet.
  • communication medium 160 may include routers, switches, base stations, or any other equipment that may be useful to facilitate communication from source device 120 to destination device 140.
  • a network server may receive encoded image/video bitstream 162 and/or feature bitstream 164 from source device 120, and provide the encoded image/video bitstream 162 and/or feature bitstream 164 to destination device 140, e.g., via network transmission.
  • Communication medium 160 may also be in the form of a storage media (e.g., non-transitory storage media) , such as a hard disk, flash drive, compact disc, digital video disc, Blu-ray disc, volatile or non-volatile memory, or any other suitable digital storage media for storing encoded video data.
  • a computing device of a medium production facility such as a disc stamping facility, may receive encoded video data from source device 120 and produce a disc containing the encoded video data.
  • Destination device 140 may include an input interface 148, an image/video decoder 142, and a feature decoder 146.
  • Input interface 148 of destination device 140 receives encoded image/video bitstream 162 and/or encoded feature bitstream 164 from communication medium 160.
  • Encoded image/video bitstream 162 is decoded by image/video decoder 142 to generate decoded image data
  • encoded feature bitstream 164 is decoded by feature decoder 146 to generate decoded feature attributes.
  • Destination device 140 may also include or be connected to various devices (not shown) for utilizing the decoded image data or feature attributes.
  • the various devices may include a display device that displays the decoded image data to a user and may include any of a variety of display devices such as a cathode ray tube (CRT) , a liquid crystal display (LCD) , a plasma display, an organic light emitting diode (OLED) display, or another type of display device.
  • the various devices may include one or more processors configured to use the decoded feature attributes to perform various computer-vision applications, such as object recognition and tracking, face recognition, images matching, image/video search, augmented reality, robot vision and navigation, autonomous driving, 3-dimension structure construction, stereo correspondence, motion tracking, etc.
  • Image/video encoder 122, feature encoder 126, image/video decoder 142, and feature decoder 146 each may be implemented as any of a variety of suitable encoder/decoder circuitry, such as one or more microprocessors, digital signal processors (DSPs) , application specific integrated circuits (ASICs) , field programmable gate arrays (FPGAs) , discrete logic, software, hardware, firmware or any combinations thereof.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable gate arrays
  • a device may store instructions for the software in a suitable, non-transitory computer-readable medium and execute the instructions in hardware using one or more processors to perform the techniques of this disclosure.
  • Each of image/video encoder 122, feature encoder 126, image/video decoder 142, and feature decoder 146 may be included in one or more encoders or decoders, either of which may be integrated as part of a combined encoder/decoder (CODEC) in a respective device.
  • CDEC combined encoder/decoder
  • Image/video encoder 122, feature encoder 126, image/video decoder 142, and feature decoder 146 may operate according to any coding standard, such as MPEG-7 Compact Descriptors for Visual Search (CDVS) standard, the MPEG-7 Compact Descriptors for Video Analysis (CDVA) standard, the MPEG-7 standard on metadata, the Versatile Video Coding (VVC/H. 266) standard, the High Efficiency Video Coding (HEVC/H. 265) standard, the ITU-T H. 264 (also known as MPEG-4) standard, etc.
  • CDVS Visual Search
  • CDVA Video Analysis
  • VVC/H. 266 Versatile Video Coding
  • HEVC/H. 265 High Efficiency Video Coding
  • ITU-T H. 264 also known as MPEG-4 standard, etc.
  • image/video encoder 122, feature encoder 126, image/video decoder 142, and feature decoder 146 may be integrated with an audio encoder and decoder, and may include appropriate MUX-DEMUX units, or other hardware and software, to handle encoding of audio data, image data, and/or feature data in a common data stream or separate data streams.
  • encoded image/video bitstream 162 includes syntax information including syntax elements that describe characteristics or processing of coded image blocks or units.
  • the syntax information in encoded image/video bitstream 162 is defined by image/video encoder 122 and used by image/video decoder 142.
  • encoded feature bitstream 164 includes syntax information including syntax elements that describe characteristics or processing of coded feature attributes.
  • the syntax information in encoded feature bitstream 164 is defined by feature encoder 126 and used by feature decoder 146.
  • object tracking is a computer-vision application that involves with a sensor or camera extracting features and sending encoded feature attributes to a server for analysis.
  • encoded feature bitstream 164 need to be transmitted from source device 120 to destination device 140.
  • the techniques for encoding and decoding the feature attributes may be in compliance with video coding standards, such as the MPEG-7 CDVS standard or the MPEG-7 CDVA standard.
  • the MPEG-7 CDVS standard is developed to use local descriptors to describe features in a static image
  • the MPEG-7 CDVA standard is developed to use local descriptors to describe features in a video sequence (i.e., a sequence of consecutive frames) .
  • Both the MPEG-7 CDVS and MPEG-7 CDVA standards are based on the Scale Invariant Feature Transform (SIFT) technique, which extracts keypoints from an image and describes the keypoints with certain pre-defined attributes or descriptors.
  • SIFT Scale Invariant Feature Transform
  • FIG. 2A is a schematic diagram illustrating an exemplary input image 200, consistent with embodiments of the present disclosure.
  • FIG. 2B shows a plurality of keypoints 210 extracted from input image 200.
  • Each keypoint 210 is shown as a blob surrounding a local area rich in features.
  • the SIFT technique has been used in the art to extract the features and generate keypoints 210 that are invariant to over image translation, image rotation, image scaling, and/or changes in, e.g., viewpoint and illumination. Thus, these keypoints are also called “SIFT keypoints. ”
  • FIG. 3 is a schematic diagram illustrating an exemplary keypoint 310 in a scale space of an input image, consistent with embodiments of the present disclosure.
  • the scale space is described by a function that is produced from the convolution of a Gaussian kernel (at different scales) and the input image.
  • keypoint 310 may be described by various parameters, including, for example: keypoint position (x, y) in an image scale space in which keypoint 310 is detected, scale ⁇ (aradius of the keypoint in the scale space) , and orientation ⁇ (an angle of the keypoint in the image scale space, expressed in radians) .
  • the scale or radius ⁇ describes the extent of the area keypoint 310 occupies in the scale space, and depends on the keypoint’s position (x, y) in the image scale space.
  • keypoint 310 is also accompanied by a descriptor (not shown) of the area occupied by keypoint 310.
  • the descriptor is a 3-dimensional spatial histogram of the image gradients in characterizing the appearance of a keypoint.
  • the gradient at each pixel is regarded as a sample of a three-dimensional elementary feature vector, formed by the pixel location and the gradient orientation. Samples are weighed by the gradient norm and accumulated in a 3-dimensional histogram, which forms the SIFT descriptor of the region.
  • An additional Gaussian weighting function may be applied to the SIFT descriptor to give less importance to gradients farther away from the keypoint.
  • a plurality of keypoints may be extracted from an image.
  • Each keypoint can be described by a feature vector that includes the keypoint position (x, y) , scale ⁇ , orientation ⁇ , and/or descriptor.
  • the scales ⁇ , orientations ⁇ , and descriptors are also referred to as “attributes” of the keypoints.
  • each of the keypoints is described by a feature vector representing the associated feature attributes. A portion of these feature vectors is shown in FIG. 4. Although the feature vectors in FIG. 4 do not include information associated with the keypoint descriptors, it is contemplated that each of these feature vectors may include additional elements representing the associated keypoint descriptor.
  • FIG. 17 is a flowchart of an exemplary Scale Invariant Feature Transform (SIFT) algorithm 1700 for extracting features from image data, consistent with embodiments of the present disclosure.
  • SIFT algorithm 1700 may be performed by a feature encoder, such as feature encoder 126 (FIG. 1) .
  • SIFT algorithm 1700 may include four steps: 1710-Extrema Detection, 1720-Keypoint Localization, 1730-Orientation Assignment, and 1740-Keypoint Descriptor Generation.
  • the feature encoder examines an input image under various scales to isolate points of the input image from their surroundings. These points, called extrema, are potential candidates for image features.
  • the feature encoder selects some of these points (extrema) to be keypoints. This selection rejects extrema that are caused by edges of the picture and by low contrast points.
  • the feature encoder converts each keypoint and its neighborhood into a set of vectors by computing a scale ⁇ and an orientation ⁇ for them.
  • the feature encoder takes a collection of vectors in the neighborhood of each keypoint and consolidates this information into a set of vectors called the descriptor. Each descriptor is converted into a feature by computing a normalized sum of these vectors.
  • the feature vectors can be analyzed and used by various computer-vision applications, for example, object detection, recognition, and tracking.
  • the feature vectors can be stored/transmitted in their raw form.
  • the feature vectors can be described in stream syntax or encapsulated by additional information (e.g., a checksum) .
  • the present disclosure provides embodiments to improve the efficiency of storing and transmitting feature data by modifying the representation of the feature data to reduce its dynamic range, thereby reducing the cost for storing and transmitting the feature data.
  • the keypoints extracted from an image are partitioned into a plurality of clusters based on the keypoint positions.
  • the keypoints in each cluster have better matching local image properties, and thus have less diverse attribute values.
  • the dynamic range of the attribute values in each cluster can be reduced by subtracting, from each of the attribute values, a reference value associated with the respective cluster.
  • the attribute value of a cluster’s medoid is chosen as the cluster’s reference value.
  • Medoid of a cluster is defined as one of the cluster’s keypoints whose sum of distances to all the other keypoints in the cluster is minimal.
  • a non-medoid keypoint’s attribute value is subtracted by the medoid’s attribute value, while the medoid keypoint’s attribute value is unchanged because it is representative of the cluster’s attribute values.
  • FIG. 5 shows a cluster of 8 keypoints represented by solid dark circles.
  • One of the keypoints is determined to be the medoid of the cluster, whose sum of distances to all the other keypoints in the cluster is the smallest.
  • the scale values of the non-medoid keypoints can be subtracted by the scale value of the medoid keypoint to reduce their dynamic range.
  • the medoid keypoint’s scale value is unchanged, so that it can be preserved as the reference value for the cluster.
  • the medoid keypoint’s scale value is a good representation of the scale values of the keypoints in the same cluster.
  • FIG. 5 only shows differential representation of the non-medoid keypoints’s cale values, it is contemplated that the differential representation can also be applied to the orientation values in the same way.
  • a medoid clustering algorithm for example, a Partitioning Around Medoids (PAM) algorithm, is used to partition the keypoints and determine the medoid of a cluster.
  • the PAM algorithm searches for a set of representative data points in an input data set and then assigns the remaining data points to the closest medoid in order to create clusters.
  • the representative data point of a cluster is also called the “medoid” of the cluster and is the mathematical center of the cluster.
  • the remaining data points in the cluster are called “non-medoids. ”
  • the PAM algorithm may be implemented in three steps. In the first step, the PAM algorithm chooses, from the input data set, one or more data points as initial medoids.
  • the one or more initial medoids correspond to one or more clusters, respectively.
  • the PAM algorithm associates each of the remaining data points with its closest initial medoid.
  • the PAM algorithm searches for the optimum medoid for a cluster, by minimizing a cost function that measures the sum of dissimilarities between the medoid and the other data points in the same cluster.
  • the keypoints located on a convex hull of the input keypoints are used as initial medoids of the medoid clustering algorithm. This way ensures that the encoder and decoder use the same method to determine initial medoids, hence results in the same clusters and medoids being generated on both the encoder and decoder sides.
  • a convex hull of a set of keypoints is the smallest convex polygon that contains all the keypoints in the set. Specifically, the set of keypoints is shown as gray filled areas and the convex hull is shown as a solid-line polygon encompassing all the keypoints.
  • FIG. 18 is a schematic diagram illustrating an exemplary method for implementing the Jarvis march algorithm.
  • the Jarvis march algorithm starts with the left-most keypoint P 0 and looks for the convex-hull keypoints in a counter-clockwise direction. Specifically, the Jarvis march algorithm starts from the left-most keypoint P 0 and finds the next convex-hull keypoint P 1 which has the largest angle (Angle-0 in FIG. 18) from P 0 . The Jarvis march algorithm then starts from keypoint P 1 and finds the next convex-hull keypoint P 2 which has the largest angle (Angle-1 in FIG.
  • FIG. 7 is a schematic diagram illustrating a medoid clustering process, according to embodiments consistent with the present disclosure. As shown in FIG. 7, a set of keypoints is shown as gray areas. The keypoints located on the convex hull of the set of keypoints are shown as asterisk marks and are used as initial medoids for the medoid clustering process. The cross marks correspond to the medoids of the clusters output by the medoid clustering process.
  • a hierarchical clustering process may be used to partition the keypoints. For example, for any given cluster, a ratio of the total number of keypoints in the cluster over the number of keypoints on the cluster’s convex hull is determined. If the ratio is greater than or equal to the number of keypoints on the cluster’s convex hull, the cluster is further partitioned into smaller cluster; otherwise, the cluster is not further partitioned.
  • the final output of the hierarchical clustering process is a plurality of clusters deemed unpartitionable. This ensures the unambiguity of the partitioning, and achieves a high-level granularity such that the reference values of the resulted clusters (e.g., the attribute values of the medoid keypoints) can better match the local scene properties.
  • the above-described differential representation of non-medoid keypoints’ attribute values is a lossless hierarchical coding method. It does not lose any useful information of the feature attributes. It can also be reversibly encoded and decoded. Moreover, other than the attribute values themselves, no additional information is needed to be sent to the decoder to facilitate the decoding. Therefore, the disclosed technique for coding the attribute values is highly efficient. This technique can be applied to tools that use SIFT, such as direct SIFT, CDVS, CDVA, etc.
  • FIG. 8 is a block diagram of an example apparatus 800 for encoding or decoding image feature attributes, consistent with embodiments of the disclosure.
  • feature encoder 126 and feature decoder 146 can each be implemented as apparatus 800 for performing the above-described methods.
  • apparatus 800 can include processor 802.
  • processor 802 executes instructions described herein, apparatus 800 can become a specialized machine for encoding or decoding feature attributes.
  • Processor 802 can be any type of circuitry capable of manipulating or processing information.
  • processor 802 can include any combination of any number of a central processing unit (or “CPU” ) , a graphics processing unit (or “GPU” ) , a neural processing unit ( “NPU” ) , a microcontroller unit ( “MCU” ) , an optical processor, a programmable logic controller, a microcontroller, a microprocessor, a digital signal processor, an intellectual property (IP) core, a Programmable Logic Array (PLA) , a Programmable Array Logic (PAL) , a Generic Array Logic (GAL) , a Complex Programmable Logic Device (CPLD) , a Field-Programmable Gate Array (FPGA) , a System On Chip (SoC) , an Application-Specific Integrated Circuit (ASIC) , or the like.
  • processor 802 can also be a set of processors grouped as a single logical component. For example, as shown in FIG. 8, processor 802 can include multiple processors
  • Apparatus 800 can also include memory 804 configured to store data (e.g., a set of instructions, computer codes, intermediate data, or the like) .
  • the stored data can include program instructions (e.g., program instructions for implementing the encoding or decoding of feature attributes) and data for processing (e.g., image data, features extracted from image data, or encoded feature bitstream) .
  • Processor 802 can access the program instructions and data for processing (e.g., via bus 810) , and execute the program instructions to perform an operation or manipulation on the data for processing.
  • Memory 804 can include a high-speed random-access storage device or a non-volatile storage device.
  • memory 804 can include any combination of any number of a random-access memory (RAM) , a read-only memory (ROM) , an optical disc, a magnetic disk, a hard drive, a solid-state drive, a flash drive, a security digital (SD) card, a memory stick, a compact flash (CF) card, or the like.
  • RAM random-access memory
  • ROM read-only memory
  • optical disc an optical disc
  • magnetic disk a magnetic disk
  • hard drive a solid-state drive
  • flash drive a security digital (SD) card
  • SD security digital
  • CF compact flash
  • Memory 804 can also be a group of memories (not shown in FIG. 8) grouped as a single logical component.
  • Bus 810 can be a communication device that transfers data between components inside apparatus 800, such as an internal bus (e.g., a CPU-memory bus) , an external bus (e.g., a universal serial bus port, a peripheral component interconnect express port) , or the like.
  • an internal bus e.g., a CPU-memory bus
  • an external bus e.g., a universal serial bus port, a peripheral component interconnect express port
  • processor 802 and other data processing circuits are collectively referred to as a “data processing circuit” in this disclosure.
  • the data processing circuit can be implemented entirely as hardware, or as a combination of software, hardware, or firmware.
  • the data processing circuit can be a single independent module or can be combined entirely or partially into any other component of apparatus 800.
  • Apparatus 800 can further include network interface 806 to provide wired or wireless communication with a network (e.g., the Internet, an intranet, a local area network, a mobile communications network, or the like) .
  • network interface 806 can include any combination of any number of a network interface controller (NIC) , a radio frequency (RF) module, a transponder, a transceiver, a modem, a router, a gateway, a wired network adapter, a wireless network adapter, a Bluetooth adapter, an infrared adapter, a near-field communication ( “NFC” ) adapter, a cellular network chip, or the like.
  • NIC network interface controller
  • RF radio frequency
  • apparatus 800 can further include peripheral interface 808 to provide a connection to one or more peripheral devices.
  • the peripheral device can include, but is not limited to, a cursor control device (e.g., a mouse, a touchpad, or a touchscreen) , a keyboard, a display (e.g., a cathode-ray tube display, a liquid crystal display, or a light-emitting diode display) , an image/video input device (e.g., a camera or an input interface coupled to an image/video archive) , or the like.
  • a cursor control device e.g., a mouse, a touchpad, or a touchscreen
  • a keyboard e.g., a keyboard
  • a display e.g., a cathode-ray tube display, a liquid crystal display, or a light-emitting diode display
  • an image/video input device e.g., a camera or an input interface coupled to an image/video archive
  • FIG. 9 is a flowchart of a method 900 for encoding image feature attributes, according to embodiments consistent with the present disclosure.
  • method 900 may be performed by a feature encoder, such as a processor device in the feature encoder (e.g., processor 802) .
  • a feature encoder such as a processor device in the feature encoder (e.g., processor 802) .
  • method 900 includes the following steps 902-906.
  • the feature encoder partitions a plurality of image features into one or more clusters.
  • the image features may be scale-invariant feature transform (SIFT) keypoints extracted from an image.
  • SIFT scale-invariant feature transform
  • Each of the keypoint can be described by various attributes, including, for example, a scale and an orientation.
  • the partitioning is based on the positions of the keypoints.
  • a partitioning around medoid (PAM) algorithm is performed by the feature encoder to partition the plurality of image features.
  • a cluster output by the PAM algorithm includes a medoid keypoint and one or more non-medoid keypoints.
  • the feature encoder uses a hierarchical partitioning process to perform the partitioning in step 902.
  • FIG. 10 is a flowchart of an exemplary hierarchical partitioning method 1000 for use in step 902 (FIG. 9) , according to embodiments consistent with the present disclosure.
  • method 1000 may be performed by a processor (e.g., processor 802) in a feature encoder.
  • a processor e.g., processor 802
  • FIG. 10 the feature encoder receives a plurality of input keypoints, which are image regions generated by applying a feature-extraction algorithm, e.g., a SIFT algorithm, to an image.
  • a feature-extraction algorithm e.g., a SIFT algorithm
  • the plurality of input keypoints may belong to a cluster generated by a previous iteration of method 1000.
  • the feature encoder determines a total number of the input keypoints. During this counting process, keypoints lying at the same position (x, y) are counted as separate keypoints. The counted total number is denoted as “N” in the following description.
  • the feature encoder determines a convex hull of the input keypoints. For example, a Jarvis march algorithm may be executed to determine the convex hull.
  • the feature encoder identifies the number of keypoints located on the convex hull.
  • the number of keypoints on the convex hull is denoted as “M” in the following description.
  • the feature coder determines the value of which is the ratio of the total number of the input keypoints over the number of the keypoints located on the convex hull. In one embodiment, the feature encoder may round down the value of to the nearest integer. After the quotient is determined, the feature encoder compares to M, i.e., the number of the keypoints located on the convex hull.
  • step 1012 partitions the input keypoints
  • step 1014 determines parameter representing the attributes of the input keypoints. The details of the operations involved in step 1014 are explained below in connection with FIG. 11.
  • step 1012 partitions the input keypoints
  • step 1014 determines parameter representing the attributes of the input keypoints
  • the feature encoder partitions the input keypoints into a plurality of clusters, such as by executing a Partitioning Around Medoids (PAM) algorithm.
  • the partition is based on the positions of the input keypoints.
  • the M keypoints located on the convex hull may be used as initial medoids for the PAM algorithm, so as to cause the algorithm to partition the input keypoints into M clusters.
  • PAM Partitioning Around Medoids
  • the initial medoids By defining the initial medoids to be the medoids located on the convex hull, it ensures the encoder and decoder to partition the keypoints in the same way, without requiring additional coding information to be sent from the encoder to the decoder.
  • information of the initial medoids may be provided to the decoder to facilitate identification of the keypoints.
  • the information provided to the decoder may include the number or initial positions of the initial medoids. Such information allows the decoder to use the same initial medoids as the encoder.
  • the feature encoder After step 1012, the feature encoder returns to step 1002 and initiates a new iteration of method 1000 on each of the plurality of clusters resulted from step 1012. The feature encoder ends method 1000 when no more cluster can be further partitioned.
  • the feature encoder determines a reference value for at least one of the one or more the plurality of clusters.
  • the reference value for a cluster may be determined to be equal to an attribute value (e.g., scale value or orientation value) of the cluster’s medoid keypoint.
  • the feature encoder generates a plurality of parameters respectively associated with the plurality of clusters.
  • the plurality of parameters represents the image features of the keypoints.
  • Each of the plurality of parameters is generated based on a value of an attribute of the respective image feature and the reference value for the respective cluster.
  • the value of the parameter associated with each medoid feature is made equal to the reference value for the respective cluster (i.e., the attribute value of the respective cluster’s medoid) .
  • the value of the parameter associated with each non- medoid feature is made to be equal to a difference of: an attribute value of the respective non-medoid feature, and the reference value for the respective cluster (i.e., the attribute value of the respective cluster’s medoid) .
  • FIG. 11 shows a method 1100 for determining parameters representing the attributes of a cluster of keypoints.
  • method 1100 may be performed by a processor in the feature encoder (e.g., processor 802) .
  • a processor in the feature encoder e.g., processor 802
  • method 1100 includes the following steps 1102-1106.
  • the feature encoder determines a medoid for a cluster of keypoints.
  • the medoid’s attribute value is used as a reference value for the cluster.
  • the interested attribute may be keypoint scales, and the medoid’s scale value is used as a reference scale value for the cluster.
  • the feature encoder determines a parameter representing the attribute value of the non-medoid keypoint.
  • the value of the parameter is determined to be equal to a difference between the non-medoid keypoint’s attribute value and the respective cluster’s reference value. For example, if the interested attribute is the keypoint scale, the parameter for each non-medoid keypoint is determined to be equal to ⁇ non-medoid ⁇ i ⁇ - ⁇ medoid_of_cluster , wherein ⁇ non-medoid ⁇ i ⁇ denotes the scale value of an ith non-medoid keypoint, and ⁇ medoid_of_cluster denotes the scale value of the medoid keypoint.
  • the parameter for each non-medoid keypoint is determined to be equal to ⁇ non-medoid ⁇ i ⁇ - ⁇ medoid_of_cluster , wherein ⁇ non-medoid ⁇ i ⁇ denotes the orientation value of an ith non-medoid keypoint, and ⁇ medoid_of_cluster denotes the orientation value of the medoid keypoint.
  • step 1106 the feature encoder determines whether a parameter is determined for all the non-medoid keypoints in the cluster. If not, the feature encoder returns to step 1104 and determines the parameter for a next non-medoid keypoint in the cluster. If the last non-medoid keypoint in the cluster has been reached, the feature encoder ends method 1100 and returns the determined parameters.
  • method 1100 can be executed at step 1014 of method 1000 (FIG. 10) to determine the parameters representing the attributes of non-medoid keypoints. Moreover, the parameters representing the attributes of medoid keypoints are made equal to the medoid keypoints’ attribute values.
  • the generated plurality of parameters can be inserted in a feature bitstream to be transmitted to a decoder. As described above, because the generated plurality of parameters has a smaller data range than the original attribute values’ range, using the generated plurality of parameters to represent the attribute values can reduce the storage and transmission costs.
  • FIG. 12 is a flowchart of a method 1200 for decoding image feature attributes, according to embodiments consistent with the present disclosure.
  • method 1200 may be performed by a feature decoder, such as a processor device in the feature decoder (e.g., processor 802) .
  • a feature decoder such as a processor device in the feature decoder (e.g., processor 802) .
  • FIG. 12 method 1200 includes the following steps 1202-1208.
  • the feature decoder receives a data stream comprising a plurality of parameters respectively associated with a plurality of image features.
  • the image features may be scale-invariant feature transform (SIFT) keypoints extracted from an image.
  • SIFT scale-invariant feature transform
  • Each of the keypoint can be described by various attributes, including, for example, a scale and an orientation.
  • the feature decoder partitions the plurality of image features into one or more clusters.
  • the partitioning is based on the positions of the keypoints.
  • a partitioning around medoid (PAM) algorithm is executed by the feature decoder to partition the plurality of image features.
  • a cluster output by the PAM algorithm includes a medoid keypoint and one or more non-medoid keypoints.
  • the feature decoder uses a hierarchical partitioning process to perform the partitioning in step 1204.
  • FIG. 13 is a flowchart of an exemplary hierarchical partitioning method 1300 for use in step 1204 (FIG. 12) , according to embodiments consistent with the present disclosure.
  • method 1300 may be performed by a feature decoder, such as a processor device in the feature decoder (e.g., processor 802) .
  • a feature decoder such as a processor device in the feature decoder (e.g., processor 802) .
  • method 1300 can be executed in multiple iterations.
  • the feature decoder receives a plurality of input keypoints.
  • the plurality of input keypoints may belong to a cluster generated by a previous iteration of method 1300.
  • the feature decoder determines a total number of the input keypoints. During this counting process, keypoints lying at the same position (x, y) are counted separately. The counted total number is denoted as “N” in the following description.
  • the feature decoder determines a convex hull of the input keypoints. For example, a Jarvis march algorithm may be executed to determine the convex hull.
  • the feature decoder identifies the number of keypoints located on the convex hull. The number of keypoints on the convex hull is denoted as “M” in the following description.
  • the feature decoder determines the value of which is the ratio of the total number of the input keypoints over the number of the keypoints located on the convex hull. In one embodiment, the feature decoder may round down the value of to the nearest integer. After the quotient is determined, processor 802 compares to M, i.e., the number of the keypoints located on the convex hull. If the feature decoder determines that the input keypoints can be further partitioned and proceeds to step 1312 to partition the input keypoints; or if the feature decoder determines that the input keypoints form a unpartitionable cluster, and proceeds to step 1314 to determine attribute values of the input keypoints. The details of the operations involved in step 1314 are explained below in connection with FIG. 14.
  • step 1012 partitions the input keypoints
  • step 1014 determines parameter representing the attributes of the input keypoints
  • the feature decoder partitions the input keypoints into a plurality of clusters, such as by executing a Partitioning Around Medoids (PAM) algorithm.
  • the partition is based on the positions of the input keypoints.
  • the M keypoints located on the convex hull may be used as initial medoids for the PAM algorithm, so as to cause the algorithm to partition the input keypoints into M clusters.
  • PAM Partitioning Around Medoids
  • the encoder may provide information of the initial medoids directly to the decoder to facilitate identification of the keypoints.
  • the information provided to the decoder may include the number or initial positions of the initial medoids.
  • the feature encoder After step 1312, the feature encoder returns to step 1302 and initiates a new iteration of method 1300 on each of the plurality of clusters resulted from step 1312. The feature encoder ends method 1300 when no more cluster can be further partitioned.
  • the feature decoder determines a reference value for at least one of the one or more the plurality of clusters.
  • the reference value for a cluster may be determined to be equal to a value of the parameter associated with the cluster’s medoid keypoint.
  • the feature decoder determines respective attributes of the plurality of image features.
  • the attribute of the image feature is determined based on: the parameter associated with the respective image feature, and the reference value for the respective cluster.
  • the attribute of each medoid feature is determined to have a value equal to the reference value for the respective cluster.
  • the attribute of each non-medoid feature is determined to have a value equal to a sum of: a value of the parameter associated with the respective non-medoid feature, and the reference value for the respective cluster (i.e., the value of the parameter associated with the respective cluster’s medoid) .
  • the operations involved in step 1208 may be determining the attributes of all image features, or determining the attributes of part of the image features.
  • FIG. 14 shows a method 1400 for determining attribute values of a cluster of keypoints.
  • method 1400 may be performed by a processor in the feature decoder (e.g., processor 802) .
  • processor 802 e.g., processor 802
  • FIG. 14 shows a processor in the feature decoder
  • the feature decoder determines a medoid for a cluster of keypoints.
  • the value of the parameter associated with the medoid is used as a reference value for the cluster.
  • the feature decoder determines an attribute value of the non-medoid keypoint. Specifically, the attribute value is determined to be equal to a sum of the value of the parameter associated non-medoid keypoint’s associated parameter and the reference value for the respective cluster.
  • each non-medoid keypoint is determined to be equal to parameter non-medoid ⁇ i ⁇ +parameter medoid_of_cluster , wherein parameter non-medoid ⁇ i ⁇ denotes the value of the parameter associated with an ith non-medoid keypoint, and parameter medoid_of_cluster denotes the value of the parameter associated with the medoid keypoint (i.e., the reference value for the respective cluster) .
  • step 1406 the feature decoder determines whether an attribute value is determined for all the non-medoid keypoints in the cluster. If not, the feature decoder returns to step 1404 and determines the attribute value of a next non-medoid keypoint in the cluster. If the last non-medoid keypoint in the cluster has been reached, the feature decoder ends method 1400 and returns the determined attribute values.
  • method 1400 can be executed at step 1314 of method 1300 (FIG. 13) to determine the attribute values of non-medoid keypoints. Moreover, the attribute values of medoid keypoints are made equal to the values of the parameters associated with the medoid keypoints, respectively.
  • FIG. 15 shows a histogram for the ⁇ values of SIFT keypoints extracted from a sample Pozna ⁇ CarPark frame, without using the disclosed scheme for representing the scale values.
  • FIG. 16 shows a histogram for the ⁇ values of SIFT keypoints extracted from the same sample CarPark frame, by using the disclosed scheme for representing the scale values.
  • the keypoints in FIG. 16 are partitioned into 24 clusters.
  • the box plot covers 50%of the observations; the dash line in the box plot is the median in these 50%of observations; and the continuous Kernel Density Estimation (KDE) plot smooths the observations with a Gaussian kernel, producing a continuous density estimate.
  • KDE Kernel Density Estimation
  • the 50%-observations plot box has a narrow dynamic range.
  • a non-transitory computer-readable storage medium including instructions is also provided, and the instructions may be executed by a device (such as the disclosed encoder and decoder) , for performing the above-described methods.
  • a device such as the disclosed encoder and decoder
  • Common forms of non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM or any other flash memory, NVRAM, a cache, a register, any other memory chip or cartridge, and networked versions of the same.
  • the device may include one or more processors (CPUs) , an input/output interface, a network interface, and/or a memory.
  • the term “or” encompasses all possible combinations, except where infeasible. For example, if it is stated that a database may include A or B, then, unless specifically stated otherwise or infeasible, the database may include A, or B, or A and B. As a second example, if it is stated that a database may include A, B, or C, then, unless specifically stated otherwise or infeasible, the database may include A, or B, or C, or A and B, or A and C, or B and C, or A and B and C.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

Lossless encoding and decoding of attributes of image features are disclosed. According to certain embodiments, an exemplary decoding method may include: receiving a data stream comprising a plurality of parameters respectively associated with a plurality of image features; partitioning the plurality of image features into one or more clusters; determining a reference value for at least one of the one or more clusters; and determining respective attributes of the plurality of image features, wherein the attribute of each image feature is determined based on: the parameter associated with the respective image feature, and the reference value for the respective cluster.

Description

LOSSLESS HIERARCHICAL CODING OF IMAGE FEATURE ATTRIBUTES TECHNICAL FIELD
The present disclosure generally relates to image or video processing, and more particularly, to methods and systems for representing, encoding, decoding attributes of image features.
BACKGROUND
With the rapid development of devices and applications that use visual information, there is an exploding demand for image/video processing and transmission capacities. On one hand, standards such as Advanced Video Coding (AVC) , High Efficiency Video Coding (HEVC) , and Versatile Video Coding (VVC) were developed to significantly reduce the storage size and transmission bandwidth of image/video data, while maximizing the subjective quality perceived by humans under certain specified bitrate conditions.
On the other hand, machine vision applications, such as object recognition and tracking, face recognition, image/video search, mobile augmented reality (MAR) , autonomous vehicles, Internet of Things (IoT) , images matching, 3-dimension structure construction, stereo correspondence, and motion tracking, focus on the analysis and usage of features extracted from the image/video data, rather than the subjective quality. Thus, there is an increasing demand to store and transmit data quantifying and describing the extracted features (i.e., attributes of the extracted features) . To that end, the Moving Picture Experts Group (MPEG) set up a Video Coding for Machines (VCM) working group to standardize new codecs and tools directed to encoding and decoding the feature attributes (hereinafter also referred to as “feature data” or “feature streams” ) . Constant effort has been spent to improve the efficiency of these codecs and tools.
SUMMARY
Embodiments of the present disclosure relate to methods of representing and coding (i.e., encoding or decoding) feature attributes. In some embodiments, a computer-implemented method for decoding feature attributes is provided. The method includes: receiving a data stream comprising a plurality of parameters respectively associated with a plurality of image features; partitioning the plurality of image features into one or more clusters; determining a reference value for at least one of the one or more clusters; and determining respective attributes of the plurality of image features, wherein the attribute of each image feature is determined based on: the parameter associated with the respective image feature, and the reference value for the respective cluster.
In some embodiments, a computer-implemented method for encoding feature attributes is provided. The method includes: partitioning a plurality of image features into one or more clusters; determining a reference value for at least one of the one or more clusters; and generating a plurality of parameters respectively associated with the plurality of image features, wherein each of the plurality of parameters is generated based on:a value of an attribute of the respective image feature, and the reference value for the respective cluster.
In some embodiments, a device for decoding feature attributes is provided. The device includes a memory storing instructions; and a processor configured to execute the instructions to cause the device to: receive a data stream comprising a plurality of parameters respectively associated with a plurality of image features; partitioning the plurality of image features into one or more clusters; determine a reference value for at least one of the one or more clusters; and determine respective attributes of the plurality of image features, wherein the attribute of each image feature is determined based on: the parameter associated with the respective image feature, and the reference value for the respective cluster.
In some embodiments, a device for encoding feature attributes is provided. The device includes a memory storing instructions; and a processor configured to execute the instructions to cause the device to: partition a plurality of image features into one or more clusters; determine a reference value for at least one of the one or more clusters; and generate a plurality of parameters respectively associated with the plurality of image features, wherein each of the plurality of parameters is generated based on: a value of an attribute of the respective image feature, and the reference value for the respective cluster.
Aspects of the disclosed embodiments may include non-transitory, tangible computer-readable media that store software instructions that, when executed by one or more processors, are configured for and capable of performing and executing one or more of the methods, operations, and the like consistent with the disclosed embodiments. Also, aspects of the disclosed embodiments may be performed by one or more processors that are configured as special-purpose processor (s) based on software instructions that are programmed with logic and instructions that perform, when executed, one or more operations consistent with the disclosed embodiments.
Additional objects and advantages of the disclosed embodiments will be set forth in part in the following description, and in part will be apparent from the description, or may be learned by practice of the embodiments. The objects and advantages of the disclosed embodiments may be realized and attained by the elements and combinations set forth in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosed embodiments, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a schematic diagram illustrating an exemplary encoding and decoding system for processing and transmitting image data and feature data, consistent with embodiments of the present disclosure.
FIG. 2A is a schematic diagram illustrating an exemplary input image, consistent with embodiments of the present disclosure.
FIG. 2B is a schematic diagram illustrating keypoints identified on the input image showing in FIG. 2A, consistent with embodiments of the present disclosure.
FIG. 3 is a schematic diagram illustrating an exemplary keypoint in an image scale space, consistent with embodiments of the present disclosure.
FIG. 4 is a table of exemplary feature vectors that are used for describing keypoints extracted from image data, consistent with embodiments of the present disclosure.
FIG. 5 is a schematic diagram illustrating a cluster of keypoints, consistent with embodiments of the present disclosure.
FIG. 6 is a schematic diagram illustrating a convex hull of a set of keypoints, consistent with embodiments of the present disclosure.
FIG. 7 is a schematic diagram illustrating an exemplary medoid clustering process, consistent with embodiments of the present disclosure.
FIG. 8 is a block diagram of an exemplary apparatus for encoding or decoding image feature attributes, consistent with embodiments of the present disclosure.
FIG. 9 is a flowchart of an exemplary method for encoding image feature attributes, consistent with embodiments of the present disclosure.
FIG. 10 is a flowchart of an exemplary hierarchical partitioning method for use in the encoding method shown in FIG. 9, consistent with embodiments of the present disclosure.
FIG. 11 is a flowchart for determining parameters associated with non-medoid keypoints, consistent with embodiments of the present disclosure.
FIG. 12 is a flowchart of an exemplary method for decoding image feature attributes, consistent with embodiments of the present disclosure.
FIG. 13 is a flowchart of an exemplary hierarchical partitioning method for use in the decoding method shown in FIG. 12, consistent with embodiments of the present disclosure.
FIG. 14 is a flowchart for determining attribute values of non-medoid keypoints, consistent with embodiments of the present disclosure.
FIG. 15 is a histogram of the scale values for keypoints extracted from a sample image, without using the disclosed methods for representing image feature attributes.
FIG. 16 is a histogram of the scale values for keypoints extracted from a sample image, by using the disclosed methods for representing image feature attributes.
FIG. 17 is a flowchart of an exemplary Scale Invariant Feature Transform (SIFT) algorithm for extracting features from image data, consistent with embodiments of the present disclosure.
FIG. 18 is a schematic diagram illustrating an exemplary Jarvis march algorithm, consistent with embodiments of the present disclosure.
DETAILED DESCRIPTION
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the invention. Instead, they are merely examples of apparatuses and methods consistent with aspects related to the invention as recited in the appended claims.
The present disclosure provides systems and methods for representing and coding (i.e., encoding or decoding) attributes of features extracted from image data. The image data may include an image, multiple images, or a video. An image is a static picture (e.g., a frame) . Multiple images may be related or unrelated, either spatially or temporary. A video includes a set of images arranged in a temporal sequence. Features of an image are information used to describe local properties of the image. For example, a feature may be a local  structure, pattern, texture, or shape in an image. The attributes of features are data quantifying and describing the features extracted from image data, and are also referred to as “image feature attributes, ” “feature attributes, ” or “feature data” in the present disclosure.
Consistent with the present disclosure, the features may be any kind of regions of interest or points of interests extracted from the image data, such as keypoints, blobs, edges, corners, vertices, points, ridges, objects, etc. Various types of attributes can be used to describe these features, such as scales, orientations, curvatures, peak responses of Laplacian of Gaussian (LoG) , distanced to image center, gradients, etc. Specific embodiments regarding encoding and decoding scales of Scale Invariant Feature Transform (SIFT) keypoints are described below. However, it is contemplated that the feature-attribute encoding and decoding systems and methods provided by the present disclosure are not limited to the SIFT keypoints.
Moreover, consistent with the present disclosure, the feature attributes may be represented using any data structures, such as parameters, coefficients, data arrays, vectors, matrixes, or any other data structures suitable for quantifying the features. And the feature attributes may be stored or transmitted as encoded data streams, which are also referred as “feature streams” or “feature bitstreams” in the present disclosure.
FIG. 1 is a block diagram illustrating an encoding and decoding system 100 that may be used to process and transmit image data and feature data, according to some disclosed embodiments. As shown in FIG. 1, system 100 includes a source device 120 that provides encoded image data and/or feature data to be decoded by a destination device 140. Consistent with the disclosed embodiments, each of source device 120 and destination device 140 may include any of a wide range of devices, including a desktop computer, a notebook (e.g., laptop) computer, a server, a tablet computer, a set-top box, a mobile phone, a vehicle, a camera, an image sensor, a robot, a television, a camera, a wearable device (e.g., a smart watch or a wearable camera) , a display device, a digital media player, a video gaming console, a video streaming device, or the like. Source device 120 and destination device 140 may be equipped for wireless or wired communication.
Referring to FIG. 1, source device 120 may include an image/video encoder 122, a feature extractor 124, a feature encoder 126, and an output interface 128. Consistent with the disclosed embodiments, source device 120 may further include various devices (not shown) for providing raw images or videos to be processed by image/video encoder 122, feature extractor 124, and/or feature encoder 126. The devices for providing the raw images or videos may include an image/video capture device, such as a camera, an image/video storage device containing previously captured images/videos, or an image/video feed interface to receive images/videos from an image/video content provider. As a further alternative, source device 120 may generate computer graphics-based data as the source images/videos, or a combination of live images/videos, archived images/videos, and computer-generated images/videos. The captured, pre-captured, or computer-generated images/videos may be encoded by image/video encoder 122 to generate an encoded image/video bitstream 162, or analyzed by feature extractor 124 to extract features. The attributes of the extracted features are encoded by feature encoder 126 to generate an encoded feature bitstream 164. In some embodiments, feature extractor 124 may also be configured to extract features from the output of the image/video encoder 122, i.e., encoded image/video bitstream 162. Additionally or alternatively, the features extracted by feature extractor 124 may be used by image/video encoder 122 to facilitate the encoding of the image data. For example, prior to the encoding of the image data, image/video encoder 122 may remove certain redundancy from the image data, based on the extracted features, such as size/shape information or luminance gradient. The redundancy may include details (e.g., background luminance, local texture, temporal change, spatial-temporal frequency information, etc. ) in the image data that are not perceivable by human eyes. Consistent with the disclosed embodiments, encoded image/video bitstream 162 and encoded feature bitstream 164 are output by output interface 126 onto a communication medium 160.
Output interface 128 may include any type of medium or device capable of transmitting the encoded image/video bitstream 162 and/or feature bitstream 164 from source device 120 to destination device 140. For example, output interface 128 may include a transmitter or a transceiver configured to transmit encoded image/video bitstream 162 and/or feature bitstream 164 from source device 120 directly to destination device 140. The encoded image/video bitstream 162 and/or feature bitstream 164 may be modulated according to a communication standard, such as a wireless communication protocol, and transmitted to destination device 140.
Communication medium 160 may include transient media, such as a wireless broadcast or wired network transmission. For example, communication medium 160 may include a radio frequency (RF) spectrum or one or more physical transmission lines (e.g., a cable) . Communication medium 160 may form part of a packet-based network, such as a local area network, a wide-area network, or a global network such as the Internet. In some embodiments, communication medium 160 may include routers, switches, base stations, or any other equipment that may be useful to facilitate communication from source device 120 to destination device 140. For example, a network server (not shown) may receive encoded image/video bitstream 162 and/or feature  bitstream 164 from source device 120, and provide the encoded image/video bitstream 162 and/or feature bitstream 164 to destination device 140, e.g., via network transmission.
Communication medium 160 may also be in the form of a storage media (e.g., non-transitory storage media) , such as a hard disk, flash drive, compact disc, digital video disc, Blu-ray disc, volatile or non-volatile memory, or any other suitable digital storage media for storing encoded video data. In some embodiments, a computing device of a medium production facility, such as a disc stamping facility, may receive encoded video data from source device 120 and produce a disc containing the encoded video data.
Destination device 140 may include an input interface 148, an image/video decoder 142, and a feature decoder 146. Input interface 148 of destination device 140 receives encoded image/video bitstream 162 and/or encoded feature bitstream 164 from communication medium 160. Encoded image/video bitstream 162 is decoded by image/video decoder 142 to generate decoded image data, and encoded feature bitstream 164 is decoded by feature decoder 146 to generate decoded feature attributes. Destination device 140 may also include or be connected to various devices (not shown) for utilizing the decoded image data or feature attributes. For example, the various devices may include a display device that displays the decoded image data to a user and may include any of a variety of display devices such as a cathode ray tube (CRT) , a liquid crystal display (LCD) , a plasma display, an organic light emitting diode (OLED) display, or another type of display device. As another example, the various devices may include one or more processors configured to use the decoded feature attributes to perform various computer-vision applications, such as object recognition and tracking, face recognition, images matching, image/video search, augmented reality, robot vision and navigation, autonomous driving, 3-dimension structure construction, stereo correspondence, motion tracking, etc.
Image/video encoder 122, feature encoder 126, image/video decoder 142, and feature decoder 146 each may be implemented as any of a variety of suitable encoder/decoder circuitry, such as one or more microprocessors, digital signal processors (DSPs) , application specific integrated circuits (ASICs) , field programmable gate arrays (FPGAs) , discrete logic, software, hardware, firmware or any combinations thereof. When the techniques are implemented partially in software, a device may store instructions for the software in a suitable, non-transitory computer-readable medium and execute the instructions in hardware using one or more processors to perform the techniques of this disclosure. Each of image/video encoder 122, feature encoder 126, image/video decoder 142, and feature decoder 146 may be included in one or more encoders or decoders, either of which may be integrated as part of a combined encoder/decoder (CODEC) in a respective device.
Image/video encoder 122, feature encoder 126, image/video decoder 142, and feature decoder 146 may operate according to any coding standard, such as MPEG-7 Compact Descriptors for Visual Search (CDVS) standard, the MPEG-7 Compact Descriptors for Video Analysis (CDVA) standard, the MPEG-7 standard on metadata, the Versatile Video Coding (VVC/H. 266) standard, the High Efficiency Video Coding (HEVC/H. 265) standard, the ITU-T H. 264 (also known as MPEG-4) standard, etc. Although not shown in FIG. 1, in some embodiments, some or all of image/video encoder 122, feature encoder 126, image/video decoder 142, and feature decoder 146 may be integrated with an audio encoder and decoder, and may include appropriate MUX-DEMUX units, or other hardware and software, to handle encoding of audio data, image data, and/or feature data in a common data stream or separate data streams.
Still referring to FIG. 1, encoded image/video bitstream 162 includes syntax information including syntax elements that describe characteristics or processing of coded image blocks or units. The syntax information in encoded image/video bitstream 162 is defined by image/video encoder 122 and used by image/video decoder 142. Moreover, encoded feature bitstream 164 includes syntax information including syntax elements that describe characteristics or processing of coded feature attributes. The syntax information in encoded feature bitstream 164 is defined by feature encoder 126 and used by feature decoder 146. Although FIG. 1 shows system 100 is capable of compressing and transmitting both of the image data and feature attributes, some embodiments only need to use one of the image data and feature attributes. For example, object tracking is a computer-vision application that involves with a sensor or camera extracting features and sending encoded feature attributes to a server for analysis. Thus, in these embodiments, only encoded feature bitstream 164 need to be transmitted from source device 120 to destination device 140.
Consistent with the disclosed embodiment, the techniques for encoding and decoding the feature attributes may be in compliance with video coding standards, such as the MPEG-7 CDVS standard or the MPEG-7 CDVA standard. The MPEG-7 CDVS standard is developed to use local descriptors to describe features in a static image, while the MPEG-7 CDVA standard is developed to use local descriptors to describe features in a video sequence (i.e., a sequence of consecutive frames) . Both the MPEG-7 CDVS and MPEG-7 CDVA standards are based on the Scale Invariant Feature Transform (SIFT) technique, which extracts keypoints from an image and describes the keypoints with certain pre-defined attributes or descriptors. The  SIFT technique is described by D. G. Lowe in “Distinctive Image Features from Scale-Invariant Keypoints, ” International Journal of Computer Vision, Vol. 60, No. 2, pp. 91-110, 2004.
To illustrate the concept of keypoints, FIG. 2A is a schematic diagram illustrating an exemplary input image 200, consistent with embodiments of the present disclosure. And FIG. 2B shows a plurality of keypoints 210 extracted from input image 200. Each keypoint 210 is shown as a blob surrounding a local area rich in features. The SIFT technique has been used in the art to extract the features and generate keypoints 210 that are invariant to over image translation, image rotation, image scaling, and/or changes in, e.g., viewpoint and illumination. Thus, these keypoints are also called “SIFT keypoints. ”
To illustrate the attributes used for describing the keypoints, FIG. 3 is a schematic diagram illustrating an exemplary keypoint 310 in a scale space of an input image, consistent with embodiments of the present disclosure. The scale space is described by a function that is produced from the convolution of a Gaussian kernel (at different scales) and the input image. As shown in FIG. 3, keypoint 310 may be described by various parameters, including, for example: keypoint position (x, y) in an image scale space in which keypoint 310 is detected, scale σ (aradius of the keypoint in the scale space) , and orientation θ (an angle of the keypoint in the image scale space, expressed in radians) . The scale or radius σ describes the extent of the area keypoint 310 occupies in the scale space, and depends on the keypoint’s position (x, y) in the image scale space. In some embodiments, keypoint 310 is also accompanied by a descriptor (not shown) of the area occupied by keypoint 310. The descriptor is a 3-dimensional spatial histogram of the image gradients in characterizing the appearance of a keypoint. The gradient at each pixel is regarded as a sample of a three-dimensional elementary feature vector, formed by the pixel location and the gradient orientation. Samples are weighed by the gradient norm and accumulated in a 3-dimensional histogram, which forms the SIFT descriptor of the region. An additional Gaussian weighting function may be applied to the SIFT descriptor to give less importance to gradients farther away from the keypoint.
Referring back to FIG. 2B, a plurality of keypoints may be extracted from an image. Each keypoint can be described by a feature vector that includes the keypoint position (x, y) , scale σ, orientation θ, and/or descriptor. The scales σ, orientations θ, and descriptors are also referred to as “attributes” of the keypoints. As an example, for a 1920×1088 image, about 18000 keypoints may be extracted and each of the keypoints is described by a feature vector representing the associated feature attributes. A portion of these feature vectors is shown in FIG. 4. Although the feature vectors in FIG. 4 do not include information associated with the keypoint descriptors, it is contemplated that each of these feature vectors may include additional elements representing the associated keypoint descriptor.
FIG. 17 is a flowchart of an exemplary Scale Invariant Feature Transform (SIFT) algorithm 1700 for extracting features from image data, consistent with embodiments of the present disclosure. For example, algorithm 1700 may be performed by a feature encoder, such as feature encoder 126 (FIG. 1) . As shown in FIG. 17, SIFT algorithm 1700 may include four steps: 1710-Extrema Detection, 1720-Keypoint Localization, 1730-Orientation Assignment, and 1740-Keypoint Descriptor Generation. Specifically, at step 1710, the feature encoder examines an input image under various scales to isolate points of the input image from their surroundings. These points, called extrema, are potential candidates for image features. At step 1720 (Keypoint Detection) , the feature encoder selects some of these points (extrema) to be keypoints. This selection rejects extrema that are caused by edges of the picture and by low contrast points. At step 1730 (Orientation Assignment) , the feature encoder converts each keypoint and its neighborhood into a set of vectors by computing a scale σ and an orientation θ for them. At step 1740 (Keypoint Descriptor Generation) , the feature encoder takes a collection of vectors in the neighborhood of each keypoint and consolidates this information into a set of vectors called the descriptor. Each descriptor is converted into a feature by computing a normalized sum of these vectors.
The feature vectors can be analyzed and used by various computer-vision applications, for example, object detection, recognition, and tracking. The feature vectors can be stored/transmitted in their raw form. Alternatively and additionally, the feature vectors can be described in stream syntax or encapsulated by additional information (e.g., a checksum) .
Increasing demand for the use of feature attributes or feature streams increases the amount of feature data that needs to be stored or transmitted. However, the above-described feature attributes may occupy a large data range, which incurs significant cost for data storage and transmission. Therefore, it is desirable to develop efficient techniques to store and transmit the feature data. The present disclosure provides embodiments to improve the efficiency of storing and transmitting feature data by modifying the representation of the feature data to reduce its dynamic range, thereby reducing the cost for storing and transmitting the feature data.
Specifically, in the disclosed embodiments, the keypoints extracted from an image are partitioned into a plurality of clusters based on the keypoint positions. The keypoints in each cluster have better matching local image properties, and thus have less diverse attribute values. The dynamic range of the attribute values in each cluster can be reduced by subtracting, from each of the attribute values, a reference value associated with the  respective cluster. For example, in some embodiments, the attribute value of a cluster’s medoid is chosen as the cluster’s reference value. Medoid of a cluster is defined as one of the cluster’s keypoints whose sum of distances to all the other keypoints in the cluster is minimal. In a cluster, a non-medoid keypoint’s attribute value is subtracted by the medoid’s attribute value, while the medoid keypoint’s attribute value is unchanged because it is representative of the cluster’s attribute values. By such differential representation of the non-medoid keypoints’ attribute values relative to the respective medoid keypoint’s attribute value, the dynamic range of the attribute values is reduced.
The above concept of differential representation for the non-medoid keypoints is further illustrated in FIG. 5, which shows a cluster of 8 keypoints represented by solid dark circles. One of the keypoints is determined to be the medoid of the cluster, whose sum of distances to all the other keypoints in the cluster is the smallest. The scale values of the non-medoid keypoints can be subtracted by the scale value of the medoid keypoint to reduce their dynamic range. The medoid keypoint’s scale value is unchanged, so that it can be preserved as the reference value for the cluster. Since the keypoints in a cluster generally are indicated at the same level of the scale pyramid in SIFT algorithm, the medoid keypoint’s scale value is a good representation of the scale values of the keypoints in the same cluster. Although FIG. 5 only shows differential representation of the non-medoid keypoints’s cale values, it is contemplated that the differential representation can also be applied to the orientation values in the same way.
In some embodiments, a medoid clustering algorithm, for example, a Partitioning Around Medoids (PAM) algorithm, is used to partition the keypoints and determine the medoid of a cluster. The PAM algorithm searches for a set of representative data points in an input data set and then assigns the remaining data points to the closest medoid in order to create clusters. The representative data point of a cluster is also called the “medoid” of the cluster and is the mathematical center of the cluster. The remaining data points in the cluster are called “non-medoids. ” In some embodiments, the PAM algorithm may be implemented in three steps. In the first step, the PAM algorithm chooses, from the input data set, one or more data points as initial medoids. The one or more initial medoids correspond to one or more clusters, respectively. In the second step, the PAM algorithm associates each of the remaining data points with its closest initial medoid. In the third step, the PAM algorithm searches for the optimum medoid for a cluster, by minimizing a cost function that measures the sum of dissimilarities between the medoid and the other data points in the same cluster.
In some embodiments, the keypoints located on a convex hull of the input keypoints are used as initial medoids of the medoid clustering algorithm. This way ensures that the encoder and decoder use the same method to determine initial medoids, hence results in the same clusters and medoids being generated on both the encoder and decoder sides. As shown in FIG. 6, a convex hull of a set of keypoints is the smallest convex polygon that contains all the keypoints in the set. Specifically, the set of keypoints is shown as gray filled areas and the convex hull is shown as a solid-line polygon encompassing all the keypoints. A convex hull determination algorithm, for example, a Jarvis march algorithm, can be used to determine the convex hull. FIG. 18 is a schematic diagram illustrating an exemplary method for implementing the Jarvis march algorithm. As shown in FIG. 18, the Jarvis march algorithm starts with the left-most keypoint P 0 and looks for the convex-hull keypoints in a counter-clockwise direction. Specifically, the Jarvis march algorithm starts from the left-most keypoint P 0 and finds the next convex-hull keypoint P 1 which has the largest angle (Angle-0 in FIG. 18) from P 0. The Jarvis march algorithm then starts from keypoint P 1 and finds the next convex-hull keypoint P 2 which has the largest angle (Angle-1 in FIG. 18) from P 1. using the origin to calculate an angle between every other point in the simulation. The Jarvis march algorithm repeats the above process until it returns to the left-most keypoint P 0. The set of keypoints chosen in this way constitute the convex hull.
The keypoints located on the convex hull can be used as initial medoid for the disclosed clustering process, for example, a PAM algorithm, the output of which is a set of clusters with medoids. FIG. 7 is a schematic diagram illustrating a medoid clustering process, according to embodiments consistent with the present disclosure. As shown in FIG. 7, a set of keypoints is shown as gray areas. The keypoints located on the convex hull of the set of keypoints are shown as asterisk marks and are used as initial medoids for the medoid clustering process. The cross marks correspond to the medoids of the clusters output by the medoid clustering process.
In some embodiments, a hierarchical clustering process may be used to partition the keypoints. For example, for any given cluster, a ratio of the total number of keypoints in the cluster over the number of keypoints on the cluster’s convex hull is determined. If the ratio is greater than or equal to the number of keypoints on the cluster’s convex hull, the cluster is further partitioned into smaller cluster; otherwise, the cluster is not further partitioned. Thus, the final output of the hierarchical clustering process is a plurality of clusters deemed unpartitionable. This ensures the unambiguity of the partitioning, and achieves a high-level granularity such that the reference values of the resulted clusters (e.g., the attribute values of the medoid keypoints) can better match the local scene properties.
In summary, the above-described differential representation of non-medoid keypoints’ attribute values is a lossless hierarchical coding method. It does not lose any useful information of the feature attributes. It can also be reversibly encoded and decoded. Moreover, other than the attribute values themselves, no additional information is needed to be sent to the decoder to facilitate the decoding. Therefore, the disclosed technique for coding the attribute values is highly efficient. This technique can be applied to tools that use SIFT, such as direct SIFT, CDVS, CDVA, etc.
The disclosed technique for coding the attribute values can be implemented in an encoder or a decoder. FIG. 8 is a block diagram of an example apparatus 800 for encoding or decoding image feature attributes, consistent with embodiments of the disclosure. For example, feature encoder 126 and feature decoder 146 can each be implemented as apparatus 800 for performing the above-described methods. As shown in FIG. 8, apparatus 800 can include processor 802. When processor 802 executes instructions described herein, apparatus 800 can become a specialized machine for encoding or decoding feature attributes. Processor 802 can be any type of circuitry capable of manipulating or processing information. For example, processor 802 can include any combination of any number of a central processing unit (or “CPU” ) , a graphics processing unit (or “GPU” ) , a neural processing unit ( “NPU” ) , a microcontroller unit ( “MCU” ) , an optical processor, a programmable logic controller, a microcontroller, a microprocessor, a digital signal processor, an intellectual property (IP) core, a Programmable Logic Array (PLA) , a Programmable Array Logic (PAL) , a Generic Array Logic (GAL) , a Complex Programmable Logic Device (CPLD) , a Field-Programmable Gate Array (FPGA) , a System On Chip (SoC) , an Application-Specific Integrated Circuit (ASIC) , or the like. In some embodiments, processor 802 can also be a set of processors grouped as a single logical component. For example, as shown in FIG. 8, processor 802 can include multiple processors, including processor 802a, processor 802b, and processor 802n.
Apparatus 800 can also include memory 804 configured to store data (e.g., a set of instructions, computer codes, intermediate data, or the like) . For example, as shown in FIG. 8, the stored data can include program instructions (e.g., program instructions for implementing the encoding or decoding of feature attributes) and data for processing (e.g., image data, features extracted from image data, or encoded feature bitstream) . Processor 802 can access the program instructions and data for processing (e.g., via bus 810) , and execute the program instructions to perform an operation or manipulation on the data for processing. Memory 804 can include a high-speed random-access storage device or a non-volatile storage device. In some embodiments, memory 804 can include any combination of any number of a random-access memory (RAM) , a read-only memory (ROM) , an optical disc, a magnetic disk, a hard drive, a solid-state drive, a flash drive, a security digital (SD) card, a memory stick, a compact flash (CF) card, or the like. Memory 804 can also be a group of memories (not shown in FIG. 8) grouped as a single logical component.
Bus 810 can be a communication device that transfers data between components inside apparatus 800, such as an internal bus (e.g., a CPU-memory bus) , an external bus (e.g., a universal serial bus port, a peripheral component interconnect express port) , or the like.
For ease of explanation without causing ambiguity, processor 802 and other data processing circuits are collectively referred to as a “data processing circuit” in this disclosure. The data processing circuit can be implemented entirely as hardware, or as a combination of software, hardware, or firmware. In some implementations, the data processing circuit can be a single independent module or can be combined entirely or partially into any other component of apparatus 800.
Apparatus 800 can further include network interface 806 to provide wired or wireless communication with a network (e.g., the Internet, an intranet, a local area network, a mobile communications network, or the like) . In some embodiments, network interface 806 can include any combination of any number of a network interface controller (NIC) , a radio frequency (RF) module, a transponder, a transceiver, a modem, a router, a gateway, a wired network adapter, a wireless network adapter, a Bluetooth adapter, an infrared adapter, a near-field communication ( “NFC” ) adapter, a cellular network chip, or the like.
In some embodiments, apparatus 800 can further include peripheral interface 808 to provide a connection to one or more peripheral devices. As shown in FIG. 8, the peripheral device can include, but is not limited to, a cursor control device (e.g., a mouse, a touchpad, or a touchscreen) , a keyboard, a display (e.g., a cathode-ray tube display, a liquid crystal display, or a light-emitting diode display) , an image/video input device (e.g., a camera or an input interface coupled to an image/video archive) , or the like.
FIG. 9 is a flowchart of a method 900 for encoding image feature attributes, according to embodiments consistent with the present disclosure. For example, method 900 may be performed by a feature encoder, such as a processor device in the feature encoder (e.g., processor 802) . As shown in FIG. 9, method 900 includes the following steps 902-906.
In step 902, the feature encoder partitions a plurality of image features into one or more clusters. For example, the image features may be scale-invariant feature transform (SIFT) keypoints extracted from an image. Each of the keypoint can be described by various attributes, including, for example, a scale and an  orientation. The partitioning is based on the positions of the keypoints. In some embodiments, a partitioning around medoid (PAM) algorithm is performed by the feature encoder to partition the plurality of image features. A cluster output by the PAM algorithm includes a medoid keypoint and one or more non-medoid keypoints.
In some embodiments, the feature encoder uses a hierarchical partitioning process to perform the partitioning in step 902. FIG. 10 is a flowchart of an exemplary hierarchical partitioning method 1000 for use in step 902 (FIG. 9) , according to embodiments consistent with the present disclosure. For example, method 1000 may be performed by a processor (e.g., processor 802) in a feature encoder. As shown in FIG. 10, method 1000 can be executed in multiple iterations. In step 1002, the feature encoder receives a plurality of input keypoints, which are image regions generated by applying a feature-extraction algorithm, e.g., a SIFT algorithm, to an image. The plurality of input keypoints may belong to a cluster generated by a previous iteration of method 1000. Then, in step 1004, the feature encoder determines a total number of the input keypoints. During this counting process, keypoints lying at the same position (x, y) are counted as separate keypoints. The counted total number is denoted as “N” in the following description. In step 1006, the feature encoder determines a convex hull of the input keypoints. For example, a Jarvis march algorithm may be executed to determine the convex hull. In step 1008, the feature encoder identifies the number of keypoints located on the convex hull. The number of keypoints on the convex hull is denoted as “M” in the following description. In step 1010, the feature coder determines the value of
Figure PCTCN2022144351-appb-000001
which is the ratio of the total number of the input keypoints over the number of the keypoints located on the convex hull. In one embodiment, the feature encoder may round down the value of
Figure PCTCN2022144351-appb-000002
to the nearest integer. After the
Figure PCTCN2022144351-appb-000003
quotient is determined, the feature encoder compares
Figure PCTCN2022144351-appb-000004
to M, i.e., the number of the keypoints located on the convex hull. If
Figure PCTCN2022144351-appb-000005
the feature encoder determines that the input keypoints can be further partitioned and proceeds to step 1012 to partition the input keypoints; or if
Figure PCTCN2022144351-appb-000006
the feature encoder determines that the input keypoints form a unpartitionable cluster, and proceeds to step 1014 to determine parameter representing the attributes of the input keypoints. The details of the operations involved in step 1014 are explained below in connection with FIG. 11.
In an embodiment, if
Figure PCTCN2022144351-appb-000007
the feature encoder determines that the input keypoints can be further partitioned and proceeds to step 1012 to partition the input keypoints; or if
Figure PCTCN2022144351-appb-000008
the feature encoder determines that the input keypoints form a unpartitionable cluster, and proceeds to step 1014 to determine parameter representing the attributes of the input keypoints.
Referring back to method 1000 (FIG. 10) , in step 1012, the feature encoder partitions the input keypoints into a plurality of clusters, such as by executing a Partitioning Around Medoids (PAM) algorithm. The partition is based on the positions of the input keypoints. The M keypoints located on the convex hull (determined in step 1008) may be used as initial medoids for the PAM algorithm, so as to cause the algorithm to partition the input keypoints into M clusters. As described above, to perform unambiguous clustering on both the encoder and decoder sides without transmitting additional information to the decoder, it may be desired to precisely define the initial medoids for the PAM algorithm used by the encoder and decoder. By defining the initial medoids to be the medoids located on the convex hull, it ensures the encoder and decoder to partition the keypoints in the same way, without requiring additional coding information to be sent from the encoder to the decoder. In some embodiments, information of the initial medoids may be provided to the decoder to facilitate identification of the keypoints. For example, the information provided to the decoder may include the number or initial positions of the initial medoids. Such information allows the decoder to use the same initial medoids as the encoder.
After step 1012, the feature encoder returns to step 1002 and initiates a new iteration of method 1000 on each of the plurality of clusters resulted from step 1012. The feature encoder ends method 1000 when no more cluster can be further partitioned.
Referring back to method 900 (FIG. 9) , in step 904, the feature encoder determines a reference value for at least one of the one or more the plurality of clusters. For example, the reference value for a cluster may be determined to be equal to an attribute value (e.g., scale value or orientation value) of the cluster’s medoid keypoint.
In step 906, the feature encoder generates a plurality of parameters respectively associated with the plurality of clusters. The plurality of parameters represents the image features of the keypoints. Each of the plurality of parameters is generated based on a value of an attribute of the respective image feature and the reference value for the respective cluster. According to some embodiments, the value of the parameter associated with each medoid feature is made equal to the reference value for the respective cluster (i.e., the attribute value of the respective cluster’s medoid) . And the value of the parameter associated with each non- medoid feature is made to be equal to a difference of: an attribute value of the respective non-medoid feature, and the reference value for the respective cluster (i.e., the attribute value of the respective cluster’s medoid) .
The operations involved in  steps  904 and 906 are further illustrated in FIG. 11, which shows a method 1100 for determining parameters representing the attributes of a cluster of keypoints. For example, method 1100 may be performed by a processor in the feature encoder (e.g., processor 802) . As shown in FIG. 11, method 1100 includes the following steps 1102-1106.
In step 1102, the feature encoder determines a medoid for a cluster of keypoints. The medoid’s attribute value is used as a reference value for the cluster. For example, the interested attribute may be keypoint scales, and the medoid’s scale value is used as a reference scale value for the cluster.
In step 1104, for each non-medoid keypoint of the cluster, the feature encoder determines a parameter representing the attribute value of the non-medoid keypoint. In some embodiments, the value of the parameter is determined to be equal to a difference between the non-medoid keypoint’s attribute value and the respective cluster’s reference value. For example, if the interested attribute is the keypoint scale, the parameter for each non-medoid keypoint is determined to be equal to σ non-medoid {i} medoid_of_cluster, wherein σ non-medoid {i} denotes the scale value of an ith non-medoid keypoint, and σ medoid_of_cluster denotes the scale value of the medoid keypoint. As another example, if the interested attribute is the keypoint orientation, the parameter for each non-medoid keypoint is determined to be equal to θ non-medoid {i} -θ medoid_of_cluster, wherein θ non-medoid {i} denotes the orientation value of an ith non-medoid keypoint, and θ medoid_of_cluster denotes the orientation value of the medoid keypoint.
In step 1106, the feature encoder determines whether a parameter is determined for all the non-medoid keypoints in the cluster. If not, the feature encoder returns to step 1104 and determines the parameter for a next non-medoid keypoint in the cluster. If the last non-medoid keypoint in the cluster has been reached, the feature encoder ends method 1100 and returns the determined parameters.
Consistent with the disclosed embodiments, method 1100 can be executed at step 1014 of method 1000 (FIG. 10) to determine the parameters representing the attributes of non-medoid keypoints. Moreover, the parameters representing the attributes of medoid keypoints are made equal to the medoid keypoints’ attribute values. Referring back to method 900 (FIG. 9) , at the end of step 906, the generated plurality of parameters can be inserted in a feature bitstream to be transmitted to a decoder. As described above, because the generated plurality of parameters has a smaller data range than the original attribute values’ range, using the generated plurality of parameters to represent the attribute values can reduce the storage and transmission costs.
FIG. 12 is a flowchart of a method 1200 for decoding image feature attributes, according to embodiments consistent with the present disclosure. For example, method 1200 may be performed by a feature decoder, such as a processor device in the feature decoder (e.g., processor 802) . As shown in FIG. 12, method 1200 includes the following steps 1202-1208.
In step 1202, the feature decoder receives a data stream comprising a plurality of parameters respectively associated with a plurality of image features. For example, the image features may be scale-invariant feature transform (SIFT) keypoints extracted from an image. Each of the keypoint can be described by various attributes, including, for example, a scale and an orientation.
In step 1204, the feature decoder partitions the plurality of image features into one or more clusters. The partitioning is based on the positions of the keypoints. In some embodiments, a partitioning around medoid (PAM) algorithm is executed by the feature decoder to partition the plurality of image features. A cluster output by the PAM algorithm includes a medoid keypoint and one or more non-medoid keypoints.
In some embodiments, the feature decoder uses a hierarchical partitioning process to perform the partitioning in step 1204. FIG. 13 is a flowchart of an exemplary hierarchical partitioning method 1300 for use in step 1204 (FIG. 12) , according to embodiments consistent with the present disclosure. For example, method 1300 may be performed by a feature decoder, such as a processor device in the feature decoder (e.g., processor 802) . As shown in FIG. 13, method 1300 can be executed in multiple iterations. In step 1302, the feature decoder receives a plurality of input keypoints. The plurality of input keypoints may belong to a cluster generated by a previous iteration of method 1300. Then, in step 1304, the feature decoder determines a total number of the input keypoints. During this counting process, keypoints lying at the same position (x, y) are counted separately. The counted total number is denoted as “N” in the following description. In step 1306, the feature decoder determines a convex hull of the input keypoints. For example, a Jarvis march algorithm may be executed to determine the convex hull. In step 1308, the feature decoder identifies the number of keypoints located on the convex hull. The number of keypoints on the convex hull is denoted as “M” in the following description. In step 1310, the feature decoder determines the value of
Figure PCTCN2022144351-appb-000009
which is the ratio of the total number of the input keypoints over the number of the keypoints located on the convex hull. In one embodiment, the feature decoder may round down the value of
Figure PCTCN2022144351-appb-000010
to the nearest integer. After the
Figure PCTCN2022144351-appb-000011
quotient is determined,  processor 802 compares
Figure PCTCN2022144351-appb-000012
to M, i.e., the number of the keypoints located on the convex hull. If
Figure PCTCN2022144351-appb-000013
the feature decoder determines that the input keypoints can be further partitioned and proceeds to step 1312 to partition the input keypoints; or if
Figure PCTCN2022144351-appb-000014
the feature decoder determines that the input keypoints form a unpartitionable cluster, and proceeds to step 1314 to determine attribute values of the input keypoints. The details of the operations involved in step 1314 are explained below in connection with FIG. 14.
In an embodiment, if
Figure PCTCN2022144351-appb-000015
the feature encoder determines that the input keypoints can be further partitioned and proceeds to step 1012 to partition the input keypoints; or if
Figure PCTCN2022144351-appb-000016
the feature encoder determines that the input keypoints form a unpartitionable cluster, and proceeds to step 1014 to determine parameter representing the attributes of the input keypoints.
Referring back to method 1300 (FIG. 13) , in step 1312, the feature decoder partitions the input keypoints into a plurality of clusters, such as by executing a Partitioning Around Medoids (PAM) algorithm. The partition is based on the positions of the input keypoints. The M keypoints located on the convex hull (determined in step 1308) may be used as initial medoids for the PAM algorithm, so as to cause the algorithm to partition the input keypoints into M clusters. As described above, to perform unambiguous clustering on both the encoder and decoder sides without transmitting additional information to the decoder, it may be desired to precisely define the initiation medoids for the PAM algorithm used by the encoder and decoder. By defining the initiation medoids to be the medoids located on the convex hull, it ensures the encoder and decoder to partition the keypoints in the same way, without requiring additional coding information to be sent from the encoder to the decoder. In some embodiments, the encoder may provide information of the initial medoids directly to the decoder to facilitate identification of the keypoints. For example, the information provided to the decoder may include the number or initial positions of the initial medoids.
After step 1312, the feature encoder returns to step 1302 and initiates a new iteration of method 1300 on each of the plurality of clusters resulted from step 1312. The feature encoder ends method 1300 when no more cluster can be further partitioned.
Referring back to method 1200 (FIG. 12) , in step 1206, the feature decoder determines a reference value for at least one of the one or more the plurality of clusters. For example, the reference value for a cluster may be determined to be equal to a value of the parameter associated with the cluster’s medoid keypoint.
In step 1208, the feature decoder determines respective attributes of the plurality of image features. The attribute of the image feature is determined based on: the parameter associated with the respective image feature, and the reference value for the respective cluster. According to some embodiments, the attribute of each medoid feature is determined to have a value equal to the reference value for the respective cluster. And the attribute of each non-medoid feature is determined to have a value equal to a sum of: a value of the parameter associated with the respective non-medoid feature, and the reference value for the respective cluster (i.e., the value of the parameter associated with the respective cluster’s medoid) .
In some embodiments, the operations involved in step 1208 may be determining the attributes of all image features, or determining the attributes of part of the image features.
The operations involved in  steps  1206 and 1208 are further illustrated in FIG. 14, which shows a method 1400 for determining attribute values of a cluster of keypoints. For example, method 1400 may be performed by a processor in the feature decoder (e.g., processor 802) . As shown in FIG. 14, method 1400 includes the following steps 1402-1406
In step 1402, the feature decoder determines a medoid for a cluster of keypoints. The value of the parameter associated with the medoid is used as a reference value for the cluster.
In step 1404, for each non-medoid keypoint of the cluster, the feature decoder determines an attribute value of the non-medoid keypoint. Specifically, the attribute value is determined to be equal to a sum of the value of the parameter associated non-medoid keypoint’s associated parameter and the reference value for the respective cluster. For example, the scale value of each non-medoid keypoint is determined to be equal to parameter non-medoid {i} +parameter medoid_of_cluster, wherein parameter non-medoid {i} denotes the value of the parameter associated with an ith non-medoid keypoint, and parameter medoid_of_cluster denotes the value of the parameter associated with the medoid keypoint (i.e., the reference value for the respective cluster) .
In step 1406, the feature decoder determines whether an attribute value is determined for all the non-medoid keypoints in the cluster. If not, the feature decoder returns to step 1404 and determines the attribute value of a next non-medoid keypoint in the cluster. If the last non-medoid keypoint in the cluster has been reached, the feature decoder ends method 1400 and returns the determined attribute values.
Consistent with the disclosed embodiments, method 1400 can be executed at step 1314 of method 1300 (FIG. 13) to determine the attribute values of non-medoid keypoints. Moreover, the attribute values of medoid keypoints are made equal to the values of the parameters associated with the medoid keypoints, respectively.
The above-described methods for encoding and decoding feature attributes can achieve a reduction in the aggregate cost of representing attribute values (e.g., scale values or orientation values of SIFT keypoints) . This effect is illustrated in FIG. 15 and FIG. 16. Specifically, FIG. 15 shows a histogram for the σ values of SIFT keypoints extracted from a sample Poznań CarPark frame, without using the disclosed scheme for representing the scale values. In contrast, FIG. 16 shows a histogram for the σ values of SIFT keypoints extracted from the same sample
Figure PCTCN2022144351-appb-000017
CarPark frame, by using the disclosed scheme for representing the scale values. The conditions for generating the histograms in both FIG. 15 and FIG. 16 include: quantization parameter (QP) =47, 5996 keypoints. Moreover, the keypoints in FIG. 16 are partitioned into 24 clusters. In both FIG. 15 and FIG. 16, the box plot covers 50%of the observations; the dash line in the box plot is the median in these 50%of observations; and the continuous Kernel Density Estimation (KDE) plot smooths the observations with a Gaussian kernel, producing a continuous density estimate. Comparing to FIG. 15, the 50%-observations plot box has a narrow dynamic range. Thus, the disclosed methods for representing the keypoints’ attribute values, as well as the disclosed methods for encoding and decoding the feature data, can reduce the dynamic range of the encoded feature data.
In some embodiments, a non-transitory computer-readable storage medium including instructions is also provided, and the instructions may be executed by a device (such as the disclosed encoder and decoder) , for performing the above-described methods. Common forms of non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM or any other flash memory, NVRAM, a cache, a register, any other memory chip or cartridge, and networked versions of the same. The device may include one or more processors (CPUs) , an input/output interface, a network interface, and/or a memory.
It should be noted that, the relational terms herein such as “first” and “second” are used only to differentiate an entity or operation from another entity or operation, and do not require or imply any actual relationship or sequence between these entities or operations. Moreover, the words “comprising, ” “having, ” “containing, ” and “including, ” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items.
As used herein, unless specifically stated otherwise, the term “or” encompasses all possible combinations, except where infeasible. For example, if it is stated that a database may include A or B, then, unless specifically stated otherwise or infeasible, the database may include A, or B, or A and B. As a second example, if it is stated that a database may include A, B, or C, then, unless specifically stated otherwise or infeasible, the database may include A, or B, or C, or A and B, or A and C, or B and C, or A and B and C.
It will be appreciated that the present invention is not limited to the exact construction that has been described above and illustrated in the accompanying drawings, and that various modifications and changes can be made without departing from the scope thereof. It is intended that the scope of the invention should only be limited by the appended claims.

Claims (81)

  1. A computer-implemented decoding method, comprising:
    receiving a data stream comprising a plurality of parameters respectively associated with a plurality of image features;
    partitioning the plurality of image features into one or more clusters;
    determining a reference value for at least one of the one or more clusters; and
    determining respective attributes of the plurality of image features, wherein the attribute of the image feature is determined based on:
    the parameter associated with the respective image feature, and
    the reference value for the respective cluster.
  2. The method of claim 1, wherein the image features are scale-invariant feature transform (SIFT) keypoints on an image.
  3. The method of claim 1 or 2, wherein the attributes of the plurality of image features comprise at least one of:
    scales of the plurality of image features; or
    orientations of the plurality of image features.
  4. The method of any of claims 1-3, wherein the partitioning is based on positions of the plurality of image features on an image.
  5. The method of any of claims 1-4, wherein the plurality of image features is partitioned using a partitioning around medoid (PAM) algorithm.
  6. The method of any of claims 1-5, wherein image features in a cluster comprise a medoid feature, and the reference value for the cluster is determined to be a value of the parameter associated with the respective medoid feature.
  7. The method of any of claims 1-6, wherein image features in a cluster comprise a medoid feature and one or more non-medoid features, wherein determining the attributes of the plurality of image features comprises:
    determining the attribute of the medoid feature to have a value equal to the reference value for the respective cluster; and
    determining the attribute of the non-medoid feature to have a value equal to a sum of:
    a value of the parameter associated with the respective non-medoid feature, and
    the reference value for the respective cluster.
  8. The method of claim 7, wherein the medoid feature of the cluster is a medoid of the image features in the respective cluster.
  9. The method of any of claims 1-8, further comprising:
    before the partitioning, determining a convex hull of the plurality of image features,
    wherein image features on the convex hull are used as initial medoids for the partitioning.
  10. The method of claim 9, wherein the convex hull is determined using a Jarvis march algorithm.
  11. The method of any of claims 1-10, wherein the partitioning comprises executing an iterative hierarchical partitioning process on the plurality of image features, the executing of the iterative hierarchical partitioning process comprising:
    executing a first iteration of partitioning to generate a first cluster of image features, wherein there are N total image features in the first cluster and M image features located on a convex hull of the first cluster, N and M being positive integers; and
    in response to
    Figure PCTCN2022144351-appb-100001
    is greater than or equal to M, executing a second iteration of partitioning to partition the first cluster, wherein, when
    Figure PCTCN2022144351-appb-100002
    is less than M, no further partitioning is performed on the first cluster.
  12. The method of any of claims 1-10, wherein the partitioning comprises executing an iterative hierarchical partitioning process on the plurality of image features, the executing of the iterative hierarchical partitioning process comprising:
    executing a first iteration of partitioning to generate a first cluster of image features, wherein there are N total image features in the first cluster and M image features located on a convex hull of the first cluster, N and M being positive integers; and
    in response to N is greater than or equal to M*M, executing a second iteration of partitioning to partition the first cluster, wherein, when N is less than M*M, no further partitioning is performed on the first cluster.
  13. The method of claim 11 or 12, wherein the M image features located on the convex hull of the first cluster are used as initial medoids for partitioning the first cluster.
  14. A computer-implemented encoding method, comprising:
    partitioning a plurality of image features into one or more clusters;
    determining a reference value for at least one of the one or more clusters; and
    generating a plurality of parameters respectively associated with the plurality of image features, wherein each of the plurality of parameters is generated based on:
    a value of an attribute of the respective image feature, and
    the reference value for the respective cluster.
  15. The method of claim 14, further comprising:
    transmitting the plurality of parameters in a data stream to a decoder.
  16. The method of claim 14 or 15, wherein the image features are scale-invariant feature transform (SIFT) keypoints on an image.
  17. The method of any of claims 14-16, wherein the attributes of the plurality of image features comprise at least one of:
    scales of the plurality of image features; or
    orientations of the plurality of image features.
  18. The method of any of claims 14-17, wherein the partitioning is based on positions of the plurality of image features on an image.
  19. The method of any of claims 14-18, wherein the plurality of image features is partitioned using a partitioning around medoid (PAM) algorithm.
  20. The method of any of claims 14-19, wherein image features in a cluster comprise a medoid feature, and the reference value for the cluster is determined to be a value of an attribute of the respective medoid feature.
  21. The method of any of claims 14-20, wherein image features in a cluster comprise a medoid feature and one or more non-medoid features, wherein:
    the parameter associated with the medoid feature has a value equal to the reference value for the respective cluster; and
    the parameter associated with the non-medoid feature has a value equal to a difference of:
    a value of an attribute of the respective non-medoid feature, and
    the reference value for the respective cluster.
  22. The method of claim 21, wherein the medoid feature of the cluster is a medoid of the image features in the respective cluster.
  23. The method of any of claims 14-22, further comprising:
    before the partitioning, determining a convex hull of the plurality of image features,
    wherein image features on the convex hull are used as initial medoids for the partitioning.
  24. The method of claim 23, wherein the convex hull is determined using a Jarvis march algorithm.
  25. The method of any of claims 14-24, wherein the partitioning comprises executing an iterative hierarchical partitioning process on the plurality of image features, the executing of the iterative hierarchical partitioning process comprising:
    executing a first iteration of partitioning to generate a first cluster of image features, wherein there are N total image features in the first cluster and M image features located on a convex hull of the first cluster, N and M being positive integers; and
    in response to
    Figure PCTCN2022144351-appb-100003
    is greater than or equal to M, executing a second iteration of partitioning to partition the first cluster, wherein, when
    Figure PCTCN2022144351-appb-100004
    is less than M, no further partitioning is performed on the first cluster.
  26. The method of any of claims 14-24, wherein the partitioning comprises executing an iterative hierarchical partitioning process on the plurality of image features, the executing of the iterative hierarchical partitioning process comprising:
    executing a first iteration of partitioning to generate a first cluster of image features, wherein there are N total image features in the first cluster and M image features located on a convex hull of the first cluster, N and M being positive integers; and
    in response to N is greater than or equal to M*M, executing a second iteration of partitioning to partition the first cluster, wherein, when N is less than M*M, no further partitioning is performed on the first cluster.
  27. The method of claim 25 or 26, wherein the M image features located on the convex hull of the first cluster are used as initial medoids for partitioning the first cluster.
  28. A device comprising:
    a memory storing instructions; and
    a processor configured to execute the instructions to cause the device to:
    receive a data stream comprising a plurality of parameters respectively associated with a plurality of image features;
    partition the plurality of image features into one or more clusters;
    determine a reference value for at least one of the one or more clusters; and
    determine respective attributes of the plurality of image features, wherein the attribute of the image feature is determined based on:
    the parameter associated with the respective image feature, and
    the reference value for the respective cluster.
  29. The device of claim 28, wherein the image features are scale-invariant feature transform (SIFT) keypoints on an image.
  30. The device of claim 28 or 29, wherein the attributes of the plurality of image features comprise at least one of:
    scales of the plurality of image features; or
    orientations of the plurality of image features.
  31. The device of any of claims 28-30, wherein the processor is configured to execute the instructions to cause the device to partition the plurality of image features based on positions of the plurality of image features on an image.
  32. The device of any of claims 28-31, wherein the processor is configured to execute the instructions to cause the device to partition the plurality of image features using a partitioning around medoid (PAM) algorithm.
  33. The device of any of claims 28-32, wherein image features in a cluster comprise a medoid feature, and the processor is configured to execute the instructions to cause the device to determine the reference value for the cluster to be a value of the parameter associated with the respective medoid feature.
  34. The device of any of claims 28-33, wherein image features in a cluster comprise a medoid feature and one or more non-medoid features, wherein in determining the attributes of the plurality of image features, the processor is configured to execute the instructions to cause the device to:
    determine the attribute of the medoid feature to have a value equal to the reference value for the respective cluster; and
    determine the attribute of the non-medoid feature to have a value equal to a sum of:
    a value of the parameter associated with the respective non-medoid feature, and
    the reference value for the respective cluster.
  35. The device of claim 34, wherein the medoid feature of the cluster is a medoid of the image features in the respective cluster.
  36. The device of any of claims 28-35, wherein the processor is configured to execute the instructions to cause the device to:
    before the partitioning, determine a convex hull of the plurality of image features,
    wherein image features on the convex hull are used as initial medoids for the partitioning.
  37. The device of claim 36, wherein the processor is configured to execute the instructions to cause the device to determine the convex hull using a Jarvis march algorithm.
  38. The device of any of claims 28-37, wherein the processor is configured to execute the instructions to cause the device to partition the plurality of image features by executing an iterative hierarchical partitioning process, the iterative hierarchical partitioning process comprising:
    executing a first iteration of partitioning to generate a first cluster of image features, wherein there are N total image features in the first cluster and M image features located on a convex hull of the first cluster, N and M being positive integers; and
    in response to
    Figure PCTCN2022144351-appb-100005
    is greater than or equal to M, executing a second iteration of partitioning to partition the first cluster, wherein, when
    Figure PCTCN2022144351-appb-100006
    is less than M, no further partitioning is performed on the first cluster.
  39. The device of any of claims 28-37, wherein the processor is configured to execute the instructions to cause the device to partition the plurality of image features by executing an iterative hierarchical partitioning process, the iterative hierarchical partitioning process comprising:
    executing a first iteration of partitioning to generate a first cluster of image features, wherein there are N total image features in the first cluster and M image features located on a convex hull of the first cluster, N and M being positive integers; and
    in response to N is greater than or equal to M*M, executing a second iteration of partitioning to partition the first cluster, wherein, when N is less than M*M, no further partitioning is performed on the first cluster.
  40. The device of claim 38 or 39, wherein the M image features located on the convex hull of the first cluster are used as initial medoids for partitioning the first cluster.
  41. A device comprising:
    a memory storing instructions; and
    a processor configured to execute the instructions to cause the device to:
    partition a plurality of image features into one or more clusters;
    determine a reference value for at least one of the one or more clusters; and
    generate a plurality of parameters respectively associated with the plurality of image features, wherein each of the plurality of parameters is generated based on:
    a value of an attribute of the respective image feature, and
    the reference value for the respective cluster.
  42. The device of claim 41, wherein the processor is configured to execute the instructions to cause the device to:
    transmit the plurality of parameters in a data stream to a decoder.
  43. The device of claim 41 or 42, wherein the image features are scale-invariant feature transform (SIFT) keypoints on an image.
  44. The device of any of claims 41-43, wherein the attributes of the plurality of image features comprise at least one of:
    scales of the plurality of image features; or
    orientations of the plurality of image features.
  45. The device of any of claims 41-44, wherein the processor is configured to execute the instructions to cause the device to partition the plurality of image features based on positions of the plurality of image features on an image.
  46. The device of any of claims 41-45, wherein the processor is configured to execute the instructions to cause the device to partition the plurality of image features using a partitioning around medoid (PAM) algorithm.
  47. The device of any of claims 41-46, wherein image features in a cluster comprise a medoid feature, and the processor is configured to execute the instructions to cause the device to determine the reference value for the cluster to be a value of an attribute of the respective medoid feature.
  48. The device of any of claims 41-47, wherein image features in a cluster comprise a medoid feature and one or more non-medoid features, wherein:
    the parameter associated with the medoid feature has a value equal to the reference value for the respective cluster; and
    the parameter associated with the non-medoid feature has a value equal to a difference of:
    a value of an attribute of the respective non-medoid feature, and
    the reference value for the respective cluster.
  49. The device of claim 48, wherein the medoid feature of the cluster is a medoid of the image features in the respective cluster.
  50. The device of any of claims 41-49, wherein the processor is configured to execute the instructions to cause the device to:
    before the partitioning, determine a convex hull of the plurality of image features,
    wherein image features on the convex hull are used as initial medoids for the partitioning.
  51. The device of claim 50, wherein the processor is configured to execute the instructions to cause the device to determine the convex hull using a Jarvis march algorithm.
  52. The device of any of claims 41-51, wherein the processor is configured to execute the instructions to cause the device to partition the plurality of image features by executing an iterative hierarchical partitioning process, the iterative hierarchical partitioning process comprising:
    executing a first iteration of partitioning to generate a first cluster of image features, wherein there are N total image features in the first cluster and M image features located on a convex hull of the first cluster, N and M being positive integers; and
    in response to
    Figure PCTCN2022144351-appb-100007
    is greater than or equal to M, executing a second iteration of partitioning to partition the first cluster, wherein, when
    Figure PCTCN2022144351-appb-100008
    is less than M, no further partitioning is performed on the first cluster.
  53. The device of any of claims 41-51, wherein the processor is configured to execute the instructions to cause the device to partition the plurality of image features by executing an iterative hierarchical partitioning process, the iterative hierarchical partitioning process comprising:
    executing a first iteration of partitioning to generate a first cluster of image features, wherein there are N total image features in the first cluster and M image features located on a convex hull of the first cluster, N and M being positive integers; and
    in response to N is greater than or equal to M*M, executing a second iteration of partitioning to partition the first cluster, wherein, when N is less than M*M, no further partitioning is performed on the first cluster.
  54. The device of claim 52 or 53, wherein the M image features located on the convex hull of the first cluster are used as initial medoids for partitioning the first cluster.
  55. A non-transitory computer-readable medium storing a set of instructions that is executable by one or more processors of a device to cause the device to perform a method comprising:
    receiving a data stream comprising a plurality of parameters respectively associated with a plurality of image features;
    partitioning the plurality of image features into one or more clusters;
    determining a reference value for at least one of the one or more clusters; and
    determining respective attributes of the plurality of image features, wherein the attribute of the image feature is determined based on:
    the parameter associated with the respective image feature, and
    the reference value for the respective cluster.
  56. The medium of claim 55, wherein the image features are scale-invariant feature transform (SIFT) keypoints on an image.
  57. The medium of claim 55 or 56, wherein the attributes of the plurality of image features comprise at least one of:
    scales of the plurality of image features; or
    orientations of the plurality of image features.
  58. The medium of any of claims 55-57, wherein the partitioning is based on positions of the plurality of image features on an image.
  59. The medium of any of claims 55-58, wherein the plurality of image features is partitioned using a partitioning around medoid (PAM) algorithm.
  60. The medium of any of claims 55-59, wherein the image features in a cluster comprise a medoid feature, and the reference value for the cluster is determined to be a value of the parameter associated with the respective medoid feature.
  61. The medium of any of claims 55-60, wherein the image features in a cluster comprise a medoid feature and one or more non-medoid features, wherein determining the attributes of the plurality of image features comprises:
    determining the attribute of the medoid feature to have a value equal to the reference value for the respective cluster; and
    determining the attribute of the non-medoid feature to have a value equal to a sum of:
    a value of the parameter associated with the respective non-medoid feature, and
    the reference value for the respective cluster.
  62. The medium of claim 61, wherein the medoid feature of the cluster is a medoid of the image features in the respective cluster.
  63. The medium of any of claims 55-62, wherein the set of instructions causes the device to perform:
    before the partitioning, determining a convex hull of the plurality of image features,
    wherein image features on the convex hull are used as initial medoids for the partitioning.
  64. The medium of claim 63, wherein the convex hull is determined using a Jarvis march algorithm.
  65. The medium of any of claims 51-60, wherein the partitioning comprises executing an iterative hierarchical partitioning process on the plurality of image features, the executing of the iterative hierarchical partitioning process comprising:
    executing a first iteration of partitioning to generate a first cluster of image features, wherein there are N total image features in the first cluster and M image features located on a convex hull of the first cluster, N and M being positive integers; and
    in response to
    Figure PCTCN2022144351-appb-100009
    is greater than or equal to M, executing a second iteration of partitioning to partition the first cluster, wherein, when
    Figure PCTCN2022144351-appb-100010
    is less than M, no further partitioning is performed on the first cluster.
  66. The medium of any of claims 51-60, wherein the partitioning comprises executing an iterative hierarchical partitioning process on the plurality of image features, the executing of the iterative hierarchical partitioning process comprising:
    executing a first iteration of partitioning to generate a first cluster of image features, wherein there are N total image features in the first cluster and M image features located on a convex hull of the first cluster, N and M being positive integers; and
    in response to N is greater than or equal to M*M, executing a second iteration of partitioning to partition the first cluster, wherein, when N is less than M*M, no further partitioning is performed on the first cluster.
  67. The medium of claim 65 or 66, wherein the M image features located on the convex hull of the first cluster are used as initial medoids for partitioning the first cluster.
  68. A non-transitory computer-readable medium storing a set of instructions that is executable by one or more processors of a device to cause the device to perform a method comprising:
    partitioning a plurality of image features into one or more clusters;
    determining a reference value for at least one of the one or more clusters; and
    generating a plurality of parameters respectively associated with the plurality of image features, wherein each of the plurality of parameters is generated based on:
    a value of an attribute of the respective image feature, and
    the reference value for the respective cluster.
  69. The medium of claim 68, wherein the set of instructions causes the device to perform:
    transmitting the plurality of parameters in a data stream to a decoder.
  70. The medium of claim 68 or 69, wherein the image features are scale-invariant feature transform (SIFT) keypoints on an image.
  71. The medium of any of claims 68-70, wherein the attributes of the plurality of image features comprise at least one of:
    scales of the plurality of image features; or
    orientations of the plurality of image features.
  72. The medium of any of claims 68-71, wherein the partitioning is based on positions of the plurality of image features on an image.
  73. The medium of any of claims 68-72, wherein the plurality of image features is partitioned using a partitioning around medoid (PAM) algorithm.
  74. The medium of any of claims 68-73, wherein image features in a cluster comprise a medoid feature, and the reference value for the cluster is determined to be a value of an attribute of the respective medoid feature.
  75. The medium of any of claims 68-74, wherein image features in a cluster comprise a medoid feature and one or more non-medoid features, wherein:
    the parameter associated with the medoid feature has a value equal to the reference value for the respective cluster; and
    the parameter associated with the non-medoid feature has a value equal to a difference of:
    a value of an attribute of the respective non-medoid feature, and
    the reference value for the respective cluster.
  76. The medium of claim 75, wherein the medoid feature of the cluster is a medoid of the image features in the respective cluster.
  77. The medium of any of claims 68-76, wherein the set of instructions causes the device to perform:
    before the partitioning, determining a convex hull of the plurality of image features,
    wherein image features on the convex hull are used as initial medoids for the partitioning.
  78. The medium of claim 77, wherein the convex hull is determined using a Jarvis march algorithm.
  79. The medium of any of claims 68-78, wherein the partitioning comprises executing an iterative hierarchical partitioning process on the plurality of image features, the executing of the iterative hierarchical partitioning process comprising:
    executing a first iteration of partitioning to generate a first cluster of image features, wherein there are N total image features in the first cluster and M image features located on a convex hull of the first cluster, N and M being positive integers; and
    in response to
    Figure PCTCN2022144351-appb-100011
    is greater than or equal to M, executing a second iteration of partitioning to partition the first cluster, wherein, when
    Figure PCTCN2022144351-appb-100012
    is less than M, no further partitioning is performed on the first cluster.
  80. The medium of any of claims 68-78, wherein the partitioning comprises executing an iterative hierarchical partitioning process on the plurality of image features, the executing of the iterative hierarchical partitioning process comprising:
    executing a first iteration of partitioning to generate a first cluster of image features, wherein there are N total image features in the first cluster and M image features located on a convex hull of the first cluster, N and M being positive integers; and
    in response to N is greater than or equal to M*M, executing a second iteration of partitioning to partition the first cluster, wherein, when N is less than M*M, no further partitioning is performed on the first cluster.
  81. The medium of claim 79 or 80, wherein the M image features located on the convex hull of the first cluster are used as initial medoids for partitioning the first cluster.
PCT/CN2022/144351 2022-09-28 2022-12-30 Lossless hierarchical coding of image feature attributes WO2024066123A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP22461609.4 2022-09-28
EP22461609 2022-09-28

Publications (1)

Publication Number Publication Date
WO2024066123A1 true WO2024066123A1 (en) 2024-04-04

Family

ID=83506117

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/144351 WO2024066123A1 (en) 2022-09-28 2022-12-30 Lossless hierarchical coding of image feature attributes

Country Status (1)

Country Link
WO (1) WO2024066123A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150117540A1 (en) * 2013-10-29 2015-04-30 Sony Corporation Coding apparatus, decoding apparatus, coding data, coding method, decoding method, and program
CN105959696A (en) * 2016-04-28 2016-09-21 成都三零凯天通信实业有限公司 Video content safety monitoring method based on SIFT characteristic algorithm
US20170214936A1 (en) * 2016-01-22 2017-07-27 Mitsubishi Electric Research Laboratories, Inc. Method and Apparatus for Keypoint Trajectory Coding on Compact Descriptor for Video Analysis
CN107257980A (en) * 2015-03-18 2017-10-17 英特尔公司 Local change in video detects

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150117540A1 (en) * 2013-10-29 2015-04-30 Sony Corporation Coding apparatus, decoding apparatus, coding data, coding method, decoding method, and program
CN107257980A (en) * 2015-03-18 2017-10-17 英特尔公司 Local change in video detects
US20170214936A1 (en) * 2016-01-22 2017-07-27 Mitsubishi Electric Research Laboratories, Inc. Method and Apparatus for Keypoint Trajectory Coding on Compact Descriptor for Video Analysis
CN105959696A (en) * 2016-04-28 2016-09-21 成都三零凯天通信实业有限公司 Video content safety monitoring method based on SIFT characteristic algorithm

Similar Documents

Publication Publication Date Title
US10904564B2 (en) Method and apparatus for video coding
US9349072B2 (en) Local feature based image compression
CN111108529B (en) Information processing apparatus and method
US7372990B2 (en) Texture image compressing device and method, texture image decompressing device and method, data structures and storage medium
Müller et al. Autorf: Learning 3d object radiance fields from single view observations
CN112204618A (en) Point cloud mapping
CN107958446B (en) Information processing apparatus, information processing method, and computer program
US20230164353A1 (en) Point cloud data processing device and processing method
KR20130112311A (en) Apparatus and method for reconstructing dense three dimension image
US12002229B2 (en) Accelerating speckle image block matching using convolution techniques
Ainala et al. An improved enhancement layer for octree based point cloud compression with plane projection approximation
US10445613B2 (en) Method, apparatus, and computer readable device for encoding and decoding of images using pairs of descriptors and orientation histograms representing their respective points of interest
Yang et al. Development of a fast transmission method for 3D point cloud
Junayed et al. HiMODE: A hybrid monocular omnidirectional depth estimation model
Khan et al. Sparse to dense depth completion using a generative adversarial network with intelligent sampling strategies
WO2024066123A1 (en) Lossless hierarchical coding of image feature attributes
Ying et al. Pushing point cloud compression to the edge
US20200027216A1 (en) Unsupervised Image Segmentation Based on a Background Likelihood Estimation
Baroffio et al. Hybrid coding of visual content and local image features
Wei Color object indexing and retrieval in digital libraries
Bosch et al. Sensor adaptation for improved semantic segmentation of overhead imagery
Kumar et al. Depth compression via planar segmentation
Sehli et al. WeLDCFNet: Convolutional Neural Network based on Wedgelet Filters and Learnt Deep Correlation Features for depth maps features extraction
Zhang et al. Tcdm: Transformational complexity based distortion metric for perceptual point cloud quality assessment
Wu et al. Hybrid mobile vision for emerging applications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22960713

Country of ref document: EP

Kind code of ref document: A1