WO2022182445A1

WO2022182445A1 - Duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features

Info

Publication number: WO2022182445A1
Application number: PCT/US2022/013261
Authority: WO
Inventors: Ruiyuan LIN; Hongyu Sun; Zhebin ZHANG; Rijun LIAO; Binghuang CAI; Jian Sun
Original assignee: Innopeak Technology, Inc.
Priority date: 2021-12-03
Filing date: 2022-01-21
Publication date: 2022-09-01

Abstract

Novel tools and techniques are provided for implementing duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features. In various embodiments, a computing system may perform feature discriminability enhancement, using a deep metric learning framework, by: transforming first and second image-specific aggregated feature sets into first and second feature vectors using corresponding neural networks; converting each of the first and second feature vectors into first and second normalized feature vectors having values between 0 and 1, using an activation function; applying a loss function to the first and second normalized feature vectors to decrease Euclidean distance between positive pairs and increase Euclidean distance between negative pairs; and binarizing the first and second normalized feature vectors; and determining whether the second image is a duplicate of the first image based on a Hamming distance between the first and second binarized normalized feature vectors.

Description

DUPLICATE IMAGE OR VIDEO DETERMINATION AND/OR IMAGE OR VIDEO DEDUPLICATION BASED ON DEEP METRIC LEARNING WITH

KEYPOINT FEATURES

CROSS-REFERENCES TO RELATED APPLICATIONS [0001] This application claims priority to U.S. Patent Application Ser. No. 63/285,522 (the " '522 Application"), filed December 3, 2021, by Ruiyuan Lin et al. (attorney docket no. INNOPEAK-1121-178-P), entitled, "An Efficient Method and System of Image and Video Deduplication Based on Deep Metric Learning with Key Points Features," the disclosure of which is incorporated herein by reference in its entirety for all purposes.

COPYRIGHT STATEMENT

[0002] A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

FIELD

[0003] The present disclosure relates, in general, to methods, systems, and apparatuses for implementing neural network, artificial intelligence ("AI"), machine learning, and/or deep learning applications, and more particularly, to methods, systems, and apparatuses for implementing duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features.

BACKGROUND

[0004] Existing video deduplication methods include frame- and video- level methods. Frame-level methods generate features for each frame. For example, frame descriptors may be used to represent a video. Video-level methods generate a global representation for each video. As an example, global video features may be generated with the help of deep metric learning. However, for video deduplication, some frame-wise feature based conventional systems may fail to take the temporal information into consideration. This can limit the performance of the solutions. Further, speed is a limitation in some conventional systems, with such systems taking a very long time to process when there are huge amounts of video data (e.g., in the cloud). Moreover, computing Hamming distance on the binary features generated from the feature extraction and aggregation alone may not be capable of correctly identifying certain positive and negative pairs. Some negative pairs are close to each other in terms of Hamming distance while some positive pairs are too far away from each other. [0005] Hence, there is a need for more robust and scalable solutions for implementing neural network, artificial intelligence ("AI"), machine learning, and/or deep learning applications.

SUMMARY

[0006] The techniques of this disclosure generally relate to tools and techniques for implementing neural network, AI, machine learning, and/or deep learning applications, and, more particularly, to methods, systems, and apparatuses for implementing duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features.

[0007] In an aspect, a method may be provided for performing duplicate image or video determination the method may be implemented by a computing system and may comprise performing feature discriminability enhancement, using a first deep metric learning framework, by: transforming a first image-specific aggregated feature set corresponding to a first image into a first feature vector, using a first neural network based on a set of parameters; converting the first feature vector into a first normalized feature vector having a first set of values between 0 and 1, using an activation function; transforming a second image-specific aggregated feature set corresponding to a second image into a second feature vector, using a second neural network based on the set of parameters; converting the second feature vector into a second normalized feature vector having a second set of values between 0 and 1 , using the activation function; applying a first loss function to the first normalized feature vector and the second normalized feature vector to decrease Euclidean distance between positive pairs and to increase Euclidean distance between negative pairs among the first and second sets of values; and binarizing the first and second normalized feature vectors. The method may further comprise determining whether the second image is a duplicate of the first image based on a Hamming distance between the first and second binarized normalized feature vectors. [0008] In another aspect, a system may be provided for performing duplicate image or video determination. The system might comprise a computing system, which might comprise at least one first processor and a first non-transitory computer readable medium communicatively coupled to the at least one first processor. The first non-transitory computer readable medium might have stored thereon computer software comprising a first set of instructions that, when executed by the at least one first processor, causes the computing system to: perform feature discriminability enhancement, using a deep metric learning framework, by: transforming a first image-specific aggregated feature set corresponding to a first image into a first feature vector, using a first neural network based on a set of parameters; converting the first feature vector into a first normalized feature vector having a first set of values between 0 and 1, using an activation function; transforming a second image-specific aggregated feature set corresponding to a second image into a second feature vector, using a second neural network based on the set of parameters; converting the second feature vector into a second normalized feature vector having a second set of values between 0 and 1 , using the activation function; applying a contrastive loss function to the binarized first normalized feature vector and the binarized second normalized feature vector to decrease Euclidean distance between positive pairs and to increase Euclidean distance between negative pairs among the first and second sets of values; and binarizing the first and second normalized feature vectors; and determine whether the second image is a duplicate of the first image based on a Hamming distance between the first and second binarized normalized feature vectors. [0009] Various modifications and additions can be made to the embodiments discussed without departing from the scope of the invention. For example, while the embodiments described above refer to particular features, the scope of this invention also includes embodiments having different combination of features and embodiments that do not include all of the above-described features.

[0010] The details of one or more aspects of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the techniques described in this disclosure will be apparent from the description and drawings, and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS [0011] A further understanding of the nature and advantages of particular embodiments may be realized by reference to the remaining portions of the specification and the drawings, in which like reference numerals are used to refer to similar components. In some instances, a sub-label is associated with a reference numeral to denote one of multiple similar components. When reference is made to a reference numeral without specification to an existing sub-label, it is intended to refer to all such multiple similar components.

[0012] Fig. 1 is a schematic diagram illustrating a system for implementing duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features, in accordance with various embodiments.

[0013] Figs. 2A-2C are schematic block flow diagrams illustrating various non-limiting examples of duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features, in accordance with various embodiments.

[0014] Figs. 3A and 3B are schematic block flow diagrams illustrating non-limiting examples of deep metric learning frameworks with contrastive loss for use during duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features, in accordance with various embodiments.

[0015] Figs. 4A-4J are flow diagrams illustrating a method for implementing duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features, in accordance with various embodiments.

[0016] Fig. 5 is a block diagram illustrating an example of computer or system hardware architecture, in accordance with various embodiments.

[0017] Fig. 6 is a block diagram illustrating a networked system of computers, computing systems, or system hardware architecture, which can be used in accordance with various embodiments.

DETAILED DESCRIPTION

[0018] Overview

[0019] Various embodiments provide tools and techniques for implementing neural network, artificial intelligence ("AI"), machine learning, and/or deep learning applications, and, more particularly, to methods, systems, and apparatuses for implementing duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features.

[0020] In various embodiments, a computing system may perform feature discriminability enhancement, using a first deep metric learning framework, by: transforming a first image- specific aggregated feature set corresponding to a first image into a first feature vector, using a first neural network based on a set of parameters; converting the first feature vector into a first normalized feature vector having a first set of values between 0 and 1 , using an activation function; transforming a second image-specific aggregated feature set corresponding to a second image into a second feature vector, using a second neural network based on the set of parameters; converting the second feature vector into a second normalized feature vector having a second set of values between 0 and 1, using the activation function; applying a first loss function to the first normalized feature vector and the second normalized feature vector to decrease Euclidean distance between positive pairs and to increase Euclidean distance between negative pairs among the first and second sets of values; and binarizing the first and second normalized feature vectors. The computing system may determine whether the second image is a duplicate of the first image based on a Hamming distance between the first and second binarized normalized feature vectors.

[0021] In some embodiments, the computing system may comprise at least one of a machine learning system, an artificial intelligence ("AI") system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like. In some instances, the first neural network and the second neural network may each comprise at least one of a deep learning model-based neural network, a deep metric learning-based neural network, a convolutional neural network ("CNN"), or a fully convolutional network ("FCN"), and/or the like.

[0022] According to some embodiments, the computing system may perform image- specific feature extraction for the first image, by: performing local feature extraction for the first image to extract a plurality of first local features; and aggregating the extracted plurality of first local features to form the first image-specific aggregated feature set for the first image. The computing system may perform image-specific feature extraction for the second image, by: performing local feature extraction for the second image to extract a plurality of second local features; and aggregating the extracted plurality of second local features to form the second image- specific aggregated feature set for the second image.

[0023] Alternatively, the computing system may perform image-specific feature extraction for the first image, by: splitting the first image into a plurality of first sub-images contained within a corresponding plurality of first grid cells of a first grid, the first grid comprising a first predetermined number of the plurality of first grid cells; performing local feature extraction for each first sub-image among the plurality of first sub-images to extract a plurality of first local features; aggregating the extracted plurality of first local features to form a corresponding first image-specific feature set for each first sub-image; and concatenating each first image- specific feature set corresponding to each first sub-image among the plurality of first sub-images to generate a first combined aggregated feature set, wherein the first image- specific aggregated feature set may comprise the first combined aggregated feature set. The computing system may perform image-specific feature extraction for the second image, by: splitting the second image into a plurality of second sub-images contained within a corresponding plurality of second grid cells of a second grid, the second grid comprising a second predetermined number of the plurality of second grid cells, the second predetermined number being the same as the first predetermined number; performing local feature extraction for each second sub-image among the plurality of second sub-images to extract a plurality of second local features; aggregating the extracted plurality of second local features to form a corresponding second image-specific feature for each second sub-image; and concatenating each second image-specific feature set corresponding to each second sub-image among the plurality of second sub-images to generate a second combined aggregated feature set, wherein the second image-specific aggregated feature set may comprise the second combined aggregated feature set.

[0024] In some instances, the computing system: based on a determination that at least one third sub-image among at least one of the plurality of first sub-images or the plurality of second sub-images lacks any local features or any feature keypoints, may assign each first image-specific feature set corresponding to each of the at least one third sub-image with extra three elements corresponding to average red/green/blue ("RGB") values, or the like; based on a determination that at least one fourth sub-image among at least one of the plurality of first sub-images or the plurality of second sub-images lacks any local features or any feature keypoints, assigning each first image-specific feature set corresponding to each of the at least one fourth sub-image with a negative value, or the like; or may assign a random value to sub images without any keypoints for other elements that correspond to aggregated local features; or the like.

[0025] In some embodiments, the first image may be a frame among a plurality of first frames in a first video and the first combined aggregated feature set may be among a plurality of first combined aggregated feature sets, while the second image may be a frame among a plurality of second frames in a second video and the second combined aggregated feature set may be among a plurality of second combined aggregated feature sets. In such cases, the computing system may aggregate extracted features from one or more first key frames among the plurality of first frames in the first video, by aggregating extracted features from one or more corresponding first combined aggregated feature sets among the plurality of first combined aggregated feature sets; may binarize the aggregated extracted features from the one or more first key frames to generate a first binarized key frame feature vector; may aggregate extracted features from one or more second key frames among the plurality of second frames in the second video, by aggregating extracted features from one or more corresponding second combined aggregated feature sets among the plurality of second combined aggregated feature sets; may binarize the aggregated extracted features from the one or more second key frames to generate a second binarized key frame feature vector; and may determine whether the second video is a duplicate of the first video based on a Hamming distance between the first and second binarized key frame feature vectors. In some cases, the computing system may use a second deep metric learning framework to perform feature discriminability enhancement on each of the aggregated extracted features from the one or more first key frames and the aggregated extracted features from the one or more second key frames prior to respective binarizations to extract discriminative features for each of the first and second videos, respectively.

[0026] Alternatively, or additionally, the method computing system may match feature keypoints from one or more selected pairs of first frames among the plurality of first frames in the first video; may obtain a set of parameters between each selected pair of first frames to generate first four dimensional ("4D") features for each selected pair of first frames, the set of parameters comprising a rotation angle parameter, two translation parameters, and a zoom factor parameter; may aggregate the first 4D features for the one or more selected pairs of first frames; may concatenate the aggregated first 4D features with the aggregated extracted features from one or more first key frames to generated first combined 4D and key features, wherein generating the first binarized key frame feature vector may comprise binarizing the generated first combined 4D and key features; may match feature keypoints from one or more selected pairs of second frames among the plurality of second frames in the second video; may obtain a set of parameters between each selected pair of second frames to generate second 4D features for each selected pair of second frames; may aggregate the second 4D features for the one or more selected pairs of second frames; and concatenating the aggregated second 4D features with the aggregated extracted features from one or more second key frames to generated second combined 4D and key features, wherein generating the second binarized key frame feature vector may comprise binarizing the generated second combined 4D and key features. In some cases, the computing system may use a third deep metric learning framework to perform feature discriminability enhancement on each of the generated first combined 4D and key features and the generated second combined 4D and key features prior to respective binarizations to extract discriminative features for each of the first and second videos, respectively.

[0027] Merely by way of example, in some cases, the first image-specific aggregated feature set may be among a plurality first image-specific aggregated feature set, each having different dimensions obtained by adjusting a principal components analysis ("PCA") output dimension and a number of Gaussian components for each first image-specific aggregated feature set, wherein the second image-specific aggregated feature set may be among a plurality second image- specific aggregated feature set, each having different dimensions obtained by adjusting a PCA output dimension and a number of Gaussian components for each second image-specific aggregated feature set. In such cases, the computing system may concatenate the first feature vectors each corresponding to respective differently dimensioned first image- specific aggregated feature set to generate a first concatenated feature vector; may transform the generated first concatenated feature vector, using a third neural network, prior to converting into the first normali ed feature vector; may concatenate the second feature vectors each corresponding to respective differently dimensioned second image-specific aggregated feature set to generate a second concatenated feature vector; and may transform the generated second concatenated feature vector, using a fourth neural network, prior to converting into the second normalized feature vector.

[0028] According to some embodiments, the first loss function may comprise one of contrastive loss function or triplet loss function, and/or the like. In some embodiments, the computing system may train each of the first and second neural networks based at least in part on applying a second loss function to at least one of the first and second normalized feature vectors or the results of the first loss function. In some cases, the second loss function may comprise a cross-entropy loss function, or the like.

[0029] In some embodiments, the computing system may perform image or video deduplication based on a determination that the second image is a duplicate of the first image. Alternatively, the computing system may perform search-by-image functionality by matching the second image with the first image based on a determination regarding whether the second image is a duplicate of the first image, wherein one of the first image or second image may be a query image and the other of the first image or second image may be a stored image.

[0030] In the various aspects described herein, a deep metric learning based framework(s) is provided for performing duplicate image or video determination and/or image or video deduplication (or image-based searching). The deep metric learning based framework(s) is lightweight, and does not require a very deep neural network but only a shallow neural network with a few fully connected ("FC") layers. This makes the deep metric learning based framework solution computationally efficient. The deep metric learning based framework(s) may also be capable of running on mobile devices. Additionally, deep metric learning based framework may generate a robust and memory efficient representation for each image or video that may be considered a signature of each image or video and may be adopted in many applications. Further, the feature dimension and the number of Gaussian components (in the case of Fisher vector aggregation, or the like) in feature generation may be adjusted to fit the requirement of different platforms with different computational resources. In some cases, the various embodiments provide an efficient and lightweight image deduplication framework based on keypoint detectors or descriptors, feature aggregation, and deep metric learning. Alternatively, or additionally, the various embodiments provide a video deduplication system that incorporates temporal information by encoding motion. Flerein, image or video deduplication is performed to reduce the burden on storage systems, by removing duplicate data (e.g., duplicate images and/or videos) that may be needlessly taking up space on a local mobile drive or on a network-based or server-based computing system.

[0031] These and other aspects of the system and method for implementing duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features are described in greater detail with respect to the figures.

[0032] The following detailed description illustrates a few embodiments in further detail to enable one of skill in the art to practice such embodiments. The described examples are provided for illustrative purposes and are not intended to limit the scope of the invention. [0033] In the following description, for the purposes of explanation, numerous details are set forth in order to provide a thorough understanding of the described embodiments. It will be apparent to one skilled in the art, however, that other embodiments of the present invention may be practiced without some of these details. In other instances, some structures and devices are shown in block diagram form. Several embodiments are described herein, and while various features are ascribed to different embodiments, it should be appreciated that the features described with respect to one embodiment may be incorporated with other embodiments as well. By the same token, however, no single feature or features of any described embodiment should be considered essential to every embodiment of the invention, as other embodiments of the invention may omit such features.

[0034] Unless otherwise indicated, all numbers used herein to express quantities, dimensions, and so forth used should be understood as being modified in all instances by the term "about." In this application, the use of the singular includes the plural unless specifically stated otherwise, and use of the terms "and" and "or" means "and/or" unless otherwise indicated. Moreover, the use of the term "including," as well as other forms, such as "includes" and "included," should be considered non-exclusive. Also, terms such as "element" or "component" encompass both elements and components comprising one unit and elements and components that comprise more than one unit, unless specifically stated otherwise.

[0035] Various embodiments as described herein - while embodying (in some cases) software products, computer-performed methods, and/or computer systems - represent tangible, concrete improvements to existing technological areas, including, without limitation, image duplication determination technology, video duplication determination technology, image deduplication technology, video deduplication technology, image search technology, video search technology, machine learning technology, deep learning technology, AI technology, and/or the like. In other aspects, some embodiments can improve the functioning of user equipment or systems themselves (e.g., image duplication determination systems, video duplication determination systems, image deduplication systems, video deduplication systems, image search systems, video search systems, machine learning systems, deep learning systems, AI systems, etc.), for example, for training a deep metric learning framework(s) to perform duplicate image or video determination and for implementing a trained deep metric learning framework(s) to perform duplicate image or video determination, by using a computing system to perform feature discriminability enhancement, using a first deep metric learning framework, by: transforming a first image-specific aggregated feature set corresponding to a first image into a first feature vector, using a first neural network based on a set of parameters; converting the first feature vector into a first normalized feature vector having a first set of values between 0 and 1, using an activation function; transforming a second image- specific aggregated feature set corresponding to a second image into a second feature vector, using a second neural network based on the set of parameters; converting the second feature vector into a second normalized feature vector having a second set of values between 0 and 1 , using the activation function; applying a first loss function to the first normalized feature vector and the second normalized feature vector to decrease Euclidean distance between positive pairs and to increase Euclidean distance between negative pairs among the first and second sets of values; binarizing the first and second normalized feature vectors; and to determine whether the second image is a duplicate of the first image based on a Hamming distance between the first and second binarized normalized feature vectors; and/or the like.

[0036] In particular, to the extent any abstract concepts are present in the various embodiments, those concepts can be implemented as described herein by devices, software, systems, and methods that involve novel functionality (e.g., steps or operations), such as, training and implementing neural networks to perform duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features, and/or the like, to name a few examples, that extend beyond mere conventional computer processing operations. These functionalities can produce tangible results outside of the implementing computer system, including, merely by way of example, providing a lightweight deep metric learning based framework(s) that is computationally efficient and capable of running on mobile devices, that is capable of generating a robust and memory efficient representation for each image or video that may be considered a signature of each image or video and may be adopted in many applications, whose feature dimension and number of Gaussian components may be adjusted to fit the requirement of different platforms with different computational resources, that provides an efficient and lightweight image deduplication framework based on keypoint detectors or descriptors, feature aggregation, and deep metric learning, and/or that provides a video deduplication system that incorporates temporal information by encoding motion, and/or the like, at least some of which may be observed or measured by users, content developers, system administrators, and/or service providers.

[0037] Some Embodiments

[0038] We now turn to the embodiments as illustrated by the drawings. Figs. 1-6 illustrate some of the features of the method, system, and apparatus for implementing neural network, artificial intelligence ("AI"), machine learning, and/or deep learning applications, and, more particularly, to methods, systems, and apparatuses for implementing duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features, as referred to above. The methods, systems, and apparatuses illustrated by Figs. 1-6 refer to examples of different embodiments that include various components and steps, which can be considered alternatives or which can be used in conjunction with one another in the various embodiments. The description of the illustrated methods, systems, and apparatuses shown in Figs. 1-6 is provided for purposes of illustration and should not be considered to limit the scope of the different embodiments.

[0039] With reference to the figures, Fig. 1 is a schematic diagram illustrating a system 100 for implementing duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features, in accordance with various embodiments.

[0040] In the non-limiting embodiment of Fig. 1, system 100 may comprise computing system 105, including, but not limited to, at least one of a feature extraction system 105a, a feature aggregation system 105b, a deep metric learning framework(s) 105c, and/or one or more loss function modules 105d, or the like. The computing system 105, the feature extraction system 105a, the feature aggregation system 105b, the deep metric learning framework(s) 105c, and/or the one or more loss function modules 105d may be part of duplicate image or video determination system 110a, or may be separate, yet communicatively coupled with, the duplicate image or video determination system 110a. In some instances, the computing system 105, the feature extraction system 105a, the feature aggregation system 105b, the deep metric learning framework(s) 105c, and/or the one or more loss function modules 105d may be embodied as an integrated system. Alternatively, the computing system 105, the feature extraction system 105a, the feature aggregation system 105b, the deep metric learning framework(s) 105c, and/or the one or more loss function modules 105d may be embodied as separate, yet communicatively coupled, systems. In some embodiments, computing system 105 may include, without limitation, at least one of a machine learning system, an artificial intelligence ("AI") system, a deep learning system, or a processor on the user device, and/or the like. In some instances, the deep metric learning framework(s) 105c may include at least a first neural network and a second neural network. In some instances, the first neural network and the second neural network may each include, without limitation, at least one of a deep learning model-based neural network, a deep metric learning-based neural network, a convolutional neural network ("CNN"), or a fully convolutional network ("FCN"), and/or the like.

[0041] System 100 may further comprise a network-based or server-based duplicate image or video determination system 110b (and corresponding database(s) 115), one or more content sources 120 (and corresponding database(s) 125), and a content distribution system 130 (and corresponding database(s) 135) that communicatively couple with at least one of the computing system 105, the feature extraction system 105a, the feature aggregation system 105b, the deep metric learning framework(s) 105c, the one or more loss function modules 105d, and/or the duplicate image or video determination system 110a via network(s) 140. Network-based or server-based duplicate image or video determination system 110b may comprise computing system 105 and/or at least one of feature extraction system 105 a, feature aggregation system 105b, deep metric learning framework(s) 105c, and/or the one or more loss function modules 105d, or the like, as described herein with respect to computing system 105 of duplicate image or video determination system 110a. In such cases, computing system 105 of network-based or server-based duplicate image or video determination system 110b may include, without limitation, at least one of a machine learning system, an artificial intelligence ("AI") system, a deep learning system, a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like.

[0042] System 100 may further comprise one or more user devices 145a-145n (collectively, "user devices 145" or the like) that communicatively couple with at least one of computing system 105, feature extraction system 105 a, feature aggregation system 105b, deep metric learning frame work(s) 105c, one or more loss function modules 105d, and/or duplicate image or video determination system 110a, either directly via wired (not shown) or wireless communications links (denoted by lightning bolt symbols in Fig. 1), or indirect via network(s) 140 and via wired (not shown) and/or wireless communications links (denoted by lightning bolt symbols in Fig. 1). According to some embodiments, the user devices 145 may each include, but is not limited to, a portable gaming device, a smart phone, a tablet computer, a laptop computer, a desktop computer, a server computer, a digital photo album platform- compliant device, a web-based digital photo album platform-compliant device, a software application ("app") -based digital photo album platform-compliant device, a video sharing platform-compliant device, a web-based video sharing platform-compliant device, an app- based video sharing platform-compliant device, a law enforcement computing system, a security system computing system, a surveillance system computing system, a military computing system, and/or the like.

[0043] At least one of computing system 105, duplicate image or video determination system 110a, and/or network-based or server-based duplicate image or video determination system 110b (collectively, "computing system") may be used to train the first and second neural networks to perform duplicate image or video determination (referred to herein as "training" or the like), in accordance with the various embodiments. The computing system may then use the trained first and second neural networks to perform duplicate image or video determination (referred to herein as "inferencing" or the like), in accordance with the various embodiments. Unless indicated as training the first and second neural networks, the computing system is otherwise described as performing inferencing for determining whether the first and second images or videos are duplicates.

[0044] In operation, the computing system may perform feature discriminability enhancement, using a first deep metric learning framework (e.g., deep metric learning framework(s) 105c, or the like), by: transforming a first image- specific aggregated feature set corresponding to a first image (e.g., first image or video 150a, or the like, which may be received from one or more of the content source(s) 120, the content distribution system 130, and/or one or more of the user devices 145a-145n, or the like) into a first feature vector, using a first neural network based on a set of parameters; converting the first feature vector into a first normalized feature vector having a first set of values between 0 and 1 , using an activation function (including, but not limited to a sigmoid function, or the like); transforming a second image-specific aggregated feature set corresponding to a second image (e.g., second image or video 150b, or the like, which may also be received from one or more of the content source(s) 120, the content distribution system 130, and/or one or more of the user devices 145a-145n, or the like) into a second feature vector, using a second neural network based on the set of parameters; converting the second feature vector into a second normalized feature vector having a second set of values between 0 and 1, using the activation function; applying a first loss function to the first normalized feature vector and the second normalized feature vector to decrease Euclidean distance between positive pairs and to increase Euclidean distance between negative pairs among the first and second sets of values; and binarizing the first and second normalized feature vectors. The computing system may then determine whether the second image is a duplicate of the first image based on a Hamming distance between the first and second binarized normalized feature vectors. In some cases, the duplicate image or video determination may be sent and/or presented to one or more of the content source(s) 120, the content distribution system 130, and/or one or more of the user devices 145a-145n, or the like. [0045] According to some embodiments, the first loss function may include, but is not limited to, one of contrastive loss function or triplet loss function, and/or the like. In some embodiments, the computing system may train each of the first and second neural networks based at least in part on applying a second loss function to at least one of the first and second normalized feature vectors or the results of the first loss function. In some cases, the second loss function may include, but is not limited to, a cross-entropy loss function, or the like. In some cases, cross-entropy loss is not directly applied to the normalized feature vectors, but to a transformed normalized feature after one or more fully connected layers.

[0046] In some embodiments, the computing system may perform image or video deduplication based on a determination that the second image is a duplicate of the first image. Alternatively, the computing system may perform search-by-image functionality by matching the second image with the first image based on a determination regarding whether the second image is a duplicate of the first image. In such cases, one of the first image or second image may be a query image while the other of the first image or second image may be a stored image.

[0047] According to some embodiments, prior to performing feature discriminability enhancement (i.e., in cases, where the computing system receives images without any feature extraction having been performed on the images), the computing system may perform image- specific feature extraction for the first image, by: performing local feature extraction for the first image to extract a plurality of first local features; and aggregating the extracted plurality of first local features to form the first image-specific aggregated feature set for the first image. Concurrently, or sequentially, the computing system may perform image-specific feature extraction for the second image, by: performing local feature extraction for the second image to extract a plurality of second local features; and aggregating the extracted plurality of second local features to form the second image-specific aggregated feature set for the second image. [0048] Alternatively, the computing system may perform image-specific feature extraction for the first image, by: splitting the first image into a plurality of first sub-images contained within a corresponding plurality of first grid cells of a first grid, the first grid comprising a first predetermined number of the plurality of first grid cells; performing local feature extraction for each first sub-image among the plurality of first sub-images to extract a plurality of first local features; aggregating the extracted plurality of first local features to form a corresponding first image-specific feature set for each first sub-image; and concatenating each first image- specific feature set corresponding to each first sub-image among the plurality of first sub-images to generate a first combined aggregated feature set. In such cases, the first image-specific aggregated feature set may comprise the first combined aggregated feature set, or vice versa; or the first combined aggregated feature set, instead of the first image-specific aggregated feature set, may be input into the first deep metric learning framework for performing feature discriminability enhancement.

[0049] Concurrently, or sequentially, the computing system may perform image-specific feature extraction for the second image, by: splitting the second image into a plurality of second sub-images contained within a corresponding plurality of second grid cells of a second grid, the second grid comprising a second predetermined number of the plurality of second grid cells, the second predetermined number being the same as the first predetermined number; performing local feature extraction for each second sub-image among the plurality of second sub-images to extract a plurality of second local features; aggregating the extracted plurality of second local features to form a corresponding second image-specific feature for each second sub-image; and concatenating each second image-specific feature set corresponding to each second sub-image among the plurality of second sub-images to generate a second combined aggregated feature set, wherein the second image- specific aggregated feature set comprises the second combined aggregated feature set. In such cases, the second image-specific aggregated feature set may comprise the second combined aggregated feature set, or vice versa; or the second combined aggregated feature set, instead of the second image- specific aggregated feature set, may be input into the first deep metric learning framework for performing feature discriminability enhancement.

[0050] In some instances, based on a determination that at least one sub-image among at least one of the plurality of first sub-images or the plurality of second sub-images lacks any local features or any feature keypoints (or that the first or second image lacks any local features or any feature keypoints), the computing system may assign each first image-specific feature set corresponding to each of the at least one sub-image (or may assign the first or second image that is lacking features or feature keypoints) with three extra elements corresponding to average red/green/blue ("RGB") values, or the like. For other sub-images (or images) that have features or feature keypoints, the computing system may assign a negative value (e.g., -1, or the like) for said sub-image (or for said image), or the like. For other elements that correspond to the aggregated local features, the computing system may randomly assign any value for those sub-images (or images) without any keypoints.

[0051] In some embodiments, the first image may be a frame among a plurality of first frames in a first video and the first combined aggregated feature set may be among a plurality of first combined aggregated feature sets, while the second image may be a frame among a plurality of second frames in a second video and the second combined aggregated feature set may be among a plurality of second combined aggregated feature sets. In such cases, the computing system may aggregate extracted features from one or more first key frames among the plurality of first frames in the first video, by aggregating extracted features from one or more corresponding first combined aggregated feature sets among the plurality of first combined aggregated feature sets; may binarize the aggregated extracted features from the one or more first key frames to generate a first binarized key frame feature vector. Concurrently, or sequentially, the computing system may aggregate extracted features from one or more second key frames among the plurality of second frames in the second video, by aggregating extracted features from one or more corresponding second combined aggregated feature sets among the plurality of second combined aggregated feature sets; may binarize the aggregated extracted features from the one or more second key frames to generate a second binarized key frame feature vector.

[0052] Subsequently, the computing system may determine whether the second video is a duplicate of the first video based on a Hamming distance between the first and second binarized key frame feature vectors. In some cases, the computing systemmay use a second deep metric learning framework (e.g., deep metric learning framework(s) 105c, or the like) to perform feature discriminability enhancement on each of the aggregated extracted features from the one or more first key frames and the aggregated extracted features from the one or more second key frames prior to respective binarizations to extract discriminative features for each of the first and second videos, respectively.

[0053] Alternatively, or additionally, the computing system may match feature keypoints from one or more selected pairs of first frames among the plurality of first frames in the first video; may obtain a set of parameters between each selected pair of first frames to generate first four dimensional ("4D") features for each selected pair of first frames (the set of parameters including, but not limited to, a rotation angle parameter, two translation parameters, and a zoom factor parameter, and/or the like); may aggregate the first 4D features for the one or more selected pairs of first frames; may concatenate the aggregated first 4D features with the aggregated extracted features from one or more first key frames to generated first combined 4D and key features. In some cases, generating the first binarized key frame feature vector may comprise binarizing the generated first combined 4D and key features. [0054] Concurrently, or sequentially, the computing system may match feature keypoints from one or more selected pairs of second frames among the plurality of second frames in the second video; may obtain a set of parameters between each selected pair of second frames to generate second 4D features for each selected pair of second frames; may aggregate the second 4D features for the one or more selected pairs of second frames; and may concatenate the aggregated second 4D features with the aggregated extracted features from one or more second key frames to generated second combined 4D and key features. In some cases, generating the second binarized key frame feature vector may comprise binarizing the generated second combined 4D and key features.

[0055] According to some embodiments, the computing system may use a third deep metric learning framework (e.g., deep metric learning framework(s) 105c, or the like) to perform feature discriminability enhancement on each of the generated first combined 4D and key features and the generated second combined 4D and key features prior to respective binarizations to extract discriminative features for each of the first and second videos, respectively.

[0056] Merely by way of example, in some cases, the first image-specific aggregated feature set may be among a plurality first image-specific aggregated feature set, each having different dimensions obtained by adjusting a principal components analysis ("PCA") output dimension and a number of Gaussian components for each first image-specific aggregated feature set. Similarly, the second image-specific aggregated feature set may be among a plurality second image-specific aggregated feature set, each having different dimensions obtained by adjusting a PCA output dimension and a number of Gaussian components for each second image-specific aggregated feature set. In such cases, the computing system may concatenate the first feature vectors each corresponding to respective differently dimensioned first image- specific aggregated feature set to generate a first concatenated feature vector; may transform the generated first concatenated feature vector, using a third neural network, prior to converting into the first normali ed feature vector; may concatenate the second feature vectors each corresponding to respective differently dimensioned second image-specific aggregated feature set to generate a second concatenated feature vector; and may transform the generated second concatenated feature vector, using a fourth neural network, prior to converting into the second normalized feature vector. In some instances, the third and fourth neural networks, like the first neural network and the second neural network, may each include, without limitation, at least one of a deep learning model-based neural network, a deep metric learning- based neural network, a CNN, or a FCN, and/or the like.

[0057] In the various aspects described herein, a deep metric learning based framework(s) is provided for performing duplicate image or video determination and/or image or video deduplication (or image-based searching). The deep metric learning based framework(s) is lightweight, and does not require a very deep neural network but only a shallow neural network with a few fully connected ("FC") layers. This makes the deep metric learning based framework solution computationally efficient. The deep metric learning based framework(s) may also be capable of running on mobile devices. Additionally, deep metric learning based framework may generate a robust and memory efficient representation for each image or video that may be considered a signature of each image or video and may be adopted in many applications. Further, the feature dimension and the number of Gaussian components (in the case of Fisher vector aggregation, or the like) in feature generation may be adjusted to fit the requirement of different platforms with different computational resources. In some cases, the various embodiments provide an efficient and lightweight image deduplication framework based on keypoint detectors or descriptors, feature aggregation, and deep metric learning. Alternatively, or additionally, the various embodiments provide a video deduplication system that incorporates temporal information by encoding motion. Herein, image or video deduplication is performed to reduce the burden on storage systems, by removing duplicate data (e.g., duplicate images and/or videos) that may be needlessly taking up space on a local mobile drive or on a network-based or server-based computing system.

[0058] These and other functions of the system 100 (and its components) are described in greater detail below with respect to Figs. 2-4.

[0059] Figs. 2A-2C (collectively, "Fig. 2") are schematic block flow diagrams illustrating various non-limiting examples 200, 200', and 200" of duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features, in accordance with various embodiments.

[0060] With reference to the non-limiting example 200 of Fig. 2A, discriminative binary features may be generated for each of the first and second images (e.g., first and second images or videos 150a and 150b, or the like) and duplicate image or video determination may be performed by computing Hamming distance on the binary features. As described above, based on the duplicate image or video determination, image or video deduplication - in some cases, deleting of the duplicate first or second image or video - may be performed. Alternatively, based on the duplicate image or video determination, search-by-image functionality may be performed by matching the second image with the first image based on a determination regarding whether the second image is a duplicate of the first image. In such cases, one of the first image or second image may be a query image while the other of the first image or second image may be a stored image. Herein, image pairs that are close to each other may be considered duplicates. The overall feature extraction system may include two steps as depicted in Fig. 2A: image-specific feature extraction (at blocks 210a and 210b) and deep-metric-learning-based feature discriminability enhancement (at blocks 225a and 225b). These two steps need not be performed together or sequentially. Rather, image-specific feature extraction may be performed beforehand, and at some later time feature discriminability enhancement may be performed (such as shown, e.g., in Fig. 2B, or the like). [0061] Turning back to Fig. 2 A, first image or video (e.g., first image or video 150a, or the like) may be input into a first pipeline 205a, which may perform image-specific feature extraction (at block 210a), including, but not limited to, local feature extraction (at block 215a) to extract a plurality of first local features from the first image or video, and feature aggregation (at block 220a) to aggregate the extracted plurality of first local features to form the first image-specific aggregated feature set for the first image. The first pipeline 205a may also perform feature discriminability enhancement (at block 225a), in some cases, using deep metric learning (at block 230a; e.g., using deep metric learning framework(s) 105c of Fig. 1, or the like), by: transforming the first image-specific aggregated feature set corresponding to the first image or video (e.g., first image or video 150a, or the like) into a first feature vector, using a first neural network (e.g., first neural network 260a as shown in Fig. 2B, or the like) based on a set of parameters; and converting the first feature vector into a first normalized feature vector having a first set of values between 0 and 1, using an activation function. In some instances, the activation function may include, but is not limited to, a sigmoid function, or the like.

[0062] Concurrently, or sequentially, second image or video (e.g., second image or video 150b, or the like) may be input into a second pipeline 205b, which may perform image- specific feature extraction (at block 210b), including, but not limited to, local feature extraction (at block 215b) to extract a plurality of second local features from the second image or video, and feature aggregation (at block 220b) to aggregate the extracted plurality of second local features to form the second image-specific aggregated feature set for the second image. The second pipeline 205b may also perform feature discriminability enhancement (at block 225b), in some cases, using deep metric learning (at block 230b; e.g., using deep metric learning framework(s) 105c of Fig. 1, or the like), by: transforming the second image-specific aggregated feature set corresponding to the second image or video (e.g., second image or video 150b, or the like) into a second feature vector, using a second neural network (e.g., second neural network 260b as shown in Fig. 2B, or the like) based on a set of parameters; and converting the second feature vector into a second nor alized feature vector having a second set of values between 0 and 1 , using the activation function.

[0063] In the first step, image- specific features may be generated. Each feature is designed to capture important visual information for each image. To generate such image- specific features, one may first adopt scale- and rotation-invariant keypoint detectors and/or descriptors in order to extract local features. Such keypoint detectors and/or descriptors may include, but are not limited to, at least one of scale-invariant feature transform ("SIFT"), speeded-up robust features ("SURF"), oriented FAST and rotated BRIEF ("ORB"), KAZE features, accelerated KAZE ("AKAZE") features, and/or binary robust invariant scalable keypoints ("BRISK"), and/or the like. In some embodiments, principal components analysis ("PCA") may be performed on the generated feature to reduce feature dimension. Subsequently, the generated features may be aggregated to form an image- specific feature (e.g., aggregated features for Image 1 255a or for Image 2255b, as shown in Fig. 2B, which may be examples of the output of image-specific feature extraction (at blocks 210a and 210b of Fig. 2A), or the like).

[0064] For feature aggregation, one option may be Fisher vector aggregation, which encodes the deviation from the Gaussian Mixture Model ("GMM"). In this case, suppose the PCA output is K_d dimension and there are n_c Gaussian components for the GMM, resulting in 2 *K_d*n_c dimensional Fisher vector encoding. One can extract and adopt half of the encoding corresponding to the deviation with respect to the means only. Other possible aggregation methods may include, but are not limited to, vector of locally aggregated descriptors ("VLAD") and bag of words ("BoW"). The aggregated image-specific feature may then be binarized to generate binary encoding and Hamming distance may be computed to perform image or video deduplication or image searching, or the like. In some embodiments, the dimension of PCA output and the number of Gaussian components (in the case of Fisher vector aggregation, or the like) may be adjusted to fit some use cases or to fit the requirements of the running platform (e.g., in the cloud or on mobile phones, etc.). These hyperparameter choices can affect the computational cost and the accuracy. Fig. 3B shows deep metric learning framework with contrastive loss for different dimensions of features obtained by adjusting the dimension of PCA output and the number of Gaussian components. To encode additional information about the location of the keypoints, one can split the original image into an NxN grid and perform the above mentioned feature extraction and aggregation for each grid cell. The generated aggregated feature can then be concatenated according to the grid location to form the combined feature.

[0065] Subsequently, for training, contrastive loss (at optional block 235) may be applied to the first normalized feature vector (i.e., output from the first pipeline 205a after deep metric learning (at block 230a)) and the second normalized feature vector (i.e., output from the second pipeline 205b after deep metric learning (at block 230b)) to decrease Euclidean distance between positive pairs (i.e., in the case that the first and second images are determined to be duplicates) and to increase Euclidean distance between negative pairs (i.e., in the case that the first and second images are determined to not be duplicates) among the first and second sets of values. Although contrastive loss is shown in Fig. 2, the various embodiments are not so limited, and any suitable loss may be used. In some embodiments, for training, each of the first and second neural networks of the deep metric learning framework may be trained based at least in part on applying a cross-entropy loss function to at least one of the first and second normalized feature vectors or the results of the first loss function (at optional block 240). Unless indicated as training the first and second neural networks, the process as shown in Fig. 2 is otherwise described as performing inferencing for determining whether the first and second images are duplicates, in which case the loss functions (at optional blocks 235 and 240) are not applied.

[0066] The first pipeline 205a may further perform binarization (at block 235a) to binarize the first normalized feature vector. Similarly, the second pipeline 205b may further perform binarization (at block 235b) to binarize the second normalized feature vector. Duplicate determination (at block 245) - which may include determining a Hamming distance between the first and second binarized feature vectors (not shown in Fig. 2) - may then be performed to output whether the second image is a duplicate of the first image based on the Hamming distance between the first and second binarized feature vectors (at block 250). Computing Hamming distance on the binary features generated from the first step alone may not be capable of correctly identifying certain positive and negative pairs. Some negative pairs may be close to each other in terms of Hamming distance while some positive pairs may be too far away from each other. In this case, setting any distance threshold to identify duplicates can result in either a considerable number of false positive pairs or a considerable number of false negative pairs. To address this problem, the deep metric learning-based framework (at block 230a, and as described above) has been added after step 1 to re-adjust the distance between negative and positive pairs with the help of a shallow neural network. [0067] Turning to the non-limiting example 200' of Fig. 2B, the first pipeline 205a may receive aggregated features for a first image h 255a that may be obtained beforehand, rather than immediately following image-specific feature extraction (as described above with respect to block 210a in Fig. 2A, or the like). Likewise, the second pipeline 205b may receive aggregated features for a second image h 255b that may be obtained beforehand, rather than immediately following image-specific feature extraction (as described above with respect to block 210b in Fig. 2A, or the like). The process for feature discriminability enhancement (at blocks 225a and 225b), contrastive loss (at optional block 235), cross-entropy loss (at optional block 240), binarization (at blocks 245a and 245b), and duplication determination (at block 250) may subsequently be performed as described above with respect to Fig. 2A.

[0068] In some embodiments, the first image may be a frame among a plurality of first frames in a first video and the first combined aggregated feature set may be among a plurality of first combined aggregated feature sets, while the second image may be a frame among a plurality of second frames in a second video and the second combined aggregated feature set may be among a plurality of second combined aggregated feature sets. In such cases, with reference to the non-limiting example 200" of Fig. 2C, image- specific features h - I_M 265a- 265m may each correspond to output of image-specific feature extraction 210a (e.g., as shown and described above with respect to Fig. 2A) for each key or selected frame of first video 150a, while framewise feature aggregation (at block 270a) may correspond to aggregation of these image-specific features h - I_M 265a-265m for all key or selected frames of the first video.

[0069] The first pipeline 205a may match feature keypoints from one or more selected pairs of first frames among the plurality of first frames in the first video (e.g., first video 150a, or the like); and may obtain a set of parameters between each selected pair of first frames to generate first four dimensional ("4D") features for each selected pair of first frames (e.g., 4D Features F_\-F_v 275a-275p, or the like). In some instances, the set of parameters may include, but is not limited to, a rotation angle parameter, two translation parameters, and a zoom factor parameter, and/or the like. The first pipeline 205a may aggregate the first 4D features for the one or more selected pairs of first frames (at block 280a); and may concatenate (at block 285a) the aggregated first 4D features (from block 280a) with the aggregated extracted (frame-wise) features (from block 270a) from one or more first key frames to generated first combined 4D and key features. Deep metric learning (at block 230a) and binarization (at block 245a) may be similar, if not identical, to the corresponding processes in Figs. 2A and 2B. In some cases, generating the first binarized key frame feature vector may comprise binarizing the generated first combined 4D and key features.

[0070] Although not shown, the second pipeline 205b may perform similar tasks for the second video. Likewise, although not shown in Fig. 2C, contrastive loss (at optional block 235), cross-entropy loss (at optional block 240), and duplication determination (at block 250) may also be similar, if not identical, to the corresponding processes in Figs. 2A and 2B.

[0071] These and other functions of the example(s) examples 200, 200', and 200" (and their components) are described in greater detail herein with respect to Figs. 1, 3, and 4.

[0072] Figs. 3 A and 3B (collectively, "Fig. 3") are schematic block flow diagrams illustrating non-limiting examples 300 and 300' of deep metric learning frameworks with contrastive loss for use during duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features, in accordance with various embodiments.

[0073] In Fig. 3, contrastive loss is used as an example to illustrate the deep metric learning framework. The overall strategy is depicted in Fig. 3A. A (Siamese) network 205 (including pipelines or network parts 205a and 205b) may take pairs of samples as input. The inputs may be the aggregated features before deep metric learning (e.g., aggregated features for image 1 h 255a and for image 2h 255b, or the like). The parameters may be shared for different images between the network parts 205a and 205b. A few fully connected ("FC") layers 260a and 260b may transform each feature 255a or 255b into a feature vector (which may be output from each FC3 layer as shown in Fig. 3A). The feature dimension can be adjusted according to different data and the requirement of different platforms (e.g., mobile phone versus cloud). As an example, the feature vector dimension may be set to be 256 in the example 300 of Fig. 3A. A Sigmoid function (at blocks 305a and 305b) may be applied to the feature vector output from FC3 to transform the data to [0,1], the sigmoid function output being denoted as "S" in Fig. 3 A. Subsequently, for training (at optional blocks 310a and 310b) contrastive loss (at optional block 235) may be computed to make positive pairs to be close to each other and negative pairs to be away from each other. For easier convergence and better performance, cross-entropy loss (at block 240a) may be added as illustrated in Fig. 3A. To compute the cross-entropy loss, images that are considered as "duplicates" of each other may be considered to be the same class. The size of FC4 in Fig. 3 may be equal to the number of classes, that is, the number of "unique" images. With a properly trained network, the sigmoid output features may be expected to have the positive pairs close to each other and negative pairs away from each other. The sigmoid output features S may then be binarized (at blocks 245a and 245b) to generate the binary encoding for each image. With a properly set threshold for binarization, the number of mistakes may be reduced compared to directly using the binarized aggregated features. Duplication determination (at block 250) may then be performed in a similar manner as described above with respect to Figs. 2A and 2B.

[0074] Referring to the non-limiting example 300' of Fig. 3B, one can generate features of different dimensions from step 1 (e.g., feature(s) 1^' , I^², through /^^N 315a-315n and feature(s) /^¹, I^², through 7^^N 320a-320n, or the like), by adjusting PCA output dimension and the number of Gaussian components (in the case of Fisher vector aggregation) and then combining them in the deep metric learning framework as illustrated in Fig. 3B. An input branch may be created for each aggregated feature of different input dimension. Each branch (blocks 260a' and 260b') may generate, for example, a 256-D feature, and then the features may be concatenated (at blocks 320a and 320b, denoted "C" in Fig. 3B). The concatenated feature C may be further transformed with 3 more FC layers (i.e., FC5, FC6, and FC7 at blocks 325a and 325b, or the like) before the Sigmoid function is applied (at blocks 305a and 305b). Parameters may be shared for different images but not for different dimensions. The subsequent processes may be similar, if not identical, to those as shown and described above with respect to Fig. 3A.

[0075] These and other functions of the example(s) examples 300 and 300' (and their components) are described in greater detail herein with respect to Figs. 1, 2, and 4.

[0076] Figs. 4A-4J (collectively, "Fig. 4") are flow diagrams illustrating a method 400 for implementing duplicate image or video determination and/or image or video deduplication based on deep metric learning with keypoint features, in accordance with various embodiments. Method 400 of Fig. 4D continues onto Fig. 4E following the circular marker denoted, "A," and returns to Fig. 4D following the circular marker denoted, "B," or continues onto Fig. 4G following the circular marker denoted, "E," and returns to Fig. 4D following the circular marker denoted, "F." Method 400 of Fig. 4D continues onto Fig. 4F following the circular marker denoted, "C," and returns to Fig. 4D following the circular marker denoted, "D," or continues onto Fig. 4F1 following the circular marker denoted, "G," and returns to Fig. 4D following the circular marker denoted, "FI." Method 400 of Fig. 41 returns to Fig. 4A following the circular marker denoted, "I." Method 400 of Fig. 4J returns to Fig. 4A following the circular marker denoted, " J. " [0077] While the techniques and procedures are depicted and/or described in a certain order for purposes of illustration, it should be appreciated that certain procedures may be reordered and/or omitted within the scope of various embodiments. Moreover, while the method 400 illustrated by Fig. 4 can be implemented by or with (and, in some cases, are described below with respect to) the systems, examples, or embodiments 100, 200, 200', 200", 300, and 300' of Figs. 1, 2A, 2B, 2C, 3A, and 3B, respectively (or components thereof), such methods may also be implemented using any suitable hardware (or software) implementation. Similarly, while each of the systems, examples, or embodiments 100, 200, 200', 200", 300, and 300' of Figs. 1, 2A, 2B, 2C, 3A, and 3B, respectively (or components thereof), can operate according to the method 400 illustrated by Fig. 4 (e.g., by executing instructions embodied on a computer readable medium), the systems, examples, or embodiments 100, 200, 200', 200", 300, and 300' of Figs. 1, 2A, 2B, 2C, 3 A, and 3B can each also operate according to other modes of operation and/or perform other suitable procedures.

[0078] In the non-limiting embodiment of Fig. 4A, method 400, at block 402a, may comprise a computing system performing image-specific feature extraction for a first image.

At block 404, method 400 may comprise the computing system performing feature discriminability enhancement, using a first deep metric learning framework, by: transforming a first image-specific aggregated feature set corresponding to the first image into a first feature vector, using a first neural network based on a set of parameters (block 406a); and converting the first feature vector into a first normalized feature vector having a first set of values between 0 and 1, using an activation function (block 408a).

[0079] Concurrently, method 400, at block 402b, may comprise performing image- specific feature extraction for a second image. Also at block 404, method 400 may comprise the computing system performing feature discriminability enhancement, using a first deep metric learning framework, by: transforming a second image-specific aggregated feature set corresponding to the second image into a second feature vector, using a second neural network based on the set of parameters (block 406b); and converting the second feature vector into a second normalized feature vector having a second set of values between 0 and 1 , using the activation function (block 408b).

[0080] In some embodiments, the computing system may include, without limitation, at least one of a machine learning system, an artificial intelligence ("AI") system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like. In some instances, the first neural network and the second neural network may each include, but is not limited to, at least one of a deep learning model-based neural network, a deep metric learning-based neural network, a convolutional neural network ("CNN"), or a fully convolutional network ("FCN"), and/or the like.

[0081] For training, at optional block 410, method 400 may comprise applying a first loss function to the first normalized feature vector and the second normalized feature vector to decrease Euclidean distance between positive pairs (i.e., in the case that the first and second images are determined to be duplicates) and to increase Euclidean distance between negative pairs (i.e., in the case that the first and second images are determined to not be duplicates) among the first and second sets of values. In some instances, the first loss function may include, without limitation, a contrastive loss function, and/or the like. In some embodiments, method 400 may further comprise training each of the first and second neural networks based at least in part on applying a second loss function to at least one of the first and second normalized feature vectors or the results of the first loss function (optional block 412). In some cases, the second loss function may include, but is not limited to, a cross-entropy loss function, or the like. Unless indicated as training the first and second neural networks, method 400 is otherwise described as performing inferencing for determining whether the first and second images are duplicates, in which case the loss functions (at optional blocks 410 and 412) are not applied.

[0082] Method 400, at block 414, may comprise binarizing the first and second normalized feature vectors (block 414); and determining a Flamming distance between the between the first and second binarized normalized feature vectors (block 416). Method 400 may further comprise, at block 418, determining whether the second image is a duplicate of the first image based on the Flamming distance between the first and second binarized normalized feature vectors.

[0083] According to some embodiments, method 400 may further comprise one of: performing image or video deduplication based on a determination that the second image is a duplicate of the first image (block 420a); or performing search-by-image functionality (block 420b). In some instances, performing search-by-image functionality may comprise matching the second image with the first image based on a determination regarding whether the second image is a duplicate of the first image. In such cases, one of the first image or second image may be a query image and the other of the first image or second image may be a stored image. [0084] With reference to Fig. 4B, performing image-specific feature extraction for the first image (at block 402a) may comprise performing local feature extraction for the first image to extract a plurality of first local features (block 422a); and aggregating the extracted plurality of first local features to form the first image-specific aggregated feature set for the first image (block 424a). Alternatively, performing image-specific feature extraction for the first image (at block 402a) may comprise splitting the first image into a plurality of first sub-images contained within a corresponding plurality of first grid cells of a first grid, the first grid comprising a first predetermined number of the plurality of first grid cells (block 426a); performing local feature extraction for each first sub-image among the plurality of first sub images to extract a plurality of first local features (block 428a); aggregating the extracted plurality of first local features to form a corresponding first image-specific feature set for each first sub-image (block 430a); and concatenating each first image-specific feature set corresponding to each first sub-image among the plurality of first sub-images to generate a first combined aggregated feature set (block 432a). In such cases, the first image- specific aggregated feature set may comprise the first combined aggregated feature set, or vice versa; or the first combined aggregated feature set may be input into the processes at block 404-408, instead of the first image-specific aggregated feature set (at block 402a).

[0085] Similarly, in Fig. 4C, performing image-specific feature extraction for the second image (at block 402b) may comprise performing local feature extraction for the second image to extract a plurality of second local features (block 422b); and aggregating the extracted plurality of second local features to form the second image-specific aggregated feature set for the second image (block 424b). Alternatively, performing image-specific feature extraction for the first image (at block 402b) may comprise splitting the second image into a plurality of second sub-images contained within a corresponding plurality of second grid cells of a second grid, the second grid comprising a second predetermined number of the plurality of second grid cells, the second predetermined number being the same as the first predetermined number (block 426b); performing local feature extraction for each second sub-image among the plurality of second sub-images to extract a plurality of second local features (block 428b); aggregating the extracted plurality of second local features to form a corresponding second image-specific feature for each second sub-image (block 430b); and concatenating each second image- specific feature set corresponding to each second sub-image among the plurality of second sub-images to generate a second combined aggregated feature set (block 432b). In such cases, the second image-specific aggregated feature set comprises the second combined aggregated feature set, or vice versa; or the second combined aggregated feature set may be input into the processes at block 404-408, instead of the second image-specific aggregated feature set (at block 402b).

[0086] Referring to Figs. 4B and 4C, method 400 may further comprise at least one of: based on a determination that at least one third sub-image among at least one of the plurality of first sub-images or the plurality of second sub-images lacks any local features or any feature keypoints (or that the first or second image lacks any local features or any feature keypoints), assigning each first image-specific feature set corresponding to each of the at least one third sub-image (or assigning the first or second image lacking features or feature keypoints) with extra three elements corresponding to average red/green/blue ("RGB") values, or the like; based on a determination that at least one fourth sub-image among at least one of the plurality of first sub-images or the plurality of second sub-images lacks any local features or any feature keypoints, assigning each first image-specific feature set corresponding to each of the at least one fourth sub-image with a negative value, or the like; or assigning a random value to sub-images without any keypoints for other elements that correspond to aggregated local features; or the like (not shown in Fig. 4).

[0087] In some embodiments, the first image may be a frame among a plurality of first frames in a first video and the first combined aggregated feature set may be among a plurality of first combined aggregated feature sets, while the second image may be a frame among a plurality of second frames in a second video and the second combined aggregated feature set may be among a plurality of second combined aggregated feature sets. In such cases, with reference to Fig. 4D, method 400 may further comprise: aggregating extracted features from one or more first key frames among the plurality of first frames in the first video, by aggregating extracted features from one or more corresponding first combined aggregated feature sets among the plurality of first combined aggregated feature sets (block 434a). Method 400 may continue onto the process at block 436a, the process at block 442a in Fig. 4E following the circular marked denoted, "A," and/or the process at block 444a in Fig. 4G following the circular marked denoted, "E." Similarly, with reference to Fig. 4D, method 400 may further comprise: aggregating extracted features from one or more second key frames among the plurality of second frames in the second video, by aggregating extracted features from one or more corresponding second combined aggregated feature sets among the plurality of second combined aggregated feature sets (block 434b). Method 400 may proceed to the process at block 436b, the process at block 442b in Fig. 4F following the circular marked denoted, "C," or the process at block 444b in Fig. 4H following the circular marked denoted, "G."

[0088] At block 436a, method 400 may comprise binarizing the aggregated extracted features from the one or more first key frames to generate a first binarized key frame feature vector. Likewise, at block 436b, method 400 may comprise binarizing the aggregated extracted features from the one or more second key frames to generate a second binarized key frame feature vector.

[0089] Method 400 may further comprise determining a Hamming distance between the first and second binarized key frame feature vectors (block 438); and determining whether the second video is a duplicate of the first video based on the Hamming distance between the first and second binarized key frame feature vectors (block 440).

[0090] At block 442a in Fig. 4E (following the circular marker denoted, "A"), method 400 may comprise using a second deep metric learning framework to perform feature discriminability enhancement on the aggregated extracted features from the one or more first key frames. Method 400 may then return to the process at 436a in Fig. 4D, following the circular marker denoted, "B."

[0091] Likewise, at block 442b in Fig. 4F (following the circular marker denoted, "C"), method 400 may comprise using the second deep metric learning framework to perform feature discriminability enhancement on the aggregated extracted features from the one or more second key frames. Method 400 may then return to the process at 436b in Fig. 4D, following the circular marker denoted, "D."

[0092] Alternatively, or additionally, at block 444a in Fig. 4G (following the circular marker denoted, "E"), method 400 may comprise matching feature keypoints from one or more selected pairs of first frames among the plurality of first frames in the first video; obtaining a set of parameters between each selected pair of first frames to generate first four dimensional ("4D") features for each selected pair of first frames, the set of parameters including, but not limited to, a rotation angle parameter, two translation parameters, and a zoom factor parameter (block 446a); aggregating the first 4D features for the one or more selected pairs of first frames (block 448a); concatenating the aggregated first 4D features with the aggregated extracted features from one or more first key frames to generated first combined 4D and key features (block 450a); and using a third deep metric learning framework to perform feature discriminability enhancement on the generated first combined 4D and key features (block 452a). Method 400 may then return to the process at 436a in Fig. 4D, following the circular marker denoted, "F." In some cases, generating the first binarized key frame feature vector (at block 436a in Fig. 4D) may comprise binarizing the generated first combined 4D and key features.

[0093] Similarly, at block 444b in Fig. 4H (following the circular marker denoted, "G"), method 400 may comprise matching feature keypoints from one or more selected pairs of second frames among the plurality of second frames in the second video; obtaining a set of parameters between each selected pair of second frames to generate second 4D features for each selected pair of second frames (block 446b); aggregating the second 4D features for the one or more selected pairs of second frames (block 448b); and concatenating the aggregated second 4D features with the aggregated extracted features from one or more second key frames to generated second combined 4D and key features (block 450b); and using the third deep metric learning framework to perform feature discriminability enhancement on the generated second combined 4D and key features (block 452b). Method 400 may then return to the process at 436b in Fig. 4D, following the circular marker denoted, "H." In some cases, generating the second binarized key frame feature vector (at block 436b in Fig. 4D) may comprise binarizing the generated second combined 4D and key features.

[0094] Merely by way of example, in some cases, the first image-specific aggregated feature set may be among a plurality first image-specific aggregated feature set, each having different dimensions obtained by adjusting a principal components analysis ("PCA") output dimension and a number of Gaussian components for each first image-specific aggregated feature set. Likewise, the second image-specific aggregated feature set may be among a plurality second image-specific aggregated feature set, each having different dimensions obtained by adjusting a PCA output dimension and a number of Gaussian components for each second image-specific aggregated feature set. In such cases, with reference to Fig. 41, method 400 may further comprise: concatenating the first feature vectors each corresponding to respective differently dimensioned first image- specific aggregated feature set to generate a first concatenated feature vector (block 454a); and transforming the generated first concatenated feature vector, using a third neural network (block 456a), prior to converting into the first normali ed feature vector (at block 408a). Method 400 may then return to the process at 408a in Fig. 4A, following the circular marker denoted, "I."

[0095] Similarly, with reference to Fig. 4J, method 400 may further comprise: concatenating the second feature vectors each corresponding to respective differently dimensioned second image-specific aggregated feature set to generate a second concatenated feature vector (block 454b); and transforming the generated second concatenated feature vector, using a fourth neural network (block 456b), prior to converting into the second normalized feature vector (at block 408b). Method 400 may then return to the process at 408b in Fig. 4A, following the circular marker denoted, "J."

[0096] Examples of System and Hardware Implementation

[0097] Fig. 5 is a block diagram illustrating an example of computer or system hardware architecture, in accordance with various embodiments. Fig. 5 provides a schematic illustration of one embodiment of a computer system 500 of the service provider system hardware that can perform the methods provided by various other embodiments, as described herein, and/or can perform the functions of computer or hardware system (i.e., computing system 105, duplicate image or video determination system 110a or 110b, content source(s) 120, content distribution system 130, and user devices 145a-145n, etc.), as described above. It should be noted that Fig. 5 is meant only to provide a generalized illustration of various components, of which one or more (or none) of each may be utilized as appropriate. Fig. 5, therefore, broadly illustrates how individual system elements may be implemented in a relatively separated or relatively more integrated manner.

[0098] The computer or hardware system 500 - which might represent an embodiment of the computer or hardware system (i.e., computing system 105, duplicate image or video determination system 110a or 110b, content source(s) 120, content distribution system 130, and user devices 145a-145n, etc.), described above with respect to Figs. 1-4 - is shown comprising hardware elements that can be electrically coupled via a bus 505 (or may otherwise be in communication, as appropriate). The hardware elements may include one or more processors 510, including, without limitation, one or more general-purpose processors and/or one or more special-purpose processors (such as microprocessors, digital signal processing chips, graphics acceleration processors, and/or the like); one or more input devices 515, which can include, without limitation, a mouse, a keyboard, and/or the like; and one or more output devices 520, which can include, without limitation, a display device, a printer, and/or the like.

[0099] The computer or hardware system 500 may further include (and/or be in communication with) one or more storage devices 525, which can comprise, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, solid-state storage device such as a random access memory ("RAM") and/or a read-only memory ("ROM"), which can be programmable, flash-updateable, and/or the like. Such storage devices may be configured to implement any appropriate data stores, including, without limitation, various file systems, database structures, and/or the like.

[0100] The computer or hardware system 500 might also include a communications subsystem 530, which can include, without limitation, a modem, a network card (wireless or wired), an infra-red communication device, a wireless communication device and/or chipset (such as a Bluetooth™ device, an 802.11 device, a WiFi device, a WiMax device, a WWAN device, cellular communication facilities, etc.), and/or the like. The communications subsystem 530 may permit data to be exchanged with a network (such as the network described below, to name one example), with other computer or hardware systems, and/or with any other devices described herein. In many embodiments, the computer or hardware system 500 will further comprise a working memory 535, which can include a RAM or ROM device, as described above.

[0101] The computer or hardware system 500 also may comprise software elements, shown as being currently located within the working memory 535, including an operating system 540, device drivers, executable libraries, and/or other code, such as one or more application programs 545, which may comprise computer programs provided by various embodiments (including, without limitation, hypervisors, VMs, and the like), and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein. Merely by way of example, one or more procedures described with respect to the method(s) discussed above might be implemented as code and/or instructions executable by a computer (and/or a processor within a computer); in an aspect, then, such code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods.

[0102] A set of these instructions and/or code might be encoded and/or stored on a non- transitory computer readable storage medium, such as the storage device(s) 525 described above. In some cases, the storage medium might be incorporated within a computer system, such as the system 500. In other embodiments, the storage medium might be separate from a computer system (i.e., a removable medium, such as a compact disc, etc.), and/or provided in an installation package, such that the storage medium can be used to program, configure, and/or adapt a general purpose computer with the instructions/code stored thereon. These instructions might take the form of executable code, which is executable by the computer or hardware system 500 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computer or hardware system 500 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc.) then takes the form of executable code.

[0103] It will be apparent to those skilled in the art that substantial variations may be made in accordance with particular requirements. For example, customized hardware (such as programmable logic controllers, field-programmable gate arrays, application- specific integrated circuits, and/or the like) might also be used, and/or particular elements might be implemented in hardware, software (including portable software, such as applets, etc.), or both. Further, connection to other computing devices such as network input/output devices may be employed.

[0104] As mentioned above, in one aspect, some embodiments may employ a computer or hardware system (such as the computer or hardware system 500) to perform methods in accordance with various embodiments of the invention. According to a set of embodiments, some or all of the procedures of such methods are performed by the computer or hardware system 500 in response to processor 510 executing one or more sequences of one or more instructions (which might be incorporated into the operating system 540 and/or other code, such as an application program 545) contained in the working memory 535. Such instructions may be read into the working memory 535 from another computer readable medium, such as one or more of the storage device(s) 525. Merely by way of example, execution of the sequences of instructions contained in the working memory 535 might cause the processor(s) 510 to perform one or more procedures of the methods described herein.

[0105] The terms "machine readable medium" and "computer readable medium," as used herein, refer to any medium that participates in providing data that causes a machine to operate in some fashion. In an embodiment implemented using the computer or hardware system 500, various computer readable media might be involved in providing instructions/code to processor(s) 510 for execution and/or might be used to store and/or carry such instructions/code (e.g., as signals). In many implementations, a computer readable medium is a non-transitory, physical, and/or tangible storage medium. In some embodiments, a computer readable medium may take many forms, including, but not limited to, non-volatile media, volatile media, or the like. Non-volatile media includes, for example, optical and/or magnetic disks, such as the storage device(s) 525. Volatile media includes, without limitation, dynamic memory, such as the working memory 535. In some alternative embodiments, a computer readable medium may take the form of transmission media, which includes, without limitation, coaxial cables, copper wire, and fiber optics, including the wires that comprise the bus 505, as well as the various components of the communication subsystem 530 (and/or the media by which the communications subsystem 530 provides communication with other devices). In an alternative set of embodiments, transmission media can also take the form of waves (including without limitation radio, acoustic, and/or light waves, such as those generated during radio-wave and infra-red data communications).

[0106] Common forms of physical and/or tangible computer readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read instructions and/or code.

[0107] Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to the processor(s) 510 for execution. Merely by way of example, the instructions may initially be carried on a magnetic disk and/or optical disc of a remote computer. A remote computer might load the instructions into its dynamic memory and send the instructions as signals over a transmission medium to be received and/or executed by the computer or hardware system 500. These signals, which might be in the form of electromagnetic signals, acoustic signals, optical signals, and/or the like, are all examples of carrier waves on which instructions can be encoded, in accordance with various embodiments of the invention.

[0108] The communications subsystem 530 (and/or components thereof) generally will receive the signals, and the bus 505 then might carry the signals (and/or the data, instructions, etc. carried by the signals) to the working memory 535, from which the processor(s) 505 retrieves and executes the instructions. The instructions received by the working memory 535 may optionally be stored on a storage device 525 either before or after execution by the processor(s) 510.

[0109] While particular features and aspects have been described with respect to some embodiments, one skilled in the art will recognize that numerous modifications are possible. For example, the methods and processes described herein may be implemented using hardware components, software components, and/or any combination thereof. Further, while various methods and processes described herein may be described with respect to particular structural and/or functional components for ease of description, methods provided by various embodiments are not limited to any particular structural and/or functional architecture but instead can be implemented on any suitable hardware, firmware and/or software configuration. Similarly, while particular functionality is ascribed to particular system components, unless the context dictates otherwise, this functionality need not be limited to such and can be distributed among various other system components in accordance with the several embodiments.

[0110] Moreover, while the procedures of the methods and processes described herein are described in a particular order for ease of description, unless the context dictates otherwise, various procedures may be reordered, added, and/or omitted in accordance with various embodiments. Moreover, the procedures described with respect to one method or process may be incorporated within other described methods or processes; likewise, system components described according to a particular structural architecture and/or with respect to one system may be organized in alternative structural architectures and/or incorporated within other described systems. Hence, while various embodiments are described with — or without — particular features for ease of description and to illustrate some aspects of those embodiments, the various components and/or features described herein with respect to a particular embodiment can be substituted, added and/or subtracted from among other described embodiments, unless the context dictates otherwise. Consequently, although several embodiments are described above, it will be appreciated that the invention is intended to cover all modifications and equivalents within the scope of the following claims.

Claims

WHAT IS CLAIMED IS:

1. A method for performing duplicate image or video determination, the method implemented by a computing system and comprising: performing feature discriminability enhancement, using a first deep metric learning framework, by: transforming a first image-specific aggregated feature set corresponding to a first image into a first feature vector, using a first neural network based on a set of parameters; converting the first feature vector into a first normalized feature vector having a first set of values between 0 and 1, using an activation function; transforming a second image-specific aggregated feature set corresponding to a second image into a second feature vector, using a second neural network based on the set of parameters; converting the second feature vector into a second normalized feature vector having a second set of values between 0 and 1 , using the activation function; applying a first loss function to the first normalized feature vector and the second normalized feature vector to decrease Euclidean distance between positive pairs and to increase Euclidean distance between negative pairs among the first and second sets of values; and binarizing the first and second normalized feature vectors; and determining whether the second image is a duplicate of the first image based on a Hamming distance between the first and second binarized normalized feature vectors.

2. The method of claim 1, wherein the computing system comprises at least one of a machine learning system, an artificial intelligence ("AI") system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, wherein the first neural network and the second neural network each comprises at least one of a deep learning model-based neural network, a deep metric learning-based neural network, a convolutional neural network ("CNN"), or a fully convolutional network ("FCN").

3. The method of claim 1 or 2, further comprising: performing image-specific feature extraction for the first image, by: performing local feature extraction for the first image to extract a plurality of first local features; and aggregating the extracted plurality of first local features to form the first image- specific aggregated feature set for the first image; and performing image-specific feature extraction for the second image, by: performing local feature extraction for the second image to extract a plurality of second local features; and aggregating the extracted plurality of second local features to form the second image-specific aggregated feature set for the second image.

4. The method of claim 1 or 2, further comprising: performing image-specific feature extraction for the first image, by: splitting the first image into a plurality of first sub-images contained within a corresponding plurality of first grid cells of a first grid, the first grid comprising a first predetermined number of the plurality of first grid cells; performing local feature extraction for each first sub-image among the plurality of first sub-images to extract a plurality of first local features; aggregating the extracted plurality of first local features to form a corresponding first image-specific feature set for each first sub-image; and concatenating each first image-specific feature set corresponding to each first sub-image among the plurality of first sub-images to generate a first combined aggregated feature set, wherein the first image-specific aggregated feature set comprises the first combined aggregated feature set; and performing image-specific feature extraction for the second image, by: splitting the second image into a plurality of second sub-images contained within a corresponding plurality of second grid cells of a second grid, the second grid comprising a second predetermined number of the plurality of second grid cells, the second predetermined number being the same as the first predetermined number; performing local feature extraction for each second sub-image among the plurality of second sub-images to extract a plurality of second local features; aggregating the extracted plurality of second local features to form a corresponding second image-specific feature for each second sub-image; and concatenating each second image-specific feature set corresponding to each second sub-image among the plurality of second sub-images to generate a second combined aggregated feature set, wherein the second image-specific aggregated feature set comprises the second combined aggregated feature set.

5. The method of claim 4, further comprising at least one of: based on a determination that at least one third sub-image among at least one of the plurality of first sub-images or the plurality of second sub-images lacks any local features or any feature keypoints, assigning each first image-specific feature set corresponding to each of the at least one third sub-image with three extra elements corresponding to average red/green/blue ("RGB") values; based on a determination that at least one fourth sub-image among at least one of the plurality of first sub-images or the plurality of second sub-images lacks any local features or any feature keypoints, assigning each first image-specific feature set corresponding to each of the at least one fourth sub-image with a negative value; or assigning a random value to sub-images without any keypoints for other elements that correspond to aggregated local features.

6. The method of claim 4, wherein the first image is a frame among a plurality of first frames in a first video and the first combined aggregated feature set is among a plurality of first combined aggregated feature sets, wherein the second image is a frame among a plurality of second frames in a second video and the second combined aggregated feature set is among a plurality of second combined aggregated feature sets, wherein the method further comprises: aggregating extracted features from one or more first key frames among the plurality of first frames in the first video, by aggregating extracted features from one or more corresponding first combined aggregated feature sets among the plurality of first combined aggregated feature sets; binarizing the aggregated extracted features from the one or more first key frames to generate a first binarized key frame feature vector; aggregating extracted features from one or more second key frames among the plurality of second frames in the second video, by aggregating extracted features from one or more corresponding second combined aggregated feature sets among the plurality of second combined aggregated feature sets; binarizing the aggregated extracted features from the one or more second key frames to generate a second binarized key frame feature vector; and determining whether the second video is a duplicate of the first video based on a Hamming distance between the first and second binarized key frame feature vectors.

7. The method of claim 6, further comprising: using a second deep metric learning framework to perform feature discriminability enhancement on each of the aggregated extracted features from the one or more first key frames and the aggregated extracted features from the one or more second key frames prior to respective binarizations to extract discriminative features for each of the first and second videos, respectively.

8. The method of claim 6, further comprising: matching feature keypoints from one or more selected pairs of first frames among the plurality of first frames in the first video; obtaining a set of parameters between each selected pair of first frames to generate first four dimensional ("4D") features for each selected pair of first frames, the set of parameters comprising a rotation angle parameter, two translation parameters, and a zoom factor parameter; aggregating the first 4D features for the one or more selected pairs of first frames; concatenating the aggregated first 4D features with the aggregated extracted features from one or more first key frames to generated first combined 4D and key features, wherein generating the first binarized key frame feature vector comprises binarizing the generated first combined 4D and key features; matching feature keypoints from one or more selected pairs of second frames among the plurality of second frames in the second video; obtaining a set of parameters between each selected pair of second frames to generate second 4D features for each selected pair of second frames; aggregating the second 4D features for the one or more selected pairs of second frames; and concatenating the aggregated second 4D features with the aggregated extracted features from one or more second key frames to generated second combined 4D and key features, wherein generating the second binarized key frame feature vector comprises binarizing the generated second combined 4D and key features.

9. The method of claim 8, further comprising: using a third deep metric learning framework to perform feature discriminability enhancement on each of the generated first combined 4D and key features and the generated second combined 4D and key features prior to respective binarizations to extract discriminative features for each of the first and second videos, respectively.

10. The method of any of claims 1-9, wherein the first image-specific aggregated feature set is among a plurality first image-specific aggregated feature set, each having different dimensions obtained by adjusting a principal components analysis ("PCA") output dimension and a number of Gaussian components for each first image-specific aggregated feature set, wherein the second image-specific aggregated feature set is among a plurality second image- specific aggregated feature set, each having different dimensions obtained by adjusting a PCA output dimension and a number of Gaussian components for each second image-specific aggregated feature set, wherein the method further comprises: concatenating the first feature vectors each corresponding to respective differently dimensioned first image- specific aggregated feature set to generate a first concatenated feature vector; transforming the generated first concatenated feature vector, using a third neural network, prior to converting into the first normali ed feature vector; concatenating the second feature vectors each corresponding to respective differently dimensioned second image-specific aggregated feature set to generate a second concatenated feature vector; and transforming the generated second concatenated feature vector, using a fourth neural network, prior to converting into the second normali ed feature vector.

11. The method of any of claims 1-10, wherein the first loss function comprises a contrastive loss function.

12. The method of any of claims 1-11, further comprising: training each of the first and second neural networks based at least in part on applying a second loss function to transformed features generated by transforming the first and second normalized feature vectors with a fully connected layer.

13. The method of claim 12, wherein the second loss function comprises a cross entropy loss function.

14. The method of any of claims 1-13, further comprising: performing image or video deduplication based on a determination that the second image is a duplicate of the first image.

15. The method of any of claims 1-14, further comprising: performing search-by-image functionality by matching the second image with the first image based on a determination regarding whether the second image is a duplicate of the first image, wherein one of the first image or second image is a query image and the other of the first image or second image is a stored image.

16. A system operable to perform the method of claims 1-15, for performing duplicate image or video determination, the system comprising: a computing system, comprising: at least one first processor; and a first non-transitory computer readable medium communicatively coupled to the at least one first processor, the first non-transitory computer readable medium having stored thereon computer software comprising a first set of instructions that, when executed by the at least one first processor, causes the computing system to: perform feature discriminability enhancement, using a deep metric learning framework, by: transforming a first image-specific aggregated feature set corresponding to a first image into a first feature vector, using a first neural network based on a set of parameters; converting the first feature vector into a first normalized feature vector having a first set of values between 0 and 1 , using an activation function; transforming a second image-specific aggregated feature set corresponding to a second image into a second feature vector, using a second neural network based on the set of parameters; converting the second feature vector into a second normalized feature vector having a second set of values between 0 and 1, using the activation function; applying a contrastive loss function to the first normalized feature vector and the second normalized feature vector to decrease Euclidean distance between positive pairs and to increase Euclidean distance between negative pairs among the first and second sets of values; and binarizing the first and second normalized feature vectors; and determine whether the second image is a duplicate of the first image based on a Hamming distance between the first and second binarized normalized feature vectors.

17. The system of claim 16, wherein the computing system comprises at least one of a machine learning system, an artificial intelligence ("AI") system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, wherein the first neural network and the second neural network each comprises at least one of a deep learning model-based neural network, a deep metric learning-based neural network, a convolutional neural network ("CNN"), or a fully convolutional network ("FCN").