WO2023149588A1

WO2023149588A1 - Unsupervised hash generation system

Info

Publication number: WO2023149588A1
Application number: PCT/KR2022/001745
Authority: WO
Inventors: Wonju Lee; Minje Park; Seok-Yong Byun
Original assignee: Intel Corporation
Priority date: 2022-02-04
Filing date: 2022-02-04
Publication date: 2023-08-10

Abstract

Systems, apparatuses and methods include technology that trains a neural network by generating high-dimensional representations of a plurality of images, determining a first loss associated with the generation of the high-dimensional representations of the plurality of images, and updating at least one parameter of the neural network based on the first loss.

Description

UNSUPERVISED HASH GENERATION SYSTEM

Embodiments generally relate to generating an enhanced neural network that is trained based on accurate loss calculations. More particularly, embodiments relate to a contrastive learning neural network that updates one or more parameters of a neural network based on low-dimensional losses and high-dimensional losses.

Neural networks may execute hashing techniques (e.g., unsupervised hashing that does not require label information) to map data of arbitrary size to fixed-size codes. Hashing techniques are used to index large amount of data in retrieval applications to access data in nearly constant time per retrieval. In image hash applications, deep neural networks (DNNs) may be used for feature representation capability. Training of such DNNs may be problematic as loss function calculations are limited in scope and accuracy to only encompass high-dimensional binary hard code.

For example, generating hash codes for DNNs may be challenging since optimization contains discrete constraints, and thus conventional backpropagation cannot be directly applicable to update the discrete constraints. For example, a hash code is generally composed of fixed length binary vector, so a DNN performs binarization that includes employing the discrete constraint during DNN optimization. To address this, some conventional designs adopt continuous relaxation to hashing optimization by replacing discrete sign function with a smooth activation function (e.g., hyperbolic tangent or sigmoid). The sign function is used only in a forward pass while the gradients are transmitted to the front layer intactly. However, such asymmetric behavior between forward and backward passes may generate noises during optimization and degrade the quality of generated hash codes.

The various advantages of the embodiments will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which:

FIG. 1 is a diagram of an example of neural network architecture according to an embodiment;

FIG. 2 is a flowchart of an example of a method to generate losses for a neural network according to an embodiment;

FIG. 3 is a diagram of an example of training and deploying a neural network according to an embodiment;

FIG. 4 is a diagram of an example of an overview of a neural network architecture e according to an embodiment;

FIG. 5 is a diagram of an example of a similarity determination process according to an embodiment;

FIG. 6 is a diagram of an example of a conventional neural network training example and an enhanced neural network training example according to an embodiment;

FIG. 7 is a graph of an example of a different losses according to an embodiment;

FIG. 8 is a block diagram of an example of a neural network training system 15 according to an embodiment;

FIG. 9 is an illustration of an example of a semiconductor apparatus according to an embodiment;

FIG. 10 is a block diagram of an example of a processor according to an embodiment; and

FIG. 11 is a block diagram of an example of a multi-processor based computing system according to an embodiment.

Turning now to FIG. 1, a neural network architecture 100 is illustrated. The neural network architecture 100 may be trained through a contrastive learning process that incorporates unsupervised hashing methods. A plurality of images 102 are provided. The neural network architecture 100 may determine a prediction indicating whether different images of the plurality of images are similar to each other and generate a loss function (e.g., a number indicating a correctness of the prediction) to update parameters of the neural network architecture 100. To train the neural network architecture 100, a loss function may generate the loss based on the accuracy of the matching. For example, the loss function may quantify the difference between the expected outcome (e.g., images are similar or dissimilar from each other) and the outcome produced by the machine learning model. From the loss function, the neural network architecture 100 may generate gradients which are then used to update the parameters (e.g., weights biases, backbone parameters, etc.) of the neural network architecture 100. Some embodiments may enhance the loss function by analyzing soft code 110 (e.g., a high-dimensional representation) in addition to hard code 114 (e.g., a low-dimensional representation). Other conventional implementations may be unable to quantify the loss of the backbone layers 106 (e.g., soft layers) and thus may not accurately update the backbone layers 106 or may neglect to do so altogether. The soft code 110 may comprise first soft code 110a and second soft code 110b. The hard code 114 may include first hard code 114a and second hard code 114b.

Thus, some embodiments enhance the quality of generated hash codes based on a soft-to-hard hashing model. In detail, some embodiments propagate the correlations between positives (e.g., matches) determined from the soft code 110 (e.g., a high-dimensional soft code) to the hard code 114 (e.g., a low-dimensional binary hard code). Doing so may effectively suppress noises that may be generated via other methods (e.g., 15 discrete constraint relaxation methods). Some embodiments generate a loss function for the soft code (e.g., in a penultimate block of an AI model) to facilitate more accurate training of the binary hash model while also addressing object mismatching problems to perform joint training for contrastive learning.

During forward propagation, the neural network architecture 100 receives images 104 including a first image v1 104a and a second image v2 104b. The first and second images v1,

v2

104a, 104b may be fed into different portions of the neural network architecture 100. For example, the first image v1 104a may be processed in the upper branch of the neural network architecture 100 while the second image v2 104b may be processed in the lower branch of the neural network architecture 100. The first and second images v1,

v2

104a, 104b may be different from each other.

The first and second images v1,

v2

104a, 104b may then be provided to backbone layers 106. The backbone layers 106 include first and

second backbone layers

106a, 106b. As discussed, the neural network architecture 100 may be undergo a contrastive learning process. Contrastive learning may be a framework that learns similar/dissimilar representations from data that are organized into similar/dissimilar pairs. Contrastive learning groups pairs of similar images (e.g., positive images that originate from a same image) together while repelling dissimilar images (e.g., negative images that originate from different images) away from the pairs of similar images. Contrastive learning leverages input data itself as supervision via instance discrimination and may be utilized in representation learning domains to determine how to properly classify images.

The first and

second backbone layers

106a, 106b may generate differently distorted views from a same image for contrastive learning. For example, the first and

second backbone layers

106a, 106b may transform any given image randomly resulting in two correlated views of the same image (e.g., a positive pair). Various types of augmentations may be applied, including random cropping followed by resize back to the original size, random color distortions, translation, rotation, brightness modification, random Gaussian blur, etc. Thus, the first backbone layer 106a may modify the first image v ₁ 104a to generate a pair of transformed first images, and the second backbone layer 106b may modify the second image v ₂ 104b to generate a pair of transformed second images. As a detailed example, with a batch χ consisting of N samples from the overall dataset, the first and

second backbones

106a, 106b may generate two distorted views v_i and v_j by applying data augmentations following the distribution T.

The augmented view v_i is mapped to vector representations z_i ∈ R^N×C by feedforwarding into one of the soft encoders 108 (e.g., which may be a pre-trained feature extractor). The pair of transformed first images and pair of transformed second images are provided to the soft encoders 108. The soft encoders 108 include a first soft encoder 108a and a second soft encoder 108b. The first soft encoder 108a may vectorize the pair of transformed first images to generate first representation vectors (each vector corresponding to one of the transformed first images). The second soft encoder 108b may vectorize the pair of transformed second images to generate second representation vectors (each vector corresponding to one of the transformed second images). In some embodiments, each pair of transformed first images and pair of transformed second images is mapped to vector representations z_i ∈ R^N×C by feedforwarding into the soft encoders 108 which are pre-trained. Thus, the soft encoders 108 may compute a D-dimensional soft code　q_i∈ R^N×D by a non-linear mapping function g(·).

The first and second representation vectors may be represented by the soft code 110 for further processing. The soft code 110 may be referred to as soft layers. Soft code 110 may be a hash code with a real-valued vector in a high-dimensional space. The neural network architecture 100 may compare the first representation vectors and the second representation vectors to each other (e.g., each representation vector of the first and second representation vectors is compared to all of the other representation vectors) in order to build a correlation matrix 116 showing the correlation between different features of the pair of transformed first images and the pair of transformed second images. The soft code 110 may be in a penultimate block of the neural network architecture 100. The first and second representation vectors may be high-dimensional continuous data representations (e.g., real-valued floating point values with a large and/or infinite number of values).

Some embodiments may follow the information bottleneck principle so that the soft code 110 (e.g., soft code which generates the high-dimensional features) is informative about a semantic similarity between positives (e.g., transformed images that correspond to a same image) while being invariant to distortions inserted into the images. Some embodiments formulate the loss function minimizing the information bottleneck at soft code 110 as follows:

In Equation 1, I(·,·) denotes the mutual information, and β is positive weight parameter. It is further worthwhile to note that I(Q_θ,·) = H(Q_θ)-H(Q_θ|·) for X and V where H is an entropy function. In equation 1, X is an original image and V is a distorted view. Since the soft code Q_θconditioned on the input V with the deterministic soft coder is understood and hence the conditional entropy H(Q_θ,V)is zero, the loss function described in Equation 1 may be rewritten as:

In order to reduce the impact of zero-one properties of the entropy in the intermediate auxiliary loss, embodiments may approximate the first term (argmin_θαH(Q_θ,X) minimizing the information of the soft codes with distortions to the factor maximizing the alignment between soft codes. Similarly, some embodiments approximate the second term ((1-α)H(Q_θ)), maximizing the information of the soft codes itself to minimize a Frobenius norm of a cross-correlation matrix such as a first correlation matrix 116.

Embodiments then may then determine terms of the first correlation matrix 116 as a pair-wise similarity matrix between soft codes q from

. . The q terms may refer to different entries of the first and second representation vectors. The diagonal terms of the first correlation matrix 116 may make distorted embeddings to be invariant, while the off-diagonal terms of the first correlation matrix 116 make embeddings have nonredundant information about the sample. Therefore, the weighted similarity preserving loss (e.g., a first loss) for the soft code can be written as Equation 3:

In Equation 3, I is the NxN identity matrix (e.g., a square matrix having 1s on the main diagonal, and 0s everywhere else). Q is the NxN first correlation matrix 116, where N is the batch-size of each mini-batch. Further, w_i,j is set to 1 for i=j, and β = (1-α)/α for others. α is derived from Equation 2 to maximize the information of the soft codes, such as soft code 110, itself and to minimize a Frobenius norm of a cross-correlation matrix. By replacing β = (1-α)/α, embodiments simplify the equation 2 to equation 3. The diagonal terms of the first correlation matrix Q sets the positive terms to be 1 and others are the negatives and are set to 0, so embodiments compare Q and identity matrix I. For the weighted similarity preserving loss, embodiments exploit the weight w. Some embodiments enhance robustness relative to other approaches that may introduce noise from the discrete constraint relaxation during the backpropagation. For example, some embodiments herein adopt a mean square error for soft codes. Here, the mean square error lies on the information bottleneck principle. In the first correlation matrix 116, the middle diagonal may correspond to a match between different features. For example, the middle diagonal may correspond to features of one of the first and second representation vectors that match each other. The loss of Equation 3 may correspond to how accurately the neural network architecture 100 may determine that the continuous field values of the first and second representation vectors are similar or dissimilar from each other.

Thereafter, hard code layers 112, comprising first and second hard code layers 112a, 112b, may reduce the high dimensionality (e.g., a continuous field such as floating point format) of the first and second representation vectors to generate a low-dimensional representations (e.g., a discrete field such as binary format that has a limited number of values). Hard code may be a fixed-length binary hash code. For example, the hard coder layers 112, may hash the first and second representation vectors to generate low-dimensional representations. The hard coder layers 112 (e.g., hard encoders) may be binary image descriptors. Some embodiments may compute the D-dimensional binary hard code c_i∈{-1,+1}^N×B where C is the hard code, N is the batch size and B is the length of hard code 114. The Sigmoid function σ(·)and sign function sgn(·) may be included to determine the binary hard code as shown in in Equation 4:

The H function is the hard code 112.The hash optimization problem of Equation 4 includes a discrete constraint as follows:

Since Equation 5 includes the discrete constraint, Equation 5 may represent a non-deterministic polynomial-time hardness (NP-hard) problem. Thus, some embodiments include a greedy algorithm which solves the optimization without a discrete constraint to obtain the optimal continuous solution, then finds the closest discrete point to the continuous solution by updating Equation 6 as follows:

In Equation 6, the model parameter θ and the learning rate correspond to the t-th gradient descent iteration. Some embodiments include the greedy optimization algorithm to train the neural network architecture 100.

The hard code 114 may be used to determine the hashing problem from the perspective of multi-label classification, and hence embodiments include a loss for the hard code 114 by applying a general cross-entropy loss (e.g., Hashing with Contrastive Information Bottleneck or CIBHash). The Normalized Temperature-scaled Cross Entropy Loss for distorted samples i and j from a batch x is given by Equation 7:

In Equation 7, 1(·) is the indicator function that outputs 1 for true and 0 for false, C_i,j is the element of the pairwise similarity matrix from

, and τ denotes the temperature of the Normalized Temperature-scaled Cross Entropy Loss.

The contrastive loss for the hard code 114 of the batch x is given by:

Equation 8

Some embodiments may generate the end-to-end loss formulation as follows:

Equation 9

In Equation 9, the positive weighting parameter λ controls the trade-off between the importance of the loss of the soft and the hard code. Thus, embodiments generate two losses for soft and hard codes, and then combine them together.

In this example, the hard code 114 may generate an output represented by the second correlation matrix 118. The second correlation matrix 118 may be interpreted to a binary output, or a "+1" for two images of the pair of transformed first images and the pair of transformed second images being the same, or a "-1" or two images of the pair of transformed first images and the pair of transformed second images being different from each other.

For example, the hard code 114 may operate on based on binary image descriptors generated by the hard coder layers 112. Thus, the hard code 114 may provide not only efficient data storage (since high-dimensional data is compressed to binary values) but also execute rapid estimation of the Hamming distance-based similarity via XOR bit-wise operations between vectors. Thus, the hard code 114 may generate similarity values based on Hamming distances. The neural network architecture 100 may further modify parameters (e.g., via gradient descent) of the neural network architecture 100 based on based on the Equation 9, and in particular L_hard + λL_soft. In some embodiments, different portions of the neural network architecture 100, such as the hard coder layers 112, may be adjusted based on L_hard, while other portions of the neural network architecture 100, such as the backbone layer 106 and/or soft encoder layers 108, may be adjusted based on L_hardand L_soft. Thus, the proposed unsupervised hashing method enhances performance on binary hashing problems compared to conventional methods.

Thus, some embodiments enhance image modifications by the backbone layer 106 and hashing by the soft encoder layers 108 based on a loss that is generated specifically based on an analysis of the outputs of the backbone layer 106 and hashing by the soft encoder layers 108. As such, the neural network architecture 100 may be more accurate and efficient than other neural networks that do not modify the backbone layer 106 and the soft encoder layers 108 or does so based only on outputs of the entire neural network architecture 100 rather than a loss generated based specifically on the outputs of the backbone layer 106 and hashing by the soft encoder layers 108. Embodiments further introduce the joint training of the backbone layers 106 and hard code 114 from the perspective of the self-supervised representation learning by reducing the impact of the noise from the discrete constraint relaxation. This joint training shows significant achievements when the pre-trained backbone yields poor performance on target datasets from the large discrepancy between source and target datasets.

It is further worthwhile to note that the while variations of the weighted MSE loss

were described above some embodiments may be modified. For example, some embodiments may be extended to normalized temperature-scaled crossentropy loss, L2 loss with hard assignment and swapped-prediction loss with Sinkhorn-Knopp algorithm.

Some or all components in the neural network architecture 100 may be implemented using one or more of a CPU, a GPU, an AI accelerator, a FPGA accelerator, an ASIC, and/or via a processor with software, or in a combination of a processor with software and an FPGA or ASIC. More particularly, components of the neural network architecture 100 may be implemented in one or more modules as a set of logic instructions stored in a machine- or computer-readable storage medium such as random access memory (RAM), read only memory (ROM), programmable ROM (PROM), firmware, flash memory, etc., in configurable logic such as, for example, programmable logic arrays (PLAs), FPGAs, complex programmable logic devices (CPLDs), in fixed-functionality logic hardware using circuit technology such as, for example, ASIC, complementary metal oxide semiconductor (CMOS) or transistor-transistor logic (TTL) technology, or any combination thereof.

For example, computer program code to carry out operations by the neural network architecture 100 may be written in any combination of one or more programming languages, including an object-oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages.

Additionally, logic instructions might include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, state-setting data, configuration data for integrated circuitry, state information that personalizes electronic circuitry and/or other structural components that are native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.).

FIG. 2 shows a method 320 to generate losses for a neural network. The method 320 may be readily combinable with any of the embodiments described herein. For example, the method 320 may implement and/or operate in conjunction with one or more aspects of the neural network architecture 100 (FIG. 1) already discussed. The method 320 may be implemented in one or more modules as a set of logic instructions stored in a machine- or computer-readable storage medium such as RAM, ROM, PROM, firmware, flash memory, etc., in configurable logic such as, for example, PLAs, FPGAs, CPLDs, in fixed-functionality logic hardware using circuit technology such as, for example, ASIC, CMOS or TTL technology, or any combination thereof.

Illustrated processing block 322 generates a high-dimensional representations of a plurality of images. Illustrated processing block 324 determines a first loss associated with the generation of the high-dimensional representations of the plurality of images. Illustrated processing block 326 updates at least one parameter of the neural network based on the first loss.

In some embodiments, the method 320 further includes generating low-dimensional representations of the plurality of images based on the high-dimensional representations, where the high-dimensional representations are continuous data representations, and the low-dimensional representations are binary representations. In such embodiments, the method 320 may further include generating a similarity measurement between the plurality of images based on the low-dimensional representations. In some embodiments, the method 320 further includes transforming an image of the plurality of images into two different images and generating the high-dimensional representations based on the two different images. In some embodiments, the method 320 further includes hashing the high-dimensional representations to generate the low-dimensional representations. In some embodiments, the method 320 further includes generating a second loss associated with the similarity measurement and updating the least one parameter of the neural network based on the second loss. In some embodiments, the neural network is a contrastive learning neural network.

FIG. 3 illustrates a training and deployment process 300 of a neural network. The process 300 may be readily combinable with any of the embodiments described herein. For example, the process 300 may implement and/or operate in conjunction with one or more aspects of the neural network architecture 100 (FIG. 1) and/or method 320 (FIG. 2) already discussed.

The process 300 includes a training process 302 and a deployment process 314. Initially, the training process 302 may be an offline preprocessing that includes stored images 304. An enhanced hashing model 306, which may correspond to the neural network architecture 100 (FIG. 1), is trained on the stored images 304. The enhanced hashing model 306 may generate a database of stored hash keys 308. The stored hash keys 308 may be hashes of the stored images 304 and may be stored in association with identifiers of the images. For example, a first hash key may be generated based on a first image (e.g., the first hash key represents the first image), and is thus stored in association with the first image.

During deployment 314, a query image 312 may be provided to the enhanced hashing model 306. The enhanced hashing model 306 may then hash the query image 312 to generate a query hash key. The process 300 may then compare and retrieve 310 a corresponding hash key from the stored hash keys based on the query hash key. For example, the compare and retrieve 310 may compare the query hash key to the stored hash keys 308 to identify an image. For example, the query hash key may match (e.g., be within a predefined Hamming distance) of a first hash key of the stored hash keys 308. The first hash key may be stored in association with a first image, and thus the query hash key may be deemed to correspond to the first image.

Thus, the enhanced hashing model 306 is an example of content-based image retrieval (CBIR) system using improved hashing methods as described herein. The quality of retrieved image of CBIR is mainly determined by hashing and comparisons of hashes of images. Embodiments as described herein include improved hashing methods that may generate enhanced keys for various types of binary hash code comparisons.

FIG. 4 illustrates an overview of a neural network architecture 400. The neural network architecture 400 may be readily combinable with any of the embodiments described herein. For example, the neural network architecture 400 may implement and/or operate in conjunction with one or more aspects of the neural network architecture 100 (FIG. 1), method 320 (FIG. 2) and/or process 300 (FIG. 3) already discussed. In detail, a neural network 412 includes Convolutional Neural Networks (CNNs) 406 and fully connected portions 408 (e.g., full-connected layer or perceptron of neural networks). The neural network 412 may also receive training data 402 and testing data 404. The neural network 412 may generate high-dimensional representations of the training data 402. For example, outputs of the CNNs 406 may be the high-dimensional representations. The neural network 412 may then output hash codes 410 (e.g., low dimensional representations of the high-dimensional representations). The neural network 412 may determine Hamming distances to determine if two images are similar to each other (e.g., a Hamming distance therebetween is lower than a threshold). A match may be output as "person 1" in this example.

FIG. 5 illustrates a similarity determination process 450 to determine whether images 452 are similar to each other. The similarity determination process 450 may be readily combinable with any of the embodiments described herein. For example, the similarity determination process 450 may implement and/or operate in conjunction with one or more aspects of the neural network architecture 100 (FIG. 1), method 320 (FIG. 2), process 300 (FIG. 3) and/or neural network architecture 400 (FIG. 4) already discussed. The images 452 may be vectorized into vectors 454. CNNs 456 may generate low-dimensional outputs based on the vectors 454. A fch layer 458 may generate an output 460 (e.g., a positive value for a match for a specific feature of the high-dimensional outputs or a negative value for mismatch between a specific feature of the high-dimensional outputs). The entire series of matches and mismatches of the features may be represented by graph 462. For example, matching features may be indicated with the dark boxes while mismatches may be indicated by the light boxes. A similarity label 464 may be generated based on the graph 462, and whether a number of the matching features meets a threshold.

FIG. 6 illustrates a conventional example 480 and an enhanced example 478. The enhanced example 478 may be readily combinable with any of the embodiments described herein. For example, the enhanced example 478 may implement and/or operate in conjunction with one or more aspects of the neural network architecture 100 (FIG. 1), method 320 (FIG. 2), process 300 (FIG. 3) neural network architecture 400 (FIG. 4) and/or similarity determination process 450 (FIG. 5) already discussed.

In the conventional example 480, a contrastive training process is illustrated. A first image may be transformed into the square images of graph 482 (e.g., a high-dimensional representation), while a second image may be transformed into triangle images (e.g., a high-dimensional representation) of the graph 482. In the conventional example, the black arrows correspond to comparison between dissimilar images, while the white arrows correspond to comparisons between similar images. The conventional example 480 may execute forward propagation based on the B= sgn(h) 484 during forward propagation. Based on the output, the conventional example 480 may generate graph 476 (e.g., a low-dimensional representation). Losses may be based on loss L_hard that includes loss L^true (which is the true noise) and noise n_q (which are inaccuracies). That is, asymmetric behavior between forward and backward yields noises n_q, which is then inefficiently formed as part of the loss. Thus, noises n_q are superimposed in contrastive learning in the conventional example 480. During backward propagation 486, a gradient is based on L^hard which as stated is inaccurate in the conventional example 480.

In the enhanced example 478, a first image may be transformed into the square images of graph 488 (e.g., a high-dimensional representation), while a second image may be transformed into triangle images (e.g., a high-dimensional representation) of the graph 488. During forward propagation 494, function H =g(Q) may modify the graph 488 to graph 490. Furthermore, during forward propagation 498, function B = sgn(H) may modify the graph 490 to graph 492. Since embodiments leverage the loss for soft code, contrastive learning is applied to graph 490 and this makes positives to be closer and negatives to be more distant. Graph 492 shows the hard code. That is every symbol in graph 492 should be -1 or 1. So similar symbols are overlapped while dissimilar images are not overlapped. This is the same as the graph 476. It is worthwhile to note that the correlations between positives from the high-dimensional soft code are transmitted downstream to the low-dimensional binary hard code calculation to be incorporated in a final loss calculation.

During backward propagation 478, parameters of one or more layers (e.g., low-dimensional layers) may be updated based on a gradient that is based on loss L^hard. Furthermore, during backward propagation 496, parameters of one or more layers (e.g., of high-dimensional layers) may be updated based on L^hard and L^soft. Notably, the enhanced example does not include noise as part of the loss calculation and is therefore more accurate.

FIG. 7 illustrates a graph 500 that shows L^hard (conv) as determined by

conventional examples, L^true (conv) which is the true loss and L^hard (prop) as determined by embodiments described herein. As illustrated, the L^hard (conv) is markedly distinct from L^true (conv). In contrast, L^hard (prop) is proximate to L^true (conv) and is better approximation for L^true (conv) than L^hard (conv). Thus, embodiments approximate the L^true (conv) with greater accuracy than conventional examples.

Turning now to FIG. 8, an efficiency and training enhanced computing system 158 is shown. The computing system 158 may generally be part of an electronic device/platform having computing functionality (e.g., personal digital assistant/PDA, notebook computer, tablet computer, convertible tablet, server), communications functionality (e.g., smart phone), imaging functionality (e.g., camera, camcorder), media playing functionality (e.g., smart television/TV), wearable functionality (e.g., watch, eyewear, headwear, footwear, jewelry), vehicular functionality (e.g., car, truck, motorcycle), robotic functionality (e.g., autonomous robot), etc., or any combination thereof. In the illustrated example, the computing system 158 includes a host processor 134 (e.g., CPU) having an integrated memory controller (IMC) 154 that is coupled to a system memory 144.

The illustrated computing system 158 also includes an input output (IO) module 142 implemented together with the host processor 134, a graphics processor 132 (e.g., GPU), ROM 136, and AI accelerator 148 on a semiconductor die 146 as a system on chip (SoC). The illustrated IO module 142 communicates with, for example, a display 172 (e.g., touch screen, liquid crystal display/LCD, light emitting diode/LED display), a network controller 174 (e.g., wired and/or wireless), FPGA 178 and mass storage 176 (e.g., hard disk drive/HDD, optical disk, solid state drive/SSD, flash memory). The network controller 174 may communicate with a plurality of nodes that implement a neural network. Furthermore, the SoC 146 may further include processors (not shown) and/or the AI accelerator 148 dedicated to artificial intelligence (AI) and/or neural network (NN) processing. For example, the system SoC 146 may include a vision processing unit (VPU) 138 and/or other AI/NN-specific processors such as AI accelerator 148, etc. In some embodiments, any aspect of the embodiments described herein may be implemented in the processors and/or accelerators dedicated to AI and/or NN processing such as AI accelerator 148, the graphics processor 132, VPU 138 and/or the host processor 134.

The graphics processor 132 and/or the host processor 134 may execute instructions 156 retrieved from the system memory 144 (e.g., a dynamic random-access memory) and/or the mass storage 176 to implement aspects as described herein. For example, the graphics processor 132 and/or the host processor 134 may train the neural network by generating high-dimensional representations of a plurality of images, determining a first loss associated with the generation of the high-dimensional representations of the plurality of images, and updating at least one parameter of the neural network based on the first loss.

When the instructions 156 are executed, the computing system 158 may implement one or more aspects of the embodiments described herein. For example, the computing system 158 may implement one or more aspects of the neural network architecture 100 (FIG. 1), method 320 (FIG. 2), process 300 (FIG. 3) neural network architecture 400 (FIG. 4) similarity determination process 450 (FIG. 5) and/or enhanced example 478 (FIG. 6) already discussed. The illustrated computing system 158 is therefore considered to be efficiency and training enhanced at least to the extent that it enables the computing system 158 to be trained more efficiently based on approximations that approach true loss of the neural network.

FIG. 9 shows a semiconductor apparatus 186 (e.g., chip, die, package). The illustrated apparatus 186 includes one or more substrates 184 (e.g., silicon, sapphire, 15 gallium arsenide) and logic 182 (e.g., transistor array and other integrated circuit/IC components) coupled to the substrate(s) 184. In an embodiment, the apparatus 186 is operated in an application development stage and the logic 182 performs one or more aspects of the embodiments described herein, for example, one or more aspects of the neural network architecture 100 (FIG. 1), method 320 (FIG. 2), process 300 (FIG. 3) neural network architecture 400 (FIG. 4) similarity determination process 450 (FIG. 5) and/or enhanced example 478 (FIG. 6) already discussed. The logic 182 may be implemented at least partly in configurable logic or fixed-functionality hardware logic. In one example, the logic 182 includes transistor channel regions that are positioned (e.g., embedded) within the substrate(s) 184. Thus, the interface between the logic 182 and the substrate(s) 184 may not be an abrupt junction. The logic 182 may also be considered to include an epitaxial layer that is grown on an initial wafer of the substrate(s) 184.

FIG. 10 illustrates a processor core 200 according to one embodiment. The processor core 200 may be the core for any type of processor, such as a micro-processor, an embedded processor, a digital signal processor (DSP), a network processor, or other device to execute code. Although only one processor core 200 is illustrated in FIG. 10, a processing element may alternatively include more than one of the processor core 200 illustrated in FIG. 10. The processor core 200 may be a single-threaded core or, for at least one embodiment, the processor core 200 may be multithreaded in that it may include more than one hardware thread context (or "logical processor") per core.

FIG. 10 also illustrates a memory 270 coupled to the processor core 200. The memory 270 may be any of a wide variety of memories (including various layers of memory hierarchy) as are known or otherwise available to those of skill in the art. The memory 270 may include one or more code 213 instruction(s) to be executed by the processor core 200, wherein the code 213 may implement one or more aspects of the neural network architecture 100 (FIG. 1), method 320 (FIG. 2), process 300 (FIG. 3) neural network architecture 400 (FIG. 4) similarity determination process 450 (FIG. 5) and/or enhanced example 478 (FIG. 6) already discussed. The processor core 200 follows a program sequence of instructions indicated by the code 213. Each instruction may enter a front end portion 210 and be processed by one or more decoders 220. The decoder 220 may generate as its output a micro operation such as a fixed width micro operation in a predefined format, or may generate other instructions, microinstructions, or control signals which reflect the original code instruction. The illustrated front end portion 210 also includes register renaming logic 225 and scheduling logic 230, which generally allocate resources and queue the operation corresponding to the convert instruction for execution.

The processor core 200 is shown including execution logic 250 having a set of 20 execution units 255-1 through 255-N. Some embodiments may include a number of execution units dedicated to specific functions or sets of functions. Other embodiments may include only one execution unit or one execution unit that can perform a particular function. The illustrated execution logic 250 performs the operations specified by code instructions.

After completion of execution of the operations specified by the code instructions, back end logic 260 retires the instructions of the code 213. In one embodiment, the processor core 200 allows out of order execution but requires in order retirement of instructions. Retirement logic 265 may take a variety of forms as known to those of skill in the art (e.g., re-order buffers or the like). In this manner, the processor core 200 is transformed during execution of the code 213, at least in terms of the output generated by the decoder, the hardware registers and tables utilized by the register renaming logic 225, and any registers (not shown) modified by the execution

logic 250.

Although not illustrated in FIG. 10, a processing element may include other elements on chip with the processor core 200. For example, a processing element may include memory control logic along with the processor core 200. The processing element may include I/O control logic and/or may include I/O control logic integrated with memory control logic. The processing element may also include one or more caches.

Referring now to FIG. 11, shown is a block diagram of a computing system 1000 embodiment in accordance with an embodiment. Shown in FIG. 11 is a multiprocessor system 1000 that includes a first processing element 1070 and a second processing element 1080. While two

processing elements

1070 and 1080 are shown, it is to be understood that an embodiment of the system 1000 may also include only one such processing element.

The system 1000 is illustrated as a point-to-point interconnect system, wherein the first processing element 1070 and the second processing element 1080 are coupled via a point-to-point interconnect 1050. It should be understood that any or all of the interconnects illustrated in FIG. 11 may be implemented as a multi-drop bus rather than point-to-point interconnect.

As shown in FIG. 11, each of

processing elements

1070 and 1080 may be multicore processors, including first and second processor cores (i.e.,

processor cores

1074a and 1074b and

processor cores

1084a and 1084b).

Such cores

1074a, 1074b, 1084a, 1084b may be configured to execute instruction code in a manner similar to that discussed above in connection with FIG. 10.

Each

processing element

1070, 1080 may include at least one shared

cache

1896a, 1896b. The shared

cache

1896a, 1896b may store data (e.g., instructions) that are utilized by one or more components of the processor, such as the

cores

1074a, 1074b and 1084a, 1084b, respectively. For example, the shared

cache

1896a, 1896b may locally cache data stored in a

memory

1032, 1034 for faster access by components of the processor. In one or more embodiments, the shared

cache

1896a, 1896b may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof.

While shown with only two

processing elements

1070, 1080, it is to be understood that the scope of the embodiments is not so limited. In other embodiments, one or more additional processing elements may be present in a given processor. Alternatively, one or more of

processing elements

1070, 1080 may be an element other than a processor, such as an accelerator or a field programmable gate array. For example, additional processing element(s) may include additional processors(s) that are the same as a first processor 1070, additional processor(s) that are heterogeneous or asymmetric to processor a first processor 1070, accelerators (such as, e.g., graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays, or any other processing element. There can be a variety of differences between the

processing elements

1070, 1080 in terms of a spectrum of metrics of merit including architectural, micro architectural, thermal, power consumption characteristics, and the like. These differences may effectively manifest themselves as asymmetry and heterogeneity amongst the

processing elements

1070, 1080. For at least one embodiment, the

various processing elements

1070, 1080 may reside in the same die package.

The first processing element 1070 may further include memory controller logic (MC) 1072 and point-to-point (P-P) interfaces 1076 and 1078. Similarly, the second processing element 1080 may include a MC 1082 and

P-P interfaces

1086 and 1088. As shown in FIG. 11, MC's 1072 and 1082 couple the processors to respective memories, namely a memory 1032 and a memory 1034, which may be portions of main memory locally attached to the respective processors. While the

MC

1072 and 1082 is illustrated as integrated into the

processing elements

1070, 1080, for alternative embodiments the MC logic may be discrete logic outside the

processing elements

1070, 1080 rather than integrated therein.

The first processing element 1070 and the second processing element 1080 may be coupled to an I/O subsystem 1090 via P-P interconnects 1076 1086, respectively. As shown in FIG. 11, the I/O subsystem 1090 includes

P-P interfaces

1094 and 1098. Furthermore, I/O subsystem 1090 includes an interface 1092 to couple I/O subsystem 1090 with a high performance graphics engine 1038. In one embodiment, bus 1049 may be used to couple the graphics engine 1038 to the I/O subsystem 1090. Alternately, a point-to-point interconnect may couple these components.

In turn, I/O subsystem 1090 may be coupled to a first bus 1016 via an interface 1096. In one embodiment, the first bus 1016 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the embodiments is not so limited.

As shown in FIG. 11, various I/O devices 1014 (e.g., biometric scanners, speakers, cameras, sensors) may be coupled to the first bus 1016, along with a bus bridge 1018 which may couple the first bus 1016 to a second bus 1020. In one embodiment, the second bus 1020 may be a low pin count (LPC) bus. Various devices may be coupled to the second bus 1020 including, for example, a keyboard/mouse 1012, communication device(s) 1026, and a data storage unit 1019 such as a disk drive or other mass storage device which may include code 1030, in one embodiment. The illustrated code 1030 may implement the one or more aspects of the neural network architecture 100 (FIG. 1), method 320 (FIG. 2), process 300 (FIG. 3) neural network architecture 400 (FIG. 4) similarity determination process 450 (FIG. 5) and/or enhanced example 478 (FIG. 6) already discussed. Further, an audio I/O 1024 may be coupled to second bus 1020 and a battery 1010 may supply power to the computing system 1000.

Note that other embodiments are contemplated. For example, instead of the point-to-point architecture of FIG. 11, a system may implement a multi-drop bus or another such communication topology. Also, the elements of FIG. 11 may alternatively be partitioned using more or fewer integrated chips than shown in FIG. 11.

Additional Notes and Examples:

Example 1 includes a computing system to train a neural network, the computing system comprising a network interface to communicate with a plurality of nodes that implement the neural network, a processor, and a memory coupled to the processor, the memory including a set of executable program instructions, which when executed by the processor, cause the computing system to generate high-dimensional representations of a plurality of images, determine a first loss associated with the generation of the high-dimensional representations of the plurality of images, and update at least one parameter of the neural network based on the first loss.

Example 2 includes the computing system of Example 1, wherein the executable program instructions, when executed, cause the computing system to generate low-dimensional representations of the plurality of images based on the high-dimensional representations, wherein the high-dimensional representations are continuous data representations, and the low-dimensional representations are binary representations, and generate a similarity measurement between the plurality of images based on the low-dimensional representations.

Example 3 includes the computing system of any one of Examples 1 to 2, wherein the executable program instructions, when executed, cause the computing system to transform an image of the plurality of images into two different images, and generate the high-dimensional representations based on the two different images.

Example 4 includes the computing system of Example 2, wherein the executable program instructions, when executed, cause the computing system to hash the high-dimensional representations to generate the low-dimensional representations.

Example 5 includes the computing system of Example 2, wherein the executable program instructions, when executed, cause the computing system to generate a second loss associated with the similarity measurement, and update the at least one parameter of the neural network based on the second loss.

Example 6 includes the computing system of any one of Examples 1 to 5, wherein the neural network is a contrastive learning neural network.

Example 7 includes a semiconductor apparatus to train a neural network, the semiconductor apparatus comprising one or more substrates, and logic coupled to the one or more substrates, wherein the logic is implemented in one or more of configurable logic or fixed-functionality logic hardware, the logic coupled to the one or more substrates to generate high-dimensional representations of a plurality of images, determine a first loss associated with the generation of the high-dimensional representations of the plurality of images, and update at least one parameter of the neural network based on the first loss.

Example 8 includes the apparatus of Example 7, wherein the logic coupled to the one or more substrates is to generate low-dimensional representations of the plurality of images based on the high-dimensional representations, wherein the high-dimensional representations are continuous data representations, and the low-dimensional representations are binary representations, and generate a similarity measurement between the plurality of images based on the low-dimensional representations.

Example 9 includes the apparatus of any one of Examples 7 to 8, wherein the logic coupled to the one or more substrates is to transform an image of the plurality of images into two different images, and generate the high-dimensional representations based on the two different images.

Example 10 includes the apparatus of Example 8, wherein the logic coupled to the one or more substrates is to hash the high-dimensional representations to generate the low-dimensional representations.

Example 11 includes the apparatus of Example 8, wherein the logic coupled to the one or more substrates is to generate a second loss associated with the similarity measurement, and update the at least one parameter of the neural network based on the second loss.

Example 12 includes the apparatus of any one of Examples 7 to 11, wherein the neural network is a contrastive learning neural network.

Example 13 includes the apparatus of any one of Examples 7 to 12, wherein the logic coupled to the one or more substrates includes transistor channel regions that are positioned within the one or more substrates.

Example 14 includes at least one computer readable storage medium comprising a set of executable program instructions to train a neural network, which when executed by a computing system, cause the computing system to generate high-dimensional representations of a plurality of images, determine a first loss associated with the generation of the high-dimensional representations of the plurality of images, update at least one parameter of the neural network based on the first loss.

Example 15 includes the at least one computer readable storage medium of Example 14, wherein the instructions, when executed, further cause the computing system to generate low-dimensional representations of the plurality of images based on the high-dimensional representations, wherein the high-dimensional representations are continuous data representations, and the low-dimensional representations are binary representations, and generate a similarity measurement between the plurality of images based on the low-dimensional representations.

Example 16 includes the at least one computer readable storage medium of any one of Examples 14 to 15, wherein the instructions, when executed, further cause the computing system to transform an image of the plurality of images into two different images, and generate the high-dimensional representations based on the two different images.

Example 17 includes the at least one computer readable storage medium of Example 15, wherein the instructions, when executed, further cause the computing system to hash the high-dimensional representations to generate the low-dimensional representations.

Example 18 includes the at least one computer readable storage medium of Example 15, wherein the instructions, when executed, further cause the computing system to generate a second loss associated with the similarity measurement, and update the at least one parameter of the neural network based on the second loss.

Example 19 includes the at least one computer readable storage medium of any one of Examples 14 to 18, wherein the neural network is a contrastive learning neural network.

Example 20 includes a method to train a neural network, the method comprising generating high-dimensional representations of a plurality of images, determining a first loss associated with the generation of the high-dimensional representations of the plurality of images, and updating at least one parameter of the neural network based on the first loss.

Example 21 includes the method of Example 20, further comprising generating low-dimensional representations of the plurality of images based on the high-dimensional representations, wherein the high-dimensional representations are continuous data representations, and the low-dimensional representations are binary representations, and generating a similarity measurement between the plurality of images based on the low-dimensional representations.

Example 22 includes the method of any one of Examples 20 to 21, further comprising transforming an image of the plurality of images into two different images, and generating the high-dimensional representations based on the two different images.

Example 23 includes the method of Example 21, further comprising hashing the high-dimensional representations to generate the low-dimensional representations.

Example 24 includes the method of Example 21, further comprising generating a second loss associated with the similarity measurement, and updating the at least one parameter of the neural network based on the second loss.

Example 25 includes the method of any one of Examples 20 to 24, wherein the neural network is a contrastive learning neural network.

Example 26 includes a semiconductor apparatus to train a neural network, the semiconductor apparatus comprising means for generating high-dimensional representations of a plurality of images, means for determining a first loss associated with the generation of the high-dimensional representations of the plurality of images, and means for updating at least one parameter of the neural network based on the first loss.

Example 27 includes the apparatus of Example 26, further comprising means for generating low-dimensional representations of the plurality of images based on the high-dimensional representations, wherein the high-dimensional representations are continuous data representations, and the low-dimensional representations are binary representations, and means for generating a similarity measurement between the plurality of images based on the low-dimensional representations.

Example 28 includes the apparatus of any one of Examples 26 to 27, further comprising means for transforming an image of the plurality of images into two different images, and means for generating the high-dimensional representations based on the two different images.

Example 29 includes the apparatus of Example 27, further comprising means for hashing the high-dimensional representations to generate the low-dimensional representations.

Example 30 includes the apparatus of Example 27, further comprising means for generating a second loss associated with the similarity measurement, and means for updating the at least one parameter of the neural network based on the second loss.

Example 31 includes the apparatus of any one of Examples 26 to 30, wherein the neural network is to be a contrastive learning neural network.

Example 32 includes an apparatus comprising means for performing the method of any one of Examples 20 to 25.

Thus, technology described herein may provide for an enhanced profiling system that enables determination of drift estimates based on unlabeled data. Embodiments as described herein may determine the drift between a training dataset that is used to train a neural network model, and an input dataset that the neural network processes.

Embodiments are applicable for use with all types of semiconductor integrated circuit ("IC") chips. Examples of these IC chips include but are not limited to processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoCs), SSD/NAND controller ASICs, and the like. In addition, in some of the drawings, signal conductor lines are represented with lines. Some may be different, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines.

Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments. Further, arrangements may be shown in block diagram form in order to avoid obscuring embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the platform within which the embodiment is to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting.

The term "coupled" may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections. In addition, the terms "first", "second" etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated.

As used in this application and in the claims, a list of items joined by the term "one or more of" may mean any combination of the listed terms. For example, the phrases "one or more of A, B or C" may mean A, B, C; A and B; A and C; B and C; or A, B and C.

Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims.

Claims

A computing system to train a neural network, the computing system comprising:

a network interface to communicate with a plurality of nodes that implementthe neural network;

a processor; and

a memory coupled to the processor, the memory including a set of executable program instructions, which when executed by the processor, cause the computing system to:

generate high-dimensional representations of a plurality of images,

determine a first loss associated with the generation of the high-dimensional

representations of the plurality of images, and

update at least one parameter of the neural network based on the first loss.
The computing system of claim 1, wherein the executable program instructions, when executed, cause the computing system to:

generate low-dimensional representations of the plurality of images based on the high-dimensional representations, wherein the high-dimensional representations are continuous data representations, and the low-dimensional representations are binary representations; and

generate a similarity measurement between the plurality of images based on the low-dimensional representations.
The computing system of claim 2, wherein the executable program instructions, when executed, cause the computing system to:

transform an image of the plurality of images into two different images; and

generate the high-dimensional representations based on the two different images.
The computing system of claim 2, wherein the executable program instructions, when executed, cause the computing system to:

hash the high-dimensional representations to generate the low-dimensional representations.
The computing system of claim 2, wherein the executable program instructions, when executed, cause the computing system to:

generate a second loss associated with the similarity measurement; and update the at least one parameter of the neural network based on the second loss.
The computing system of any one of claims 1 to 5, wherein the neural network is a contrastive learning neural network.
A semiconductor apparatus to train a neural network, the semiconductor apparatus comprising:

one or more substrates; and

logic coupled to the one or more substrates, wherein the logic is implemented in one or more of configurable logic or fixed-functionality logic hardware, the logic coupled to the one or more substrates to:

generate high-dimensional representations of a plurality of images;

determine a first loss associated with the generation of the high-dimensional representations of the plurality of images; and

update at least one parameter of the neural network based on the first loss.
The apparatus of claim 7, wherein the logic coupled to the one or more substrates is to:

generate low-dimensional representations of the plurality of images based on the high-dimensional representations, wherein the high-dimensional representations are continuous data representations, and the low-dimensional representations are binary representations; and

generate a similarity measurement between the plurality of images based on the low-dimensional representations.
The apparatus of claim 8, wherein the logic coupled to the one or more substrates is to:

transform an image of the plurality of images into two different images; and

generate the high-dimensional representations based on the two different images.
The apparatus of claim 8, wherein the logic coupled to the one or more substrates is to:

hash the high-dimensional representations to generate the low-dimensional representations.
The apparatus of claim 8, wherein the logic coupled to the one or more substrates is to:

generate a second loss associated with the similarity measurement; and

update the at least one parameter of the neural network based on the second loss.
The apparatus of any one of claims 7 to 11, wherein the neural network is a contrastive learning neural network.
The apparatus of any one of claims 7 to 11, wherein the logic coupled to the one or more substrates includes transistor channel regions that are positioned within the one or more substrates.
At least one computer readable storage medium comprising a set of executable program instructions to train a neural network, which when executed by a computing system, cause the computing system to:

generate high-dimensional representations of a plurality of images;

determine a first loss associated with the generation of the high-dimensional representations of the plurality of images; and

update at least one parameter of the neural network based on the first loss.
The at least one computer readable storage medium of claim 14, wherein the instructions, when executed, further cause the computing system to:

generate low-dimensional representations of the plurality of images based on the high-dimensional representations, wherein the high-dimensional representations are continuous data representations, and the low-dimensional representations are binary representations; and

generate a similarity measurement between the plurality of images based on the low-dimensional representations.
The at least one computer readable storage medium of claim 15, wherein the instructions, when executed, further cause the computing system to:

transform an image of the plurality of images into two different images; and

generate the high-dimensional representations based on the two different images.
The at least one computer readable storage medium of claim 15, wherein the instructions, when executed, further cause the computing system to:

hash the high-dimensional representations to generate the low-dimensional representations.
The at least one computer readable storage medium of claim 15, wherein the instructions, when executed, further cause the computing system to:

generate a second loss associated with the similarity measurement; and

update the at least one parameter of the neural network based on the second loss.
The at least one computer readable storage medium of any one of claims 14 to 18, wherein the neural network is a contrastive learning neural network.
A method to train a neural network, the method comprising:

generating high-dimensional representations of a plurality of images;

determining a first loss associated with the generation of the high-dimensional representations of the plurality of images; and

updating at least one parameter of the neural network based on the first loss.
The method of claim 20, further comprising:

generating low-dimensional representations of the plurality of images based on the high-dimensional representations, wherein the high-dimensional representations are continuous data representations, and the low-dimensional representations are binary representations; and

generating a similarity measurement between the plurality of images based on the low-dimensional representations.
The method of claim 21, further comprising:

transforming an image of the plurality of images into two different images; and

generating the high-dimensional representations based on the two different images.
The method of claim 21, further comprising:

hashing the high-dimensional representations to generate the low-dimensional representations.
The method of claim 21, further comprising:

generating a second loss associated with the similarity measurement; and

updating the at least one parameter of the neural network based on the second loss.
An apparatus comprising means for performing the method of any one of claims 20 to 24.