WO2023167530A1

WO2023167530A1 - Method for classifying images using novel classes

Info

Publication number: WO2023167530A1
Application number: PCT/KR2023/002914
Authority: WO
Inventors: Adrian BULAT; Ricardo Guerrero MORENO; Brais Martinez ALONSO; Georgios TZIMIROPOULOS
Original assignee: Samsung Electronics Co., Ltd.
Priority date: 2022-03-04
Filing date: 2023-03-03
Publication date: 2023-09-07

Abstract

Broadly speaking, the present techniques generally relate to a computer-implemented method for using a trained machine learning, ML, model to classify images depicting novel classes, without retraining the ML model to recognise the novel classes.

Description

METHOD FOR CLASSIFYING IMAGES USING NOVEL CLASSES

The present application generally relates to a method for performing image classification using novel classes. In particular, the present application relates to a computer-implemented method for using a trained machine learning, ML, model to classify images depicting novel classes, without retraining the ML model to recognise the novel classes.

Due to the advent of deep learning, object detection has witnessed tremendous progress over the last years. However, the standard setting of training and testing on a closed set of classes has specific important limitations. Firstly, it is unfeasible to annotate all objects of relevance present in-the-wild, thus, current systems are trained only on a small subset. It does not seem straightforward to significantly scale up this figure. Secondly, human perception operates mostly under the open set recognition/detection setting. Humans can detect/track new unseen objects on the fly, typically using a single template, without requiring any "re-training" or "fine-tuning" of their "detection" skills, arguably a consequence of the prior representation learned, an aspect we sought to exploit here too. Thirdly, important applications in robotics, where agents may interact with previously unseen objects, might require their subsequent detection on the fly without any re-training. Few-Shot Object Detection (FSOD) refers to the problem of detecting a novel class not seen during training and, hence, can potentially address many of the aforementioned challenges.

There are still important desiderata that current FSOD systems must address in order to be practical and flexible to use. For example, FSOD systems must be used as is, without requiring any re-training (e.g. fine-tuning) at test time, a crucial component for autonomous exploration. However, many existing state-of-the-art FSOD systems rely on re-training with the few available examples of the unseen classes. While such systems are still useful, the requirement for re-training makes them significantly more difficult to deploy on the fly and in real-time or on devices with limited capabilities for training. Similarly, FSOD systems must be able to handle an arbitrary number of novel objects (and moreover an arbitrary number of examples per novel class) simultaneously during test time, in a single forward pass without requiring batching. This is akin to how closed systems work, which are able to detect multiple objects concurrently. Furthermore, FSOD systems must attain classification accuracy that is comparable to that of closed systems. However, existing FSOD systems are far from achieving such high accuracy, especially for difficult datasets.

Therefore, the present applicant has recognised the need for improvements in image classification methods.

In a first approach of the present techniques, there is provided a computer-implemented method for using a trained machine learning, ML, model to classify images depicting novel classes, the method comprising: receiving an image containing at least one object to be classified; inputting, into the trained ML model, the received image and at least one template depicting a novel class, wherein the at least one template conditions the trained ML model; and outputting, from the trained ML model, a bounding box around each object to be classified, and a predicted class for the object in the bounding box.

As explained in more detail below with reference to the Figures, the present techniques provide a trained ML model which is able to identify new classes (i.e. objects) within images without needing to be retrained to do so. New/novel classes are classes which the ML model has not encountered during the training process. Advantageously, this means that the trained ML model may be provided to devices for use and can adapt, on-device, to new classes without needing to be retrained. This is particularly useful because it may not be possible to retrain a ML model on resource-constrained devices. It is also useful because, not having to retrain the ML model, means the model is more quickly able to identify new classes. The ML model of the present techniques uses templates/samples of novel classes to condition the trained ML model so that the model is able to quickly identify the new classes within input images.

The trained ML model uses the at least one template depicting a novel class as a visual prompt, to help the ML model to locate and classify an object in the received image.

The trained ML model comprises a conventional neural network, CNN, backbone for feature extraction. The CNN backbone is used to extract features from the received image, and to extract features from the at least one template for the novel class. The CNN may also generate position information for the extracted feature information. Thus, the trained ML model may extract, using a convolutional neural network backbone of the trained ML model, visual features from the received image and the at least one template.

The trained ML model comprises an encoder and decoder. The encoder and decoder may be of a transformer model. As such, the CNN backbone may divide the received image into patches or image tokens, and these patches/ image tokens are processed by the encoder and decoder of the transformer model. The extracted feature information (and position information) may be for each patch/image token. The encoder processes the feature information extracted from the received image by the CNN backbone. The encoder may perform self-attention on the extracted visual information. That is, the encoder may determine the relevance or importance of each image token relative to the other image tokens. Thus, the trained ML model may: perform, using a transformer encoder of the trained ML model, self-attention on the extracted visual features of the received image and generates a self-attention result;

The encoder may comprise a multi-head cross-attention layer which functions to filter and highlight early on, before decoding, image tokens/patches of interest. This advantageously increases few-shot accuracy. That is, the encoder may perform cross-attention using the extracted visual features and the features of the at least one template to determine the relevance or importance of each image token relative to the template(s). Thus, the trained ML model may: perform using the transformer encoder of the trained ML model, cross-attention between the extracted visual features of the received image and the at least one template and generates a cross-attention result.

The decoder of the trained ML model may accept the outputs of the encoder and uses these to predict a bounding box around each object in the received image and a pseudo-class prediction. Thus, the trained ML model may: process, using a transformer decoder of the trained ML model, the self-attention result and the cross-attention result to predict a bounding box around each object in the received image and a class for the object in the bounding box.

Inputting at least one template may comprise inputting an image depicting a single object. That is, each template may depict a single object.

The method may further comprise: obtaining, from a user, an image depicting an object to be recognised by the trained ML model; and generating a template of the object by cropping the object from the obtained image. That is, the user may want the trained ML model to identify a specific object, and may therefore provide the model with an image depicting the specific object so that a template of the object can be generated. For example, if the user wants the trained ML model to identify their pet dog in images, the user may provide at least one image depicting their pet dog, so that at least one template of the pet dog can be generated. This advantageously enables on-device personalisation of a trained ML model.

Generating a template may comprise: augmenting the obtained image by applying an image transformation to the obtained image prior to cropping the object. This may help the trained model to focus on important features relating to the object. Any suitable image transformation may be applied, such as colour jittering, random gray scale, Gaussian bur, and so on.

In a second approach of the present techniques, there is provided an apparatus for using a trained machine learning, ML, model to classify images depicting novel classes, the apparatus comprising: at least one processor coupled to memory, and arranged for: receiving an image containing at least one object to be classified; inputting, into the trained ML model, the received image and at least one template depicting a novel class, wherein the at least one template conditions the trained ML model; and outputting, from the trained ML model, a bounding box around each object to be classified, and a predicted class for the object in the bounding box.

The features described above in relation to the first approach may apply equally to the second approach and therefore, for the sake of conciseness, are not repeated.

The apparatus may be, for example, a constrained-resource device, but which has the minimum hardware capabilities to use a trained neural network/ML model. The apparatus may be: a smartphone, tablet, laptop, computer or computing device, virtual assistant device, a vehicle, a drone, an autonomous vehicle, a robot or robotic device, a robotic assistant, image capture system or device, an augmented reality system or device, a virtual reality system or device, a gaming system, an Internet of Things device, a smart consumer device, a smartwatch, a fitness tracker, and a wearable device. It will be understood that this is a non-exhaustive and non-limiting list of example apparatus.

The apparatus may comprise at least one image capture device configured to capture an image to be processed by the trained neural network/ML model. The image capture device may be a camera. Additionally or alternatively the apparatus may comprise at least one interface for receiving an image for analysis.

The image classification process may also function on pre-captured images. Thus, the apparatus may further comprise: storage storing at least one image; and at least one interface for receiving a user query. The at least one processor may be arranged to: receive, via the at least one interface, a user query requesting any image from the storage that contains a specific object; use the trained ML model to identify any image containing the specific object; and output each image containing the specific object to the user via the at least one interface. For example, the user query may be "Hey Bixby, find images on my gallery showing my dog". The image classification process be used to identify any image in the storage that shows the user's dog, and then these images may be output to the user. The user may speak or type their query, and may input an image showing their dog so that the trained ML model can generate a template of the dog and use this to identify the same dog in the images in the gallery. The output(s) may be displayed on a display screen of the apparatus or a response may be output via a speaker (e.g. "We have found two images of your dog"). Any outputs which are incorrect, e.g. show dogs but not the user's dog, can be annotated by the user and then used as negative templates/samples by the trained ML model in the future.

The present techniques may be used by a robot assistant device to, for example, navigate through a new environment or to identify or interact with new objects, without having to be retrained first.

In a third approach of the present techniques, there is provided a computer-implemented method for training a machine learning, ML, model to classify images depicting novel classes, the method comprising: obtaining a set of images, each image depicting at least one object, and a set of templates for known objects and classes, each template depicting a single object; inputting an image from the set of images and at least one template from the set of templates into the ML model; and training the ML model using the image and the at least one template to output a bounding box around the object in the image and a predicted class for the object in the bounding box.

As explained in more detail below with reference to the Figures, the present techniques also provide a new training technique which enables the trained ML model to identify novel objects/classes in images. The present techniques use a pre-training process which is label-free, i.e. does not require labelled data, and which closely mimics the full training process. More specifically, an input image depicting at least one object is obtained, and the at least one object is identified within the input image. The at least one object is then cropped out of the input image to form a template. Each such template represents a single object and can be mapped to a pseudo-class. The goal of the pre-training process is to predict the location of these templates within the original input image. To make the task harder, the templates may be augmented using a set of random image transformations. The ML model is then trained using a regression and a classification loss.

The step of obtaining a set of images and a set of templates may comprise: obtaining a training dataset comprising a plurality of images depicting a plurality of objects; dividing the training dataset into the set of images, and a further set of images; and generating the set of templates using the further set of images. That is, the training dataset may used to form a set of images and a set of templates.

Generating the set of templates using the further set of images may comprise cropping objects from the further set of images, such that each template depicts a single object.

Generating the set of templates may comprise: augmenting the images in the further set of images by applying an image transformation to the images prior to cropping the object.

Generating the set of templates may comprise: assigning, to each template depicting an object of a specific class, a pseudo-class embedding.

Training the ML model may comprise: extracting, using a pre-trained convolutional neural network backbone of the ML model, visual features from the image and the at least one template.

Training the ML model may comprise: performing, using a transformer encoder of the ML model, self-attention on the extracted visual features of the image and generates a self-attention result; and performing, using the transformer encoder of the ML model, cross-attention between the extracted visual features of the image and the at least one template and generates a cross-attention result.

Training the ML model may comprise: processing, using a transformer decoder of the ML model, the self-attention result and the cross-attention result to predict a bounding box around each object in the image and a class for the object in the bounding box.

In a fourth approach of the present techniques, there is provided a server for training a machine learning, ML, model to classify images depicting novel classes, the server comprising: a database comprising a set of images, each image depicting at least one object, and a set of templates, each template depicting a single object; and at least one processor coupled to memory, arranged for: inputting an image from the set of images and at least one template from the set of templates into the ML model; and training the ML model using the image and the at least one template to output a bounding box around the object in the image and a predicted class for the object in the bounding box.

The features described above in relation to the third approach may apply equally to the fourth approach and therefore, for the sake of conciseness, are not repeated.

In a related approach of the present techniques, there is provided a computer-readable storage medium comprising instructions which, when executed by a processor, causes the processor to carry out any of the methods described herein.

As will be appreciated by one skilled in the art, the present techniques may be embodied as a system, method or computer program product. Accordingly, present techniques may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.

Furthermore, the present techniques may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object oriented programming languages and conventional procedural programming languages. Code components may be embodied as procedures, methods or the like, and may comprise sub-components which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction set to high-level compiled or interpreted language constructs.

Embodiments of the present techniques also provide a non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out any of the methods described herein.

The techniques further provide processor control code to implement the above-described methods, for example on a general purpose computer system or on a digital signal processor (DSP). The techniques also provide a carrier carrying processor control code to, when running, implement any of the above methods, in particular on a non-transitory data carrier. The code may be provided on a carrier such as a disk, a microprocessor, CD- or DVD-ROM, programmed memory such as non-volatile memory (e.g. Flash) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier. Code (and/or data) to implement embodiments of the techniques described herein may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as Python, C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog (RTM) or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, such code and/or data may be distributed between a plurality of coupled components in communication with one another. The techniques may comprise a controller which includes a microprocessor, working memory and program memory coupled to one or more of the components of the system.

It will also be clear to one of skill in the art that all or part of a logical method according to embodiments of the present techniques may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the above-described methods, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application-specific integrated circuit. Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media.

In an embodiment, the present techniques may be realised in the form of a data carrier having functional data thereon, said functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable said computer system to perform all the steps of the above-described method.

The methods described above may be wholly or partly performed on an apparatus, i.e. an electronic device, using a machine learning or artificial intelligence model. The model may be processed by an artificial intelligence-dedicated processor designed in a hardware structure specified for artificial intelligence model processing. The artificial intelligence model may be obtained by training. Here, "obtained by training" means that a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose) is obtained by training a basic artificial intelligence model with multiple pieces of training data by a training algorithm. The artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values and performs neural network computation by computation between a result of computation by a previous layer and the plurality of weight values.

As mentioned above, the present techniques may be implemented using an AI model. A function associated with AI may be performed through the non-volatile memory, the volatile memory, and the processor. The processor may include one or a plurality of processors. At this time, one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU). The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (AI) model stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning. Here, being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule or AI model of a desired characteristic is made. The learning may be performed in a device itself in which AI according to an embodiment is performed, and/or may be implemented through a separate server/system.

The AI model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights. Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks.

The learning algorithm is a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

Implementations of the present techniques will now be described, by way of example only, with reference to the accompanying drawings, in which:

Figure 1 is a schematic diagram showing an existing technique for identifying multiple objects within an image.

Figure 2 is a schematic diagram illustrating the present techniques for identifying multiple objects within an image.

Figure 3A is a schematic diagram illustrating the present techniques for identifying N classes within an image, and Figure 3B is a schematic diagram showing an existing technique for binary classification.

Figure 4A is a schematic diagram illustrating the present techniques for identifying a class within an image using positive and negative samples, and Figure 4B is a schematic diagram showing an existing technique for identifying a class within an image using only a positive sample.

Figure 5A is a schematic diagram illustrating the present techniques for identifying N classes within an image using multiple templates for a class, and Figure 5B is a schematic diagram showing an existing technique for identifying N classes.

Figure 6 is a flowchart of example steps to classify images containing novel classes using the trained ML model of the present techniques.

Figure 7 is a block diagram of an apparatus for using a trained machine learning, ML, model to classify images depicting novel classes.

Figure 8A is a schematic diagram illustrating the present techniques for identifying novel classes within an image, and Figure 8B is a schematic diagram showing an existing technique for identifying novel classes.

Figure 9A is a schematic diagram illustrating the present techniques for identifying novel classes within an image, and Figure 9B is a schematic diagram showing an existing technique for identifying novel classes.

Figure 10 is a flowchart of example steps for training a machine learning, ML, model to classify images depicting novel classes.

Figure 11 shows the proposed machine learning, ML, model of the present techniques.

Figure 12 shows results of experiments performed using the present model, and existing techniques.

Figure 13 shows results of experiments performed using the present model, and existing techniques.

Figure 14 is a table showing the results of experiments on the choice of template encoder design.

Figure 15 is a table showing the results of experiments on the pre-training stage.

Figure 16 is a table showing the results of experiments to determine impact on individual components.

Figure 1 is a schematic diagram showing an existing technique for identifying multiple objects within an image. Given an input image 10, the goal is to identify all the classes /objects within the image. In the image 10, there are at least three objects: a child, a dog, and a frisbee. Existing techniques to identify all three objects need to process the same image three times to detect all three objects. That is, as shown in Figure 1, an existing ML model 12 may first identify the dog, then the child, and then the frisbee. The existing ML model 12 may use templates 14 for each object/class, which can help the model to identify the three objects within the image 10. The output of the ML model 12 during each pass may be the coordinates of a bounding box around the identified object. That is, existing techniques require N forward passes to detect N objects. This is a slow process, which is slower the more objects there are to detect within an image.

In contrast to the existing technique shown in Figure 1, the present techniques provide a trained ML model which is able to identify new classes (i.e. objects) within images without needing to be retrained to do so. New/novel classes are classes which the ML model has not encountered during the training process. Advantageously, this means that the trained ML model may be provided to devices for use and can adapt, on-device, to new classes without needing to be retrained. This is particularly useful because it may not be possible to retrain a ML model on resource-constrained devices. It is also useful because, not having to retrain the ML model, means the model is more quickly able to identify new classes. The ML model of the present techniques uses templates/samples of novel classes to condition the trained ML model so that the model is able to quickly identify the new classes within input images.

Figure 2 is a schematic diagram illustrating the present techniques for identifying multiple objects within an image. The goal here is the same as in Figure 1, i.e. to identify all the classes /objects within the image 10. The ML model 20 may use templates 22 for each object/class, which can help the model 20 to identify the three objects within the image 10. The model 20 of the present techniques is able to identify multiple classes/objects, corresponding to the three templates 22, within the image 10 using a single forward pass. That is, the model 20 is able to identify all three objects simultaneously. This means the ML model 20 is advantageously around N times faster than the existing techniques of Figure 1, which require N passes to identify N objects. The output of the ML model of the present techniques may be the coordinates of a bounding box around the identified object, and a class ID for each object. The class ID is described in more detail below. Example bounding boxes 200 are shown in Figure 11.

Thus, the present techniques provide a computer-implemented method for using a trained machine learning, ML, model to classify images depicting novel classes, the method comprising: receiving an image 10 containing at least one object to be classified; inputting, into the trained ML model 20 , the received image 10 and at least one template 22 depicting a novel class, wherein the at least one template 22 conditions the trained ML model 20; and outputting, from the trained ML model 20, a bounding box around each object to be classified, and a predicted class for the object in the bounding box.

Figure 3A is a schematic diagram illustrating the present techniques for identifying N classes within an image. The present ML model supports N classes in a single pass because a pseudo-label approach is used (as explained in more detail below). During training of the ML model, an N-way classification problem defined by the pseudo-labels is minimised. That is, during training of the ML model, templates or samples which are used to train the ML model are assigned pseudo-labels (e.g. 1, 2, 3), rather than real, human-understandable labels (e.g. dog, child, frisbee). In contrast, Figure 3B is a schematic diagram showing an existing technique for binary classification. Existing techniques are only able to determine whether an object in an image does or does not match a template, and therefore are slower at identifying objects in an image.

Figure 4A is a schematic diagram illustrating the present techniques for identifying a class within an image using positive and negative samples. As shown, the goal is to detect a specific object within an image, which in this example is the black dog in an image showing a black dog and a white dog. The present ML model is able to use both positive and negative samples/templates as inputs, which enables the ML model to better discriminate between similar objects (e.g. in this case, dogs of different colours). For example, if the ML model is used to identify particular images within a user's photo gallery, and some results are not correct, these can be marked as incorrect and the model can use these incorrect samples to identify the required particular images. Figure 4B is a schematic diagram showing an existing technique for identifying a class within an image using only a positive sample. Thus, the existing techniques only use positive samples, and therefore may identify both dogs in the image rather than only the black dog.

Figure 5A is a schematic diagram illustrating the present techniques for identifying N classes within an image using multiple templates for a class. As mentioned above with respect to Figure 2, the ML model of the present techniques are able to identify N classes or objects within an image during a single pass. The present model is also able to identify N classes using differing numbers of templates for each class. In this example, in order to identify the three objects/classes in the image (child, dog, frisbee), the ML model is provided with templates for each class. However, the ML model is provided with one template each for frisbee and child, and three templates for dog. Advantageously, this means that multiple templates can be provided for classes/objects which may be more difficult to identify. Figure 5B is a schematic diagram showing an existing technique for identifying N classes. The existing technique is unable to handle a variable number of templates/samples. In contrast to the present techniques, existing techniques require separate models to handle differing numbers of templates/samples for a class, as shown. This is impractical to implement on a resource-constrained device, which may not have the memory to store multiple models for performing the same task.

Figure 6 is a flowchart of example steps to classify images containing novel classes using the trained ML model of the present techniques. The method comprises: receiving an image containing at least one object to be classified (step S100); inputting, into the trained ML model, the received image and at least one template depicting a novel class, wherein the at least one template conditions the trained ML model (step S102); and outputting, from the trained ML model, a bounding box around each object to be classified, and a predicted class for the object in the bounding box (step S104).

The trained ML model comprises a conventional neural network, CNN, backbone for feature extraction. The CNN backbone is used to extract features from the received image, and to extract features from the at least one template for the novel class. The CNN may also generate position information for the extracted feature information. Thus, at step S102, the trained ML model may extract, using a convolutional neural network backbone of the trained ML model, visual features from the received image and the at least one template.

The trained ML model comprises an encoder and decoder. The encoder and decoder may be of a transformer model. As such, the CNN backbone may divide the received image into patches or image tokens, and these patches/ image tokens are processed by the encoder and decoder of the transformer model. The extracted feature information (and position information) may be for each patch/image token. The encoder processes the feature information extracted from the received image by the CNN backbone. The encoder may perform self-attention on the extracted visual information. That is, the encoder may determine the relevance or importance of each image token relative to the other image tokens. Thus, at step S102, the trained ML model may: perform, using a transformer encoder of the trained ML model, self-attention on the extracted visual features of the received image and generates a self-attention result;

The encoder may comprise a multi-head cross-attention layer which functions to filter and highlight early on, before decoding, image tokens/patches of interest. This advantageously increases few-shot accuracy. That is, the encoder may perform cross-attention using the extracted visual features and the features of the at least one template to determine the relevance or importance of each image token relative to the template(s). Thus, at step S102, the trained ML model may: perform using the transformer encoder of the trained ML model, cross-attention between the extracted visual features of the received image and the at least one template and generates a cross-attention result.

The decoder of the trained ML model may accept the outputs of the encoder and uses these to predict a bounding box around each object in the received image and a pseudo-class prediction. Thus, at step S104, the trained ML model may: process, using a transformer decoder of the trained ML model, the self-attention result and the cross-attention result to predict a bounding box around each object in the received image and a class for the object in the bounding box.

At step S102,iInputting at least one template may comprise inputting an image depicting a single object. That is, each template may depict a single object.

Figure 7 is a block diagram of an apparatus 100 for using a trained machine learning, ML, model to classify images depicting novel classes. The apparatus 100 may be, for example, a constrained-resource device, but which has the minimum hardware capabilities to use a trained neural network/ML model. The apparatus may be: a smartphone, tablet, laptop, computer or computing device, virtual assistant device, a vehicle, a drone, an autonomous vehicle, a robot or robotic device, a robotic assistant, image capture system or device, an augmented reality system or device, a virtual reality system or device, a gaming system, an Internet of Things device, a smart consumer device, a smartwatch, a fitness tracker, and a wearable device. It will be understood that this is a non-exhaustive and non-limiting list of example apparatus.

The apparatus comprises a trained machine learning, ML, model 106.

The apparatus comprises at least one processor 102 coupled to memory 104. The processor may be arranged to: receiving an image containing at least one object to be classified; inputting, into the trained ML model 106, the received image and at least one template 116 depicting a novel class, wherein the at least one template conditions the trained ML model; and outputting, from the trained ML model, a bounding box around each object to be classified, and a predicted class for the object in the bounding box.

The at least one processor 102 may comprise one or more of: a microprocessor, a microcontroller and an integrated circuit. The memory 104 may comprise volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example.

The apparatus 100 may comprise at least one image capture device 112 configured to capture an image. The image capture device 112 may be a camera. Additionally or alternatively the apparatus 100 may comprise at least one interface for receiving an image for analysis.

The image classification process may also function on pre-captured images. Thus, the apparatus may further comprise: storage 108 storing at least one image 110; and at least one interface for receiving a user query. The at least one processor may be arranged to: receive, via the at least one interface, a user query requesting any image 110 from the storage 108 that contains a specific object; use the trained ML model 106 to identify any image containing the specific object; and output each image containing the specific object to the user via the at least one interface. For example, the user query may be "Hey Bixby, find images on my gallery showing my dog". The image classification process be used to identify any image in the storage 108 that shows the user's dog, and then these images may be output to the user. The user may speak or type their query, and may input an image showing their dog so that the trained ML model can generate a template of the dog and use this to identify the same dog in the images in the gallery. The output(s) may be displayed on a display screen 114 of the apparatus or a response may be output via a speaker (e.g. "We have found two images of your dog"). Any outputs which are incorrect, e.g. show dogs but not the user's dog, can be annotated by the user and then used as negative templates/samples by the trained ML model in the future.

As explained in more detail below, the present techniques also provide a new training technique which enables the trained ML model to identify novel objects/classes in images. The present techniques use a pre-training process which is label-free, i.e. does not require labelled data, and which closely mimics the full training process. More specifically, an input image depicting at least one object is obtained, and the at least one object is identified within the input image. The at least one object is then cropped out of the input image to form a template. Each such template represents a single object and can be mapped to a pseudo-class. The goal of the pre-training process is to predict the location of these templates within the original input image. To make the task harder, the templates may be augmented using a set of random image transformations. The ML model is then trained using a regression and a classification loss.

Figure 8A is a schematic diagram illustrating the present techniques for training the ML model. The present training technique uses an unsupervised pre-training strategy, and no manual labelling effort is required for pre-training. Figure 8B is a schematic diagram showing an existing technique for training a model. In contrast to the present techniques, the existing technique requires pre-training on labelled data.

Figure 9A is a schematic diagram illustrating the present techniques for identifying novel classes within an image. The present techniques can directly find a location of a novel object based on the template(s) directly, and without retraining the model. That is, the present ML model can be provided with a template of a novel class (i.e. a class that the ML model has not seen during training), and is able to identify the novel class in an input image. No retraining of the ML model is required for it to be able to identify the novel class. This advantageously means that when the ML model is running on a resource-constrained device, the model is able to adapt to new classes without retraining (as training could be difficult on-device). Figure 9B is a schematic diagram showing an existing technique for identifying novel classes. In contrast to the present techniques, existing techniques require the model to be retrained using the new sample/template. However, retraining on-device can be expensive and problematic. Without access to all the original samples used to train the model, the model may forget the classes it previously new when the training using the new sample takes place. The retraining process also means the ML model cannot be used to identify new classes quickly - every time a sample for a new class is presented, the ML model needs to be retrained/finetuned. In contrast, the present techniques can be used immediately to identify new classes.

Figure 10 is a flowchart of example steps for training a machine learning, ML, model to classify images depicting novel classes, the method comprising: obtaining a set of images, each image depicting at least one object, and a set of templates for known objects and classes, each template depicting a single object (step S200); inputting an image from the set of images and at least one template from the set of templates into the ML model (step S202); and training the ML model using the image and the at least one template to output a bounding box around the object in the image and a predicted class for the object in the bounding box (step S204).

The step S202 of obtaining a set of images and a set of templates may comprise: obtaining a training dataset comprising a plurality of images depicting a plurality of objects; dividing the training dataset into the set of images, and a further set of images; and generating the set of templates using the further set of images. That is, the training dataset may used to form a set of images and a set of templates.

Training, at step S204, the ML model may comprise: extracting, using a pre-trained convolutional neural network backbone of the ML model, visual features from the image and the at least one template.

Training, at step S204, the ML model may comprise: performing, using a transformer encoder of the ML model, self-attention on the extracted visual features of the image and generates a self-attention result; and performing, using the transformer encoder of the ML model, cross-attention between the extracted visual features of the image and the at least one template and generates a cross-attention result.

Training, at step S204, the ML model may comprise: processing, using a transformer decoder of the ML model, the self-attention result and the cross-attention result to predict a bounding box around each object in the image and a class for the object in the bounding box.

The model of the present techniques is now described in more detail.

DEtection TRansformer (DETR) approaches: After revolutionizing NLP, Transformer-based architectures have started making significant impact in computer vision problems. In object detection, methods are typically grouped into two-stage (proposal-based) and single-stage (proposal-free) methods. In this field, a recent breakthrough is the DEtection TRansformer (DETR) (Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.), which is a single-stage approach that treats the task as a direct set prediction without requiring hand-crafted components, like non-maximum suppression or anchor generation. Specifically, DETR is trained in an end-to-end manner using a set loss function which performs bipartite matching between the predicted and the ground-truth bounding boxes. Because DETR has slow training convergence, several methods have been proposed to improve it. For example, Conditional DETR learns a conditional spatial query from the decoder embeddings that are used in the decoder for cross-attention with the image features. Deformable DETR proposes deformable attention in which attention is performed only over a small set of key sampling points around a reference point. Unsupervised pre-training of DETR improves its convergence, where randomly cropped patches are summed to the object queries and the model is then trained to detect them in the original image. A follow-up work, DETReg, replaces the random crops with proposals generated by Selective Search.

Few Shot Object Detection (FSOD) methods can be categorised into re-training based and without re-training methods. Re-training based methods assume that during test time, but before deployment, the provided samples of the novel categories can be used to fine-tune the model. This setting is restrictive as it requires training before deployment. Instead, without re-training methods can be directly deployed on the fly for the detection of novel examples.

Re-training based approaches can be divided into meta-learning and fine-tuning approaches. Meta-learning based approaches attempt to transfer knowledge from the base classes to the novel classes through meta-learning. Fine-tuning based methods follow the standard pre-train and fine-tune pipeline. They have been shown to significantly outperform meta-learning approaches.

Without re-training approaches are primarily based on metric learning. A standard approach uses cross-attention between the backbone's features and the query's features to refine the proposal generation, then re-uses the query to re-weight the RoI features channel-wise (in a squeeze-and-excitation manner) for novel class classification.

As mentioned above, there are at least three challenges for FSOD systems: (1) they must be used as is, without requiring any re-training (e.g. fine-tuning) at test time; (2) they must be able to handle an arbitrary number of novel objects (and moreover an arbitrary number of examples per novel class) simultaneously during test time, in a single forward pass without requiring batching; and (3) they must attain classification accuracy that is comparable to that of closed systems.

The present techniques aim to significantly advance the state-of-the-art in all three of these challenges. To this end, and building upon the DETR framework, the present techniques provide a system, called Few-Shot DETR (FS-DETR), capable of detecting multiple novel classes at once, supporting a variable number of examples per class, and importantly, without any extra re-training. In the present techniques, the visual template(s) from the new class(es) are used, during test time, in two ways: (1) in FS-DETR's encoder to filter the backbone's image features via cross-attention, and more importantly, (2) as visual prompts in FS-DETR's decoder, "stamped" with special pseudo-class encodings and prepended to the learnable object queries. The pseudo-class encodings are used as pseudo-classes which a classification head attached to the object queries is trained to predict via a Cross-Entropy loss. Finally, the output of the decoder are the predicted pseudo-classes and regressed bounding boxes.

Contrary to prior work, FS-DETR, akin to soft-prompting, "instructs" the model in the input space regarding object visual information about the searched templates. The model is capable of predicting for each prompt (i.e. visual template) all the locations at which it is present in the image, if any. This is achieved without any additional modules or carefully engineered structures and feature filtering mechanisms. Instead, the present techniques directly append the prompts to the object queries of the decoder.

In summary, some of the advantages of the present techniques include:

A fine-tuning-free Few-Shot DEction TRanformer (FS-DETR) based on prompting which is capable of detecting multiple novel objects at once, and can support an arbitrary number of samples per class in an efficient manner. These features can be enabled by extending DETR based on two key ideas: (1) feed the provided visual templates of novel classes as visual prompts during test time, and (2) "stamp" these prompts with (class agnostic) pseudo-class embeddings, which are then predicted at the output of the decoder along with bounding boxes.

A simple and efficient yet powerful pipeline consisting of unsupervised pre-training followed by prompt-like base class training.

In addition to being more flexible, FS-DETR matches and outperforms state-of-the-art results on the standard FSOD setting on PASCAL VOC and MSCOCO. Specifically, FS-DETR outperforms existing not re-trained methods and most existing re-training based methods on extreme few-shot settings (k = 1,2), while being competitive for more shots.

Given a dataset where each image is annotated with a set of bounding boxes representing the instantiations of C known base classes, the goal of the present techniques is to train a model capable of localizing objects belonging to novel classes, i.e. unseen during training, using up to k examples per novel class. In practice, the available datasets are partitioned into two disjoint sets, one containing

classes for testing, and another with

classes for training ( i.e

and

).

Overview of FS-DETR. The proposed Few-Shot DEtection TRansformer (FS-DETR) is built upon DETR's architecture. (In practice, due to its superior convergence properties, the Conditional DETR mentioned aboved is used as the basis of the implementation, but for simplicity of exposition, the original DETR architecture is used). FS-DETR's architecture consists of: (1) the CNN backbone used to extract visual features from the target image and the templates, (2) a transformer encoder that performs self-attention on the image tokens and cross-attention between the templates and the image tokens, and (3) a transformer decoder that processes object queries and templates to make predictions for pseudo-classes (see also below) and bounding boxes. Contrary to prior works, the present techniques process an arbitrary number of templates (i.e new classes) jointly, in a single forward pass, i.e without requiring batching, significantly improving the efficiency of the process.

Key contributions: DETR re-formulates object detection as a set prediction problem, making object predictions by "tuning" a set of N learnable queries

to the image features through cross-attention. The queries

are used as prompts in DETR for closed-set object detection. To accommodate for open-set FSOD, novel classes' templates are provided as additional visual prompts to the system in order to condition and control the detector's output. To train the system, these prompts are also "stamped" with pseudo-class embeddings, which are then predicted by the decoder along with bounding boxes. The proposed FS-DETR is depicted in Figure 11.

Figure 11 shows the proposed machine learning, ML, model of the present techniques, i.e. FS-DETR. In FS-DETR, the available templates are provided as additional visual prompts to the system in order to condition and control the output. To train and test the system, these prompts are "stamped" with pseudo-class embeddings, which are predicted at the output of the decoder along with bounding boxes. Note, there is no correlation between actual classes and pseudo-classes, e.g. the cat could be of any of the possible classes (e.g. "blue" or "red", or, "1" or "2", etc.) as there is no preferred order. FS-DETR naturally supports k -shot detection, as the model can process multiple examples per class at once. Templates belonging to the same class will share the same pseudo-class embedding.

The FS-DETR architecture and main components are now described.

Template encoding: Let

be the template images of the available classes (sampled from

during training) where m is the number of classes at the current training iteration (m can vary), and k is the number of examples per class ( i.e k-shot detection; k can also vary). A CNN backbone ( e.g ResNet-50) generates template features

using either average or attention pooling.

Pseudo-class embeddings: At each training iteration, the k templates in

belonging to the i-th class (for that iteration) are dynamically and randomly associated with a pseudo-class represented by a pseudo-class embedding

, which are added to the templates as follows:

where

contains the pseudo-class embeddings for all templates at the current iteration. The pseudo-class embeddings are initialised from a normal distribution and learned during training. They are not determined by the ground-truth categories and are class-agnostic. During each inference step, we arbitrarily associate to a template (belonging to some class) the i - th embedding as described by Eq. 1. The goal is to predict the pseudo-class i. Note that the actual class information is not used. As the assigned embedding changes at every iteration, there is no correlation between the actual classes and the learned embeddings. In the proposed FS-DETR, each decoded object query

in

will attempt to predict a pseudo-class using a classifier. Pseudo-class embeddings add a signature to each template allowing the network to track the template within and dissociate it from the rest of the templates belonging to a different class. The pseudo-class embeddings are a key contribution of our approach. The method cannot be trained without the pseudo-class embeddings ( i.e it won't converge). As transformers are permutation invariant, it is not possible to predict the pseudo-class without such embeddings.

Templates as visual prompts: The templates

are provided as visual prompts to the system by prepending them to the sequence of object queries fed to the decoder:

As shown below, the templates will induce pseudo-class related information into the object queries via attention. This can be interpreted as a new form of training-aware soft-prompting.

FS-DETR encoder: Given a target image

, the same CNN backbone used for template feature extraction first generates image features

, which are enriched with positional information through positional encodings

. The features

are then processed by FS-DETR's encoder layers in order to be enriched with global contextual information. The l -th encoding layer processes the output features of the previous layer

using a series of Multi-Head Self-Attention (MHSA), Layer Normalization (LN), and MLP layers, as well as a newly proposed Multi-Head Cross-Attention (MHCA) layer as follows:

The purpose of the MHCA layer above is to filter and highlight early on, before decoding, the image tokens of interest. It was found that such a layer noticeably increases few-shot accuracy. FS-DETR's encoder is implemented by stacking L = 6 blocks, each following Eq. (3)-(5). As image tokens are permutation invariant, a fixed positional encoding was used. For the templates, pseudo-class embeddings serve as positional encodings.

FS-DETR decoder: FS-DETR's decoder accepts as input the concatenated templates and learnable object queries

which are transformed by the decoder's layers through self-attention and cross-attention layers in order to be eventually used for pseudo-class prediction and bounding box regression. The l -th decoding layer processes the output features of the previous layer

as follows:

where

. Notably, different MLPs are used to process the decoder's features

corresponding to the templates

and the object queries

:

FS-DETR's decoder comprises L = 6 layers implemented using Eqs. (6)-(9).

FS-DETR training and loss functions: For each base class that exists in the target image, a template is created for that class by randomly sampling and cropping an object from that category using a different image (containing an object of the same class) from the train set. After applying image augmentation, the cropped object/template is passed through the CNN backbone of FS-DETR. For each target image and template i (depicted in that image), the ground truth is

, where

is the target pseudo-class label (up to m classes in total) and

are the normalised bounding box coordinates. To calculate the loss for training FS-DETR, only the N transformed object queries

from the output of the last decoding layer are used for pseudo-class classification and bounding box regression ( i.e

is not used). To this end, pseudo-class and bounding box prediction heads are used to produce a set of N predictions

consisting of the pseudo-class probabilities

and the bounding box coordinates

. The heads are implemented using a 3-layer MLP and ReLU activations. An additional special pseudo-class

is used to denote tokens without valid object predictions. Note that as the training is done in a class-agnostic way via mapping of the base class templates to pseudo-classes (the actual class information is discarded) the model is capable to generalise to the unseen novel categories. Bipartite matching is used to find an optimal permutation

. Finally, the loss is:

where IoU is the GIoU loss of Rezatofighi et al and

are the loss weights.

Pre-training: Transformers are generally more data hungry than CNNs due to their lack of inductive bias. Therefore, building representations that generalise well to unseen data, and prevent overfitting within the DETR framework, requires larger amounts of data. To this end, images are used from ImageNet-100 (Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.) and to some extent MSCOCO, for unsupervised pre-training where the classes and the bounding boxes are generated on-the-fly using an object proposal system, without using any labels. Note, that unlike all prior works, negative templates are used as prompts, training the network using the proposed loss.

Experiments to test the present techniques are now described.

Datasets: Experiments presented here were all conducted using PASCAL VOC and MSCOCO datasets. Moreover, ImageNet100, consisting of ~ 125K images and 100 categories, is used (without labels) to pre-train the present object detector. PASCAL VOC and MSCOCO are used to train and evaluate few-shot experiments. Following previous works, the proposed method is evaluated on PASCAL VOC 2007+2012 and MSCOCO 2014. Specifically, PASCAL VOC is randomly divided into three different splits, each consisting of 15 base and 5 novel classes; training is done on the PASCAL VOC 2007+2012 train/val sets, and evaluation on the PASCAL VOC 2007 test set. Similarly, MSCOCO is split into base and novel categories, where the 20 overlapping categories with PASCAL VOC are considered novel, while the remainder are the base categories; following recent conventio, 5k samples from the validation set are held out for testing, while the remaining samples from both train and validation sets are used for training.

Evaluation setting: There are currently two widely-used FSOD evaluation protocols. The first focuses exclusively on novel classes while disregarding base class performance, thus not monitoring catastrophic forgetting. The second, more comprehensive protocol, called generalised few-shot object detection (G-FSOD), considers both base and novel classes. The choice of protocol and, hence, results interpretation, bears special importance for re-training based methods, as base class generalizability might be compromised. Without re-training methods, as FS-DETR, adhere to the second protocol (G-FSOD) by default, as base class catastrophic forgetting is not applicable. Results from several runs using different templates are reported, and hence, competing results using a likewise setting when available are also reported.

Baselines: As mentioned above, existing FSOD methods can be broadly categorised into: re-training based, and without re-training. The latter can handle few-shot detection on-the-fly at deployment, while re-training based FSOD methods generally tend to perform better. Re-training based methods can be further subdivided into "meta-learning" and "fine-tuning" approaches.

Results. As explained in more detail below, the results show that FS-DETR outperforms all without re-training based approaches by a large margin, i.e those directly comparable. Furthermore, FS-DETR outperforms the majority of re-training based approaches (some by a large margin). Importantly, on average across all novel sets from PASCAL VOC, it outperforms all for k = 1, while at the same time being more robust across splits, i.e FS-DETR has lower variance across novel sets. Similarly, in MSCOCO with k = 1,2 it outperforms almost all re-trained methods.

Figure 12 shows results of experiments performed using the present model, and existing techniques. Specifically, Figure 12 shows FSOD performance (AP50) on the PASCAL VOC dataset. Results with * are from Han et al. while those with

were produced by the present applicant. The present techniques outperform all without re-training methods. Moreover, the present techniques provide competitive results compared with other re-training based methods for k = 3,5,10, and even outperforms most for k = 1,2, i.e extreme few-shot settings.

The experiments for k-shot detection were conducted for three data splits, where k was set to 1,2,3,4,5,10 and AP50 values are reported. Note that the table in Figure 12 is split into two sections: Methods at the top require an additional few-shot re-training stage, while those at the bottom, including the present method, do not require any re-training. Here, it can be appreciated that the present tehcniques outperform all without re-training methods by a large margin, improving the current state-of-the-art in any shot and all split experiments by up to 17.8 AP50 points, and, in most cases, by at least ~ 10 AP50 points. Moreover, the present techniques can process multiple novel classes in a single forward pass (c.f. Figure 2). Finally, UP-DETR was reimplemented for few-shot detection on PASCAL VOC (since there is no publicly available implementation for few-shot detection or results). The present techniques largely outperform UP-DETR, perhaps unsurprisingly, as the latter was not developed for few-shot detection, but for unsupervised pre-training. Importantly, the proposed method provides competitive results or even outperforms re-training based methods (meta-learned or fine-tuned), especially for extreme low-shot, k = 1,2.

Figure 13 shows results of experiments performed using the present model, and existing techniques. Specifically, the table in Figure 13 shows FSOD performance on the MSCOCO dataset. Results with * are from Han et al. while those with

disregard performance on base classes. It can be seen that the present techniques consistently outperform the state-of-the-art methods in most of the shots and metrics.

The table in Figure 13 shows evaluation results for FS-DETR and all competing state-of-the-art methods on MSCOCO. Similarly to Figure 12, the table in Figure 13 is split into methods requiring re-training at the top and those that do not require re-training at the bottom. There, it can be appreciated that FS-DETR outperforms all comparable state-of-the-art methods by up to 3.1 AP50 points (1-shot) and, in most cases, by at least ~ AP50 points. In the experiments UP-DETR failed to converge on MSCOCO, hence, results are not included. Moreover, and in line with results observed on PASCAL VOC, FS-DETR achieves competitive results to those of re-trained based methods on MSCOCO, a far more challenging dataset. FS-DETR outperforms all re-training based methods (except Meta-DETR), for k = 1,2 while performing comparably for k = 3,5,10.

Further experiments were perforemd to analyse the impact of different design choices.

Figure 14 is a table showing the results of experiments on the choice of template encoder design. Specifically, Figuer 14 shows FSOD performance (AP50) on the PASCAL VOC dataset Novel Set 1 for various template construction configurations.

- result produced using bounding-box jittering for the patch extraction. An important component of the present techniques is the extraction of discriminative prompts from the novel classes' templates. To this end, FS-DETR's input image CNN encoder is reused. However, to focus on the most important components attention-based pooling is used instead of simple global average pooling. In Figure 14, the impact of: (a) resolution, (b) augmentation level, and (c) pooling type is shown. As the results show, increasing the resolution from 128 to 192px yields no additional gains. This suggests that, at least for the datasets in question, fine grained details are not quintessential for the identification of the targeted novel class and higher level concepts suffice. While spatial augmentation generally helps ( i.e for object recognition), it was found that adding noise to the ground truth bounding box of the template at train time leads to lower accuracy. This makes the problem for the object detector too hard, and impedes convergence. Finally, attentive pooling, instead of global average, can further boost performance.

Figure 15 is a table showing the results of experiments on the pre-training stage. The table shows the impact on the FSOD performance (AP50) on the PASCAL VOC dataset (Novel Set 1) for models with and without pre-training. Many FSOD systems use pre-trained backbones on ImageNet for classification. Deviating from this, the present model is trained in an unsupervised manner on ImageNet images and parts of MSCOCO without using the labels. This is especially important for transformer based architectures which were shown to be more prone to over-fitting due to the lack of inductive bias. As the results from Figure 15 show, unsupervised pre-training, can significantly boost the performance, preventing over-fitting toward the base classes and improving overall discriminative capacity. To reduce over-fitting the pre-training loss on ImageNet data is applied during supervised training every 8th iteration.

Figure 16 is a table showing the results of experiments to determine impact on individual components. The table shows the Impact of various components on the few-shot object detection performance (AP50) on the PASCAL VOC dataset (Novel Set 1). The accuracy improvement obtained by two components of FS-DETR was analysed, namely the MHCA layer in FS-DETR's encoder (see Eq. 5), and the type-specific MLPs (TS-MLP) in FS-DETR's decoder (see Eq. 9). As Figure 16 shows, while the present techniques, without both components, provides satisfactory results, unsurprisingly, the addition of TS-MLP further boosts the accuracy. This is expected as the information carried by the object queries and template tokens is semantically different, so ideally they should be transformed using different functions. Finally, the MHCA within the encoder injects template-related information early on to filter or highlight certain image regions, and also helps increase the accuracy.

Thus, the present techniques provide FS-DETR, a novel transformer based few-shot architecture, that is simple yet powerful, while also being very flexible and easy to train. FS-DETR outperforms all previously proposed methods, thus achieving a new state-of-the-art. In addition to the outstanding results and discussions presented above, the proposed method can simultaneously predict arbitrary number of classes, using variable-shots per class, in a single forward pass. These results, in combination with the methods formulation, clearly demonstrate not only its performance improvements but also its high flexibility. Therefore, FS-DETR can uniquely satisfy the above-mentioned FSOD system desiderata (1) and (2), while also making big improvements toward satisfying (3).

Those skilled in the art will appreciate that while the foregoing has described what is considered to be the best mode and where appropriate other modes of performing present techniques, the present techniques should not be limited to the specific configurations and methods disclosed in this description of the preferred embodiment. Those skilled in the art will recognise that present techniques have a broad range of applications, and that the embodiments may take a wide range of modifications without departing from any inventive concept as defined in the appended claims.

Claims

A computer-implemented method for using a trained machine learning, ML, model to classify images depicting novel classes, the method comprising:

receiving an image containing at least one object to be classified;

inputting, into the trained ML model, the received image and at least one template depicting a novel class, wherein the at least one template conditions the trained ML model; and

outputting, from the trained ML model, a bounding box around each object to be classified, and a predicted class for the object in the bounding box.
The method as claimed in claim 1 wherein the trained ML model:

extracts, using a convolutional neural network backbone of the trained ML model, visual features from the received image and the at least one template.
The method as claimed in claim 2 wherein the trained ML model:

performs, using a transformer encoder of the trained ML model, self-attention on the extracted visual features of the received image and generates a self-attention result; and

performs, using the transformer encoder of the trained ML model, cross-attention between the extracted visual features of the received image and the at least one template and generates a cross-attention result.
The method as claimed in claim 3 wherein the trained ML model:

processes, using a transformer decoder of the trained ML model, the self-attention result and the cross-attention result to predict a bounding box around each object in the received image and a class for the object in the bounding box.
The method as claimed in claim 1 wherein inputting at least one template comprises inputting an image depicting a single object.
The method as claimed in claim 1 further comprising:

obtaining, from a user, an image depicting an object to be recognised by the trained ML model; and

generating a template of the object by cropping the object from the obtained image.
The method as claimed in claim 6 wherein generating a template comprises:

augmenting the obtained image by applying an image transformation to the obtained image prior to cropping the object.
An apparatus for using a trained machine learning, ML, model to classify images depicting novel classes, the apparatus comprising:

at least one processor coupled to memory, and arranged for:

receiving an image containing at least one object to be classified;

inputting, into the trained ML model, the received image and at least one template depicting a novel class, wherein the at least one template conditions the trained ML model; and

outputting, from the trained ML model, a bounding box around each object to be classified, and a predicted class for the object in the bounding box.
The apparatus as claimed in claim 8 wherein the trained ML model:

extracts, using a convolutional neural network backbone of the trained ML model, visual features from the received image and the at least one template.
The apparatus as claimed in claim 9 wherein the trained ML model:

performs, using a transformer encoder of the trained ML model, self-attention on the extracted visual features of the received image and generates a self-attention result; and

performs, using the transformer encoder of the trained ML model, cross-attention between the extracted visual features of the received image and the at least one template and generates a cross-attention result.
The apparatus as claimed in claim 10 wherein the trained ML model:

processes, using a transformer decoder of the trained ML model, the self-attention result and the cross-attention result to predict a bounding box around each object in the received image and a class for the object in the bounding box.
The apparatus as claimed in any one of claims 8 wherein the at least one template is an image depicting a single object.
The apparatus as claimed in any one of claims 8 further comprising:

obtaining, from a user, an image depicting an object to be recognised by the trained ML model; and

generating a template of the object by cropping the object from the obtained image.
The apparatus as claimed in claim 13 wherein generating a template comprises:

augmenting the obtained image by applying an image transformation to the obtained image prior to cropping the object.
A computer-implemented method for training a machine learning, ML, model to classify images depicting novel classes, the method comprising:

obtaining a set of images, each image depicting at least one object, and a set of templates for known objects and classes, each template depicting a single object;

inputting an image from the set of images and at least one template from the set of templates into the ML model; and

training the ML model using the image and the at least one template to output a bounding box around the object in the image and a predicted class for the object in the bounding box.