WO2018197835A1

WO2018197835A1 - Apparatus and method for open-set object recognition

Info

Publication number: WO2018197835A1
Application number: PCT/GB2018/050971
Authority: WO
Inventors: Xun Xu; Vimal THILAK; Meng Cao; Xuejun Wang
Original assignee: Blippar.Com Limited
Priority date: 2017-04-26
Filing date: 2018-04-12
Publication date: 2018-11-01

Abstract

Data processing apparatus comprising data processing resources comprising a processor and memory, said memory configured to store image data representative of a scene and configured with machine-readable instructions executable by said processor to configure said data processing circuitry to: provide a convolutional neural network trained for closed-set object recognition of a set of objects of an object training class; provide a prediction layer configured to comprise a first likelihood for an object belonging to said training class and a second likelihood for an object not belonging to said training class; receive plural instances of image data each representative of a scene comprising an object belonging to said training class or representative of a scene not comprising an object belonging to said training class and outputting for each of said plural instances of training image data respective output values ϕ from the said convolutional neural network indicative of the image content of said training image data, said value φ comprising a value ϕ responsive to the input scene comprising an object belonging to the training class and a value φ̅ for the input scene not comprising an object belonging to the training class; and generate a prediction layer indicative of a probability of an instance of image data being representative of a scene comprising an object belonging to said training class from a distribution of said respective output values φ and φ̅.

Description

APPARATUS AND METHOD FOR OPEN-SET OBJECT RECOGNITION

Field

The present invention relates to an apparatus and method for open-set recognition of images. In particular, but not exclusively, the present invention relates to an apparatus and method for configuring a prediction layer for the open-set recognition of images and an apparatus and method for the open-set recognition of images utilising the prediction layer.

Background

Deep learning models, specifically deep convolutional neural networks, have been proven to be very effective for visual recognition ([1] Krizhevsky, Sutskever and Hinton 2012; [2] Simonyan and Zisserman 2014; [3] Szegedy et al. 2014; [4] He et al. 2015). However, all these models are designed and trained for closed-set recognition, i.e. the image to be recognized is assumed to comprise an object or feature belonging to a definite set of objects and features determined at the training stage, and the network is configured to pick the correct object or feature out of that set. For the avoidance of doubt, the terms "object" and "feature" may be used interchangeably and the use of one may be considered use of the other unless the context requires otherwise. Objects and features may be referred to as being of a class, e.g. class k or class /^', to differentiate one type of feature or object from another.

As general background, closed-set recognition is particularly suitable for benchmarking tasks where the objects to be recognized are known to be restricted to a pre-defined set. However, in a real world application it is seldom the case that an object to be recognized in an image is a member of a pre-defined set. In a real world application, a situation may occur in which a recognition apparatus is fed with an arbitrary image, for example an image containing an object or objects not seen at the training stage or not containing any identifiable object at all (e.g. a blurry or cluttered image). For a closed-set recognition the apparatus is configured such that it is required to return a recognition result that is a one of the trained set of objects in the closed-set and if the apparatus is provided with such an arbitrary image the result is likely to be unpredictable, i.e. the apparatus will return a result corresponding to recognition of a one of the closed-set of objects without there being such an object in the image, thus rendering the apparatus unreliable in practice. A conventional way to mitigate the so-called "open-set recognition problem" ([5] Girshick 2015) is to create a catch-all background class to capture all images that do not belong to the set of recognizable classes. Since the background can simply be anything, there are a prohibitive number of training images, and it is very hard to determine if a sufficient number of background images have been included in the training set.

A related work ([6] Bendale and Boult 2015) also attempts to solve the open-set recognition problem by adapting existing deep networks for closed-set recognition, with a specifically designed generative learning approach ([7] Bendale and Boult 2015).

Aspects and embodiments in accordance with the present invention were devised with the foregoing in mind.

Summary

Viewed from a first aspect there is provided a data processing apparatus comprising data processing resources comprising a processor and memory, said memory configured to store image data representative of a scene and configured with machine-readable instructions executable by said processor to configure said data processing circuitry to:

provide a convolutional neural network trained for closed-set object recognition of a set of objects of an object training class;

provide a prediction layer configured to comprise a first likelihood for an object belonging to said training class and a second likelihood for an object not belonging to said training class;

receive plural instances of image data each representative of a scene comprising an object belonging to said training class or representative of a scene not comprising an object belonging to said training class and outputting for each of said plural instances of training image data respective output values φ from the said convolutional neural network indicative of the image content of said training image data, said value φ comprising a value φ responsive to the input scene comprising an object belonging to the training class and a value φ for the input scene not comprising an object belonging to the training class; and generate a prediction layer indicative of a probability of an instance of image data being representative of a scene comprising an object belonging to said training class from a distribution of said respective output values φ and φ.

Viewed from a second aspect there is provided a method of operating data processing apparatus comprising a processor and memory to train for object recognition, the method comprising:

configuring said processor of said data processing apparatus with a:

convolutional neural network trained for closed-set object recognition of a set of objects of an object training class; and a

prediction layer configured to comprise a first likelihood for an object belonging to said training class and a second likelihood for an object not belonging to said training class;

receiving plural instances of image data to said convolutional neural network, each instance of image data representative of a scene comprising an object belonging to said training class or representative of a scene not comprising an object belonging to said training class;

outputting from said convolutional neural network respective output values φ from the said convolutional neural network indicative of the image content of said training image data for each of said plural instances of training image data, said value φ comprising a value cp responsive to the input scene comprising an object belonging to the training class and a value φ for the input scene not comprising an object belonging to the training class; and

generating a prediction layer indicative of a probability of an instance of image data being representative of a scene comprising an object belonging to said training class from a distribution of said respective output values φ and φ.

An embodiment may be further configured to provide a convolutional neural network trained for closed-set object recognition of a set of objects of a plurality of object training classes and to provide a prediction layer configured to comprise a first likelihood for an object belonging to each respective training class and a second likelihood for an object not belonging to said each respective training class. Embodiments in accordance with the first and second aspects provide a prediction layer for the discriminative determination of the probability that an object is present in an image or not present. Thus the object recognition model is trained to recognise when none of the trained objects are present in an input image as well as when trained objects are present, thereby providing a prediction layer suitable for open set recognition.

The data processing circuitry is further configured such that said prediction layer comprises respective probability distribution functions based on output values φ for objects belonging to a respective training class and not belonging to a respective training class φ . Thus, probability distribution functions may be used to discriminate between a likelihood of an object being present in an image and the likelihood of an object not being present in an image.

Suitably, respective probability distribution functions are fitted to a distribution curve. For example, respective probability distribution functions may be fitted to a Gaussian distribution. Other distributions may be utilised at the discretion of the designer of the object recognition model to modify, change or manipulate the recognition profile.

In an embodiment in which object recognition is to be carried out following training, image data for object recognition is received; generation of said prediction layer is inhibited; said output value φ indicative of the image content of said image data for recognition is output from said convolutional neural network and input to said prediction layer to determine a probability that the scene represented by said image data comprises an object of a one of a trained class; and an indication that said image data comprises an object of a one of a trained class is output for said probability satisfying a threshold criterion.

Thus, a trained embodiment may be used for object recognition if generation of the prediction layer does not occur, ie it is inhibited.

Viewed from a third aspect there is provided a data processing apparatus for object recognition comprising data processing resources comprising a processor and memory, wherein said memory is:

configured to store image data representative of a scene; configured with a prediction layer indicative of a probability of an instance of image data being representative of a scene comprising an object belonging to a trained class from a distribution of respective output values for input scenes comprising an object belonging to the training class and the input scene not comprising an object belonging to the training class; and

configured with machine-readable instructions executable by said processor to configure said data processing circuitry to:

provide a convolutional neural network trained for closed-set object recognition for the training class for receiving plural instances of image data each representative of a scene;

output from said convolutional neural network an output value φ indicative of the image content of said image data for recognition;

input said output value φ to said prediction layer to determine a probability that the scene represented by said image data comprises an object of a one of each trained class; and

output an indication that said image data comprises an object of each trained class for said probability satisfying a threshold criterion.

Viewed from a fourth aspect there is provided a method of operating data processing apparatus for object recognition comprising a processor and memory, the data processing apparatus configured with a prediction layer indicative of a likelihood of an instance of image data being representative of a scene comprising an object belonging to a trained class from a distribution of respective output values for input scenes comprising an object belonging to the training class and the input scene not comprising an object belonging to the training class, the method comprising:

storing image data representative of a scene in said memory;

configuring said processor of data processing apparatus with a convolutional neural network trained for closed-set object recognition for the training class;

receiving to said convolutional neural network plural instances of image data each representative of a scene;

outputting from said closed-set convolutional neural network an output value φ indicative of the image content of said image data for recognition; inputting said output value φ to said prediction layer to determine a probability that the scene represented by said image data comprises an object of each trained class; and

outputting an indication that said image data comprises an object of each trained class for said probability satisfying a threshold criterion.

An embodiment in accordance with the third and fourth aspects does not require a training phase as it is already configured with prediction layers for the trained objects.

Viewed from a fifth aspect there is provided a method of converting a close-set convolutional neural network for closed set recognition to an open-set recognition model, the method comprising removing a soft-max layer from said close-set convolutional neural network, and replacing said soft-max layer with a new prediction layer comprising, for each of the trained objects, two likelihood probability density functions - one for positive (object) class and one for negative (non-object) class, as well as a prior ratio, wherein said two likelihood probability density functions are learned by fitting the feature values extracted from training images. Such an embodiment modifies an existing close-set model to operate as an open-set model. Thus a designer need not create all of the open-set model but utilise an initial part of the close-set model.

Typically, the method further comprises fitting the feature values extracted from training images for each of the trained classes individually.

Viewed from a sixth aspect there is provided a method of predicting the presence of an object in an input image utilising an open-set recognition model derived in accordance with the method set out in one or both of the immediately preceding two paragraphs wherein the posterior probability of each object's existence in the input image is estimated by the prediction layer in accordance with: where a_k =— is the prior ratio, and such that if probability meets or exceeds a threshold value the object is considered to exist in the image.

Typically, the method further comprises rejecting an input image for all probabilities from the prediction layer falling below the threshold value.

One or more embodiments recall objects the system was trained to recognize and meanwhile refrain from making unreliable predictions when fed with unfamiliar objects or scenes, resulting in lower false positive rate.

The described system also eliminates, or at least reduces, the need for such a background class.

Utilising a soft-max layer as is conventional in prior art systems converts class-wise features into probabilities through coupled normalization. Conversely, in embodiments in accordance with the present invention, each class is processed individually, mapping each feature to a probability score indicating how likely it is that the class is present in the image. Therefore, when the image of an unknown object or scene is presented to the network the probability score of each class is predicted independently If all of the probability scores are sufficiently low the network gracefully rejects the image as an unfamiliar input, instead of making a prediction of one of the trained classes, which it would have to do in closed-set recognition, and which would be an unreliable prediction.

A visual recognition system that is robust in real world scenarios should be able to detect the scenes and objects that fall outside of the trained set and gracefully abstain from making unreliable predictions on those.

In particular, one or more embodiments in accordance with the present invention provide apparatus and or a method that is capable of adapting a pre-trained closed-set recognition model to fulfill such tasks, hence better accommodating real world applications. The proposed system is capable of adapting a convolutional network which uses soft-max as the classification layer (e.g. a convolutional network such as disclosed in [1] Krizhevsky, Sutskever, and Hinton 2012; [2] Simonyan and Zisserman 2014; [3] Szegedy et al. 2014;

[4] He et al. 2015) to accommodate the open-set scenario. One or more embodments reuse the majority of a convolutional network and do not require retraining. Instead, only an add-on learning step is needed operative over the class-wise features that are already available from the training of the existing network.

The system described herein differs from other open-set recognition methodologies, e.g. as disclosed in [6] Bendale, Abhijit, and Terrance Boult "Towards Open Set Deep Networks" arXiv:1511.06233 [cs], November 19, 2015. http://arxiv.org/abs/1511.06233 and [7] Bendale, Abhijit, and Terrance Boult "Towards Open World Recognition" arXiv.1412.5687 [cs], June 2015, 1893-1902. doi:10.1109/CVPR.2015.7298799, by employing a discriminative learning paradigm which effectively exploits information of all training images including the background images. Additionally, one or more embodiments in accordance with the poresent invention may be more flexible than systems using conventional open-set recognition methodologies as the sensitivity of each class can be adjusted individually through the corresponding prior ratio . Finally, the proposed system is conceptually simpler and easier to implement than conventional open-set recognition methodologies.

List of figures

There will now be described one or more embodiments in accordance with the present invention, provided by way of non-limiting example only and with reference to the following drawings:

Figure 1 is a schematic illustration of a filter segment of a feed-forward neural network comprising sparse connectivity and tied weights suitable for receptive field behaviour;

Figure 2 is a schematic illustration of a convolutional neural network for closed-set recognition of an object within an image;

Figure 3 is a schematic illustration of a convolutional neural network in accordance with an embodiment of the present invention; Figure 4 is a schematic illustration of a system configuration incorporating apparatus comprising a convolutional neural network in accordance with an embodiment of the present invention;

Figure 5 is a graphical representation of an illustrative example of the probability functions for two classes of object for a convolutional neural network in accordance with an embodiment of the present invention;

Figure 6 is a process flow control diagram for the training phase of apparatus 200 in accordance with an embodiment of the present invention; and

Figure 7 is a process flow control diagram for the recognition phase apparatus 200 in accordance with an embodiment of the present invention.

Description

A fundamental element of a feed-forward convolutional neural network suitable for recognising one or more objects in an image are filter layers which are tied together in order to provide a receptive field behaviour. An example of an arrangement of filter layer for a neural network 50 is schematically illustrated in figure 1. The concept of receptive field behaviour is derived from neurobiology, for example based on the organisation of neurons in the visual cortex of a cat. Arranging a neural network for receptive field behaviour provides for the learning to recognise an object in one location of an image field to be translated or transferred to the same object but located in a different part of the image field without having to have separate weights between respective neurons in different layers as would be required for a fully connected neural network.

In the neural network 50 illustrated in figure 1 there is an input filter layer 52, a hidden layer 54 and an output layer 56. The input filter layer 52 is configured to receive pixel values from a small two-dimensional region of an image undergoing analysis. The output 56 of the neural network 50 is a single value. Neural network 50 acts as a filter on the inputs put into input filter 52. As is now conventional in image processing, input filter layer 52 of neural network 50 is applied repeatedly across an image. That is to say, inputs to input filter layer 52 are the pixel values of respective regions of the image having a size corresponding to the input filter. In the illustrated example, neural network 50 receives as input pixel values from respective groups of five pixels in an image. In other terms, filter layer 52 may be considered to be "moved over" an image so as to have input to the neural network 50 values from respective pixel regions. Applying the neural network 50 filter function across the output of the image at all possible offsets is the convolution that gives convolutional neural networks their name. Filter layer 52 may be moved one pixel at a time across an image or in groups of pixels comprising the width of the filter layer input, i.e. in the described embodiment of five pixels at a time. The particular mode of moving the filter across the image depends on how the model is configured.

Filter weights between each neural connection of input layer 52 and hidden layer 54 are "shared" or "tied" in that they have the same value for the same neural connection vector. That is to say, neural connection vector 53 has the same weight between layers 52 and 54 as illustrated by the long dashed lines, neural connection vector 57 has the same weight between layers 52 and 54 as illustrated by the short dashed lines and neural connection vector 55 has the same weight between layers 52 and 54 as illustrated by the dashed and dotted lines. Weights between respective neural connection vectors 58, 59 and 60 may be different as illustrated by different dashed or dotted outline. Thus, as neural network 50 is moved across the image the same filter is applied at each image region, thereby introducing position invariance into the filter function of neural network 50. Such position invariance may be considered to be an inherent quality of a convolutional neural network when utilised as an image filter. The limited breadth of the input filter 52 also exploits spatially local correlation thereby generating filters which may produce a strong response to a spatially local input pattern. Consequently, a convolutional neural network may provide a position invariant strong response to a local input pattern.

A general outline of a conventional closed-set recognition apparatus 100 will now be described with reference to figure 2 of the drawings. The apparatus 100 receives an image 102 comprising a feature or object 104. Feature 104 belongs to a class of a set of features for which the apparatus was trained as this is a closed-set recognition apparatus. The number of classes may be designated as "C". Such a conventional closed-set recognition apparatus 100 comprises a deep convolutional neural network (CONVNET) for visual recognition consisting of feature extraction layers 106, 108, 110 and 112 followed by a classification layer 114 that outputs a final prediction of the class of object in the input image 102. The final predictions φΐ - φθ are input to a so-called "soft-max" layer 118 which determines the probability of a class of object being present in image 102 from a log-normalised analysis of the φ values output from layer 114.

The input image typically is in 2-dimensional array of pixel values which in a monochrome image may have a single value of "0" or "1" indicating either a white or black pixel. For "grey-scale" images each pixel may have a value represented by a byte, i.e. an eight digit binary number, having 256 different values of grey-scale running from zero giving a white pixel to 255 giving a black pixel. Colour images typically have three pixels for any image point, the pixels being red, green and blue and generally having an intensity governed by a byte, i.e. 256 different values of intensity. Consequently, an image is represented by a two-dimensional array of binary numbers.

Each extraction layer filters the input image 102 in a manner that enhances, or at least draws out relative to the rest of the image, characteristics of the feature or object contained in the input image and upon which the apparatus 100 is being or has been trained. Conventionally, convolutional encoding is carried out on image 102 in extraction layer 106 which creates a feature map comprising pixel data output 105 from the filter, e.g. neural network 50 of figure 1, convolved across image 102 and in the specific example illustrated in figure 2 data 105 corresponds to pixel data from a region 103 of image 102. Pixel data 107 in feature map 106 may be reduced by so-called "pooling" or "max-pooling" in pooling layer 108 where representative values 109 for respective groups of pixels such as in a 3 x 3 array is evaluated. Values 111 from the pooling layer undergo convolution in layer 110 resulting in filter output 113. Pixel data 115 convolution layer 110 is then pooled 117 in pooling layer 112 resulting in a final feature map of image 102 represented by a reduced dataset array compared to the number of pixels (image data points) representing the original image 102. Figure 2 illustrates for each extraction layer plural filters 106a, 106 b, 106c ... 106m; 108a, 108b, 108c ... 108n; 110a, 110b, 110c ... HOo; and 112a, 112b, 112c ... 112p each producing a feature map, for convenience referred to herein by the same reference as the corresponding layer filters. Each chain of extraction layer filters results in respective final feature map 112a, 112b, 112c ... 112p. Final feature maps 112a...112p are input to a fully connected feed forward layer 114 and eventually result in the phi ("φ") layer .

Apparatus 100 is trained for object recognition and typically employs a so-called soft-max layer 118 as a classification layer, such as described in [1] Krizhevsky, Sutskever, and Hinton 2012; [2] Simonyan and Zisserman 2014; [3] Szegedy et al. 2014; [ 4 ] He et al. 2015, which is designed for closed-set recognition problems, where the input image is classified into one of the pre-defined set of object classes. Training comprises inputting images 102 including one of the object classes 104 in the closed-set. The soft-max layer 118 performs multi-class classificationby mapping the output values from the previous layer 114 (usually a fully-connected layer as described above) into posterior probabilities that sum up to one. The outputs of the previous layer 114 are called phi ( φ) and comprise a number of values <pi, q>2, ·■·, <Pk ,—, ( C (where is the number of trained object classes) which are representative of class-wise features, since each of the values φ corresponds to one trained class and can be considered as the evidence of such a class existing in the input image. Mathematically, the output probability score of the k-t class is:

C

Qk = exp(<pk)/ exp(⁽f>j) . (1)

/=1

As its name suggests, the soft-max layer compares the class-wise features φ and assigns a high probability to the class with the largest feature, i.e. value of φ, in accordance with equation (1). Such an operation is natural for closed-set recognition, since every input image is known to capture an object belonging to one of the trained classes, and the corresponding class-wise feature is expected to be larger than others. However, under an open-set scenario a soft-max layer approach is problematic. When apparatus 100 receives an image comprising an unknown object or scene which class-wise feature would have the largest result is unpredictable, leading to unreliable probability scores.

Embodiments in accordance with the present invention comprise a recognition system which has replaced the soft-max layer with a different prediction layer that maps the class-wise features to probabilities in a different way from a soft-max layer. That is to say, the neural network is trained with the soft-max layer in place and then the soft-max layer is removed or discarded. Figure 3 schematically illustrates the architecture of an embodiment in accordance with an aspect of the present invention comprising apparatus 200 which may be trained to recognise one or more classes of object for which it has been trained from input images which may comprise one or more objects for which the apparatus 200 has not been trained; classify as not having an object class images which do not comprise one of the trained classes of object; or classify as not having an object class images which do not comprise one of the trained classes of object and input image empty of any object, i.e. an image comprising merely background. Apparatus 200 comprises convolution and pooling pairs 206 and 208 together with fully connected layer 210 (210 corresponds to 114 and 118 in Figure 2) and prediction layer 212. Input images 202 and 203 may comprise respective classes of object which in the described example are respectively a triangle 204 and cross 205. Additionally, images 207 which do not comprise objects for which apparatus 200 has been configured to recognise or are empty of any object 209 may also be input to apparatus 200. Convolution/pooling pairs 206 and 208 have a similar function and indeed may be configured in a similar manner to respective convolutional pooling pairs 106/108 and 110/112. Convolution/pooling pairs 206 and 208 may have the same weight and other configuration parameters as 106/108 and 110/112 to the extent apparatus 200 is configured to recognise an open set environment having the same or similar class objects that apparatus 100 is configured to recognise as part of a closed set.

Figure 4 is a schematic illustration of a server-based embodiment in accordance with the present invention. Server 252 comprises processing resources for implementing apparatus 200 and includes an application program interface 254, a processor 256 and an interface 258. Server 252 also includes memory 262 which stores various program instructions executable by processor 256 for implementing object recognition as well as providing a data store 266. For example, memory 262 may include computer program instructions 264 executable by processor 256 for implementing a convolutional network, computer program instructions 268 executable by processor 256 to implement an application for object recognition and computer program instructions executable by processor 256 for implementing a class feature probability mapping function 270 in accordance with an embodiment of the present invention. The various processing resources and memory are coupled together via data and instruction bus 260. A scanner 272, camera 274 or other image source 276 may be coupled directly to the server to interface 258. Optionally or additionally, scanner 272, camera 274 and image source 276 may be coupled through a computer network, such as the Internet, 278 and into server 252 via API 254.

The class feature probability mapping module 270 comprises processor executable instructions for both a training phase mode and also normal recognition mode. Operation of server 252 will now be described in both the training and a normal recognition mode.

In general terms, for each class k (with reference to figure 3 triangle 204 or cross 205), probability mapping module 270 comprises instructions which generate positive and negative likelihood functions from the training data. The positive likelihood p (<Pk \ ^k £ I) is the conditional probability density function (p.d.f.) of the / -th class-wise feature for and object belonging to class k being present in image /. It can be estimated from all training images having a class k object present in them. Similarly, the negative likelihood, pk(⁽pk\ k fc I) can be estimated from all training images not containing objects belonging to class k (i.e. negative images). Such negative images include both those belonging to other classes, e.g. cross 205 when training for triangle 204, and those not containing any trained classes at all (i.e. background images 209) . pk and pk can take any parametric or nonparametric form.

For instance, in the described embodiment referring to one specific implementation pk and Pk are Gaussian functions. The output probability score for an image I including an object belonging to class k is determined by Bayes' theorem :

P — ^^k . where a_k =— is the prior ratio. (2)

As is clear, the greater the value of a the lower the probability score for class k and therefore the class would be recognized more conservatively. Ad d it i o n a l l y, α^· can be optimized as a hyper-parameter. Figure 5 is a schematic illustration of probability distribution functions generated in respect of two respective class of objects having images input to apparatus 200 comprising objects of respective classes and also objects which do not fall within the trained class of objects, comprise merely background information, or are part of the training of a first object but comprise a second object. For the described embodiment, 310 represents the training result for a first class of object being present in an input image and curve 312 represents the training result from images not comprising the first object. Conversely, 316 is the training result for images comprising a second class of object and curve 314 for images not containing the second class of object. As will be readily apparent to a person of ordinary skill in the art, figure 5 is a schematic representation of the distribution of results and is provided graphically for ease of understanding. Such results will be stored in data store 266 of memory 262 of server 252 illustrated in figure 4. The histogram under curve 310 is representative of the manner in which results are collected, collated and stored in the currently described embodiment. The probability distribution functions illustrated in figure 5 in effect provide the prediction layer 212 referred to with reference to figure 3.

Training

Turning now to figure 6, a process flow control diagram 350 schematically illustrating the steps in training apparatus 200 to recognise an object class starts at step 352. Processor executable instructions for implementing an image training and recognition process in accordance with process flow control diagram 350 are stored as application software 268 in memory 262. At step 353 the apparatus is selected to be configured for an image containing an object of the class for which the apparatus is undergoing training (cp value output) or configured for an image not containing object of the class for which the apparatus is undergoing training (φ value output) and at step 354 the image is input to the apparatus 200. At step 356, application software 268 invokes the convolutional neural network application 264 which applies convolutional neural network processing to the input image. As was described with reference to figure 2 and figure 3 the output of the convolutional and pooling layers is input to fully connected layers 210. Application 268 then invokes the class feature probability mapping software 270 which stores, step 360, the output, φ, from the fully connected layers 210 to a positive (φ value output) or negative (φ value output) set. At step 362 application 268 determines if the user has input an indication that the training has finished, 362. If no indication that the user has finished training apparatus 200 has been received then process flow control returns to step 354 where the next image is input to apparatus 200 and the process proceeds through steps 356 through to step 362 where the process again determines whether not an instruction to finish training has been received.

During the iteration through steps 353 to 362, a number of images are input into apparatus 200 and a φ value determined for each image representative of an image containing the particular class of object for which the apparatus is being trained, e.g. in the present example a triangle or a cross. In the described embodiment output value corresponding to an image containing the particular class of object for which the apparatus is being trained is represented by p. In accordance with one or more embodiments images not containing the particular class of object for which the apparatus is currently being trained, are also input. For example, if the apparatus is undergoing training for a triangle then images not containing a triangle, including an image containing a cross, will be input. For the purposes of clarification and distinction between outputs, the output for an image identified as not containing an object of the class for which the apparatus 200 is undergoing training is represented by φ. However, when describing the process in general the symbol φ is used. All possible values for φ are divided into "bins" representative of a small range of values for φ so that a histogram of φ values may be generated as schematically illustrated in figure 5 for probability distribution function 310. The "bins" are configured in data store 266 and the value for each "bin" is increased, typically by 1, for each φ value assigned to the bin at step 360. Only the one set of histograms is illustrated for clarity but of course a set of histograms will be produced for each of the curves 312, 314 and 316 illustrated in figure 5.

In response to application 268 determining that the indication training has finished has been received process flow control proceeds to step 364 at which application 268 invokes class feature probability mapping application 270. In the currently described embodiment, class feature probability mapping application 270 fits the histogram collected for the set of training images input for the training phase to produce a curve such as one of curves 310- 316 illustrated in figure 5. The line fitted data is then stored for respective classes in data store 266, step 366. Process flow control then flows to step 368 where training for the class on the training is finished. Training for another class of objects may then take place or the training phase may be halted if all classes of objects have been trained. It should be noted that training for a particular class includes the inputting of images which do not contain the respective class of objects as part of training the apparatus 200 to recognise when an image does not contain a particular class of object, e.g. curve 312 of the probability distribution functions illustrated in figure 5. Curve 312 is representative of images which do not have the class of objects present in them for which curve 310 is derived (i.e. curve 310 is derived from images which do have that class of object present).

In an optional or alternative embodiment, apparatus 200 may already be configured to recognise objects using a "soft-max" layer as known in the prior art and discussed in the introductory description herein. In such an already configured apparatus, probability distributions for outputs of the convolutional neural network, elements 206, 208 and 210 of apparatus 200 illustrated in figure 3 and implemented by module 264 of server 252, for each class of object will have already been derived and stored in data store 266. For example, images comprising a triangle and images comprising a cross will have been input to train apparatus 200 to recognise images comprising a triangle or a cross. Such probability distributions may have the form of curves 310 and 314 illustrated in figure 5, there being no corresponding curves 312 and 316. For such an optional or alternative embodiment the training process is slightly modified. Referring now to figure 6 the modification is such that step 353 is fixed to select training using images which do not include an object of the class undergoing training. For example, training to recognise a triangle will require the input of images not including a triangle to create an appropriate probability distribution function. Referring now to figure 4, if curve 310 is a probability distribution representative of images which contain a triangle then curve 312 will be the result of the further training phase of inputting images not including a triangle. The process as illustrated in figure 6 continues in the same way as described earlier until the training for a particular class of object is completed. Then the training will continue for the next class of object by inputting images not containing that class of object. In this way, for an illustrative example comprising recognition of just two classes of object probability distribution functions as illustrated in figure 4, i.e. curve 310, 312, 314 and 316, will be generated and the corresponding data stored in memory 266. Thus, the recognition process may be the same for embodiments which utilise existing probability distribution functions as well as embodiments in which the apparatus is trained from a non-trained start point.

Recognition

Turning now to figure 7, a process flow control diagram 400 schematically illustrating the steps in apparatus 200 to recognise an object class in an input image starts at step 402. Processor executable instructions for implementing an image training and recognition process in accordance with process flow control diagram 400 are stored as application software 268 in memory 262. At step 404 an image is input to the apparatus 200 and at step 406 application software 268 invokes the convolutional neural network application 264 which applies convolutional neural network processing to the input image. As was described with reference to figure 2 and figure 3 the output of the convolutional and pooling layers is input to fully connected layers 210, step 408. Application 268 then invokes the class feature probability mapping software 270 and inputs the resulting φ value to the class feature probability mapping software 270, step 410. The class feature probability mapping software 270 feeds the input φ value into the probability distribution functions stored in data store 266 to compute the posterior probability according to Equation (2), step 412. Using Equation (2) above to determine probability is sufficient since once the probability density function is defined the computation utilising equation (2) is the application of the Bayesian theorem.

At step 414 the recognition result is output, which includes a "no recognised object" output, which occurs when every class has a posterior probability lower than a specified threshold. A typical threshold is 0.5 but the threshold used in any particular implementation is a matter of design choice for the designer of the recognition system. A threshold greater than 0.5 may be used for a system in which a high certainty in the recognition of objects is desired and conversely a lower threshold may be used if a lower certainty is required. At step 416 it is determined whether or not the user has requested that another image is recognised and if so process flow control returns to step 404. Otherwise, process flow control proceeds to step 418 at which the recognition procedure finishes.

The term "empty" is used herein in connection with image data to denote image data representative of a scene not containing a specific class of feature relevant to the context. For example, in a training phase for class k features an empty scene or empty image data would be a scene not containing a k class feature or image data representative of such a scene. An empty scene or empty image data may relate to a scene or data comprising a feature other than the specific class of feature relevant to the context, i.e. in the present discussion a non- k class feature for example a / class feature, as well as a scene or image data comprising no features and which may be described as a "background" image.

Although embodiments have been described utilising convolution and pooling pairs, pooling, max pooling or sub- sampling need not be utilised.

It will be appreciated that any of the optional features of any of the embodiments described herein could also be provided with one or more of any of the other embodiments described herein.

As noted in some of the embodiments above, they may be embodied in hardware. The hardware may be referenced as a hardware element. In general, a hardware element may refer to any hardware structures arranged to perform certain operations. In one embodiment, for example, the hardware elements may include any analogue or digital electrical or electronic elements fabricated on a substrate. The fabrication may be performed using silicon-based integrated circuit (IC) techniques, such as complementary metal oxide semiconductor (CMOS), bipolar, and bipolar CMOS (BiCMOS) techniques, for example. Examples of hardware elements may include processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. The embodiments are not limited in this context. Also noted above, some embodiments may be embodied in software. The software may be referenced as a software element. In general, a software element may refer to any software structures arranged to perform certain operations. In one embodiment, for example, the software elements may include program instructions and/or data adapted for execution by a hardware element, such as a processor. Program instructions may include an organized list of commands comprising words, values or symbols arranged in a predetermined syntax, that when executed, may cause a processor to perform a corresponding set of operations.

The software may be written or coded using a programming language. Examples of programming languages may include C, C++, BASIC, Perl, Matlab, Pascal, Visual BASIC, JAVA, ActiveX, assembly language, machine code, and so forth. The software may be stored using any type of computer-readable media or machine-readable media. Furthermore, the software may be stored on the media as source code or object code. The software may also be stored on the media as compressed and/or encrypted data. Examples of software may include any software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. The embodiments are not limited in this context.

Some embodiments may be implemented, for example, using any computer-readable media, machine-readable media, or article capable of storing software. The media or article may include any suitable type of memory unit, memory device, memory article, memory medium, storage device, storage article, storage medium and/or storage unit, such as any of the examples described with reference to a memory. The media or article may comprise memory, removable or non-removable media, erasable or non-erasable media, writeable or re- writeable media, digital or analog media, hard disk, floppy disk, Compact Disk Read Only Memory (CD-ROM), Compact Disk Recordable (CD-R), Compact Disk Rewriteable (CD-RW), optical disk, magnetic media, magneto-optical media, removable memory cards or disks, various types of Digital Versatile Disk (DVD), subscriber identify module, tape, cassette, electrical signal, radio-frequency signal, optical carrier signal, or the like. The instructions may include any suitable type of code, such as source code, object code, compiled code, interpreted code, executable code, static code, dynamic code, and the like. The instructions may be implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language, such as C, C++, Java, BASIC, Perl, Matlab, Pascal, Visual BASIC, JAVA, ActiveX, assembly language, machine code, and so forth. The embodiments are not limited in this context.

Unless specifically stated otherwise, it may be appreciated that terms such as "processing," and the routine "computing," "calculating," "determining," or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing device, that manipulates and/or transforms data represented as physical quantities (e.g., electronic) within the computing system's registers and/or memories into other data similarly represented as physical quantities within the computing system's memories, registers or other such information storage, transmission or viewing devices. The embodiments are not limited in this context.

As used herein any reference to "one embodiment" or "an embodiment" means that a particular element, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of the phrase "in one embodiment" or the phrase "in an embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

As used herein, the terms "comprises," "comprising," "includes," "including," "has," "having" or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, unless expressly stated to the contrary, "or" refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present). In addition, use of the "a" or "an" are employed to describe elements and components of the invention. This is done merely for convenience and to give a general sense of the invention. This description should be read to include one or at least one and the singular also includes the plural unless it is obvious that it is meant otherwise.

In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention. For example, although an embodiment has been described utilising a normal Gaussian distribution to which the likelihood estimations p and p are fitted, other distributions may be utilised, for example multiples of a Gaussian distribution may be utilised as well as other generally bell-shaped distributions such as Cauchy, Student's t-distribution , and logistic distributions or any other suitable distribution. Additionally, although an embodiment in accordance with the present invention has been described utilising "binning" of results and fitting such bins to a Gaussian function the ordinarily skilled person will readily understand that binning results is not necessary and that result could be fitted to a Gaussian function, or other function, in a parametric manner.

Although an embodiment in accordance with the present invention has been described with reference to images 202 and 203 which may be considered to only comprise a single shape 204 and 205 respectively, the images the not be so simplistic and may comprise other features and background features as would normally be the case in images obtained in the real world. In practice, real world images having other features and background features in them will be used in training in order to generate a probability distribution function suited for real world applications. In the described embodiment, φ values are assigned to respective "bins", each "bin" covering a range of possible φ values in order to get a meaningful distribution or histogram to which a curve may be fitted. In an optional or additional embodiment, φ values are not assigned to respective "bins" as they are evaluated. Instead, respective φ values are stored and then collected into suitable equal range groups. Class feature probability mapping module 270 then determines how many instances of φ values falls within respective equal range groups to determine a suitable histogram to which a line may be fitted. However, if a very large number of training images are utilised it would not be necessary to assign φ values to "bins" since a sufficient number of values will be collected for a meaningful distribution to which a Gaussian curve, or other suitable distribution, may be fitted. Optionally, if Gaussian or other parametric form of probability density function is to be used, no binning is needed at all, with any number of training images the system may be configured to calculate the parameters from the φ values directly.

The specific embodiment described herein has been disclosed with reference to a training phase. However, if an existing deep learning convolutional neural network for a closed-set recognition system is adapted to replace the "soft-max" layer with the probability distribution function layer in accordance with an embodiment of the present invention the deep learning convolutional neural network need not be retrained for respective classes of objects since it is already configured by virtue of its previous training and use for a closed-set recognition system. Instead, the training merely comprises inputting images so that the probability distribution functions may be derived for use in an open-set system.

The described embodiment utilises not just images comprising objects of the classes to be recognised but also images which do not include an object to be recognised or in the class - wise feature training phase including an object of another class to that of the class for which the apparatus 200 is undergoing training.

The scope of the present disclosure includes any novel feature or combination of features disclosed therein either explicitly or implicitly or any generalisation thereof irrespective of whether or not it relates to the claimed invention or mitigate against any or all of the problems addressed by the present invention. The applicant hereby gives notice that new claims may be formulated to such features during prosecution of this application or of any such further application derived therefrom. In particular, with reference to the appended claims, features from dependent claims may be combined with those of the independent claims and features from respective independent claims may be combined in any appropriate manner and not merely in specific combinations enumerated in the claims. Works Cited

[1] Krizhevsky, Alex, llya Sutskever, and Geoff Hinton. "ImageNet Classification with Deep

Convolutional Neural Networks." In Advances in Neural information Processing

Systems 25, 1106-14, 2012. http://www.cs.toronto.edu/~hinton/absps/imagenet.pdf.

[2] Simonyan, Karen, and Andrew Zisserman. "Very Deep Convolutional Networks for

Large-Scale Image Recognition." arXiv:1409.1556 [cs], September 4, 2014. http://arxiv.org/abs/1409.1556.

[3] Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir

Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. "Going

Deeper with Convolutions." a rXiv. -1409.4842 [cs], September 16, 2014. http://arxiv.org/abs/1409.4842.

[4] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep Residual Learning for

Image Recognition." arXiv:1512.03385 [cs], December 10, 2015. http://arxiv.org/abs/1512.03385.

[5] Girshick, Ross. "Fast R-CNN." arXiv.1504.08083 [cs], April 30, 2015.

http://arxiv.org/abs/1504.08083.

[6] Bendale, Abhijit, and Terrance Boult. "Towards Open Set Deep Networks." arXiv:1511.06233 [cs], November 19, 2015. http://arxiv.org/abs/1511.06233.

[7] Bendale, Abhijit, and Terrance Boult. "Towards Open World Recognition." arXiv.1412.5687 [cs], June 2015, 1893-1902. doi:10.1109/CVPR.2015.7298799.

Claims

Claims:

1. Data processing apparatus comprising data processing resources comprising a processor and memory, said memory configured to store image data representative of a scene and configured with machine-readable instructions executable by said processor to configure said data processing circuitry to:

receive plural instances of image data each representative of a scene comprising an object belonging to said training class or representative of a scene not comprising an object belonging to said training class and outputting for each of said plural instances of training image data respective output values φ from the said convolutional neural network indicative of the image content of said training image data, said value φ comprising a value φ responsive to the input scene comprising an object belonging to the training class and a value φ for the input scene not comprising an object belonging to the training class; and

generate a prediction layer indicative of a probability of an instance of image data being representative of a scene comprising an object belonging to said training class from a distribution of said respective output values φ and φ.

2. Data processing apparatus according to claim 1, further configured provide a convolutional neural network trained for closed-set object recognition of a set of objects of a plurality of object training classes and to provide a prediction layer configured to comprise a first likelihood for an object belonging to each respective training class and a second likelihood for an object not belonging to said each respective training class.

3. Data processing apparatus according to claim 1 or claim 2, wherein said data processing circuitry is further configured such that said prediction layer comprises respective probability distribution functions based on output values φ for objects belonging to a respective training class and not belonging to a respective training class <p.

4. Data processing apparatus according to claim 3, wherein said data processing circuitry is further configured such that said respective probability distribution functions are fitted to a distribution curve.

5. Data processing apparatus according to claim 4, wherein said data processing circuitry is further configured such that said respective probability distribution functions are fitted to a Gaussian distribution.

6. Data processing apparatus according to any preceding claim, said machine-readable instructions executable by said processor to further configure said data processing apparatus to:

receive image data for object recognition;

inhibit generation of said prediction layer;

output from said convolutional neural network said output value φ indicative of the image content of said image data for recognition;

input said output value φ to said prediction layer to determine a probability that the scene represented by said image data comprises an object of a one of a trained class; and

output an indication that said image data comprises an object of a one of a trained class for said probability satisfying a threshold criterion.

7. Data processing apparatus for object recognition comprising data processing resources comprising a processor and memory, wherein said memory is:

configured to store image data representative of a scene; configured with a prediction layer indicative of a likelihood of an instance of image data being representative of a scene comprising an object belonging to a trained class from a distribution of respective output values for input scenes comprising an object belonging to the training class and the input scene not comprising an object belonging to the training class; and

output an indication that said image data comprises an object of said trained class for said probability satisfying a threshold criterion.

8. Data processing apparatus according to claim 7, further configured provide a convolutional neural network trained for closed-set object recognition of a set of objects of a plurality of object training classes and to provide a prediction layer configured to comprise a first likelihood for an object belonging to each respective training class and a second likelihood for an object not belonging to said each respective training class.

9. A method of operating data processing apparatus comprising a processor and memory to train for object recognition, the method comprising:

configuring said processor of said data processing apparatus with a: convolutional neural network trained for closed-set object recognition of a set of objects of an object training class; and a prediction layer configured to comprise a first likelihood for an object belonging to said training class and a second likelihood for an object not belonging to said training class;

outputting from said convolutional neural network respective output values φ indicative of the image content of said training image data for each of said plural instances of training image data, said value φ comprising a value cp responsive to the input scene comprising an object belonging to the training class and a value φ for the input scene not comprising an object belonging to the training class; and

10. A method according to claim 9, further comprising providing a convolutional neural network trained for closed-set object recognition of a set of objects of a plurality of object training classes and to provide a prediction layer configured to comprise a first likelihood for an object belonging to each respective training class and a second likelihood for an object not belonging to said each respective training class.

11. A method according to claim 9 or claim 10, wherein said prediction layer comprises respective probability distribution functions based on output values φ for objects belonging to a respective training class and not belonging to a respective training class φ.

12. A method according to claim 11, further comprising fitting said respective probability distribution functions to a distribution curve.

13. A method according to claim 12, wherein said respective probability distribution functions are fitted to a Gaussian distribution.

14. A method according to any of claim 9 to claim 13, further comprising:

receiving image data for object recognition at said convolutional neural network; inhibiting generation of said prediction layer at said convolutional neural network;

outputting from said convolutional neural network said output value φ indicative of the image content of said image data for recognition;

inputting said output value φ to said prediction layer to determine a probability that the scene represented by said image data comprises an object of a one of a trained class; and

outputting an indication that said image data comprises an object of a one of a trained class for said probability satisfying a threshold criterion.

15. A method of operating data processing apparatus for object recognition comprising a processor and memory, the data processing apparatus configured with a prediction layer indicative of a probability of an instance of image data being representative of a scene comprising an object belonging to a trained class from a distribution of respective output values for input scenes comprising an object belonging to the training class and the input scene not comprising an object belonging to the training class, the method comprising:

storing image data representative of a scene in said memory;

outputting from said convolutional neural network an output value φ indicative of the image content of said image data for recognition;

inputting said output value φ to said prediction layer to determine a probability that the scene represented by said image data comprises an object of said trained class; and outputting an indication that said image data comprises an object of said trained class for said probability satisfying a threshold criterion.

16. A method according to claim 15, wherein said data processing apparatus is configured with a prediction layer configured to comprise a first likelihood for an object belonging to a respective training class of a plurality of training classes and a second likelihood for an object not belonging to said respective training class.

17. A method of converting a close-set convolutional neural network for closed set recognition to an open-set recognition model, the method comprising removing a soft-max layer from said close-set convolutional neural network, and replacing said soft-max layer with a new prediction layer comprising, for each of the trained objects, two likelihood probability density functions - one for positive (object) class and one for negative (non-object) class, as well as a prior ratio, wherein said two likelihood probability density functions are learned by fitting the feature values extracted from training images.

18. A method according to claim 17, further comprising fitting the feature values

extracted from training images for each of the trained classes individually.

19. A method of predicting the presence of an object in an input image utilising an

open-set recognition model derived in accordance with the method of claim 9 or claim 10, wherein the posterior probability of each object's existence in the input image is estimated by the prediction layer in accordance with:

P = ^P/c __

k Pk+ttkPk '

where a_k =— is the prior ratio, and such that if probability meets or exceeds a threshold value the object is considered to exist in the image.

20. A method according to claim 19, further comprising rejecting an input image for all probabilities from the prediction layer falling below the threshold value.