WO2020225562A1

WO2020225562A1 - Processing captured images

Info

Publication number: WO2020225562A1
Application number: PCT/GB2020/051120
Authority: WO
Inventors: Razwan GHAFOOR; Peter RENNERT; Hichame MORICEAU; Reynald HAVARD
Original assignee: ThirdEye Labs Limited
Priority date: 2019-05-08
Filing date: 2020-05-07
Publication date: 2020-11-12
Also published as: GB2584400A; GB201906457D0

Abstract

An apparatus for processing a captured image depicting a person to classify the behaviour of the person with respect to an object within an environment, the apparatus comprising: a pose-estimation module configured to analyse the captured image using a computational neural network to identify a pose estimate for the person depicted in the image; a mapping module configured to map a location of the person indicated by their pose estimate within the captured image to a corresponding point of a stored map of the environment to determine the location of the person within the environment; a gaze-determination module configured to analyse the identified pose estimate in the captured image to calculate a gaze direction of the person within the environment and determine if the gaze direction intersects the object in the environment in dependence on the determined location of the person within the environment; and a classifying module configured to classify the behaviour of the person with respect to the object in dependence on whether the gaze direction intersects the object.

Description

PROCESSING CAPTURED IMAGES FIELD

This invention relates to processing captured images to classify the behaviour of a person depicted within the captured image. BACKGROUND

Computer vision is a disciplinary field that addresses how computational methods can be employed to gain an understanding of information depicted in images, or videos. To do this, image data (e.g. in the form of single images, image sequences forming a video stream, images captured by a camera etc.) can be processed and analysed to extract information and/or data that can be used to make certain conclusions on what is depicted by those images.

A typical task in computer vision is to identify specific objects in an image and to determine the object’s position and/or orientation relative to some coordinate system. Humans are examples of such objects, with the ability to detect humans within images being desirable for a number of applications, for example in the fields of surveillance and video labelling. As well as identifying people within an image, it may be desirable to additionally classify, or categorise, the behaviour of a person within an image. The ability to computationally analyse an image to categorise the behaviour of a depicted person may have utility in many fields, for example video surveillance, security, retail etc. SUMMARY

According to the present invention there is provided an apparatus and method for processing a captured image depicting a person to classify the behaviour of the person with respect to an object within an environment as set out in the appended claims. BRIEF DESCRIPTION OF FIGURES

The present invention will now be described by way of example with reference to the accompanying drawings. In the drawings:

Figure 1 shows an example of an image processing system.

Figure 2 shows an example of an image processing unit forming part of the image processing system.

Figure 3 shows a flowchart of steps for performing a method of processing a captured image to classify the behaviour of a depicted person with respect to an object in the environment.

Figure 4 shows an illustration of generating a pose estimate of the person in a captured image.

Figure 5 shows a schematic illustration of mapping a location of the person within the captured image to a corresponding location within a stored map of the environment. Figure 6 shows a schematic illustration of extracting a gaze direction of the person and determining whether the gaze direction intersects an object within the environment. DETAILED DESCRIPTION

The present disclosure is directed to processing techniques for analysing captured images using computational neural networks to classify, or categorise the behaviour of a person depicted within the image. The behaviour of the person may be classified with respect to an object or feature within the environment inhabited by the person. The classified behaviour may for instance specify or indicate an interactivity level with the object. The object may take many different forms, but could be, for example, a display, a display unit, a shelf or an item within the environment. To classify the behaviour of the depicted person, features associated with the depicted person are extracted from the captured image using the pose estimate for that person. The extracted features may include a gaze direction of the person, and a determination as to whether the gaze direction intersects the object. The extracted features may additionally include an interaction level associated with the user indicating the level of interaction between the user and the environment, e.g. indicating whether the user is interacting with the environment or not. The extracted features may

include the distance between the user and object within the environment. These features may be extracted from the captured image from an analysis of a pose estimate derived for the person within the image. Thus, the present disclosure describes techniques for analysing an image to classify the behaviour of a person within the image using computational neural networks. This conveniently enables image labels classifying the behaviour of a depicted person to be assigned to images without having to manually label the images, which can be time consuming and expensive. Examples of how the image can be processed to extract features of the image using computational neural networks will be described below.

Figure 1 shows an image processing system 100 located within an environment 102. The environment could be for example a building, a shop, or a larger environment such as a town or city. The environment contains a plurality of objects 108, 110 and 112. The objects may have a known location within the environment. The objects may be fixtures, or fittings within the environment. The objects may be physical items within the environment. The objects could be, for example, shelves or tables (e.g. within a retail environment), displays, display units or display items etc. More generally the objects may be features within the environment.

The image processing system comprises a plurality of image capturing devices 104i _,2_,3 distributed within the environment. The image processing system is shown as including three image capturing devices for the purposes of illustration only; in general, the image processing system may include one or more image capturing devices 104. The image capturing devices may be cameras. The cameras may capture 2D or 3D images. The captured images may be digital images. Each image capturing device captures images of the environment 102 within its respective field of view. The images captured by each image capturing device may be in the form of a video stream. The image capturing devices are coupled to an image processing apparatus 106 by respective communication links. The communication links could be wired or wireless. The links communicate images captured by the cameras to the apparatus 106. The apparatus 106 could be remote from the cameras 104, or local to the cameras. For example, the apparatus could be implemented as a server (e.g. a c

separated from the cameras by a wireless network (not shown).

The apparatus 106 receives digital images captured by the image capturing devices and processes those images to classify the behaviour of depicted people with respect to the objects 108, 110, 112 within the environment 102.

An example apparatus 106 is shown in more detail in figure 2. The apparatus comprises a pose estimation module 202; a mapping module 204; a gaze- determination module 206; a joint analysis module 208 and a classifier module 210. The pose estimation module 202 is configured to receive an image 212 captured from one of the image capturing devices 104 (e.g. device 104i). The pose estimation module is coupled to the mapping module 204; gaze determination module 206; and joint analysis module 208. The mapping module is also coupled to the gaze determination module 206; joint analysis module 208 and classifier module 210.

The operation of the apparatus 106 to process a captured image to classify the behaviour of a person depicted in that image with respect to an object within the environment 102 will now be described with reference to the flowchart in figure 3. The object for the purposes of the following examples will be taken to be object 108.

The pose estimation module 202 initially receives a captured image 212 depicting a person. At step 302, the pose estimation module 202 analyses the captured image using a computational neural network to identify a pose estimate for the person depicted in the image 212. This is illustrated schematically in figure 4, which shows an example of image 212 depicting a person 402 inhabiting the environment 102. The image is processed by the pose estimation module 202 to generate a pose estimate for person 402, shown at 404.

A pose estimate refers to an estimate of the configuration of a body (in part or in whole). A pose estimate may therefore be referred to equivalently as a body pose. As will be explained in more detail below, a pose estimate may be generated by first analysing the image to identify a set of joint candidates within

estimates can then be generated from the identified joint candidates. In this context, it is noted that a‘joint’ may refer to a point of interest on the body. A joint may for example be a body part. Though a joint may correspond to a point of the body that is anatomically considered a joint (e.g. the knee, elbow, ankle etc.), it need not necessarily do so. For example, a joint could include: foot, ankle, knee, hip, torso, shoulder, elbow, wrist, neck, ears, eyes, nose etc. The joints may be provided as a set of two-dimensional joint locations on the image; i.e. each joint may be provided as an image coordinate.

A pose estimate may be represented in a variety of ways. For example, a pose estimate may represent the body as a series of kinematic links interconnected by joints (e.g. as shown in figure 4). This may be referred to as the kinematic tree representation of the body configuration.

Another way to represent the pose estimate is as a set of body parts, each with its own position and orientation in space. The pose estimates may be generated from the set of joint candidates using knowledge of typical human anatomy. For example, knowledge of the position and orientation of human body parts can be used to corroborate identified candidate joint locations within the image and identified links between them. That is, a set of joint candidates identified from the image and interconnected by a series of links may be compared with knowledge of human anatomy (e.g. known positions and orientations that can be adopted by different body parts). This knowledge can be used to refine or adapt the joint interconnections until a set of joint candidates and interconnections are derived consistent with human anatomy. If an image depicts multiple people, knowledge of the position and orientation of human body parts can be used to group identified joint candidates together for each person. That is, knowledge of the position and orientation of human body parts can be used to identify groups of joint candidates belonging to each person. A set of statistical or physical constraints may be used to enforce a pose estimate with skeletal consistency.

The computational neural network implemented by the pose estimation module 202 is a computational model that calculates and/or approximates one or more functions based on one or more inputs. The basic unit of computation in a i

neuron. A neuron may also be referred to as a node. A neuron receives input(s) from other neurons in the network, or from an external source. Each input may be associated with a weight, and (depending on the type of neuron) the neuron then calculates an output by applying a function to the weighted sum of its inputs. The function applied by the neuron may be non-linear. A neural network may contain multiple neurons arranged in layers. The layers are arranged so that neurons in a given layer do not communicate with each other. The neural network may comprise three distinct types of layers: an input layer, a hidden layer, and an output layer. The input layer is formed of neurons that provide inputs to the network. The input nodes may themselves not perform any computation. The hidden layer is formed of neurons that have no direct connection to the outside system. The neurons of the hidden layer perform computations to provide an output from one or more inputs. A neural network may be formed of one or more hidden layers. The output layer is formed of neurons that perform computations to provide outputs from the network.

The computational neural network implemented by the pose estimation module 202 may be a non-recursive neural network (such as a convolutional neural network (CNN)) or a recursive neural network (such as a long-short term memory (LSTM) network).

The neural network may comprise two stages, or phases. Each stage may be formed of multiple layers. In a first stage, the neural network may operate to generate a set of joint candidates. The first stage of the neural network may generate one or more joint maps that indicate a confidence level for each image element (e.g. pixel, pixel block etc.) that the image element represents a joint. The first stage of the neural network may generate a single joint map indicating the set of candidate joints within the image; i.e. the locations of the joint candidates within the image. The joint maps may be refined over a plurality of iterations. Each iteration may be performed by a respective layer of the neural network. A joint map may be produced by each stage of the neural network, with the joint map being refined following each iteration, or layer of the neural network. The refined joint map indicates the spatial locations of the candidate joints within the image 212. The second phase of the neural network generates the pose

identified joint candidates. Essentially, this phase determines the connections between the candidate joints that defines the pose estimate. One way to do this is to model the body as a set of n body parts, and generate a set of one or more body part maps identifying the regions of the image 212 representing the body parts. A set of n body part maps may be generated (one per body part), or a single body part map may be generated indicating the regions of the image representing each of the n body parts. The neural network may combine the information from the first stage (the set of candidate joints) with the information from the second stage (the set of identified regions of the image representing body parts of the person) to generate the pose estimate.

In summary, at step 302 the pose estimation module 202 generates a pose estimate 404 of the person 402 that estimates the body configuration of that person, which may be in the form of a set of interconnected joints and/or a set of identified body parts.

The mapping module 204 receives the pose estimate 404 identified by the pose estimation module 202. At step 304, the mapping module 204 maps a location of the person 402 determined from their pose estimate within the captured image 212 to a corresponding point of a stored map of the environment to determine the location of the person within the environment 102. That is, the mapping module 204 operates to map a point in the captured image (representing the location of the person within the image) to a corresponding point in a stored map of the environment. The map may be stored within memory 214 of the image processing apparatus 106. The map may be an architectural map of the environment. The map may be a digital, or virtual map. It may be a 2D map (e.g. a plan view of the environment 102), or a 3D map.

The mapping module 204 may map a point in the captured image to a corresponding point in the environment map by performing a coordinate transformation. Images of the environment captured by the image capturing devices 104 and the stored map may be related by a homography. That homography relationship

the mapping unit (for example, it may be precomputed). The homography relationship may be defined by a homography matrix that transforms a coordinate (i.e. point) from the captured image to a coordinate (point) in the stored map.

The transformation from the captured image to the stored map performed by the mapping module 204 is illustrated schematically in figure 5.

The mapping module initially determines a location of the person within the captured image 212 from their pose estimate. The location of the person may for example be given by a specific part, or joint, of the pose estimate. Alternatively, the location may be determined using the pose estimate, for example determining the location from one or more joints of the pose estimate. One approach is to determine the location of the person by: i) determining the location of the person’s feet from their pose estimate; and ii) calculate the average of those foot locations to determine the location of the person within the image. This is the example approach illustrated in figure 5, with the determined location marked by the‘x’ 502.

Once the mapping module 204 has determined a location of the person within the captured image 212 from the pose estimate 404, it maps that location to a corresponding point within the stored environment map (e.g. using the homography relationship between the captured image and stored map to transform the determined location‘x’ to a corresponding point in the stored map). An example of the stored map is illustrated at 504. Map 504 is in this example a 2D, plan view of the environment. The point in the map 504 that corresponds to the location of person 402 within the captured image 212 is denoted x’, and marked by the reference label 506. The points x and x’ are therefore related through the homography relationship. The map 504 stores the locations of the objects 108, 110 and 112. In other words, the position of the objects 108, 110 and 112 within the map 504 is known. Objects 108, 110 and 112 are represented in the map 504 by 508, 510 and 512 respectively. The mapping module 204, having calculated the corresponding

stored map 504, calculates the location of the person 402 within the environment. In some examples, the calculated point x’ is taken as the location of the person within the environment. The location x’ may for example specify coordinates within the map 504, with these coordinates specifying the location of the person within the environment. In other words, the location of the person within the environment may be given by a location (e.g. coordinates) defined in a frame of reference (e.g. coordinate system) local to the environment map. Thus, in summary, the mapping module maps a location of the person within the captured image 212 to a corresponding point within the map 504 to thereby determine the location of the person within the environment.

At step 306, the mapping module determines a distance between the person and the objects within the environment.

The mapping module may determine the distance between the person and objects in the environment from the calculated location x’ and the stored locations of objects 508, 510 and 512. The distance may be calculated between the location x’ and a specified location of the objects. The specified location of the objects may be given by the centroid of the objects. Alternatively, the map 504 may store multiple reference locations for each object (e.g. a reference location for each side, or face of the object). In this case, the mapping module may calculate a distance between the location x’ of the person and each reference location of the objects.

The calculated location x’ of the person within the environment is communicated from the mapping module 204 to the gaze direction module 206. The gaze direction module also receives the pose estimate 404 of the person from the pose estimation module 202.

Referring back to figure 3, at step 308 the gaze direction module 206 analyses the pose estimate 404 to extract a gaze direction of the person within the environment. The gaze direction is the direction in which the person is determined to be looking. The gaze direction module analyses the captured image 212 t<

direction of the person. The gaze-direction module may calculate the gaze direction by analysing the captured image using either a computational neural network, or computer vision techniques. If a neural network is used, that neural network may form part of the same neural network implemented by the pose estimation module (the pose estimation module may for example implement one or more layers of the neural network to extract the pose estimate, and the gaze direction module may implement one or more further layers of the neural network to calculate the gaze direction). Alternatively, the neural network implemented by the gaze direction module may be a different neural network to the neural network implemented by the pose estimation module.

Figure 6 schematically illustrates the processing performed by the gaze direction module according to a first example. The extracted gaze direction within the captured image is denoted D. The gaze direction D may be viewed as a vector within the captured image, and may be referred to as a gaze direction vector.

In this example, the gaze direction module 206 extracts the gaze direction of the person within the captured image from their pose estimate 404. In particular, the module 106 extracts the gaze direction from a set of points (or‘joints’) of the pose estimate representing features of the person’s head. This set of points may collectively define the head pose for the person. The set of points defining the head pose may represent, for example: i) one or both of the person’s eyes; and/or (ii) one or both of the person’s ears; and/or iii) the person’s nose. The set of points defining the head pose may represent features i) and ii); features ii) and iii); features i) and iii); or features i), ii) and iii). The module 206 may extract the gaze direction by projecting from one point of the pose estimate (e.g. the point representing an ear of the person) to another point of the pose estimate (e.g. representing an eye of the person). In the examples shown in figure 6, the gaze direction module projects from joint 602 of the pose estimate (representing the person’s ear) to joint 604 (representing the person’s eye) to determine the gaze direction D within the captured image.

The module 206 may then transform, or map, the gaze direction D in the captured image 212 to the stored map 504. In other words, the gaze direction module 206 projects the gaze direction D within the captured image onto the stc

gaze direction projected onto the stored map 504 is denoted D’ in figure 6.

To transform the gaze direction D to the map 504, the gaze direction module may first project the gaze direction D onto the floor plan of the environment within the captured image. The floor plan may be defined by the pose estimate 404, and in particular by the joints of the pose estimate representing the person’s feet. The floor plan of the environment defined by the pose estimate is illustrated in figure 6 by the dotted line 606. Having projected the gaze direction D onto the floor plan within the captured image 212, the gaze direction module can map the direction onto the map 504 to generate the gaze direction vector D’, for example using the homography relationship between the captured image 212 and the map 504.

Thus, in summary, according to a first example the gaze direction module 206 may extract a gaze direction within the environment by performing the following steps to process the captured image:

(i) extracting a gaze direction D in the captured image from joints of the pose estimate 404;

(ii) projecting the gaze direction D onto the floor/ground plane within the captured image. The floor/ground plane may be defined by the joints of the pose estimate representing the person’s feet.

(iii) mapping the projected gaze direction (i.e. the gaze direction D as projected onto the floor plane) onto the stored map 504, e.g. using the homography relationship between the captured image 212 and the stored map 504.

A second example for calculating the gaze direction of the person within the environment will now be described.

In accordance with the second example, the gaze-direction module computes the centre of the person’s head from multiple points of the pose estimate. The set of points from the pose estimate used to calculate the head centre could be a set of points representing parts of the person’s head (e.g. )the set of points defining the head pose described above). The gaze-direction module then projects the centre of the head to the floor/ground plane within the captured image. The centre of the head may for example be projected to a mid-point between the person’s feet

between joints of the pose estimate representing the person’s feet).

The gaze direction module then computes the orientation of the person’s head on the ground plane within the captured image. The orientation of the person’s head on the ground plane may be computed from a set of points of the pose estimate representing parts of the person’s head (e.g. the set of points defining the person’s head pose). The orientation of the person’s head on the ground plane may be taken as the gaze direction of the person within the captured image.

The gaze direction module can map the gaze direction within the captured image onto the map 504 to generate the gaze direction vector D’, for example using the homography relationship between the captured image 212 and the map 504. Thus, in summary, according to the second example the gaze direction module 206 may extract a gaze direction within the environment by performing the following steps to process the captured image:

(i) computing the centre of the person’s head from a set of points of the pose estimate 204. This set of points may represent features of the person’s head (e.g. the person’s eyes and/or ears and/or nose).

(ii) projecting the centre of the person’s head onto the floor/ground plane within the captured image. The floor/ground plane may be defined by the joints of the pose estimate representing the person’s feet. The centre of the head may be projected to a mid-point between the joints representing the person’s feet.

(iii) computing the orientation of the person’s head (and hence gaze direction of the person) on the floor/ground plane within the captured image. The head orientation may be computed from a set of points of the pose estimate 402 representing parts of the person’s head. It may be computed from the same set of points used to calculate the centre of the head at step (ii). The head orientation/gaze direction may be represented as a vector within the captured image. The vector may intersect the position of the centre of the head as projected onto the ground plane.

(iv) mapping the gaze direction as projected onto the ground plane onto the stored map 504, e.g. using the homography relationship between the captured image 212 and the stored map 504. Thus, according to the first and second examples described above, the gaze- determination module 206 operates to process the captured image 212 to calculate from the pose estimate the gaze direction of the person projected onto the ground plane within the captured image. The module then maps that gaze direction to the stored map 504 to calculate the gaze direction D’ within the map.

One convenient aspect of the above approaches is that, by projecting the gaze direction D onto the floor plane within the captured image, the projected gaze direction intersects the location of the person‘x’, which is also determined by the joints of the pose estimate representing the person’s feet. A consequence of this is that, when mapping the projected gaze direction to the map 504, the mapped gaze direction D’ intersects the corresponding location x’ of the person within the map. This is because the same transformation between the captured image 212 and map 504 can be used for both the person’s location‘x’ and the projected gaze direction.

Having extracted the gaze direction D’ of the person relative to the map 504, at step 310 the gaze direction module determines if the gaze direction intersects an object in the environment. The location of the objects 108, 1 10, and 1 12 are marked, or stored, on the map 504 (as 508, 510, 512 respectively), and thus the gaze direction module can determine if the gaze direction intersects an object from the extracted gaze direction D’ relative to the map 504, and the stored locations of the objects on the map 504.

In this particular example, the gaze direction module 206 determines that the gaze direction D’ intersects object 508, and thus the person’s gaze direction intersects object 108 in the environment 102.

At step 312, the joint analysis module 208 analyses the pose estimate 204 to classify the pose estimate as interactive or non-interactive. The joint analysis unit may analyse the pose estimate 404 using a computational neural network. This neural network may be the same neural network implemented by the pose estimation module 202 and/or mapping module 204. For example, the joint analysis unit may implement one or more layers of the neural network to classify the pose estimate as interactive or non-interactive. Alternatively, the neural network implemented b;

unit may be separate neural network to that implemented by the pose estimation module and/or mapping module.

The joint analysis module 208 may classify the pose estimate as interactive or non- interactive with respect to the environment; that is to say, the joint analysis module may classify the pose estimate as indicating a specified level of interaction between the person and the environment (e.g. an interaction or non-interaction between the person and the environment). To do this, the joint analysis module 208 may extract one or more joint angles from the pose estimate 404, and classify the level of interaction of the pose estimate (e.g. interactive or non-interactive) in dependence on those extracted joint angle(s).

For example, the joint analysis module 208 may determine the angle of the elbow joint from the pose estimate, and/or the angle of the shoulder joint. The angle of the elbow joint and/or the shoulder joint may indicate whether the person’s arm is extended, or outreached. If the person’s arm is determined to be extended, this may indicate that the person is interacting with the environment, for example grasping an object within the environment, such as an item on a shelf.

In contrast, if the joint analysis module 208 determines from the extracted joint angle(s) that the person’s arm is not extended, this may indicate that the person is not interacting with the environment.

Other joint angles may be extracted to classify the level of interaction of the pose estimate. For example, a torso angle (e.g., the angle between the upper body and hips) may be extracted. A torso angle within a specified range may indicate the person is interacting with the environment, for example by bending over to inspect, or grasp an item. A knee angle may also be extracted, with an extracted angle within a specified range indicating the person is interacting with the environment, for example, by crouching to see or grasp a low-lying object.

In some examples, the joint analysis module 208 may extract a set of joint angles from the pose estimate 404, and from those joint angles make one or more classifications with respect to the pose estimate 404 (for example, (i) arm extendec

over or not; (iii) crouching or not). The set of one or more classifications can then be used to classify the level of interaction of the pose estimate with respect to the environment. That is, the level of interaction of the pose estimate may be classified in dependence on one or more pose classifications determined from a set of extracted joint angles. The use of multiple pose classifications to classify the level of interactivity of the pose may increase the robustness of the assessment by reducing the likelihood that a pose is incorrectly classified as interactive (or not-interactive). The features extracted from the capture image by the mapping module 204, gaze determination module 206, and joint analysis module 208 are communicated to the classifier module 210.

Referring back to figure 3, at step 314 the classifier module 210 classifies the behaviour of person 402 with respect to an object in the environment using the features extracted from the captured image.

The classifier module may classify the person’s behaviour into one or more behavioural classes, or categories, in dependence on the extracted information from the captured image. In particular, the classifier module may classify the behaviour of the person with respect to the object in dependence on a set of one or more conditions determined from the extracted information. The behavioural classes may indicate different levels of interactivity, or engagement, with the object. In the context of this example, those behavioural classes may be: (i) passing the object; (ii) browsing the object; or (iii) engaging the object. The classifier module may generate a label associated with the image that indicates the behavioural class of the person depicted within the image.

In more detail, if the classifier module identifies from the gaze determination module 206 that the person is not looking at the object (i.e. it is determined at step 310 that the extracted gaze direction does not intersect the object), it classifies the behaviour of the person 402 as passing the object (the lowest level of interactivity, or engagement). If on the other hand the classifier module identifies from the gaze determination module 206 that the person is looking at the object (i.e., it is determined at step 310 that the extracted gaze does intersect the object), th<

210 classifies the behaviour as either browsing the object or engaging the object. The classifier module 210 may further classify the behaviour of the person using the information from the mapping module 204 and the joint analysis module 208. For instance, if the classifier module determines from the mapping module that the distance between the person and object is less than a specified threshold and identifies from the joint analysis module 208 that the pose estimate 404 of the person is an interactive pose, it classifies the behaviour as engaging the object (the highest level of interactivity, or engagement). If on the other hand the classifier module 210 determines that either: (i) the distance between the person and the object is greater than a specified threshold; or (ii) the pose estimate 404 of the person is a non interactive pose, it classifies the behaviour of the person as browsing the object (an intermediary level of interactivity, or engagement). Though three behavioural classes have been described in this example, it will be appreciated that the classifier module may be configured to classify the behaviour of the person into one of any suitable number of classes (e.g., less than three classes or more than three classes). It will also be appreciated that other types of behavioural classes may be used that indicate a level of activity or engagement with an object in the environment.

The above examples describe how image processing unit 106 can process a captured image using a computational neural network to classify the behaviour of a person depicted within that image. This conveniently enables labels associated with the image to be generated without having to manually label the images, saving time and expense.

Having generated a label for the image 212 indicating the behavioural class of depicted person 402, the classifier module 210 outputs the label.

Image labels generated by the image processing unit 106 may be used for a multitude of different purposes. For example, the image processing apparatus 100 may be implemented within a wider security system. Labels generated by the image processing unit 106 may be communicated to a security management unit. The security management unit may analyse the labels and determin

person is engaging with an object in the surveyed environment. If it is determined that a person is engaging with an object, a security alert may be generated. This may be useful if the objects or not to be touched or interfered with without permission, e.g. as may be the case within a museum, gallery, or shop displaying high-end items.

In another implementation, the image processing apparatus 100 may be implemented within a wider sales management system (e.g. within a store). In this case, labels generated by the image processing unit 106 may be communicated to a sales management unit. The sales management unit may analyse the labels and determine if any indicate a person is engaging with an object in the environment. In response to such a determination, the sales management unit may identify an item being handled by the person. That item may be the object itself, or an item in the vicinity of the object (e.g. if the object were a shelf or display, the item could be an item from that shelf or display). The sales management unit may identify the item by performing additional processing steps on the captured image to implement an image-recognition algorithm. The image-recognition algorithm may be performed by implementing a neural network at the sales management unit. Alternatively, the object may be the item, in which case the sales management unit determines the item from the image label generated by the image processing unit 106. The sales management unit may store data indicating items handled by people depicted in the captured images, and link that data with point- of-sale (POS) data generated from POS terminals within the environment. This may be useful for providing an additional layer of security, for example by ensuring that each item detected as being handled by a person appears within the POS data for that person.

In the examples above, the classifier module 210 classified the behaviour of the depicted person 402 in dependence on: (i) whether the gaze direction intersects the object (determined by the gaze determination module 206; (ii) the distance between the person and the object (determined by the mapping module); and (iii) whether the pose estimate is classified as interactive or non-interactive with respect to the environment (as determined by the joint analysis module 208). More generally, steps 304 to 312 of figure 3 are examples of feature extraction steps in which a set of one or more features relating to the person are extracted from the captured image depicting the person. The feature extraction steps performed by

the image processing unit 106 extract one or more features from the pose estimate of the person. In general, the classifier module 210 may classify the behaviour of the depicted person in dependence on a set of one or more features extracted from the pose estimate of the person. The set of one or more features are associated with the interaction between the person and the environment. Features (i), (ii) and (iii) above are examples of such features.

For example, in an alternative implementation the classifier module 210 may classify the behaviour of the person with respect to an object only in dependence on whether the extracted gaze direction intersects the object. In this case, the classifier module may classify the behaviour into one of two behavioural classes: a first class indicating a relatively low level of interaction with the object (if the person’s gaze does not intersect the object); and a second class indicating a relatively high level of interaction with the object (if the person’s gaze does intersect the object). In this case, the image processing unit 106 may not include the joint analysis module 208.

In another alternative implementation, the classifier module 210 may classify the behaviour of the person only in dependence on whether the extracted gaze direction intersects the object and on whether the pose estimate is classified as interactive or non-interactive (i.e. , it may not depend on the distance between the person and the object). In this case, the mapping module may not calculate the distance between the person and the object. In this case, the classifier module may classify the behaviour into one of three behavioural classes: a first class indicating a relatively low level of interaction with the object (if the person’s gaze does not intersect the object); a second class indicating a relatively high level of interaction with the object (if the person’s gaze does intersect the object and the pose estimate is classified as interactive); and a third class indicating an intermediary level of interaction with the object (if the person’s gaze does intersect the object but the pose estimate is classified as non-interactive). In this implementation, the image processing unit 106 may not include the joint analysis module 208.

In another alternative implementation, the classifier module 210 may classify the behaviour of the person in dependence on whether the extracted gaze direction intersects the object and on whether the distance between the pe

less than a specified threshold. In this implementation, the image processing unit 106 may not include the joint analysis module 208. The classifier module may classify the behaviour into one of three behavioural classes: a first class indicating a relatively low level of interaction with the object (if the person’s gaze does not intersect the object); a second class indicating a relatively high level of interaction with the object (if the person’s gaze does intersect the object and the distance is less than a specified threshold); and a third class indicating an intermediary level of interaction with the object (if the person’s gaze does intersect the object but the distance is greater than a specified threshold). In this implementation, the image processing unit 106 may not include the joint analysis module 208.

Though only a single person was depicted in the captured image in the examples described above, it will be appreciated that in other examples a single captured image may depict multiple people. In these examples, the image processing unit 106 may operate to perform steps 302 to 314 for each person depicted in the image. In other words, the behaviour of each person depicted in the image can be classified with respect to an object in the environment. This may be the same object, or the behaviour of each person may be classified with respect to different objects (e.g. if the people are relatively far apart within the environment. A convenient aspect of the techniques described herein is that they scale relatively simply with the number of people depicted in the image.

The image processing system 100 is also scalable with the size of the environment 102. In some implementations, the image processing system may contain only a single image capturing device, for example covering a single aisle of a store, or region of a store. In other implementations, the image processing system may contain multiple image capturing devices, with the image processing unit 106 being configured to analyse images captured from each image capturing device. The multiple image capturing devices may for example the floor plan of a store, a shopping centre, a street, a park etc.

Each of the steps described above with reference to figure 3 can be performed by implementing a computational neural network. In some examples, a single neural network may implemented by the image processing unit 106, wit

network layers being performed to implemented each step. In other examples, each module within the image processing unit 106 may implement its own computational neural network. Each of the computational neural network may be a non-recursive neural network (such as a convolutional neural network), or a recursive neural network (such as a LSTM network).

The unit 106 is shown as comprising a number of functional blocks. This is schematic only and is not intended to define a strict division between different logic elements of the unit. Each functional block may be provided in any suitable manner. It is to be understood that intermediate values described herein as being formed by a particular module/unit etc. need not be physically generated by the module at any point and may merely represent logical values which conveniently describe the processing performed by the module between its input and output.

The modules/units described herein may be configured to perform any of the methods described herein. Generally, any of the functions, methods, techniques or components described above can be implemented in software, firmware, hardware (e.g., fixed logic circuitry), or any combination thereof. The term“module” may be used herein to generally represent software, firmware, hardware, or any combination thereof. In the case of a software implementation, the modules take the form of program code that performs the specified tasks when executed on a processor. The methods described herein could be performed by one or more processors executing code that causes the processor(s) to perform the methods. The code may be stored on a non-transitory computer-readable storage medium. Examples of a non-transitory computer-readable storage medium include a random-access memory (RAM), read only memory (ROM), an optical disc, flash memory, hard disk memory, and other memory devices that may use magnetic, optical, and other techniques to store instructions or other data and that can be accessed by a machine.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features s

disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1. An apparatus for processing a captured image depicting a person to classify the behaviour of the person with respect to an object within an environment, the apparatus comprising:

a pose-estimation module configured to analyse the captured image using a computational neural network to identify a pose estimate for the person depicted in the image;

a mapping module configured to map a location of the person indicated by their pose estimate within the captured image to a corresponding point of a stored map of the environment to determine the location of the person within the environment;

a gaze-determination module configured to analyse the identified pose estimate in the captured image to calculate a gaze direction of the person within the environment and determine if the gaze direction intersects the object in the environment in dependence on the determined location of the person within the environment; and

a classifying module configured to classify the behaviour of the person with respect to the object in dependence on whether the gaze direction intersects the object.

2. An apparatus as claimed in claim 1 , wherein the apparatus comprises a memory storing the map of the environment.

3. An apparatus as claimed in claim 1 or 2, wherein the object’s location is marked on the stored map of the environment.

4. An apparatus as claimed in any preceding claim, wherein the mapping module is configured to:

determine a location of the person’s feet from the identified pose estimate; and map the location of the person’s feet to a corresponding point of the stored map to determine the location of the person within the environment.

5. An apparatus as claimed in any preceding claim, wherein the mapping module is further configured to: determine a distance between the person and the object with

from said corresponding point of the stored map and the object’s location on the stored map; and

wherein the classifying module is configured to classify the behaviour of the person with respect to the object in dependence on (i) whether the gaze direction intersects the object; and (ii) the distance between the person and the object.

6. An apparatus as claimed in any preceding claim, wherein the gaze- determination module is configured to project the extracted gaze direction from the corresponding point of the stored map to determine if the gaze direction intersects the object in the environment.

7. An apparatus as claimed in claim 6, wherein the gaze-determination module is configured to:

calculate a gaze direction of the person within the captured image from joints of the pose estimate; and

transform the gaze direction to the stored map to generate a mapped gaze direction projecting from the corresponding point of the stored map.

8. An apparatus as claimed in claim 7, wherein the gaze-determination module is configured to calculate the gaze direction on a plane defined by joints of the pose estimate representing the person’s feet, and transform the gaze direction to the stored map to generate the mapped gaze direction D’ in the stored map.

9. An apparatus as claimed in any preceding claim, wherein the apparatus further comprises a joint-analysis module configured to:

analyse the identified pose estimate using a computational neural network to extract one or more joint angles for the pose estimate; and

classify the pose estimate as having a specified level of interactivity with respect to the environment in dependence on the extracted joint angle(s).

10. An apparatus as claimed in claim 9, wherein the one or more joint angles comprise an elbow joint of the pose estimate.

11. An apparatus as claimed in claim 9 or 10, wherein the one c

comprise a shoulder joint of the pose estimate.

12. An apparatus as claimed in any of claims 9 to 11 when dependent on claim 5, wherein the classifying module is configured to classify the behaviour of the person with respect to the object in dependence on: (i) whether the gaze direction intersects the object; (ii) the distance between the person and the object; and (iii) the classified level of interactivity of the pose estimate with respect to the environment.

13. An apparatus as claimed in claim 12, wherein the classifying module is configured to: classify the behaviour of the person with respect to the object in a first class indicating a relatively low level of interaction with the object if it is determined the gaze direction does not intersect the object; classify the behaviour of the person with respect to the object in a second class indicating an intermediate level of interaction with the object if it is determined the gaze direction intersects the object and the pose estimate is classified as non-interactive; and classify the behaviour of the person with respect to the object in a third class indicating a relatively high level of interaction with the object if it is determined the gaze direction intersects the object, the pose estimate is classified as interactive, and the distance between the person and object is less than a specified threshold.

14. An apparatus as claimed in any preceding claim, wherein the object is a shelf or display within the environment.

15. An apparatus as claimed in any preceding claim, wherein the first class of behaviour is passing the object, the second class of behaviour is browsing the object, and the third class of behaviour is engaging the object.

16. A method of processing a captured image depicting a person to classify the behaviour of the person with respect to an object within an environment, the method comprising:

analysing the captured image using a computational neural network to identify a pose estimate for the person depicted in the image; mapping a location of the person indicated by their pose i

captured image to a corresponding point of a stored map of the environment to determine the location of the person within the environment;

analysing the identified pose estimate in the captured image to calculate a gaze direction of the person within the environment and determine if the gaze direction intersects the object in the environment in dependence on the determined location of the person within the environment; and

classifying the behaviour of the person with respect to the object in dependence on whether the gaze direction intersects the object.

17. A method as claimed in claim 16, wherein the object’s location is marked on the stored map of the environment.

18. A method as claimed in claim 16 or 17, wherein the mapping step comprises: determining a location of the person’s feet from the identified pose estimate; and

mapping the location of the person’s feet to a corresponding point of the stored map to determine the location of the person within the environment.

19. A method as claimed in any of claims 16 to 18, wherein the method further comprises:

determining a distance between the person and the object within the environment from said corresponding point of the stored map and the object’s location on the stored map; and

wherein the method comprises classifying the behaviour of the person with respect to the object in dependence on (i) whether the gaze direction intersects the object; and (ii) the distance between the person and the object.

20. A method as claimed in any of claims 16 to 19, wherein the step of analysing the determined pose estimate comprises determining if the gaze direction intersects the object in the environment by projecting the gaze direction from the corresponding point of the stored map.

21. A method as claimed in claim 20, wherein the method comp

calculating a gaze direction of the person within the captured image from joints of the pose estimate; and

transforming the gaze direction to the stored map to generate a mapped gaze direction projecting from the corresponding point of the stored map.

22. A method as claimed in claim 21 , the method comprising calculating the gaze direction on a plane defined by joints of the pose estimate representing the person’s feet, and transforming the projected gaze direction to the stored map to generate the mapped gaze direction in the stored map.

23. A method as claimed in any of claims 16 to 23, wherein the method further comprises:

analysing the identified pose estimate using a computational neural network to extract one or more joint angles for the pose estimate; and

classifying the pose estimate as having a specified level of interactivity with respect to the environment in dependence on the extracted joint angle(s).

24. A method as claimed in claim 23, wherein the one or more joint angles comprise an elbow joint of the pose estimate.

25. A method as claimed in claim 23 or 24, wherein the one or more joint angles comprise a shoulder joint of the pose estimate.

26. A method as claimed in any of claims 23 to 25, wherein the method comprises classifying the behaviour of the person with respect to the object in dependence on: (i) whether the gaze direction intersects the object; (ii) the distance between the person and the object; and (iii) the classified level of interactivity of the pose estimate with respect to the environment.

27. A method as claimed claim 26, wherein the method comprises: classifying the behaviour of the person with respect to the object in a first class indicating a relatively low level of interaction with the object if it is determined the gaze direction does not intersect the object; classifying the behaviour of the person with respect to the object in a second class indicating an intermediate level of interaction w

determined the gaze direction intersects the object and the pose estimate is classified as non-interactive; and classifying the behaviour of the person with respect to the object in a third class indicating a relatively high level of interaction with the object if it is determined the gaze direction intersects the object, the pose estimate is classified as interactive, and the distance between the person and object is less than a specified threshold.

28. A method as claimed in any of claims 16 to 27, wherein the object is a shelf or display within the environment.

29. A method as claimed in any of claims 16 to 28, wherein the first class of behaviour is passing the object, the second class of behaviour is browsing the object, and the third class of behaviour is engaging the object.