WO2020104542A1

WO2020104542A1 - Age estimation

Info

Publication number: WO2020104542A1
Application number: PCT/EP2019/081967
Authority: WO
Inventors: Francisco Angel Garcia RODRIGUEZ; Symeon NIKITIDIS; Ahmed Kazim Pal
Original assignee: Yoti Holding Limited
Priority date: 2018-11-21
Filing date: 2019-11-20
Publication date: 2020-05-28
Also published as: GB202108530D0; GB201818948D0; GB2598028A

Abstract

A neural network processes captured image and audio data to compute human age estimate data. A facial feature extractor of the neural network comprises a plurality of neural network processing layers configured to process the image data to compute a facial feature set, which encodes human facial features exhibited in the captured image data. A voice feature extractor of the neural network comprises a plurality of neural network processing layers configured to process the audio data to compute a voice feature set, which encodes human voice features exhibited in the captured audio data. A combined feature set is computed by combining the voice and facial feature sets. An age estimator comprising at least one neural network processing layer is configured to process the combined feature set to compute the human age estimate data.

Description

Age Estimation

Technical Field

This disclosure related to computer-implemented age-estimation technology.

Background From time to time people need to prove some aspect of their identity, and often the most compelling way to do this is with a passport or other national photo ID such as a driving licence or (in jurisdictions which mandate them) an identity card. However, whilst these documents are greatly trusted due to the difficulty involved in making fraudulent copies and their issuance by government institutions, they are also sufficiently valuable that it is preferable not to have to carry them everywhere with us.

An important aspect of this is age verification. Systems for automated age verification are known. For example, in a digital identity service provided under the name“Yoti”, at the time of writing, an image of a user may be captured and transmitted to a back-end service storing user credentials (e.g. passport information) which can then identify the user and verify their age. Facial recognition is used to match a selfie taken by the user with an identity photo on the passport or other ID document.

A user can share selected credential(s), such as an 18+ attribute, with others based on QR codes. The fact that the 18+ attribute is derived from the user’s passport or other trusted identity document, which in turn has been matched to the user biometrically, makes this a highly robust automated age verification mechanism.

Summary

Although highly robust, the system above does require a user device such as a smartphone with network connectivity in order to perform the age verification. It also requires the user to perform the sharing process, which requires some effort and is somewhat time consuming. The present invention provides automated age estimation technology based on machine learning (ML), which is able to provide an accurate human age estimate in many practical circumstances based on a combination of a user’s facial and voice characteristics. This exploits the realization that both facial and voice characteristics can contain information about the age of the user from which they have been captured.

A first aspect of the present invention provides a computer system for performing human age estimation, the computer system comprising: an image input configured to receive captured image data; an audio input configured to receive captured audio data; execution hardware configured to execute a neural network for processing the captured image and audio data to compute human age estimate data, the neural network being formed of: a facial feature extractor comprising a plurality of neural network processing layers configured to process the image data to compute a facial feature set, which encodes human facial characteristics exhibited in the captured image data; a voice feature extractor comprising a plurality of neural network processing layers configured to process the audio data to compute a voice feature set, which encodes human voice characteristics exhibited in the captured audio data; a feature combiner configured to compute a combined feature set by combining the voice and facial feature sets; and an age estimator comprising at least one neural network processing layer configured to process the combined feature set to compute the human age estimate data.

A benefit of this architecture is that it allows the human age estimation system to be trained “end-to-end”, which in turn is expected to improve the performance of the trained system. This training can be based on the back propagation across the whole system (examples of this are described below). As well the performance improvement, end-to-end trainability makes it significantly easier to incrementally adapt/tune the system to new data (without re-training from scratch) compared to a non-end-to-end case where the classifier would need to be trained from scratch. This is a significant benefit as it allows the system to be continuously trained and improved as more data becomes available.

In preferred embodiments of the invention, a confidence value is also determined for the age estimate which provides an indication of the robustness of the age estimate. This is useful in a practical context in which access to an age-restricted function may be regulated based on the age estimate and the confidence value, with access being granted when the age estimate indicates an associated age-requirement is met and the confidence in the age estimate is sufficiently high.

In embodiments, the neural network processing layers of layers of the facial feature extractor may be convolutional neural network (CNN) processing layers. The neural network processing layers of layers of the voice feature extractor may be CNN processing layers.

The voice feature extractor may comprise an audio processing component configured to process portions of the audio data within respective time windows to extract a feature vector for each of the time windows, and combine the feature vectors for the time windows to form an input to the CNN processing layers from which the voice feature set is computed.

The feature vectors may comprise frequency values, such as cepstral coefficients.

The at least one neural network processing layer of the age estimator may be fully connected.

The human age estimate data may be in the form of a distribution over human age classes.

The age estimator may be configured to compute at least one confidence value for the human age estimate data. For example, where the human age estimate data is in the form of a distribution over human age classes, the confidence value may be a measure of the spread of the distribution (e.g. a variance or standard deviation of the distribution).

Each age class may be associated with a numerical age value and the age estimate data may comprise an estimated age value. The age estimator may be configured to compute the estimated age value as an average of the numerical age values that is weighted according to the distribution over the human age classes. For example, with a probability distribution that assigns a probability to each age class, the numerical age values may be weighted according to the corresponding class probabilities. The feature combiner may be configured to apply, to the voice feature set, a feature modification function to compute a modified voice feature set, the combined feature set comprising the modified feature set. The feature modification function may be defined by a set of weighted learned during training of the neural network. The feature combining function may, for example, have a sigmoid activation.

The computer system may comprise an access controller configured to regulate access to an age-restricted function based on the age estimate data. For example, based on the human age estimate data and the at least one confidence value referred to above.

Another aspect of the invention provides a computer system for performing human age estimation, the computer system comprising: an image input configured to receive captured image data; an audio input configured to receive captured audio data; an anti-spoofing system configured to randomly select one or more words for outputting to a user from which the audio and image data are captured, and generate an anti-spoofing output by comparing the randomly-selected one or more words with at least one of the captured audio data and the captured image data to determine whether voice patterns and/or lip movements exhibited therein match the randomly-selected one or more words; and a human age estimation system configured to process the captured audio and image data, to compute human age estimate data based on voice and facial features exhibited therein.

This provides age-estimation with robust“anti-spoofing”. A sample of speech is required from the user to be used, in conjunction with the video data, for age estimation. By imposing an additional requirement that the user must speak a specific randomly selected word(s) and checking that he has done so (using the audio data, image data or both), it becomes much harder for the user to spoof the system by presenting pre-recorded audio and video.

To increase the robustness of the anti-spoofmg, face matching against may also be applied to the image data against an already-stored biometric template. In combination with the above, this imposes a requirement that the randomly selected word(s) must be spoken in real-time by a known authorised user. In embodiments, the computer system may comprise a facial recognition component configured to apply facial recognition to the image data, to provide a facial verification output indicating whether the image data contains a face of an authorized user

The computer system may comprise an access controller configured to regulate access to an age-restricted function based on the age estimate data and the anti-spoofing output and/or the facial verification output.

Another aspect of the invention provides a computer-implemented method of performing human age estimation, in which the above functionality is implemented.

Another aspect of the invention provides computer program product comprising executable instructions stored on a computer-readable storage medium and embodying the above neural network for execution on one or more processors. When executed, the instructions cause the one or more processors to implement any of the above functionality.

Brief Description of Figures

For a better understanding of the present invention, and to show how embodiments of the same may be carried into effect, reference is made by way of example only to the following figures in which:

Figure 1 shows a schematic block diagram of a computer system in which automated age- estimation is implemented;

Figure 2 shows a functional block diagram of an access control system; Figure 3 shows a functional block diagram of a human age estimation system implemented as a neural network;

Figure 4 shows a schematic representation of a data pipeline within the neural network of Figure 3; Figure 5 shows a schematic function block diagram of a ML age estimation system during training;

Figure 6 shows a functional block diagram of an age-estimation classifier;

Figure 7 shows an extension of the age estimation system architecture of Figure 3; Figure 8 illustrates high level principles of data processing operations performed within a convolutional neural network; and

Figure 9 is a functional block diagram of an access control system which incorporates anti- spoofing technology.

Detailed Description Figure 1 shows a highly schematic block diagram of computer system in which a computer device 106 is arranged to capture both audio data 102 and image data 104 from a user 100 of the computer device 106. The image data 104 and audio data 102 contains respectively, facial and speech characteristics of the user 100 which are used as a basis for automated age estimation, as described below. The computer device 106 is shown to comprise at least one image capture device 103 and at least audio capture device 101. The image and audio data 104, 102 are captured by those devices 103, 101 respectively and supplied to a processor 110 of the computer device 106, and the audio and video capture devices 101, 103 are coupled to the processor 1 10 for this purpose. Although shown as a single element, the computer device 106 can comprise one or more such processors 110, such as but not limited to CPUs, accelerators e.g. GPUs etc. on which instructions are executed to carry out the functionality described herein. The computer device 106 is also shown to comprise a user interface 108 by which information can be outputted to the user 100 and received from the user 100 for processing. The user interface 108 can comprise any combination of input and output devices that is suitable for this purpose such as but not limited to a display, a touchscreen, mouse, trackpad etc. It is also noted that, although shown as separate elements, one or both of the audio and image capture devices 101, 103 may be considered part of the user interface 108, for example in a context where the computer device 106 is controlled using voice and/or gesture input. The computer device 106 is also shown to comprise a network interface 112 that is also coupled to the processor 110 to allow the computer device 106 to connect to a computer network 114 such as the Internet. This allows the computer device 106 to access remote services. By way of example, a remote computer system (backend) 116 is shown which can be configured to provide one or more such services. The backend 116 is shown to comprise at least one processor 118 (such as a CPU, GPU etc.) on which instructions are executed to implement such services. The processors 110, 118 can comprise any form of execution hardware that is capable of executing the functionality disclosed herein. As described below, aspects of the described technology are implemented using machine learning models, which can be executed on a variety of hardware platforms, and which in some instances may be optimized for that purpose. Examples include an accelerator or co-processor such as, but not limited to, a GPU (graphics processing unit); suitably configured FPGA(s) (field programmable gate array) or even ASIC(s) (application-specific integrated circuit). It will therefore be appreciated that the term execution hardware covers a wide variety of viable hardware platforms in the present context, the breadth of which can only be expected to increase given the speed of development in the relevant technical fields.

With reference to Figure 2, the user 100 is able to access an age-restricted function 204 using the computer device 106 subject to the user 100 successfully completing an automated access control process in order to gain access to the age-restricted function 204. This is based on the results of the age-estimation applied to the audio and video data 102, 104.

Figure 2 shows a schematic function block diagram for an access control system 220 embodied in the computer system of Figure 1. The access control system 220 is shown to comprise a human age estimation system 200 and an access controller 202. The access controller 202 operates to regulate access to the age-restricted function 204 based on outputs of the age estimation system 200. Those outputs are shown to comprise, in this example, human age estimate data 210 and at least one associated confidence value (score) 212. The age estimate data 210 is computed using a combination of the captured audio and image data 102, 104. The mechanism by which the age estimate data 210 is derived is described in detail below. For now suffice it to say that, in the described examples, the age estimate data 210 is in the form of an age distribution, which is a probability distribution over a set of age classes. Each age class can for example be a human age expressed in years (18, 19, 20 etc.) or an age range (such as 18 to 21). The confidence value 212 denotes a level of confidence in the age estimate data, i.e. how confident the age estimation system 200 is that the age estimate data 210 provides an accurate estimate about the age of the human user 100.

The output 210 of the age estimation system 200 is a probability distribution over all possible ages. An estimated age is computed from this as the probability-weighted average of these ages and the confidence score 212 is computed as the standard deviation of the distribution (the lower the standard deviation, the more confident the system is about the age estimation).

The access controller 202 will only grant the user 100 access to the age-restricted function 204 when the age estimate data 210 indicates the user 100 meets a specified age requirement associated with the age-restricted function 204, and the at least one confidence value 212 indicates a sufficient level of confidence in that estimate. This could for example be based on a threshold applied to the confidence value 212.

The access control system 220 can be implemented at the hardware level in a variety of ways. For example, in one case its functionality can be implemented entirely locally at the computer device 106 of Figure 1. Alternatively, this functionality may be implemented at the backend 116 and, in that case, decisions made at the backend 1 16 can be communicated to the computer device 106 via the network 114 as necessary to give effect to those decisions. Alternatively, the functionality can be implemented in a distributed fashion, for example with part of it being implemented at the computer device 106 and part being implemented at the backend 116, with communication taking place between those two systems via the network 114 as needed to do so.

The age estimation technology disclosed herein can be applied in a variety of contexts using a variety of devices, systems etc. For example, the computer device 106 may be a personal user device such as a smartphone, tablet, personal computer etc. In that context, the age-estimation technology can for example be applied in order to regulate online purchases of age-restricted goods or services or to regulate access to certain age-restricted content online. Another context is in physical retail outlets with self-service technology. Here the computer device 106 may for example be a self-checkout terminal or a handheld“self-scanning” device which a user uses to scan items they wish to purchase as they select them.

In such contexts, where a user wishes to purchase age-restricted items, they may be prevented from doing so if they do not successfully pass the automated age-estimation checks that are disclosed herein, or at least may be prevented from doing so subject to further age verification procedures.

This further check could be a manual check by a supervisor working in the retail outlet. To provide another example of a further age-verification check that may be performed if the user fails to pass the initial age-verification check, reference is made to International patent application published as WO2016/128569 and United States patent application published as: US2016/0239653; US2016/0239657; US2016/0241531; US2016/0239658; and

US2016/0241532, each of which is incorporated herein by reference in its entirety. These disclose a digital identity system (Yoti/uPass) which allows a user to share verified identity attributes with third parties. The further age verification process can be carried out by sharing an age-related identity attribute (e.g. date of birth or 18+ attribute etc.) with the access controller 202 in accordance therewith. In this context, a trusted digital identity system provides the attribute, which is also anchored to a passport or other authenticated identity document. The digital identity system and the ID document are both trusted sources of identity information, allowing the identity attribute to be verified by verifying its source. Given the range of contexts in which the invention can be applied, it will be appreciated that the age-restricted function 204 broadly represents any functionality or service, access to which or the use of which is at least partially restricted based on age.

Figure 3 shows a schematic function block diagram of the age estimation system 200 in one example implementation. The age estimation system 200 is in the form of a neural network comprising two parallel subnetworks (subnets) 304, 306, a feature combiner 308 and an age classifier 310.

The age estimation system 200 is shown having inputs for receiving the audio and image data 102, 104 for processing within the system 200. The parallel subnets 304, 306 have convolutional neural network (CNN) architectures for processing the audio and image data 102, 104 respectively, and may be referred to as image and audio subnets respectively. The audio subnet 304 extracts from the audio data 102 a set of speech (vocal) characteristics (features) and the image subnet 306 extracts from the image data 104 a set of facial characteristics.

The feature combiner 308 receives the set of speech characteristics from the audio subnet 304 and the set of facial characteristics from the image subnet 306 and combines them in order to provide a combined set of user characteristics to the age classifier 310. The age classifier 310 computes the age estimate data 210 from the set of combined characteristics, which as noted is in the form of an age distribution. A confidence estimation component 312 computes the at confidence value 212 as the standard deviation of the age distribution 210

In the present example, the image data 104 is supplied to the image subnet 306 directly as an input“volume” of pixel values. The image processing subnet 306 applies a series of convolutions and non-linear transformations to the input volume in order to extract increasingly high-level features from the image data 104. This is described in further detail below.

The audio data 102 is subject to additional pre-processing in order to form a suitable input volume to the audio subnet 304. This is shown in Figure 3 as pre-processing component 302 which extracts an input feature vector from the audio data 102 for each of a plurality of time windows. The input feature vector for each time window comprises MFCCs (Mel-Frequency Cepstral Coefficients) computed from the audio data 102 at different time windows. The input feature vectors across the time windows constitute an array of MFCCs, which is passed as an input volume to the audio subnet 304. Similarly, the audio subnet 204 applies a series of convolutions and non-linear transformations thereto in order to generate the set of speech characteristics to be combined with the facial characteristics. MFCCs encode a power frequency spectrum of the audio signal within each time window, with different frequencies being weighted according to how responsive the human ear is to those frequencies. In order to compute the MFCCs, a Fourier transform (e.g. FFT) is applied to the portion of audio data in each time window, from which the weighted power spectrum can be computed. A volume in this context refers to an array of values, which can be a 2D or 3D array (in the context of CNN processing, a 2D array is considered a volume having a depth of one).

To further illustrate the architectural principles of the age estimation system 200, Figure 4 shows a high-level overview of an example of a data pipeline within the age estimation system 200. The image data 104 is shown as an input volume to the image subnet 306 and the result of the processing by the image subnet 306 is a series of intermediate data volumes 406 representing increasingly high-level facial features within the image data 104, culminating in the extracted set of facial features which is shown in the form of a facial feature vector 407. This corresponds to the output of the image subnet 306 that is passed to the feature combiner 308.

The image data 104 could be“static” image data captured at a particular time instant or a sequence of successive video frames captured over a time interval (video data).

The audio data 102 is shown as a time-varying signal in the time domain. As indicated above, the audio signal is divided into time windows, and two such time windows are marked and labelled W 1 and W2 in Figure 4. To the portion of audio data in each of those time windows W 1 , W2, an MFCC transformation is applied in order to compute a set of MFCCs for that time window. These are supplied to the audio subnet 304 in an input volume 402, as outlined above. In this example each row of the input volume 402 corresponds to a particular time window and contains the MFCCs extracted for that time window. Hence, the audio input volume 402 is a form of spectrogram, which represents the audio data 102 in a time-frequency plane (time running horizontally and frequency running vertically in this example) in a way that captures variations in the power spectrum of the audio signal over time.

The processing of the input volume 402 by the audio subnet 304 similarly results in a series of intermediate data volumes 404 representing increasingly high-level speech features of the user 100, culminating in the set of speech characteristics that is passed to the feature combiner 308 which is shown in the example of Figure 4 in the form of a speech feature vector 405.

As will be appreciated, although the structures of the intermediate data volumes 404, 406 within the subnets 304, 306 appear similar in Figure 4, this is not necessarily the case in practice. The two subnets 404, 406 are processing different types of data, and may have quite different CNN architectures (different numbers of layers/filters, different filter sizes, different dimensionalities etc.), which may be optimise for the type and format of the data they are processing.

In this particular example the voice and facial characteristics are combined by concatenating the facial and voice feature vectors 405 and 406 to form a concatenated feature vector 408. This concatenated feature vector 408 is the output of the feature combiner 308 and is used by the age classifier 310 to compute the age distribution 210. That is, a single age distribution is computed directly from voice and facial features extracted by the audio and image subnets 304, 306.

The age distribution 210 is encoded in this example as a softmax vector 210, the nth component of which is an estimated probability that the user belongs to age class n given the input audio and image data 102, 104 (A, I respectively), which is expressed in mathematical notation as the condition N = n. That conditional probability may be expressed in mathematical notation as P(/V = n\A, /).

Concatenation is a relatively simple but nevertheless effective way of combining the facial and voice characteristics that are to be used as a basis for age estimation. However, extensions of the system are described later in which the features are combined in a way that takes into account the expected relevance of certain features to particular age classes.

The processing performed by the image subnet 306, the audio subnet 304 and the age classifier 310 is based on respective model parameters which have been learned from a suitable set of training data. A benefit of the architecture of Figures 3 and 4 is that it allows the whole system 200 to be trained end-to-end as described below with reference to Figure 5. It is expected that end-to-end training can result in a performance improvement for the trained system 200.

Figure 5 shows a schematic function block diagram which demonstrates high-level principles of the end-to-end training in this context. A training set 500 comprises a plurality of examples of the kind of data the age estimation system 200 will need to be able to interpret meaningfully in use. In this case, each example comprises a piece of image data 504 (static or video) captured from a user whose age is known, together with a sample of audio data 502 captured from the same user. Each of those examples is labelled with an age label 506 identifying the age class to which the user in question is known to belong. In a training phase, for each example in the training set 500, the audio and image data 502, 504 of that example is passed to the age estimation system 200 and processed therein just as described above with reference to Figures 3 and 4. In particular, the audio subnet 304, image subnet 306 and age classifier 310 process their respective inputs according to their respective model parameters, which are labelled PI, P2 and P3 respectively in Figure 5. The end result is an output 510 of the age estimation system 200 which corresponds to the age distribution 210. The aim of the training is to adjust the model parameters PI, P2 and P3 in order to match the outputs 510 across the training examples to the corresponding age labels 506. To achieve this, a loss function 512 is defined which provides a measure of difference between the outputs across the training set 500 and the corresponding age labels 506 in a given iteration of the training process. Back propagation is then used by a back-propagation component 514 to adapt the model parameters PI , P2 and P3 for the next iteration with the objective of minimising the defined loss function 512. This process is repeated until defined stopping criteria are met. The stopping criteria are chosen to optimise the loss function 512 to a sufficient degree, whilst avoiding overfilling of the system 200 to the training set 500. Following successful training, the age estimation system 200 is able to apply the knowledge it has gained in training to new inputs that it has not encountered during training. The principles of back propagation based on loss functions are well known per se hence further details are only described herein to the extent that they are considered relevant in the present context. What is novel in the present context is performing end-to-end training over the entire age classification system 200 such that the audio subnet 304, image subnet 306 and age classifier 310 are trained simultaneously based on the final output 510 of the age estimation system 200.

The loss function 512 may for example be a“softmax” (cross-entropy) loss function.

Figure 6 shows further details of the age classifier 310 in one example. For the sake of illustration, four age classes labelled 1 to 4 are considered in this example. However as will be appreciated the principles may be applied with any number of age classes. In the example of Figure 6, the age classifier 310 takes the form of a“fully connected” neural network processing layer comprising four neurons (nodes) which are represented as circles numbered 1 to 4. The age classifier 310 is shown to comprise a fully connected layer with a softmax activation. Each of the nodes 1 to 4 operates directly on the concatenated feature vector 408 which is represented using mathematical notation as h, and computes a weighted sum of the components of h:

The set of weights w_{n i} used by node n (corresponding to age class n) are learned during training so as to weight the corresponding features h_t according to their relevance to the age class in question. The weights across the four neurons constitute the model parameters P3 of the age classifier 310 in the present example. Softmax normalisation is then applied across the outputs of the neurons 1 to 4 in order to compute normalised class probabilities for each of those classes. The processing layer is fully connected in that the weighted sum computed by each node n is defined over the entire concatenated feature vector 408, and based on a set of weights {nn_{h ί}} unique to that node n, which emphasise features most relevant to age class n.

Although Figure 6 shown only a single fully-connected layer, the age-estimation classifier can have a more complex structure. For example, it may comprise multiple fully-connected layers at which one or more intermediate non-linear processing operations are performed (before the final softmax normalization).

Figure 7 is a schematic functional block diagram showing an extension of the above architecture to incorporate a“control gate” 702 for the audio features 405 as part of the feature combiner 308. The control gate 702 is a trainable ML component having a set of model parameters P4. During the above training, these are tuned based on the loss function simultaneously with the other model parameters PI, P2 and P3, using back propagation applied across the full set of overall weights (PI, P2, P3, P4}. In the present example, this is achieved by the control gate 702 applying a learned feature-wise weighted sigmoid function to each audio feature. Representing the audio feature vector 405 in mathematical notation as a, the feature selector computes a modified audio feature vector a' as:

1

a = sigmoid(W_ja_j) = _____ That is, for each component of the audio feature vector cq, the control gate 702 weights that component by a corresponding weight and computes the sigmoid of the weighted feature. The model parameters P4 of the control gate 702 comprise the set of weights {W_j} across the set of audio features (i.e. across the components of the audio feature vector 405). The concatenated feature vector 408 in this case is the concatenation of the (unaltered) facial features 406 with the modified audio features computed as above.

Incorporating the control gate 702 gives the age estimation system 200 the freedom to place a greater emphasis on certain voice feature (or suppress certain features) depending on how relevant they are to the age-estimation task. Due to its location within the pipeline of the system, the control gate 702 modifies features uniformly across all age classes based on the learned weights (W , which in some circumstances complements the per-class weighting applied at the age classification layer 310 to provide an overall performance improvement.

Figure 8 shows a schematic block diagram that demonstrates some of the principles of data processing within a CNN. Such data processing is applied in the parallel audio and video subnets 304, 306, to their respective audio and image input volumes 402, 104.

A CNN is formed of processing layers and the inputs to and outputs of the processing layers of a CNN are referred to as volumes (see above). Each volume is effectively formed of a stack of two-dimensional arrays each of which may be referred to as a“feature map”.

By way of example Figure 8 shows a sequence of five such volumes 802, 804, 806, 808 and 810 that may for example be generated through a series of convolution operations and non linear transformations, and potentially other operations such as pooling, as is known in the art. For reference, two feature maps within the first volume 802 are labelled 802a and 802b respectively, and two feature maps within the fifth volume 810 are labelled 810a and 810b respectively. Herein (x,y) coordinates refer to locations within a feature map or image as applicable. The z dimension corresponds to the“depth” of the feature map with the applicable volume. A color image (e.g. RGB) may be represented as an input volume of depth of three corresponding to the three color channels, i.e. the value at (x,y,z) is the value of color channel z at location (x,y). A volume generated at a processing layer within a CNN has a depth corresponding to a number of“filters” applied at that layer, where each filter corresponds to a particular feature the CNN learns to recognize.

A CNN differs from a classical neural network architecture in that it has processing layers that are not fully connected. Rather, processing layers are provided that are only partially connected to other processing layer(s). In particular, each node in a convolution layer is connected to only a localized 3D region of the processing layer(s) from which it receives inputs and over which that node performs a convolution with respect to a filter. The nodes to which that node is particularly connected are said to be within a“receptive field” of that filter. The filter is defined by a set of filter weights and the convolution at each node is a weighted sum (weighted according to the filter weights) of the outputs of the nodes within the receptive field of the filter. The localized partial connections from one layer to the next respect ( x , y) positions of values within their respective volumes, such that (x, y) position information is at least to some extent preserved within the CNN as data passes through the network. By way of example, Figure 8 shows receptive fields 812, 814, 816 and 818 at example locations within the volumes 812, 814, 816 and 818 respectively. The values within the receptive field are convolved with the applicable filter in order to generate a value in the relevant location in the next output volume.

The model parameters PI, P2 of the audio and video subnets 304, 306 comprise the filter weights applied within those subsets 304, 306 respectively.

Each feature map is determined by convolving a given filter over an input volume. The depth of each convolution layer is thus equal to the number of filters applied at that layer. The input volume itself can have any depth, including one. For example, a colour image 102 may be provided to the image subset 306 as an input volume of depth three (i.e. as a stack of three 2D arrays, one for each color channel); the MFCC input volume 402 provided to the audio subnet 306 may be a feature map of depth one, i.e. a single 2D array of MFCCs computed as above.

Using an image as an example, when a convolution is applied to the image directly, each filter operates as a low-level structure detector, in that“activations” (i.e. relatively large output values) occur when certain structure is formed by the pixels within the filter’s receptive field (that is, structure which matches a particular filter). However, when convolution is applied to a volume that is itself the result of convolution earlier in the network, each convolution is performed across a set of feature maps for different features, therefore activations further into the network occur when particular combinations of lower level features are present within the receptive field. Thus, with each successive convolution, the network is detecting the presence of increasingly high level structural features corresponding to particular combinations of features from the previous convolution. Hence, in the early layers the network is effectively performing lower level structure detection but gradually moves towards higher level semantic understanding of structure in the deeper layers. There are, in general terms, the broad principles according to which the image subset 306 leams to extract relevant facial characteristics from image data.

As noted, the image subnet 306 can take static image data or video data as an input. Age- estimation based on video data would require only a change to the video subnet architecture, by replacing 2D convolutional layers of the image subnet 306 with 3D layers that perform a spatiotemporal convolution. The principles of the audio subset 304 are similar: here, the early processing layers of the audio subnet 304 are learning to recognize low-level structure within the time-frequency plane, with increasingly highly-level feature interpretation occurring deeper into the subset 304.

The filter weights are learned during training, which is how the network learns what structure to look for. As is known in the art, convolution can be used in conjunction with other operations. For example, pooling (a form of dimensionality reduction) and non-linear transformations (such as ReLu, softmax, sigmoid etc.) are typical operations that are used in conjunction with convolution within a CNN.

With reference to Figure 9, the present age-estimation technology can be applied in conjunction with anti-spoofing technology, also referred to as liveness detection. That is, with an additional check or checks to ensure that the entity from which the audio and image data 102, 104 is captured is an actual human being, as opposed to a spoofing entity designed to appear as such. A spoofing entity could for example be a printed or electronically-displayed image of a human, a 3D model etc. which is presented to the image capture device 103 whilst simultaneously playing back pre-recorded audio (for example). Figure 10 shows an example of an anti-spoofing system 900 which operates in conjunction with the age estimation system 200. The anti-spoofing in this example is based on a“random- challenge”. A word selection component 902 randomly selects a word or sequence of random words to be spoken by the user 100, which is provided to the UI 108 for outputting to the user 100. The user 100 is instructed via the UI 108 to speak the randomly-selected word(s) whilst the audio and image data 102, 104 is captured. In this case, the image data 104 is video data which captures the user’s lip movements as the words are spoken.

The captured audio and video data 102, 104, or portions thereof, are used as a basis for automated age estimation by the age estimation system 200 as described above.

Additionally, at least the video data 102, or a portion (e.g. a frame or frames) thereof, are provided to the anti-spoofing system 900, where a lip reading algorithm is applied to the image data 104, by a video processing components 904. The results of the lip reading algorithm are provided to a comparison component 908, which compares them with the randomly-selected word(s) to determine whether or not the user 100 has spoken the randomly selected word/sequence of words as expected.

This random challenge makes it very difficult for the user 100 to spoof the access control process using pre-recorded audio and video.

As well as receiving the age estimate data 210 from the age estimation system, the access controller also receives an anti-spoofing output from the comparison component 1008, which conveys the results of the anti-spoofing comparison. The access controller uses these to regulate access to the age-restricted function 204, by only granting access when the user is found to meet the associated age-requirement and passes the anti-spoofing test.

Alternatively, or in addition, speech recognition may be applied to the audio data 104, or a portion thereof, by a speech recognition component 906 of the anti-spoofing system 900, to identify voice patterns exhibited therein. The results can be used in exactly the same way to verify the user 100 has spoken the correct word(s). It is noted however that this is not essential - lip reading on the image data is sufficient for anti-spoofing, hence the speech recognition component 906 may be omitted from the anti-spoofing system 900. A simpler anti-spoofing check, which also provides a reasonable level of robustness, is a comparison of the audio and video data 102, 104 to simply ensure that the video data 104 contains lip-movements which match speech contained in the audio data 102 (without necessarily prompting the user 100 to speak specific words). This prevents the age-estimation from being spoofed using unrelated audio and video.

It will be appreciated that the above embodiments have been described only by way of example. Other embodiments and applications of the present invention will be apparent to the person skilled in the art in view of the teaching presented herein. The present invention is not limited by the described embodiments, but only by the accompanying claims.

Claims

1. A computer system for performing human age estimation, the computer system comprising:

an image input configured to receive captured image data;

an audio input configured to receive captured audio data;

execution hardware configured to execute a neural network for processing the captured image and audio data to compute human age estimate data, the neural network being formed of:

a facial feature extractor comprising a plurality of neural network processing layers configured to process the image data to compute a facial feature set, which encodes human facial features exhibited in the captured image data;

a voice feature extractor comprising a plurality of neural network processing layers configured to process the audio data to compute a voice feature set, which encodes human voice features exhibited in the captured audio data;

a feature combiner configured to compute a combined feature set by combining the voice and facial feature sets; and

an age estimator comprising at least one neural network processing layer configured to process the combined feature set to compute the human age estimate data.

2. The computer system of claim 1, wherein the neural network processing layers of the facial feature extractor are convolutional neural network (CNN) processing layers.

3. The computer system of claim 1 or 2, wherein the neural network processing layers of the voice feature extractor are CNN processing layers.

4. The computer system of claim 3 wherein the voice feature extractor comprises an audio processing component configured to process portions of the audio data within respective time windows to extract a feature vector for each of the time windows, and combine the feature vectors for the time windows to form an input to the CNN processing layers from which the voice feature set is computed.

5. The computer system of claim 4, wherein the feature vectors comprise frequency values.

6. The computer system according to claim 5, wherein the frequency values comprise cepstral coefficients.

7. The computer system of any preceding claim, wherein the at least one neural network processing layer of the age estimator is fully connected.

8. The computer system of any preceding claim, wherein the human age estimate data is in the form of a distribution over human age classes.

9. The computer system of any preceding claim, wherein the age estimator is configured to compute at least one confidence value for the human age estimate data.

10. The computer system of claim 9, wherein the human age estimate data is in the form of a distribution over human age classes, and the confidence value is a measure of the spread of the distribution.

11. The computer system of claim 10, wherein the confidence value is a variance or standard deviation of the distribution.

12. The computer system of claim 8, wherein each age class is associated with a numerical age value and the age estimate data comprise an estimated age value, the age estimator being configured to compute the estimated age value as an average of the numerical age values that is weighted according to the distribution over the human age classes.

13. The computer system of any preceding claim, wherein the feature combiner is configured to modify features of the voice feature set, the combined feature set comprising the modified voice features, by applying a feature modification function learned during training of the human age estimation system.

14. The computer system of claim 13, wherein the feature combining function has a sigmoid activation.

15. The computer system of any preceding claim, comprising an access controller configured to regulate access to an age-restricted function based on the age estimate data.

16. The computer system of claim 9, comprising an access controller configured to regulate access to an age-restricted function based on the human age estimate data and the at least one confidence value.

17. A computer system for performing human age estimation, the computer system comprising:

an image input configured to receive captured image data;

an audio input configured to receive captured audio data;

an anti-spoofing system configured to randomly select one or more words for outputting to a user from which the audio and image data are captured, and generate an anti- spoofmg output by comparing the randomly-selected one or more words with at least one of the captured audio data and the captured image data to determine whether voice patterns and/or lip movements exhibited therein match the randomly-selected one or more words; and a human age estimation system configured to process the captured audio and image data, to compute human age estimate data based on voice and facial features exhibited therein.

18. The computer system of claim 17, comprising an access controller configured to regulate access to an age-restricted function based on the age estimate data and the anti- spoofing output.

19. The computer system of claim 17 or 18, comprising a facial recognition component configured to apply facial recognition to the image data, to provide a facial verification output indicating whether the image data contains a face of an authorized user

20. A computer-implemented method of performing human age estimation, the method comprising, in a computer system:

receiving, at a neural network of the computer system, captured image and audio data, wherein the neural network computes human age estimate data by: processing the image data to compute a facial feature set, which encodes human facial features exhibited in the captured image data, by a facial feature extractor comprising plurality of neural network processing layers,

processing the audio data to compute a voice feature set, which encodes human voice features exhibited in the captured audio data, by a voice feature extractor comprising a plurality of neural network processing layers configured to process

computing a combined feature set by combining the voice and facial feature sets, and processing the combined feature set to compute the human age estimate data, by an age estimator comprising at least one neural network processing layer.

21. A computer program product comprising executable instructions stored on a non- transitory computer-readable storage medium and configured, when executed on one or more processors, to implement a neural network comprising:

a facial feature extractor comprising a plurality of neural network processing layers configured to process captured image data to compute a facial feature set, which encodes human facial features exhibited in the captured image data;

a voice feature extractor comprising a plurality of neural network processing layers configured to process captured audio data to compute a voice feature set, which encodes human voice features exhibited in the captured audio data;

an age estimator comprising at least one neural network processing layer configured to process the combined feature set to compute human age estimate data.