WO2024012686A1

WO2024012686A1 - Method and device for age estimation

Info

Publication number: WO2024012686A1
Application number: PCT/EP2022/069772
Authority: WO
Inventors: Mohammed-En-Nadhir ZIGHEM; Abdenour HADID
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2022-07-14
Filing date: 2022-07-14
Publication date: 2024-01-18

Abstract

The present invention generally relates to age estimation. A method and a device is proposed to estimate age from clothing. The method comprises obtaining a human image; cropping body part(s) from the image; segmenting a clothes image from the body part(s); removing background pixels in the clothes image, provide the clothes image as an input to a neural network, which may be based on a 18-layers residual network, to obtain an output. The output indicates an age category of the human. The age estimation based on clothing provides an additional or an alternative way of estimating age, especially when face is not avaible in the image. The age estimation in this disclosure uses machine learning to minimize biases and analyze context of the clothes for better prediction. The age estimation can be used in a search engine to filter child abuse content.

Description

METHOD AND DEVICE FORAGE ESTIMATION

TECHNICAL FIELD

The present disclosure generally relates to computer technology. For instance, the disclosure relates to image processing. For a further instance, the disclosure relates to a method and a device for performing age estimation.

BACKGROUND

Human and/or face recognition is widely used for computer vision. For instance, a human body and/or human face can be detected from an image or a video with the help of machine learning techniques. Based on the detected human body and/or human face, the age of the human can be estimated. Various machine learning-based solutions, e g., neural networks, have been used for age estimation based on the detected human body and/or human face. The estimated age can be used in various application scenarios, such as providing censored search results, filtering child abuse contents, and restricting services to a minor (underage).

SUMMARY

Conventional age estimation is based on the human face and/or uncovered human body. When there is no human face or uncovered human body detectable, e.g., due to obstruction or occlusion, age estimation does not work, because there is no valid input.

In view of the above, there is a need for an improved age estimation method.

An objective of this disclosure is to facilitate performing age estimation even when there is no human face or uncovered human body detectable in an image of the human. This disclosure aims for performing age estimation based on clothing.

These and other objectives are achieved by the subject matter of the independent claims. Further implementations forms are apparent from the dependent claims, the description and the drawings.

A first aspect of the present disclosure provides a method for age estimation. The method is performed by a device and comprises the following steps: obtaining an image of a human; cropping one or more body parts from the image; segmenting clothes from the cropped one or more body parts, to obtain a clothes image; removing background pixels from the clothes image; providing the clothes image as an input to a neural network model; and obtaining an output of the neural network model, wherein the output of the neural network model indicates an age category of the human.

Optionally, the image of the human may be a photo or a frame of a video. The image (or the video) may be obtained by a camera of the device. Alternatively, the image may be obtained by the device from an outside source, e.g., via a communications link.

Optionally, the one or more body parts may comprise one or more of human torso and limbs. The limbs may comprise one or more of a human arm and leg. The cropped one or more body parts may be completely covered with clothes, or partly covered with clothes.

Optionally, the segmented clothes may be understood as garments and may further comprise ornaments worn by the human.

Optionally, the age category of the human may comprise two or more age groups. For instance, the two or more age groups may comprise minor (below the age of 18) and adult (at the age of 18 or above). Alternatively, the two or more age groups may comprise child, adolescence (teen), adult, and senior adult.

The method of the first aspect enables efficient age estimation based on clothing. In this way, the method facilitates performing age estimation even when there is no human face or uncovered human body detectable in the image of the human.

In an implementation form of the first aspect, the neural network model may be based on an 18-layer residual neural network (ResNet-18) model.

Optionally, the neural network model may share a similar architecture of the ResNet-18 model. The architecture of the ResNet-18 may provide a satisfying performance with a remarkably low error rate. In an implementation form of the first aspect, the neural network model may comprise three convolutional layers and three feature refinement blocks (FRBs). The step of obtaining the output of the neural network model may comprise: extracting three feature representations of the clothes image from three outputs of the three convolutional layers; and providing the three feature representations as inputs to the three feature refinement blocks, to obtain three refined features from three outputs of the three feature refinement blocks.

Optionally, each FRB may be used to compress and recalibrate extracted features. Each FRB is adapted to learn a weighted vector from different block features during a training phase. The weighted vector will serve as an attention vector to recalibrate features of the outputs of the three convolutional layers during an inference phase. In this way, features that are useful for classifying small objects can be selected. The FRBs may be connected with a global maxpooling in the end to capture global context information and get a compressed feature vector.

In an implementation form of the first aspect, the method may further comprise concatenating the three refined features and an output of the first convolutional layer of the three convolutional layers, to obtain a final feature representation of the clothes image.

Optionally, the three refined features may be elementwise multiplied before being concatenated.

In an implementation form of the first aspect, the neural network model may further comprise a fully-connected (FC) layer. The method may further comprise: providing the final feature representation of the clothes as an input to the fully- connected layer; and obtaining an output of the fully-connected layer as the output of the neural network model.

In an implementation form of the first aspect, the image of the human may not include a face of the human. Optionally, the face of the human may be not completely visible. Alternatively, the image quality of the face of the human may be inferior such that age estimation based on the face is not feasible.

A second aspect of the present disclosure provides a device for age estimation. The device is configured to: obtain an image of a human; crop one or more body parts from the image; segment clothes from the cropped one or more body parts, to obtain a clothes image; remove background pixels from the clothes image; provide the clothes image as an input to a neural network model implemented in the device; and obtain an output of the neural network model, wherein the output of the neural network model indicates an age category of the human.

In an implementation form of the second aspect, the neural network model may be based on a ResNet-18 model.

In an implementation form of the second aspect, the neural network model may comprise three convolutional layers and three feature refinement blocks, and for obtaining the output of the neural network model, the device is configured to: extract three feature representations of the clothes image from three outputs of the three convolutional layers; provide the three feature representations as inputs to the three feature refinement blocks, to obtain three refined features from three outputs of the three feature refinement blocks.

In an implementation form of the second aspect, the device may be further configured to concatenate the three refined features and an output of the first convolutional layer of the three convolutional layers, to obtain a final feature representation of the clothes image.

In an implementation form of the second aspect, the neural network model may comprise a fully-connected layer, and the device may be further configured to: provide the final feature representation of the clothes as an input to the fully-connected layer; and obtain an output of the fully-connected layer as the output of the neural network model. In an implementation form of the second aspect, the image of the human may not include a face of the human.

The device of the second aspect is able to perform age estimation based on clothing in an efficient manner. In this way, the device is able to perform age estimation even when there is no human face or uncovered human body detectable in the image of the human.

A third aspect of the present disclosure provides a computer program comprising a program code for performing the method according to the first aspect or any of its implementation forms.

A fourth aspect of the present disclosure provides a non-transitory storage medium storing executable program code which, when executed by a processor, causes the method according to the first aspect or any of its implementation forms to be performed.

A tenth aspect of the present disclosure provides a chipset comprising a memory and a processor, which are configured to store and execute program code to perform the method according to the first aspect or any of its implementation forms.

It has to be noted that all devices, elements, units, and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity, which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS

The above-described aspects and implementation forms will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which FIG. 1 illustrates an example of a device according to this disclosure;

FIG. 2 shows an example of a process for age estimation according to this invention;

FIG. 3 illustrates a schematic view of a neural network model according to this disclosure;

FIG. 4 shows a diagram of a process according to this disclosure;

FIG. 5 shows an example of an application scenario according to this disclosure;

FIG. 6 illustrates an example of filtering search results according to this disclosure;

FIG. 7 illustrates an example of a further application scenario according to this disclosure;

FIG. 8 illustrates an example of verifying age performed by a vending machine according to this disclosure;

FIG. 9 shows a diagram of a method for age estimation according to this disclosure; and FIG. 10 shows an example of a neural network model according to this disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS

This disclosure generally relates to human age estimation based on clothing, which works without the presence of faces or bodies. Given an image or a video containing clothing, the age (or at least the age group) of persons present in the image or video can be roughly estimated. In order to better estimate the age of a person from clothing, it is crucial to analyze the nature of the clothes and the context of the clothing.

FIG. 1 illustrates an example of a device 100 according to this disclosure. The device 100 is configured to obtain an image 110 of a human (or a person). In this disclosure, the image 110 of the human may be simply referred to as a human image 110. Optionally, the human image 110 may be a frame of a video. Optionally, the human image 110 may be understood as a picture containing at least one human. Optionally, the device 100 may comprise a camera (not shown in FIG 1) configured to capture the human image 110. Alternatively, the device 100 may receive the human image 110 from a further device.

The device 100 is configured to crop one or more body parts from the human image 110. The device 100 is configured to segment clothes from the cropped one or more body parts, to obtain a clothes image 130. Optionally, the one or more body parts may comprise one or more of the human torso and limbs (e g., arms and legs). The segmented clothes may comprise one or more of a piece of clothing, headwear, and jewelry worn by the person. The device 100 is configured to remove background pixels from the obtained clothes image 130. This is to reduce noise and improve the precision of the age estimation. Optionally, the device 100 may comprise an image processing module 120 adapted to obtain the clothes image 130 based on the human image 110. Any means commonly known in the field of image processing that can be adapted to segment the clothes from the human image 110 may be used to obtain the clothes image 130.

The device 100 is configured to provide the clothes image 130 as an input to a neural network model 140 and obtain an output 150 of the neural network model 140. The neural network model 140 may be simply referred to as a neural network 140. The output 150 of the neural network model 140 indicates an age category of the human. The neural network model 140 may be comprised in the device 100 as a machine learning model adapted to predict (or estimate) the age category of the human based on the obtained clothes image 130. The age category may be understood as an age group and may not necessarily indicate an exact age of the human.

Optionally, the age category may comprise at least two groups. For instance, the age category may comprise majority (e.g., an adult) and minority (e.g., a minor). In most cases, the age of the majority may be 18 and can be adjusted according to different areas and nations. Optionally, the age category can be further refined and comprise more detailed groups. For a further example, the age category may comprise child group (e.g., the age of 0-12), teen (e.g., the age of 13-17), adult (e.g., the age of 18-59), and senior (e.g., older than the age of 60). It is noted that the age range given in the above examples are for illustration purposes only and can be adjusted, e.g., according to different purposes, application scenarios, and/or needs.

Optionally, the neural network model 140 may be trained through a training phase, so that the trained neural network model 140 is fit for performing the age estimation based on the clothes image. For training the neural network 140, a number (e.g., thousands) of images of clothes may be randomly collected and labelled according to desired age categories as a training data set. Various persons may be involved in labelling the collected images of clothes so as to remove biases as far as possible. Any common training techniques known in the field of machine learning may be used to train the neural network 140

Optionally, the neural network model 140 may comprise a feature extraction model 142 and a classification model 144. The feature extraction model 142 may be adapted to extract features based on the clothes image 130. Optionally, the feature extranction model 142 may comprise three convolutional layers and three FRBs. The three convolutional layers may be used to extract three feature representations of the clothes image. The three feature representations may be provided as inputs to the three FRBs, to obtain three refined features. The three refined features may be processed and fed as an input to the classification model 144. The classification model may be adapted to classify the processed features into a corresponding age category, e.g., via a FC layer through max pooling.

Optionally, the neural network model 140 may be based on a ResNet-18 model. A residual network (ResNet) is a convolutional neural network (CNN) with the least error rate and is nearly equal to human error rate. ResNets can be learned with network depths ranging from a small model with 18 layers to a complex model with 152 layers (e.g., layers of 18, 34, 50, 101, 152). The number of the layers indicated a total number of weight layers in the ResNet. In this disclosure, a ResNet with 18 layers may be used as a basis to build the neural network model according to this disclosure.

The device 100 may be used for various application scenarios. For example, the device 100 may be a mobile device as a content distribution terminal. The mobile device may receive delivered content, such as streaming music and video, from a content provider. Conventionally, the device 100 may ask a user to manually input his/her age for verification. However, there is no measure to control whether the manually input age information is authentic. Therefore, some explicit content or content restricted to a certain age group may be inappropriately accessed. The device 100 may be adapted to capture images of the user to verify the age according to this disclosure.

Similarly, for a further example, the device 100 may be a mobile device as a gaming console. Due to some regulations in some regions, games may be with different age ratings or gaming time may be restricted for a certain age group. The device 100 may be adapted to capture an image of the user to verify the age before and/or during a video (or mobile) game according to this disclosure.

FIG. 2 shows an example of a process for age estimation according to this invention. The process can be applied to the device 100 of FIG. 1. Similar elements shall share the same functions and features likewise. Arrows between elements in FIG. 2 indicate information flows between the elements. As illustrated in FIG. 2, image processing 220 is applied to an input image 210 to obtain a clothes image. The clothes image is then fed as an input to a convolutional neural network model 240. The convolutional neural network 240 may comprise a feature extraction part 242 and a classification part 244 as a dropout layer. The convolutional neural network 240 may be built based on a ResNet-18 model. For example, the convolutional neural network 240 may comprise a part of the ResNet-18 model that is before a res_conv3_l layer as a backbone. The convolutional neural network 240 may further comprise, after the res_conv3_l layer of a ResNet-18 model, three different branches and three FRBs, e.g., in the feature extraction part 242. Three feature representations (or features) x₁₅ x₂, and x₃ of the clothes appeared in the clothes image may be extracted from three outputs of the three different branches following the res_conv3_l layer of the ResNet-18 model. It is noted that the res_conv3_l layer of the ResNet- 18 model may refer to a layer of the ResNet 18 model, wherein the layer is adapted to perform 3x 1 convolution with a fixed feature map dimension of 256. The three different branches may comprise three independent convolutional layers, each adapted to extract features. The three feature representations x₁₅ x₂ and x₃ may be then processed by the three FRBs respectively, to obtain three refined features, P₁₍ P₂ and P₃.

Optionally, each FRB may be adapted to extract features with respect to the nature of the clothes and the context of the clothes for better prediction. This is because the nature and context of the clothes are more robust to region, weather, indoor/outdoor, and cultural differences.

Optionally, each FRB may be used to compress and recalibrate the extracted feature representations. Each FRB may be adapted to learn a weighted vector from different block features during a training phase. The weighted vector may serve as an attention vector to recalibrate features of the outputs of the three different branches during an inference phase. In this way, features that are useful for classifying small objects can be selected.

Optionally, the three refined features may be elementwise multiplied and may be then concatenated with an output of the first convolutional layer, to obtain a final feature representation 243 of the clothes image.

The final feature representation 243 may be provided as an input to the classification part 244 comprising an FC layer, to obtain an output 250 of the FC layer. The output 250 indicates an age category of the human. FIG. 3 illustrates a schematic view of a neural network model according to this disclosure. Arrows between elements in FIG. 3 indicate information flows between the elements. The neural network model in FIG. 3 may be built based on a ResNet-18 model. The neural network model may comprise a part of the ResNet-18 model as a backbone, e.g., the part that is before a res_conv3_l layer (or simply, conv3_l) of the ResNet-18 model. Then, after the res_conv3_l layer, the neural network model may further comprise three different branches, each branch adapted to extract features x_1? x₂ and x₃. The extracted features may be similarly processed by three FRBs mentioned with respect to FIG. 2. It is noted that the neural network model shown in FIG. 3 may be applied to FIG. 1 and FIG. 2.

FIG. 4 shows a diagram of a process 400 according to this disclosure. The process 400 is as follows.

Step 401 : Obtaining an input image (or a video).

Step 402: Detecting human body from the input image.

Step 403 : After human body is detected, check whether the face of the human is detected. It is noted that the face can be seen as a part of human body. Therefore, a general human body detection is applied.

Step 404: If the face is detected, predicting age from the face.

Step 405: If no face is detected, predicting age from the body (which includes human body parts other than the face).

Step 406: After step 402, detecting clothing from the human body.

Step 407 : Predicting age from clothing.

It is noted that various aspects according to the present disclosure as mentioned in FIGs. 1-3 can be applied to steps 406 and 407. It can be seen that age estimation from clothing according to this disclosure can be applied a case no matter whether the face is detected or not. Therefore, the age estimation from clothing according to this disclosure may be used as an independent measure to predict age, or as an alternative measure to predict age when face is not detectable, or as an additional (e.g., verification, anti-counterfeiting) measure used in addition to the age prediction from face and/or body.

FIG. 5 shows an example of an application scenario according to this disclosure.

The application scenario illustrates a safe content searching method 500, which can filter child abuse content from search results.

The method 500 comprises the following steps.

Step 501 : A user search for unrestricted content on a search engine.

The unrestricted content may sometimes contain inappropriate content such as adult content, child abuse images and/or videos.

Step 502: The search engine performs age prediction on images of search results, to determine the age group of persons in the images.

Step 503 : Based on the determined age group, the search engine determines whether there is child abuse in the search result.

Step 504: If there is any child abuse content, hide the corresponding image from the search result.

It is noted that various aspects according to the present disclosure as mentioned in FIGs. 1-4 can be applied to step 503. It can be seen that the present disclosure may be particularly useful for filtering child abuse (e.g., child pornography) content.

FIG. 6 illustrates an example of filtering search results according to this disclosure The filtering is based on the method 500 in FIG. 5. It can be seen that several inappropriate images have been filtered in the final search results. FIG. 7 illustrates an example of a further application scenario according to this disclosure. FIG. 7 illustrates a vending machine on the left-hand side. On the right hand side, a flow of method steps performed with respect to the vending machine are shown.

The vending machine is adapted to estimate the age of a buyer to allow or deny selling. For example, if the vending machine sells alcoholic and/or tobacco products, the vending machine may need to verify the age of the buyer. A conventional verification method based on an identity (ID) card can be easily bypassed. The vending machine in FIG. 7 comprises at least one camera. For example, a front camera and a rear camera are shown in FIG. 7. The vending machine is adapted to obtain one or more images of the buyer from the at least one camera. Then, the vending machine is adapted to estimate the age of the buyer according to various aspects of this disclosure as mentioned with respect to FIGs. 1-4 as an independent, additional, or alternative measure to verify the age of the buyer. An example of a method used by the vending machine to verify the age of the buyer is illustrated on the right-hand side of FIG. 7. An advantage of the vending machine according to this disclosure is that, even if the buyer intentionally covers his/her face to avoid age verification, the vending machine can still estimate his/her age based on the clothing. Optionally, the vending machine may further comprise a liveness detection as an anti-spoofing measure.

FIG. 8 illustrates an example of verifying age performed by a vending machine according to this disclosure. The image of a customer is taken and the body of the customer and one or more objects are detected in the image. A cropped image of an object identified with high confidence is created. Then, age prediction is performed, according to this disclosure, based on the image of the body, the cropped object image, and surrounding objects in the image. If, for example, the customer is predicted to be a minor, the order of the customer at the vending machine can be cancelled.

FIG. 9 shows a diagram of a method 900 for age estimation according to this disclosure. The method 900 is performed by a device and comprises the following steps: step 901 : obtaining an image of a human; step 902: cropping one or more body parts from the image; step 903: segmenting clothes from the cropped one or more body parts, to obtain a clothes image; step 904: removing background pixels from the clothes image; step 905: providing the clothes image as an input to a neural network model; and step 906: obtaining an output of the neural network model, wherein the output of the neural network model indicates an age category of the human.

It is noted that the steps of the method 900 may share the same functions and details from the perspective of FIG. 1-4 described above. Therefore, the corresponding method implementations are not described again at this point.

FIG. 10 shows an example of a neural network model according to this disclosure. The neural network model shown in FIG. 10 is presented based on PyTorch framework. It is noted that the neural network model in FIG. 10 are merely given as an illustrative example. The neural network model that can be used in this disclosure shall not be limited to FIG. 10. The neural network model in FIG. 10 may be applied to FIG. 1-9 in this disclosure.

The devices in the present disclosure may comprise processing circuitry (not shown) configured to perform, conduct or initiate the various operations of the devices described herein, respectively. The processing circuitry may comprise hardware and software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors. Optionally, the processing circuitry comprises one or more processors and a non- transitory memory connected to the one or more processors. The non-transitory memory may carry executable program code which, when executed by the one or more processors, causes the devices to perform, conduct or initiate the operations or methods described herein, respectively.

For example, a device according to this disclosure may be an electronic device capable of computing, such as a computer, a server, a tablet, a mobile terminal, a graphics processing unit, a neural processing unit, and the like.

The present invention has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1. A method (900) for age estimation, the method comprising: obtaining (901) an image (110) of a human; cropping (902) one or more body parts from the image (110); segmenting (903) clothes from the cropped one or more body parts, to obtain a clothes image (130); removing (904) background pixels from the clothes image (130); providing (905) the clothes image (130) as an input to a neural network model (140); and obtaining (906) an output (150) of the neural network model, wherein the output (150) of the neural network model indicates an age category of the human.

2. The method (900) according to claim 1, wherein the neural network model (140) is based on an 18-layer residual neural network, ResNet-18, model.

3. The method (900) according to claim 1 or 2, wherein the neural network model (140) comprises three convolutional layers and three feature refinement blocks, and the obtaining (906) of the output (150) of the neural network model comprises: extracting three feature representations of the clothes image (130) from three outputs of the three convolutional layers; and providing the three feature representations as inputs to the three feature refinement blocks, to obtain three refined features from three outputs of the three feature refinement blocks.

4. The method (900) according to claim 3, wherein the method further comprises concatenating the three refined features and an output of the first convolutional layer of the three convolutional layers, to obtain a final feature representation of the clothes image (130).

5. The method (900) according to claim 4, wherein the neural network model (140) comprises a fully-connected layer (144), and the method further comprises: providing the final feature representation of the clothes as an input to the fully-connected layer (144); and obtaining an output of the folly-connected layer as the output (150) of the neural network model.

6. The method (900) according to any one of claims 1 to 5, wherein the image (110) of the human does not include a face of the human.

7. A device (100) for age estimation, the device (100) being configured to: obtain an image (110) of a human; crop one or more body parts from the image (110); segment clothes from the cropped one or more body parts, to obtain a clothes image (130); remove background pixels from the clothes image (130); provide the clothes image (130) as an input to a neural network model (140) implemented in the device; and obtain an output (150) of the neural network model, wherein the output (150) of the neural network model indicates an age category of the human.

8. The device (100) according to claim 7, wherein the neural network model (140) is based on an 18-layer residual neural network, ResNet-18, model.

9. The device (100) according to claim 7 or 8, wherein the neural network model (140) comprises three convolutional layers and three feature refinement blocks, and for obtaining the output (150) of the neural network model, the device is configured to: extract three feature representations of the clothes image (130) from three outputs of the three convolutional layers; provide the three feature representations as inputs to the three feature refinement blocks, to obtain three refined features from three outputs of the three feature refinement blocks

10. The device (100) according to claim 9, wherein the device is further configured to concatenate the three refined features and an output of the first convolutional layer of the three convolutional layers, to obtain a final feature representation of the clothes image (130).

11. The device (100) according to claim 10, wherein the neural network model (140) comprises a fully-connected layer, and the device is further configured to: provide the final feature representation of the clothes as an input to the fully-connected layer; and obtain an output of the fully-connected layer as the output (150) of the neural network model.

12. The device (100) according to any one of claims 7 to 11, wherein the image (110) of the human does not include a face of the human.

13. A computer program product comprising instructions which, when the program is executed by a computer, cause the computer to perform the method according to any one of claims 1 to 6.