CN107909026B

CN107909026B - Small-scale convolutional neural network based age and/or gender assessment method and system

Info

Publication number: CN107909026B
Application number: CN201711118413.XA
Authority: CN
Inventors: 王星; 梅迪·塞伊菲; 陈明华; 吴谦伟; 梁杰
Original assignee: Shenzhen Aotong Technology Co ltd
Current assignee: Shenzhen Aotong Technology Co ltd
Priority date: 2016-11-30
Filing date: 2017-11-10
Publication date: 2021-08-13
Anticipated expiration: 2037-11-10
Also published as: CN107909026A

Abstract

The various embodiments described herein provide examples of systems that can perform age and gender assessment on facial images that are larger in size than the maximum number of input pixels that a given small-scale hardware Convolutional Neural Network (CNN) module can support. In some embodiments, the age and gender assessment system first segments the high-resolution input face image into a set of appropriately sized image blocks, each with a carefully designed overlap between adjacent image blocks. Each image block is then processed separately by a small-scale CNN module, e.g., an embedded CNN module of Hi3519 SoC. The outputs corresponding to the set of image blocks are then combined to obtain an output corresponding to the input face image, and the combined output may be further processed by subsequent layers in the age and gender assessment system to generate an age and gender classification for the input face image.

Description

Small-scale convolutional neural network based age and/or gender assessment method and system

Priority claims and related patent applicationsPlease note that

This patent application claims priority from U.S. patent application 62/428,497 (entitled CONVOLUTIONAL NEURAL Network (CNN) BASED ON limited RESOLUTION small-SCALE CNN MODULES) (convoluitonal NEURAL network NETWORKS (CNN) BASED ON RESOLUTION-LIMITED SMALL-SCALE CNN MODULES; inventor: wangxing, wushuwei, jie; application date: 2016, 11/30). The contents of this U.S. provisional application are incorporated by reference and made a part of this application.

The present application is also related to pending U.S. patent application 15/441,194 (application name: CONVOLUTIONAL NEURAL NETWORK (CNN) SYSTEM BASED ON limited RESOLUTION small SCALE CNN MODULES (CNN) SYSTEM BASED ON RESOLUTION ON solution-LIMITED SMALL-SCALE CNN MODULES), inventor: regam, wuqiang, jie; application date: 2017, 2/23). This U.S. patent application is incorporated by reference and made a part hereof. The present application is also related to pending U.S. patent application 15/657,109 (application name: FACE DETECTION USING SMALL SCALE CONVOLUTIONAL NEURAL NETWORK (CNN) module of EMBEDDED system-modified FOR EMBEDDED system), inventor: regam, mdy seifei, chemliki, wuweiqi, jie, application date: 2017, month 21).

Technical Field

The present application relates generally to the field of machine learning and artificial intelligence and, more particularly, to systems, apparatuses, and techniques for assessing gender and age group of a person from an input face image using a small-scale hardware Convolutional Neural Network (CNN) module.

Background

Deep Learning (DL) is a branch of machine learning and artificial neural networks based on a set of algorithms that attempt to model high-level abstractions in data by using artificial neural networks with many processing layers. A typical DL architecture may include many layers of neurons and millions of parameters. These parameters can be trained with massive amounts of data on high-speed computers equipped with GPUs and guided by new training algorithms that can also be applied in deep networks, such as modified linear units (relus), drop-outs (or discards), data set enhancements, and Stochastic Gradient Descent (SGD).

Among the existing DL architectures, Convolutional Neural Network (CNN) is one of the most popular architectures. Although the idea behind CNN was discovered over 20 years ago, the true capabilities of CNN were only recognized after the recent development of deep learning theory. CNNs have achieved great success to date in many areas of artificial intelligence and machine learning, such as face recognition, image classification, image subtitle generation, visual question answering, and auto-driving automobiles.

In many face recognition applications, face detection is an important process. Many face detection techniques can easily detect a forward face at a close distance. However, in an unconstrained situation, it is still very difficult to achieve robust and fast face detection. This is because these situations are typically associated with a large number of changes in the face, including changes in pose, occlusion, exaggerated expressions, and extreme lighting changes. Effective face detection techniques that can handle these unconstrained scenarios include: (1) a Cascade Convolutional Neural Network (CNN) structure (hereinafter referred to as "Cascade CNN" or "Cascade CNN structure") described in "Cascade of Convolutional Neural networks for Face Detection" (a Convolutional Neural Network Cascade for Face Detection) (h.li, z.lin, x.shen, j.branch, and g.hua, Computer Vision and Pattern Recognition, IEEE conference proceedings (proc.ieee conf.on Computer Vision and Pattern Recognition), 2015 year 6, month 1); and (2) a Multitask cascade CNN structure (hereinafter referred to as "MTCNN" or "MTCNN structure") described in "Joint Face Detection and Alignment Using Multitask Cascade connected Networks" (K.Zhang, Z.Zhang, Z.Li, and Y.Qiao; IEEE Signal Processing Letters, Vol.23, No.10, pp.1499-1503, 2016. 10 months).

In the cascade CNN, a rough-to-fine cascade CNN architecture is used for face detection. More specifically, the cascaded CNN architecture does not use a single deep neural network, but rather uses multiple shallow neural networks operating at different resolutions of the input image, so that the CNN can quickly discard background regions at low resolution levels, and then carefully evaluate a small number of candidate regions at final high resolution levels. To improve localization efficiency, a correction stage is employed after each detection/classification stage to adjust the position of the detected window (or "bounding box"). Thus, the cascade CNN typically requires 6 stages or 6 simple CNNs: three levels or CNNs are used for binary face detection/classification, while the other three are used for bounding box correction. Due to the cascade design and simple CNN adopted at each stage, the face detection architecture is highly suitable for operating in an embedded environment. Note that cascading each bounding box correction stage within a CNN requires additional computational overhead. Furthermore, in this cascade CNN, the inherent relevance of face detection and face alignment is ignored.

In MTCNN, a multi-task cascade CNN integrates face detection and face alignment operations by using a cascade CNN of a unified standard through multi-task learning. In principle, the MTCNN also employs multiple coarse to fine CNN levels, thereby operating on different resolutions of the input image. However, in MTCNN, at each level, a single CNN is utilized in conjunction with training face keypoint localization, binary face classification, and bounding box alignment. Thus, the MTCNN requires only three stages. More particularly, the first level of the MTCNN quickly generates candidate face windows through the shallow CNN. Next, the second level of the MTCNN utilizes the more complex CNNs to screen out candidate windows by discarding a large number of non-face windows. Finally, the third level of MTCNN uses the more processing-capable CNN to determine whether each input window includes a human face. If yes, the positions of five face key points are estimated. MTCNN performance is significantly improved over previous face detection systems. The MTCNN architecture is generally more suitable for execution on resource-limited embedded systems than the cascaded CNN architecture described above.

Disclosure of Invention

The various embodiments described herein provide examples with respect to an age and gender assessment system that can perform age and gender classification on digital facial images (hereinafter also referred to as "facial images") that may be larger in size than the maximum number of input pixels that a given small-scale hardware Convolutional Neural Network (CNN) module can support. In some embodiments, the age and gender assessment system presented herein first partitions the high-resolution input face image into a set of appropriately sized image blocks (also referred to as "sub-images"), each having a carefully designed overlap with adjacent image blocks. Each image block is then processed separately by a small-scale CNN module, such as an embedded CNN module from the Hi3519 chipset of hais semiconductor limited, a subsidiary of hua technology limited. The outputs corresponding to the set of image blocks are then combined to obtain an output corresponding to a high resolution input face image, and the combined output may be further processed by subsequent layers in the age and gender assessment system to generate an age and gender classification for the input face image.

The age and gender assessment system proposed by the present application can be implemented in a low cost embedded system comprising at least one small scale hardware CNN module and can be integrated with a face detection system that can also be implemented on this low cost embedded system. In some embodiments, the age and gender assessment system may be coupled to the face detection system to perform age and gender assessments on detected face images generated by the face detection system, wherein the age and gender assessment system and the face detection system may use at least one small-scale CNN module, such as Hi3519, to perform their designated operations. By applying this sub-image based technique to a high resolution face image, the age and gender assessment system proposed in the present application can perform age and gender assessment on a small scale CNN module without affecting the accuracy of the age and gender assessment. The ability to perform age and gender assessments in the field within an embedded system based on acquired and detected face images without the need for a separate device, system or server to perform the operation can significantly reduce operational costs. In some embodiments, the age and gender assessment proposed herein may also be implemented on low-cost embedded systems that do not include face detection structures. In these embodiments, the low cost embedded system may receive facial images directly from one or more external sources and then perform specialized age and gender assessments of the received facial images using the age and gender assessment system.

In one aspect, a process for performing age and gender assessment on a face image using a small scale Convolutional Neural Network (CNN) module with maximum input size constraints is disclosed. The process comprises the steps of: an input face image mainly covered by a face is received, and then whether the size of the input face image is larger than the maximum input image size supportable by the small-scale CNN module is determined according to the maximum input size limit. If so, the process further includes determining whether the size of the input face image satisfies a predefined input image size limit. The predefined input image size is limited to satisfy a given one of a plurality of image sizes that may divide the input image into a set of sub-images having a second size, wherein the second size is smaller than the maximum input image size. If the size of the input face image meets the predefined input image size limit, the process further comprises the steps of: dividing the input face image into a group of sub-images with a second size; processing the set of sub-images with the small-scale CNN module to generate a feature map array; merging the feature map array into a group of merged feature maps corresponding to the input face image; the combined feature map is processed using two or more fully connected layers to generate age and/or gender classifications for the persons in the input face image.

In some embodiments, if the size of the input face image does not meet the predefined input image size limit, the process further comprises the steps of: resizing the input face image to a given image size that satisfies the predefined input image size limit; dividing the resized input face image into a set of sub-images having a second size; processing the set of sub-images with the small-scale CNN module to generate a feature map array; merging the feature map array into a group of merged feature maps corresponding to the resized input face image; the combined feature map is processed using two or more fully connected layers to generate age and/or gender classifications for the persons in the input face image.

In some embodiments, resizing the input face image to the given image size comprises: if the size of the input face image is larger than the given image size, the input face image is down-sampled to the given image size; if the size of the input face image is smaller than the given image size, the input face image is up-sampled to the given image size.

In some embodiments, if the size of the input face image is less than or equal to the maximum input image size associated with the small-scale CNN module, the process processes the input face image directly with the small-scale CNN module without dividing the input face image into a set of sub-images having a smaller size.

In some embodiments, if the size of the input face image is less than or equal to the maximum input image size of the small-scale CNN module, the process further comprises the steps of: upsampling the size of the input face image to a given image size that satisfies the predefined input image size limit; dividing the resized input face image into a set of sub-images having a second size; processing the set of sub-images with the small-scale CNN module to generate a feature map array; merging the feature map array into a group of merged feature maps corresponding to the resized input face image; the combined feature map is processed using two or more fully connected layers to generate age and/or gender classifications for the persons in the input face image.

In some embodiments, the input face image is the output of a face detection CNN module that detects face images from the input video image.

In some embodiments, the small-scale CNN module includes three convolutional layers, where each of the three convolutional layers is followed by a modified linear unit (ReLU) layer and a pooling layer.

In some embodiments, a last fully-connected layer of the two or more fully-connected layers includes a flexibility maximum classifier.

In another aspect, an age and gender assessment system utilizing at least one small-scale CNN module is disclosed. The age and gender assessment system includes: an input module for receiving an input face image that is primarily covered by a face; a small-scale CNN module connected with the output of the input module and used for processing the face image by using a group of filters, wherein the small-scale CNN module has the maximum input size limit; a merging module connected to the output of the small-scale CNN module; and a decision module comprising two or more fully connected layers and connected to the output of the merging module. In some embodiments, the input module is further configured to determine whether the size of the input face image is larger than a maximum input image size supportable by the small-scale CNN module according to the maximum input size limit; if so, it is determined whether the size of the input face image meets a predefined input image size limit. The predefined input image size is limited to satisfy a given one of a plurality of image sizes in which the input image may be divided into a set of sub-images having a second size, the second size being smaller than the maximum input image size. The input module is further configured to divide the input face image into a set of sub-images having a second size if the size of the input face image satisfies the predefined input image size limit. The small-scale CNN module is to process the set of sub-images to generate an array of feature maps. The merging module is used for merging the feature map array into a group of merged feature maps corresponding to the input face image. Finally, the decision module processes the combined feature map using two or more fully connected layers to generate an age and/or gender classification for the person in the input face image.

In some embodiments, if the input face image size does not meet the predefined input image size limit, the input module resizes the input face image to a given image size that meets the predefined input image size limit and divides the resized input face image into a set of sub-images having a second size; the small-scale CNN module processes the set of sub-images to generate a feature map array; the merging module merges the feature map array into a group of merged feature maps corresponding to the resized input face image; and the decision module processes the combined feature map using two or more fully connected layers to generate an age and/or gender classification for the person in the input face image.

In some embodiments, the input module resizes the input face image to the given image size by: if the size of the input face image is larger than the given image size, the input face image is down-sampled to the given image size; if the size of the input face image is smaller than the given image size, the input face image is up-sampled to the given image size.

In some embodiments, if the size of the input face image is less than or equal to the maximum input image size of the small-scale CNN module, the small-scale CNN module directly processes the input face image without dividing the input face image into a set of sub-images having a smaller size.

In some embodiments, if the size of the input face image is less than or equal to the maximum input image size of the small-scale CNN module, the input module upsamples the size of the input face image to a given image size that satisfies the predefined input image size limit and divides the resized input face image into a set of sub-images having a second size; the small-scale CNN module processes the set of sub-images to generate a feature map array; the merging module merges the feature map array into a group of merged feature maps corresponding to the resized input face image; and the decision module processes the combined feature map using two or more fully connected layers to generate an age and/or gender classification for the person in the input face image.

In some embodiments, the input module is coupled to a face detection CNN module that detects a face image from an input video image, and the input face image is an output of the face detection CNN module.

In some embodiments, the small-scale CNN module is a hardware CNN module embedded in a chipset or system on a chip (SoC).

In some embodiments, the merge module merges the feature map arrays by concatenating the feature map arrays into a one-dimensional vector.

In another aspect, an embedded system is disclosed that can perform face detection and age and gender assessment in the field in acquired video images. The embedded system includes: a processor; a memory coupled to the processor; image acquisition means coupled to the processor and the memory for acquiring video images; a face detection subsystem connected to the image acquisition device and configured to detect a face from the acquired video image; and an age and gender assessment subsystem coupled to the face detection subsystem and including a small-scale CNN module with a maximum input size limit. In some embodiments, the age and gender assessment subsystem is to: receiving a detected face image from the face detection subsystem that is predominantly covered by a face; determining whether the size of the detected face image is larger than the maximum input size limit which can be supported by the small-scale CNN module or not according to the maximum input size limit; if so, it is determined whether the size of the detected face image meets a predefined input image size limit. The predefined input image size is limited to satisfy a given one of a plurality of image sizes that may divide the input image into a set of sub-images having a second size, wherein the second size is smaller than the maximum input image size. The age and gender assessment subsystem is further for, if the size of the detected face image satisfies the predefined input image size limit: dividing the detected face image into a group of sub-images with a second size; processing the set of sub-images with the small-scale CNN module to generate a feature map array; merging the feature map array into a group of merged feature maps corresponding to the detected face image; the combined feature map is processed using two or more fully connected layers to generate age and/or gender classifications for the persons in the detected face images.

In some embodiments, the small-scale CNN module is a low-cost hardware CNN module shared by the age and gender assessment subsystem and the face detection subsystem.

In another aspect, a process for performing deep learning image processing utilizing a small-scale CNN module with maximum input size constraints is disclosed. The process first receives an input image. The process then determines whether the size of the input image is larger than the maximum input image size that the small-scale CNN module can support, based on the maximum input size limit. If so, the process further performs: dividing the input image into a set of sub-images having a second size, wherein the second size is smaller than the maximum input image size; processing the set of sub-images with the small-scale CNN module to generate a feature map array; merging the feature map array into a set of merged feature maps corresponding to the input image; the combined feature map is processed using two or more fully connected layers to generate a classification decision for the input image.

In some embodiments, the size of the input image satisfies a predefined input image size limit, wherein the predefined input image size limit satisfies a given one of a plurality of image sizes at which the input image may be divided into a set of sub-images having a second size.

In some embodiments, if the size of the input image is less than or equal to the maximum input image size supportable by the small-scale CNN module, the process processes the input image directly with the small-scale CNN module without dividing the input image into a set of sub-images having a smaller size.

In some embodiments, the input image is an input face image that is predominantly covered by a face; the classification decision on the input image includes an age and gender classification for the person in the input face image.

In some embodiments, the set of sub-images has a predefined overlap between a pair of neighboring sub-images in the set of sub-images and no overlap and gap between a pair of neighboring feature maps corresponding to a pair of neighboring sub-images.

Drawings

The structure and operation of the present application may be understood by reading the following detailed description and various drawings, in which like reference numerals refer to like parts, and in which:

FIG. 1A shows a block diagram of a small-scale hardware CNN module for processing low-resolution input images;

FIG. 1B shows a more detailed implementation of the hardware CNN module of FIG. 1A;

FIG. 2A shows a block diagram of a conventional full image-based CNN system for processing higher resolution input images;

fig. 2B shows a block diagram of a sub-picture based CNN system;

FIG. 3 illustrates a block diagram of an exemplary face detection system based on a small-scale hardware CNN module, according to some embodiments of the present application;

fig. 4 illustrates a block diagram of an exemplary implementation of a first level CNN based on small scale hardware CNN modules, as illustrated in fig. 3, in accordance with some embodiments described herein;

fig. 5 illustrates a block diagram of an exemplary implementation of the small-scale hardware CNN-based second level CNN as shown in fig. 3, according to some embodiments described herein;

fig. 6 illustrates a block diagram of an exemplary implementation of the third level CNN, as shown in fig. 3, in accordance with some embodiments described herein;

FIG. 7 illustrates an exemplary input image partitioning scheme of 46 x 46 image blocks, in some embodiments described herein;

fig. 8 illustrates a block diagram of an exemplary implementation process of a third level CNN based on small-scale hardware CNN modules as illustrated in fig. 3, according to some embodiments of the present application;

FIG. 9 illustrates a block diagram of an exemplary implementation of the final decision module shown in FIG. 3, according to some embodiments of the present application;

FIG. 10 illustrates a flow chart describing an exemplary face detection process utilizing the face detection system executing on the embedded CNN enabled system disclosed herein in accordance with some embodiments of the invention;

fig. 11 shows a flowchart describing an exemplary process for processing a second set of resized image blocks (i.e., step 1014 of fig. 10) using the sub-image based CNN system, in accordance with some embodiments described herein;

fig. 12 illustrates a block diagram of an exemplary age and gender assessment neural network based on small-scale CNN modules, according to some embodiments of the present application;

fig. 13 illustrates a block diagram of an exemplary age and gender assessment system based on a small-scale hardware CNN module and on sub-image technology, according to some embodiments described herein;

FIG. 14 shows a flow diagram of an exemplary process for pre-processing an input face image in the input module, according to some embodiments described herein;

fig. 15 illustrates a block diagram of an exemplary implementation of the small-scale CNN module of fig. 13, in accordance with some embodiments described herein;

FIG. 16 is a block diagram illustrating an exemplary implementation of the merge module and the decision module of FIG. 13 according to some embodiments described herein;

FIG. 17 illustrates a flow chart of an exemplary process for performing an age and gender assessment using the age and gender assessment system presented herein, in accordance with some embodiments described herein; and

FIG. 18 illustrates an exemplary embedded system within which the disclosed sub-image based face detection system and sub-image based age and gender assessment system function according to some embodiments described herein.

Detailed Description

The detailed description below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology may be practiced. The accompanying drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a thorough understanding of the subject technology. However, the subject technology is not limited to the specific details set forth herein, which may be practiced without these specific details. In some instances, structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology.

Throughout the specification, the following terms have the meanings provided herein, unless the context clearly dictates otherwise. The terms "image resolution" and "image size" are used interchangeably to refer to the number of pixels within a given two-dimensional (2D) image.

Various examples of face detection systems, techniques, and architectures based on the use of small-scale low-cost CNN modules configured into a multi-tasking cascaded CNN module are described. In one embodiment, the small-scale low-cost CNN module is embedded in a chip set or system on a chip (SoC). Thus, the face detection systems, techniques, and architectures presented herein may be implemented on a chipset or system on a chip (SoC) that includes a small-scale, low-cost CNN module. In a specific example, the face detection system, techniques, and architecture presented herein may be implemented on a haisi Hi3519 system-on-chip (hereinafter or referred to as "Hi 3519" or "Hi 3519 system-on-chip"). The Hi3519 system-on-chip is developed for a smart camera and developed by Hassi semiconductors, Inc., a subsidiary of Huashi technologies, Inc. Notably, the Hi3519 system on a chip includes embedded hardware CNN modules and a CPU that can perform some simple software CNN functions.

The present patent application also provides examples of age and gender classification that may be performed on a digital face image (hereinafter also referred to as a "face image") that may be larger in size than the maximum number of input pixels that a given small-scale hardware Convolutional Neural Network (CNN) module may support. In some embodiments, the age and gender assessment system presented herein first partitions the high-resolution input face image into a set of appropriately sized image blocks (also referred to as "sub-images"), each having a carefully designed overlap with adjacent image blocks. Each image block is then processed separately by a small-scale CNN module, such as the embedded CNN module of Hi 3519. The outputs corresponding to the set of image blocks are then combined to obtain an output corresponding to the high resolution input face image, and the combined output may be further processed by later layers within the age and gender assessment system.

The age and gender assessment system proposed by the present application can be implemented in a low cost embedded system comprising at least one small scale hardware CNN module and can be integrated with the face detection system described above, which can also be implemented on this low cost embedded system. In some embodiments, the age and gender assessment system may be coupled to the face detection system to perform age and gender assessments on detected face images generated by the face detection system, wherein the age and gender assessment system and the face detection system may use at least one small-scale CNN module, such as Hi3519, to perform their designated operations. By applying this sub-image based technique to a high resolution face image, the age and gender assessment system proposed in the present application can perform age and gender assessment on a small scale CNN module without affecting the accuracy of the age and gender assessment. The ability to perform age and gender assessments in the field within an embedded system based on acquired and detected face images without the need for a separate device, system or server to perform the operation can significantly reduce operational costs. In some embodiments, the age and gender assessment presented herein may also be implemented on low-cost embedded systems that do not include a face detection system. In these embodiments, the low cost embedded system may receive facial images directly from one or more external sources and then perform specialized age and gender assessments of the received facial images using the age and gender assessment system.

In many embedded system applications, most of the existing CNN-based DL architectures and systems are not cost-effective. Meanwhile, some embedded systems with CNN function based on low-cost chipsets have begun to appear. A typical example is the Hi3519 system on a chip. The cost of Hi3519 system on chip is significantly lower than that of Nvidia^TMTK1/TX1 chipset. The Hi3519 system on chip includes an embedded hardware CNN module with many desirable capabilities. For example, the parameters of the embedded CNN module of the Hi3519 system on a chip are reconfigurable, i.e., the user can modify the network architecture and parameters, whereinTo pre-train the parameters for different applications. In addition, the embedded CNN module has high processing speed.

Small-scale low-cost CNN modules such as the Hi3519 system-on-chip are often limited in capabilities and have many limitations for cost-reducing design purposes. For example, in a Hi3519 system on a chip, the maximum number of pixels of the input image of the embedded CNN module is 1280. However, in the coarse-to-fine architecture in the MTCNN structure described above, the input image size is rapidly increasing step by step. For example, in some embodiments of the MTCNN, the input image size of the second level is 24 × 24 × 3 ═ 1728, and the input image size of the third level is 48 × 48 × 3 ═ 6912. Both of these input image sizes exceed the input size limit of the embedded CNN module within the Hi3519 system-on-chip. To implement the MTCNN on a Hi3519 system on chip, the MTCNN needs to be modified to take a smaller input image size and down sample the input video accordingly. However, in doing so, the image quality of the face in the video may be significantly degraded, which may result in a serious impact on the face detection performance.

The contents of U.S. patent application 15/441,194, which is related to the present application and which is incorporated herein by reference, provides a solution for implementing MTCNN on small-scale low-cost CNN modules, which may be Hi3519 systems on chip, as described above. To address the problem of input image sizes larger than the maximum input size of the CNN module, the related patent application provides embodiments of a sub-image based CNN system that first divides a larger input image into a set of smaller sub-images with a reasonably designed overlap between adjacent sub-images. Each sub-picture is then processed by a small-scale hardware CNN module, which may be an inline CNN module within the Hi3519 system-on-chip. The corresponding outputs of the set of sub-images may then be merged, and the merged result may be further processed by a next stage. The sub-image based CNN system described in the related patent application may be configured to be equivalent to a large-scale CNN system that processes the entire input image without division, so that the output of the sub-image based CNN system may be identical to that of the large-scale CNN. Based on this, some embodiments disclosed in the related patent application apply the sub-image-based CNN systems and techniques to one or more stages of a cascaded CNN or MTCNN, so that a larger input image in a given stage of the cascaded CNN or MTCNN may be divided into a set of sub-images having a smaller size. Thus, each stage of the cascade CNN or the MTCNN may employ the same small-scale hardware CNN module, which is limited by the maximum input image size.

In some embodiments, to improve the performance of performing face detection in real-time, the face detection techniques and systems presented herein detect moving regions in each video frame/image. For example, the face detection techniques and systems may detect moving regions in a video frame using an embedded background elimination module in Hi 3519. Next, the face detection techniques and systems use multiple stages of CNNs from coarse to fine to detect most or all faces in the video frame. More specifically, for each level of the multi-level CNN having the input image size limitation, a sub-image-based CNN structure may be applied. For example, some embodiments of the face detection techniques presented herein only require that the sub-image based CNN structure be applied to the last level of the multi-level CNN structure.

In some embodiments, to improve the efficiency of performing face detection in real-time, the face detection techniques and systems may also identify face keypoints (e.g., eyes, nose, and mouth) for each detected face. This information allows the system to track to each face, select the best posed image (also referred to as the best face) for each person, for example the image closest to the frontal perspective, and then send the best face to the server for further processing, such as face recognition. For some application environments, the face information in the video frame is transmitted without transmitting the entire video frame to the server, thereby reducing the requirements on the network bandwidth and the computing resources of the server. In system applications, this reduction in requirements is particularly important for systems equipped with a large number of cameras to acquire video signals of multiple channels simultaneously.

In the following discussion, we take the embedded hardware CNN module in the Hi3519 system-on-a-chip as an example to describe some exemplary embodiments of the face detection CNN system and technique proposed in this application. However, it should be understood that the face detection CNN system and technique is not limited to a particular chipset or system-on-chip, such as the Hi3519 system-on-chip. Face detection CNN systems and techniques utilize small-scale hardware CNN modules in place of larger, more complex CNN modules in some or all of the stages of the cascaded CNN or MTCNN. The face detection CNN system and technique may be applied in any small-scale hardware CNN module or other chipset or system-on-chip including embedded small-scale hardware CNN modules. In addition, the face detection system and technique can be implemented as a single field programmable gate array and integrated with an embedded platform.

Description of sub-picture based CNN Structure

The sub-picture based CNN system described in us related patent application 15/441,194 is built based on small-scale low-cost hardware CNN modules. Such sub-picture based CNN systems can be implemented in resource limited systems such as embedded systems and mobile devices to enable these systems to perform tasks that typically require large scale, high complexity, expensive CNN systems to implement. The sub-image-based CNN system can also be realized in the existing DL system to replace a large-scale and high-complexity CNN module, thereby remarkably reducing the system cost. For example, such a sub-image based CNN system allows the use of a low-cost embedded system with CNN functionality in applications requiring high complexity CNN, such as processing high resolution input images. For embedded systems with limited resources, high resolution input images cannot be processed in a different manner than provided by the present application. In some embodiments, the sub-image based CNN system reuses one or more small-scale hardware CNN modules designed to process low-resolution input images, such as the inline hardware CNN module within the Hi3519 system-on-a-chip, so that the sub-image based CNN system can be applied to high-resolution input images and a more challenging task that typically requires expensive, large-scale hardware CNN modules to have processing power.

The sub-picture based CNN system is a hierarchical system based on the use of a divide-and-conquer approach to handle complex tasks. In some embodiments of the related patent application, the sub-picture based CNN system is constructed with two or more levels. Wherein each of the two or more stages is implemented using one or more small-scale, low-cost hardware CNN modules operable on low-resolution inputs or software operable on low-resolution inputs. Thus, each of the two or more stages has very low complexity. More specifically, to use this sub-image based CNN system, an initial high resolution input image may be divided into a set of sub-images of the same size or substantially the same size that is significantly smaller than the size of the initial input image, wherein the divided images may include properly designed overlap portions between adjacent sub-images. These sub-images are fed to a first stage of the sub-image based CNN system, which first stage comprises at least one small scale low cost hardware CNN module designed to process low resolution input images, and the processed sub-image sets of the output of the first stage are subsequently combined. More specifically, a group of sub-images may be processed by repeatedly calling one or more small-scale hardware CNN modules repeatedly. In this manner, a high resolution input image may be processed by one or more small scale hardware CNN modules by repeatedly calling the one or more small scale hardware CNN modules on a sub-group of images.

The outputs of the first stage may then be combined based on the set of sub-images. In some embodiments, the sub-image based CNN system includes provisions for the input image and the size of the sub-image to ensure that the combined result is substantially the same or exactly the same as the output of processing the entire high resolution input image directly with a large scale high complexity CNN module (without the need to divide the input image). The merged results are then processed by a second stage of the sub-picture based CNN system, which may also be implemented using one or more small scale hardware CNN modules or using software. As such, the disclosed CNN system may enable high complexity tasks, such as processing high resolution input images, without requiring large-scale, high complexity, expensive hardware modules, thereby improving the tradeoff between performance and cost. Therefore, the sub-image based CNN system is highly applicable to embedded systems with limited resources, such as various surveillance cameras, machine vision cameras, drones, robots, self-driving, and mobile phones.

Small-scale low-cost hardware CNN module

Fig. 1A shows a block diagram of a small-scale hardware CNN module 100 for processing a low-resolution input image. In some embodiments, CNN module 100 may be used to extract features of input images with limited resolution and perform various DL inferences, depending on application requirements. As can be seen in fig. 1A, CNN module 100 includes at least two sub-modules, denoted CNN1 and CNN2, respectively. In some embodiments, the CNN module 100 is configured to limit the input image 102 size to no more than an image resolution of 1280 pixels, e.g., 32x40 pixels. This limitation on the input image size also severely limits the types of applications to which CNN module 100 can be adapted.

Fig. 1B shows a more detailed implementation of the hardware CNN module 100. As can be seen in fig. 1B, the first sub-module CNN1 in fig. 1A further includes in series a plurality of alternating sets of Convolution (CONV) layers, modified linear unit (ReLU) layers (not shown), and pooling layers. Further, for each of a plurality of CONV layers, such as the CONV (1) layer, the sub-module CNN1 uses a set of convolution filters to extract a particular set of features from the input image 102. Each of the plurality of CONV layers in sub-module CNN1 is followed by a corresponding ReLU layer (not shown) and a pooling layer, such as a POOL (1) layer; the pooling layer POOL (1) is used to reduce the filtered image size generated by the corresponding CONV layer while preserving some of the extracted features.

As also shown in fig. 1B, the second sub-module CNN2 in fig. 1A further includes a series of sets of alternating Fully Connected (FC) layers and ReLU layers (not shown). Each of a plurality of FC layers in sub-module CNN2, such as the FC (1) layer, is configured to perform matrix multiplication. Each of the plurality of FC layers (except the last FC layer) is followed by a corresponding ReLU layer (not shown). Although not explicitly shown in fig. 1B, each of the plurality of ReLU layers in CNN1 and CNN2 is configured to provide non-linear characteristics to the CNN system. Finally, at the output of the last FC layer (e.g., FC (n) layer), a decision block (also not shown) is used to predict the output based on the last FC layer, thereby generating the output 104 of the CNN block 100. In some embodiments, the first sub-module CNN1 includes 1-8 sets of CONV layers, ReLU layers, and pooling layers, while the second sub-module CNN2 includes 3-8 sets of Fully Connected (FC) layers and ReLU layers.

In some embodiments, the number of convolution filters in each of the plurality of CONV layers is at most 50 and only 3 x 3 filters are allowed. In addition, the convolution step is fixed to 1, and zero padding is not used. In some embodiments, the pooling layer in CNN1 may use a max pooling technique to select a maximum from each of the 2x 2 regions in the filter image. In some embodiments, both maximum pooling and average pooling may be used, however, the pooling window size is fixed at 2x 2 and the step size is fixed at 2. In other words, after each pooling layer, the image is reduced in width and height by half.

Taking the example of a hardware CNN module within a Hi3519 system on a chip, the maximum input size of the first FC layer is 1024, and the number of neurons in the intermediate FC layer is 256 at most. The size of the CNN module output is at most 256. Due to these constraints, the hardware CNN module within the Hi3519 system-on-chip is generally only suitable for performing simple applications such as handwritten digit recognition and car license plate recognition. For more challenging applications such as face recognition, direct application of small-scale CNN modules such as CNN module 100 is not satisfactory for at least the following reasons. First, the maximum input resolution of 1280 pixels (such as 40 × 32) is very limited because a face image down-sampled to this resolution loses too much important face information. Secondly, the learning capabilities of the small-scale CNN module 100 are also very limited.

Layered CNN architecture and system based on sub-images

FIG. 2A shows a device for treatingBlock diagram of a conventional full image-based CNN system 200 for processing high resolution input images. As can be seen, the conventional CNN system 200 may receive the entire high-resolution input image 202 on the first convolution layer CONV (1) and begin performing feature extraction operations on the high-resolution input image 202. As such, the conventional CNN system 200 can directly process the entire high-resolution input image 202 without dividing the input image. However, the conventional CNN system 200 also requires the use of large-scale expensive chips capable of processing such high-resolution input images, such as the Nvidia described earlier^TMAnd (3) a chip.

Fig. 2B shows a block diagram of a sub-picture based CNN system 210. In the disclosed CNN system 210, small-scale CNN modules with limited resolution, such as the CNN module 100 described in connection with fig. 1A and 1B or hardware CNN modules within a Hi3519 system-on-chip, may be a component of the sub-image based CNN system 210. As mentioned above, such small-scale CNN modules have limitations on the maximum size of the input image, e.g., a maximum of 1280 pixels. To enable processing of a high resolution input image 202 using this small-scale CNN module (e.g., image pixels exceeding 1280), the disclosed CNN system 210 includes an input module 212 that divides the high resolution input image 202 into a set of smaller sub-images 204, where each of the sub-images 204 has a maximum size that is less than or equal to the input image allowed/supported by the small-scale CNN module that is a constituent of the CNN system 210. In some embodiments, the input module 212 may divide the high resolution input image 202 by appropriately setting the overlap between adjacent sub-images 204, as shown in fig. 2B. It is to be noted that the set of four sub-images 204 with two rows and two columns of spaces and overlapping portions shown in fig. 2B is for ease of understanding the concept thereof, and does not represent an actual division.

As shown in fig. 2B, CNN system 210 includes a two-tier processing architecture based on the use and/or reuse of one or both of the two hardware sub-modules CNN1 and CNN2 of small-scale CNN module 100 depicted in fig. 1A and 1B. In addition to the input module 212, the CNN system 210 includes a first processing stage 220, a merging module 222, and a second processing stage 224. More specifically, the first processing stage 220 of the CNN system 210 includes at least one CNN1 processing module, such as the CNN1 module 214. In some embodiments, the CNN1 module 214 is implemented by the hardware sub-module CNN1 depicted in fig. 1A and 1B. In other embodiments, the CNN1 module 214 is implemented by the entire CNN module 100 described in fig. 1A and 1B including two sub-modules CNN1 and CNN 2. It is noted that the multiple instances of the CNN1 module 214 shown within the first processing stage 220 represent the use of the same CNN1 module 214 at different times t1, t2, t3, …, and tn, as noted for each such instance. Thus, "CNN 1214 at t 1", "CNN 1214 at t 2", "CNN 1214 at t 3", …, and "CNN 1214 at tn" shown in fig. 2B correspond to the same CNN1 module 214 at different processing times, and should not be construed as a plurality of CNN1 modules having the same number 214. Although not shown, the first processing stage 220 may include additional CNN1 modules similar to CNN module 214. For example, the first processing stage 220 may include two or more identical CNN1 modules.

The second processing stage 224 of the CNN system 210 includes at least one CNN2 module 216. In some embodiments, the CNN2 module 216 is implemented by the hardware sub-module CNN2 depicted in fig. 1A and 1B. In other embodiments, the CNN2 module 216 is implemented by the entire CNN module 100 described in fig. 1A and 1B including two sub-modules CNN1 and CNN 2. In certain other embodiments, the CNN2 module 216 within the second processing stage 224 may be implemented in software rather than in hardware.

In particular, to process the set of sub-images 204 generated by the input module 212, the same CNN1 module 214 may be used multiple times to sequentially process the set of sub-images 204, one sub-image at a time. That is, each instance of the CNN1 module 214 within the first processing stage 220 of the CNN system 210 represents one of multiple applications of the same CNN1 module 214 on one of the sub-images 204 in the set of sub-images 204 at a different processing time. However, since the CNN1 module 214 processes each sub-image 204 very quickly, the total processing time to process the group of sub-images 204 will also be very fast. The output of the multiple applications of the CNN1 module 214 contains an array of feature maps 206 corresponding to the set of sub-images 204 after the multiple-layer convolution, ReLU, and pooling operations.

It is noted that while the embodiment shown in fig. 2B is based on reusing the same hardware CNN1 module 214 in the first processing stage 220 of the CNN system 210, other embodiments may use additional hardware CNN1 modules similar or identical to the CNN1 module 214 in the first processing stage 220 of the CNN system 210 so that multiple hardware CNN1 modules process the group of sub-images 204 in parallel. The actual number of CNN1 modules used for a given design may be determined based on a tradeoff between hardware cost constraints and speed requirements for the given design. For example, some variations of CNN system 210 may include 3 to 5 CNN1 modules in the first processing stage.

As mentioned above, the CNN1 module 214 may be implemented by a dedicated hardware sub-module CNN1, such as described in connection with fig. 1A and 1B, or by the entire CNN module 100 including both CNN1 and CNN2 sub-modules described in connection with fig. 1A and 1B. In the first case, the CNN1 module 214 within the CNN system 210 may include only the CONV layer, the ReLU layer, and the pooling layer. In the second case, implementing CNN1 module 214 in CNN system 210 further includes skipping the FC layer and the corresponding ReLU layer, i.e., skipping sub-module CNN2 within CNN module 100. When skipping the CNN2 sub-module, the CNN1 module 214 generally needs to retain spatial location information in its output profile because the outputs from the CNN1 module 214 will be combined and used for further processing. For some embedded hardware CNN modules, such as those within the Hi3519 system on a chip, the parameters of the embedded CNN module are reconfigurable. By using this property, the purpose of skipping sub-module CNN2 can be achieved by forcing each of the FC layers within CNN module 100 into an identity matrix such that the output from each FC layer becomes a reorganization of the two-dimensional profile into a one-dimensional vector when using the embedded CNN module. In this case, the ReLU layer following each FC layer may still be used as usual. In the partitioned embodiment, for the CNN2 sub-module with three tiers of FC-ReLU combinations, the last two ReLU tiers do not change any data, since the concatenation of multiple ReLU tiers amounts to only one ReLU tier.

Returning to FIG. 2B, after each sub-image 204 in the set of sub-images 204 is sequentially processed by the CNN1 module 214, the output from the CNN1 module 214 containing the array of feature maps 206 becomes the input to the merge module 222, which merge module 222 is configured to merge the array of feature maps 206 to form a complete feature map for the entire input image 202. The merged feature map may then be used as an input to the second processing stage 224 of the CNN system 210. In some embodiments, the output 228 from the second processing stage 224 is the output from the last FC layer of the CNN2 module 216. Ideally, the output 228 is the same as the output 226 of the conventional CNN system 200 in fig. 2A.

In some embodiments, the array of feature maps 206 comprises a set of three-dimensional (3D) matrices (i.e., two-dimensional feature maps and the number of feature maps). For example, the array of signatures 206 may be made up of nine 3D matrices, each matrix being 2x 48 in size, where nine is the number of sub-images 204 with

subscripts

0, 1, 2, …, 8 (i.e., 3 rows and 3 columns of sub-images), 2x 2 is the size of each output signature for each sub-image after processing by the

CNN1 module

214, and 48 is the number of signatures for each sub-image. In some embodiments, the merge module 222 is configured to merge the array of feature maps 206 by concatenating all of the 3D output matrices based on the corresponding output matrix indices, thereby forming a merged 3D feature map matrix while preserving the spatial relationship of the set of sub-images 204. In the example above, this step would generate a 6 × 6 × 48 3D matrix. Next, the merged 3D matrix may be flattened into a one-dimensional (1D) vector. In the example above, this would result in a 1D vector with 1728 elements. Finally, the flattened 1D vector is fed to the second processing stage 224.

Fig. 2B shows that the merged feature map 208 generated by the merging module 222 is fed to a second processing stage 224 of the CNN system 210 for further processing. More specifically, the second processing stage 224 of the CNN system 210 includes at least one CNN2 module 216, the CNN2 module 216 further including an FC layer and a ReLU layer set as described above. As mentioned above, the CNN2 module 216 in the CNN system 210 may be implemented by the dedicated hardware sub-module CNN2 described in connection with fig. 1A and 1B. In these embodiments, the CNN2 module 216 within the CNN system 210 may include only the FC layer and the ReLU layer. In still other embodiments, the CNN2 module 216 may be implemented by the entire hardware CNN module 100 described in fig. 1A and 1B including two sub-modules CNN1 and CNN 2. In these embodiments, implementing the CNN2 module 216 in the CNN system 210 further includes skipping the CONV layer-ReLU layer-pooling layer, i.e., skipping sub-module CNN1 within the CNN module 100. In some systems, such as Hi3519, it may be difficult to skip the CONV layer-ReLU layer-pooling layer to directly use the FC layer and the ReLU layer. In these cases, the CNN2 module 216, i.e., the FC layer and the ReLU layer, may be implemented by software. Since most of the complex computations of the CNN system 210 are at the CONV level, implementing the FC and ReLU layers in software typically has little impact on the overall speed of the system. Further, systems such as Hi3519 can also provide additional tools to optimize the speed of software implementing the CNN2 module 216.

As mentioned above, the CNN2 module 216 within the second processing stage 224 may be implemented by software rather than by a hardware CNN module. It should be noted that because the complexity of the FC layer and the ReLU layer is generally much lower than the convolutional layer, most of the complex computations of the CNN system 210 reside in the convolutional layer implemented by the CNN1 module 214. Based on such recognition, the low complexity computing operations implemented by the hardware CNN2 module 216 in the CNN system 210 may be implemented by software instead of the hardware CNN2 or CNN module mentioned above. Furthermore, such a software implementation approach may provide more flexibility than embodiments based on hardware CNN modules.

Face detection CNN framework that this application provided

In the two aforementioned face detection structures, the MTCNN structure is simpler than the cascade CNN because the MTCNN uses three CNN levels and the cascade CNN uses six levels. In addition, the MTCNN may detect face keypoint locations, which facilitates person tracking and determining a pose for each face. Thus, several examples of face detection CNN systems and techniques described below are based on the MTCNN structure using three levels of CNNs. It should be noted, however, that the face detection system and technique is equally applicable to cascaded CNN architectures.

It has been mentioned above that the embedded CNN module of the Hi3519 system on a chip cannot be directly used to implement each stage of the MTCNN of the initial design without addressing its input image size limitation. In fact, the original design of the MTCNN did not meet or conflict with many of the limitations of the embedded CNN block of the Hi3519 system-on-chip. These conflicts include, but are not limited to:

maximum input image size: as mentioned above, in the Hi3519, the maximum value of input image pixels that the Hi3519 can support is 1280. In contrast, the input image size of the second level of the MTCNN as originally designed is 24 × 24 × 3 ═ 1728, and the input image size of the third level is 48 × 48 × 3 ═ 6912. Both of these input sizes exceed the upper limit of the input image size of Hi 3519.

Minimum input image size: the minimum width or height of the input image of Hi3519 is 16 pixels. In contrast, the input image size of the first level of the MTCNN of the initial design is 12 × 12, which is too small for Hi 3519.

The number of filters: in the embedded CNN block of Hi3519, the maximum number of filters per Convolution (CONV) layer is 50. In contrast, several CONV layers in MTCNN of the original design have 64 or 128 filters.

CNN architecture: in the embedded CNN module of Hi3519, each CONV layer is followed by a Maximum Pooling (MP) layer. However, the MTCNN generally has two or three continuous convolutional layers, in between which no MP layer exists.

Pooling window size: in the embedded CNN module of Hi3519, the MP layer is designed to support a pooling window size of 2 × 2 pixels, whereas in MTCNN, a maximum pooling window of 3 × 3 is typically used.

CONV layer filter size: in the embedded CNN module of Hi3519, the CONV layer has a 3 × 3 filter, while in MTCNN, the CONV layer typically employs a 5 × 5 filter and a 2 × 2 filter.

Nonlinear function: the MTCNN employs a parameter-modified linear unit (prellu) as a non-linear function, while the inline CNN module of Hi3519 employs a modified linear unit (ReLU).

Fully Connected (FC) layer: the first stage of the MTCNN, originally designed, is a Full Convolutional Network (FCN) to reduce the runtime of this sliding window approach during testing, where the FC layer is not involved. In contrast, Hi3519 requires at least 3 FC layers in one CNN.

Examples of face detection CNN systems and techniques presented herein are designed to address the above-described problems, such that the initial CNN within each level of the MTCNN may be implemented by a small-scale, low-cost CNN, such as the embedded CNN module of Hi 3519.

Fig. 3 illustrates a block diagram of an exemplary face detection system 300 based on a small-scale hardware CNN module, according to some embodiments of the present application. In some embodiments, the face detection system 300 is implemented on a CNN-enabled embedded system, including a small-scale, low-cost system-on-a-chip such as the Hi3519 system-on-a-chip. As shown in fig. 3, the face detection system 300 receives and takes as input video data 302 and generates and takes as output face detection decisions 316. In some embodiments, the input video image 302 is a video frame or video captured by a camera. It is noted that the face detection system 300 includes at least a motion detection module 304, a pyramid and block generation module 306, a first level CNN308, a second level CNN310, a third level CNN 312, and a final decision module 314. The face detection system 300 may also include other modules not shown in fig. 3. Each of the modules in the face detection system 300 will be described in greater detail below.

As can be seen, the motion detection module 304 first receives an input video image 302. In some embodiments, faces within a given video are considered to be motion related. Thus, to reduce computational complexity, the motion detection module 304 may locate and identify regions within each video frame that are associated with motion based on a comparison with previously received video frames. It should be noted that these moving areas include human or non-human objects, such as a moving automobile. In addition, even if a person is moving, the moving area may include a human face and a human body. When the face recognition system 300 is implemented on Hi3519, the motion detection module 302 may be implemented by an inline motion detection hardware module of Hi 3519. The output of the motion detection module 302 includes a set of identified motion regions having different sizes. As part of the output video image 302, each identified moving region is sent to subsequent face detection modules within the face detection system 300 for detecting most or all faces within the moving region. In this embodiment, the non-moving regions within the input video image 302 are generally not considered for face detection. However, some other embodiments of the face detection system presented herein may not include a motion detection module.

In some embodiments, a face tracking module (not shown) may be used in place of or in conjunction with the motion detection module 302. The face tracking module is used to calculate the motion trajectory of the detected face through the face detection system 300. More specifically, the face tracking module calculates a motion trajectory based on the face positions in the previous video frame, and predicts new positions of the detected faces in the new video frame based on the calculated motion trajectory, and subsequently retrieves the faces in the vicinity of the predicted positions. It should be noted that by combining motion detection and face tracking within the face detection system 300, the speed of face detection can be significantly increased.

In some embodiments, the size of a given movement region 318 generated by the movement detection module 304, or generated by the face tracking module, or generated by a combination of motion detection and face tracking has a minimum. The minimum value of the movement region may be determined based on one or more design parameters and the limitations of the small scale hardware CNN module employed in the face detection system 300, such as the face detection system 300 of Hi 3519. In some embodiments, the one or more parameters include a preliminary downsampling factor for the pyramid or block generation module 306 and a minimum input image size for the first level CNN 308. For example, if the preliminary downsampling factor of the pyramid and block generation module 306 is 2: 1, the minimum input image of the first level CNN308 is 16 × 16, the minimum size of the face that can be detected should be 32 × 32. In another example, if the preliminary downsampling factor for pyramid and block generation module 306 is 3: 1, the minimum input image of the first level CNN308 is 16 × 16, the minimum size of the face that can be detected should be 48 × 48. To reduce complexity, the minimum size of the moving area sent to the face detection module is typically larger than the minimum size of faces that can be detected. In some embodiments, the maximum size of the movement region generated by the motion detection module 304 may be as large as the size of the entire input video image 302. For example, the moving area may correspond to an input image that is substantially completely covered by a human face.

As can be seen in fig. 3, the detected movement regions generated by the motion detection module 304 (either by the face tracking module or by a combination of motion detection and face tracking) are processed in a similar manner by other modules within the face detection system 300, including the pyramid and block generation module 306, the first level CNN308, the second level CNN310, the third level CNN 312, and the final decision module 314. Thus, the operations described below with respect to the pyramid and block generation module 306, the first level CNN308, the second level CNN310, the third level CNN 312, and the final decision module 314 will be performed repeatedly for each detected movement region 318. This processing loop performed for each detected movement area is indicated by the dashed line surrounding the module identifications. Thus, the following discussion of the face detection system 300 is directed to, and applies equally to, all of the detected movement regions 318.

In the face detection system 300, each detected movement region 318 is received by the pyramid block generation module 306 as part of the input video image 302. The pyramid block generation module 306 downsamples the moving region 318 using different downsampling factors to convert the moving region 318 into a "pyramid" multi-resolution representation of the moving region 318, thereby allowing subsequent face detection modules to detect faces of different sizes within the moving region 318. More specifically, the high resolution representation of the moving region 318 in the "pyramid" may be used to detect smaller faces in the initial input image 302; while a low resolution representation of the moving region 318 in the "pyramid" may be used to detect a larger face in the initial input image 302.

In some embodiments, the highest resolution representation of the moving region 318 in the pyramid is determined by the input size of the first level CNN308 and the minimum ideal size of faces that can be detected. Note that the input size of the first level CNN398 may be an artificially defined parameter, and the minimum value of the input size is limited by the minimum input size of the first level CNN308, which is constrained by the particular device. For example, for the inline CNN module of Hi3519, the minimum input size is 16 × 16. This constraint indicates that the input size of the first level CNN308 needs to be at least 16 x 16. In addition, the highest resolution representation also determines the smallest face that the face detection system 300 can detect. More particularly, the smallest face that is detectable can be determined by multiplying the input size of the first level CNN308 by the pyramid and downsampling factor employed by the block generation module 306. For example, if the input size employed by the first level CNN308 is 16 x 16 and the initial downsampling factor employed by the pyramid and block generation module 306 is 3, then the smallest face that can be detected is 48 x 48. If the initial downsampling factor used by the pyramid and block generation module 306 is 2 and the input size used by the first level CNN308 is 16 x 16, then the smallest face that can be detected is 32x 32.

It should be noted that the pyramid and block generation modules use down-sampling factors that need to be determined, taking into account the trade-off between face detection accuracy and speed. On the other hand, the initial down-sampling factor may be determined as the ratio of the minimum size of detectable faces to the input size of the first level CNN 308. For example, assuming that the input size of the first level CNN308 is 16 × 16, and the minimum size of detectable faces is about 48 × 48, the initial down-sampling factor should be 3. In some embodiments, the user-specified input size of the first level CNN308 may be greater than the minimum input size of the first level CNN308, i.e., 16 x 16.

In some embodiments, the lowest resolution representation of this moving region in the pyramid should be equal to or close to, but not smaller than, the minimum input size of the first level CNN308, i.e., corresponding to 16 × 16 in Hi 3519. For example, the lowest resolution representation of the moving region 318 may be a 24 × 24 image. The other resolution representation of the region of movement 318 may be between the lowest and highest resolution of the pyramid, and is typically between adjacent resolution representations in a ratio of 2: 1 or 3: the factors of 1 are spaced apart.

For each received moving region 318, the pyramid and block generation module 306 generates a multi-resolution representation of the pyramid for that moving region 318. In other words, the pyramid and block generation module 306 generates a set of images with different resolutions for the same portion of the initial video image 302. In some embodiments, not all images in the pyramid are processed, but instead, the first level CNN308 processes the image blocks based on the user-specified input size described above. For example, if a 16 × 16 input size is used, each image in the pyramid is further divided into a set of 16 × 16 image blocks.

In some embodiments, the pyramid and block generation module 306 divides each image in the pyramid into a set of image blocks using a sliding window approach. More specifically, each image in the pyramid can be searched one by one in user-specified steps through a sliding window of a user-specified size, such as 16 × 16, to generate an image block at each sliding window position; the specified step size is, for example, 2 or 4 pixels in both row and column directions. Thus, the pyramid and block generation module 306 generates and outputs sets of image blocks 320 of the same size corresponding to the sets of multi-resolution representations of the moving region. It should be noted that the high resolution representation of the moving area 318 produces more image blocks than the low resolution representation of the moving area 318. Next, the sets of image blocks 320 are received by the first set of CNNs 308. Based on the hardware configuration, the first stage CNN308 may process the received image blocks sequentially, block-by-block; or, a plurality of image blocks are processed in parallel to accelerate the processing speed. Some embodiments of the first level CNN308 will be described in more detail below.

The first level CNN308 is used to process each received image block corresponding to each sliding window position within each pyramid representation of the moving area 318. Fig. 4 shows a block diagram of an exemplary implementation process 400 of a first-level CNN308 based on small-scale hardware CNN modules, according to some embodiments described herein.

As can be seen in fig. 4, the first stage CNN400 includes two stages of CONV and MP layers (i.e., CONV (1)/MP (1) and CONV (2)/MP (2)), followed by two FC layers (i.e., FC (1) and FC (2)). In some embodiments, each CONV layer and FC layer (except the last FC layer) is followed by a ReLU layer (not shown in fig. 4). In some embodiments, the input to first level CNN400 comprises input image blocks 402 of three R/G/B channels (i.e., one of the groups of image blocks 320 shown in FIG. 3), where each channel has a size of 16 × 16. In other embodiments, the input to the first level CNN400 comprises a grayscale map (i.e., a single channel) of the input image block 402. For a given input image block 402, using the associated grayscale map may have a shorter processing time than using 3R/G/B channels. Thus, if the correlation performance of the two types of inputs is substantially the same, the use of a grayscale map for each image has a significant advantage over the use of 3R/G/B channels. In the embodiment shown in the figure, the CONV (1) layer comprises 10 3 × 3 filters with step size 1. Therefore, the output of the CONV (1) layer has a size of 14 × 14 × 10. The MP (1) layer uses a pooling window of 2 × 2 with a step size of 2. Therefore, the output size of the MP (1) layer is 7 × 7 × 10. The CONV (2) layer includes 16 3 × 3 filters with step size 1. Therefore, the output of the CONV (2) layer has a size of 5 × 5 × 16. The MP (2) layer uses a pooling window of 2x 2 with a step size of 2. Therefore, the output size of the MP (2) layer is 3 × 3 × 16. The outputs of the first and last FC layers are 32 × 1 and 16 × 1 vectors, respectively. In some embodiments, of the final 16 × 1 output vectors, the first two outputs are used to generate a face detection confidence index (also referred to as a "face classifier"); the next 4 outputs are the bounding box coordinates (also referred to as the "bounding box regression operator") of the face in the image block 402 (if a face is detected in the image block 402); the last 10 outputs represent the positions of 5 face keypoints of the detected face, i.e., the left eye, right eye, nose, and two mouth corners (also referred to as "keypoint localization operators"). Thus, the first level CNN400 is output as a set of candidate face windows/bounding boxes (corresponding to a subset of the image blocks 320 shown in FIG. 3).

It is noted that the combination of the number of layers and filters, the input image size, the filter and pooling window size, the FC layer output size, and other parameters shown in first stage CNN400 is only one exemplary configuration of first stage CNN 308. First level CNN308 may be constructed with other configurations having one or more parameter values different from those shown in fig. 4 without departing from the scope of the technology described herein. In some embodiments, such as the exemplary first level CNN400 shown in fig. 4, the first level CNN308 satisfies the constraints of a small-scale hardware CNN module, which may be, for example, an inline hardware CNN module within Hi3519, such that the inline hardware CNN module within Hi3519 may implement the first level CNN 308.

In some embodiments, to eliminate more "false alarms," i.e., alarms that are detected by the first stage CNN308 as image blocks of a face but not actually a face, a filter may be applied to the face detection confidence index at the detection output. The filter only retains image blocks whose face detection confidence index is greater than a threshold (e.g., the threshold is typically set between 0.5 and 0.7). In some embodiments, this filtering operation is implemented after the last FC layer in first stage CNN 308.

It should be noted that, because the multi-resolution representation is generated by using the pyramid technique and the image blocks are generated by using the sliding window technique, a plurality of overlapped but different bounding boxes can be generated around each face of the input image. In some embodiments, for each image block divided into faces by first-level CNN308, a corresponding image region is identified in initial input video image 302. Next, those highly overlapping bounding boxes are merged using non-maximum suppression (NMS) techniques, as described in MTCNN. It should be noted that the NMS operation may operate after the filtering operation performed on the candidate face window as described above. In some embodiments, this NMS operation is implemented within the first level CNN308 within the face detection system 300. After NMS operation, the remaining bounding box may be refined through a bounding box regression operation to refine the location of the bounding box, as also described in MTCNN. Again, this NMS operation may be performed within the first level CNN308 within the face detection system 300. Thus, after one or more other places, the first level CNN 300 outputs a set of candidate bounding boxes, or "candidate face windows," of faces.

In some embodiments, for each candidate face window 322 output by the first level CNN308, there is a corresponding image block located in and truncated from the initial input video image 302, and the truncated image block is then resized to the user-specified input size of the second level CNN 310. Based on this coarse-to-fine approach, the user-specified input size for the second level CNN310 should be larger than the input size for the first level CNN 308. In some embodiments, the input size of the second level CNN310 is 24 × 24. Therefore, the image block of the re-size is also 24 × 24 in size. However, in other embodiments, input sizes similar to, but slightly different from, 24 x 24 may also be employed without departing from the scope of the described techniques. The process of generating a resized image block from the candidate face window 322 may be implemented by hardware, software, or a combination of hardware and software. The corresponding processing module, which is not explicitly shown in the figure, may be located between the first-level CNN308 and the second-level CNN 310. Next, the second level CNN310 receives the resized image block. Based on the hardware configuration, the second stage CNN310 may process the received image blocks 324 sequentially, block-by-block; or, a plurality of image blocks are processed in parallel to accelerate the processing speed. Some embodiments of the second level CNN310 will be described in more detail below.

Fig. 5 shows a block diagram of an exemplary implementation process 500 of a small-scale hardware-based CNN secondary CNN310, according to some embodiments described herein.

As can be seen in fig. 5, the second stage CNN 500 includes three stages of CONV and MP layers (i.e., CONV (1)/MP (1), CONV (2)/MP (2), and CONV (3)/MP (3)), followed by two FC layers (i.e., FC (1) and FC (2)). In some embodiments, each CONV layer and FC layer (except the last FC layer) is followed by a ReLU layer (not shown in fig. 5). In some embodiments, the second level CNN 500 satisfies the constraints of the embedded hardware CNN module of Hi 3519. For example, the input to the second level CNN 500 is a grayscale map 502 (i.e., one of the resized image blocks 324 in FIG. 3) of a single channel of size 24 × 24, rather than the RGB images of size 24 × 24 × 3 used in the second level CNN of the MTCNN. This is because the maximum input size supported by Hi3519 is 1280 pixels (24 × 24 × 3 ═ 1728). However, experimental results show that the performance of using a grayscale rather than a color image is not significantly affected. Thus, the second level CNN 500 can be efficiently implemented with small scale hardware CNN such as embedded CNN within Hi 3519.

In the embodiment shown, the CONV (1) layer includes 28 3 × 3 filters with step size 1. Thus, the output of the CONV (1) layer is 22 × 22 × 28 in size (based on the input size of 24 × 24). The MP (1) layer uses a pooling window of 2 × 2 with a step size of 2. Therefore, the output size of the MP (1) layer is 11 × 11 × 28. The CONV (2) layer includes 32 3 × 3 filters with step size 1. Therefore, the output of the CONV (2) layer has a size of 9 × 9 × 32. The MP (2) layer uses a pooling window of 2x 2 with a step size of 2. Therefore, the output size of the MP (2) layer is 5 × 5 × 32. The CONV (3) layer comprises 48 3 x 3 filters with step size 1. Therefore, the output of the CONV (3) layer has a size of 3 × 3 × 48. The MP (3) layer uses a pooling window of 2x 2 with a step size of 2. Therefore, the output size of the MP (3) layer is 2 × 2 × 48. The outputs of the first and last FC layers are 128 × 1 and 16 × 1 vectors, respectively. It should be noted that although each CONV layer uses more filters than the first-stage CNN400 and the FC layer is also larger than the FC layer used by the first-stage CNN400, the design of the second-stage CNN 500 still satisfies the constraint of the embedded CNN module of Hi 3519.

As can be seen, the output of the last FC layer of second stage CNN 500 is still a 16 × 1 output vector. Wherein, the first two outputs are used for generating a face detection confidence index or a face classifier; the next 4 outputs are the bounding box coordinates or bounding box regression operators for the face in the image block 502 (if a face is detected in the image block 402); the last 10 outputs represent the positions of 5 face keypoints of the detected face, i.e., the left eye, the right eye, the nose, and two mouth corners, i.e., the keypoint location operators. However, since the input image resolution of the second-level CNN 500 is higher than that of the first-level CNN400, and the CNN 500 has stronger processing power than the CNN400, the face detection accuracy of the CNN 500 is also higher than that of the CNN 400. Thus, the second stage CNN 500 outputs a set of candidate face windows/bounding boxes (such as the candidate face window shown in fig. 3) corresponding to a subset of the input image block 502.

Similar to the first level CNN308, a confidence index threshold may be applied to the face detection confidence index at the detection output of the second level CNN310, leaving only input image blocks with face detection confidence indices greater than the threshold. In some embodiments, this filtering operation is implemented after the last FC layer in the second stage CNN 310. Similarly, after filtering the candidate bounding boxes, the highly overlapping candidate bounding boxes may be merged using the NMS techniques mentioned above. In some embodiments, this NMS operation may also be implemented in the second level CNN 310. Typically, the candidate face window only remains a small subset after filtering and NMS operation. After the NMS operation, the location of the remaining bounding box may be refined by a bounding box regression operator, which refinement may be implemented by the second level CNN 310.

It is noted that the combination of the number of layers and filters, the input image size, the filter and pooling window size, the FC layer output size, and other parameters shown in the second stage CNN 500 is only one exemplary configuration of the second stage CNN 308. Second level CNN310 may be constructed with other configurations having one or more parameter values different from those shown in fig. 5 without departing from the scope of the technology described herein. For example, the input size of the second level CNN310, i.e., 24 × 24, may not be sampled, but other similar sizes, e.g., 32 × 32, may also be used. In some embodiments, such as the exemplary second level CNN 500 shown in fig. 5, the second level CNN310 satisfies the constraints of a small-scale hardware CNN module, which may be, for example, an inline hardware CNN module within Hi3519, such that the inline hardware CNN module within Hi3519 may implement the second level CNN 310.

In some embodiments, for each candidate face window 326 output by the second level CNN310, there is a corresponding image block located in and truncated from the initial input video image 302, and the truncated image block is then resized to the user-specified input size of the third level CNN 312. Based on this coarse-to-fine approach, the user-specified input size of the third level CNN 312 should be larger than the input sizes of the first and

second level CNNs

308, 310. In some embodiments, the input size of the third level CNN 312 is 46 × 46. Therefore, the size of the resized image block is also 46 × 46. However, in other embodiments, input sizes similar to, but slightly different from, 46 x 46 may also be employed without departing from the scope of the described techniques. The process of generating resized image blocks from candidate bounding boxes may be implemented by hardware, software, or a combination of hardware and software. The corresponding processing module, which is not explicitly shown in the figure, may be located between the second-level CNN310 and the third-level CNN 312. Next, the third stage CNN 312 receives the resized image block for final refinement. Based on the hardware configuration, the third level CNN 312 may process the received image blocks sequentially, block-by-block 328; or, a plurality of image blocks are processed in parallel to accelerate the processing speed.

In principle, the third stage CNN 312 processes the input image block 328 in a manner similar to the first stage CNN308 and the second stage CNN 310. For example, fig. 6 illustrates a block diagram of an exemplary implementation process 600 of the third level CNN 312, according to some embodiments described herein.

As can be seen in fig. 6, third stage CNN 600 also includes three stages of CONV and MP layers (i.e., CONV (1)/MP (1), CONV (2)/MP (2), and CONV (3)/MP (3)), followed by two FC layers (i.e., FC (1) and FC (2)). In the embodiment shown, the CONV (1) layer includes 32 3 × 3 filters with step size 1. Thus, the output of the CONV (1) layer is 44 × 44 × 32 in size (based on an input size of 46 × 46). The MP (1) layer uses a pooling window of 2 × 2 with a step size of 2. Therefore, the output size of the MP (1) layer is 22 × 22 × 32. The CONV (2) layer includes 503 × 3 filters with step size 1. Therefore, the output of the CONV (2) layer has a size of 20 × 20 × 50. The MP (2) layer uses a pooling window of 2x 2 with a step size of 2. Therefore, the output size of the MP (2) layer is 10 × 10 × 50. The CONV (3) layer includes 503 × 3 filters with a step size of 1. Therefore, the output of the CONV (3) layer has a size of 8 × 8 × 50. The MP (3) layer uses a pooling window of 2x 2 with a step size of 2. Therefore, the output size of the MP (3) layer is 4 × 4 × 50. The outputs of the first and last FC layers are 256 × 1 and 16 × 1 vectors, respectively.

It is noted that the size of the input image block 602 (i.e., one of the resized image blocks 328 in fig. 3) is 46 × 46 × 1 ═ 2116 (i.e., a grayscale map employing only a single channel), and in the third level CNN 600 discussed above, the maximum input size of the third level CNN 312 needs to be greater than 2116. However, if the maximum input size of the CNN module is less than 2116, the CNN module cannot be used to implement the third level CNN 600. Thus, in the embodiment shown in fig. 6, the embedded hardware CNN module of Hi3519, which only supports the maximum input size of 1280 pixels, cannot implement this third level CNN 600, although it is beneficial to optimize network parameters at the design stage.

In order to solve the above problems, the sub-picture based CNN system and technique described in the related patent application of the present application may be employed. More specifically, with the sub-image based CNN system and technique, the input image block 602 may be divided into a set of overlapping sub-images. For example, fig. 7 illustrates an exemplary input image partitioning scheme of 46 x 46 image blocks in some embodiments described herein. As can be seen from the left side of fig. 7, the input image block 602 may be divided into a set of 4 overlapping sub-images or blocks, each sub-image or block having a size of 30 x 30 and an offset or step size of 16 pixels between adjacent sub-images. It is also noted that in fig. 7, the overlapping configuration of the 4 sub-images is slightly adjusted with less manual compensation, so that the 4 sub-images can be better visualized and better distinguished. However, these artificial compensations are only intended to visualize these overlapping sub-images in the map; in practice, this may not be understood as an actual compensation between these sub-images. In practice, the row coordinates of the 4 sub-images start with 1 and 17, respectively, and the column coordinates of the 4 sub-images start with 1 and 17, respectively. The set of 4 overlapping images without artificial compensation is displayed as a smaller insert with artificial compensation inserted in the upper right corner of the main image.

It is noted that the specific values (i.e., 46 x 46 input image size, 30 x 30 sub-image size, and 16 x 16 step size) are based on theoretical design as described in related patent application 15/441,194, which is incorporated herein by reference. As described above and demonstrated by the related patent application, the use of these design data ensures that each output of the 4 sub-images is combined to be equivalent to the output of the third stage CNN 600, wherein the third stage CNN 600 processes the entire input image block without employing sub-image based CNN techniques.

Fig. 8 illustrates a block diagram of an exemplary implementation process 800 of a third level CNN 312 based on small-scale hardware CNN modules, according to some embodiments of the present application. As can be seen in fig. 8, third level CNN 800 also includes three levels of CONV and MP layers (i.e., CONV (1)/MP (1), and CONV (3)/MP (3)) having the same parameters as the corresponding CONV and MP layers of third level CNN 600. The third level CNN 800 further comprises an input module 802, the input module 802 receiving the 46 x 46 input image blocks 602. The input module 802 is used to divide the image block 602 into 4 sub-images 804 with the size of 30 × 30; this sub-image 804 is smaller than the maximum input image size that can be supported by the embedded hardware CNN within Hi 3519. More detailed operation of the input module 802 can be found in related patent application 15/441,194 (e.g., input module 212 shown in fig. 2B), the contents of which are incorporated herein by reference.

In some embodiments, the three-level CONV and MP layers of the third level CNN 800 are used to sequentially process the 4 sub-images 804. As can be seen in fig. 8, for a given 30 x 30 sub-picture 804 (which sub-picture 804 is obviously a part/sub-picture of the picture block 602), the CONV (1) layer comprises 32 3 x 3 filters with a step size of 1. Therefore, the output of the CONV (1) layer has a size of 28 × 28 × 32. The MP (1) layer uses a pooling window of 2 × 2 with a step size of 2. Therefore, the output size of the MP (1) layer is 14 × 14 × 32. The CONV (2) layer includes 503 × 3 filters with step size 1. Therefore, the output of the CONV (2) layer has a size of 12 × 12 × 50. The MP (2) layer uses a pooling window of 2x 2 with a step size of 2. Therefore, the output size of the MP (2) layer is 6 × 6 × 50. The CONV (3) layer includes 503 × 3 filters with a step size of 1. Therefore, the output of the CONV (3) layer has a size of 4 × 4 × 50. The MP (3) layer uses a pooling window of 2x 2 with a step size of 2. Thus, the output size of the MP (3) layer is 2 × 2 × 50, i.e., 502 × 2 feature maps 806. For the set of 4 sub-images 804, the MP (3) layer generates the output of 4 sets of 2 × 2 × 50 feature maps 806.

As shown in fig. 8, the third-level CNN 800 further comprises a merging module 808, wherein the merging module 808 is configured to receive and merge the 4 sets of 2 × 2 × 50 feature maps 806 to form a complete feature map of the complete input image block 602, wherein the input image block 602 is the input of the third-level CNN 800. More detailed operation of the merge module 808 can be found in related patent application 15/441,194 (e.g., merge module 222 shown in fig. 2B), the contents of which are incorporated herein by reference. As described in the related patent application, the output feature map associated with the set of 4 sub-images 804 has no overlap with adjacent feature maps corresponding to adjacent sub-images and no gaps. The output profiles may be merged directly before the first FC layer to generate the same output as the third stage CNN 600 in fig. 6. The combined result, i.e., the output of the third level CNN 800, is 50 sets of 4 × 4 feature maps 810, one of which is shown on the right side of fig. 7.

In some embodiments, the embedded hardware CNN of Hi3519 is used to implement the three levels CONV and MP shown in the third level CNN 800. However, the embedded hardware CNN of Hi3519 also includes at least three FC layers. In one embodiment, to accommodate these FC layers required by Hi3519, third level CNN 800 further includes two virtual FC layers (not explicitly shown in the figure) having the same matrix parameters. Furthermore, in Hi3519 there is one ReLU layer after each FC layer. However, as disclosed in the related patent application, the ReLU layer does not affect the output of the virtual FC layer because a plurality of serially connected ReLU layers is equivalent to one ReLU layer.

It should be noted that the input image size of the third-level CNN 800 is not necessarily 46 × 46. May be another size smaller than the maximum input size of the embedded hardware CNN of Hi3519 (in which case the input need not be divided into sub-images). Other larger feasible sizes may also be employed as the input image size for this third level CNN 800, and the requirements for such feasible sizes may be found in related patent application 15/441,194, which is incorporated herein by reference. For example, other possible input image sizes for this third level CNN 800 may be 62 × 62. With this image size, the input image block 802 may be divided into 9 overlapping sub-images, each 30 x 30 in size, with the adjacent sub-images having a step size of 16 in both the horizontal and vertical directions.

Returning to fig. 3, if the third-level CNN 312 of the face detection system 300 is implemented by using the third-level CNN 80, the third-level CNN 312 outputs 50 groups of 4 × 4 feature maps 810, and the feature maps 80 are input to the final decision module 314. In some embodiments, the final decision module 314 includes a plurality of FC layers that operate on the received feature map and generate a final decision for the input video image 302, such as the face detection decision 316 shown in fig. 3.

FIG. 9 illustrates a block diagram of an exemplary implementation process 900 of the final decision module 314 according to some embodiments of the present application. As can be seen in fig. 9, a set of 50 sets of 4 × 4 feature maps 810 are received and processed by a reorganization module that is configured to combine and reorganize the set of two-dimensional feature maps into a one-dimensional vector of size 800 × 1. The one-dimensional vector is further processed by two levels of FC layers, FC (1) and FC (1), which then outputs a face detection decision 316 for a given detected movement region 318. In some embodiments, the last FC layer, FC (2), is implemented with a linear classifier such as a flexible max (softmax) classifier. In the illustrated embodiment, the face detection decision 316 may include a face classifier 904, a bounding box regression operator 906, and a face keypoint location operator 908. As described above, the marker position operators 908 within the face detection decision 316 may include the 5 face key points of the detected face, i.e., the left eye, the right eye, the nose, and the two corners of the mouth. Although the two FC layers within the final decision module 900 are 256 and 16, respectively, in other embodiments the final decision module 314 may have a different FC layer size than the final decision module 900. It should be noted that the final decision module 900 can be implemented in software or processed on a CPU on Hi3519, since the final decision computation complexity is much lower than any of the three levels CNN308, 310 and 312.

Fig. 10 presents a flow chart depicting an exemplary face detection process 1000 utilizing the face detection system 300 disclosed herein as being executed on a CNN enabled embedded system in accordance with some embodiments of the present invention. In some embodiments, the CNN enabled embedded system comprises a small scale low cost system on chip, such as a Hi3519 system on chip. The start of the face detection process 1000 is indicated when a video image/frame is received at the input of the face detection system disclosed herein (step 1002). In some embodiments, the video image is acquired by a high resolution camera, such as a surveillance camera, a machine vision camera, a camera on an autonomous vehicle, or a mobile phone camera.

Next, in the face detection system 1000, a motion detection operation may be performed on the input video image/frame to locate and identify a set of moving areas within the video frame (i.e., image blocks within the video frame that are associated with motion), step 1004. In some embodiments, the motion detection operation may be implemented using an embedded background elimination module within the CNN enabled embedded system to detect moving areas within the video image/frame. The output of the motion detection operation includes a set of identified motion regions within the video frame. In some embodiments, the motion detection operation may be replaced by or combined with a face tracking operation. It should be noted that by combining motion detection and face tracking in the face detection process 1000, the face detection speed can be significantly increased. In some embodiments, the face detection process 1000 omits this motion detection operation.

Next, in the face detection system 1000, for each detected movement region, a pyramid generation operation may be performed on the detected movement region to generate a multi-resolution representation of the detected movement region (step 1006). More specifically, a higher resolution representation of the detected moving regions may be used to detect smaller faces in the initial input video image; while a lower resolution representation of the detected movement region may be used to detect a larger face in the initial input video image.

Next, in the face detection system 1000, a sliding window operation is performed on each image represented in multi-resolution, generating a set of image blocks for the image (step 1008). In some embodiments, the size of the sliding window is determined by a first input size of the first CNN processing stage configured with a first complexity.

Next, in the face detection system 1000, a first CNN processing stage is used to process all image blocks corresponding to each sliding window position of each multi-resolution representation of the detected moving area to generate a first set of candidate face windows (step 1010). In some embodiments, each window in the first set of candidate face windows is associated with a confidence index and a set of bounding box coordinates. In some embodiments, each candidate face window is also associated with 5 face key points, namely the left eye, the right eye, the nose and two corners of the mouth. In some embodiments, the first CNN processing stage satisfies the limitations of a small-scale hardware CNN module, such as an inline hardware CNN module within Hi3519, such that the CNN processing stage may be implemented by the inline hardware CNN module within Hi 3519.

Next, in the face detection system 1000, a second CNN processing stage is used to process the first set of resized image blocks corresponding to the first set of candidate face windows to generate a second set of candidate face windows (step 1012). In some embodiments, the second CNN processing stage has a second complexity that is higher than the first complexity. In some embodiments, the first set of resized image blocks has a size equal to a second input size of the second CNN processing stage, wherein the second input size is larger than the first input size of the first CNN processing stage. Thus, the second CNN processing stage processes a higher resolution input image block with higher face detection accuracy than the first CNN processing stage. In some embodiments, each window in the second set of candidate face windows is associated with a confidence index and a set of bounding box coordinates. In some embodiments, each candidate face window is also associated with 5 face key points, namely the left eye, the right eye, the nose and two corners of the mouth. In some embodiments, the second CNN processing stage meets the limitations of a small-scale hardware CNN module, such as an inline hardware CNN module within Hi3519, such that the CNN processing stage may be implemented by the inline hardware CNN module within Hi 3519.

Next, in the face detection system 1000, the third CNN processing stage is used to process a second set of resized image blocks corresponding to the second set of candidate face windows to generate a third set of candidate face windows (step 1014). In some embodiments, the third CNN processing stage has a third complexity that is higher than the first and second complexities. In some embodiments, the second set of resized image blocks has a size equal to a third input size of the third CNN processing stage, wherein the third input size is larger than the first and second input sizes of the first and second CNN processing stages. Thus, the third CNN processing stage processes higher resolution input image blocks with higher face detection accuracy than the first and second CNN processing stages. In some embodiments, each window in the third set of candidate face windows is associated with a confidence index and a set of bounding box coordinates. In some embodiments, each window in the third set of candidate face windows is also associated with 5 face keypoints, namely the left eye, the right eye, the nose, and two mouth corners. It should be noted that the

steps

1006 and 1014 are repeated for each detected motion region within the initial input video frame.

In some embodiments, this third CNN processing stage is also ideally implemented using small-scale hardware CNN modules, such as the in-line hardware CNN module within Hi 3519. However, since the input size of the third CNN processing stage may be larger than the maximum input size of the small-scale hardware CNN module, the sub-image-based CNN method needs to be adopted.

Fig. 11 presents a flowchart describing an exemplary process 1100 for processing a second set of resized image blocks (i.e., step 1014 of process 1000) using a sub-image based CNN system, in accordance with some embodiments described herein.

Initially, a given resized image block is divided into a set of sub-images having a smaller image size (step 1102). In some embodiments, the set of sub-images comprises a two-dimensional array of overlapping sub-images. For example, a 46 × 46 image block may be divided into a set of 4 overlapping sub-images, where each sub-image is 30 × 30 in size and there is a 16-pixel offset between adjacent sub-images. Further, the size of the sub-picture is smaller than the maximum input size of a small-scale hardware CNN module of the embedded hardware CNN module such as Hi 3519.

Next, the small scale hardware CNN module sequentially processes the set of sub-images to generate a feature map for the array (step 1104). In some embodiments, the step of processing each sub-image with the small-scale hardware CNN module comprises applying multiple levels of CONV and MP layers on the sub-image.

Next, the feature maps of the array output by the small scale hardware CNN module are merged into a set of merged feature maps (step 1106). More specifically, the combined feature map is equivalent to the complete feature map of the entire high resolution resized image block generated by the large scale CNN that processes the entire high resolution resized image block directly without partitioning. Next, the second CNN module processes the combined feature map to predict whether the resized image block is a face (step 1108). In some embodiments, the step of processing the set of merged profiles comprises applying a multi-level FC layer on the set of merged profiles.

It should be noted that although the embodiments of the face detection system disclosed above apply the subimage-based CNN technique to the last CNN stage of a cascaded CNN system, in other embodiments, the face detection system may apply the subimage-based CNN technique to more than one stage of the cascaded CNN system, for example, to the last two stages of the cascaded system.

Age and gender assessment based on face images

It should be noted that automatic assessment of the age and gender of a person captured in a video image is a highly desirable feature in many commercial applications, such as retail sales, marketing, security monitoring, and social networking. For example, a security camera system installed at a retail store may utilize this capability to automatically classify customers acquired by the security camera based on the assessed gender and age group, which facilitates better knowledge of customer types, trends, and habits by the retailer. Therefore, after a face is detected in an acquired video image, CNN may be applied to the detected face image to perform age and gender assessment.

Prior to the present application, a number of methods for age and gender assessment using CNN have been disclosed. Roter (Rote) et al (depth expectation of real and apparent age from a single image with out facial surfaces cameras performed on a single image without face keypoints), Computer Vision International Journal (IJCV), 2016, 7 months) processed an input image with a resolution of 256 × 256 using a 16-layer VGG neural network with 13 CONV layers and 3 FC layers. Livayi (Levi) et al (IEEE conference on Computer Vision and Pattern Recognition, CVPR) seminar using convolutional neural networks for Age and gender Classification (CVPR) employ color images with a resolution of 227 × 227. More specifically, livayi employs a network that contains three CONV layers, each followed by a linear modifier and an MP layer that takes the maximum value in the 3 × 3 region. The first CONV layer contains 96 filters of 3 × 7 × 7 pixels; the second CONV layer contains 256 filters of 96 × 5 × 5 pixels; the third CONV layer contains 384 filters of 256 × 3 × 3 pixels; wherein, the first two CONV layers also adopt a local reaction normalization layer. After the CONV layer, two FC layers containing 512 neurons are employed, and each FC layer is followed by a ReLU and a drop layer (dropout layer). In addition, another FC layer (third FC layer) is used to map the output to the last stage for age and gender classification. Marl (Mall) et al (CVPR, 2016) train a collection of VGG-based deep learning models using a collection of deep learning models whose outputs are then combined to generate a final estimate. Yi et al (Age assessment by multi-scale connected network), Asian Conference on Computer Vision, 2014, developed a multi-block and multi-scale solution by fusing features of image blocks with different sizes around some face keypoints. The easy-to-return means that the sex and age can be jointly evaluated through the same network.

It should be noted that similar to the above-described problem of implementing a cascaded CNN architecture for face detection on an embedded system integrated with a small-scale low-cost CNN module, the above-described age and gender assessment techniques are generally not applicable to some low-cost embedded systems based on a small-scale hardware CNN module due to the complexity of the techniques. For example, if one of the age and gender assessment techniques described above were attempted to be implemented on the in-line CNN module of Hi3519SoC, then numerous limitations associated with the Hi3519 in-line CNN module would have to be considered. For example, as described above, the maximum input image size that the Hi3519 CNN module can support, i.e., the maximum value of input pixels is 1280, which results in hardware configuration limitations of the Hi 3519. Therefore, when grayscale face images are used as input and the face images are assumed to be square images, the resolution of these face images must be smaller than 36 × 36. When a color face image must be used, the maximum resolution that can be supported is lower. However, this low resolution can significantly reduce the accuracy of age and gender assessment, especially when the face image is also subject to reduced resolution due to other factors such as poor lighting, motion blur, and different poses.

Other limitations associated with the inline CNN module of Hi3519 due to the specific design include, but are not limited to, (1), the number of broadside pixels and the number of highside pixels of the input image must be even; (2) the maximum value of the CONV layer or FC layer is 8; (3) the maximum output size of the FC layer is 256; (4) the input size of the first FC layer must not exceed 1024; (5) an MP layer is required to be connected behind each CONV layer; (6) the number of filters of each CONV layer is 50 at most; (7) in each MP layer, the embedded CNN only supports a pooling window size of 2 × 2 pixels; and (8) the embedded CNN module supports only 3 × 3 filters in the CONV layer.

While some low cost embedded systems have limitations, it is beneficial to combine age and gender assessment capabilities with these low cost embedded systems to process detected faces in combination with the face detection capabilities described above. For example, performing age and gender assessments in the field within a surveillance camera system based on acquired and detected face images, without having to employ a separate server to perform these operations, can significantly reduce operational costs. However, to ensure that the accuracy of the evaluation is not affected, the original resolution of the detected image needs to be employed.

Therefore, the age and gender assessment system proposed by the present application can also process the detectable face image according to the divide-and-conquer principle using the sub-image based technique described above, wherein the size of the face image is larger than the maximum value of the input pixels that can be supported by a given small-scale CNN module. In some embodiments, the age and gender assessment system may utilize a sub-image based technique to first divide the high resolution input face image into a set of image blocks (also referred to as "sub-images") of appropriate size with a reasonably designed overlap between adjacent image blocks. Then, each image block is processed separately by a small-scale CNN module, such as an embedded CNN module within Hi 3519. The outputs corresponding to the set of image blocks are then combined to obtain an output corresponding to a high resolution input face image, and the combined output may be processed by subsequent layers in the age and gender assessment system to generate an age and gender classification for the input face image. In some embodiments, this age and gender assessment system based on sub-image techniques can be compared to the large-scale CNN of the entire high resolution input face image processed directly without segmentation. Therefore, the output of the sub-image based CNN system proposed in the present application can be completely equivalent to that of the large-scale CNN.

Fig. 12 illustrates a block diagram of an exemplary age and gender assessment neural network 1200 based on small-scale CNN modules, according to some embodiments of the present application. As can be seen in fig. 12, the neural network 1200 receives an input face image 1202 of size 46 × 46, and takes the input face image 1202 as input. As described above, the input face image 1202 can be represented by either 3R/G/B channels of 46 × 46 size, respectively, or a single grayscale image/channel of 46 × 46 size. The input face image 1202 then passes through 3CONV layers (i.e., CONV (1) -CONV (3)), each followed by a ReLU layer (not shown) and an MP layer (i.e., MP (1) -MP (3)). It should be noted that each CONV layer employs a 3 × 3 filter, and each MP layer employs a pooling window of size 2 × 2 pixels. These design parameters meet the constraints of some small-scale CNN modules, such as Hi 3519.

After this CONV/ReLU/MP layer, the FC layer (i.e., FC (1)) and the output layer based on the flexible maximum (softmax) classifier are used for final classification/evaluation of the input face image 1202. Although the final output layer is shown as being based on a flexible maximum classifier, other types of classifiers suitable for deep learning applications may also be employed to implement the final output layer of the neural network 1200. It is noted that the first two outputs (gender: 2) of the neural network 1200 are used for gender assessment, i.e., likelihood of male and female. The next 8 outputs (annual wheel set: 8) of the neural network 1200 correspond to the likelihood of 8 age groups, such as the age range including 0-2 years, 4-6 years, 8-13 years, 15-20 years, 25-32 years, 38-43 years, 48-53 years, 60 years, and above. It should be noted that in an application environment where only gender or age group needs to be assessed, the output of the neural network 1200 may be simplified, including only gender output or age output.

In some embodiments, the age and gender assessment neural network 1200 may train the network parameters by training face images. In some embodiments, the network parameters that may be optimized by the training process include: the number of filters in each CONV layer, the weight and threshold of the filters in each CONV layer, etc. Thus, the numbers shown in fig. 12 are merely exemplary configurations of the training neural network 1200. More specifically, the CONV (1) layer includes 24 filters of 3 × 3 with a step size of 1. Thus, for an input image 1202 of size 46 × 46, the output of the CONV (1) layer is of size 44 × 44 × 32. The MP (1) layer uses a pooling window of 2 × 2 with a step size of 2. Therefore, the output size of the MP (1) layer is 22 × 22 × 32. The CONV (2) layer includes 48 3 × 3 filters with a step size of 1. Therefore, the output of the CONV (2) layer has a size of 20 × 20 × 48. The MP (2) layer uses a pooling window of 2x 2 with a step size of 2. Therefore, the output size of the MP (2) layer is 10 × 10 × 48. The CONV (3) layer includes 48 3 × 3 filters with a step size of 1. Therefore, the output of the CONV (3) layer has a size of 8 × 8 × 48. The MP (3) layer uses a pooling window of 2x 2 with a step size of 2. Therefore, the output size of the MP (3) layer is 4 × 4 × 48. Thus, the input size of the FC (1) layer is 768. The outputs of the FC (1) layer and the flexible maximization classifier are vector 256 × 1 and vector 10 × 1, respectively.

In some embodiments, the above-described network parameters of the trained neural network 1200 may satisfy the constraints of a given small-scale CNN module, such as the CNN module in Hi3519SoC, so that the given small-scale CNN module may be used to implement the neural network 1200. However, as long as a given network configuration can meet the limitations of a given small-scale CNN module as in Hi3519, it is within the scope of the present disclosure that the neural network 1200 can be constructed with other configurations, including using a different number of CONV/ReLU/MP layers, filters, and FC layers of different sizes than the neural network 1200.

It is noted that the exemplary configured neural network 1200 that directly receives the 46 × 46 input image 1202 at the first CONV layer can train the network parameters using the 46 × 46 training image at the design stage of the small-scale CNN module-based age and gender assessment CNN system proposed in this application. However, for the inline hardware CNN module in Hi3519, the size of the input face image 1202 is larger than the maximum value of the input pixels that the inline hardware CNN module can support. Thus, the example configured neural network 1200 cannot be implemented directly on the in-line hardware CNN module of the Hi3519 because the size of the input image 1202 exceeds the maximum input image size that the CONV (1) layer of the CNN module of the Hi3519 can allow/support.

To address the above problem, the sub-picture based CNN system and technique described in related patent application 15/441,194 may be employed. More specifically, by employing the sub-image based CNN system and technique, the input face image 1202 may be divided into a set of suitably sized sub-images or image blocks with a reasonably designed overlap between adjacent image blocks. A reasonable partitioning scheme for a 46 x 46 input image has been described above in connection with fig. 7. Thus, the input face image 1202 may be divided into a set of 4 sub-images or image blocks with overlapping portions, where each sub-image or image block is 30 x 30 in size and the offset or step size between adjacent sub-images or image blocks is 16 pixels.

It should be noted that although the example shown in fig. 12 uses 46 × 46 images as the input of the age and gender assessment CNN system, the detected face images cut from the original video or still image may have many different sizes. To simplify the discussion, the input face image is set to be a square image having the same width and height. In general, the input face image may also include non-square images, and the following discussion based on square images may be extended to non-square images.

Fig. 13 illustrates a block diagram of an exemplary age and gender assessment system 1300 based on a small-scale hardware CNN module and on sub-image technology, according to some embodiments described herein. In some embodiments, the age and gender assessment system 1300 is implemented by a low cost embedded system including a small scale CNN module, for example by integrating Hi3519 Soc. As shown in fig. 13, the age and gender assessment system 1300 receives an input face image 1302 and takes the input face image 1302 as input, then generates a gender and age/age group classification 1312 and takes the gender and age/age group classification 1312 as output, i.e., the age and gender assessment is performed on the person corresponding to the input face image 1302. In some embodiments, the input face image 1302 is the output of the face detection system 300 described above that processes the acquired video images. Thus, the input face image 1302 may be truncated from an original video image, such as the input video image 302. However, the input face image 1302 may also be an original input image mainly including a human face. Herein, the input face image 1302 may have a size different from 46 × 46.

Age and gender assessment system 1300 includes at least an input module 304, a CNN module 1306, a merge module 1308, and a classification module 1310. The input module 304 is configured to perform the above-mentioned sub-image division, so that the input face image 1302 having a size exceeding the maximum input size supportable by the small-scale hardware CNN module can be divided into a group of sub-images having a size not larger than the maximum input size supportable by the small-scale hardware CNN module, so that the group of sub-images can be respectively processed by the small-scale hardware CNN module. Thus, the CNN module 1306 may be implemented by the small-scale CNN module to process smaller sub-images. The merge module 1308 is used to merge the outputs corresponding to the set of sub-images to ensure that the result of the merge is equivalent to the result of processing the input face image 1302 as a whole without employing sub-image based techniques. Decision module 1310 is used to map the output from merge module 1308 to the last level for age and gender classification. Age and gender assessment system 1300 may also include other modules not shown in fig. 13. The various components of the age and gender assessment system 1300 will be described in greater detail below.

As shown in fig. 13, an input module 1304 receives an input face image 1302. The input module 1304 is used to determine whether the size of the input face image 1302 is less than or equal to the maximum input image size of the CNN module 1306. For example, if the CNN module 1306 is implemented by a small-scale CNN module in Hi3519, the maximum input image size that the CNN module 1306 can support is 1028, and the input module 1304 compares the size of the input image 1302 to 1028. In some embodiments, if the size of the input face image 1302 is less than or equal to the maximum input image size of the CNN module 1306, the input module 1304 is configured to directly transfer the input face image 1302 to the CNN module 1306 without performing any sub-image operations. However, in other embodiments, if the size of the input face image 1302 is less than or equal to the maximum input image size of the CNN module 1306, the input module 1304 resizes (by upsampling) the input face image 1302 to a predefined input image size that satisfies the sub-image partitioning constraint, and then partitions the resized input image into a set of sub-images for subsequent sub-image processing. The concept of this predefined input image size will be described in more detail below.

If the size of the input face image 1302 is larger than the maximum input image size that can be supported by the CNN module 1306, the input module 1304 further determines whether the size of the input image 1302 satisfies the division constraint. The related patent application 15/441,194, incorporated herein by reference, provides a partitioning constraint for CNN systems with respect to 3CONV layers/3 MP layers. As discussed in the related patent application, the input face image 1302 should satisfy the division constraint to ensure that after the input face image 1302 is divided into a set of sub-images, the set of sub-images is processed, and the set of sub-images is subsequently merged into a feature map of the input face image 1302, each merged feature map has no gaps or overlapping portions. The partitioning constraint also ensures that the merged feature map of each feature map from the set of sub-images is equivalent to a feature map generated without processing the input face image 1302 as a whole using sub-image based CNN techniques. Specifically, for the CNN system of 3CONV layer/3 MP layer, the width and height of the input image should satisfy the condition: x is 8n +14, where n is a positive integer. For example, when n is 3, then X is 38; when n is 4, then X is 46; when n is 5, then X is 54; when n is 6, then X is 62. It should be noted, however, that CNN modules 1306 having other configurations with different numbers of CONV/MP layers may have different partitioning constraints.

In some embodiments, the segmentation limit of the age and gender assessment system 1300 is a predefined input image size that satisfies the general segmentation limit of the configured CNN module 1306. For example, if the CNN is a CNN system of 3CONV layers/3 MP layers, the partition limit of the age and gender assessment system 1300 may be a predefined image size for satisfying the condition X-8 n + 14. Therefore, the division limit may be 46 × 46 or 62 × 62. In this example, if the size of the input face image 1302 is also 46 × 46 or 62 × 62, the input face image 1302 satisfies the division restriction.

If the input module 1304 determines that the input image 1302 satisfies predefined limits, the input module 1304 may divide the input face image 1302 into a set of sub-images. For example, if the predefined division limit is 46 × 46, the input face image 1302 may be divided into 4 sub-images of size 30 × 30, with an offset or step size of 16 pixels between adjacent sub-images. Note that the size of each sub-image is smaller than the maximum input image size of the in-line hardware CNN module in Hi 3519. In another example, if the predefined division limit is 62 × 62, the input face image 1302 may be divided into 9 sub-images of size 30 × 30 with an offset or step size of 16 pixels between adjacent sub-images. However, if the input module 1304 determines that the input image 1302 does not meet the predefined division limit, the input module 1304 resets the size of the input image 1302 to a predefined input image size. For example, if the predefined partition is limited to 46 × 46, input images with a size larger than 46 × 46 will be downsampled to a size of 46 × 46; conversely, an input image having a size less than 46 x 46 would be upsampled to a size of 46 x 46. After the input image 1302 is resized to meet the constraints, the resized input image 1302 may be divided into sub-images accordingly in the manner described above.

It is noted that the age and gender assessment system 1300 may have a plurality of predefined input image sizes for meeting the general partitioning constraints of the associated CNN module, and each of the plurality of predefined input image sizes is designed for a particular type of application environment for a given facial image resolution. For example, in an application environment (e.g., a smartphone camera) where the obtained face image mostly contains a high resolution face, the corresponding predefined input image size in the age and gender assessment system 1300 should also have a higher resolution. However, in an application environment (e.g., a monitoring camera of a retail store) where the obtained face images mostly contain low-resolution faces, the corresponding predefined input image size in the age and gender assessment system 1300 should also have a lower resolution.

It should be noted that a higher resolution of the predefined input image size may result in an age and gender assessment with a higher accuracy, but at the same time may result in higher computational costs and longer processing times. In contrast, a lower resolution of the predefined input size requires shorter processing time, but the accuracy of the obtained age and gender assessment is lower. Accordingly, the age and gender assessment system 1300 may be designed with a plurality of predefined input image sizes suitable for different types of application environments. Next, when the age and gender assessment system 1300 is used in a given application environment, a specific predefined input image size may be selected among the plurality of predefined input image sizes based on a typical face image resolution associated with the given application environment.

Fig. 14 illustrates a flow diagram of an exemplary process 1400 for pre-processing an input face image in an input module 1304 according to some embodiments described herein. Initially, the input module 1304 receives a new input face image 1302 (step 1402). The input module 1304 then determines whether the size of the input face image 1302 is less than or equal to the maximum input image size of the CNN module 1306 (step 1404). In some embodiments, the maximum input image size is a limit of a small-scale hardware CNN module within the CNN module 1306. If the size of the input face image 1302 is less than or equal to the maximum input face image size, the input module 1304 may transfer the input face image 1302 directly to the CNN module 1306 without performing any sub-image operation. However, in other embodiments, the input module 1304 may resize (e.g., by upsampling) the input face image 1302 to a predefined input image size that may satisfy the sub-image splitting limit, and then split the resized input image into a set of sub-images for subsequent sub-image processing.

If the size of the input facial image 1302 is greater than the maximum input image size of the CNN module 1306, the input module 1304 determines whether the size of the input image 1302 meets a predefined division limit (step 1406). It should be noted that, since the input face image 1302 is assumed to be a square image, we only need to detect the size of the image. Generally, only the input image 1302 whose length and width satisfy the division restriction can satisfy the division restriction. If the input module 1304 determines that the input image 1302 meets predefined division limits, the input module 1304 divides the input face image 1302 into a set of sub-images (step 1408) and passes the set of sub-images to the CNN module 1306. However, if the input module 1304 determines that the input image 1302 does not satisfy the predefined division limit, the input module 1304 resets the size of the input image 1302 to the predefined input image size (step 1410). For example, if the input image 1302 is larger than the predefined input image size, the input module 1304 down-samples the input image 1302 to the predefined input image size; conversely, if the input image 1302 is smaller than the predefined input image size, the input module 1304 upsamples the input image 1302 to the predefined input image size. Input module 1304 then divides the resized input face image into a set of sub-images (step 1412) and passes the set of sub-images to CNN module 1306.

It should be noted that, since the related operation of the input module 1304 has very low computational complexity compared with the convolution operation, the input module 1304 can be implemented by software and processed by the CPU on the Hi3519 SoC.

Fig. 15 shows a block diagram of an exemplary implementation 1500 of the small-scale CNN module 1306 in the age and gender assessment system 1300, according to some embodiments described herein. As can be seen from fig. 15, the small-scale CNN module 1500 has the same CONV layer/MP structure and the same corresponding parameters as the portion of the age and gender assessment neural network 1200 before the first FC layer (FC (1)). Taking an input face image 1302 (which may be an original image or a resized image) with a size of 46 × 46 as an example, the input module 1304 may divide the image with the size of 46 × 46 into four sub-images 1504 with a size of 30 × 30, where the size of 30 × 30 is smaller than the maximum input size of the embedded hardware CNN of Hi 3519. As shown in fig. 15, the small-scale CNN module 1500 may further process each 30 × 30 sub-image, and may sequentially process the set of 4 sub-images 1504.

As shown in fig. 15, for a given sub-image 1504 of size 30 × 30, the CONV (1) layer includes 24 filters of size 3 × 3 with step size 1. Therefore, the output of the CONV (1) layer has a size of 28 × 28 × 24. The MP (1) layer uses a pooling window of 2 × 2 with a step size of 2. Therefore, the output size of the MP (1) layer is 14 × 14 × 24. The CONV (2) layer includes 48 3 × 3 filters with step size 1. Therefore, the output of the CONV (2) layer has a size of 12 × 12 × 48. The MP (2) layer uses a pooling window of 2x 2 with a step size of 2. Therefore, the output size of the MP (2) layer is 6 × 6 × 48. The CONV (3) layer comprises 48 3 x 3 filters with step size 1. Therefore, the output of the CONV (3) layer has a size of 4 × 4 × 48. The MP (3) layer uses a pooling window of 2x 2 with a step size of 2. Therefore, the output size of the MP (3) layer is 2 × 2 × 48, i.e., 48 feature maps 1506 with a resolution of 2 × 2. The output generated by the MP (3) layer for the set of 4 sub-images 1504 is 4 sets of feature maps 1506, where each set of feature maps includes 48 feature maps at 2 × 2 resolution. As described above, since the exemplary 46 × 46 input face image 1302 satisfies the division restriction, the outputs of the 4 30 × 30 sub-images have no overlapping portion or gap between the adjacent feature maps corresponding to the adjacent sub-images, and can be directly stitched, thereby obtaining the same output as that before the FC (1) layer of fig. 12 (fig. 12 does not need to divide the input face image 1302).

Fig. 16 illustrates a block diagram of an example implementation 1600 of a merge module 1308 and a decision module 1310 in an age and gender assessment system 1300, according to some embodiments described herein. As can be seen in fig. 16, the output of each sub-image 1504 in fig. 15 (i.e., the 48 × 2 × 2 feature map) is first converted into 4 192 × 1 vectors 1602. The 4 vectors 1602 from the 4 sub-images are concatenated by the concatenation module 1604 to form a one-dimensional (1D) vector 1606 of size 768 × 1. The one-dimensional vector 1606 is further processed by the FC layer 1608 and ultimately by the flexi-max classifier 1610 to generate an output age and gender classification 1312 for the given input face image 1302. For example, the age classification may include 8 age groups, namely 0-2 years, 4-6 years, 8-13 years, 15-20 years, 25-32 years, 38-43 years, 48-53 years, 60 years, and above. It should be noted that since these operations have very low computational complexity compared to the convolution operations, they can be implemented in software and processed by the CPU on Hi3519 in the merge module 1308 and the decision module 1310. In some embodiments, the operation of the merge module 1308 and the decision module 1310 may be further expedited with an Arm NEON instruction provided on a Hi3519 SoC. It is noted that while the FC (1) layer 1608 is shown as having a size of 256, other FC (1) sizes may be used with the age and gender assessment system 1300 presented herein without departing from the scope of the claimed technology.

Fig. 17 shows a flow diagram of an exemplary process 1700 for performing an age and gender assessment using the age and gender assessment system disclosed herein, in accordance with some embodiments described herein. Initially, the CNN system divides the input face image into a set of sub-images with smaller image size (step 1702). More specifically, the size of the set of sub-images is smaller than the maximum input size that can be supported by the small-scale hardware CNN module within the CNN system. Next, the set of sub-images is then processed by the small scale hardware CNN module to generate an array of feature maps (step 1704). In some embodiments, processing each sub-picture with the small-scale hardware CNN module includes applying multiple levels of CONV and MP layers to the sub-picture.

Next, the feature map arrays output by the small scale hardware CNN module are merged into a set of merged feature maps (step 1706). More specifically, the combined feature map is equivalent to a complete feature map generated by a large CNN module that processes the entire input face image directly without segmentation. Next, the combined feature map is processed by a segmentation module to generate age and gender segmentations of the person associated with the input face image (step 1708). In some embodiments, processing the combined feature map includes applying one or more FC layers and a flexibility maximum classifier after FC layers to the combined feature map.

Although the age and gender assessment system and fig. 13-16 presented in this application are described based on a predefined input image size of 46 x 46, other predefined input image sizes may be used as long as the partitioning constraints of the relevant CNN module are met, and the specific size may be based on the specific application environment. For example, if a predefined input image size of 62 × 62 is used, each input image or resized input image may be divided into 9 sub-images with an overlap size of 30 × 30, and the step size between adjacent sub-images is 16 pixels, both in the lateral and longitudinal directions.

Furthermore, the age and gender assessment techniques presented herein may be implemented within an embedded system that includes a small-scale hardware CNN module with limited resolution, and thus is not limited to Hi 3519. It should be noted that the ability to perform live age and gender assessments on a set of detected face images within an embedded system immediately after live face detection is performed on an acquired video or still image, without the need for a separate device, system or server to perform age and gender assessments on the detected face images, can significantly reduce operational costs. In some embodiments, the age and gender assessment system may also be implemented on a low cost embedded system that does not include face detection. In these embodiments, the low-cost embedded system may directly receive facial images from one or more external sources and then perform specialized age and gender assessment operations on the received facial images using the age and gender assessment system proposed herein.

It should be understood that the various embodiments of the age and gender assessment systems and techniques described herein are equally applicable to target classification. In these applications, the input face images become multiple targets to be classified, the small-scale CNN module is trained based on a set of training target images, and the output of the classification system represents the intended target for each input target image.

FIG. 18 illustrates an exemplary embedded system within which the disclosed face detection system and age and gender assessment system function according to some embodiments described herein. Embedded system 1800 may be integrated or implemented as a surveillance camera, a machine vision camera, an unmanned aerial vehicle, a robot, or an autonomous vehicle. As can be seen in fig. 18, embedded system 1800 may include a bus 1802, a processor 1804, memory 1806, storage 1808, a camera system 1810, a CNN subsystem 1812, an output device interface 1814, and a network interface 1816.

Bus 1802 collectively represents all system, peripheral, and chipset buses to which the various components of embedded system 1800 may be connected. For example, bus 1802 may communicatively connect processor 1804 with memory 1806, storage 1808, camera system 1810, CNN subsystem 1812, output device interface 1814, and network interface 1816.

The processor 1804 retrieves instructions from the memory 1806 for execution and retrieves data for processing to control various components of the embedded system 1800. The processor 1804 may include any type of processor, including but not limited to microprocessors, large scale computers, digital signal processors, electronic organizers, device controllers, and computing engines within an appliance, as well as any other processor now known or later developed. Further, processor 1804 may include one or more cores. The processor 1804 itself may include a cache that stores code and data for execution by the processor 1804.

The memory 1806 may include any type of memory that may store code and data for execution by the processor 1804. This includes, but is not limited to, Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), flash memory, Read Only Memory (ROM), and any other type of memory now known or later developed.

The storage 1808 may include any type of non-volatile memory device that may be integrated with the embedded system 1800. This includes, but is not limited to, magnetic, optical, and magneto-optical storage devices, as well as storage devices based on flash memory and/or storage with battery backup power.

Bus 1802 is also connected to camera system 1810. The camera system 1810 is configured to capture still images and/or video images at a predetermined resolution and to transfer the captured image or video data over the bus 1802 to various components within the embedded system 1800, such as to the memory 1806 for buffering and to the CNN subsystem 1812 for DL face detection. The camera system 1810 can include one or more digital cameras. In some embodiments, camera system 1810 is a digital camera equipped with a wide-angle lens. The images acquired by the camera system 1810 may have different resolutions, including high resolutions, such as 1280 × 720p, 1920 × 1080p, or other high resolutions.

In some embodiments, the CNN subsystem 1812 also includes a face detection subsystem 1818 and an age and gender assessment subsystem 1820. The CNN subsystem 1812 is configured to receive an acquired video image, such as a high resolution video image acquired via the bus 1802, and perform the aforementioned face detection operation in the received video image using the face detection subsystem 1818 to generate a face detection result of the acquired video image, and further perform the aforementioned age and gender assessment operation in the detected face image using the age and gender assessment subsystem 1820 to generate an age and gender classification of the detected face image. In particular, CNN subsystem 1812 may include one or more small-scale hardware CNN modules. For example, CNN subsystem 1812 may include one or more Hi3519 systems-on-chips, where each Hi3519 system-on-chip includes a CPU with embedded hardware CNN and executable software CNN functionality. In some embodiments, the CNN subsystem 1812 functions according to one of the embodiments of the face detection system 300 disclosed herein and the age and gender assessment system disclosed herein.

Also connected to bus 1802 is an output device interface 1814, which output device interface 1814 may, for example, display results generated by CNN subsystem 1812. Output devices used with the output device interface 1814 include, for example, printers and display devices, such as a cathode ray tube display (CRT), a light emitting diode display (LED), a Liquid Crystal Display (LCD), an organic light emitting diode display (OLED), a plasma display, or electronic paper.

Finally, as shown in FIG. 18, the bus 1802 also connects the embedded system 1800 to a network (not shown) through a network interface 1816. As such, the embedded system 1800 may be part of a network, such as a local area network ("LAN"), a wide area network ("WAN"), or an intranet, or a network of networks, such as the internet. Any or all of the components of embedded system 1800 may be used with the presently disclosed subject matter.

The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, units, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The hardware used to implement the various illustrative logics, logical blocks, modules, and circuits described in connection with the aspects disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, separate gate or transistor logic, separate hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of receiver devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Alternatively, some steps or methods may be performed by circuitry that is specific to a given function.

In one or more exemplary aspects, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored as one or more instructions or code on a non-transitory computer-readable storage medium or a non-transitory processor-readable storage medium. The steps of a method or algorithm disclosed herein may be embodied in processor-executable instructions that can reside on a non-transitory computer-readable or processor-readable storage medium. A non-transitory computer-readable or processor-readable storage medium may be any storage medium that can be accessed by a computer or a processor. By way of example, and not limitation, such non-transitory computer-readable or processor-readable storage media can comprise RAM, ROM, EEPROM, flash memory, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above are also included within the scope of non-transitory computer-readable and processor-readable media. Additionally, the operations of a method or algorithm may reside as one or any combination or set of codes and/or instructions on a non-transitory processor-readable storage medium and/or computer-readable storage medium, which may be incorporated into a computer program product.

Although this patent document contains many specifics, these should not be construed as limitations on the scope of any disclosed or claimed technology, but rather as descriptions of features specific to particular embodiments of particular technologies. Certain features that are described in this patent document in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Moreover, the separation of various system components in the embodiments described in this patent document should not be understood as requiring such separation in all embodiments.

This patent document describes only a few implementations and examples, and other implementations, enhancements, and variations can be made based on what is described and illustrated in this patent document.

Claims

1. A method for performing age and/or gender assessment on a face image using a small scale convolutional neural network module with maximum input size limit, the method comprising:

receiving, by a computer, an input face image that is primarily covered by a face;

determining, by the computer, whether the size of the input face image is larger than a maximum input image size supportable by the small-scale CNN module according to a maximum input size limit; and

if so, determining whether the size of the input face image meets a predefined input image size limit, wherein the predefined input image size limit is a given image size of a plurality of image sizes, wherein the plurality of image sizes meet the requirement that the input image can be divided into a group of sub-images with a second size, and the second size is smaller than the maximum input image size;

if so, the mobile terminal can be started,

dividing the input face image into a group of sub-images with a second size;

processing the set of sub-images with the small-scale CNN module to generate a feature map array;

merging the feature map array into a group of merged feature maps corresponding to the input face image;

the combined feature map is processed using two or more fully connected layers to generate age and/or gender classifications for the persons in the input face images.

2. The method of claim 1, wherein if the size of the input face image does not meet the predefined input image size limit, the method further comprises:

resizing the input face image to a given image size that satisfies the predefined input image size limit;

dividing the input face image with the reset size into a group of sub-images with a second size;

merging the feature map array into a set of merged feature maps corresponding to the resized input face image; and

3. The method of claim 2, wherein resizing the input face image to a given image size that satisfies the predefined input image size limit comprises:

down-sampling the input face image to the given image size if the size of the input face image is larger than the given image size; and

upsampling the input face image to the given image size if the size of the input face image is smaller than the given image size.

4. The method according to claim 1, wherein if the size of the input face image is less than or equal to the maximum input image size of the small-scale CNN module, the method further comprises: the input face image is processed directly with the small-scale CNN module without dividing the input face image into a set of sub-images having a smaller size.

5. The method according to claim 1, wherein if the size of the input face image is less than or equal to the maximum input image size of the small-scale CNN module, the method further comprises:

upsampling a size of the input face image to a given image size that satisfies the predefined input image size limit;

6. The method of claim 1, wherein the input facial image is an output of a face detection (CNN) module that detects facial images from an input video image.

7. The method of claim 1, wherein the small-scale CNN module comprises three convolutional layers, wherein each of the three convolutional layers is followed by a modified linear cell layer and a pooling layer.

8. The method of claim 1, wherein a last fully connected layer of the two or more fully connected layers comprises a flexible maximum classifier.

9. An age and/or gender assessment system utilizing at least one small-scale convolutional neural network module, comprising:

an input module for receiving an input face image that is primarily covered by a face;

a small-scale CNN module connected with the output of the input module and used for processing the face image by using a group of filters, wherein the small-scale CNN module has the maximum input size limit;

a merging module connected to the output of the small-scale CNN module; and

the decision module comprises two or more full connection layers and is connected with the output of the merging module;

wherein the input module is also used for

Determining whether the size of the input face image is larger than the size of the maximum input image which can be supported by the small-scale CNN module or not according to the maximum input size limit;

if so, determining whether the size of the input face image meets a predefined input image size limit, the predefined input image size limit being a given image size of a plurality of image sizes; wherein the plurality of image sizes satisfy a requirement that the input image is divisible into a set of sub-images having a second size, the second size being smaller than the maximum input image size;

if yes, the input module is further used for dividing the input face image into a group of sub-images with a second size;

wherein the small-scale CNN module is configured to process the set of sub-images to generate a feature map array;

the merging module is used for merging the feature map array into a group of merged feature maps corresponding to the input face image;

wherein the decision module processes the combined feature map using two or more fully connected layers to generate an age and/or gender classification for the person in the input face image.

10. The age and/or gender assessment system according to claim 9, characterized in that if the size of said input face image does not meet said predefined input image size limit, then

The input module is further configured to:

resizing the input face image to a given image size that satisfies the predefined input image size limit; and is

the small-scale CNN module is further used for processing the group of sub-images to generate a feature map array;

the merging module is further used for merging the feature map array into a group of merged feature maps corresponding to the input face image with the reset size; and is

The decision module is further configured to process the combined feature map using two or more fully connected layers to generate an age and/or gender classification for the person in the input face image.

11. The age and/or gender assessment system according to claim 10, wherein said input module resizes said input face image to said given image size by:

down-sampling the input face image to the given image size if the size of the input face image is larger than the given image size;

and if the size of the input face image is smaller than the given image size, upsampling the input face image to the given image size.

12. The age and/or gender assessment system according to claim 9, wherein said small-scale CNN module is further configured to directly process said input face image without dividing said input face image into a set of sub-images having a smaller size if the size of said input face image is less than or equal to the maximum input image size of said small-scale CNN module.

13. The age and/or gender assessment system according to claim 9, characterized in that if the size of said input face image is smaller or equal to the maximum input image size of said small-scale CNN module, then

The input module is further configured to:

upsampling a size of the input face image to a given image size that satisfies the predefined input image size limit; and

the merging module is further used for merging the feature map array into a group of merged feature maps corresponding to the input face image with the reset size; and

14. The age and/or gender assessment system according to claim 9, wherein said input module is connected to a face detection CNN module, said face detection CNN module detecting a face image from an input video image, and said input face image being the output of the face detection CNN module.

15. The age and/or gender assessment system according to claim 9, wherein said small scale CNN module comprises three convolutional layers, wherein each of said three convolutional layers is followed by a modified linear unit layer and a pooling layer.

16. The age and/or gender assessment system according to claim 9, wherein a last fully connected layer of said two or more fully connected layers comprises a flexibility maximum classifier.

17. The age and/or gender assessment system according to claim 9, wherein said small scale CNN module is a hardware CNN module embedded in a chipset or system on a chip.

18. The age and/or gender assessment system according to claim 9, wherein said merging module merges said feature map arrays by concatenating said feature map arrays into a one-dimensional vector.

19. An embedded system for performing face detection and age and/or gender assessment in-situ in acquired video images, the embedded system comprising:

a processor;

a memory coupled to the processor;

the image acquisition device is connected with the processor and the memory and is used for acquiring video images;

a face detection subsystem connected to the image acquisition device and configured to detect a face from the acquired video image; and

an age and gender assessment subsystem coupled to the face detection subsystem and including a small-scale CNN module having a maximum input size limit, wherein the age and gender assessment subsystem is to:

receiving a detected face image from the face detection subsystem that is predominantly covered by a face;

determining whether the size of the detected face image is larger than the maximum input size limit which can be supported by the small-scale CNN module or not according to the maximum input size limit;

if so, determining whether the size of the detected face image meets a predefined input image size limit, wherein the predefined input image size limit meets a given image size of a plurality of image sizes that can divide the input image into a set of sub-images having a second size, wherein the second size is smaller than the maximum input image size;

if so, the mobile terminal can be started,

dividing the detected face image into a group of sub-images with a second size;

merging the feature map array into a group of merged feature maps corresponding to the detected face image; and

the combined feature map is processed using two or more fully connected layers to generate age and/or gender classifications for the persons in the detected face images.

20. The embedded system of claim 19, wherein the small-scale CNN module is a low-cost hardware CNN module shared by the age and gender assessment subsystem and the face detection subsystem.

21. A method for performing deep learning image processing with a small scale convolutional neural network module with maximum input size constraints, the method comprising:

receiving an input image;

determining whether the size of the input image is larger than the maximum input image size which can be supported by the small-scale CNN module or not according to the maximum input size limit;

if so, then

Dividing the input image into a set of sub-images having a second size, wherein the second size is smaller than the maximum input image size;

merging the feature map array into a set of merged feature maps corresponding to the input image;

the combined feature map is processed using two or more fully connected layers to generate a classification decision for the input image.

22. The method of claim 21, wherein the size of the input image satisfies a predefined input image size limit, wherein the predefined input image size limit satisfies a given one of a plurality of image sizes at which the input image is divisible into a set of sub-images having the second size.

23. The method of claim 21, wherein if the size of the input image is less than or equal to the maximum input image size supportable by the small-scale CNN module, the method further comprises: processing the input image directly with the small-scale CNN module without dividing the input image into a set of sub-images having a smaller size.

24. The method of claim 21, wherein the input image is an input face image that is predominantly covered by a human face; the classification decision on the input image comprises an age and gender classification for the person in the input face image.

25. The method of claim 21, wherein there is a predefined overlap between each pair of adjacent sub-images in the set of sub-images, and wherein there is no overlap and no gap between a pair of adjacent feature maps corresponding to a pair of adjacent sub-images in the array of feature maps.