CN112784778B

CN112784778B - Method, apparatus, device and medium for generating model and identifying age and sex

Info

Publication number: CN112784778B
Application number: CN202110115808.4A
Authority: CN
Inventors: 朱欤; 伍天意; 郭国栋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2024-04-09
Anticipated expiration: 2041-01-28
Also published as: CN112784778A

Abstract

The disclosure discloses a method, a device, equipment, a storage medium and a program product for generating a model and identifying age and gender, and relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and image identification. The specific implementation scheme is as follows: selecting a network structure module from a preset basic network structure module set to construct at least one candidate model; for each candidate model in at least one candidate model, training the candidate model by utilizing each training sample set in at least one training sample set with the data size smaller than a first threshold value to obtain a pre-training model of the candidate model aiming at different training sample sets; scoring each candidate model according to the performance of the pre-training model of each candidate model for different training sample sets; and retraining the candidate model with the highest score by using a training sample set with the data scale larger than a second threshold value to obtain an age and gender identification model. This embodiment enables simultaneous analysis of age and gender by one model.

Description

Method, apparatus, device and medium for generating model and identifying age and sex

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and image recognition.

Background

The face age and sex recognition technique refers to a technique of estimating the physiological age and sex for a given face photograph. Faces are a very rich source of information and face images can provide attribute information such as identity, age, gender, expression, etc. of a person.

The existing face age estimation and gender identification technology is mainly based on a traditional machine learning framework or a convolutional neural network framework. However, the prior art can only estimate one of age or gender, and cannot analyze two face attributes simultaneously through one model. The face age estimation or sex identification algorithm of the traditional machine learning algorithm is low in accuracy, calculation complexity leads to large calculation amount, cannot be operated on small-sized equipment such as a mobile terminal, and cannot adapt to hardware environments with different hardware and different calculation amount requirements.

Disclosure of Invention

The present disclosure provides a method, apparatus, device and storage medium for generating a model and identifying age and gender.

According to a first aspect of the present disclosure, there is provided a method for generating a model, comprising: selecting a network structure module from a preset basic network structure module set to construct at least one candidate model; acquiring at least one training sample set with the data size smaller than a first threshold, wherein each training sample in the training sample set comprises sample face images, age and gender labeling information; for each candidate model in at least one candidate model, training the candidate model by utilizing each training sample set in at least one training sample set to obtain a pre-training model of the candidate model aiming at different training sample sets; scoring each candidate model according to the performance of the pre-training model of each candidate model for different training sample sets; and retraining the candidate model with the highest score by using a training sample set with the data size larger than a second threshold value to obtain an age and gender identification model, wherein the second threshold value is larger than the first threshold value.

According to a second aspect of the present disclosure, there is provided a method for identifying age and gender, comprising: acquiring a face image of a target user to be identified; preprocessing the face image; inputting the preprocessed face image into an age and gender recognition model trained according to the method of any one of claims 1-6, and outputting the age and gender of the target user.

According to a third aspect of the present disclosure, there is provided an apparatus for generating a model, comprising: a construction unit configured to select a network structure module from a preset set of basic network structure modules to construct at least one candidate model; an acquisition unit configured to acquire at least one training sample set having a data size smaller than a first threshold, wherein each training sample in the training sample set includes sample face image, age, and gender labeling information; the pre-training unit is configured to train each candidate model in at least one candidate model by utilizing each training sample set in at least one training sample set to obtain a pre-training model of the candidate model aiming at different training sample sets; a scoring unit configured to score each candidate model according to its performance for the pre-trained model of the different training sample set; and the retraining unit is configured to retrain the candidate model with the highest score by using a training sample set with the data size larger than a second threshold value to obtain an age and gender identification model, wherein the second threshold value is larger than the first threshold value.

According to a fourth aspect of the present disclosure, there is provided an apparatus for identifying age and gender, comprising: an acquisition unit configured to acquire a face image of a target user to be identified; a preprocessing unit configured to preprocess a face image; and a recognition unit configured to input the preprocessed face image into the age and sex recognition model trained by the apparatus according to any one of the first aspects, and output the age and sex of the target user.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the first aspects.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method according to any one of the first aspects.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the first aspects.

According to the method and the device for generating the model, the candidate model is built through the basic network structure module, the candidate model with the best performance is screened out through the small-scale training sample set for pre-training, and finally the model is retrained through the large-scale training sample set. The model can analyze the age and sex from the face image at the same time. The use scene of face age estimation and sex identification can be effectively improved, the efficiency of model development is improved, and complicated manual design is not needed for different hardware environments. In addition, the method and the device for estimating the age and identifying the gender simultaneously complete two tasks by using a single convolution network, so that the calculation efficiency is improved, and the two tasks can mutually help to improve the accuracy.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a method for generating a model according to the present application;

FIG. 3 is a schematic illustration of one application scenario of a method for generating a model according to the present application;

FIG. 4 is a schematic structural view of one embodiment of an apparatus for generating a model according to the present application;

FIG. 5 is a flow chart of one embodiment of a method for identifying age and gender according to the present application;

FIG. 6 is a schematic structural view of one embodiment of an apparatus for identifying age and gender according to the present application;

FIG. 7 is a block diagram of an electronic device for generating a model and for identifying age and gender for implementing an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 shows an exemplary system architecture 100 to which the methods for generating a model, the apparatus for generating a model, the method for identifying age and gender, or the apparatus for identifying age and gender of embodiments of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include terminals 101, 102, a network 103, a database server 104, and a server 105. The network 103 serves as a medium for providing a communication link between the terminals 101, 102, the database server 104 and the server 105. The network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user 110 may interact with the server 105 via the network 103 using the terminals 101, 102 to receive or send messages or the like. The terminals 101, 102 may have various client applications installed thereon, such as model training class applications, face detection recognition class applications, shopping class applications, payment class applications, web browsers, instant messaging tools, and the like.

The terminals 101 and 102 may be hardware or software. When the terminals 101, 102 are hardware, they may be various electronic devices with display screens, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video experts compression standard audio layer 3), laptop and desktop computers, and the like. When the terminals 101, 102 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

When the terminals 101, 102 are hardware, an image acquisition device may also be mounted thereon. The image capturing device may be various devices capable of implementing the function of capturing images, such as a camera, a sensor, and the like. The user 110 may acquire facial images of himself or others using an image acquisition device on the terminal 101, 102.

Database server 104 may be a database server that provides various services. For example, a database server may have stored therein a sample set. The sample set contains a large number of samples. The sample may include a sample face image and age-labeling information and gender-labeling information corresponding to the sample face image. Thus, the user 110 may also select samples from the sample set stored by the database server 104 via the terminals 101, 102.

The server 105 may also be a server providing various services, such as a background server providing support for various applications displayed on the terminals 101, 102. The background server may train the initial model using the samples in the sample set sent by the terminals 101, 102 and may send training results (e.g., the generated age and gender identification model) to the terminals 101, 102. In this way, the user can apply the generated age and sex identification model for age and sex identification.

The database server 104 and the server 105 may be hardware or software. When they are hardware, they may be implemented as a distributed server cluster composed of a plurality of servers, or as a single server. When they are software, they may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should be noted that the method for generating a model or the method for identifying age and gender provided in the embodiments of the present application is generally performed by the server 105. Accordingly, means for generating a model or means for identifying age and gender are also typically provided in the server 105.

It should be noted that the database server 104 may not be provided in the system architecture 100 in cases where the server 105 may implement the relevant functions of the database server 104.

It should be understood that the number of terminals, networks, database servers, and servers in fig. 1 are merely illustrative. There may be any number of terminals, networks, database servers, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a method for generating a face detection model according to the present application is shown. The method for generating a face detection model may include the steps of:

Step 201, selecting a network structure module from a preset basic network structure module set to construct at least one candidate model.

In this embodiment, the execution subject of the method for generating a model (e.g., the server 105 shown in fig. 1) may obtain a set of preset basic network structure modules, and then select at least one network structure module therefrom for constructing a candidate model. Different candidate models are constructed in different combination modes. The candidate model is a network structure composed of a plurality of basic network structure modules (blocks) and sequentially executed, wherein each block comprises a plurality of layers of operations, mainly comprising a conventional convolution layer, a depth separable convolution layer, a 1x1 convolution layer, a BN layer, an activation function layer, a pooling layer, a full connection layer and the like. The network architecture module may be selected to construct at least one candidate model by prior art Network Architecture Searching (NAS) or the like.

Step 202, at least one training sample set having a data size smaller than a first threshold is obtained.

In this embodiment, the execution subject may acquire the training sample set in a variety of ways. For example, the executing entity may obtain the existing training sample set stored therein from a database server (e.g., database server 104 shown in fig. 1) via a wired connection or a wireless connection. As another example, a user may collect training samples through a terminal (e.g., terminals 101, 102 shown in fig. 1). In this way, the executing body may receive the training samples collected by the terminal and store these training samples locally, thereby generating a training sample set.

Here, the training sample set may include at least one training sample. Each training sample includes a sample face image, age and gender labeling information, the face image being used as a model input, the age and gender labeling information being used as a desired output of the model for supervised training. The training sample adopts a human face photo of a scene of a real mobile terminal (such as a mobile phone camera), and the photo is marked by the real age and sex. The sample face image is subjected to data preprocessing. The data preprocessing may be performed as follows: and detecting and positioning the positions and key points of the face in the face photo by using a face detection and key point detection algorithm, carrying out affine transformation according to the positions, and aligning the face to a uniform position and size according to the key points.

The first threshold setting is relatively small, for example 200 tens of thousands. If there are 10 ten thousand, 50 ten thousand and 100 ten thousand, 1000 ten thousand training sample sets, 10 ten thousand, 50 ten thousand and 100 ten thousand training sample sets may be used for the pre-training. The purpose of limiting the data size by using the first threshold is to quickly find out a network structure with better performance, and at this time, the accuracy of the trained pre-training model is not high. After finding out the network structure with good performance, the model with high precision is retrained by using a large-scale training sample.

Step 203, for each candidate model in the at least one candidate model, training the candidate model by using each training sample set in the at least one training sample set to obtain a pre-training model of the candidate model for different training sample sets.

In this embodiment, using the candidate models generated in step 201, multiple candidate models are model trained on the same dataset, trained using a gradient descent method, and the model optimizes two metrics (loss functions) simultaneously, one being the age-estimated smoothl 1 loss function and the other being the cross-entopy loss function of the gender 2 classification. For training sets of different data scales, for example, 10 ten thousand, 50 ten thousand and 100 ten thousand, all candidate models are respectively trained, and if no verification set exists, the training set is directly used for counting and identifying accuracy as a performance index. If there is a validation set, performance evaluations and records are made for each epoch of the model on the validation set (e.g., 50 ten thousand scale).

Step 204, scoring each candidate model according to its performance against the pre-trained models of different training sample sets.

In this embodiment, for each candidate model, the average value of the performance of the pre-training model in the corresponding different training sample set may be used as the performance score of the candidate model, or the comprehensive ranking of the performance of the pre-training model in the different training sample set may be used as the performance of the candidate model. The numbers in brackets represent the performance ranking under this training sample set, as shown in the table below:

The performance of the pre-training model obtained by the candidate model A under the three training sample sets is ranked first and the score is highest.

The average value of the performance of the pre-training model obtained by different models under three training sample sets can also be calculated, and the candidate model A is still the highest in score.

The score may be calculated by selecting an average score algorithm or a ranking algorithm based on the number of candidate models, e.g., if the number of candidate models is greater than a predetermined threshold (e.g., 100), the ranking algorithm may be used to rank the candidate models that are the first most frequently in different data sizes. If the number of candidate models is small, a performance average score may be employed.

Optionally, the performance index may include, in addition to accuracy, a calculation amount and a memory usage amount in the prediction stage, and multiple performance indexes may be comprehensively scored. If there are a plurality of candidate models with highest accuracy, a candidate model with smaller calculation amount and memory use amount can be selected.

And 205, retraining the candidate model with the highest score by using a training sample set with the data size larger than a second threshold value to obtain an age and gender identification model.

In this embodiment, the second threshold is greater than the first threshold. A training sample set with a data size greater than a second threshold is used to train the accurate model. The method comprises the steps of screening out a single model with the best performance as a final model through analyzing the performance of the model under different data scales, and retraining under large-scale data. Specifically, for the performance of multiple candidate models at different data scales generated in the above step, a single model that performs better at different data scales is selected as the final model structure. Finally, model training is performed on a large-scale data set (for example, 1000 ten thousand sheets), the training mode is the same as above, and the result is a trained model and the application deployment of the next stage is entered.

The method for generating the model in the embodiment can automatically generate the network structure, and uses the single convolutional neural network to complete the age estimation and sex identification of the face, thereby facilitating the deployment, improving the operation efficiency, ensuring the accuracy and meeting the requirements of different hardware.

In some optional implementations of the present embodiment, selecting a network fabric module from a set of pre-set base network fabric modules to construct at least one candidate model includes: selecting network structure modules from a preset basic network structure module set to form a basic model; carrying out mathematical modeling according to model parameters, and expressing the structure of a basic model as an exponential function; and randomly sampling the model parameters according to the exponential function to obtain at least one candidate model. The basic model is a network structure composed of a plurality of blocks and sequentially executed only, each block contains a plurality of layers of operations, wherein the operations mainly comprise a conventional convolution layer, a depth separable convolution layer, a 1x1 convolution layer, a BN layer, an activation function layer and the like.

The model structure is expressed as an exponential function consisting of four parameters by mathematical modeling of the parameters of the depth, width, number of groups of separable convolutions, etc. of the network. Given the parameters, the corresponding convolutional neural network structure can be restored by using the function.

The network is made up of a plurality of blocks, each of which is similar in structure (e.g., a block structure resembling mobilent v3 is selected, or a block structure is randomly generated), while the width of each block increases progressively with increasing index. Specifically, the unquantized width can be first calculated according to the following formula:

wherein omega ₀ Is the initial layer width, k is the "slope" used to control the change in network width (k is understood to be a "slope" of width for the whole network), i represents the index of each block, and takes on values from 0 to b, where b is the total number of blocks of the network, i.e., the depth of the network. The quantized network width can then be obtained as follows:

where p is a system parameter. By the above formula, a network structure can be determined by four parameters: b, omega ₀ P, k. When the network is randomly generated, the four parameters can be sampled in a certain range to obtain a complete network structure, and other parameters such as the number of the separated convolutions and the like can be expanded. Quantization is to obtain a piecewise function that allows several consecutive blocks to use the same width (whether the aim is to reduce the random range). Therefore, effective candidate models can be obtained, and screening cost is reduced.

In some alternative implementations of the present embodiment, before training the candidate model with each of the at least one training sample set, the method further comprises: calculating the calculation amount and the memory usage amount of the candidate model in the network prediction stage; and if the operand and the memory usage do not meet the requirements of the target deployment environment, filtering the candidate model. The calculation amount of the candidate model in the network prediction stage can be calculated according to the number of the addition operators and the multiplication operators in the candidate model. And calculating the memory usage of the candidate model in the network prediction stage according to the number of the convolution parameters. The target deployment environment (mobile phone, tablet and other terminal equipment) requires that the operand is smaller than the operand threshold and the memory usage is smaller than the memory threshold, and the operand threshold and the memory threshold of the terminal equipment with different hardware configurations are different. Therefore, aiming at the mobile terminal embedded equipment and the IOT equipment, the corresponding convolution network model structure can be automatically generated according to the calculation requirements of different hardware equipment, and the requirements of face age and gender analysis on the edge equipment are met. An efficient face analysis technique is provided for a deployment environment with a weak computing power. Aiming at deployment limiting conditions of different calculated amounts, memories, time and the like, a network structure can be automatically generated, and the requirements of different hardware are met while the accuracy is ensured.

In some optional implementations of the present embodiment, scoring each candidate model according to its performance for the pre-trained models of different training sample sets includes: acquiring a verification data set, wherein each verification data set comprises a face image, age and gender labeling information; performing performance evaluation on each pre-training model by using the verification data set to obtain the performance of each pre-training model; for each candidate model, the overall performance of the pre-trained model of that candidate model is calculated as a score for that candidate model. The verification data set and the training sample set have no intersection, so that the performance of the pre-training model can be evaluated more accurately, and the evaluation process is basically the same as that of step 204, so that no further description is given. The accuracy of the performance assessment can be mentioned by using the verification data set, so that a network structure with good performance can be screened out, and an age and sex identification model with good performance can be trained.

In some alternative implementations of the present embodiment, the loss functions of the age and gender identification model include a smoothed L1 loss function of age estimation and a cross entropy loss function of gender two classification. Therefore, the performance of the age estimation algorithm and the performance of the gender estimation algorithm can be simultaneously optimized in the training process, and the two tasks of age estimation and gender identification are simultaneously completed by using a single convolution network, so that the calculation efficiency is improved, and the two tasks can mutually help to improve the accuracy.

In some alternative implementations of the present embodiment, the set of infrastructure modules includes at least one of: conventional convolution layer, depth separable convolution layer, 1x1 convolution layer, batch normalization layer, activation function layer, pooling layer, full connection layer. The method can automatically generate corresponding convolution network model structures according to the calculation requirements of different hardware devices aiming at mobile terminal embedded devices and IOT devices, and meets the requirements of face age and gender analysis on edge devices.

With further reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating a model according to the present embodiment. In the application scenario of fig. 3, a model training class application may be installed on the terminal 31 used by the user. When the user opens the application and uploads the training sample set or the storage path of the training sample set (e.g., 3 small-scale training sample sets for pre-training, 1 large-scale training sample set for retraining, 1 verification data set), the terminal may also upload the operand threshold and memory threshold required by the deployment environment, and the server 32 providing background support for the application may run a method for generating a model, including:

First, the server may randomly generate some candidate models A, B, c..n, and then calculate the operand and memory usage for each candidate model. And if the operand and the memory usage do not meet the requirements of the target deployment environment, filtering the candidate model.

Secondly, using 3 small-scale training sample sets for pre-training, pre-training each candidate model to obtain a pre-training model, taking a candidate model A as an example, obtaining pre-training models A1, A2 and A3, and then using a verification data set to respectively determine the performance of each pre-training model. The performance of each candidate model is then determined.

And finally, retraining the candidate model with the best pre-trained model performance by using a large-scale training sample set to obtain the age and sex identification model.

With continued reference to FIG. 4, as an implementation of the method illustrated in the above figures, the present application provides one embodiment of an apparatus for generating a model. The embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device can be applied to various electronic devices.

As shown in fig. 4, the apparatus 400 for generating a model of the present embodiment may include: a construction unit 401, an acquisition unit 402, a pre-training unit 403, a scoring unit 404, a retraining unit 405. Wherein the construction unit 401 is configured to select a network structure module from a preset set of basic network structure modules to construct at least one candidate model; an acquisition unit 402 configured to acquire at least one set of training samples having a data size smaller than a first threshold, wherein each training sample in the set of training samples comprises sample face images, age and gender labeling information; a pre-training unit 403 configured to train, for each candidate model of the at least one candidate model, the candidate model with each training sample set of the at least one training sample set, to obtain a pre-training model of the candidate model for different training sample sets; a scoring unit 404 configured to score each candidate model according to its performance for the pre-trained model of the different training sample set; and a retraining unit 405 configured to retrain the candidate model with the highest score by using a training sample set with a data size greater than a second threshold value, so as to obtain an age and gender identification model, wherein the second threshold value is greater than the first threshold value.

In some optional implementations of the present embodiment, the construction unit 401 is further configured to: selecting network structure modules from a preset basic network structure module set to form a basic model; carrying out mathematical modeling according to model parameters, and expressing the structure of a basic model as an exponential function; and randomly sampling the model parameters according to the exponential function to obtain at least one candidate model.

In some optional implementations of the present embodiment, the apparatus 400 further comprises a filtering unit (not shown in the drawings) configured to: before training the candidate model by using each training sample set in at least one training sample set, calculating the calculation amount and the memory usage amount of the candidate model in a network prediction stage; and if the operand and the memory usage do not meet the requirements of the target deployment environment, filtering the candidate model.

In some optional implementations of the present embodiment, the scoring unit 404 is further configured to: acquiring a verification data set, wherein each verification data set comprises a face image, age and gender labeling information; performing performance evaluation on each pre-training model by using the verification data set to obtain the performance of each pre-training model; for each candidate model, the overall performance of the pre-trained model of that candidate model is calculated as a score for that candidate model.

In some alternative implementations of the present embodiment, the loss functions of the age and gender identification model include a smoothed L1 loss function of age estimation and a cross entropy loss function of gender two classification.

In some alternative implementations of the present embodiment, the set of infrastructure modules includes at least one of: conventional convolution layer, depth separable convolution layer, 1x1 convolution layer, batch normalization layer, activation function layer, pooling layer, full connection layer.

It will be appreciated that the elements described in the apparatus 400 correspond to the various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting benefits described above with respect to the method are equally applicable to the apparatus 400 and the units contained therein, and are not described in detail herein.

Referring to fig. 5, a flow 500 of one embodiment of a method for identifying age and gender provided herein is shown. The method for identifying age and gender may include the steps of:

step 501, a face image of a target user to be identified is acquired.

In the present embodiment, an execution subject of the method for identifying age and sex (e.g., the server 105 shown in fig. 1) may acquire a face image of a target user in various ways. For example, the execution subject may acquire the face image stored therein from a database server (e.g., the database server 104 shown in fig. 1) through a wired connection or a wireless connection. For another example, the executing body may also receive a face image acquired by a terminal (e.g., the terminals 101, 102 shown in fig. 1) or other devices.

In the present embodiment, the detection object may be any user, for example, a user using a terminal, or other users present in the image acquisition range, or the like. The face image may equally be a colour image and/or a grey scale image etc. And the format of the face image is not limited in this application.

Step 502, preprocessing the face image.

In this embodiment, a face photo is obtained, and operations such as detection and alignment are performed on the face photo according to a preprocessing mode in a training sample constructing stage.

Step 503, inputting the preprocessed face image into a pre-trained age and gender recognition model, and outputting the age and gender of the target user.

In this embodiment, the executing body may input the face image acquired in step 501 into a pre-trained age and gender recognition model, perform inference calculation of the model, and return the face predicted age value (real number) and gender classification (0 or 1).

In this embodiment, the age and gender identification model may be generated using the method described above in connection with the embodiment of FIG. 2. The specific generation process may be referred to in the description of the embodiment of fig. 2, and will not be described herein.

It should be noted that the method for identifying age and sex according to the present embodiment may be used to test the age and sex identification model generated in each of the above embodiments. And further, the age and sex recognition model can be continuously optimized according to the test result. The method may be a practical application method of the age and sex identification model generated in each of the above embodiments. The age and sex recognition models generated by the embodiments are adopted to detect the human face, so that the performance of the human face detection is improved.

With continued reference to fig. 6, as an implementation of the method of fig. 5 described above, the present application provides one embodiment of an apparatus for identifying age and gender. The embodiment of the device corresponds to the embodiment of the method shown in fig. 5, and the device can be applied to various electronic devices.

As shown in fig. 6, the apparatus 600 for identifying age and sex of the present embodiment may include: an acquisition unit 601, a preprocessing unit 602, and an identification unit 603. Wherein the acquiring unit 601 is configured to acquire a face image of a target user to be identified. A preprocessing unit 602 configured to preprocess the face image. The recognition unit 603 is configured to input the preprocessed face image into a pre-trained age and gender recognition model, and output the age and gender of the target user.

It will be appreciated that the elements described in the apparatus 600 correspond to the various steps in the method described with reference to fig. 5. Thus, the operations, features and resulting benefits described above with respect to the method are equally applicable to the apparatus 600 and the units contained therein, and are not described in detail herein.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as methods for generating models and for identifying age and gender. For example, in some embodiments, the methods for generating models and for identifying age and gender may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the method described above for generating a model and for identifying age and gender may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the method for generating the model and for identifying age and gender by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a server of a distributed system or a server that incorporates a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology. The server may be a server of a distributed system or a server that incorporates a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method for generating a model, comprising:

selecting a network structure module from a preset basic network structure module set to construct at least one candidate model;

acquiring at least one training sample set with the data size smaller than a first threshold, wherein each training sample in the training sample set comprises sample face images, age and gender labeling information;

for each candidate model in the at least one candidate model, training the candidate model by utilizing each training sample set in the at least one training sample set to obtain a pre-training model of the candidate model aiming at different training sample sets;

Scoring each candidate model according to the performance of the pre-training model of each candidate model for different training sample sets;

retraining the candidate model with the highest score by using a training sample set with the data scale larger than a second threshold value to obtain an age and gender identification model, wherein the second threshold value is larger than the first threshold value;

wherein the selecting a network structure module from a preset basic network structure module set to construct at least one candidate model includes:

selecting network structure modules from a preset basic network structure module set to form a basic model;

carrying out mathematical modeling according to model parameters, and expressing the structure of a basic model as an exponential function;

and randomly sampling model parameters according to the exponential function to obtain at least one candidate model.

2. The method of claim 1, wherein prior to training the candidate model with each of the at least one training sample set, the method further comprises:

calculating the calculation amount and the memory usage amount of the candidate model in the network prediction stage;

and if the operand and the memory usage do not meet the requirements of the target deployment environment, filtering the candidate model.

3. The method of claim 1, wherein scoring each candidate model according to its performance for the pre-trained models of different training sample sets, comprises:

acquiring a verification data set, wherein each verification data set comprises a face image, age and gender labeling information;

performing performance evaluation on each pre-training model by using the verification data set to obtain the performance of each pre-training model;

for each candidate model, the overall performance of the pre-trained model of that candidate model is calculated as a score for that candidate model.

4. The method of claim 1, wherein the age and gender identification model loss functions comprise an age-estimated smoothed L1 loss function and a gender-two-classification cross-entropy loss function.

5. The method of any of claims 1-4, wherein the set of infrastructure modules comprises at least one of:

conventional convolution layer, depth separable convolution layer, 1x1 convolution layer, batch normalization layer, activation function layer, pooling layer, full connection layer.

6. A method for identifying age and gender, comprising:

acquiring a face image of a target user to be identified;

Preprocessing the face image;

inputting the preprocessed face image into an age and gender recognition model trained according to the method of any one of claims 1-5, outputting the age and gender of the target user.

7. An apparatus for generating a model, comprising:

a construction unit configured to select a network structure module from a preset set of basic network structure modules to construct at least one candidate model;

an acquisition unit configured to acquire at least one training sample set having a data size smaller than a first threshold, wherein each training sample in the training sample set includes sample face image, age, and gender labeling information;

the pre-training unit is configured to train each candidate model in the at least one candidate model by utilizing each training sample set in the at least one training sample set to obtain a pre-training model of the candidate model aiming at different training sample sets;

a scoring unit configured to score each candidate model according to its performance for the pre-trained model of the different training sample set;

the training sample set with the data size larger than a second threshold value is used for training the candidate model with the highest score again to obtain an age and gender identification model, and the second threshold value is larger than the first threshold value;

Wherein the build unit is further configured to:

8. The apparatus of claim 7, wherein the apparatus further comprises a filtering unit configured to:

before training the candidate model by utilizing each training sample set in the at least one training sample set, calculating the calculation amount and the memory usage amount of the candidate model in a network prediction stage;

9. The apparatus of claim 7, wherein the scoring unit is further configured to:

10. The apparatus of claim 7, wherein the age and gender identification model loss functions comprise an age-estimated smoothed L1 loss function and a gender-two-classification cross-entropy loss function.

11. The apparatus of any of claims 7-10, wherein the set of infrastructure modules comprises at least one of:

12. An apparatus for identifying age and gender, comprising:

an acquisition unit configured to acquire a face image of a target user to be identified;

a preprocessing unit configured to preprocess the face image;

a recognition unit configured to input the preprocessed face image into an age and gender recognition model trained by the apparatus of any one of claims 7 to 11, and output the age and gender of the target user.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.