CN115862105A

CN115862105A - Network model training method and image processing method

Info

Publication number: CN115862105A
Application number: CN202211533659.4A
Authority: CN
Inventors: 邸德宁; 朱婷; 庄瑞格; 郝敬松; 朱树磊; 殷俊
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2023-03-28

Abstract

The embodiment of the application provides a training method and an image processing method of a network model, which are used for improving the stability of the model on environmental interference. The network model training method comprises the following steps: respectively inputting N images included in a first training sample set into a network model to be trained, and carrying out face attribute recognition on the N images through the network model to be trained to obtain N predicted age values and N predicted gender values; determining a first loss value and a first standard deviation loss value according to the N predicted age values and the age tag values, and determining a first total loss value according to the first loss value and the first standard deviation loss value; determining a second loss value and a second standard deviation loss value according to the N predicted gender values and the gender tag values, and determining a second total loss value according to the second loss value and the second standard deviation loss value; and adjusting the network parameters of the network model to be trained according to the first total loss value and the second total loss value to obtain the trained network model.

Description

Network model training method and image processing method

Technical Field

The application relates to the technical field of computer vision, in particular to a training method and an image processing method of a network model.

Background

The human face attributes comprise age, gender and the like, and better results can be obtained by carrying out deep learning reasoning based on the human face image. In the training of the attribute recognition model, labels marked manually are mostly used as supervision signals for training, such as the age information of people. However, the reliability of the manual label has limitations, and the sensing difference of different labeling personnel, the interference factors of the picture and the like can cause the deviation of the label information and the actual value to be larger. On the other hand, in a non-cooperative scene, the image is interfered by the natural environment (illumination and shielding) and the person (angle and shielding) (hereinafter, referred to as environmental interference), so that the face is not clear, and a prediction result with large deviation occurs. During the training process, a great loss value may occur due to environmental interference in the image, and the loss value is further deviated from the training direction of the whole model.

Disclosure of Invention

The embodiment of the application provides a training method of a network model, namely an image processing method, which is used for improving the stability of the model on environmental interference.

In a first aspect, an embodiment of the present application provides a network model training method, including:

respectively inputting N images included in a first training sample set into a network model to be trained, and carrying out face attribute recognition on the N images through the network model to be trained to obtain N predicted age values and N predicted gender values; the N images in the first training sample set have the same age label value and gender label value; determining a first loss value and a first standard deviation loss value according to the N predicted age values and the age tag value, and determining a first total loss value according to the first loss value and the first standard deviation loss value; determining a second loss value and a second standard deviation loss value according to the N predicted gender values and the gender tag values, and determining a second total loss value according to the second loss value and the second standard deviation loss value; and adjusting the network parameters of the network model to be trained according to the first total loss value and the second total loss value to obtain the trained network model.

Based on the scheme, when the loss value is calculated, the standard deviation loss value is introduced, the network parameters are adjusted by taking the weighted sum of the loss value and the standard deviation loss value as the total loss value, the fluctuation difference value of the predicted values under different environmental interferences can be drawn, and the environmental interference resistance of the model is improved.

In one possible implementation, the determining a first standard deviation loss value according to the N predicted age values and the age tag value includes:

determining a first standard deviation from the N predicted age values and the age tag value;

when the first standard deviation is larger than a first set standard deviation threshold value, taking the difference between the first standard deviation and the first set standard deviation threshold value as a first standard deviation loss value; or the like, or a combination thereof,

when the first standard deviation is less than or equal to the first standard deviation threshold, taking a first set value as the first standard deviation loss value;

determining a second loss of standard deviation value from the N predicted gender values and the gender tag value, comprising:

determining a second standard deviation from the N predicted gender values and the gender tag value;

when the second standard deviation is larger than a second set standard deviation threshold value, taking the difference between the second standard deviation and the second set standard deviation threshold value as a second standard deviation loss value; or the like, or, alternatively,

and when the second standard deviation is less than or equal to the second standard deviation threshold value, taking a second set value as the second standard deviation loss value.

Based on the scheme, the environmental interference can be subjected to targeted control by setting the loss value of the standard deviation, and overfitting can be prevented when the model fits to zero infinitely by setting the threshold value of the standard deviation.

In one possible implementation, the first training sample set is obtained by:

acquiring M images, and an age value, an age confidence coefficient, a gender confidence coefficient and a quality score corresponding to each image in the M images, wherein the M images are images of the same person in different scenes within a set duration;

selecting N images meeting set conditions from the M images, and generating a first training sample set based on the N images; the setting conditions include: the age confidence coefficient of the image is larger than or equal to a first threshold value, the gender confidence coefficient of the image is larger than or equal to a second threshold value, and the quality score of the image is larger than or equal to a third threshold value;

and determining the age label value and the gender label value of the N images in the first training sample set according to the age value and the gender value of the N images.

In one possible implementation, the determining age label values and gender label values of the N images in the first training sample set according to the age values and the gender values of the N images includes:

selecting a plurality of age values within a set age range from the N age values corresponding to the N images;

taking the age value with the highest frequency of occurrence in the plurality of age values as the age label value of the N images in the first training sample set;

and taking the sex value with the highest frequency of occurrence in the N images as the sex label value of the N images in the first training sample set.

Based on the scheme, through multiple quality control, the reliability of the label values in the obtained first training sample set is ensured, and further the training deviation is prevented.

In one possible implementation, the set age range is determined by:

calculating an age mean and a standard deviation of the N age values;

and determining the set age range according to the age mean and the standard deviation.

In a possible implementation manner, the determining a first loss value according to the N predicted age values and the age tag value includes: for each first image, when the difference value between the predicted age value of the first image output by the network model to be trained and the age tag value of the first image is greater than a first set deviation value, determining a first predicted age value of the first image according to the age tag value of the first image, the predicted age value and the first set deviation value; or, when the difference value between the predicted age value of the first image and the age tag value of the first image is less than or equal to a first set deviation value, taking the predicted age value of the first image as a first predicted age value; determining a first loss value based on first predicted age values of the L first images, the predicted age values of N-L images, and the age tag value.

In one possible implementation, the first standard deviation is determined by:

sequencing the N predicted age values determined by the network model to be trained, and taking a median as a first value; for each first image, determining a second value from the first value, the age tag value and the second set offset value when the difference between the predicted age value and the first value of the first image is greater than the second set offset value; or, when the difference between the predicted age value of the first image and the first value is less than or equal to the second set deviation value, taking the predicted age value of the first image as a second value; determining the first standard deviation from the N-L predicted age values, the L second values, and the age label value.

By the scheme, after the low-quality image is input into the network model, when the distance between the predicted value and the tag value is determined to be too large, the predicted value can be modified according to the set deviation value and the tag value, loss calculation is determined according to the modified predicted value, and noise bias brought by the first brightening sample can be reduced.

In a second aspect, an embodiment of the present application provides an image processing method, including: acquiring an image to be processed; inputting the image to be processed into a network model, and performing face attribute recognition on the image to be processed through the network model to obtain an age value and a gender value corresponding to a face in the image to be processed, wherein the network model is obtained by training through the network model training methods of the first aspect and different implementation manners of the first aspect.

In a third aspect, an embodiment of the present application provides a network model training apparatus, including:

the input module is used for respectively inputting N images included in the first training sample set into a network model to be trained, and performing face attribute recognition on the N images through the network model to be trained to obtain N predicted age values and N predicted gender values; the N images in the first training sample set have the same age label value and gender label value;

a determining module, configured to determine a first loss value and a first standard deviation loss value according to the N predicted age values and the age tag value, and determine a first total loss value according to the first loss value and the first standard deviation loss value;

determining a second loss value and a second standard deviation loss value according to the N predicted gender values and the gender tag values, and determining a second total loss value according to the second loss value and the second standard deviation loss value;

and the adjusting module is used for adjusting the network parameters of the network model to be trained according to the first total loss value and the second total loss value so as to obtain the trained network model.

In a fourth aspect, an embodiment of the present application provides an execution apparatus, including:

a memory for storing program instructions;

and a processor, configured to obtain the program instructions stored in the memory, and execute the method according to the first aspect, the second aspect, and different implementation manners of the first aspect according to the obtained program instructions.

In a fifth aspect, embodiments of the present application provide a computer-readable storage medium storing computer instructions that, when executed on a computer, cause the computer to perform the method according to the first aspect, the second aspect, and different implementations of the first aspect.

For technical effects brought by any one implementation manner of the second aspect to the fourth aspect, reference may be made to the technical effects brought by the first aspect and different implementation manners of the first aspect, and details are not described here.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present application, and those skilled in the art can obtain other drawings based on the drawings without inventive labor.

Fig. 1A is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 1B is a schematic structural diagram of a server according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a method for training a network model according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 4 is a schematic diagram of a network model training apparatus according to an embodiment of the present application;

fig. 5 is a schematic diagram of an execution device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It is noted that relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The human face attributes comprise age, gender and the like, and better results can be obtained by carrying out deep learning reasoning based on the human face image. In the training of the attribute recognition model, a label based on manual labeling is mostly used as a supervision signal for training, for example, the age information of people. However, the reliability of the manual label has limitations, and the difference in perception of different labeling personnel, the interference factors of the picture, and the like, all result in a large deviation between the label information and the actual value. On the other hand, in the non-cooperative scene, the image is interfered by the natural environment (illumination, shielding) and the person himself (angle, shielding) (hereinafter, referred to as environmental interference), so that the face is not clear, and a prediction result with large deviation occurs. During the training process, a great loss value may occur due to environmental interference in the image, and the loss value is further deviated from the training direction of the whole model.

In order to solve the above problems, the present application provides a training method and an image processing method for a network model, in which a loss value of standard deviation is introduced in the training process, and a fluctuation difference value of predicted values under different environmental interferences is directly approximated in a manner of fusing a basic loss value and the loss value of standard deviation, so as to improve the stability of the model to the environmental interferences.

Some brief descriptions are given below to application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The image processing method provided by the embodiment of the application can be realized through an execution device. In some embodiments, the execution device may be an electronic device, and the electronic device may be implemented by one or more servers, for example, one server 100 is illustrated in fig. 1A. Referring to fig. 1A, a schematic diagram of a possible application scenario provided in the embodiment of the present application is shown, which includes a server 100 and an acquisition device 200. The server 100 may be implemented by a physical server or may be implemented by a virtual server. The server may be implemented by a single server, may be implemented by a server cluster composed of a plurality of servers, and may implement the image processing method provided by the present application by a single server or a server cluster. The acquisition device 200 is a device with an image acquisition function, and includes an electric police device, an electronic monitoring device, a monitoring camera, a video recorder, a terminal device (such as a notebook, a computer, a mobile phone, and a television) with a video acquisition function, and the like. The capture device 200 may transmit the captured image to be processed to the server 100 through the network. Alternatively, the server 100 may be connected to the terminal device 300, receive an image processing task sent by the terminal device 300, and perform image processing according to the received to-be-processed image sent by the capturing device 200. In some scenarios, the server 100 may transmit the image processing result to the terminal device 300. The terminal device 300 may be a television, a mobile phone, a tablet computer, a personal computer, and the like. In some embodiments, after the capturing device 200 captures an image, the captured image may be sent to a server, and the server performs image analysis and portrait aggregation on the image to obtain a plurality of images and store the images as an image archive. Each image file comprises face images of the same person in different scenes within a set time. In some scenarios, the image archive also includes the facial attributes, confidence and quality scores of the images. The server 100 may train the network model through a plurality of image archives to implement a training method of the network model.

By way of example, referring to FIG. 1B, server 100 may include a processor 110, a communication interface 120, and a memory 130. Of course, other components, not shown in FIG. 1B, may also be included in the server 100.

The communication interface 120 is used for communicating with the capture device 200 and the terminal device 300, and is used for receiving an image to be processed sent by the capture device 200, or receiving an image processing task sent by the terminal device 300, or sending an image processing result to the terminal device 300.

In the embodiments of the present application, the processor 110 may be a general-purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, and may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

The processor 110 is a control center of the server 100, connects various parts of the entire server 100 using various interfaces and routes, performs various functions of the server 100 and processes data by operating or executing software programs and/or modules stored in the memory 130 and calling data stored in the memory 130. Optionally, processor 110 may include one or more processing units. The processor 110 may be a control component such as a processor, a microprocessor, a controller, etc., and may be, for example, a general purpose Central Processing Unit (CPU), a general purpose processor, a Digital Signal Processing (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof.

The memory 130 may be used to store software programs and modules, and the processor 110 executes various functional applications and data processing by operating the software programs and modules stored in the memory 130. The memory 130 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to a business process, and the like. Memory 130, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 130 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 130 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 130 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.

In other embodiments, the enforcement device may be a terminal device. In some scenarios, when the terminal device can receive the image to be processed sent by the acquisition device, the image processing is performed according to the image to be processed, so as to obtain an age value and a gender value of the image to be processed. The terminal device may include a display device, and the display device may be a liquid crystal display, an Organic Light-Emitting Diode (OLED) display, a projection display device, and the like, which is not particularly limited in this application.

It should be noted that the structure shown in fig. 1A and 1B is only an example, and the present embodiment is not limited thereto.

Fig. 2 exemplarily shows a process of the network model training method, which may be performed by a network model training apparatus, which may be located in the server 100 shown in fig. 1B, such as the processor 110, or the server 100. The specific process is as follows:

and 201, respectively inputting the N images included in the first training sample set into a network model to be trained, and performing face attribute recognition on the N images through the network model to be trained to obtain N predicted age values and N predicted gender values.

In some embodiments, the first set of training samples may be obtained by: the method comprises the steps of obtaining M images, and obtaining an age value, an age confidence coefficient, a sex value, a sex confidence coefficient and a quality score corresponding to each image in the M images, wherein the M images are images of the same person in different scenes within a set time length, and the M images can form an image file. In some scenarios, the image archive may be obtained as follows: and acquiring urban-level snapshot data in a set time period, wherein the snapshot data is data with small time span so as to ensure that the face attribute is unchanged in the set time period. For example, the set period may be 1 month, 3 months, or the like, and preferably, in order to ensure uniformity of the face attributes in the snapshot data, the set period does not exceed half a year at most. And analyzing the face attribute, the face quality and the face recognition model to obtain attribute values (age and gender), confidence degrees of the attribute values, quality scores and face recognition characteristics. Further, according to the correlation information of the face, the head and the shoulder, the human body, the gait and the like during snapshot, the corresponding type of analysis algorithm is used for extracting the head and shoulder features, the human body features, the gait features and the like of the detected personnel target. And simultaneously, recording the time information and the snapshot point location information of the snapshot data.

After the information is obtained, the portrait clustering can be carried out based on the characteristics of the face, the head and the shoulder, the human body, the gait and the like and the space-time information, and a large number of low-quality face pictures can be recalled into the same image file. Therefore, a plurality of image files can be obtained, and each image file has face pictures of the same person in different scenes in a set time period. Based on the clustering processing of the city-level mass snapshot data, the richness of the image data in the image file can be ensured and the magnitude can far exceed that of the manual labeling.

In some embodiments, M images may be included in the image archive, and then N images satisfying a set condition may be selected from the M images, and a first training sample set may be generated based on the N images. Wherein the setting conditions include: the age confidence of the image is greater than or equal to a first threshold, the gender confidence of the image is greater than or equal to a second threshold, and the quality score of the image is greater than or equal to a third threshold.

As one example, confidence and quality filtering may be performed on a plurality of images within an image archive. For example, where the age confidence threshold is a first threshold and the gender confidence threshold is a second threshold, images that are less confident may be excluded based on the first and second thresholds. For example, assuming that the image archive contains M original images, the image archive may be filtered according to a first threshold for the age attribute, and when the age confidence of 3 of the M images is smaller than the first threshold, M-3 of the M images with the age confidence greater than or equal to the first threshold are retained. For the gender attribute, the image archive can be filtered according to a second threshold, and when the age confidence of 1 image in the M-3 images is smaller than the second threshold, the M-4 images with the age confidence greater than or equal to the second threshold in the M-3 images are reserved. Further, the images may be filtered according to the quality scores of the images, and the images with the quality scores smaller than the third threshold are excluded, so as to obtain N images with the quality scores greater than or equal to the third threshold. For example, when the quality scores of 2 images of the M-4 images are less than the third threshold, the M-6 images are taken as the images in the first training sample set.

In some scenarios, an image number threshold may also be set, and an image archive with the number of images greater than the image number threshold after the image archive is subjected to confidence level and quality filtering is used as a training sample set.

In some embodiments, the first training sample set is any one of a plurality of training sample sets. After the first training sample set is obtained, the age label values and the gender label values of the N images in the first training sample set may be determined according to the age values and the gender values of the N images. Wherein the N images in the first training sample set have the same age label value and gender label value. In some scenarios, the age label values and the gender label values of the N images in the first training sample set may be determined according to the age values and the gender values of the N images, and may specifically be determined as follows: a plurality of age values within a set age range are selected from N age values corresponding to the N images. And then, the age value with the highest frequency of appearance in the plurality of age values is used as the age label value of the N images in the first training sample set. Similarly, the gender value with the highest frequency of occurrence in the N images may be used as the gender label value of the N images in the first training sample set.

In some scenarios, the set age range may be determined as follows: calculating the age mean and standard deviation of the N age values; and determining the set age range according to the age mean and the standard deviation. As an example, the age mean may be expressed by mean, the standard deviation may be expressed by std, and the set age range may be expressed as mean ± std × 3.

A first loss value and a first standard deviation loss value are determined from the N predicted age values and the age tag values, and a first total loss value is determined from the first loss value and the first standard deviation loss value 202.

In some embodiments, a plurality of loss values may be determined by an existing loss value calculation method, and the first loss value may be determined according to the plurality of loss values. In some scenarios, the first penalty value may be a sum of N penalty values. In other scenarios, the first loss value may be an average of the N loss values.

In some embodiments, the first standard deviation loss value is determined according to the N predicted age values and the age tag value, and may be specifically determined as follows: determining a first standard deviation from the N predicted age values and the age tag value; when the first standard deviation is greater than the first set standard deviation threshold, the difference between the first standard deviation and the first set standard deviation threshold is taken as the first standard deviation loss value. And when the first standard deviation is less than or equal to the first standard deviation threshold value, taking the first set value as a first standard deviation loss value. Wherein, the first set value may be 0. The first standard deviation satisfies a condition described by the following equation:

Loss _Std ＝max(0,Std-L)；

therein, loss _Std Denotes a first loss of standard deviation value, std denotes a first standard deviation, and L is a first set standard deviation threshold.

In some embodiments, a weighted sum of the first loss value and the first standard deviation loss value may be used as the first total loss value, or an average of the first loss value and the first standard deviation loss value may be used as the first total loss value. In some scenarios, the first total loss value satisfies the condition shown in the following equation:

Loss＝Loss _Basic +β·Loss _Std ；

wherein Loss denotes a first total Loss value, loss _Basic Representing the first loss value and beta representing the hyperparameter.

A second loss value and a second standard deviation loss value are determined 203 based on the N predicted gender values and the gender label values, and a second total loss value is determined based on the second loss value and the second standard deviation loss value.

In some embodiments, determining the second loss of standard deviation value based on the N predicted gender values and the gender label values may be determined by: determining a second standard deviation according to the N predicted gender values and the gender label values; and when the second standard deviation is larger than a second set standard deviation threshold value, taking the difference between the second standard deviation and the second set standard deviation threshold value as a second standard deviation loss value. And when the second standard deviation is less than or equal to the second standard deviation threshold value, taking the second set value as a second standard deviation loss value. Wherein, the second set value may be 0.

In some embodiments, the method of determining the second total loss value is identical to the method of determining the first loss value, and is not illustrated here.

And 204, adjusting the network parameters of the network model to be trained according to the first total loss value and the second total loss value to obtain the trained network model.

Based on the scheme, the basic loss value ensures that the basic effect of the model is not degraded in a mode of fusing the basic loss value and the standard deviation loss value, the standard deviation loss value can approximate the distribution of prediction data under different environmental interferences, the anti-interference capability of the model is promoted without needing a tube body prediction value, and only the distribution deviation value is managed, so that the stability of the training network model to the environmental interference is improved.

In some embodiments, the L first images with quality scores less than the first quality threshold are included in the N images, and the first loss value is determined according to the N predicted age values and the age label value, and may be determined as follows: and for each first image, when the difference value between the predicted age value of the first image output by the network model to be trained and the age label value of the first image is greater than a first set deviation value, determining a first predicted age value of the first image according to the age label value of the first image, the predicted age value and the first set deviation value.

Specifically, the first predicted age value of the first image satisfies a condition shown by the following formula: s = Label + H or S = Label-H, label representing the Label value, H representing the first set deviation value, S representing the first predicted age value of the first image. Wherein, S is a value closer to the predicted age value. As an example, when it is determined that the predicted age value is 24, the age label value of the first image is 30, and the first set deviation is 4 years old, S takes a value of 26 or 34. Since the predicted age is 24, the first predicted age S takes a value of 26.

In some embodiments, the predicted age value of the first image is taken as the first predicted age value when a difference between the predicted age value of the first image and the age tag value of the first image is less than or equal to a first set deviation value.

Further, a first loss value may be determined based on the first predicted age values of the L first images, the predicted age values of the N-L images, and the age tag value.

In some embodiments, the first standard deviation may also be determined by: sequencing N predicted age values determined by a network model to be trained, and taking a median as a first value; for each first image, when the difference between the predicted age value and the first value of the first image is greater than a second set offset value, a second value is determined based on the first value, the age tag value, and the second set offset value. As an example, the first value may be denoted as Median, the second setting deviation value may be denoted as a, and when the difference between the predicted age value of the first image and the first value is greater than the second setting deviation value, the second value satisfies a condition shown by the following formula: x = media + a or X = media-a, a representing a first set deviation value and X representing a second value. Wherein, X is a value which is closer to the predicted age value. As an example, when it is determined that the predicted age value is 30, the median of the n predicted age values, i.e., the first value is 25, and the second set deviation is 4 years old, X takes a value of 29 or 21. Since the predicted age is 30, the second value X takes 29.

In some embodiments, when a difference between the predicted age value of the first image and the first value is less than or equal to a second set deviation value, the predicted age value of the first image is taken as a second value; a first standard deviation is determined from the N-L predicted age values, the L second values, and the age tag value. In some scenarios, the first standard deviation satisfies the condition shown in the following equation:

wherein Std represents a first standard deviation, x represents an age label value, and x represents an age label value when the image is a first image with a quality score smaller than a first quality threshold _i The value is a second value, when the image is an image with the mass fraction larger than or equal to the first mass threshold value, x _i And taking the value as a predicted age value.

In some embodiments, for predicting gender value, a set bias value may also be set, and the set bias value may be set to 1. The predicted gender value may be 0 or 1, and the second loss value and the second standard deviation may be determined according to a manner of determining the first loss value and the first standard deviation when the distance between the predicted gender value and the tag value is greater than the set deviation value.

In some embodiments, the test set may be used for testing when the network model is trained for a certain number of rounds. In some scenarios, multiple image archives may be used as the test set and training sample set, respectively. For the test set, equal-quantity segmented extraction can be performed according to the effective data interval of the attribute values. For example, a valid value for the age value may be [0,100]. In some scenarios, P file data may be extracted for each 5 year old interval according to the representative value. In addition, the number of the extracted P files is different from the number of the images as much as possible. Further, when the image is excluded from the medium-low quality image, valid data (including outlier image data, the same below) and low confidence data are kept as a test set. The outlier image data is image data that is not in the age range.

In some scenarios, the tag value information of the test set data may be obtained, and the difference between the test value and the attribute tag value in the archive is calculated, where the calculation mode refers to a standard deviation and is defined as:

where k represents the number of images in the image archive, value represents the test Value, and Label represents the Label Value.

In some embodiments, for each test, the average deviation value and Error of the test value from the tag value on the test set may be calculated _std . And (4) presetting a threshold value T, if the absolute value of the average deviation value exceeds T in the test, immediately stopping training, returning to the basic model, and reporting that the training fails. For Error of each test _std Presetting a threshold E and a round number I, and finding Error when the round number exceeding I is tested _std Above E, training backoff is also immediately stopped. In some scenarios, error may be monitored _std Value, calculating its moving averageAnd when the moving average value is converged to a preset threshold value, stopping training. And the model after the last training is the next basic model.

Based on the same technical concept, the embodiment of the present application provides an image processing method, and fig. 3 exemplarily shows a flow of the image processing method, which can be executed by an image processing apparatus, which can be located in the server 100 shown in fig. 1B, such as the processor 110, or the server 100. The image processing apparatus may also be located in the terminal device. The specific process is as follows:

301, acquiring an image to be processed.

In some embodiments, the image to be processed is an image frame acquired by an acquisition device. The acquisition equipment can be electric alarm equipment, electronic monitoring equipment, a monitoring camera, a video recorder, terminal equipment (such as a notebook, a computer, a mobile phone and a television) with a video acquisition function and the like.

Illustratively, after the capture device captures the video frames, the server obtains the image to be processed from the capture device.

In some embodiments, the server receives a video file sent by the capture device, where the video file includes the image to be processed. The video file may be an encoded file of video. Therefore, the server can decode the received video file to obtain the image to be processed. The video is encoded, so that the file size of a video file can be effectively reduced, and the transmission is convenient. Therefore, the transmission speed of the video can be improved, and the efficiency of confirming the video event subsequently can be improved. The manner of acquiring the encoded code stream data may be any applicable manner, including but not limited to: real Time Streaming Protocol (RTSP), open Network Video Interface Forum (ONVIF) standard or proprietary Protocol.

302, inputting the image to be processed into the network model, and performing face attribute recognition on the image to be processed through the network model to obtain an age value and a gender value corresponding to a face in the image to be processed.

Based on the same technical concept, the embodiment of the present application provides a training apparatus 400 for a network model, as shown in fig. 4. The apparatus 400 may perform any step of the above network model training method, and is not described herein again to avoid repetition. The apparatus 400 includes an input module 401, a determination module 402, and an adjustment module 403.

An input module 401, configured to input N images included in a first training sample set into a to-be-trained network model, respectively, and perform face attribute recognition on the N images through the to-be-trained network model to obtain N predicted age values and N predicted gender values; the N images in the first training sample set have the same age label value and gender label value;

a determining module 402, configured to determine a first loss value and a first standard deviation loss value according to the N predicted age values and the age tag value, and determine a first total loss value according to the first loss value and the first standard deviation loss value;

an adjusting module 403, configured to adjust a network parameter of the network model to be trained according to the first total loss value and the second total loss value, so as to obtain a trained network model.

In some embodiments, the determining module 402, when determining the first loss of standard deviation value according to the N predicted age values and the age label value, is specifically configured to: determining a first standard deviation from the N predicted age values and the age tag value; when the first standard deviation is larger than a first set standard deviation threshold value, taking the difference between the first standard deviation and the first set standard deviation threshold value as a first standard deviation loss value; or, when the first standard deviation is less than or equal to the first standard deviation threshold, taking a first set value as the first standard deviation loss value;

the determining module 402 is configured to determine a second standard deviation loss value according to the N predicted gender values and the gender tag value, and specifically configured to: determining a second standard deviation from the N predicted gender values and the gender tag value; when the second standard deviation is larger than a second set standard deviation threshold value, taking the difference between the second standard deviation and the second set standard deviation threshold value as a second standard deviation loss value; or, when the second standard deviation is less than or equal to the second standard deviation threshold, a second set value is taken as the second standard deviation loss value.

In some embodiments, the determining module 402 is further configured to obtain the first training sample set by: acquiring M images, and an age value, an age confidence coefficient, a gender confidence coefficient and a quality score corresponding to each image in the M images, wherein the M images are images of the same person in different scenes within a set duration; selecting N images meeting set conditions from the M images, and generating a first training sample set based on the N images; the setting conditions include: the age confidence coefficient of the image is larger than or equal to a first threshold value, the gender confidence coefficient of the image is larger than or equal to a second threshold value, and the quality score of the image is larger than or equal to a third threshold value; and determining the age label value and the gender label value of the N images in the first training sample set according to the age value and the gender value of the N images.

In some embodiments, the determining module 402, when determining the age label values and the gender label values of the N images in the first training sample set according to the age values and the gender values of the N images, is specifically configured to:

selecting a plurality of age values within a set age range from the N age values corresponding to the N images; taking the age value with the highest frequency of occurrence in the plurality of age values as the age label value of the N images in the first training sample set; and taking the gender value with the highest frequency of occurrence in the N images as the gender label value of the N images in the first training sample set.

In some embodiments, the set age range is determined by: calculating an age mean and a standard deviation of the N age values; and determining the set age range according to the age mean and the standard deviation.

In some embodiments, the N images include L first images with quality scores less than a first quality threshold, and the determining module 402, when determining the first loss value according to the N predicted age values and the age label value, is specifically configured to: for each first image, when the difference value between the predicted age value of the first image output by the network model to be trained and the age tag value of the first image is greater than a first set deviation value, determining a first predicted age value of the first image according to the age tag value of the first image, the predicted age value and the first set deviation value; or, when the difference value between the predicted age value of the first image and the age tag value of the first image is less than or equal to a first set deviation value, taking the predicted age value of the first image as a first predicted age value;

determining a first loss value based on a first predicted age value for the L first images, the predicted age values for N-L images, and the age tag value.

In some embodiments, the determining module 402 is further configured to determine the first standard deviation by: sequencing the N predicted age values determined by the network model to be trained, and taking a median as a first value; for each first image, determining a second value from the first value, the age tag value and the second set offset value when the difference between the predicted age value and the first value of the first image is greater than the second set offset value; or, when the difference between the predicted age value of the first image and the first value is less than or equal to the second set deviation value, taking the predicted age value of the first image as a second value; determining the first standard deviation from the N-L predicted age values, the L second values, and the age label value.

Based on the same technical concept, the embodiment of the present application provides an executing apparatus 500, and the apparatus 500 may implement any step of the network model training method and the image processing method discussed above, please refer to fig. 5. The apparatus comprises a memory 501 and a processor 502.

The memory 501 is used for storing program instructions;

the processor 502 is configured to call the program instructions stored in the memory, and execute the training method or the image processing method of the network model according to the obtained program.

In the embodiments of the present application, the processor 502 may be a general-purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, and may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in a processor.

The memory 501, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 501 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 501 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 501 in the embodiments of the present application may also be a circuit or any other device capable of implementing a storage function for storing program instructions and/or data.

Based on the same technical concept, an embodiment of the present application provides a computer-readable storage medium, including: computer program code which, when run on a computer, causes the computer to perform a method of training a network model or a method of image processing as discussed in the foregoing. Because the principle of solving the problem of the computer-readable storage medium is similar to the training method or the image processing method of the network model, the implementation of the computer-readable storage medium can refer to the implementation of the method, and repeated details are not repeated.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A network model training method is characterized by comprising the following steps:

respectively inputting N images included in a first training sample set into a network model to be trained, and carrying out face attribute recognition on the N images through the network model to be trained to obtain N predicted age values and N predicted gender values; the N images in the first training sample set have the same age label value and gender label value;

determining a first loss value and a first standard deviation loss value according to the N predicted age values and the age tag value, and determining a first total loss value according to the first loss value and the first standard deviation loss value;

and adjusting the network parameters of the network model to be trained according to the first total loss value and the second total loss value to obtain the trained network model.

2. The method of claim 1, wherein said determining a first standard deviation loss value from said N predicted age values and said age tag value comprises:

when the first standard deviation is larger than a first set standard deviation threshold value, taking the difference between the first standard deviation and the first set standard deviation threshold value as a first standard deviation loss value; or the like, or, alternatively,

determining a second loss of standard deviation value from the N predicted gender values and the gender label values, comprising:

3. The method of claim 2, wherein the first set of training samples is obtained by:

selecting N images meeting set conditions from the M images, and generating a first training sample set based on the N images; the setting conditions include: the age confidence coefficient of the image is greater than or equal to a first threshold, the gender confidence coefficient of the image is greater than or equal to a second threshold, and the quality score of the image is greater than or equal to a third threshold;

4. The method of claim 3, wherein determining age label values and gender label values for the N images in the first training sample set based on the age values and gender values of the N images comprises:

and taking the gender value with the highest frequency of occurrence in the N images as the gender label value of the N images in the first training sample set.

5. The method of claim 4, wherein the set age range is determined by:

calculating an age mean and a standard deviation of the N age values;

6. The method of claim 3, wherein the N images include L first images having quality scores less than a first quality threshold, the determining a first loss value based on the N predicted age values and the age label value comprising:

for each first image, when the difference value between the predicted age value of the first image output by the network model to be trained and the age tag value of the first image is greater than a first set deviation value, determining a first predicted age value of the first image according to the age tag value of the first image, the predicted age value and the first set deviation value; or, when the difference value between the predicted age value of the first image and the age tag value of the first image is less than or equal to a first set deviation value, taking the predicted age value of the first image as a first predicted age value;

determining a first loss value based on first predicted age values of the L first images, the predicted age values of N-L images, and the age tag value.

7. The method of claim 6, wherein the first standard deviation is determined by:

sequencing the N predicted age values determined by the network model to be trained, and taking a median as a first value;

for each first image, determining a second value from the first value, the age tag value and the second set offset value when the difference between the predicted age value and the first value of the first image is greater than the second set offset value; or, when the difference between the predicted age value of the first image and the first value is less than or equal to the second set deviation value, taking the predicted age value of the first image as a second value;

determining the first standard deviation from the N-L predicted age values, the L second values, and the age tag value.

8. An image processing method, characterized by comprising:

acquiring an image to be processed;

inputting the image to be processed into a network model, and performing face attribute recognition on the image to be processed through the network model to obtain an age value and a gender value corresponding to a face in the image to be processed, wherein the network model is obtained after being trained through the network model training method according to any one of claims 1 to 6.

9. A network model training apparatus, comprising:

10. An execution device, comprising:

a memory for storing program instructions;

a processor for retrieving the program instructions stored by the memory and executing the method of any one of claims 1-8 in accordance with the retrieved program instructions.

11. A computer-readable storage medium having stored thereon computer instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1-8.