CN114005149A

CN114005149A - Training method and device for target angle detection model

Info

Publication number: CN114005149A
Application number: CN202010671486.7A
Authority: CN
Inventors: 周宇飞; 付烁; 王兵; 张梦阳
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-07-13
Filing date: 2020-07-13
Publication date: 2022-02-01

Abstract

The application discloses a training method and a device of a target angle detection model, wherein the method comprises the following steps: inputting a sample image to an initial target angle detection model, and obtaining a prediction result according to the initial target angle detection model; inputting a prediction result to a weight network to obtain a loss weight; obtaining a loss value of the initial target angle detection model according to the loss weight, the prediction result and the labeling information of the sample image; and updating parameters in the weight network and parameters in the initial target angle detection model according to the loss values. By implementing the method and the device, the target detection precision and the target angle detection precision can be improved, and the generalization and the robustness of a target angle detection model are improved.

Description

Training method and device for target angle detection model

Technical Field

The application relates to the field of Artificial Intelligence (AI), in particular to a training method and a device of a target angle detection model.

Background

Currently, computer vision technology is one of the most widely used technologies in the field of Artificial Intelligence (AI), and includes image classification, target detection, target angle detection, and the like. The target detection and target angle detection technology has wide application prospect, for example, the detected target can be a human face, and the human face detection can be used for estimating the number of people, the security protection of public places (such as airports, customs and the like) and the like; the face angle detection can be used for behavior analysis, sight estimation, man-machine interaction, screening of faces with positive angles for face recognition and the like.

At present, target angle detection is mainly realized by constructing two models (a target detection model and a target angle detection model), namely, the position of a target is detected first and then target angle detection is carried out, the two models are separately trained, the training is long, errors of the former model can influence the latter model, and the target angle detection precision is low.

Disclosure of Invention

The embodiment of the application discloses a training method and a training device for a target angle detection model, which can realize end-to-end target detection and target angle detection, improve the detection precision of the target detection and the target angle, and further increase the generalization and the robustness of the target angle detection model.

In a first aspect, an embodiment of the present application provides a method for training a target angle detection model, where the method includes: inputting a sample image to an initial target angle detection model, and obtaining a prediction result of the initial target angle detection model on the sample image, wherein the sample image comprises annotation information; inputting the prediction result into a weight network to obtain the loss weight output by the weight network; calculating a loss value of the initial target angle detection model according to the loss weight, the prediction result and the labeling information; and finally, updating parameters in the weight network and parameters in the initial target angle detection model according to the loss value.

According to the method, the loss weight of the initial target angle detection model is obtained through the weight network to assist training of the initial target angle detection model, the training time of the initial target angle detection model can be effectively shortened, the loss value of the model can reach a convergence state more quickly, and therefore the training efficiency of the initial target angle detection model is improved.

Based on the first aspect, in a possible embodiment, the prediction result includes position prediction information and angle prediction information of the target in the sample image, and the inputting the prediction result into the weighting network to obtain the loss weight output by the weighting network specifically includes: inputting position prediction information to a first weight network to obtain a first loss weight; and inputting the angle prediction information into a second weight network to obtain a second loss weight.

The first loss weight represents a rate of contribution of a position prediction error of the initial target angle detection model to a loss value of the initial target angle detection model, and the second loss weight represents a rate of contribution of an angle prediction error of the initial target angle detection model to a loss value of the initial target angle detection model.

The weight network in the present application may dynamically change the first loss weight and the second loss weight based on the loss value learning of the initial target angle detection model to obtain the first loss weight and the second loss weight.

Based on the first aspect, in a possible embodiment, the annotation information of the sample image includes position annotation information and angle annotation information of the target in the sample image, and the calculating a loss value of the initial target angle detection model according to the loss weight, the prediction result, and the annotation information specifically includes: calculating a first loss value according to the position prediction information and the position marking information; calculating a second loss value according to the angle prediction information and the angle marking information; and adding the product of the first loss weight and the first loss value and the product of the second loss weight and the second loss value to obtain a loss value of the initial target angle detection model.

And applying the first loss weight to the loss value (namely the first loss value) of the target detection branch and applying the second loss weight to the loss value (namely the second loss value) of the target angle detection branch so as to enable the obtained loss value of the initial target angle detection model to reach a convergence state more quickly, thereby improving the training speed of the model.

Based on the first aspect, in a possible embodiment, the initial target angle detection model includes a target feature network, a target detection branch, and a target angle detection branch; the target feature network is used for determining a plurality of target candidate frames in an input sample image, and the target detection branch and the target angle detection branch respectively take the plurality of target candidate frames output by the target feature network as inputs, the target detection branch is used for outputting position prediction information, and the target angle detection branch is used for outputting angle prediction information.

Based on a plurality of target candidate frames output by the target characteristic network, the target detection branch and the target angle detection branch can respectively and simultaneously carry out position detection and angle detection on the target, so that the detection time of an initial target angle detection model on the target position and the target angle is saved, and end-to-end target detection and target angle detection are realized.

Based on the first aspect, in a possible embodiment, updating the parameters in the weight network and the parameters in the initial target angle detection model according to the loss values specifically includes: parameters in the first weight network, the target detection branch and the target characteristic network are updated in sequence according to the loss value; and sequentially updating parameters in the second weight network, the target angle detection branch and the target characteristic network according to the loss value.

On the one hand, the updating of the parameters in the weighting network and the initial target angle detection model can learn the best loss weight to improve the training speed and efficiency of the initial target angle detection model; on the other hand, the updating of the parameters in the initial target angle detection model can effectively improve the detection precision of the target angle detection model on the target detection and the target angle detection.

Based on the first aspect, in a possible embodiment, before inputting the sample image to the initial target angle detection model, the method further comprises: carrying out position marking on a target in the sample image to obtain position marking information of the target; and detecting the angle of the target by adopting N angle models to obtain N angle values for the target corresponding to the position marking information, and obtaining the angle marking information of the target according to the N angle values, wherein N is a positive integer.

And fusing a plurality of angles obtained by adopting a plurality of angle models for the same target to obtain the angle marking information of the target. By adopting the angle marking method provided by the embodiment of the application, the marking precision of the target angle can be effectively improved, so that high-quality angle marking data can be obtained, the obtained angle marking information of the target is used for training an initial target angle detection model, and the detection precision of the model on the target angle can be improved.

Based on the first aspect, in a possible embodiment, after obtaining the prediction result of the initial target angle detection model on the sample image, the method further includes: and when the angle prediction error between the angle prediction information corresponding to the target candidate frame and the angle marking information corresponding to the target candidate frame is larger than the preset difference, increasing the coefficient of the angle prediction error corresponding to the target candidate frame in the loss function of the target angle detection branch.

In the training process of the initial target angle detection model, online mining of the angle difficult samples is added, so that the large-angle samples which are difficult to predict are selected for further training, the angle detection precision of the model on the large-angle target is improved, and the generalization and the robustness of the target angle detection model are increased.

Based on the first aspect, in a possible embodiment, the training method of the target angle detection model adopts a mixed precision mode.

The target angle detection model is trained based on the mixed precision mode, so that the occupation of parameters in a model training process to a memory space can be effectively reduced, and the precision loss caused by improper storage format of data in the model training process is greatly reduced.

In a second aspect, an embodiment of the present application provides an apparatus, including: the image prediction unit is used for inputting a sample image to the initial target angle detection model and obtaining a prediction result of the initial target angle detection model on the sample image, wherein the sample image comprises annotation information; the weight learning unit is used for inputting the prediction result to the weight network and obtaining the loss weight output by the weight network; the loss calculation unit is used for obtaining a loss value of the initial target angle detection model according to the loss weight, the prediction result and the labeling information; and the parameter updating unit is used for updating the parameters in the weight network and the parameters in the initial target angle detection model according to the loss values.

Based on the second aspect, in a possible embodiment, the prediction result specifically includes position prediction information and angle prediction information of the target in the sample image, and the image prediction unit is specifically configured to: inputting position prediction information to a first weight network to obtain a first loss weight; and inputting the angle prediction information into a second weight network to obtain a second loss weight.

Based on the second aspect, in a possible embodiment, the annotation information of the sample image includes position annotation information and angle annotation information of the target in the sample image, and the loss calculation unit is specifically configured to: calculating a first loss value according to the position prediction information and the position marking information; calculating a second loss value according to the angle prediction information and the angle marking information; and adding the product of the first loss weight and the first loss value and the product of the second loss weight and the second loss value to obtain the loss value of the initial target angle detection model.

Based on the second aspect, in a possible embodiment, the initial target angle detection model comprises a target feature network, a target detection branch and a target angle detection branch; the target feature network is used for determining a plurality of target candidate frames in an input sample image, and the target detection branch and the target angle detection branch respectively take the plurality of target candidate frames output by the target feature network as inputs, the target detection branch is used for outputting position prediction information, and the target angle detection branch is used for outputting angle prediction information.

Based on the second aspect, in a possible embodiment, the parameter updating unit is specifically configured to: parameters in the first weight network, the target detection branch and the target characteristic network are updated in sequence according to the loss value; and sequentially updating parameters in the second weight network, the target angle detection branch and the target characteristic network according to the loss value.

Based on the second aspect, in a possible embodiment, the apparatus further includes an image annotation unit, configured to perform position annotation on the target in the sample image to obtain position annotation information of the target; and detecting the angle of the target by adopting N angle models to obtain N angle values for the target corresponding to the position marking information, and obtaining the angle marking information of the target according to the N angle values, wherein N is a positive integer.

Based on the second aspect, in a possible embodiment, after obtaining the prediction result of the initial target angle detection model on the sample image, the loss calculating unit is further configured to: and when the angle prediction error between the angle prediction information corresponding to the target candidate frame and the angle marking information corresponding to the target candidate frame is larger than the preset difference, increasing the coefficient of the angle prediction error corresponding to the target candidate frame in the loss function of the target angle detection branch.

Based on the second aspect, in a possible embodiment, the training method of the target angle detection model adopts a mixed precision mode.

In a third aspect, an embodiment of the present application provides an apparatus, which includes a memory and a processor, where the processor and the memory are connected or coupled together through a bus; wherein the memory is used for storing program instructions; the processor invokes program instructions in the memory to perform the method of the first aspect or any possible implementation of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium storing program code for execution by an apparatus, the program code including instructions for performing the method of the first aspect or any possible implementation manner of the first aspect.

In a fifth aspect, the present application provides a computer program software product comprising program instructions that, when executed by an apparatus, cause the apparatus to perform the method of the first aspect or any possible embodiment of the first aspect. The computer software product may be a software installation package, which, in case it is required to use the method provided by any of the possible designs of the first aspect described above, may be downloaded and executed on a device to implement the method of the first aspect or any of the possible embodiments of the first aspect.

By implementing the embodiment of the application, the target position detection and the target angle detection are simultaneously carried out based on the target candidate frame so as to realize end-to-end target detection and target angle detection and improve the target detection and target angle detection precision; in addition, a weight network is added to assist the training of the initial target angle detection model, and the loss weight of the initial target angle detection model is optimized; and the generalization and the robustness of the target angle detection model are further increased by carrying out online mining on difficult samples in model training.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a face pose angle;

FIG. 2A is a schematic diagram of a model training for face and angle detection;

FIG. 2B is a schematic diagram of an application of a model for face and angle detection;

FIG. 3 is a system architecture diagram of an application provided by an embodiment of the present application;

fig. 4 is a flowchart of a training method of a face angle detection model according to an embodiment of the present application;

fig. 5 is a framework of a face angle detection model according to an embodiment of the present application;

fig. 6 is a schematic processing flow diagram of a face angle detection model according to an embodiment of the present application;

fig. 7 is a schematic diagram of a training framework of a face angle detection model according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a branch weight network provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of another branch weight network provided in an embodiment of the present application;

fig. 10 is a flowchart of a training method of a face angle detection model according to this embodiment of the present application;

FIG. 11 is a schematic diagram of a training framework of another face angle detection model according to an embodiment of the present application;

fig. 12A is a schematic view of a large-angle human face according to the present embodiment;

fig. 12B is a schematic diagram of a small-angle human face according to the present embodiment of the present application;

fig. 13 is a flowchart of a method for annotating angles of a human face according to the embodiment of the present application;

FIG. 14 is a schematic structural diagram of an apparatus provided in this embodiment of the present application;

fig. 15 is a functional structure diagram of an apparatus provided in this embodiment of the present application.

Detailed Description

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The terms "first", "second", and the like in the description and in the claims in the embodiments of the present application are used for distinguishing different objects, and are not used for describing a particular order. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

For the sake of understanding, the following description will be made about terms and the like that may be referred to in the embodiments of the present application.

(1) Target angle detection

The target angle detection is to detect the angle information of a target object in an image. For example, if the target is a face in an image, it is called face angle detection. Face angle Detection (Face angle Detection/Head position Estimation), which may also be referred to as Head Pose Estimation. The face angle detection includes two ways, one is rough estimation, that is, a range to which the face orientation belongs is deduced from a face image, for example: head up, head down, left turn, right turn, and head up; another way is to estimate accurately, that is, it obtains the specific pose angle of the human face through a face image, that is, calculate three euler angles, which are: yaw (yaw), pitch (pitch) and roll (roll). As shown in fig. 1, yaw (yaw) is an angle generated by the left-right rotation of the face, pitch (pitch) is an angle generated by the up-down rotation of the face, and roll (roll) is an angle generated by the rotation of the face in the plane of the front.

It should be noted that, the value range of each of the three euler angles (pitch, yaw, roll) of the human face is in the range of [ -180, +180 ]. In some possible embodiments, since the range of motion of the human head is limited, the ranges of the three eulas of the human face are, respectively, the range of yaw [ -79.8,75.3], the range of pitch [ -60.4,69.6], and the range of roll [ -40.9,36.3 ].

(2) Multitask learning

Multi-task Learning (MTL) is a paradigm of machine Learning, which guides the Learning of a model by using associated information contained in a plurality of related tasks, thereby improving the generalization performance of the model to all tasks. The multi-task learning is also a paradigm of induction learning, and implicit information among different tasks can be utilized, so that characteristics learned by different task pieces can be shared, and characteristics which cannot be learned by a single task can be learned. Multi-task learning can perform multiple task analyses on the same input data simultaneously for that input data.

For example, assume that a multi-task learning model includes M tasks, shared hidden layers are used for learning features together, and a branch network of each sub-task is formed behind the shared hidden layers for learning features dedicated to a certain task. The loss output of each task can be used for jointly guiding the optimization of the network model, so that the finally learned characteristics have good generalization on each subtask.

In the related art, the detection of the face angle is not taken as an example, and the face angle detection is mainly realized by constructing two models (a face detection model and an angle detection model). Referring to fig. 2A, fig. 2A is a flowchart of a training method for a model for face and angle detection, and it can be seen that a face detection model and an angle detection model are two separate models, and the two models are separately trained, the training process is tedious and takes a long time, and a plurality of images only including a single face image are used as sample images when the two models are trained. Because the training of the face detection model and the angle detection model are separated, the error of the face detection model with the front execution sequence affects the angle detection model with the back execution sequence, so that the accuracy of the angle detection model for detecting the face angle is reduced.

For the training flow diagram shown in fig. 2A, in an actual application process, referring to fig. 2B, it is necessary to determine the only labeled frames of faces with different identities in an image by using a model for face detection, and then perform angle prediction on the face in each labeled frame by using a model for angle detection, so that the face detection and the angle detection have an obvious sequence, and on one hand, face angle detection takes a long time and is inefficient; on the other hand, the accuracy of the face angle detection model is affected by the face detection model.

The application provides a structure of a target angle detection model and a training method of the target angle detection model. The detection accuracy of the target angle detection model can be higher. It should be understood that in some embodiments of the present application, the face angle detection is taken as an example of the target angle detection, and actually the solution of the present application can be applied to various targets, for example: human faces, vehicles, animal head portraits, etc.

A system architecture to which the embodiments of the present application apply is described below. Referring to fig. 3, fig. 3 exemplarily shows a system architecture provided by an embodiment of the present application, where the data obtaining device 160 is configured to obtain training data, where the training data in the present application includes a plurality of sample images carrying annotation information, where each sample image has a plurality of targets, and the annotation information includes position annotation information and angle annotation information of the corresponding target in the sample image. Optionally, the annotation information may be obtained by the data obtaining device 160 obtaining position annotation information by performing position annotation on the object in the obtained original image, and then performing angle annotation on the object with the position annotation information to obtain angle annotation information. In some possible embodiments, the data acquiring device 160 and the training device 120 may be one device, where the data acquiring device 160 is configured to acquire a sample image containing at least one target, and the training device 120 is configured to process the sample image to obtain the above-mentioned annotation information, which is not limited in this application.

Optionally, in some possible embodiments, when performing angle labeling on a target in a sample image, the sample image needs to complete position labeling of the target first, and a target labeling frame corresponding to one piece of position labeling information may perform angle labeling only on the target corresponding to the target labeling frame that meets the size of the labeling frame and ambiguity detection, so as to screen out a target image with a proper and clear size to assist in training an initial target angle detection model, so as to improve training efficiency of the model.

Optionally, the data obtaining device 160 may store the training data into the database 130, and the training device 120 trains the initial target angle detection model based on the training data maintained in the database 130, so as to finally obtain the target angle detection model 101. The data acquisition device 160 may also send training data directly to the training device 120 to cause the training device 120 to train the initial target angle detection model based on the training data to obtain the target angle detection model 101.

The following describes the training device 120 obtaining the target angle detection model 101 based on the training data, and training the initial target angle detection model to realize the function of face angle recognition. The initial target angle detection model comprises a target feature network, a target detection branch and a target angle detection branch. Inputting a sample image into an initial target angle detection model, determining target candidate frames through target characteristic network detection, wherein each target in the sample image corresponds to at least one target candidate frame, a target detection branch and a target angle detection branch respectively predict a target position and a target angle by taking the target candidate frames as input to obtain a prediction result, inputting the prediction result into a weight network to obtain a loss weight, obtaining a loss value of the initial target angle detection model according to the prediction result, the tagging information and the loss weight due to the fact that the sample image carries tagging information, and finally updating parameters in the weight network and parameters in the initial target angle detection model according to the loss value until the loss value converges or a change amount is within an allowable value, namely completing the training of the initial target angle detection model, thereby obtaining the target angle detection model 101. The specific training process of the initial target angle detection model is described in detail later.

Because the angles of the targets included in the sample image input to the training device 120 are different, for example, when the target is a face, some faces may be directly opposite to the lens when being shot, the face appears more positive in the sample image, i.e., the deflection angle is smaller, the face angle prediction is easy to be performed, and the difference between the predicted face angle and the corresponding angle labeling information is smaller; however, the deflection angle of some faces is large, the difference between the predicted face angle and the corresponding angle marking information is large, the faces are called large-angle faces, and the angle prediction of the large-angle faces is difficult. Optionally, in the process of training the initial target angle detection model, the training device 120 may perform online mining on the angle-difficult sample, pick out the large-angle face as the angle-difficult sample, and further participate in training the angle-difficult sample, so as to improve the angle detection precision of the model on the large-angle face.

The target angle detection model for face angle detection may also be referred to as a face angle detection model. If the annotation information of the sample image includes the position annotation information and the angle annotation information of the face, the target angle detection model 101 obtained after the training can be used for detecting the face in the given image and the face angles of the face. The application process of the target angle detection model 101 is specifically as follows: the user terminal 140 inputs the image to be processed to the computing module 111 of the execution device 110 through the I/O interface 112, the computing module 111 inputs the image to be processed to the target angle detection model 101 to obtain the position prediction information and the angle prediction information of the face output by the model, and finally outputs the obtained position prediction information and the angle prediction information of the face to the user terminal 140 through the I/O interface 112.

It should be noted that, in practical applications, the training data maintained in the database 130 is not necessarily acquired from the data acquisition device 160, and may also be acquired from other devices, for example, the training device 120, and the training device 120 may also label the sample image to obtain labeling information, then store the sample image carrying the labeled sample information into the database 130, and acquire the sample image from the database 130 when needed.

It should be noted that, the training device 120 does not necessarily perform the training of the face angle detection model 101 based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training, and the above description should not be taken as a limitation to the embodiments of the present application. The training device 120 may exist separately from the execution device 110 or may be integrated within the execution device 110.

The target angle detection model 101 obtained by training according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 3, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, a vehicle-mounted terminal, a camera, or the like, and may also be a server or a cloud device. In fig. 3, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the user terminal 140, where the input data may include: the image to be processed is input by the user terminal.

In the process that the execution device 110 processes data input by the user terminal 140, or in the process that the calculation module 111 of the execution device 110 executes related tasks such as calculation, the execution device 110 may call data, codes, and the like in the data storage system 150 for corresponding processing, and may also store data, instructions, and the like obtained by corresponding processing in the data storage system 150.

It should be noted that the training device 120 may generate corresponding target angle detection models 101 for different targets or different tasks based on different training data, and the target angle detection models 101 may be used to achieve the above targets or complete the above tasks, so as to provide the user with a desired result, for example, in this application, the user may be provided with the face position and the face angle detected in a given image of the user.

It should be noted that the object angle detection model provided in the embodiment of the present application may be used to detect the position of an object in an image and the angle of the object, where the object may be an inanimate object such as a building, a vehicle, a fan, etc., and the object may also be a living object or a part of a living object such as a head of a person, a human face, an animal, etc. In the following description, the target is not exemplified by a human face, but the target angle detection model provided in the present application is not limited to be only applicable to position and angle detection of a human face. In addition, the target angle detection model for face angle detection may also be referred to as a face angle detection model.

Referring to fig. 4, based on the system architecture described above, a training method of a face angle detection model provided in the embodiment of the present application is described below, it should be noted that the face angle detection model is only an example of a target angle detection model, and the target angle detection model provided in the embodiment of the present application is not limited to be only suitable for detecting a face angle. The method includes, but is not limited to, the steps of:

and S101, inputting the sample image into an initial face angle detection model to obtain a prediction result.

In the embodiment of the application, the sample image is input into the initial face angle detection model, and the prediction result is obtained according to the initial face angle detection model. The prediction result comprises position prediction information and angle prediction information of at least one human face. It should be noted that each input sample image includes at least one face, and the initial face angle detection model may perform position prediction and angle prediction of the face simultaneously on the face in the sample image. Note that one piece of position prediction information represents one face prediction box.

To more clearly illustrate the processing flow of the initial face angle detection model, referring to fig. 5, fig. 5 is a framework of an initial face angle detection model provided in an embodiment of the present application, and as shown in fig. 5, the initial face angle detection model includes three parts, namely a face feature network, a face detection branch, and a face angle detection branch, where the face feature network is used to detect a region suspected to be a face in a sample image to obtain a plurality of face candidate frames, and each face candidate frame represents position candidate information of a face. The face detection branch and the face angle detection branch respectively take face candidate frames output by a face feature network as input, and the face detection branch determines position prediction information of the face based on position candidate information of a plurality of face candidate frames of each face; and the face angle detection branch is used for determining angle prediction information corresponding to the position candidate information based on the position candidate information of each face candidate frame.

Based on a plurality of face candidate frames output by the face feature network, the face detection branch and the face angle detection branch can respectively and simultaneously carry out face position detection and face angle detection, so that the detection time of an initial face angle detection model on the face position and the face angle is saved, and end-to-end face detection and face angle detection are realized.

Exemplarily, referring to fig. 6, fig. 6 is a schematic processing flow diagram of an initial face angle detection model provided in an embodiment of the present application, where a sample image a includes two faces, it is assumed that the sample image a obtains 6 face candidate frames through a face feature network of the initial face angle detection model, as can be seen from fig. 6, each face corresponds to 3 face candidate frames, a face detection branch and a face angle detection branch of the initial face angle detection model respectively perform face position prediction and face angle prediction based on the six face candidate frames, as can be seen from fig. 6, in an output result of the face detection branch, each face in the sample image a corresponds to one face prediction frame, each face candidate frame in an output result of the face angle detection branch has corresponding angle prediction information, that is, each face angle detection branch outputs 6 angle prediction information, 1(p1, y1, r1), 2(p2, y2, r2), 3(p3, y3, r3), 4(p4, y4, r4), 5(p5, y5, r5) and 6(p6, y6, r6), wherein 1(p1, y1, r1), 2(p2, y2, r2) and 3(p3, y3, r3) correspond to the same face, and 4(p4, y4, r4), 5(p5, y5, r5) and 6(p6, y6, r6) correspond to the same face.

It should be noted that the number of face candidate frames obtained according to the sample image is greater than the number of face labeling frames in the sample image, and a large number of face candidate frames are beneficial to improving the recall rate of face detection. An allocation process exists between the face candidate frame and the face labeling frame, for example, allocation is performed according to the size of the intersection ratio between the face candidate frame and the face labeling frame, and the larger the intersection ratio between the face candidate frame a and the face labeling frame b is, the higher the possibility that the face candidate frame a corresponds to the face labeling frame b is.

The face feature network comprises an input layer, a convolution layer, a pooling layer and an output layer, wherein the input layer represents input data, the input data is a sample image comprising position marking information and angle marking information of a face, and the output layer outputs a feature map of the sample image. It should be noted that the output may be a single Feature map, or may be a multi-scale Feature map, or a multi-scale Feature map after FPN (Feature Pyramid Networks) fusion, and the present invention is not limited in particular. The convolutional layer may include a plurality of convolution operators, which function as filters to extract specific information from the input data, and the pooling layer functions to reduce the number of training parameters, increase the training speed of the network, and thus is often added periodically after the convolutional layer. In one implementation, the convolutional and pooling layers may be spaced adjacent, i.e., each convolutional layer is followed by a pooling layer; in another implementation, a plurality of convolutional layers may be followed by a pooling layer. Thus, the output of a convolutional layer may be used as input to a subsequent pooling layer, as well as to another convolutional layer to continue the convolution operation. The number of convolutional/pooling layers is not specifically limited in this application. In addition, Batch Normalization (BN) processing and activation function (such as Leaky ReLU) processing can be performed after each convolutional layer, the Batch Normalization processing can accelerate the convergence speed of the model and improve the generalization capability of the model, and the activation function can solve the problem that training cannot be continued due to gradient dispersion in the training process caused by the increase of the number of network layers.

Based on the feature map extracted by the face feature network, the face detection branch and the face angle detection branch further extract the face and angle related features respectively. It should be noted that the face feature network in the initial face angle detection model may use some classical feature extraction networks, such as Resnet, VGG, etc., and may also design a network structure autonomously as needed, which is not specifically limited in this application.

The face detection branch comprises an input layer, a convolution layer and a pooling layer and is used for further extracting face related information on the basis of the face feature network. In a specific implementation, the face detection branch predicts an offset value for each face candidate frame, and determines a face prediction frame corresponding to the face by combining the offset value and the face candidate frame. The introduction of the offset value allows the network to predict an equilibrium value, thereby allowing the network to converge more quickly. The related descriptions of the convolutional layer and the pooling layer can refer to the related descriptions in the face feature network, and are not described in detail herein.

The face angle detection branch is used for further extracting face angle related information on the basis of the face feature network. The face angle detection branch predicts angle prediction information for each face candidate frame. In some possible embodiments, normalization processing may be performed on the obtained angle prediction information, so that three euler angles in the angle prediction information are more balanced, which is beneficial to improving the convergence speed of network training. The network structure of the face angle detection branch is similar to that of the face angle detection branch, and the difference lies in that the number of layers of the convolution layer and the pooling layer contained in the face angle detection branch is different, the selected convolution operators are different, and the like.

In the embodiment of the application, the sample image is labeled with both the face frame and the face angle, wherein the face frame labeled in the sample image can also be called a face labeling frame, each face labeling frame represents the position labeling information of one face, and the face angle labeled in the sample image can also be called angle labeling information. The position marking information of the human face is used for calculating the position prediction error of the human face detection branch, and the angle marking information is used for calculating the angle prediction error of the human face angle detection branch. The position labeling information and the angle labeling information of the human face are labeling information of the sample image.

Illustratively, when the face labeling box is a rectangular box, the face labeling box can be represented by two diagonal vertices of the rectangular box, for example, (x)₁,y₁) The coordinates of the top left corner of the rectangular frame in the whole image of the face (x)₂,y₂) Is a coordinate point at the lower right corner of the rectangular frame.

Illustratively, when the face labeling frame is a rectangular frame, the face labeling frame may also be represented by (X, Y, w, h), where X represents a value of an X-axis coordinate of an upper left corner of the face labeling frame, Y is a Y-axis coordinate value of the upper left corner of the face labeling frame, w is a width of the face labeling frame, and h is a height of the face labeling frame.

S102, inputting the prediction result to a weight network to obtain a loss weight.

In the embodiment of the application, the prediction result is input into a weight network to obtain a loss weight, wherein the loss weight is the loss weight of the initial face angle detection model.

Specifically, the prediction result includes position prediction information and angle prediction information of the face, where the position prediction information of the face is output by a face detection branch, the angle prediction information of the face is output by a face angle detection branch, and the weighting network includes a first weighting network and a second weighting network, and therefore, inputting the prediction result to the weighting network to obtain the loss weight specifically means: inputting position prediction information to a first weight network to obtain a first loss weight; and inputting the angle prediction information into a second weight network to obtain a second loss weight. It should be noted that the first loss weight corresponds to the face detection branch, and the second loss weight corresponds to the face angle detection branch.

The first loss weight represents the importance degree of the loss of the face detection branch, and the second loss weight represents the importance degree of the loss of the face angle detection branch. In the training process of the same batch of sample images, for the face detection branch, the influence of the first loss weight on the position prediction error corresponding to each face in the batch of sample images is the same; for the face angle detection branch, the influence of the second loss weight on the angle prediction error corresponding to each face in the batch of sample images is the same. It should be noted that, in the training process of the model, the weight network dynamically changes the first loss weight and the second loss weight so that the loss of each detection branch in the initial face angle detection model is in the same gradient scale, thereby improving the detection precision of the model on the face and the angle and improving the training speed of the model.

Referring to fig. 7, a weighting network module is added on the basis of the initial face angle detection model shown in fig. 5, and the weighting network includes a first weighting network and a second weighting network. A face detection branch in the initial face angle detection model is connected with the first weighting network, and a face angle detection branch in the initial face angle detection model is connected with the second weighting network. It should be noted that, the first weighting network or the second weighting network is a branch weighting network in the weighting networks.

The input to the first weighting network is the predicted output of the face detection branch, assumed to be y₁Outputting a first loss weight w via a first weight network₁The process can be represented by the formula (1), wherein f₁Representing a first weight network:

w₁＝f₁(y₁) (1)

the input of the second weight network is the predicted output of the face angle detection branch, which is assumed to be y₂Outputting a second loss weight w via a second weight network₂The process can be represented by formula (2), wherein f₂Represents a second weight network:

w₂＝f₂(y₂) (2)

each branch weight network in the weight network comprises at least one convolutional layer and a branch from an input of an initial convolutional layer to an output of a last convolutional layer in the at least one convolutional layer, the branch superimposing the input of the initial convolutional layer and the output of the last convolutional layer. In other words, the branches of the branch weight network superimpose the input of the branch weight network and the output of the input after the convolutional layer processing of the branch weight network, so as to prevent the loss of the detection branch where the branch weight network is located from gradient disappearance. The first weight network or the second weight network is a branch weight network of the weight networks. The parameters of the convolution layer in different branch weight networks are different (e.g., the values of the convolution kernel).

Referring to fig. 8, fig. 8 exemplarily provides a structure diagram of a branch weight network, and it can be seen that the branch weight network in fig. 8 includes a convolutional layer and branches of the branch weight network input to outputs of the branch weight network, and outputs of the convolutional layer are overlapped with outputs of the branches. Fig. 8 is an example of a branch weight network, and the number of convolutional layers in the branch weight network is not limited to only 1.

In some possible embodiments, each branch weight network in the weight network further includes a pooling layer, as shown in fig. 9, an output of a detection branch is input into the branch weight network, the branch weight network superimposes the output of the detection branch and an output of the detection branch after being processed by a corresponding convolutional layer, and finally, a result after the superimposition is input into the pooling layer, so as to reduce a spatial size of a parameter in the model training, reduce a calculation amount, and prevent overfitting from occurring in the training process.

For a branch weight network, each convolution layer may include a plurality of convolution kernels, which function as a filter for extracting specific information from input data, the size of the adopted convolution kernels may be 3 × 3, 5 × 5, 7 × 7 or other sizes, the number of convolution kernels per layer may be 64, 128, 256, 512 or other values, and the size of the convolution kernels and the number of the convolution kernels are not particularly limited in the embodiments of the present application.

It should be noted that the initialization strategy of the weight network may be obtained by pre-training or may be set manually. The training strategy can be that the training is completed all the time, or the training is solidified after a period of time. The weight values may be shifted by a certain value or normalized as needed to prevent the gradient from disappearing, etc. Therefore, in order to improve the training accuracy, some feasible training strategies may be added, and the present application is not limited specifically.

S103, obtaining a loss value of the initial face angle detection model according to the loss weight, the prediction result and the labeling information.

In the embodiment of the application, the loss weight of the initial face angle detection model includes a first loss weight corresponding to the face detection branch and a second loss weight corresponding to the face angle detection branch, that is, the loss value of the initial face angle detection model is obtained according to the first loss weight, the second loss weight, the prediction result and the label information.

Specifically, a first loss value (which may also be referred to as a loss value of a face detection branch) is obtained according to position prediction information of a face in the prediction result and position labeling information of the face in the labeling information, a second loss value (which may also be referred to as a loss value of a face angle detection branch) is obtained according to angle prediction information of the face in the prediction result and angle labeling information of the face in the labeling information, and finally, a loss value of the initial face angle detection model is obtained according to the first loss weight and the first loss value, the second loss weight and the second loss value, for example, the product of the first loss weight and the first loss value and the product of the second loss weight and the second loss value are added to obtain a loss value of the initial target angle detection model.

The loss value of the face detection branch indicates a sum of position prediction errors between position prediction information of each face and position marking information corresponding to each face, and the loss value of the face angle detection branch indicates a sum of angle prediction errors between angle prediction information corresponding to each face candidate frame and angle marking information corresponding to each face candidate frame, where the errors may be average absolute errors, average root-mean-square errors or other error expression modes, and the present application is not limited specifically.

Illustratively, the loss function of the face detection branch can be expressed as formula (3), and it should be noted that formula (3) only gives an exemplary loss function expression L of the face detection branch_faceRepresenting the number of sample images of each batch of input models, K representing the number of faces in each sample image, X^YF(b, k) represents the position coordinates of the face prediction frame of the kth face in the b-th sample image, X^BF(b, k) represents the position coordinates of the face labeling frame of the kth face in the b-th sample image, L_faceThe mean value of the root mean square error between each face prediction frame and the face labeling frame corresponding to each face prediction frame is shown.

Illustratively, the loss function of the face angle detection branch can be expressed as a formula(4) It should be noted that formula (4) exemplarily provides a loss function expression L of the face angle detection branch_angleRepresenting the number of sample images of each batch of input models, N_bRepresenting the number of faces with angle labeling information in the b-th sample image, M_bnNumber of face candidate frames, X, representing the nth face in the b-th sample image^JY(b, n, i) represents angle prediction information of the ith face candidate frame of the nth face in the b-th sample image, X^JB(b, n) represents angle labeling information corresponding to the face labeling frame of the nth face in the b-th sample image, L_angleThe average absolute error between the angle prediction information of each face candidate frame and the corresponding angle label information is shown.

In addition, the loss value of the initial face angle detection model is calculated according to the first loss weight, the second loss weight, the loss value of the face detection branch and the loss value of the face angle detection branch. After the output of the initial face angle detection model passes through the weighting network shown in fig. 7, the loss function of the initial face angle detection model can be expressed as formula (5), that is, the loss function of the face detection branch and the loss function of the face angle detection branch are weighted and summed, it should be noted that formula (5) is only an example, and the loss function of the present application is not limited to formula (5).

L_sum＝(1+w₁)L_face+(1+w₂)L_angle (5)

See equation (5), where L_sumThe loss function representing the initial face angle detection model, i.e. the loss shown in fig. 7, (1+ w)₁)L_faceI.e. the final loss of face detection as shown in fig. 7, (1+ w)₂)L_angleI.e. the final loss is detected for the angle shown in figure 7,wherein L is_faceIndicating loss of face detection branch, w₁Is a first loss weight, w₁Reference is made to the above formula (1), L_angleLoss function, w, representing face angle detection branches₂Is the second loss weight, w₂Reference is made to the above formula (3), L_faceMay further refer to the above formula (3), L_angleThe expression of (c) can be further referred to the above formula (4).

Optionally, for the angle labeling information in the labeling information, only the face meeting the face size condition and the definition condition has corresponding angle labeling information, and it can be understood that some faces in the sample image only have face labeling information, and some faces have both face labeling information and angle labeling information. Therefore, the loss value of the face angle detection branch represented by the formula (4) is calculated based on the face having the angle labeling information. For the angle labeling of the face in the sample image, reference may be made to the following description, and for the sake of brevity of the description, no further description is given here.

And S104, updating parameters in the weight network and parameters in the initial face angle detection model according to the loss values.

In the embodiment of the application, after the loss value of the initial face angle detection model is obtained, the parameters in the weight network and the parameters in the initial face angle detection model are updated reversely based on the loss value. Specifically, the loss value of the initial face angle detection model represents a weighted sum of a position prediction error of the face and an angle prediction error of the face, and updating the parameters in the weight network and the parameters in the initial face angle detection model according to the loss value means: on one hand, parameters in the first weight network, the face detection branch and the face feature network are updated in sequence in a reverse direction according to the loss value; and on the other hand, parameters in the second weight network, the face angle detection branch and the face feature network are updated in reverse sequence according to a loss value method until the loss value of the initial face angle detection model reaches a convergence state.

The updating of the parameters in the weight network can quickly learn the optimal loss weight of the initial face angle detection model so as to improve the training speed and efficiency of the initial face angle detection model; and the updating of the parameters in the initial face angle detection model can effectively improve the detection precision of the face angle detection model on the face detection and the face angle detection.

It should be noted that, when the loss value of the initial face angle detection model converges or the change amount is within the allowable value, the training of the weight network and the initial face angle detection model is completed, so that the parameters in the first weight network and the second weight network of the weight network do not change any more, and the parameters in the initial face angle detection model do not change any more.

Illustratively, the loss value of the initial face angle detection model is propagated reversely based on a gradient descent method to update the parameters of the weight network and the parameters of the initial face angle detection model until the loss value reaches a convergence state, so that the position prediction information and the angle prediction information output by the initial face angle detection model are closer to the real face labeling information and the angle labeling information, and the performance of the initial face angle detection model reaches an optimal state.

It should be noted that, updating the parameters in the first weight network refers to updating the parameters of the convolutional layer in the first weight network, or the parameters of the convolutional layer and the pooling layer, so as to adjust the first loss weight; updating the parameters in the second weight network means updating the parameters of the convolutional layers in the second weight network, or the parameters of the convolutional layers and the pooling layers, thereby realizing the adjustment of the second loss weight; the face detection branch and the face feature network are updated, and the face angle detection branch and the face feature network are updated according to the update of the initial face angle detection model, namely, parameters of each layer (convolution layer, pooling layer and the like) of the corresponding part in the initial face angle detection model are updated.

The weight network learns the loss weight of each detection branch in the initial face angle detection model based on the loss value of the initial face angle detection model, so that the dynamic change of the loss weight in the training of the initial face angle detection model is realized.

It should be noted that the weight network is only used in the training process of the initial face angle detection model, after the training of the initial face angle detection model is finished, the trained initial face angle detection model is called as a face angle detection model, and the face angle detection model can be directly applied. Illustratively, the face angle detection model can detect the angle of a face from a face image collected by a camera, and the face with the correct angle is screened out through the detected face angle and input into the face recognition model for recognition, so as to improve the accuracy of face recognition. In addition, the human face angle detection model can be applied to the fields of attention examination of subjects such as long-distance drivers or students, human behavior analysis, human-computer interaction and the like.

It should be noted that after the training of the initial face angle detection model is finished, the trained initial face angle detection model is also called a face angle detection model. The server (or training device) for model training may send the face angle detection model to the server or application device for model application, and the like. In some possible embodiments, if the device has both functions of training the model and applying the model, the device may directly apply the face angle detection model.

By implementing the embodiment of the application, the multi-task learning model is adopted to predict the face position and the face angle simultaneously based on the face candidate frame, so that the detection efficiency and the detection precision of the face and the face angle are improved. In addition, the weight network is added to assist the training of the initial face angle detection model, the loss weight of the initial face angle detection model is optimized, the training time of the model is effectively reduced, the training efficiency of the model is improved, and the adaptability of the model is improved.

Referring to fig. 10, fig. 10 is a block diagram illustrating another training method for a face angle detection model according to an embodiment of the present disclosure. It should be noted that the face angle detection model is only an example of the target angle detection model, and the target angle detection model provided in the embodiment of the present application is not limited to be only applicable to face angle detection. In addition, the embodiment of fig. 10 may be independent of the embodiment of fig. 4, or may be a supplement to the embodiment of fig. 4. The method includes, but is not limited to, the steps of:

s201, inputting a sample image into an initial face angle detection model to obtain a prediction result. This step may specifically refer to the related description of S101 in the embodiment in fig. 4, and is not described herein again.

S202, carrying out angle difficult sample mining according to angle prediction errors between the angle prediction information in the prediction result and the corresponding angle marking information.

In the embodiment of the application, in order to improve the robustness of the initial face angle detection model for detecting the face angle or the face pose angle, so that the trained initial face angle detection model is suitable for detecting the face angle in various angle ranges, after the initial face angle detection model outputs the prediction result, before the loss value of the initial face angle detection model is obtained, for a face angle detection branch, the step of mining an angle difficult sample can be added, namely, the mining of the difficult sample is carried out according to the angle prediction error between the angle prediction information and the corresponding angle marking information, so that the angle detection precision of the model for the large-angle face is improved, and the generalization and the robustness of the face angle detection model are improved.

Referring to fig. 11, a step of mining a difficult sample is added to the block diagram of the model training structure shown in fig. 7, and as can be seen from fig. 11, the mining of the difficult sample includes mining of a face difficult sample and mining of an angle difficult sample. The difficult sample mining is beneficial to improving the robustness and the detection precision of the model detection face position and the face angle.

Specifically, the mining of the angle-difficult sample mainly comprises two steps: step one, calculating an angle prediction error between angle prediction information corresponding to each face candidate frame and angle marking information corresponding to the face candidate frame; and step two, when the angle prediction error is larger than the preset difference, increasing the coefficient of the angle prediction error corresponding to the face candidate frame in the loss function of the face angle detection branch. It should be noted that, when the angle prediction error corresponding to a certain face candidate frame is greater than the preset difference amount, the face corresponding to the face candidate frame is the angle-difficult sample. In addition, the angle prediction error may be represented by a euclidean distance, a manhattan distance, or other measurement errors, and the present application is not limited specifically.

For example, the manner for determining the angle-difficult sample may be: after the face angle detection branch outputs angle prediction information aiming at a certain face candidate frame output by the face feature network, calculating the Euclidean distance between the angle prediction information and angle marking information corresponding to the face candidate frame, and when the obtained Euclidean distance is larger than or equal to a first distance threshold value, determining that the face corresponding to the face candidate frame is an angle-difficult sample. It should be noted that, in addition to calculating the euclidean distance, a manhattan distance (or called L1 distance), a Smooth L1 distance, or other types of distances between the angle prediction information and the angle labeling information corresponding to the face candidate frame may also be calculated, and the embodiment of the present application is not particularly limited.

For example, if the angle prediction information of a certain face candidate frame is (y11, p11, r11) and the angle label information corresponding to the face candidate frame is (y10, p10, r10), the euclidean distance between the angle prediction information and the angle label information is (y10, p10, r10)

For example, the manner for determining the angle-difficult sample may also be: firstly, calculating the Euclidean distance or the Manhattan distance or other types of distances between the angle prediction information of each face candidate frame and the angle marking information corresponding to the face candidate frame, taking the Euclidean distance as an example, judging whether the Euclidean distance is greater than a second distance threshold, if the Euclidean distance is greater than the second distance threshold, marking the face candidate frames corresponding to the Euclidean distance, and when the number of marked face candidate frames reaches H, sequencing the H Euclidean distances corresponding to the marked H personal face candidate frames in a descending order, determining that the faces corresponding to the S previous Euclidean distances are all angle-difficult samples, wherein H is greater than S, S is a positive integer greater than or equal to 1 and less than H, and the greater Euclidean distance is, the greater the deflection angle of the face corresponding to the face candidate frame corresponding to the Euclidean distance is described, it also shows that the more difficult the face angle detection branch predicts the angle of the face.

In order to visually display the angle-difficult sample, see fig. 12A and 12B, the face in fig. 12A is an angle-difficult sample (or referred to as a large-angle face) and is represented by a large angle of the face in at least one direction of a yaw angle (yaw), a pitch angle (pitch), and a roll angle (roll). While the face shown in fig. 12B is an angle easy sample (or called a small angle face). It should be noted that the small-angle face is easy to accurately predict the pose angle of the face by the model, and the angle prediction of the large-angle face is difficult.

In the embodiment of the application, after a face corresponding to a certain face candidate frame is determined to be an angle-difficult sample, the face candidate frame is marked, and coefficients of angle prediction errors corresponding to the angle-difficult samples (namely the marked face candidate frame) are increased when a loss function of a face angle detection branch is calculated.

It should be noted that the coefficient of the angle prediction error corresponding to the sample with difficult angle may be a preset fixed value or a value that can be adjusted by a program, and the embodiment of the present application is not particularly limited.

Illustratively, before the mining of the angle-difficult samples is not added, the coefficients of the angle prediction error corresponding to each face candidate frame in the loss function of the face angle detection branch are the same whether the face corresponding to the face candidate frame is the angle-difficult sample, for example, the coefficients are all 1; after the mining of the angle-difficult samples is added, when the loss function of the face angle detection branch is calculated, the coefficients of the angle prediction errors corresponding to the face candidate frames marked as the angle-difficult samples are all increased to 1.3, and the coefficients of the angle prediction errors corresponding to the remaining face candidate frames not marked as the angle-difficult samples are still 1.

In some possible embodiments, after the face candidate frame corresponding to the angle-difficult sample is marked, when the loss function of the face angle detection branch is calculated, the coefficient of the angle prediction error corresponding to each angle-difficult sample may also be adjusted according to the difference of the difficulty degree of each angle-difficult sample. Exemplarily, before the mining of the angle difficult sample is not performed, the coefficients of the angle prediction errors corresponding to the face candidate frames are all 1 when the loss function of the face angle detection branch is calculated; after the angle difficulty sample mining is performed, if it is determined that the S face candidate frame is the angle difficulty sample, the S face candidate frame may be further divided into a plurality of levels according to the euclidean distance corresponding to the S face candidate frame, the larger the euclidean distance is, the more difficult the corresponding angle difficulty sample is, and the three levels are arranged from high to low according to the difficulty degree, as an example: a first level, a second level, and a third level. The coefficient of the angle prediction error corresponding to the face candidate frame belonging to the first level is adjusted to 1.5, the coefficient of the angle prediction error corresponding to the face candidate frame belonging to the second level is adjusted to 1.3, the coefficient of the angle prediction error corresponding to the face candidate frame belonging to the third level is adjusted to 1.2, and the coefficients of the angle prediction errors corresponding to other face candidate frames which are not marked as the difficult-to-angle samples are still 1, and the loss of the angle prediction branch is calculated according to the above rule.

It should be noted that, in some possible embodiments, after determining the samples with difficult angles, the coefficients of the angle prediction errors corresponding to the samples with difficult angles in the loss function of the face angle detection branch may not be adjusted, but the sample images where the samples with difficult angles are located are input into the initial face angle detection model again to be retrained one or more times, so as to improve the detection accuracy of the model for large-angle faces.

It should be noted that, referring to fig. 11, a step of mining a difficult face sample may also be added to improve the accuracy of the model for detecting the position of the face, and increase the generalization and robustness of the face angle detection model for detecting the face. The face difficulty sample mining process is similar to angle difficulty sample mining. In a specific implementation, whether the face is a difficult-to-face sample can be judged through a position prediction error (for example, an euclidean distance or a manhattan distance and the like) between position prediction information of the face and position marking information of the face; in another specific implementation, when it is determined that the position prediction error (e.g., euclidean distance or manhattan distance) corresponding to the H personal face prediction frame is greater than the third distance threshold, based on the descending order of the H position prediction errors, the faces corresponding to the position prediction information corresponding to the first S larger position prediction errors are all used as the face difficulty samples. After the face difficult sample is determined, when a loss function of a face detection branch is calculated, the coefficient of the position prediction error corresponding to the face difficult sample can be increased, so that the training time of the model is shortened, and the face detection precision of the model is improved. For the specific operations of face difficulty sample mining, reference may be made to the above description of angle difficulty sample mining, and for brevity of the description, detailed description is omitted here.

It should be noted that the mining process of the difficult samples is online, that is, the difficult samples are mined (including at least one of the mining of the face difficult samples and the mining of the angle difficult samples) simultaneously in the training process of the initial face angle detection model, which is beneficial to improving the applicability of the initial face angle detection model and enables the model to have better robustness when detecting the face and the angle.

S203, inputting the prediction result to a weight network to obtain the loss weight. This step may specifically refer to the related description of S102 in the embodiment of fig. 4, and is not repeated here.

And S204, obtaining a loss value of the initial face angle detection model according to the loss weight, the prediction result and the labeling information. This step may specifically refer to the related description of S103 in the embodiment of fig. 4, and is not described herein again.

It should be noted that, because the angle-difficult sample mining and the face-difficult sample mining are performed in S202, when calculating the loss value of the initial face angle detection model, the loss value of the face detection branch and the loss value of the face angle detection branch need to be adjusted correspondingly, specifically, the coefficient of the position prediction error corresponding to the face-difficult sample in the loss function of the face detection branch needs to be adjusted, and the coefficient of the angle prediction error corresponding to the angle-difficult sample in the loss function of the face angle detection branch needs to be adjusted.

Without taking the adjustment of the loss function of the face angle detection branch as an example, referring to the above formula (4) in S103, it can be seen that the coefficients of the angle prediction error corresponding to each face candidate frame are all the coefficients

After S202, the coefficients of the angle prediction error corresponding to the angle-difficult samples in the formula (4) are adjusted. Assuming that after S202, if it is determined that the 2 nd face candidate frame of the 3 rd face in the 1 st sample image is the angle-difficult sample, the coefficient of the angle prediction error corresponding to the angle-difficult sample is adjusted to be 1.3, and the product of the angle prediction error corresponding to the 2 nd face candidate frame of the 3 rd face in the 1 st sample image and the adjusted coefficient is: 1.3. X^JY(1,3,2)-X^JB(1,3) |, wherein X^JY(1,3,2) angle prediction information of the 2 nd face candidate frame of the 3 rd face in the 1 st sample image, X^JB(1,3) the angle labeling information of the 3 rd face in the 1 st sample image, and 1.3 is the coefficient of the angle prediction error corresponding to the angle difficult sample.

Similarly, after the face difficult sample is determined by mining the face difficult sample, the coefficient of the position prediction error corresponding to the face difficult sample needs to be adjusted, and when the loss value of the face detection branch is calculated, the adjusted coefficient needs to be used for the coefficient of the position prediction error corresponding to the face difficult sample. Taking the formula (3) in S103 as an example, it can be seen that the coefficients of the position prediction error corresponding to each face prediction frame in the formula (3) are all the coefficients

Warp S202Then, the coefficients of the position prediction error corresponding to the difficult-to-face sample in equation (3) are adjusted. Assuming that after S202, if it is determined that the face prediction frame of the 3 rd face in the 1 st sample image is the angle-difficult sample, the coefficient of the position prediction error corresponding to the face-difficult sample is adjusted to be 1.2, and the product of the coefficient of the position prediction error corresponding to the face prediction frame of the 3 rd face in the 1 st sample image and the adjusted weight is: 1.2X^YF(1,3)-X^BF(1,3)]²Wherein X is^YF(1,3) position coordinates of a face prediction frame indicating the 3 rd face in the 1 st sample image, X^BFAnd (1,3) the position coordinates of the face labeling frame of the 3 rd face in the 1 st sample image, and 1.2 the coefficient of the position prediction error corresponding to the face difficult sample. To sum up, the loss value of the face detection branch mined by the face difficult sample and the loss value of the face angle detection branch mined by the angle difficult sample are calculated according to the above manner, and finally, the loss value of the initial face angle detection model mined by the difficult sample is obtained according to the loss value of the face detection branch after updating and the loss value of the face angle detection branch after updating, and by combining the first loss weight and the second loss weight obtained in S203.

The loss function of the face detection branch after face difficult sample mining is not assumed to be

See equation (6), where M₁Representing a face difficulty sample mining process, wherein L_faceThe loss function of the face detection branch before face difficulty sample mining is not performed is shown, and can be seen in formula (3):

the loss function of the face angle detection branch after the mining of the angle-difficult sample is not assumed to be

See equation (7), where M₂Indicating an angular difficult sample excavation process, wherein L_angleThe loss function of the face angle detection branch before the mining of the angle-difficult sample is not performed is shown, and can be seen in formula (4):

combining first loss weights w obtained from a weight network₁And a second loss weight w₂Loss function of initial face angle detection model after difficult sample mining

Updating the formula (8) from the formula (5):

and S205, updating parameters in the weight network and parameters in the initial face angle detection model according to the loss values. This step may specifically refer to the related description of S104 in the embodiment in fig. 4, and is not described here again.

It should be noted that the weight network and the hard sample mining are only used in the training process of the initial face angle detection model, and after the training of the initial face angle detection model is finished, the trained initial face angle detection model is also called a face angle detection model. The face angle detection model can be directly applied. Illustratively, the face angle detection model can detect the angle of a face from a face image collected by a camera, and the face with the correct angle is screened out through the detected face angle and input into the face recognition model for recognition, so as to improve the accuracy of face recognition. In addition, the human face angle detection model can be applied to the fields of attention examination of subjects such as long-distance drivers or students, human behavior analysis, human-computer interaction and the like.

It should be noted that after the training of the initial face angle detection model is finished, the trained initial face angle detection model is also called a face angle detection model. The server (or training device) for model training may send the trained face angle detection model to a server or application device for model application, and the like. In some possible embodiments, if the device has both functions of training the model and applying the model, the device may directly apply the trained face angle detection model.

By implementing the embodiment of the application, the multi-task learning model is adopted to simultaneously detect the position of the face and the angle of the face based on the face candidate frame, so that the efficiency and the precision of the face and angle detection are improved. In addition, a weight network is added to assist the training of the initial face angle detection model, and the loss weight of the initial face angle detection model is optimized; and the robustness of the model for detecting the face and the angle is increased by carrying out online excavation on difficult samples in model training, so that the finally obtained face angle detection model is suitable for detecting the face angles in various angle ranges.

In addition, in the training process of the face angle detection model in the embodiment of fig. 4 or the embodiment of fig. 10, the model may be trained in a mixed-precision manner (i.e., a combination of the FP16 precision and the FP32 precision), for example, the FP16 precision is applied to the face feature network, the forward direction of the face detection branch and the face angle detection branch, and the gradient calculation (which includes a large number of parameter multiplications), while the FP32 precision is applied to a scenario where the parameters need to be accumulated in a large number (e.g., convolutional layer, etc.), for example, Batch Normalization (Batch Normalization), to prevent overflow of data. In short, the FP16 precision is used for multiplication operation, and the FP32 precision is used for addition operation, so that accumulated errors are avoided. The initial face angle detection model is trained in a mixed precision mode, and the size of data of the model can be effectively compressed, so that the memory occupied by training the model is reduced, and the precision loss caused by improper storage format of the data in the training process of the model is greatly reduced.

Referring to fig. 13, fig. 13 is a flowchart of an angle labeling method for a face in a sample image according to an embodiment of the present application. It should be noted that, before the training of the initial face angle detection model in the embodiment of fig. 4 or the embodiment of fig. 10 is performed, the angle labeling work of the face in the sample image is completed, that is, the sample image with the face angle labeling completed may be input into the initial face angle detection model in fig. 4 or the initial face angle detection model in fig. 10 for training. In addition, the embodiment of fig. 13 is complementary to the embodiment of fig. 4 and 10. The method includes, but is not limited to, the steps of:

and S111, acquiring the image with the face labeling frame.

In the embodiment of the application, images containing faces with different identities are prepared in advance, the images contain a plurality of different faces, each face in the images corresponds to a unique face labeling frame, each face labeling frame corresponds to position labeling information, the position labeling information represents the pixel position of the face labeling frame in the images, and the size of the face labeling frame can be known according to the position labeling information. Illustratively, this face labeling frame is the smallest bounding rectangle of the corresponding face, i.e. the smallest rectangle frame can maximally contain all facial features of the face (e.g. eyes, nose, mouth, chin, forehead, etc.). In some possible embodiments, the face labeling box may also be a minimum bounding rectangle of the head of the corresponding face, that is, the region marked by the face labeling box (or referred to as the corresponding region) includes not only facial features of the face, but also ears, hair, and the like.

It should be noted that the deflection degrees corresponding to different faces in the image may be different, for example, some faces are front faces (faces facing the shooting lens), and some faces are faces with deflection angles, that is, the faces captured by the camera are faces when the front side head and the other person speak, faces looking up when the head is raised, faces for looking at clothing when the head is lowered, and the like, so that the attitude angle of the faces in the image is enriched, and the diversity of the face angles is increased.

It should be noted that the face labeling box may be manually labeled by a third-party labeling tool (e.g., imglab of dlib, labeltol of python), or may be directly labeled with an existing data set (e.g., WIDER FACE data set). In some possible embodiments, the face labeling frame may also use a model with a better face detection effect in the related art to detect a face in the image, and the face detection frame output by the model is used as the face labeling frame.

And S112, judging whether the shortest side of the face labeling frame is larger than a first threshold value.

In the embodiment of the application, after an image with a face labeling frame is obtained, face frame size detection is sequentially performed on each face labeling frame in the image, and it should be noted that, in general, the face labeling frame is mostly rectangular, and in this case, the face frame size detection refers to determining whether the shortest side of the face labeling frame is greater than a first threshold value, so as to filter out a face labeling frame with a smaller size in the image. Therefore, when the shortest side of the face labeling frame is greater than the first threshold, S113 is executed; and when the shortest side of the face labeling frame is less than or equal to the first threshold, executing S111 to obtain a next face labeling frame in the image.

For example, the face labeling frame may also be a circular frame, an oval frame, or a square frame, and the like, and the present application is not limited specifically. Under the condition that the face labeling frame is a circular frame, judging whether the radius of the face frame is larger than a first threshold value; under the condition that the face labeling frame is a square frame, judging whether the side length of the face frame is larger than a first threshold value; and under the condition that the face labeling frame is an oval frame, judging whether the minor axis of the face labeling frame is larger than a first threshold value.

It should be noted that the image includes a plurality of different faces, and on one hand, the faces may present different deflection angles, that is, have different pose angles; on the other hand, because the positions of the people in the image are different, the distances from the people to the lens of the camera are different, and then, according to the imaging principle of the camera, the size of the face of the person near the lens in the image is larger than that of the face of the person far from the lens, which also means that the size of the face labeling frame corresponding to the face near the lens in the image is larger than that of the face labeling frame corresponding to the face far from the lens in the image. Therefore, the face far away from the lens is removed by comparing the shortest side of the face labeling frame with the first threshold, so that the definition of the face is not enough on one hand, and the face information contained on the other hand is less.

S113, carrying out ambiguity detection on the face image corresponding to the face labeling frame to obtain a detection result, and judging whether the detection result meets a preset condition.

In the embodiment of the application, after the face labeling frame meeting a certain size is screened out through face frame size detection, the face corresponding to the face labeling frame is subjected to ambiguity detection to obtain a detection result, and whether the detection result meets a preset condition is judged to select a clear face image. Illustratively, whether the detection result is greater than a second threshold value is compared, and in case the detection result is greater than the second threshold value, S114 is executed; and if the detection result is less than or equal to the second threshold, executing S111 to acquire a next face labeling frame in the image, and then re-executing the above steps. The simple understanding is realized, and the detection result of the ambiguity detection can indirectly reflect whether the face area marked by the face labeling frame is clear or not.

Specifically, a face area marked by the face labeling frame is extracted according to the position labeling information of the face labeling frame, and image graying processing is performed on the face area to convert a color image of the face into a gray image of the face, wherein the gray image reflects the level distribution and the characteristics of the chromaticity and the brightness of the area where the face is located. Then, contour detection or edge detection can be performed on the face region by using a gradient function (for example, a laplacian operator, a Brenner operator, a Tenengrad operator, etc.), so as to obtain a detection result of the face region, where the detection result is used to indicate a variance value after convolution of the face region corresponding to the face labeling frame and the gradient function, and if the detection result is greater than a second threshold, it indicates that the face image corresponding to the face labeling frame meets a preset condition; and if the detection result is less than or equal to the second threshold, the face image corresponding to the face labeling frame does not meet the preset condition.

In some possible embodiments, after a gray level image of a face region corresponding to a face labeling frame is obtained, the gray level average value of all pixels in the gray level image is used as a reference, the gray level difference of each pixel point is solved, then the sum of squares is solved, and the total number of pixels is standardized to obtain a detection result, wherein the detection result represents the average degree of gray level change of the image, and the larger the detection result is, the clearer the image is, the smaller the detection result is, and the more fuzzy the image is. In addition to this, the degree of blurring of an image can be measured based on the high frequency components of the image in a transform domain (e.g., fourier transform), with the more high frequency components, the sharper the image, and the less high frequency components, the more blurred the image.

S114, calculating the face angle of the face image by adopting a plurality of angle models, and obtaining the angle marking information of the face image according to the obtained face angles.

In the embodiment of the application, after the face frame size detection and the ambiguity detection are performed, a face image with a proper face labeling frame size and a clear corresponding face area is obtained, N angle models are adopted for the face image to calculate face angles to obtain N individual face angles, and finally angle labeling information corresponding to the face image is obtained for the N individual face angles, so that the labeling precision of a target angle is improved, and high-quality angle labeling data are obtained. It should be noted that the N kinds of angle models may be models having a better detection effect on the detection of the face angle in the image containing only a single face in the related art.

For example, the angle Model may be a face angle calculated by matching a standard 3D face Model with a face feature Point, specifically, a feature Point of a face image is detected first, a Constrained Local Model (CLM) is used to locate the identified face feature Point, then the face feature Point is matched with the standard 3D face Model, and a Point-to-Point (PnP) method is used to solve a mapping relationship between a two-dimensional image of the face and a three-dimensional object (i.e., a camera coordinate system and an image coordinate system), so as to obtain the face angle corresponding to the face labeling frame.

Illustratively, the angle model may be a deep convolutional network model (ConvNet), where the feature extraction network includes three convolutional layers, three pooling layers, two full-link layers, and three conjugate layers, a face image corresponding to a single face frame is input into the ConvNet model, and three angle values of yaw, pitch, and roll of the face can be directly output through processing of each layer in the model.

Illustratively, the angle model may be a detector array method, i.e. a plurality of different face detectors are trained through a sample template to achieve detection of faces with different angles, for example, three SVM detectors are trained through a support vector machine to detect rotation angles of faces in yaw, pitch and roll directions, respectively, so as to obtain pose angles of the faces.

It should be noted that the angle model may also be used for detecting the angle of a single face image by using other methods based on an appearance template, a nonlinear regression method, an elastic model, and the like, which is not specifically limited in this application.

In the embodiment of the present application, after obtaining N face angles of the face by using N kinds of angle models for the same face image, each face angle is (yaw)_i，pitch_i，roll_i) And i is a positive integer greater than or equal to 1 and less than or equal to N, linear (or nonlinear) fusion processing, for example, averaging processing, may be performed on the N face angles in the yaw, pitch, and roll directions, respectively, to obtain angle labeling information of the face.

In some possible embodiments, after obtaining N face angles of the same face, weight distribution may be further performed on the accuracy of face angle detection in a certain direction according to the N angle models, so as to perform weighted averaging on the N face angles to obtain the pose angle of the face. In addition, other linear or nonlinear fusion methods may also be used to process the measured N face angles of the same face to obtain the angle labeling information of the face, which is not specifically limited in the present application.

The human face image with proper size and clear human face is screened out through human face frame size detection and ambiguity detection to serve as the angle labeled sample, and then the multiple angles obtained by adopting multiple angle models for the same human face are fused to obtain the angle labeled information of the human face, so that the construction of the sample image containing the multiple human face angle labeled information is realized, the human face angle labeling process is simplified, and the human face angle labeling efficiency and accuracy are improved. The obtained angle marking information is used for training an initial face angle detection model, and the detection precision of the model on the face angle can be improved.

Referring to fig. 14, fig. 14 is a schematic structural diagram of an apparatus provided in an embodiment of the present application, where the apparatus 10 at least includes a processor 110, a memory 111, a receiver 112, and a transmitter 113, and the receiver 112 and the transmitter 113 may also be replaced with a communication interface for providing information input and/or output for the processor 110. Optionally, the memory 111, the receiver 112, the transmitter 113 and the processor 110 are connected or coupled by a bus. The apparatus 10 may be the training device 120 of fig. 3.

The receiver 112 is used to receive the sample images sent by the database of fig. 3 for training. If the training of the model is separate from the application of the model, the transmitter 113 may be used to transmit the trained initial face angle detection model to the application device. The receiver 112 and transmitter 113 may include an antenna and chipset for communicating with in-vehicle devices, sensors, or other physical devices, either directly or over an air interface. The transmitter 113 and the transceiver 112 constitute a communication module that may be configured to receive and transmit information in accordance with one or more other types of wireless communication (e.g., protocols), such as bluetooth, IEEE 802.11 communication protocols, cellular technologies, Worldwide Interoperability for Microwave Access (WiMAX) or LTE (Long Term Evolution), ZigBee protocols, Dedicated Short Range Communications (DSRC), and RFID (Radio Frequency Identification) Communications, among others.

The specific implementation of the operations performed by the processor 110 can refer to the training process of the model in fig. 4 or fig. 10 and the labeling process of the sample image in fig. 14. Processor 110 may be comprised of one or more general-purpose processors, such as a Central Processing Unit (CPU), or a combination of a CPU and hardware chips. The hardware chip may be an Application-Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a Field-Programmable Gate Array (FPGA), General Array Logic (GAL), or any combination thereof.

The Memory 111 may include a Volatile Memory (Volatile Memory), such as a Random Access Memory (RAM); the Memory 111 may also include a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, HDD), or a Solid-State Drive (SSD); the memory 111 may also comprise a combination of the above categories. The memory 111 may store programs and data, wherein the stored programs include: the method comprises an initial face angle detection model, a loss weight learning algorithm, a difficult sample mining program and the like, wherein stored data comprise: sample images, angle prediction information, position labeling information, angle labeling information, position candidate information, and the like. The memory 111 may be separate or integrated within the processor 110.

In the embodiment of the present application, the apparatus 10 is used to implement the method described in the embodiment of fig. 4 or the embodiment of fig. 10.

Referring to fig. 15, fig. 15 is a functional structure diagram of an apparatus provided in an embodiment of the present application, and the apparatus 41 includes an image prediction unit 410, a weight learning unit 411, a loss calculation unit 412, and a parameter update unit 413. The means 41 may be implemented by means of hardware, software or a combination of hardware and software.

The image prediction unit 410 is configured to input a sample image to an initial target angle detection model, and obtain a prediction result according to the initial target angle detection model, where the sample image includes annotation information; a weight learning unit 411, configured to input the prediction result to a weight network to obtain a loss weight; a loss calculating unit 412, configured to obtain a loss value of the initial target angle detection model according to the loss weight, the prediction result, and the label information; and a parameter updating unit 413 for updating the parameters in the weight network and the parameters in the initial target angle detection model according to the loss values.

In some possible embodiments, the apparatus 41 further includes an image annotation unit 414, where the image annotation unit 414 is configured to perform position annotation on the face in the sample image to obtain position annotation information of the face; detecting the angles of the face corresponding to the position marking information by adopting N angle models to obtain N angle values; and obtaining angle marking information of the face according to the N angle values, wherein N is a positive integer.

The functional modules of the apparatus 41 may be used to implement the method described in the embodiment of fig. 4. In the embodiment of fig. 4, the image prediction unit 410 may be configured to perform S101, the weight learning unit 411 may be configured to perform S102, the loss calculation unit 412 may be configured to perform S103, and the parameter update unit 413 may be configured to perform S104. The functional modules of the apparatus 41 may also be configured to execute the method in the embodiment of fig. 10, and the image labeling unit 414 of the apparatus 41 may be configured to execute the method in the embodiment of fig. 13, which is not described herein again for brevity of description.

In the embodiments described above, the descriptions of the respective embodiments have respective emphasis, and reference may be made to related descriptions of other embodiments for parts that are not described in detail in a certain embodiment.

It should be noted that all or part of the steps in the methods of the above embodiments may be implemented by hardware instructions of a program, and the program may be stored in a computer-readable storage medium, where the storage medium includes a Read-Only Memory (ROM), a Random Access Memory (RAM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), an optical Disc (EEPROM), a Compact Disc-Read-Only Memory (CD-ROM), or other Programmable Read-Only memories (ROM, CD-ROM), Disk storage, tape storage, or any other medium readable by a computer that can be used to carry or store data.

The technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be implemented in the form of a software product, where the computer software product is stored in a storage medium and includes several instructions to enable a device (which may be a personal computer, a server, or a network device, a robot, a single chip, a robot, etc.) to execute all or part of the steps of the method according to the embodiments of the present application.

Claims

1. A training method of a target angle detection model is characterized by comprising the following steps:

inputting a sample image to an initial target angle detection model, and obtaining a prediction result according to the initial target angle detection model, wherein the sample image contains annotation information;

inputting the prediction result to a weight network to obtain a loss weight;

obtaining a loss value of the initial target angle detection model according to the loss weight, the prediction result and the labeling information;

and updating parameters in the weight network and parameters in the initial target angle detection model according to the loss values.

2. The method of claim 1, wherein the prediction results comprise position prediction information and angle prediction information of objects in the sample image, and wherein inputting the prediction results into a weighting network to obtain loss weights comprises:

inputting the position prediction information to a first weight network to obtain a first loss weight;

and inputting the angle prediction information to a second weight network to obtain a second loss weight.

3. The method of claim 2, wherein obtaining the loss value of the initial target angle detection model according to the loss weight, the prediction result and the annotation information comprises:

calculating a first loss value according to the position prediction information and position marking information in the marking information;

calculating a second loss value according to the angle prediction information and angle marking information in the marking information;

and adding the product of the first loss weight and the first loss value and the product of the second loss weight and the second loss value to obtain a loss value of the initial target angle detection model.

4. The method according to claim 2 or 3, wherein the initial target angle detection model comprises a target feature network, a target detection branch and a target angle detection branch; the target feature network is configured to determine a plurality of target candidate frames in the input sample image, and the target detection branch and the target angle detection branch respectively take the plurality of target candidate frames output by the target feature network as inputs, the target detection branch is configured to output the position prediction information, and the target angle detection branch is configured to output the angle prediction information.

5. The method of claim 4, wherein said updating parameters in said weight network and parameters in said initial target angle detection model according to said loss values comprises;

parameters in the first weight network, the target detection branch and the target feature network are updated in sequence according to the loss values;

and sequentially updating parameters in the second weight network, the target angle detection branch and the target characteristic network according to the loss value.

6. The method according to claim 4 or 5, wherein the annotation information comprises position annotation information and angle annotation information of the target in the sample image; before inputting the sample image to the initial target angle detection model, the method further comprises:

carrying out position marking on a target in the sample image to obtain position marking information of the target;

and detecting the angle of the target by adopting N angle models for the target corresponding to the position marking information to obtain N angle values, and obtaining the angle marking information of the target according to the N angle values, wherein N is a positive integer.

7. The method of claim 6,

after obtaining the prediction result according to the initial target angle detection model, the method further includes:

and when the angle prediction error between the angle prediction information corresponding to the target candidate frame and the angle marking information corresponding to the target candidate frame is larger than a preset difference amount, increasing the coefficient of the angle prediction error corresponding to the target candidate frame in the loss function of the target angle detection branch.

8. The method according to any one of claims 1-7, wherein the training method of the target angle detection model employs a mixed precision mode.

9. An apparatus for training a target angle detection model, the apparatus comprising:

the image prediction unit is used for inputting a sample image to an initial target angle detection model and obtaining a prediction result according to the initial target angle detection model, wherein the sample image contains annotation information;

the weight learning unit is used for inputting the prediction result to a weight network to obtain a loss weight;

the loss calculation unit is used for obtaining a loss value of the initial target angle detection model according to the loss weight, the prediction result and the labeling information;

and the parameter updating unit is used for updating the parameters in the weight network and the parameters in the initial target angle detection model according to the loss values.

10. The apparatus according to claim 9, wherein the prediction result includes position prediction information and angle prediction information of an object in the sample image, and the weight learning unit is specifically configured to:

11. The apparatus according to claim 10, wherein the loss calculation unit is specifically configured to:

12. The apparatus according to claim 10 or 11, wherein the initial target angle detection model comprises a target feature network, a target detection branch and a target angle detection branch; the target feature network is configured to determine a plurality of target candidate frames in the input sample image, and the target detection branch and the target angle detection branch respectively take the plurality of target candidate frames output by the target feature network as inputs, the target detection branch is configured to output the position prediction information, and the target angle detection branch is configured to output the angle prediction information.

13. The apparatus according to claim 12, wherein the parameter updating unit is specifically configured to:

14. The apparatus according to claim 12 or 13, further comprising an image annotation unit for:

15. The apparatus of claim 14, wherein the loss calculation unit is further configured to:

16. The apparatus according to any one of claims 9-15, wherein the training method of the target angle detection model employs a mixed precision mode.

17. An apparatus for training a target angle detection model, the apparatus comprising a memory for storing program instructions and a processor; the processor is configured to invoke program instructions in the memory to perform the method of any of claims 1-8.

18. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any one of claims 1-8.