CN111325107A

CN111325107A - Detection model training method and device, electronic equipment and readable storage medium

Info

Publication number: CN111325107A
Application number: CN202010074476.5A
Authority: CN
Inventors: 奉万森
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2020-01-22
Filing date: 2020-01-22
Publication date: 2020-06-23
Anticipated expiration: 2040-01-22
Also published as: CN111325107B

Abstract

The embodiment of the application provides a detection model training method, a detection model training device, electronic equipment and a readable storage medium. And training the pre-constructed neural network model by using the target image set to obtain a detection model. According to the training scheme, the sample distribution balance processing is carried out on the sample image set, and the optimization processing is carried out on each sample image, so that the obtained target image set is optimized on the sample distribution and a single sample image, and the detection accuracy of the detection model obtained through training is improved.

Description

Detection model training method and device, electronic equipment and readable storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a detection model training method and device, electronic equipment and a readable storage medium.

Background

In the process of processing the face image, the determination of the key points of the face is very important. At present, the key point detection is generally performed by adopting an artificial intelligence algorithm, which includes firstly training a model by using a sample image set to obtain a detection model, and then identifying and detecting an image to be processed by using the obtained detection model, thereby determining the key point of the face. However, in the current detection mode, the processing on the sample image set is often ignored, so that the samples in the sample image set are often randomly obtained, problems of uneven sample distribution and sample non-optimization exist, and further the detection accuracy of the obtained detection model is poor.

Disclosure of Invention

The present application provides a detection model training method, apparatus, electronic device and readable storage medium, which can optimize a sample image set, thereby improving the detection accuracy of a trained detection model.

The embodiment of the application can be realized as follows:

in a first aspect, an embodiment provides a detection model training method, where the method includes:

obtaining a sample image set, wherein the sample image set comprises a plurality of sample images, and the plurality of sample images are divided into a plurality of image subsets;

balancing the number of sample images in the plurality of image subsets;

carrying out optimization processing on each sample image according to a preset transformation strategy to obtain a target image set;

and training a pre-constructed neural network model by using the target image set to obtain a detection model.

In an alternative embodiment, the step of balancing the number of sample images in the plurality of image subsets includes:

obtaining the rotation angle of the face image in each sample image on the horizontal plane;

dividing the sample images of the target preset range to which the rotation angle belongs in the sample image set into a first image set, and dividing other sample images in the sample image set into a second image set;

and increasing the sample images in the second image set to enable the number of the sample images in the second image set to be a preset multiple of the number of the sample images in the first image set.

In an alternative embodiment, the second image set includes a first subset and a second subset, the rotation angle of the face image of the sample image in the first subset on the horizontal plane belongs to a first preset range, and the rotation angle of the face image of the sample image in the second subset on the horizontal plane belongs to a second preset range;

the step of increasing the number of sample images in the second image set to be a preset multiple of the number of sample images in the first image set includes:

increasing the sample images in the first subset to make the number of the sample images in the first subset a first preset multiple of the number of the sample images in the first image set;

and increasing the sample images in the second subset to enable the number of the sample images in the second subset to be a second preset multiple of the number of the sample images in the first image set.

In an optional embodiment, the step of performing optimization processing on each sample image according to a preset transformation strategy to obtain a target image set includes:

adjusting the size of each sample image to a preset size;

for each sample image after size adjustment, rotating a face image in the sample image by a preset angle in the vertical direction;

and carrying out random shielding treatment on each sample image by using the shielding pixel block to obtain a target image set.

In an optional embodiment, the step of performing random occlusion processing on each sample image by using an occlusion pixel block includes:

generating a shielding pixel block according to the obtained setting parameters;

for each sample image, determining a superposition area of the shielding pixel block in the sample image according to the generated random number;

superimposing the block of occluded pixels to the superimposed region in the sample image.

In an optional embodiment, the pre-constructed neural network model includes an input layer, a fusion layer, an output layer, and a plurality of network layers connected between the input layer and the fusion layer, each sample image includes a plurality of labeled key points, and the step of training the pre-constructed neural network model by using the target image set to obtain the detection model includes:

inputting each sample image in the sample image set into the input layer for preprocessing to obtain a preprocessed image;

for each network layer, performing convolution processing and feature extraction processing on the input image through the network layer, and outputting a feature image;

fusing the feature images output by each network layer in the fusion layer to obtain a fusion feature map, inputting the fusion feature map into the output layer for key point classification, and obtaining a prediction key point of the sample image;

calculating loss function values of the prediction key points and the marking key points of the sample image, carrying out back propagation training according to the loss function values, and continuing training after updating the network parameters of the neural network model until a preset termination condition is met, thereby obtaining the detection model.

In an optional embodiment, each of the network layers includes a first network module and a second network module, and the step of performing convolution processing and feature extraction processing on the input image through the network layer for each of the network layers and outputting a feature image includes:

for each network layer, performing convolution processing and feature extraction processing on the input image through a first network module of the network layer to obtain a first feature map;

performing convolution processing and feature extraction processing on the input image by using a second network module of the network layer to obtain a second feature map;

and fusing the first characteristic diagram and the second characteristic diagram, and outputting the characteristic image of the network layer.

In an optional implementation manner, the step of performing convolution processing and feature extraction processing on the input image by using the first network module of the network layer to obtain the first feature map includes:

carrying out convolution processing and feature extraction processing on the image input by the first network module by using a first convolution processing strategy, and carrying out convolution processing and feature extraction processing on the image input by the first network module by using a second convolution processing strategy;

performing fusion processing on the output image obtained by processing the first convolution processing strategy and the output image obtained by processing the second convolution processing strategy;

and carrying out channel random mixing processing on the image subjected to the fusion processing, and outputting a first characteristic diagram.

In an optional implementation manner, the step of performing convolution processing and feature extraction processing on the input image by using the second network module of the network layer to obtain the second feature map includes:

carrying out channel separation processing on the input image by utilizing a second network module of the network layer to obtain a plurality of single-channel images;

performing convolution processing and feature extraction processing on each single-channel image;

performing fusion processing on each single-channel image and the image of the single-channel image after convolution processing and feature extraction processing;

and carrying out channel random mixing processing on the plurality of single-channel images subjected to the fusion processing, and outputting a second characteristic diagram.

In a second aspect, an embodiment provides a detection model training apparatus, including:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a sample image set, the sample image set comprises a plurality of sample images, and the plurality of sample images are divided into a plurality of image subsets;

a balance processing module, configured to balance the number of sample images in the plurality of image subsets;

the optimization processing module is used for carrying out optimization processing on each sample image according to a preset transformation strategy to obtain a target image set;

and the training module is used for training a pre-constructed neural network model by using the target image set to obtain a detection model.

In a third aspect, embodiments provide an electronic device, including one or more storage media and one or more processors in communication with the storage media, where the one or more storage media store machine-executable instructions executable by the processors, and when the electronic device runs, the processors execute the machine-executable instructions to perform the detection model training method described in any one of the foregoing embodiments.

In a fourth aspect, embodiments provide a computer-readable storage medium storing machine-executable instructions, which when executed, implement the detection model training method according to any one of the foregoing embodiments.

The beneficial effects of the embodiment of the application include, for example:

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic view of an application scenario of a detection model training method provided in an embodiment of the present application;

FIG. 2 is a flowchart of a training method for a detection model according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of the substeps of step S220 in FIG. 2;

FIG. 4 is a flowchart of the substeps of step S230 in FIG. 2;

FIG. 5 is a flowchart of the substeps of step S240 in FIG. 2;

fig. 6 is a schematic network structure diagram of a neural network model provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a first network module in a neural network model provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a second network module in a neural network model provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 10 is a functional block diagram of a detection model training apparatus according to an embodiment of the present application.

Icon: 100-live broadcast providing terminal; 200-a live broadcast server; 300-a live broadcast receiving terminal; 110-a storage medium; 120-a processor; 130-detection model training means; 131-an acquisition module; 132-a balancing processing module; 133-an optimization processing module; 134-a training module; 140-communication interface.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present application, it is noted that the terms "first", "second", and the like are used merely for distinguishing between descriptions and are not intended to indicate or imply relative importance. It should be noted that the features of the embodiments of the present application may be combined with each other without conflict.

The detection model training method provided by the embodiment of the application can be applied to various application scenes, such as image processing application, live broadcast application, access control application and the like, in the application, the face image needs to be processed to track and position the key points of the face image, and therefore subsequent processing is carried out based on the positioned key points. For example, in an image processing application, after the key points in the face image are located, optimization processing may be performed on the key points, such as enlarging the eye area, reducing the face contour, and the like. In the live broadcast application, key point positioning can be performed on the human face image of the anchor, so that the human face image processing is performed based on the key points. In this application, an application scenario of live broadcast application will be described as an example in the following.

Referring to fig. 1, a schematic view of a possible application scenario of the detection model training method according to the embodiment of the present application is shown, where the scenario includes a live broadcast providing terminal 100, a live broadcast server 200, and a live broadcast receiving terminal 300. The live broadcast server 200 is in communication connection with the live broadcast providing terminal 100 and the live broadcast receiving terminal 300, respectively, and is configured to provide live broadcast services for the live broadcast providing terminal 100 and the live broadcast receiving terminal 300. For example, the live broadcast providing terminal 100 may transmit a live video stream to the live broadcast server 200, and the viewer may access the live broadcast server 200 through the live broadcast receiving terminal 300 to view the live video. The live video stream pushed by the live server 200 may be a video stream currently live in a live platform or a complete video stream formed after the live broadcast is completed. It is understood that the scenario shown in fig. 1 is only one possible example, and in other possible embodiments, the scenario may include only a part of the components shown in fig. 1 or may also include other components.

In this embodiment, the live broadcast providing terminal 100 and the live broadcast receiving terminal 300 may be, but are not limited to, a smart phone, a personal digital assistant, a tablet computer, a personal computer, a notebook computer, a virtual reality terminal device, an augmented reality terminal device, and the like. The live broadcast providing terminal 100 and the live broadcast receiving terminal 300 may have internet products installed therein for providing live broadcast services of the internet, for example, the internet products may be applications APP, Web pages, applets, etc. related to live broadcast services of the internet used in a computer or a smart phone.

In this embodiment, a video capture device for capturing the anchor video frame may be further included in the scene, and the video capture device may be, but is not limited to, a camera, a lens of a digital camera, a monitoring camera, a webcam, or the like. The video capture device may be directly installed or integrated in the live broadcast providing terminal 100. For example, the video capture device may be a camera configured on the live broadcast providing terminal 100, and other modules or components in the live broadcast providing terminal 100 may receive videos and images transmitted from the video capture device via the internal bus. Alternatively, the video capture device may be independent of the live broadcast providing terminal 100, and the two may communicate with each other in a wired or wireless manner.

Fig. 2 is a flowchart illustrating a detection model training method provided in an embodiment of the present application, where the detection model training method may be executed by the live broadcast providing terminal 100, the live broadcast receiving terminal 300, or the live broadcast server 200 shown in fig. 1. It should be understood that, in other embodiments, the order of some steps in the detection model training method of this embodiment may be interchanged according to actual needs, or some steps may be omitted or deleted. The detailed steps of the test model training method are described below.

Step S210, a sample image set is obtained, where the sample image set includes a plurality of sample images, and the plurality of sample images are divided into a plurality of image subsets.

Step S220, performing a balancing process on the number of sample images in the plurality of image subsets.

And step S230, carrying out optimization processing on each sample image according to a preset transformation strategy to obtain a target image set.

And S240, training a pre-constructed neural network model by using the target image set to obtain a detection model.

In this embodiment, the sample images in the sample image set obtained are pre-acquired images, and each sample image includes a plurality of labeled key points, that is, the key points in the sample image are labeled, and the key points may include eyebrows, eyes, a nose, a mouth, a face contour, and the like in the face image.

In this embodiment, the plurality of sample images may be divided into a plurality of image subsets, for example, the sample images may be divided according to the gender of a person in a face image in the sample images, or the face images may be divided according to the rotation angle of the face image, or the face images may be divided according to the area ratio of the face image in the sample images, or the face images may be divided according to other information of the sample images, which is not limited in this embodiment.

For each divided image subset, the number of sample images included in each image subset is generally inconsistent, and considering that some sample images in the image subsets are more frequently appeared in practical application, but the number of sample images in the image subset is relatively small in the sample image subset, in this case, the number of sample images in each image subset may be increased by performing balance processing on the number of sample images in the image subset. Or, considering the sample images in some image subsets, it is difficult to detect and locate the key points in the actual detection and identification due to the fact that the face images in the sample images have the situations such as occlusion, rotation, and the like. Therefore, in this case, the number of sample images in the image subset can be increased, so that the model can learn more about the features of the sample images in the image subset, and the subsequent detection positioning of the model on the images with the same features of the sample images in the image subset can be enhanced.

In this embodiment, the balance processing performed on the number of sample images of the plurality of image subsets is mainly based on the idea that the model can learn more about features of some image types that are more frequently present in practical applications, or can learn more about features of some image types that are harder to recognize. The main way may be to increase the number of images in the image subset in which some sample images are of the image type that is more frequently found in practical applications, or to increase the number of images in the image subset in which some sample images are of the image type that is harder to detect.

Of course, the balancing processing may also be performed based on the needs of the user, for example, when the user needs to perform the emphasis recognition detection on a certain type of special image type, for example, the side of the face is directly opposite to the image acquired by the camera, or the acquired face image only includes a half of the face area, and the like. When the image subsets are divided, the images of the image types can be divided into the same image subset, and the number of the images in the image subset is expanded. Thereby enabling the model to learn more about the image features of that particular image type.

And optimizing each sample image in the sample image set after the balance processing by adopting a preset transformation strategy so as to obtain a target image set. And finally, training the neural network model by using the target image set to obtain a detection model. The preset transformation strategy may include, for example, angle transformation, size transformation, transformation of the range of the detected face region, and the like. The method mainly aims to enable the obtained sample image to simulate the human face features under different situations, enable the finally obtained model to be suitable for detecting the images with the features under different situations and improve the robustness of the model.

In this embodiment, the sample distribution balance processing is performed on the sample image set, and the optimization processing is performed on each sample image, so that the obtained target image set is optimized on both the sample distribution and a single sample image, and the detection accuracy of the detection model obtained through training is further improved.

In this embodiment, it is considered that, when a face key point is detected, the face is likely to cause displacement of the key point under the condition of rotation, which results in inaccurate detection. Therefore, when the sample distribution balance processing is performed on the sample image set, the balance processing can be performed based on the rotation condition of the face image. Referring to fig. 3, in the present embodiment, the number of sample images in the image subsets can be balanced in the following manner:

step S221, obtaining a rotation angle of the face image in each sample image on a horizontal plane.

Step S222, dividing the sample image in the preset target range to which the rotation angle belongs in the sample image set into a first image set, and dividing other sample images in the sample image set into a second image set.

Step S223, increasing the sample images in the second image set to make the number of sample images in the second image set a preset multiple of the number of sample images in the first image set.

When the image pickup device collects the face image, if the face is opposite to the image pickup device, the key points in the collected face image are easy to detect and identify, and if the side face of the face is opposite to the image pickup device or rotates by a certain angle to be opposite to the image pickup device, some key points may not be identified in the collected face image, or the key points will shift in the face image. Therefore, the rotation angle of the face image has a large influence on the detection of the key point.

For each sample image, the rotation angle of the face image in the sample image in the horizontal plane can be obtained. The rotation angle may be obtained by obtaining in advance a distance between several key points on the face image when the face image rotates at different angles, for example, a distance between two eyes, a distance between two ends of a mouth corner, and the like. And establishing the incidence relation between different rotation angles and different distances. Therefore, when the rotation angle of the face image in the sample image is confirmed, the rotation angle of the face image can be determined according to the distance between the key points in the face image.

In view of the fact that key points in a face image captured while being captured by an image capturing apparatus or slightly rotated by a small angle are easily recognized, in the present embodiment, sample images in a sample image set in which the rotation angle of the face image belongs to a target preset range may be divided into a first image set, and the remaining sample images may be divided into a second image set. The target preset range may be 0 degree to 30 degrees, and of course, may also be other numerical value ranges, and the specific embodiment is not limited.

The face images in the sample images divided into the second image set have a large rotation angle, and the problem of difficult detection exists in the actual detection and identification process, so in the embodiment, the model can learn more feature information of the images in a manner of increasing the number of samples of the images, and then more accurate detection and identification of the images are realized. In this embodiment, the number of sample images in the second image set may be increased to two times, three times, or the like of the number of sample images in the first image set without limitation.

In this embodiment, it is considered that the difficulty in identifying key points in a face image increases with the increase of the rotation angle, and it is particularly difficult to detect and identify the key points in the face image when the rotation angle is very large. Therefore, in this embodiment, the sample images in the second image set may be further divided, and the second image set may include a first subset and a second subset. The rotation angle of the face image of the sample image in the first subset on the horizontal plane belongs to a first preset range, and the first preset range can be 30 degrees to 60 degrees. And the rotation angle of the face image of the sample image in the second subset on the horizontal plane belongs to a second preset range, and the second preset range can be 60 degrees to 90 degrees.

When the number of sample images in the first subset and the second subset is expanded, the number of sample images in the first subset may be increased to a first preset multiple, for example, twice, of the number of sample images in the first image set. Also, the number of sample images in the second subset may be increased by a second preset multiple, e.g., three times, the number of sample images in the first image set.

Therefore, sample expansion can be carried out on the sample images in different subsets according to the rotation angles in different ranges in a more detailed mode, and the number of the sample images which are difficult to carry out key point detection is expanded, so that the model can learn more characteristics of the images.

After the sample distribution balancing process, referring to fig. 4, the optimization process can be performed on each sample image in the following manner:

in step S231, the size of each sample image is adjusted to a preset size.

Step S232, for each sample image after size adjustment, rotating the face image in the sample image by a preset angle in the vertical direction.

Step S233, performing random occlusion processing on each sample image by using the occlusion pixel block to obtain a target image set.

In this embodiment, it is considered that the images input into the model generally have standard requirements on the size, and the obtained sample images may have different sizes, which is not beneficial to the feature learning of the model. Accordingly, the size of each sample image may be adjusted to a preset size. For example, when the length and width of the sample image do not match, the boundary of the short side may be expanded with the long side as a reference to convert the length and width of the sample image into match. On the basis, the length and the width of the sample image can be reduced or enlarged simultaneously, so that the size of the sample image meets the preset size, for example, 160 × 3.

Besides the rotation on the horizontal plane, the face image in the sample image may also have rotation in the vertical direction, that is, a certain included angle exists between the central vertical line on the face image and the vertical direction. In an actual scene, the situation often occurs in a face image to be recognized. When the face image rotates by a certain angle in the vertical direction, the displacement of the key points and the change of the relative positions between the key points are also caused, so that the detection and the identification are difficult.

Therefore, in the stage of training the model, by rotating the face image in each sample image by a preset angle, for example, plus or minus 30 degrees, in the vertical direction, the face image can be converted into an image having different rotation angles in the vertical direction. Therefore, the model can learn more features of the image with the rotation angle in the vertical direction, and the accuracy of detection can be improved when the image to be detected with the features is detected and identified based on the model subsequently.

In addition, in this embodiment, in consideration of practical situations, a situation that the image to be detected is blocked may occur, for example, when the anchor is playing directly, a microphone and the like in front of the anchor may block a face of the anchor, or when the anchor waves a hand, a face area may be blocked. Both of these situations may increase the difficulty of detecting and identifying the keypoints.

Therefore, in the embodiment, in the training stage of the model, the occlusion pixel blocks may be used to perform random occlusion processing on each sample image so as to simulate a face occlusion situation that may occur in an actual application scene.

In this embodiment, when performing the occlusion processing on the sample image, the occlusion pixel block may be generated according to the obtained setting parameters, where the setting parameters include the color of the occlusion pixel block, for example, a black pixel block, a white pixel block, and the like, and may further include the size, the shape, and the like of the pixel block. After obtaining the block of pixels to be blocked, in order to simulate different blocking situations that may occur in an actual application scene, a superposition area of the block of pixels to be blocked in the sample image may be determined according to the generated random number, so that the block of pixels to be blocked is superposed in the superposition area in the sample image.

After the obtained sample image set is subjected to the sample distribution balance processing and the optimization processing of the sample image, a target image set can be obtained, and then the constructed neural network model is trained by using the target image set to obtain a detection model.

In this embodiment, the constructed neural network model includes an input layer, a fusion layer, an output layer, and a plurality of network layers connected between the input layer and the fusion layer. Referring to fig. 5, training of the neural network model can be achieved by:

step S241, inputting each sample image in the sample image set to the input layer for preprocessing, to obtain a preprocessed image.

Step S242 is performed to perform convolution processing and feature extraction processing on the input image by the network layer for each network layer, and output a feature image.

And step S243, carrying out fusion processing on the feature images output by each network layer in the fusion layer to obtain a fusion feature map, inputting the fusion feature map into the output layer to carry out key point classification, and obtaining the predicted key points of the sample image.

Step S244, calculating a loss function value of the prediction key point and the annotation key point of the sample image, performing back propagation training according to the loss function value, and continuing training after updating the network parameters of the neural network model until a preset termination condition is satisfied, thereby obtaining the detection model.

Referring to fig. 6, fig. 6 schematically shows a network structure of the neural network model, and the plurality of network layers may include stages 1-5, and the preprocessed image output by the input layer is convolved and feature-extracted by Stage1 to output a feature image. Stages 2 to 5 respectively perform convolution processing and feature extraction processing on the feature images output by the previous network layer, and output feature images. The fusion layer is used for fusing the feature images output by the network layers, and schematically illustrates the fusion processing of the output feature images of Stage3, Stage4 and Stage 5. And carrying out key point classification on the fused image in an output layer to obtain a predicted key point of the sample image.

In this embodiment, the neural network model may be supervised by using a plurality of Loss functions, such as Loss1-Loss4, so that it can be ensured that the features of each network layer can be well responsible for the features in different receptive fields, and after subsequent fusion, the finally obtained fusion features can have a better feature effect.

In this embodiment, each network layer includes two network modules, a first network module and a second network module, where the processing policies for the images in the first network module and the second network module are different. Within each network layer, the output characteristic image of the network layer can be obtained by fusing the output characteristics of the two network modules.

Optionally, for each network layer, the first network module of the network layer may perform convolution processing and feature extraction processing on the input image to obtain a first feature map. And performing convolution processing and feature extraction processing on the input image by using a second network module of the network layer to obtain a second feature map. And finally, fusing the obtained first characteristic diagram and the second characteristic diagram, and outputting the characteristic image of the network layer.

Optionally, in this embodiment, in the first network module, the image may be processed by two different processing strategies, and finally, the image is fused to output the first feature map of the first network module. Specifically, for the first network module, the image input by the first network module may be subjected to convolution processing and feature extraction processing by using a first convolution processing policy, and the image input by the first network module may be subjected to convolution processing and feature extraction processing by using a second convolution processing policy. And finally, carrying out fusion processing on the output image obtained by the processing of the first convolution processing strategy and the output image obtained by the processing of the second convolution processing strategy, carrying out channel random mixing processing on the image subjected to the fusion processing, and outputting a first characteristic diagram.

Referring to fig. 7, the first convolution processing strategy corresponds to the left processing flow in fig. 7, and specifically, for the image input to the first network module, a convolution operation is first performed with a convolution kernel 3 × 3(DWConv3 × 3), and the step size of the convolution operation may be set to 2. Then, the convolution operation is performed with convolution kernel 1 × 1(Conv1 × 1), and excitation processing is performed. The second convolution processing strategy corresponds to the right processing flow in fig. 7, and specifically, the image input to the first network module may be first convolved with a convolution kernel 1 × 1(Conv1 × 1) and then subjected to excitation processing. Then, the convolution operation is performed with convolution kernel 3 × 3(DWConv3 × 3), and the step size of the convolution operation can be set to 2. Finally, the convolution operation is performed with convolution kernel 1 × 1(Conv1 × 1), and excitation processing is performed.

And finally, fusing the images output from the two sides at the Concat layer, and then carrying out Channel random mixing processing on the fused images at the Channel Shuffle layer to output a first feature map.

In addition, for the second network module, in the second network module, the second network module is used for performing channel separation processing on the input image to obtain a plurality of single-channel images. And carrying out convolution processing and feature extraction processing on each single-channel image, and then carrying out fusion processing on each single-channel image and the image of the single-channel image after the convolution processing and the feature extraction processing. And finally, carrying out channel random mixing processing on the plurality of single-channel images subjected to the fusion processing, and outputting a second characteristic diagram.

Referring to fig. 8, in the second network module, first, a Channel separation process is performed on an input image in the Channel Split layer. Then, in the right processing flow, the convolution operation is performed with the convolution kernel 1 × 1(Conv1 × 1) and the excitation processing is performed. The convolution operation is then performed with convolution kernel 3 × 3(DWConv3 × 3). Finally, the convolution operation is performed with convolution kernel 1 × 1(Conv1 × 1), and excitation processing is performed. Then, the image processed by the processing flow of the single-Channel image and the single-Channel image are fused in a Concat layer, and finally, the Channel Shuffle layer carries out Channel random mixing processing on each single-Channel image after the fusion processing, and a second feature map is output.

The neural network model provided by the embodiment can utilize different network layers to obtain the characteristics of different characteristics by setting a plurality of network layers and carrying out fusion processing on the output images of the network layers, thereby improving the characteristic learning effect. In each network layer, the fusion characteristics of the finally obtained features are improved through dimension reduction processing and channel mixing processing.

Furthermore, a plurality of Loss function supervisors are arranged, so that the characteristics in each network layer can be well responsible for the characteristics in different receptive fields, and the integrated whole characteristics have good characteristics.

Referring to fig. 9, a schematic diagram of exemplary components of an electronic device according to an embodiment of the present disclosure is shown, where the electronic device may be the live broadcast providing terminal 100 or the live broadcast receiving terminal 300 or the live broadcast server 200 shown in fig. 1. The electronic device may include a storage medium 110, a processor 120, a test model training apparatus 130, and a communication interface 140. In this embodiment, the storage medium 110 and the processor 120 are both located in the electronic device and are separately disposed. However, it should be understood that the storage medium 110 may be separate from the electronic device and may be accessed by the processor 120 through a bus interface. Alternatively, the storage medium 110 may be integrated into the processor 120, for example, may be a cache and/or general purpose registers.

The detection model training apparatus 130 may be understood as the electronic device or the processor 120 of the electronic device, or may be understood as a software functional module that is independent of the electronic device or the processor 120 and implements the detection model training method under the control of the electronic device.

As shown in fig. 10, the detection model training apparatus 130 may include an obtaining module 131, a balancing processing module 132, an optimizing processing module 133, and a training module 134. The functions of the functional modules of the test pattern training apparatus 130 are described in detail below.

The obtaining module 131 is configured to obtain a sample image set, where the sample image set includes a plurality of sample images, and the plurality of sample images are divided into a plurality of image subsets. It is understood that the obtaining module 131 may be configured to perform the step S210, and for a detailed implementation of the obtaining module 131, reference may be made to the content related to the step S210.

A balance processing module 132, configured to balance the number of sample images in the plurality of image subsets. It is understood that the balancing processing module 132 may be configured to perform the step S220, and for the detailed implementation of the balancing processing module 132, reference may be made to the content related to the step S220.

And the optimization processing module 133 is configured to perform optimization processing on each sample image according to a preset transformation strategy to obtain a target image set. It is understood that the optimization module 133 can be used to execute the step S230, and the detailed implementation of the optimization module 133 can refer to the content related to the step S230.

And the training module 134 is configured to train a pre-constructed neural network model by using the target image set to obtain a detection model. It is understood that the training module 134 can be used to perform the step S240, and the detailed implementation of the training module 134 can refer to the content related to the step S240.

Further, an embodiment of the present application also provides a computer-readable storage medium, where machine-executable instructions are stored in the computer-readable storage medium, and when the machine-executable instructions are executed, the detection model training method provided in the foregoing embodiment is implemented.

In summary, the embodiments of the present application provide a detection model training method, an apparatus, an electronic device, and a readable storage medium, in which a plurality of sample images in an obtained sample image set are divided into a plurality of image subsets, and the number of sample images in the plurality of image subsets is balanced to solve the problem of uneven distribution of samples in the obtained sample image set, and each sample image is optimized according to a preset transformation policy to obtain a target image set. And training the pre-constructed neural network model by using the target image set to obtain a detection model. According to the training scheme, the sample distribution balance processing is carried out on the sample image set, and the optimization processing is carried out on each sample image, so that the obtained target image set is optimized on the sample distribution and a single sample image, and the detection accuracy of the detection model obtained through training is improved.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and there may be other divisions in actual implementation, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or modules through some communication interfaces, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for training a test model, the method comprising:

balancing the number of sample images in the plurality of image subsets;

2. The detection model training method of claim 1, wherein the step of balancing the number of sample images in the plurality of image subsets comprises:

3. The detection model training method according to claim 2, wherein the second image set includes a first subset and a second subset, the rotation angle of the face image of the sample image in the first subset in the horizontal plane belongs to a first preset range, and the rotation angle of the face image of the sample image in the second subset in the horizontal plane belongs to a second preset range;

4. The detection model training method according to claim 1, wherein the step of performing optimization processing on each sample image according to a preset transformation strategy to obtain a target image set comprises:

adjusting the size of each sample image to a preset size;

5. The detection model training method according to claim 4, wherein the step of performing random occlusion processing on each sample image by using occlusion pixel blocks comprises:

6. The method for training the detection model according to claim 1, wherein the pre-constructed neural network model includes an input layer, a fusion layer, an output layer, and a plurality of network layers connected between the input layer and the fusion layer, each of the sample images includes a plurality of labeled key points, and the step of training the pre-constructed neural network model by using the target image set to obtain the detection model includes:

7. The training method of detection models according to claim 6, wherein each of the network layers includes a first network module and a second network module, and the step of performing convolution processing and feature extraction processing on the input image by the network layer and outputting a feature image for each of the network layers includes:

8. The training method of the detection model according to claim 7, wherein the step of performing convolution processing and feature extraction processing on the input image by the first network module of the network layer to obtain the first feature map includes:

9. The detection model training method according to claim 7, wherein the step of performing convolution processing and feature extraction processing on the input image by using the second network module of the network layer to obtain a second feature map includes:

10. A test pattern training apparatus, comprising:

11. An electronic device comprising one or more storage media and one or more processors in communication with the storage media, the one or more storage media storing processor-executable machine-executable instructions that, when executed by the electronic device, are executed by the processors to perform the detection model training method of any of claims 1-9.

12. A computer-readable storage medium having stored thereon machine-executable instructions which, when executed, implement the detection model training method of any one of claims 1-9.