CN111325107B

CN111325107B - Detection model training method, device, electronic equipment and readable storage medium

Info

Publication number: CN111325107B
Application number: CN202010074476.5A
Authority: CN
Inventors: 奉万森
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2020-01-22
Filing date: 2020-01-22
Publication date: 2023-05-23
Anticipated expiration: 2040-01-22
Also published as: CN111325107A

Abstract

The embodiment of the application provides a detection model training method, a device, electronic equipment and a readable storage medium, wherein a plurality of sample images in an obtained sample image set are divided into a plurality of image subsets, the number of the sample images in the plurality of image subsets is balanced, so that the problem of uneven sample distribution in the obtained sample image set is solved, and each sample image is optimized according to a preset transformation strategy to obtain a target image set. And training the pre-constructed neural network model by utilizing the target image set to obtain a detection model. According to the training scheme, sample distribution balance processing is carried out on the sample image set, and optimization processing is carried out on each sample image, so that the obtained target image set is optimized on sample distribution and a single sample image, and further the detection accuracy of a detection model obtained through training is improved.

Description

Detection model training method, device, electronic equipment and readable storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a detection model training method, a detection model training device, electronic equipment and a readable storage medium.

Background

In the face image processing process, the method is very important for determining the key points of the face. At present, key point detection is generally carried out by adopting an artificial intelligent algorithm, which comprises the steps of firstly training a model by utilizing a sample image set to obtain a detection model, and then identifying and detecting an image to be processed by utilizing the obtained detection model so as to determine the key points of the human face. The samples in the sample image set are very important for the identification accuracy of the finally obtained detection model, however, in the current detection mode, the processing of the sample image set is often ignored, so that the samples in the sample image set are often obtained randomly, the problems of uneven sample distribution and non-optimal samples exist, and the detection accuracy of the obtained detection model is poor.

Disclosure of Invention

The purpose of the application includes, for example, providing a method, a device, an electronic device and a readable storage medium for training a detection model, which can optimize a sample image set, thereby improving the detection accuracy of the detection model obtained by training.

Embodiments of the present application may be implemented as follows:

in a first aspect, an embodiment provides a method for training a detection model, the method comprising:

Obtaining a sample image set, the sample image set comprising a plurality of sample images, the plurality of sample images divided into a plurality of image subsets;

performing balance processing on the number of sample images in the plurality of image subsets;

optimizing each sample image according to a preset transformation strategy to obtain a target image set;

training a pre-constructed neural network model by using the target image set to obtain a detection model.

In an alternative embodiment, the step of balancing the number of sample images in the plurality of image subsets includes:

acquiring the rotation angle of the face image in each sample image on a horizontal plane;

dividing sample images in the target preset range of the rotation angle in the sample image set into a first image set, and dividing other sample images in the sample image set into a second image set;

and increasing the sample images in the second image set so that the number of the sample images in the second image set is a preset multiple of the number of the sample images in the first image set.

In an optional embodiment, the second image set includes a first subset and a second subset, where a rotation angle of a face image of the sample image in the first subset on a horizontal plane belongs to a first preset range, and a rotation angle of a face image of the sample image in the second subset on a horizontal plane belongs to a second preset range;

The step of increasing the number of sample images in the second image set so that the number of sample images in the second image set is a preset multiple of the number of sample images in the first image set includes:

increasing the sample images in the first subset so that the number of sample images in the first subset is a first preset multiple of the number of sample images in the first image set;

and increasing the sample images in the second subset so that the number of the sample images in the second subset is a second preset multiple of the number of the sample images in the first image set.

In an optional implementation manner, the step of optimizing each sample image according to a preset transformation policy to obtain a target image set includes:

the size of each sample image is adjusted to be a preset size;

for each sample image with the size adjusted, rotating a face image in the sample image by a preset angle in the vertical direction;

and carrying out random shielding treatment on each sample image by using the shielding pixel blocks to obtain a target image set.

In an alternative embodiment, the step of performing random occlusion processing on each sample image by using an occlusion pixel block includes:

Generating an occlusion pixel block according to the obtained setting parameters;

for each sample image, determining a superposition area of the shielding pixel block in the sample image according to the generated random number;

the block of occlusion pixels is superimposed to the superimposed area in the sample image.

In an alternative embodiment, the pre-built neural network model includes an input layer, a fusion layer, an output layer, and a plurality of network layers connected between the input layer and the fusion layer, each sample image includes a plurality of labeling key points, and the step of training the pre-built neural network model by using the target image set to obtain a detection model includes:

inputting each sample image in the sample image set into the input layer for preprocessing to obtain a preprocessed image;

for each network layer, carrying out convolution processing and feature extraction processing on the input image through the network layer, and outputting a feature image;

the feature images output by each network layer are subjected to fusion processing in the fusion layer to obtain a fusion feature image, the fusion feature image is input to the output layer to be subjected to key point classification, and the prediction key points of the sample image are obtained;

And calculating the loss function values of the predicted key points and the marked key points of the sample image, performing back propagation training according to the loss function values, updating the network parameters of the neural network model, and continuing training until a preset termination condition is met, thereby obtaining the detection model.

In an alternative embodiment, each network layer includes a first network module and a second network module, and the step of outputting the feature image includes, for each network layer, performing convolution processing and feature extraction processing on the input image through the network layer:

for each network layer, performing convolution processing and feature extraction processing on an input image through a first network module of the network layer to obtain a first feature map;

performing convolution processing and feature extraction processing on the input image by using a second network module of the network layer to obtain a second feature map;

and carrying out fusion processing on the first characteristic image and the second characteristic image, and outputting the characteristic image of the network layer.

In an optional embodiment, the step of performing convolution processing and feature extraction processing on the input image by the first network module of the network layer to obtain a first feature map includes:

Performing convolution processing and feature extraction processing on the image input by the first network module by using a first convolution processing strategy, and performing convolution processing and feature extraction processing on the image input by the first network module by using a second convolution processing strategy;

performing fusion processing on the output image obtained by the processing of the first convolution processing strategy and the output image obtained by the processing of the second convolution processing strategy;

and carrying out channel random mixing processing on the fused image, and outputting a first feature map.

In an optional implementation manner, the step of performing convolution processing and feature extraction processing on the input image by using the second network module of the network layer to obtain a second feature map includes:

carrying out channel separation processing on the input image by utilizing a second network module of the network layer to obtain a plurality of single-channel images;

carrying out convolution processing and feature extraction processing on each single-channel image;

the single-channel images and the images after the single-channel images are subjected to convolution processing and feature extraction processing are subjected to fusion processing;

and carrying out channel random mixing processing on the plurality of single-channel images after the fusion processing, and outputting a second characteristic diagram.

In a second aspect, an embodiment provides a test model training apparatus, the apparatus comprising:

an acquisition module for acquiring a sample image set comprising a plurality of sample images divided into a plurality of image subsets;

a balancing processing module, configured to perform balancing processing on the number of sample images in the plurality of image subsets;

the optimization processing module is used for carrying out optimization processing on each sample image according to a preset transformation strategy to obtain a target image set;

and the training module is used for training the pre-constructed neural network model by utilizing the target image set to obtain a detection model.

In a third aspect, an embodiment provides an electronic device comprising one or more storage media and one or more processors in communication with the storage media, the one or more storage media storing processor-executable machine-executable instructions that, when the electronic device is operating, are executed by the processor to perform the detection model training method of any of the preceding embodiments.

In a fourth aspect, embodiments provide a computer readable storage medium storing machine executable instructions that when executed implement the detection model training method of any of the preceding embodiments.

The beneficial effects of the embodiment of the application include, for example:

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an application scenario schematic diagram of a detection model training method provided in an embodiment of the present application;

FIG. 2 is a flowchart of a test model training method according to an embodiment of the present application;

FIG. 3 is a flow chart of sub-steps of step S220 in FIG. 2;

FIG. 4 is a flow chart of sub-steps of step S230 in FIG. 2;

FIG. 5 is a flow chart of sub-steps of step S240 in FIG. 2;

fig. 6 is a schematic diagram of a network structure of a neural network model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a first network module in the neural network model according to the embodiment of the present application;

fig. 8 is a schematic structural diagram of a second network module in the neural network model provided in the embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 10 is a functional block diagram of a test model training device according to an embodiment of the present application.

Icon: 100-live broadcast providing terminal; 200-a live broadcast server; 300-a live broadcast receiving terminal; 110-a storage medium; a 120-processor; 130-a test model training device; 131-an acquisition module; 132-balancing the processing module; 133-an optimization processing module; 134-training module; 140-communication interface.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

In the description of the present application, it should be noted that, if the terms "first," "second," and the like are used merely to distinguish between descriptions, they are not to be construed as indicating or implying relative importance. It should be noted that, without conflict, features in embodiments of the present application may be combined with each other.

The detection model training method provided by the embodiment of the invention can be applied to various application scenes, such as image processing application, live broadcast application, access control application and the like, and in the application, the face image needs to be processed so as to track and locate the key points of the face image, so that subsequent processing is performed based on the located key points. For example, in image processing applications, after locating key points in a face image, the key points may be optimized, such as enlarging the eye area, reducing the face contour, and so on. In live broadcast application, key point positioning can be performed on the face image of the anchor, so that face image processing is performed based on the key points. In this application, an application scenario of a live application will be described later as an example.

Referring to fig. 1, a schematic diagram of a possible application scenario of the test model training method provided in the embodiment of the present application is shown, where the scenario includes a live broadcast providing terminal 100, a live broadcast server 200, and a live broadcast receiving terminal 300. The live broadcast server 200 is respectively connected to the live broadcast providing terminal 100 and the live broadcast receiving terminal 300 in a communication manner, and is configured to provide live broadcast services for the live broadcast providing terminal 100 and the live broadcast receiving terminal 300. For example, the live providing terminal 100 may transmit a live video stream to the live server 200, and a viewer may access the live server 200 through the live receiving terminal 300 to watch the live video. The live video stream pushed by the live server 200 may be a video stream currently being live in a live platform or a complete video stream formed after live broadcast is completed. It will be appreciated that the scenario shown in fig. 1 is only one possible example, and that in other possible embodiments the scenario may include only a portion of the components shown in fig. 1 or may include other components as well.

In this embodiment, the live broadcast providing terminal 100 and the live broadcast receiving terminal 300 may be, but are not limited to, a smart phone, a personal digital assistant, a tablet computer, a personal computer, a notebook computer, a virtual reality terminal device, an augmented reality terminal device, and the like. Among them, the live broadcast providing terminal 100 and the live broadcast receiving terminal 300 may have installed therein an internet product for providing an internet live broadcast service, for example, the internet product may be an application APP, a Web page, an applet, etc. related to the internet live broadcast service used in a computer or a smart phone.

In this embodiment, a video capturing device for capturing a video frame of a host may be further included in the scene, and the video capturing device may be, but is not limited to, a camera, a lens of a digital camera, a monitoring camera, or a network camera. The video capture device may be directly installed or integrated with the live providing terminal 100. For example, the video capture device may be a camera configured on the live broadcast providing terminal 100, and other modules or components in the live broadcast providing terminal 100 may receive video, images transmitted from the video capture device via an internal bus. Alternatively, the video capture device may be independent of the live broadcast providing terminal 100, and the two may communicate through a wired or wireless manner.

Fig. 2 shows a flowchart of a detection model training method provided in an embodiment of the present application, where the detection model training method may be performed by the live broadcast providing terminal 100 or the live broadcast receiving terminal 300 or the live broadcast server 200 shown in fig. 1. It should be understood that, in other embodiments, the sequence of some steps in the test model training method of the present embodiment may be interchanged according to actual needs, or some steps may be omitted or deleted. The detailed steps of the test model training method are described below.

In step S210, a sample image set is acquired, the sample image set containing a plurality of sample images, the plurality of sample images being divided into a plurality of image subsets.

Step S220, performing a balancing process on the number of sample images in the plurality of image subsets.

Step S230, performing optimization processing on each sample image according to a preset transformation strategy, so as to obtain a target image set.

And step S240, training a pre-constructed neural network model by using the target image set to obtain a detection model.

In this embodiment, the sample images in the obtained sample image set are pre-collected images, and each sample image includes a plurality of labeling key points, that is, the key points in the sample image are labeled, and the key points may include eyebrows, eyes, nose, mouth, facial contours, and the like in the face image.

In this embodiment, the plurality of sample images may be divided into a plurality of image subsets, for example, according to the sex of the person in the face image in the sample image, according to the rotation angle of the face image, or according to the area ratio of the face image in the sample image, or according to other information of the sample image.

The number of sample images included in each image subset is generally not uniform for each image subset obtained by division, and in consideration of the fact that sample images in some image subsets are more frequently present in practical applications, but the number of sample images in the image subset is relatively smaller, so in this case, the balancing process performed on the number of sample images in each image subset may be to increase the number of sample images in the image subset. Or, considering sample images in some image subsets, in actual detection and recognition, key points are difficult to detect and locate due to the fact that face images in the sample images are blocked, rotated and the like. Thus, in this case, the number of sample images in the subset of images of this type may be increased to enable the model to learn more about the features of the sample images in the subset of images of this type, enhancing the subsequent detection localization of images having the same features of the sample images in the subset of images of this type.

The balancing processing performed on the number of sample images of the plurality of image subsets in this embodiment is mainly based on the idea of enabling the model to learn more about the features of some image types that occur more frequently in practical applications, or learn more about the features of some image types that are difficult to identify. The main way may be to increase the number of images in a subset of images where some of the included sample images are of a type of image that is more common in practical applications, or to increase the number of images in a subset of images where some of the included sample images are of a type of image that is more difficult to identify for detection.

Of course, the balancing process may be performed based on the user's requirement, for example, when the user needs to perform important recognition detection on a specific image type, for example, the face side faces the image collected by the camera, or the collected face image only includes half of the face area. When dividing the image subsets, the images of the image types can be divided into the same image subset, and the number of images in the image subset is expanded. Thus enabling the model to learn more about the image features of that particular image type.

And carrying out optimization processing on each sample image in the sample image set after the balance processing by adopting a preset transformation strategy so as to obtain a target image set. And finally, training the neural network model by utilizing the target image set to obtain a detection model. The preset transformation strategy may include, for example, angle transformation, size transformation, and transformation of the detected face region range. The method aims to enable the obtained sample image to simulate the face characteristics under different conditions, enable the finally obtained model to be suitable for detecting images with different condition characteristics, and improve the robustness of the model.

In this embodiment, by performing sample distribution balancing processing on the sample image set and performing optimization processing on each sample image, the obtained target image set is optimized on sample distribution and a single sample image, so as to improve the detection accuracy of the detection model obtained by training.

In this embodiment, it is considered that when the face key point detection is performed, the face is rotated, so that the key point is easily shifted, which results in inaccurate detection. Therefore, when the sample distribution balancing processing is performed on the sample image set, the balancing processing can be performed with emphasis based on the rotation situation of the face image. Referring to fig. 3, in the present embodiment, the number of sample images in a plurality of image subsets may be balanced by:

step S221, obtaining a rotation angle of the face image in each sample image on a horizontal plane.

Step S222, dividing the sample images in the target preset range to which the rotation angle belongs in the sample image set into a first image set, and dividing other sample images in the sample image set into a second image set.

Step S223, adding the sample images in the second image set, so that the number of sample images in the second image set is a preset multiple of the number of sample images in the first image set.

When the image capturing device captures a face image, if the face is opposite to the image capturing device, key points in the captured face image are easier to detect and identify, and if the face side faces the image capturing device or rotates a certain angle to face the image capturing device, some key points may not be identified in the captured face image, or the key points are shifted in the face image. Therefore, the rotation angle of the face image has a large influence on the detection of the key point.

For each sample image, the rotation angle of the face image in the sample image in the horizontal plane can be obtained. The rotation angle may be obtained by obtaining, in advance, the intervals between several key points on the face image, such as the intervals between eyes, the intervals between the two ends of the mouth angle, etc., when the face image is rotated by different angles. And establishing association relations between different rotation angles and different distances. Therefore, when the rotation angle of the face image in the sample image is confirmed, the rotation angle of the face image can be determined according to the distance between key points in the face image.

In this embodiment, since key points in a face image captured directly opposite to the image capturing apparatus or slightly rotated by a small angle are easily identified, a sample image of a target preset range to which a rotation angle of the face image in a sample image set belongs may be divided into a first image set, and the remaining sample images may be divided into a second image set. The target preset range may be 0 to 30 degrees, and of course, may be other numerical ranges, which is not limited in this embodiment.

The face images in the sample images divided into the second image set have larger rotation angle, and the problem that detection is difficult exists in the actual detection and recognition process, so in the embodiment, the model can learn the characteristic information of the images more by increasing the number of samples of the images, and the images can be detected and recognized more accurately later. In the present embodiment, the number of sample images in the second image set may be increased to two times, three times, or the like of the number of sample images in the first image set without limitation.

In this embodiment, considering that the difficulty in identifying key points in the face image increases with the increase of the rotation angle, it is particularly difficult to detect and identify when the rotation angle is very large. Thus, in this embodiment, the sample images in the second image set may be further divided, and the second image set may include the first subset and the second subset. The rotation angle of the face image of the sample image in the first subset on the horizontal plane belongs to a first preset range, and the first preset range can be 30 degrees to 60 degrees. And the rotation angle of the face image of the sample image in the second subset on the horizontal plane belongs to a second preset range, which may be 60 degrees to 90 degrees.

When expanding the number of sample images in the first subset and the second subset, the number of sample images in the first subset may be increased to a first preset multiple, e.g. twice, the number of sample images in the first image set. Also, the number of sample images in the second subset may be increased by a second preset multiple, e.g. three times, the number of sample images in the first image set.

Therefore, sample expansion can be performed on sample images in different subsets according to rotation angles in different ranges, and the model can learn the characteristics of the images more by expanding the number of sample images which are difficult to detect key points.

After the sample distribution balancing process described above, referring to fig. 4, the optimization process may be performed on each sample image in the following manner:

step S231, adjusting the size of each sample image to a preset size.

Step S232, for each of the sample images with the adjusted size, rotating the face image in the sample image by a preset angle in the vertical direction.

And step S233, carrying out random shielding processing on each sample image by using the shielding pixel blocks to obtain a target image set.

In this embodiment, considering that the image input into the model generally has a standard requirement in size, the obtained sample image may be different in size, which is disadvantageous for feature learning of the model. Accordingly, the size of each sample image can be adjusted to a preset size. For example, when the length and the width of the sample image are not identical, the boundary of the short side may be first expanded with the long side as a reference so that the length and the width of the sample image are identical. On this basis, the length and width of the sample image may be reduced or enlarged simultaneously so that the size thereof satisfies a predetermined size, for example, 160×160×3 images.

In addition to the rotation condition on the horizontal plane, the face image in the sample image may also rotate in the vertical direction, i.e. a certain included angle exists between the central vertical line on the face image and the vertical direction. In an actual scene, the face image to be recognized also often appears in this case. When the face image rotates a certain angle in the vertical direction, the shift of the key points and the change of the relative positions among the key points are caused, so that the detection and the identification are difficult.

Therefore, in the stage of training the model, by rotating the face image in each sample image by a preset angle in the vertical direction, for example, plus or minus 30 degrees, the face image can be converted into an image having a different rotation angle in the vertical direction. Therefore, the model can learn the characteristics of the images with the rotation angles in the vertical direction more, and the detection accuracy can be improved when the images to be detected with the characteristics are detected and identified based on the model.

In addition, in this embodiment, considering that in actual situations, there may be a situation that the image to be detected is blocked, for example, when the anchor performs live broadcast, a microphone in front of the anchor may block the face of the anchor, or when the anchor swings his hand, the anchor blocks the face area. Both of these situations may increase the difficulty of detecting and identifying keypoints.

Therefore, in this embodiment, in the training stage of the model, the shielding pixel block may be used to perform random shielding processing on each sample image, so as to simulate the face shielding situation that may occur in the actual application scene.

In this embodiment, when the sample image is subjected to the occlusion processing, the occlusion pixel block may be generated according to the obtained setting parameters, where the setting parameters include the color of the occlusion pixel block, for example, a black pixel block, a white pixel block, and the like, and may further include the size, shape, and the like of the pixel block. After the occlusion pixel block is obtained, in order to simulate different occlusion situations possibly occurring in an actual application scene, a superposition area of the occlusion pixel block in the sample image can be determined according to the generated random number, so that the occlusion pixel block is superposed to the superposition area in the sample image.

After the obtained sample image set is subjected to the sample distribution balancing treatment and the sample image optimization treatment, a target image set can be obtained, and the target image set is used for training the constructed neural network model to obtain a detection model.

In this embodiment, the constructed neural network model includes an input layer, a fusion layer, an output layer, and a plurality of network layers connected between the input layer and the fusion layer. Referring to fig. 5, training of the neural network model may be accomplished by:

step S241, inputting each sample image in the sample image set to the input layer for preprocessing, so as to obtain a preprocessed image.

Step S242, for each of the network layers, performing convolution processing and feature extraction processing on the input image by the network layer, and outputting a feature image.

And step S243, carrying out fusion processing on the feature images output by each network layer in the fusion layer to obtain a fusion feature image, and inputting the fusion feature image to the output layer to carry out key point classification to obtain the predicted key points of the sample image.

And step S244, calculating the loss function values of the predicted key points and the marked key points of the sample image, performing back propagation training according to the loss function values, updating the network parameters of the neural network model, and continuing training until the preset termination condition is met, thereby obtaining the detection model.

Referring to fig. 6, fig. 6 schematically shows a network structure diagram of the neural network model, where a plurality of network layers may include Stage1-Stage5, and the preprocessed image output by the input layer is subjected to convolution processing and feature extraction processing by Stage1, so as to output a feature image. And the Stage2-Stage5 respectively carries out convolution processing and feature extraction processing on the feature images output by the previous network layer and outputs the feature images. The fusion layer is used for carrying out fusion processing on the characteristic images output by each network layer, and the fusion processing on the output characteristic images of Stage3, stage4 and Stage5 is schematically shown in the figure. And classifying the key points of the fused image in an output layer to obtain the predicted key points of the sample image.

In this embodiment, the neural network model may be supervised by using a plurality of Loss functions, for example, loss1-Loss4, so that it can be ensured that the features of each network layer can be well responsible for the features under different receptive fields, and after the subsequent fusion, the finally obtained fusion features can have a better feature effect.

In this embodiment, each network layer includes two network modules, a first network module and a second network module, where the processing policies of the first network module and the second network module for the image are different. Inside each network layer, the output characteristic image of the network layer can be obtained by fusing the output characteristics of the two network modules.

Optionally, for each network layer, convolution processing and feature extraction processing may be performed on the input image by the first network module of the network layer to obtain a first feature map. And performing convolution processing and feature extraction processing on the input image by using a second network module of the network layer to obtain a second feature map. And finally, fusing the obtained first feature image and the second feature image, and outputting the feature image of the network layer.

Optionally, in this embodiment, in the first network module, two different processing strategies may be used to process the image respectively, and finally fusion is performed to output a first feature map of the first network module. Specifically, for the first network module, the first convolution processing policy may be used to perform convolution processing and feature extraction processing on the image input by the first network module, and the second convolution processing policy may be used to perform convolution processing and feature extraction processing on the image input by the first network module. And finally, carrying out fusion processing on the output image obtained by processing the first convolution processing strategy and the output image obtained by processing the second convolution processing strategy, and then carrying out channel random mixing processing on the fused image to output a first feature image.

Referring to fig. 7 in combination, the first convolution processing strategy described above corresponds to the left processing flow in fig. 7, specifically, for the image input to the first network module, the convolution operation is first performed with a convolution kernel 3*3 (DWConv 3×3), and the step size of the convolution operation may be set to 2. Then, the convolution operation is performed by a convolution kernel 1*1 (Conv 1×1), and excitation processing is performed. The second convolution processing strategy corresponds to the right processing flow in fig. 7, specifically, for the image input to the first network module, convolution operation is performed first with a convolution kernel 1*1 (Conv 1×1), and excitation processing is performed. The convolution operation is performed with convolution kernel 3*3 (DWConv 3×3), and the step size of the convolution operation may be set to 2. Finally, the convolution operation is performed by a convolution kernel 1*1 (Conv 1×1), and excitation processing is performed.

And finally, carrying out fusion processing on the images output from two sides at a Concat layer, carrying out Channel random mixing processing on the fused images at a Channel Shuffle layer, and outputting a first feature image.

In addition, for the second network module, in the second network module, the second network module is utilized to perform channel separation processing on the input image to obtain a plurality of single-channel images. And carrying out convolution processing and feature extraction processing on each single-channel image, and then carrying out fusion processing on each single-channel image and the images subjected to the convolution processing and the feature extraction processing on each single-channel image. And finally, carrying out channel random mixing processing on the plurality of single-channel images after the fusion processing, and outputting a second characteristic diagram.

Referring to fig. 8 in combination, in the second network module, first, a Channel Split process is performed on an input image in a Channel Split layer. Then, through the right processing flow, the convolution operation is first performed by using the convolution kernel 1*1 (Conv 1×1), and the excitation processing is performed. The convolution operation is then performed with convolution kernel 3*3 (DWConv 3 x 3). Finally, the convolution operation is performed by a convolution kernel 1*1 (Conv 1×1), and excitation processing is performed. And then, fusing the image processed by the single-Channel image through the processing flow with the single-Channel image in a Concat layer, and finally, carrying out Channel random mixing processing on each single-Channel image after the fusion processing in a Channel Shuffle layer, and outputting a second feature map.

According to the neural network model provided by the embodiment, through setting a plurality of network layers and carrying out fusion processing on the output images of each network layer, characteristics of different characteristics can be obtained by utilizing different network layers, and the characteristic learning effect is improved. In addition, in each network layer, the fusion characteristic of the finally obtained characteristics is improved through dimension reduction processing and channel mixing processing.

Furthermore, by setting a plurality of los function supervision, the characteristics in each network layer can be guaranteed to be well responsible for the characteristics under different receptive fields, so that the integrated whole characteristics have good characteristics.

Referring to fig. 9, a schematic diagram of exemplary components of an electronic device according to an embodiment of the present application is provided, where the electronic device may be the live broadcast providing terminal 100 or the live broadcast receiving terminal 300 or the live broadcast server 200 shown in fig. 1. The electronic device may include a storage medium 110, a processor 120, a detection model training apparatus 130, and a communication interface 140. In this embodiment, the storage medium 110 and the processor 120 are both located in the electronic device and are separately disposed. However, it should be understood that the storage medium 110 may also be separate from the electronic device and accessible to the processor 120 through a bus interface. Alternatively, the storage medium 110 may be integrated into the processor 120, for example, as a cache and/or general purpose registers.

The test model training device 130 may be understood as the electronic device, or the processor 120 of the electronic device, or may be understood as a software functional module that implements the test model training method under the control of the electronic device, independent of the electronic device or the processor 120.

As shown in fig. 10, the test model training apparatus 130 may include an acquisition module 131, a balance processing module 132, an optimization processing module 133, and a training module 134. The functions of the respective functional modules of the test model training apparatus 130 are described in detail below.

An acquisition module 131 is configured to acquire a sample image set, where the sample image set includes a plurality of sample images, and the plurality of sample images are divided into a plurality of image subsets. It is understood that the acquisition module 131 may be used to perform the step S210 described above, and reference may be made to the details of the implementation of the acquisition module 131 regarding the step S210 described above.

The balancing processing module 132 is configured to perform balancing processing on the number of sample images in the plurality of image subsets. It is understood that the balancing processing module 132 may be used to perform the step S220 described above, and reference may be made to the details of the implementation of the balancing processing module 132 regarding the step S220 described above.

And the optimization processing module 133 is configured to perform optimization processing on each sample image according to a preset transformation policy, so as to obtain a target image set. It is understood that the optimization processing module 133 may be used to perform the step S230 described above, and reference may be made to the details of the implementation of the optimization processing module 133 related to the step S230.

The training module 134 is configured to train the neural network model constructed in advance by using the target image set, so as to obtain a detection model. It will be appreciated that the training module 134 may be used to perform step S240 described above, and reference may be made to the details of step S240 regarding the implementation of the training module 134.

Further, the embodiment of the application further provides a computer readable storage medium, and the computer readable storage medium stores machine executable instructions, which when executed implement the test model training method provided in the above embodiment.

In summary, the embodiments of the present application provide a method, an apparatus, an electronic device, and a readable storage medium for training a detection model, where a plurality of sample images in an obtained sample image set are divided into a plurality of image subsets, and the number of sample images in the plurality of image subsets is balanced, so as to improve the problem of uneven sample distribution in the obtained sample image set, and optimize each sample image according to a preset transformation policy, so as to obtain a target image set. And training the pre-constructed neural network model by utilizing the target image set to obtain a detection model. According to the training scheme, sample distribution balance processing is carried out on the sample image set, and optimization processing is carried out on each sample image, so that the obtained target image set is optimized on sample distribution and a single sample image, and further the detection accuracy of a detection model obtained through training is improved.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, and the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, and for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, indirect coupling or communication connection of devices or modules, electrical, mechanical, or other form.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.

The foregoing is merely a specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes or substitutions are covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of training a test model, the method comprising:

optimizing each sample image according to a preset transformation strategy to obtain a target image set, wherein the target image set is obtained by randomly shielding each sample image by using shielding pixel blocks, and the shielding pixel blocks are generated according to the obtained setting parameters;

2. The method of claim 1, wherein the step of balancing the number of sample images in the plurality of image subsets comprises:

3. The test model training method of claim 2, wherein the second image set includes a first subset and a second subset, wherein a rotation angle of face images of sample images in the first subset on a horizontal plane belongs to a first preset range, and a rotation angle of face images of sample images in the second subset on a horizontal plane belongs to a second preset range;

4. The method for training a detection model according to claim 1, wherein the step of optimizing each of the sample images according to a preset transformation strategy to obtain a target image set comprises:

the size of each sample image is adjusted to be a preset size;

5. The method of claim 4, wherein the step of randomly masking each of the sample images with a block of masking pixels comprises:

6. The method for training a detection model according to claim 1, wherein the pre-built neural network model includes an input layer, a fusion layer, an output layer, and a plurality of network layers connected between the input layer and the fusion layer, each sample image includes a plurality of labeled key points, and the step of training the pre-built neural network model by using the target image set to obtain the detection model includes:

7. The method according to claim 6, wherein each of the network layers includes a first network module and a second network module, and the step of outputting the feature image includes, for each of the network layers, performing convolution processing and feature extraction processing on the input image through the network layer:

8. The method for training a detection model according to claim 7, wherein the step of performing convolution processing and feature extraction processing on the input image by the first network module of the network layer to obtain a first feature map includes:

9. The method for training a detection model according to claim 7, wherein the step of performing convolution processing and feature extraction processing on the input image by using the second network module of the network layer to obtain a second feature map includes:

10. A test model training device, the device comprising:

the optimization processing module is used for carrying out optimization processing on each sample image according to a preset transformation strategy to obtain a target image set, wherein the target image set is obtained by carrying out random shielding processing on each sample image by using shielding pixel blocks, and the shielding pixel blocks are generated according to the obtained setting parameters;

11. An electronic device comprising one or more storage media and one or more processors in communication with the storage media, the one or more storage media storing processor-executable machine-executable instructions that, when the electronic device is operating, are executed by the processor to perform the test model training method of any of claims 1-9.

12. A computer readable storage medium storing machine executable instructions which when executed by a processor implement the test model training method of any of claims 1-9.