WO2022141092A1

WO2022141092A1 - Model generation method and apparatus, image processing method and apparatus, and readable storage medium

Info

Publication number: WO2022141092A1
Application number: PCT/CN2020/141003
Authority: WO
Inventors: 张雪
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2022-07-07

Abstract

A model generation method and apparatus, an image processing method and apparatus, and a readable storage medium. The model generation method comprises: acquiring training data, wherein the training data comprises sample images and labels of the sample images, and the labels comprise labels generated using at least two image information acquisition methods; and training an initial model according to the sample images and the labels of the sample images, so as to generate an image processing model, wherein the image processing model is used for extracting image information, and the initial model comprises at least two processing branches.

Description

Model generation method, image processing method, device and readable storage medium

technical field

The present invention belongs to the field of network technology, and in particular, relates to a model generation method, an image processing method, a device and a readable storage medium.

Background technique

At present, images are an excellent way to obtain information, and images are produced in more and more scenarios. In order to extract image information in an image, it is often necessary to generate an image processing model for extracting image information.

In the existing methods, image processing models are often generated directly using training data marked in a single method. However, the generalization ability of the image processing model finally generated by this method is weak, and the accuracy of the image information extracted when used is low.

SUMMARY OF THE INVENTION

The present invention provides a model generation method, an image processing method, a device and a readable storage medium, so as to solve the problems of weak generalization ability of the image processing model and low accuracy of the image information extracted during use.

In order to solve the above-mentioned technical problems, the present invention is achieved in this way:

In a first aspect, an embodiment of the present invention provides a model generation method, which includes:

Acquiring training data; the training data includes a sample image and a label of the sample image, and the label includes a label generated in at least two ways of acquiring image information;

According to the sample image and the label of the sample image, an initial model is trained to generate an image processing model; the image processing model is used to extract image information, and the initial model includes at least two processing branches.

In a second aspect, an embodiment of the present invention provides an image processing method, which is applied to a processing device, and the method includes:

Using the image to be processed as the input of the preset image processing model to obtain the output of the image processing model;

obtaining image information of the to-be-processed image according to the output of the image processing model;

Wherein, the image processing model is generated according to the above model generation method.

In a third aspect, an embodiment of the present invention provides a model generation apparatus, the apparatus includes a memory and a processor;

the memory for storing program codes;

The processor calls the program code, and when the program code is executed, is configured to perform the following operations:

In a fourth aspect, an embodiment of the present invention provides an image processing apparatus, the apparatus includes a memory and a processor;

the memory for storing program codes;

Wherein, the image processing model is generated according to the above-mentioned model generating device.

In a fifth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, any one of the foregoing methods is implemented.

In this embodiment of the present invention, training data may be acquired, wherein the training data includes sample images and labels of the sample images, and the labels include labels generated by at least two image information acquisition methods. Then, according to the sample images and the labels of the sample images, the initial model is trained to generate an image processing model, wherein the image processing model is used to extract image information, and the initial model includes at least two processing branches. Since the problem of insufficient samples caused by the limitation of a single labeling method can be avoided when labels are marked with multiple image information acquisition methods, so, by acquiring training data marked with multiple image information acquisition methods for training, the training data can be ensured. Diversity and sufficiency, which in turn can improve the generalization ability of the final image processing model to a certain extent, thereby improving the accuracy of the image information extracted by the image processing model subsequently.

Description of drawings

1 is a flow chart of steps of a model generation method provided by an embodiment of the present invention;

2 is a schematic diagram of a model structure provided by an embodiment of the present invention;

3 is a schematic diagram of a fusion layer provided by an embodiment of the present invention;

4 is a schematic diagram of a depthwise separable convolution operation provided by an embodiment of the present invention;

5 is a schematic diagram of a processing layer provided by an embodiment of the present invention;

6 is a flowchart of steps of an image processing method provided by an embodiment of the present invention;

Fig. 7 is a block diagram of a model generation device provided by an embodiment of the present invention;

8 is a block diagram of an image processing apparatus provided by an embodiment of the present invention;

FIG. 9 is a block diagram of a computing processing device according to an embodiment of the present invention;

FIG. 10 is a block diagram of a portable or fixed storage unit according to an embodiment of the present invention.

Detailed ways

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

FIG. 1 is a flowchart of steps of a model generation method provided by an embodiment of the present invention. As shown in FIG. 1 , the method may include:

Step 101: Acquire training data; the training data includes a sample image and a label of the sample image, and the label includes a label generated in at least two ways of acquiring image information.

In this embodiment of the present invention, the sample image may be obtained by receiving user input, or may be obtained independently from the network. For example, download sample images directly from open source databases. Further, the label of the sample image can be used to characterize the image information of the sample image. For example, when the image information is the angle information of the face in the image, the label of the sample image can be used to represent the angle information of the face in the sample image. Specifically, the label of the sample image may be the angle information itself, or the data used to calculate the angle information, for example, the key point information used to calculate the angle information. Further, the specific type and specific quantity of the image information acquisition manner used to acquire the label may be set according to actual requirements, which is not limited in this embodiment of the present invention.

Step 102: Train an initial model according to the sample image and the label of the sample image to generate an image processing model; the image processing model is used to extract image information, and the initial model includes at least two processing branches.

The specific architecture of the initial model may be pre-designed according to actual requirements. In the embodiment of the present invention, by designing an initial model including at least two processing branches, and performing training based on the initial model including multiple processing branches, to a certain extent, the In the training process, the model can extract more information based on multiple branches, which can improve the model training effect to a certain extent.

Further, in an implementation manner, the training of the initial model may take the sample image as the input of the initial model, then use the output of the initial model as the predicted value, and determine the real value according to the label of the sample image, for example, use the label as the actual value. Next, the current loss value of the initial model is calculated according to the predicted value and the actual value. If the loss value does not meet the preset requirements, it means that the initial model has not yet converged. Accordingly, the model parameters of the initial model can be adjusted, and the adjustment The subsequent initial model continues to train until the loss value meets the preset requirements. Finally, when the loss value of a certain round of initial models meets the preset requirements, the current initial model is used as the final image processing model.

To sum up, the model generation method provided by the embodiments of the present invention can acquire training data, wherein the training data includes sample images and labels of the sample images, and the labels include labels generated by at least two image information acquisition methods. Then, according to the sample images and the labels of the sample images, the initial model is trained to generate an image processing model, wherein the image processing model is used to extract image information, and the initial model includes at least two processing branches. Since the problem of insufficient samples caused by the limitation of a single labeling method can be avoided when labels are marked with multiple image information acquisition methods, so, by acquiring training data marked with multiple image information acquisition methods for training, the diversity of training data can be ensured. To a certain extent, the generalization ability of the final generated image processing model can be improved, thereby improving the accuracy of the image information extracted by the image processing model subsequently.

Optionally, the image information in this embodiment of the present invention may include angle information of the human face in the image, and the angle information may represent the posture angle of the human face in the image. For example, the angle information may include pitch, yaw, and roll. Further, the label may include a first label generated in a manner of acquiring first image information and a second tag generated in a manner of acquiring second image information. The first method of acquiring image information may include a method of acquiring angle information according to key points of faces in the image, and the second method of acquiring image information may include a method of performing regression detection according to color channel values of pixels in the image to obtain angle information. The color channel value of the pixel point may be the red-green-blue (Red-Green-Blue, RGB) color channel value of the pixel point.

Since the determination efficiency of face key points is often high, more training samples can often be obtained based on the first image information acquisition method, but as the angle information of the face in the image becomes larger and larger, the face key points The accuracy of the determination will decrease accordingly, which will lead to a larger error in the label, which in turn will lead to a decrease in the accuracy of the final generated model. However, in the method of performing regression detection based on the color channel value of the pixel in the image to obtain the angle information, the influence of the angle information of the face in the image becomes larger, that is, the angle information of the face in the image is larger. The label can also accurately represent the angle information of the face, the label error is small, and the label quality is high. Therefore, in the embodiment of the present invention, two kinds of training data corresponding to the first label and the second label are generated in combination with the first image information acquisition method and the second image information acquisition method, which can ensure sufficient training data to a certain extent. At the same time, the accuracy of the training data is ensured and the model training effect is ensured, so that the final generated image processing model can extract image information more accurately in the case of limited training data.

Optionally, in an implementation manner of the embodiment of the present invention, the above operation of acquiring training data may include the following steps:

Step 1011: Obtain a first preset model and a second preset model; the first preset model is used to obtain angle information according to the key points of the face in the image, and the second preset model is used to obtain the angle information according to the pixel points in the image. The color channel value of to get the angle information.

In this step, the first preset model and the second preset model may be pre-trained models. Wherein, when the first preset model is pre-trained, the first preset model can be trained with the images in the training set and the corresponding marked face key points, so that the first preset model can learn the ability to determine the face key points, Further, the first preset model may calculate the angle information of the face according to the determined key points of the face according to the preset acquisition algorithm. When pre-training the second preset model, the second preset model can be trained with the images in the training set and the correspondingly marked face angle information, so that the second preset model can learn to perform regression detection according to the pixels in the image to determine The ability to angle information. Further, when training a model for obtaining angle information according to the color channel value of the pixel in the image, the sample labeling is often done manually. In this way, the labeling data is often less, and it is affected by personal subjective feelings. tend to be poorer, which in turn leads to poorer training results. In the embodiment of the present invention, by combining two preset models to obtain training data corresponding to the two methods, and combining the two types of training data for training, the problem of poor training effect when training based on a single labeling method can be avoided to a certain extent.

Correspondingly, when acquiring the first preset model and the second preset model, the pre-trained first preset model and the second preset model may be directly loaded, thereby improving the acquisition efficiency to a certain extent. For example, directly load the open-source first preset model and the second preset model.

Step 1012: Process the first sample image according to the first preset model to obtain the first label, and process the second sample image according to the second preset model to obtain the first label. Two labels.

In this step, the first sample image and the second sample image may be multiple images, and the image set composed of the first sample image and the image set composed of the second sample image may not have the same image, or may have a part of the same image , which is not limited in this embodiment of the present invention. When acquiring the first label, the first sample image can be used as the input of the first preset model, the first sample image can be tagged by the first preset model, and then the first preset model is output as the first label. It should be noted that, in the embodiment of the present invention, the existing face key point data can also be directly obtained based on an open source database as an image as the first sample image, and the first label is generated according to the face key point data in these images. Further, in the actual application scenario, the first preset model can also be used to determine the key points of the face in the first sample image, and the key points of the face can be used as training labels. In the later training, the final generated image processing model can be After learning how to accurately determine the key points of the face, correspondingly, when the image processing model extracts the angle information, the corresponding angle information can be calculated according to the determination of the key points of the face and the preset algorithm.

When acquiring the second label, the second sample image can be used as the input of the second preset model, the second sample image can be tagged by the second preset model, and then the second preset model is output as the second label. Of course, manual annotation may also be used in the embodiment of the present invention, which is not limited in the embodiment of the present invention. It should be noted that when labeling sample images based on the preset model, there may be certain errors, which may result in a small part of noise data in the final training data. A small amount of noise data can provide more abundant and diverse information for the subsequent training process, which in turn can improve the generalization ability of the model to a certain extent.

In the embodiment of the present invention, by first acquiring the first preset model and the second preset model, the first sample image and the second sample image are marked based on the acquired first preset model and the second preset model, respectively. , get the first label and the second label. In this way, compared with the manual labeling method, the labeling efficiency can be improved to a certain extent and the labeling cost can be reduced. At the same time, the sample image is divided into a first sample image and a second sample image, and the first sample image and the second sample image are marked in different ways, which can facilitate subsequent cross-training, so that the initial model can be trained during training. In the process, learning is performed based on the characteristics of the two training data, which can ensure the training effect of the model to a certain extent. It should be noted that, the image information in this embodiment of the present invention may also be other information, such as age information corresponding to the face in the image. Correspondingly, when acquiring the training data, the specific age information corresponding to the face in the sample image can be used as the label of the sample image.

Further, in order to ensure the accuracy of the training data, after setting the label for the sample image, the sample image can be used as the initial image, and the label set for the sample image can be used as the initial label. Based on the initial images and initial labels, a target processing model is generated. Specifically, the initial image and the initial label can be used as training data to train the acquisition target processing model. For example, the initial image can be used as the input of the preset original model, then the output of the preset original model can be used as the predicted value, and the real value can be determined according to the initial label of the initial image, for example, the initial label can be used as the real value. Next, the current loss value of the preset original model is calculated according to the predicted value and the actual value. If the loss value does not meet the preset requirements, it means that the preset original model has not yet converged. Accordingly, the preset original model can be adjusted. model parameters, and continue to train the adjusted preset original model until the loss value meets the preset requirements. Finally, when the loss value of the preset original model in a certain round meets the preset requirements, the current preset original model can be used as the final target processing model. The initial images can then be filtered based on the target processing model, initial images, and initial labels. In this way, by screening the initial images, the dirty data in the training data can be automatically eliminated to a certain extent, thereby improving the accuracy of the training data. At the same time, compared with the problems of long time and high cost caused by manual screening, the automatic screening in the embodiment of the present invention can reduce the screening cost and screening time to a certain extent, which is beneficial to the iterative update of the model.

Optionally, when screening, for any initial image, the initial image can be used as the input of the target processing model to obtain the output of the target processing model; the output is used as the predicted label of the initial image, and the The similarity between the predicted label and the initial label is eliminated; the initial image whose similarity is less than a preset similarity threshold is eliminated.

Wherein, after inputting the initial image into the target processing model, the target processing model can process the initial image, and then obtain the output. Taking model_origin representing the target processing model and dataset_origin representing the set of initial images as an example, you can use model_origin to determine the predicted labels of each image in dataset_origin. Further, the similarity between the predicted label and the initial label can represent the closeness between the two. If the similarity between the predicted label and the initial label is greater, it can be considered that the initial label is more accurate and the more reliable. high. If the similarity between the predicted label and the initial label is smaller, the initial label can be considered to be less credible.

Correspondingly, screening can be performed according to the magnitude relationship between the similarity and the preset similarity threshold. If the similarity is less than the preset similarity threshold, it can be considered that the initial label of the initial image has low reliability, so the initial image can be eliminated, and only the similarity is not less than the preset similarity threshold, and the reliability is high The initial image is used as the sample image. The preset similarity threshold may be set according to actual requirements, which is not limited in this embodiment of the present invention. Further, the set dataset_clean composed of the retained initial images may be used as training data to train and obtain an image processing model, and the image processing model obtained by the training may be represented as model_clean. In the embodiment of the present invention, by calculating the similarity between the predicted label and the initial label, the initial image is screened according to the similarity, so that the accuracy of the screening operation can be ensured to a certain extent.

Optionally, when calculating the similarity between the predicted label and the initial label, the absolute value of the difference between the predicted label and the initial label can be calculated, and the predicted label and the initial label can be determined according to the absolute value. similarity between the initial labels; the similarity is negatively correlated with the absolute value.

Among them, the larger the absolute value, the larger the gap between the predicted label and the initial label. The smaller the absolute value, the smaller the gap between the predicted label and the initial label. Therefore, the similarity can be set to be negatively correlated with the absolute value. As an example, assuming that predict_label represents the predicted label and label represents the initial label, the absolute value of the difference between the predicted label and the initial label can be expressed as abs(predict_label-label). Where abs(*) means to take the absolute value of the input "*". Further, -abs(predict_label-label) can be used as similarity. In the embodiment of the present invention, by calculating the absolute value of the difference between the two, and determining the similarity according to the absolute value, the calculation of the similarity can be conveniently realized, and the calculation efficiency can be improved to a certain extent. Of course, other similarity algorithms may also be used for calculation, which is not limited in this embodiment of the present invention.

It should be noted that, in the embodiment of the present invention, an initial image that was rejected due to misscreening may also be obtained from the rejected initial image by manual discrimination, and added to the training data. In this way, the problem of reducing training data caused by false screening can be eliminated to a certain extent. At the same time, compared to the method of directly relying on manual screening of training data, in the embodiment of the present invention, the initial images that are rejected are manually recycled for the second time through automatic screening, so that the manual processing volume can be reduced and the increase of While the screening speed is improved, the screening accuracy is improved.

Optionally, the above-mentioned operation of training an initial model to generate an image processing model according to the sample image and the label of the sample image may include the following steps:

Step 1021: Adjust the sample image to a sample image with multiple preset sizes.

In this embodiment of the present invention, the preset size may be set according to actual requirements. For example, the preset size may include 64*64, 48*48, 40*40, and so on. Further, in an implementation manner, the sample images may be divided into N groups, where N may be the number of preset sizes. Next, different preset sizes may be set for different groups, and finally the size of the sample images in each group is adjusted to the preset size corresponding to the group, thereby obtaining sample images in multiple preset sizes. Of course, other manners may also be used for adjustment, which is not limited in this embodiment of the present invention.

Step 1022: For each sample image in the preset size, train the initial model according to the sample image and the label of the sample image, so as to obtain an image processing model in each preset size.

In this step, the size of the sample image can represent the input size (inputsize) corresponding to the model, and the initial model is trained with the sample images in each preset size, and then image processing models corresponding to different input sizes can be obtained, wherein, Image processing models corresponding to different input sizes are image processing models under different preset sizes. Further, due to the larger corresponding input size, the computational complexity of the model tends to be higher. For example, when the model calculation amount is represented by the double-format Multiply Accumulate (MACC) amount when the model is running, the corresponding input size is 40*40, 48*48, 64*64. The corresponding MACC calculation amounts may be 3.8M (MByte, M), 4.8M, and 8.7M, respectively. In the embodiment of the present invention, generating image processing models corresponding to different input sizes can provide users with image processing models with different calculation amounts, thereby facilitating users to choose and use them according to actual needs. For example, in the scene where the image processing model is applied to the beauty system, if the requirement set by the beauty system is that the calculation amount of the model is less than 10M, then you can choose one version of the three versions of the model to use. Alternatively, you can directly select the corresponding image processing model with the least amount of calculation to minimize the amount of calculation and increase the running speed to make the beauty system smoother.

In the implementation of the present invention, by adjusting the sample images to sample images in a plurality of preset sizes, for the sample images in each preset size, the initial model is trained according to the sample images and the labels of the sample images, so as to obtain the sample images of each preset size. Set the image processing model under the size, that is, generate image processing models corresponding to different input sizes, provide users with a variety of choices, so that users can choose an image processing model suitable for the device capabilities according to actual needs in subsequent applications, thereby improving the flexibility of operation. sex.

Optionally, in this embodiment of the present invention, an operation of training according to the sample images in the preset size and the labels of the sample images may be performed once for each sample image in a preset size, so as to realize the generation of the preset size. Set the size of the image processing model.

Specifically, the sample image may include a first sample image and a second sample image, and the manner of acquiring the first label of the first sample image and the manner of acquiring the second label of the second sample image may be different. The above steps of training the initial model according to the sample images and the labels of the sample images may include:

Step 10221: Divide the first sample image into multiple first sample groups, and divide the second sample image into multiple second sample groups.

For example, the first sample image may be equally divided into multiple image groups, thereby obtaining multiple first sample groups. The second sample image is equally divided into a plurality of image groups, thereby obtaining a plurality of second sample groups. Of course, grouping may also be performed randomly, which is not limited in this embodiment of the present invention.

Step 10222: According to the first sample image in the first sample group and the first label of the first sample image, and the second sample image in the second sample group and the first sample The second label for this image, cross-training the initial model.

In the embodiment of the present invention, since the first sample image and the second sample image have different labels, that is, the training data are separated, in the embodiment of the present invention, the first sample image and the second sample image are divided into multiple decibels. A first sample group and a plurality of second sample groups are combined for cross-training with the first sample group and the second sample group. In this way, the initial model can be learned based on the two training samples in a balanced manner during the training process. In turn, to a certain extent, the final image processing model can be obtained by combining the two image information acquisition methods, and the final training effect can be improved. Of course, other training methods can also be used. For example, the first sample image and the second sample image may be the same image, that is, the first label and the second label are simultaneously set for the same sample image. Correspondingly, in this case, the sample image may be directly used for training, which is not limited in this embodiment of the present invention.

Optionally, when dividing the first sample group and the second sample group, the number of first sample images included in the first sample group can be set to be the same as the number of second sample images included in the second sample group. . Further, the above is based on the first sample image and the first label of the first sample image in the first sample group, and the second sample image and the first sample image in the second sample group. The second label of the sample image, the step of cross-training the initial model, may include:

Step 10222a: Train the initial model according to a first sample image in a first sample group and a first label of the first sample image to update model parameters of the initial model.

Illustratively, an unused first sample group may be selected from the first sample group, and then the first sample image in the selected first sample group may be used as the input of the initial model, based on the output of the initial model. And the first tag determines a loss value, and if the loss value does not meet the preset requirements, the model parameters of the initial model can be updated. For example, a preset stochastic gradient descent method can be used to adjust the model parameters to achieve the update.

Step 10222b, after updating the model parameters of the initial model, train the initial model according to the second sample image in the second sample group and the second label of the second sample image to update the The model parameters of the initial model, and after updating the model parameters, the first sample image in the first sample group and the first label of the first sample image are re-executed for all the Describe the steps for training the initial model.

In this step, after the model parameters are updated, the second sample group may be used for training, so as to realize cross-training between the first sample group and the second sample group. For example, an unused second sample group can be selected from the second sample group, and then the second sample image in the selected second sample group is used as the input of the initial model, based on the output of the initial model and the second label. The loss value is determined, and if the loss value does not meet the preset requirements, the model parameters of the initial model can be updated. For example, a preset stochastic gradient descent method can be used to adjust the model parameters to achieve the update. Correspondingly, after the parameter update is performed based on the second sample group, the above-mentioned process of performing the training update based on the first sample group may be repeated to realize cyclic update training. Finally, the training can be ended when the loss value of the initial model meets the preset requirements.

In this embodiment of the present invention, based on the same number of first sample groups and second sample groups, after each first sample group is used to train and update the initial model, a second sample group is alternately used to train the initial model Update, cross-training is achieved by alternating cycles. Since the number of images contained in the first sample group and the second sample group is the same, the balance of training data during each cross-training can be improved to a certain extent, thereby ensuring the effect of cross-training. At the same time, continuous training and updating of model parameters based on the first sample group and the second sample group can add rich learnable information to the model and enable the model to be optimized by the two training samples in a balanced manner during the update process. Improve the generalization ability of the model and speed up the convergence efficiency of the model.

It should be noted that the structures of the initial model and the image processing model mentioned in the embodiments of the present invention are the same, but the model parameters in each layer may be different. The process of generating the image processing model by training the initial model is the process of adjusting the model parameters. process.

In an implementation manner, the at least two processing branches included in the initial model may specifically include a first processing branch and a second processing branch, wherein the first processing branch and the second processing branch may both include a convolution layer, an activation function layer and pooling layer. Further, the initial model may further include a fusion layer for fusing the outputs of the first processing branch and the second processing branch, and a processing layer for processing the output of the fusion layer. Among them, the convolution layer can be used to extract the features of the model input through the convolution operation. Setting the activation function layer can be used to add nonlinear factors to the model. By setting the activation function layer, the model can be prevented from being only a simple linear combination, and the expressive ability of the model can be improved to a certain extent. The specific type of the activation function used by the activation function layer can be set according to actual needs. For example, the activation function layer in the first processing branch can be a linear rectification function (Rectified Linear Unit, Relu) activation function layer, and the second processing The activation function layer in the branch may be a hyperbolic tangent (Tanh) activation function layer. Alternatively, both can be set as Relu activation function layers. Compared with using the Relu activation function layer and the Tanh activation function layer, setting the activation function layer as the Relu activation function layer can avoid using too many Tanh activation functions in the model structure while ensuring the accuracy of the model. There are problems such as gradient saturation, low training efficiency, and the fact that the actual running speed of the model cannot reach the theoretical speed, so as to ensure that the actual running time of the model is proportional to the theoretical calculation amount of the model, avoiding the theoretical calculation amount is low, but the actual running time is long But longer question. Further, by setting the pooling layer, redundant information can be removed while retaining the main features, thereby reducing the size of the data. The pooling layer in the first processing branch may be an average pooling layer, and the pooling layer in the second processing branch may be a maximum pooling layer.

Of course, the initial model may also include other layers. As an example, FIG. 2 is a schematic diagram of a model structure provided by an embodiment of the present invention. As shown in FIG. 2 , Input(3*40*40) indicates that it contains three color channel sizes is a 40*40 input image, Stream1 represents the first branch, and Stream2 represents the second branch. SeparableConxBnRelu represents a depthwise separable convolutional layer, a batch normalization (BN) layer, and a Relu activation function layer. Avgpool represents the average pooling layer, and Maxpool represents the maximum pooling layer. SeparableConxBnTanh represents the depthwise separable convolutional layer, BN layer, and Tanh activation function layer. Further, the ith fusion layer is represented by Fusioni, and the fusion layers included in the initial model can be Fusion1, Fusion2, and Fusion3. The i-th processing layer is represented by Stagei_output, and the processing layers included in the initial model can be Stage1_output, Stage2_output, and Stage3_output. Further, "1" in (16, 3, 1), (24, 3, 1), (48, 3, 1), (96, 3, 1) in Figure 2 can represent the step of the convolution operation long, "3" indicates the size of the convolution kernel used, "16", "24", "48" and "96" can respectively indicate the number of convolution kernels. (2, 2) in Figure 2 represents the size of the region processed by the pooling operation each time. Since the size of the feature map will decrease correspondingly with the continuous processing of the pooling layer, accordingly, in the embodiment of the present invention, the number of convolution kernels in each depth-separable convolutional layer is set to increase in sequence. While ensuring that the amount of calculation is not too large, ensure that each convolution operation can extract enough feature information.

In the embodiment of the present invention, by designing at least two processing branches and a fusion layer, when the model is used subsequently, the model can extract more information for fusion based on the multiple processing branches included, thereby improving the processing accuracy to a certain extent. .

Optionally, the fusion layer in this embodiment of the present invention may include a convolution layer, and the convolution layer in the fusion layer and the convolution layers in the first processing branch and the second processing branch are used to perform depthwise separable convolution operations. That is, the convolutional layers in the initial model can all be depthwise separable convolutional layers. By way of example, FIG. 3 is a schematic diagram of a fusion layer provided by an embodiment of the present invention. As shown in FIG. 3 , Stream1_stagei represents the input data input from the first branch to the ith fusion layer, and Stream2_stagei represents the input data from the second branch to the ith fusion layer. Input data for i fusion layers. SeparableConxBnRelu represents the depthwise separable convolution layer, BN layer, and Relu activation function layer. Avgpool represents the average pooling layer. SeparableConxBnTanh represents the depth separable convolution layer, BN layer, Tanh activation function layer, and Maxpool represents the maximum pooling layer. "Elements multiply" means the layer used for element multiplication.

Further, the depthwise separable convolution layer can be used to perform a depthwise separable convolution operation, and the depthwise separable convolution operation may include two parts: spatial/depthwise convolution (Depthwise Convolution) and channel convolution (Pointwise Convolution). In specific implementation, depthwise convolution can be performed on the channels of the feature map respectively, and the output is spliced, and then the unit convolution kernel is used for pointwise convolution. Exemplarily, FIG. 4 is a schematic diagram of a depthwise separable convolution operation provided by an embodiment of the present invention. As shown in FIG. 4 , Depthwise_Conv(3,1) may be executed first, and then Pointwise_Conv(1,1) may be executed to achieve 3×3 convolution operation. Among them, Depthwise_Conv represents spatial convolution, and Pointwise_Conv represents channel convolution. The specific execution process can be first through a 3×1 spatial convolution, and finally through a 1×1 channel convolution.

Compared with the method of directly using the standard convolution operation, in the embodiment of the present invention, by dividing the standard convolution into two parts, the correlation between the spatial dimension and the channel dimension can be split, so that the convolution calculation can be reduced. The number of parameters required can reduce the model calculation amount to a certain extent, and improve the model calculation efficiency and calculation speed.

Optionally, the processing layer in this embodiment of the present invention may include a first processing layer and a second processing layer, wherein the first processing layer may include a fully connected layer (Fully Connected layers, FC) and an activation function layer, the second processing layer. The layers may include depthwise separable convolutional layers. Among them, FC can be used to reassemble the features extracted by the previous layer into a complete feature map through the weight matrix. And play the role of a classifier in the model. Compared with the method of directly using the standard convolution operation, in the embodiment of the present invention, by setting the depth-separable convolution layer in the processing layer, the number of parameters required for the convolution calculation can be reduced, and the calculation can be reduced to a certain extent. to improve computational efficiency.

For example, FIG. 5 is a schematic diagram of a processing layer provided by an embodiment of the present invention. As shown in FIG. 5 , the input of the ith processing layer may be the output “Fusioni_output” of the corresponding ith fusion layer. The input of the processing layer can be input to the first processing layer 01 and the second processing layer 02 correspondingly. Among them, SeparableConV(10, 3, 1) indicates that with 1 as the stride, depthwise separable convolution is performed based on 10 convolution kernels of size 3×3. Further, angle calculation can be performed based on the outputs of each layer included in the first processing layer, to obtain angle information in one way, for example, obtain first angle information obtained according to the key points of the face in the image. Specifically, the angle calculation may be implemented based on a preset angle calculation method in the angle calculation layer, or the outputs of the three modules in the first processing layer may be directly used as the angle information. Further, the processing result of the second processing layer can represent the angle information extracted in another way, for example, the second angle information obtained according to the color channel value of the pixel in the image is obtained. Further, since the output range of the Tanh activation function is [1, -1], and the output range of the Relu activation function is [0, +∞], therefore, in the embodiment of the present invention, a partial Tanh activation is set in the first processing layer function, to a certain extent, it can ensure that there are positive values and positive values in the range of finally obtained angle information, thereby expanding the range of angle information.

It should be noted that, in another implementation manner of the embodiment of the present invention, the initial model may also be set to include a single processing branch, where a single processing branch may include a convolution layer, an activation function layer, a parallel maximum pooling layer and an average pooling layer, a concatenation layer, a fusion layer, and a processing layer for processing the output of the fusion layer. Among them, the fusion layer is used to fuse the outputs of the max pooling layer and the average pooling layer.

FIG. 6 is a flowchart of steps of an image processing method provided by an embodiment of the present invention. The method can be applied to a processing device. As shown in FIG. 6 , the method can include:

Step 201: Use the image to be processed as the input of a preset image processing model to obtain the output of the image processing model.

In the embodiment of the present invention, the processing device may be a mobile phone, a pan-tilt camera, and other devices with shooting capability and processing capability. A preset image processing model can be deployed on the processing device. The image to be processed can be an image from which image information is extracted as required. For example, the image to be processed may be an image captured by a processing device, or an image in a captured video.

Step 202: Acquire image information of the image to be processed according to the output of the image processing model; wherein the image processing model is generated according to the above model generation method.

Since the preset image processing model is trained by acquiring training data marked with multiple image information acquisition methods, in this way, when marking tags with multiple image information acquisition methods, the shortage of samples caused by the limitation of a single marking method can be avoided. It can ensure the diversity and sufficiency of training data, and to a certain extent, can improve the generalization ability of the final image processing model, thereby improving the image information extracted when the image processing model is used to extract the image to be processed. accuracy.

Optionally, the image information may include angle information of the face in the image, and the output of the image processing model may include the first angle information obtained according to the key points of the face in the image, and the first angle information obtained according to the color channel value of the pixel in the image. Two angle information. Correspondingly, when acquiring the image information of the image to be processed according to the output of the image processing model, the second angle information may be determined as the angle information of the image to be processed. That is, only the second angle information obtained according to the color channel value of the pixel in the image is used during application.

Since the accuracy of the second angle information obtained according to the color channel value of the pixel in the image is often high, the second angle information obtained according to the color channel value of the pixel in the image is used as the angle information of the image to be processed, To a certain extent, the accuracy of the angle information of the image to be processed can be ensured. Of course, in an actual application scenario, the first angle information and the second angle information may also be displayed, and the angle information selected by the user is used as the angle information of the image to be processed. Alternatively, the angle information of the image to be processed is calculated in combination with the first angle information and the second angle information, for example, the average value of the first angle information and the second angle information is used as the angle information of the to-be-processed image, which is not performed in this embodiment of the present invention. limited.

Optionally, the preset image processing models may include image processing models corresponding to different preset sizes. Correspondingly, when the image to be processed is used as the input of the preset image processing model to obtain the output of the image processing model: a preset size that matches the processing performance of the processing device may be determined first; wherein the processing performance The higher, the larger the matching preset size. Then, the to-be-processed image can be used as the input of the target image processing model to obtain the output of the target image processing model; the target image processing model is the image processing model corresponding to the matching preset size.

The processing performance may be determined based on the hardware configuration of the processing device. If the hardware configuration of the processing device is higher, it may be determined that the processing performance of the processing device is higher. Correspondingly, the preset size corresponding to the processing performance of the processing device may be determined according to the corresponding relationship between the preset processing performance and the preset size, so as to obtain a matching preset size. Then, the image processing model with the matching preset size is used as the target processing model. For example, assuming that the matching preset size is 40*40, the target processing model may be an image processing model corresponding to 40*40.

In this embodiment of the present invention, image processing models with different preset sizes are generated in the model generation stage, and during application, a suitable image processing model is selected according to the actual processing capability of the processing device to perform image processing. In this way, to a certain extent, It can handle the problem that the device does not have enough capacity to run the image processing model, which causes the device to freeze.

FIG. 7 is a block diagram of an apparatus for generating a model provided by an embodiment of the present invention. The apparatus may include: a memory 301 and a processor 302 .

The memory 301 is used to store program codes.

The processor 302 calls the program code, and when the program code is executed, is configured to perform the following operations:

Optionally, the image information includes angle information of the face in the image, and the label includes a first label generated in a manner of acquiring first image information and a second tag generated in a manner of acquiring second image information;

Wherein, the first method of acquiring image information includes a method of acquiring angle information according to a face key point in the image; the second method of acquiring image information includes a method of performing regression detection according to the color channel value of the pixel point in the image to obtain the angle information. Way.

Optionally, the acquiring training data includes:

Obtain a first preset model and a second preset model; the first preset model is used to obtain angle information according to the key points of the face in the image, and the second preset model is used according to the color channel of the pixel in the image. value to get angle information;

The first sample image is processed according to the first preset model to obtain the first label, and the second sample image is processed according to the second preset model to obtain the second label.

Optionally, the initial model is trained according to the sample image and the label of the sample image to generate an image processing model, including:

adjusting the sample image to a sample image under multiple preset sizes;

For each sample image in the preset size, the initial model is trained according to the sample image and the label of the sample image, so as to obtain the image processing model in each preset size.

Optionally, the sample image includes a first sample image and a second sample image, and the method for acquiring the first label of the first sample image is different from the method for acquiring the second label of the second sample image;

The training of the initial model according to the sample image and the label of the sample image includes:

dividing the first sample image into a plurality of first sample groups, and dividing the second sample image into a plurality of second sample groups;

According to the first sample image in the first sample group and the first label of the first sample image, and the second sample image in the second sample group and the first sample image The second label, cross-training the initial model.

Optionally, the number of first sample images included in the first sample group is the same as the number of second sample images included in the second sample group;

the first label according to the first sample image and the first sample image in the first sample group, and the second sample image and the first sample in the second sample group The second label of the image, to cross-train the initial model, including:

training the initial model according to a first sample image in one of the first sample groups and a first label of the first sample image to update model parameters of the initial model;

After updating the model parameters of the initial model, the initial model is trained according to the second sample image in the second sample group and the second label of the second sample image to update the initial model the model parameters, and after updating the model parameters, re-execute the first sample image in the first sample group and the first label of the first sample image, to the initial model Steps for training.

Optionally, the at least two processing branches include a first processing branch and a second processing branch, and both the first processing branch and the second processing branch include a convolution layer, an activation function layer, and a pooling layer;

The initial model also includes a fusion layer for fusing the outputs of the first processing branch and the second processing branch and a processing layer for processing the output of the fusion layer.

Optionally, the fusion layer includes a convolution layer, and the convolution layer in the fusion layer and the convolution layer in the first processing branch and the second processing branch are used to perform depthwise separable convolution. operate.

Optionally, the processing layer includes a first processing layer and a second processing layer;

The first processing layer includes a fully connected layer and an activation function layer, and the second processing layer includes a depthwise separable convolutional layer.

To sum up, the model generating apparatus provided by the embodiments of the present invention can acquire training data, wherein the training data includes sample images and labels of the sample images, and the labels include labels generated by at least two image information acquisition methods. Then, according to the sample images and the labels of the sample images, the initial model is trained to generate an image processing model, wherein the image processing model is used to extract image information, and the initial model includes at least two processing branches. Since the problem of insufficient samples caused by the limitation of a single labeling method can be avoided when labels are marked with multiple image information acquisition methods, so, by acquiring training data marked with multiple image information acquisition methods for training, the diversity of training data can be ensured. To a certain extent, the generalization ability of the final generated image processing model can be improved, thereby improving the accuracy of the image information extracted by the image processing model subsequently.

FIG. 8 is a block diagram of an image processing apparatus provided by an embodiment of the present invention. The apparatus may include: a memory 401 and a processor 402 .

The memory 401 is used to store program codes.

The processor 402 calls the program code, and when the program code is executed, is configured to perform the following operations:

Optionally, the image information includes the angle information of the face in the image; the output of the image processing model includes the first angle information obtained according to the key points of the face in the image, and the color channel value according to the pixel point in the image. The acquired second angle information;

The obtaining image information of the to-be-processed image according to the output of the image processing model includes:

The second angle information is determined as the angle information of the image to be processed.

Optionally, the image processing model includes image processing models corresponding to different preset sizes;

Taking the image to be processed as the input of the preset image processing model to obtain the output of the image processing model includes:

determining a preset size matching the processing performance of the processing device; wherein, the higher the processing performance, the larger the matching preset size;

The to-be-processed image is used as the input of the target image processing model to obtain the output of the target image processing model; the target image processing model is the image processing model corresponding to the matching preset size.

To sum up, in the image processing apparatus provided by the embodiments of the present invention, since the preset image processing model used is obtained by acquiring training data marked in various ways of acquiring image information for training, thus, using various image information When the label is obtained by the labeling method, the problem of insufficient samples caused by the limitation of a single labeling method can be avoided, thereby ensuring the diversity and sufficiency of training data, and improving the generalization ability of the final image processing model to a certain extent, thereby improving the When using this image processing model to extract the image to be processed, the accuracy of the extracted image information.

Further, an embodiment of the present invention also provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, each step in the above method is implemented, and can achieve the same In order to avoid repetition, the technical effect will not be repeated here.

The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

Various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor may be used in practice to implement some or all of the functions of some or all of the components in the computing processing device according to the embodiments of the present invention. The present invention can also be implemented as apparatus or apparatus programs (eg, computer programs and computer program products) for performing part or all of the methods described herein. Such a program implementing the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such signals may be downloaded from Internet sites, or provided on carrier signals, or in any other form.

For example, FIG. 9 is a block diagram of a computing processing device provided by an embodiment of the present invention. As shown in FIG. 9 , FIG. 9 shows a computing processing device that can implement the method according to the present invention. The computing processing device traditionally includes a processor 710 and a computer program product or computer readable medium in the form of a memory 720 . The memory 720 may be electronic memory such as flash memory, EEPROM (electrically erasable programmable read only memory), EPROM, hard disk, or ROM. The memory 720 has storage space 730 for program code for performing any of the method steps in the above-described methods. For example, the storage space 730 for program codes may include various program codes for implementing various steps in the above methods, respectively. These program codes can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such computer program products are typically portable or fixed storage units as described with reference to Figure 10 . The storage unit may have storage segments, storage spaces, etc. arranged similarly to the memory 720 in the computing processing device of FIG. 9 . The program code may, for example, be compressed in a suitable form. Typically, the storage unit includes computer readable code, ie code readable by a processor such as 710 for example, which when executed by a computing processing device, causes the computing processing device to perform each of the methods described above. step.

The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments may be referred to each other.

Reference herein to "one embodiment," "an embodiment," or "one or more embodiments" means that a particular feature, structure, or characteristic described in connection with an embodiment is included in at least one embodiment of the present invention. Also, please note that instances of the phrase "in one embodiment" herein are not necessarily all referring to the same embodiment.

In the description provided herein, numerous specific details are set forth. It will be understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. do not denote any order. These words can be interpreted as names.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

A model generation method, characterized in that the method comprises:

Acquiring training data; the training data includes a sample image and a label of the sample image, and the label includes a label generated in at least two ways of acquiring image information;

According to the sample image and the label of the sample image, an initial model is trained to generate an image processing model; the image processing model is used to extract image information, and the initial model includes at least two processing branches.
The method according to claim 1, wherein the image information includes angle information of a face in the image, and the label includes a first label generated by a first image information acquisition method and a second image information generated by the acquisition method. 's second label;

Wherein, the first method of acquiring image information includes a method of acquiring angle information according to a face key point in the image; the second method of acquiring image information includes a method of performing regression detection according to the color channel value of the pixel point in the image to obtain the angle information. Way.
The method according to claim 2, wherein the acquiring training data comprises:

Obtain a first preset model and a second preset model; the first preset model is used to obtain angle information according to the key points of the face in the image, and the second preset model is used according to the color channel of the pixel in the image. value to get angle information;

The first sample image is processed according to the first preset model to obtain the first label, and the second sample image is processed according to the second preset model to obtain the second label.
The method according to any one of claims 1 to 3, wherein the training an initial model according to the sample image and the label of the sample image to generate an image processing model, comprising:

adjusting the sample image to a sample image under multiple preset sizes;

For each sample image in the preset size, the initial model is trained according to the sample image and the label of the sample image, so as to obtain the image processing model in each preset size.
The method according to claim 4, wherein the sample image comprises a first sample image and a second sample image, and the method of acquiring the first label of the first sample image is the same as that of the second sample image The way to obtain the second label of is different;

The training of the initial model according to the sample image and the label of the sample image includes:

dividing the first sample image into a plurality of first sample groups, and dividing the second sample image into a plurality of second sample groups;

According to the first sample image in the first sample group and the first label of the first sample image, and the second sample image in the second sample group and the first sample image The second label, cross-training the initial model.
The method according to claim 5, wherein the number of the first sample images included in the first sample group is the same as the number of the second sample images included in the second sample group;

the first label according to the first sample image and the first sample image in the first sample group, and the second sample image and the first sample in the second sample group The second label of the image, to cross-train the initial model, including:

training the initial model according to a first sample image in one of the first sample groups and a first label of the first sample image to update model parameters of the initial model;

After updating the model parameters of the initial model, the initial model is trained according to the second sample image in the second sample group and the second label of the second sample image to update the initial model the model parameters, and after updating the model parameters, re-execute the first sample image in the first sample group and the first label of the first sample image, to the initial model Steps for training.
The method of claim 2, wherein the at least two processing branches comprise a first processing branch and a second processing branch, and both the first processing branch and the second processing branch comprise convolutional layers, Activation function layer and pooling layer;

The initial model also includes a fusion layer for fusing the outputs of the first processing branch and the second processing branch and a processing layer for processing the output of the fusion layer.
The method according to claim 7, wherein the fusion layer comprises a convolution layer, a convolution layer in the fusion layer, and a convolution layer in the first processing branch and the second processing branch Layers are used to perform depthwise separable convolution operations.
The method according to claim 7 or 8, wherein the processing layer comprises a first processing layer and a second processing layer;

The first processing layer includes a fully connected layer and an activation function layer, and the second processing layer includes a depthwise separable convolutional layer.
An image processing method, characterized in that, applied to a processing device, the method comprising:

Using the image to be processed as the input of the preset image processing model to obtain the output of the image processing model;

obtaining image information of the to-be-processed image according to the output of the image processing model;

Wherein, the image processing model is generated according to the method of any one of the above claims 1 to 9.
The method according to claim 10, wherein the image information includes angle information of the face in the image; the output of the image processing model includes the first angle information obtained according to the key points of the face in the image, and The second angle information obtained according to the color channel value of the pixel in the image;

The obtaining image information of the to-be-processed image according to the output of the image processing model includes:

The second angle information is determined as the angle information of the image to be processed.
The method according to claim 10 or 11, wherein the image processing model comprises image processing models corresponding to different preset sizes;

Taking the image to be processed as the input of the preset image processing model to obtain the output of the image processing model includes:

determining a preset size matching the processing performance of the processing device; wherein, the higher the processing performance, the larger the matching preset size;

The to-be-processed image is used as the input of the target image processing model to obtain the output of the target image processing model; the target image processing model is the image processing model corresponding to the matching preset size.
A model generation device, characterized in that the device includes a memory and a processor;

the memory for storing program codes;

The processor calls the program code, and when the program code is executed, is configured to perform the following operations:

Acquiring training data; the training data includes a sample image and a label of the sample image, and the label includes a label generated in at least two ways of acquiring image information;

According to the sample image and the label of the sample image, an initial model is trained to generate an image processing model; the image processing model is used to extract image information, and the initial model includes at least two processing branches.
An image processing apparatus, characterized in that, the apparatus is applied to processing equipment, and the apparatus includes a memory and a processor;

the memory for storing program codes;

The processor calls the program code, and when the program code is executed, is configured to perform the following operations:

Using the image to be processed as the input of the preset image processing model to obtain the output of the image processing model;

obtaining image information of the to-be-processed image according to the output of the image processing model;

Wherein, the image processing model is generated according to the device of claim 13 above.
A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method of any one of claims 1 to 12 is implemented.