CN116109531A

CN116109531A - Image processing method, device, computer equipment and storage medium

Info

Publication number: CN116109531A
Application number: CN202111327454.6A
Authority: CN
Inventors: 张莹; 李琛
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-11-10
Filing date: 2021-11-10
Publication date: 2023-05-12

Abstract

The application provides an image processing method, an image processing device, computer equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: mapping a first image into three intermediate images, the first image comprising a human body, the three intermediate images having different dimensions for representing image features of the first image; fusing the three intermediate images to obtain a second image; and mapping the second image into a target image, wherein the target image is marked with different parts of the human body. According to the scheme, the first image is mapped into the three images with different scales and then fused, compared with a mode of fusing the middle images of each scale in a complex neural network, the structural complexity is reduced, the calculated amount and the reasoning time are reduced, and therefore deployment can be carried out on the mobile terminal.

Description

Image processing method, device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to an image processing method, an image processing device, a computer device, and a storage medium.

Background

Human body analysis is a technique of dividing a human body in an image or video into a plurality of semantically uniform regions, such as dividing the human body into a head, a hand, a leg, and the like. At present, the human body analysis technology generally adopts a deep neural network to predict pixels belonging to the same semantic region in an image, so that the human body in the image is segmented, and a more accurate human body analysis result is obtained.

However, the neural network adopted by the scheme has a complex structure, large calculation amount and long reasoning time, so that the problem of difficulty in deployment in the mobile terminal is caused.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, computer equipment and a storage medium, which are capable of reducing structural complexity, calculation amount and reasoning time compared with a mode of fusing intermediate images of each scale in a complex neural network, so that deployment can be performed on a mobile terminal. The technical scheme is as follows:

in one aspect, there is provided an image processing method, the method including:

mapping a first image into three intermediate images, the first image comprising a human body, the three intermediate images having different dimensions for representing image features of the first image;

Fusing the three intermediate images to obtain a second image;

and mapping the second image into a target image, wherein the target image is marked with different parts of the human body.

In another aspect, there is provided an image processing apparatus including:

the first mapping module is used for mapping a first image into three intermediate images, wherein the first image comprises a human body, and the three intermediate images are different in scale;

the image fusion module is used for fusing the three intermediate images to obtain a second image;

and the second mapping module is used for mapping the second image into a target image, and the target image is marked with different parts of the human body.

In some embodiments, the first mapping module is configured to convolve the first image to obtain a first intermediate image; convolving the first intermediate image to obtain a second intermediate image; and carrying out channel feature reinforcement and semantic feature reinforcement on the second intermediate image to obtain a third intermediate image, wherein the channel feature reinforcement is used for reinforcing the importance of different channel features, and the semantic feature reinforcement is used for reinforcing global semantic information.

In some embodiments, the image fusion module is configured to fuse a result of convolving the third intermediate image with the second intermediate image to obtain a first fused image; and fusing the result of convolution of the first fused image and the first intermediate image to obtain a second fused image, and taking the second fused image as the second image.

In some embodiments, the second mapping module is configured to increase the resolution of the second image to obtain a third image, where the resolution of the third image is not higher than the resolution of the first image; and convolving the third image to obtain the target image.

In some embodiments, the steps performed by the image processing apparatus are implemented based on a human body analysis model, which is used to analyze the input image for a human body and output images marked with different parts of the human body.

In some embodiments, the apparatus further comprises:

the preprocessing module is used for preprocessing a first marked image of a sample image to obtain an encoded image of the sample image, wherein the first marked image is used for indicating different parts of a sample human body in the sample image, and the encoded image is used for indicating a prediction result of a previous frame image of the sample image;

The splicing module is used for splicing the sample image and the coded image to obtain an input image;

and the training module is used for training the human body analysis model iterated in the ith round based on the input image by taking the first marked image as supervision information, wherein i is a positive integer.

In some embodiments, the preprocessing module is configured to perform image transformation on the first labeling image of the sample image to obtain a second labeling image; and encoding the second marked image to obtain the encoded image.

In some embodiments, the preprocessing module is configured to perform at least one of rigid transformation and non-rigid transformation on the first labeling image to obtain the second labeling image.

In some embodiments, the preprocessing module is configured to map pixels in the second labeling image to a target vector space according to the pixel class to obtain the encoded image.

In some embodiments, the stitching module is configured to stitch the sample image and the encoded image in a channel dimension to obtain the input image.

In some embodiments, the training module is configured to parse the input image based on the body parsing model of the ith iteration to obtain a predicted image, where the predicted image is used to indicate different parts of the sample body obtained by prediction; determining, based on the first annotation image and the prediction image, a first loss indicating a difference between the first annotation image and the prediction image, a second loss indicating a difference between the first annotation image and the prediction image after pixel weighting, and a third loss indicating a difference between the first annotation image and the prediction image after pixel addition of dependency information indicating information contained in pixels surrounding the pixel; based on the first loss, the second loss, and the third loss, model parameters of the human body analytical model of the ith round of iteration are adjusted.

In some embodiments, the training module is configured to determine, based on a number of pixels corresponding to each pixel class in the predicted image, a class weight for each pixel class, where the class weight is inversely related to the number of pixels; and determining a weighted cross entropy loss based on the class weight of each pixel class, and taking the weighted cross entropy loss as the second loss.

In some embodiments, the training module is configured to determine a labeling probability distribution based on the first labeling image; determining a predictive probability distribution based on the predicted image; determining a labeling probability density function, a prediction probability density function and a joint probability density function based on the labeling probability distribution and the prediction probability distribution; and determining a cross entropy loss based on the labeling probability density function, the prediction probability density function and the joint probability density function, and taking the cross entropy loss as the third loss.

In another aspect, a computer device is provided, the computer device including a processor and a memory for storing at least one segment of a computer program loaded and executed by the processor to implement the image processing method in the embodiments of the present application.

In another aspect, a computer readable storage medium having stored therein at least one segment of a computer program loaded and executed by a processor to implement an image processing method as in embodiments of the present application is provided.

In another aspect, a computer program product or a computer program is provided, the computer program product or computer program comprising computer program code stored in a computer readable storage medium, the computer program code being read from the computer readable storage medium by a processor of a computer device, the computer program code being executed by the processor such that the computer device performs the image processing method in various alternative implementations of the above aspects.

The scheme of image processing is provided, the first image is mapped into three images with different scales and then fused, and compared with a mode of fusing intermediate images with each scale in a complex neural network, the scheme of image processing is capable of reducing structural complexity, calculation amount and reasoning time, and therefore deployment can be carried out on a mobile terminal.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an implementation environment of an image processing method according to an embodiment of the present application;

FIG. 2 is a flow chart of an image processing method provided according to an embodiment of the present application;

FIG. 3 is a flow chart of another image processing method provided in accordance with an embodiment of the present application;

FIG. 4 is a schematic illustration of an image transformation provided in accordance with an embodiment of the present application;

FIG. 5 is a schematic diagram of an image stitching provided in accordance with an embodiment of the present application;

FIG. 6 is a schematic diagram of a compression activation module provided in accordance with an embodiment of the present application;

FIG. 7 is a schematic diagram of a pyramid pooling module provided in accordance with embodiments of the present application;

FIG. 8 is a schematic diagram of a model structure provided according to an embodiment of the present application;

fig. 9 is a schematic diagram comparing the results of video human body analysis according to an embodiment of the present application;

fig. 10 is a block diagram of an image processing apparatus provided according to an embodiment of the present application;

fig. 11 is a block diagram of another image processing apparatus provided according to an embodiment of the present application;

fig. 12 is a block diagram of a terminal according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," and the like in this application are used to distinguish between identical or similar items that have substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the "first," "second," and "nth" terms, nor is it limited to the number or order of execution.

The term "at least one" in this application means one or more, and the meaning of "a plurality of" means two or more.

It will be appreciated that in embodiments of the present application, related data such as user information, images, etc. is referred to, and that when the examples herein are applied to a particular product or technology, user approval or consent is required, and that the collection, use, and processing of related data is required to comply with relevant laws and regulations and standards of the relevant country and region.

Hereinafter, terms related to the present application are explained.

Human body analysis refers to dividing a human body captured in an image/video into a plurality of semantically consistent regions, such as head, hands, legs, etc.

mIOU (Mean Intersection over Union, average cross-over) is a standard measure of semantic segmentation, representing the cross-over of two sets, the two sets being a labeling value (group trunk) and a predicted value (predicted segmentation) in the problem of semantic segmentation.

Pixel Acc (Pixel Accuracy) refers to the ratio of the number of correctly classified pixels to the number of all pixels.

FFM (Feature Fusion Module ) for fusing features of different dimensions. In many work in deep learning (e.g., object detection, image segmentation), fusing features of different scales is an important means of improving performance. The lower level features have higher resolution, contain more location, detail information, but are less semantically noisy due to less convolution passing. Higher-level features have stronger semantic information, but have very low resolution and poor perceptibility of details.

The image processing method provided by the embodiment of the application can be executed by computer equipment. In some embodiments, the computer device is a terminal or a server. In the following, taking a computer device as an example of a terminal, an implementation environment of an image processing method provided in an embodiment of the present application is described, and fig. 1 is a schematic diagram of an implementation environment of an image processing method provided in an embodiment of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102.

The terminal 101 and the server 102 can be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

In some embodiments, terminal 101 is a smart phone, tablet, notebook, desktop, smart box, smart watch, or the like, but is not limited thereto. The terminal 101 installs and runs application programs supporting image processing, such as an album program, a shooting program, and a social program. Those skilled in the art will appreciate that the number of terminals 101 may be greater or lesser. Such as one terminal, or tens or hundreds, or more. The number of terminals and the device type are not limited in the embodiment of the present application.

In some embodiments, the server 102 is a stand-alone physical server, can be a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms. The server 102 is used to provide background services for applications that support image processing. In some embodiments, the server 102 takes on primary computing work and the terminal 101 takes on secondary computing work; alternatively, the server 102 takes on secondary computing work and the terminal 101 takes on primary computing work; alternatively, a distributed computing architecture is used for collaborative computing between the server 102 and the terminal 101.

In this embodiment of the present application, the terminal 101 may obtain a human body analysis model from a server, where the human body analysis model is used to analyze an input image of a human body and output images marked with different parts of the human body. Then, based on the human body analysis model deployed at the terminal, the first image is input into a human body analysis model, the human body analysis model maps the first image into three intermediate images with different scales, and then, based on the human body analysis model, the three intermediate images are fused to obtain a second image. And then mapping the second image into a target image based on the human body analysis model, and finally displaying the target image output by the human body analysis model by the terminal.

Fig. 2 is a flowchart of an image processing method according to an embodiment of the present application, and as shown in fig. 2, an example of execution by a terminal is described in the embodiment of the present application. The method comprises the following steps:

201. the terminal maps the first image into three intermediate images, the first image comprising a human body, the three intermediate images being of different dimensions for representing image features of the first image.

In the embodiment of the present application, the terminal is the terminal 101 in fig. 1. The first image is an image to be processed. The terminal can convolve the first image based on a plurality of convolution layers, and map the first image into three intermediate images with different scales. In other words, the three intermediate images are identical in source, but the convolution processing is different, resulting in the three intermediate images being different in scale. The three intermediate images are feature images of the first image, and can represent image features of the first image.

202. And the terminal fuses the three intermediate images to obtain a second image.

In the embodiment of the application, the terminal can sequentially fuse the three intermediate images to obtain the fused second image, and the second image contains the characteristics of the intermediate images with the three different scales and can be used for improving the analysis performance of the human body.

203. The terminal maps the second image into a target image, and the target image is marked with different parts of the human body.

In the embodiment of the application, the terminal can perform convolution processing on the second image based on the plurality of convolution layers, and map the second image into the target image, so as to analyze the human body of the human body in the first image.

Fig. 2 schematically illustrates a main flow of an image processing scheme provided in an embodiment of the present application, and the image processing scheme is further described below based on an application scenario. In this application scenario, the image processing scheme is implemented based on a human body resolution model, and the following embodiments will describe a process of training the human body resolution model and using the human body resolution model. Fig. 3 is a flowchart of another image processing method provided according to an embodiment of the present application, and referring to fig. 3, in the embodiment of the present application, an example of execution by a terminal, which is a mobile terminal, will be described. The image processing method includes the steps of:

301. The terminal preprocesses a first labeling image of the sample image to obtain a coded image of the sample image, wherein the first labeling image is used for indicating different parts of a sample human body in the sample image, and the coded image is used for indicating a prediction result of a previous frame image of the sample image.

In the embodiment of the present application, the sample image includes a sample human body, and the sample image may be an image in a public sample data set, or may be an image uploaded by a user acquired after the user is fully authorized. In the first labeling image of the sample image, different parts of the sample human body, such as hair, face, trunk, arms and the like, are labeled with different colors. The terminal can preprocess the first labeling image, so that the encoded image obtained by preprocessing can indicate the prediction result of the previous frame image of the sample image. It should be noted that, the encoded image can simulate the prediction result of the previous frame image, and is not the prediction result obtained by analyzing the human body of the previous frame image.

In some embodiments, the terminal can pre-process the first annotation image by using image transformation and encoding. The method comprises the steps of preprocessing a first annotation image of a sample image by a terminal to obtain an encoded image of the sample image, and comprises the following steps: the terminal performs image transformation on the first marked image of the sample image to obtain a second marked image, and then the terminal encodes the second marked image to obtain an encoded image. The first annotation image is subjected to image transformation to obtain a second annotation image, so that the change caused by human body motion can be simulated, and then the second annotation image is encoded, so that the encoded image obtained by encoding can be used as a prediction result of a previous frame image of a sample image to be learned by a model, the model can learn the relevance between the prediction results of adjacent image frames, and the stability of model output is improved.

In some embodiments, the terminal can perform a rigid transformation on the first annotation image to obtain the second annotation image, wherein the rigid transformation includes translation, rotation, scaling, and the like. Alternatively, the terminal can perform a non-rigid transformation on the first annotation image to obtain the second annotation image, the non-rigid transformation including grid distortion, optical distortion, elastic transformation, and the like. Alternatively, the terminal can perform rigid transformation and non-rigid transformation on the first label image to obtain the second label image, wherein the order of the rigid transformation and the non-rigid transformation is not limited. By performing at least one of rigid transformation and non-rigid transformation to the first labeling image with different degrees, image change caused by human body movement can be simulated, so that the model learns the relevance between prediction results of adjacent image frames, and the stability of model output is improved.

For example, fig. 4 is a schematic diagram of an image transformation provided according to an embodiment of the present application. Referring to fig. 4, the background of the first labeling image is black, and the hair, face, trunk skin (neck portion), coat, arm, and hand of the human body are labeled with different colors other than black. And after the first marked image is subjected to rigid transformation and non-rigid transformation, a second marked image is obtained, and the color in the second marked image is the same as the marked part of the first marked image, and the size and the shape are different.

In some embodiments, the terminal can map the pixels in the second labeling image to the target vector space according to the pixel class to which the pixels belong, so as to obtain the encoded image. Wherein the target vector space is a one-dimensional space or a multi-dimensional space. By encoding the second labeling image, the encoding image can simulate the prediction result of the previous frame image, further simulate the image change caused by human body motion, further enable the model to learn the relevance between the prediction results of the adjacent image frames, and improve the stability of model output.

For example, the pixels in the labeling image have 15 pixel categories, and the values of the pixel categories are respectively represented by 15 positive integers of 0-14: {0: a background; 1: a cap; 2: hair; 3: a face; 4: a sunglasses; 5: torso skin; 6: a coat; 7: one-piece dress; 8: trousers; 9: a short skirt; 10: an arm; 11: a hand; 12: a leg portion; 13: a foot; 14: sock }. The terminal encodes the pixels according to the pixel type to which each pixel in the image belongs.

In some embodiments, the terminal adopts a normalization strategy, and maps the pixels in the second labeling image to between 0 and 1 according to the pixel category to which the pixels belong based on the formula (1). Wherein, formula (1) is as follows:

E _i ＝y _i /(C-1) (1)；

Wherein E is _i Representing the value of the i-th pixel after encoding, y _i Representing the ith imageThe value of the pixel class to which the pixel belongs, C, represents the total number of pixel classes.

In some embodiments, the terminal maps the pixels in the second labeling image to high-dimensional vectors according to the belonging pixel class based on the formula (2), and obtains a multi-channel coding result. Wherein, formula (2) is as follows:

f(y _i )＝[sin(2 ⁰ πy _i )，cos(2 ⁰ πy _i )，...，sin(2 ^L-1 πy _i )，cos(2 ^L-1 πy _i )] (2)；

wherein y is _i A value indicating a pixel class to which the i-th pixel belongs, f (y _i ) Represents the high-dimensional vector after the i-th pixel is encoded, and L represents f (y _i ) Half the number of dimensions of (a).

302. And the terminal splices the sample image and the coded image to obtain an input image.

In the embodiment of the application, the terminal can splice the sample image and the coded image in the channel dimension to obtain the input image. By splicing the sample image and the coding image, the model can learn the relevance between the prediction results of the adjacent image frames, and the stability of model output is improved. Referring to fig. 5, fig. 5 is a schematic diagram of image stitching according to an embodiment of the present application.

303. The terminal takes the first labeling image as supervision information, and trains a human body analysis model of the ith iteration based on the input image, wherein i is a positive integer.

In the embodiment of the application, the terminal can perform multiple rounds of iterative training by adopting a supervised learning mode and taking the first labeling image as the supervised learning mode to obtain the human body analysis model. Taking the process of the ith iteration as an example for explanation, when the ith iteration is the first iteration, the human body analytic model of the ith iteration is an initial model; when the ith iteration is a non-first iteration, the human body analytical model of the ith iteration is the human body analytical model with the model parameters adjusted after the i-1 th iteration is completed.

In some embodiments, the step of training the body analysis model of the ith iteration based on the input image by using the first labeling image as the supervision information includes steps 3031 to 3033.

3031. And the terminal performs human body analysis on the input image based on the human body analysis model of the ith iteration to obtain a predicted image.

The terminal inputs an input image into a human body analysis model of the ith iteration, then carries out human body analysis on the input image based on the human body analysis model, and outputs a predicted image, wherein the predicted image is used for indicating different parts of a predicted sample human body.

3032. The terminal determines a first loss, a second loss, and a third loss based on the first annotation image and the prediction image.

In an embodiment of the present application, the first loss is a cross entropy loss, which is used to indicate a difference between the first labeling image and the predicted image. The first loss is calculated as shown in the following equation (3):

wherein L is _ce Representing a first loss, C representing the total number of pixel classes, y _c A value representing pixel class c, p _c Representing that the predicted pixel belongs to y _c Is a probability of (2).

In an embodiment of the present application, the second loss is a weighted cross entropy loss, which is used to indicate a difference between the first labeled image and the predicted image after pixel weighting. The terminal determines a second loss process based on the first labeling image and the predicted image, and the second loss process comprises the following steps: the terminal determines class weights of all pixel classes based on the pixel quantity corresponding to all pixel classes in the predicted image, wherein the class weights are inversely related to the pixel quantity. The terminal determines a weighted cross entropy loss based on the class weights of the respective pixel classes, and takes the weighted cross entropy loss as a second loss. The second loss is calculated by the formula (4) modified based on the formula (3), and the formula (4) is reversely weighted by the duty ratio of the number of pixels, namely, the class weight of the pixel class with the larger number of pixels is lower, so that the model can pay attention to the pixel class with a small area. The second loss is calculated as shown in the following equations (4) and (5):

w _c ＝(N-N _c )/N (5)；

Wherein L is _w Representing a second loss, C representing the total number of pixel classes, w _c Representing class weights, y _c Values representing pixel class, p _c Representing that the predicted pixel belongs to y _c N represents the number of all pixels in the image, N _c The number of pixels corresponding to the pixel class c is indicated.

In this embodiment of the present application, the third loss is a mutual information loss, where the third loss is used to indicate a difference between the first labeling image and the prediction image after the pixel adds dependency information, and the dependency information is used to indicate information included in pixels around the pixel. The process of determining the third loss by the terminal based on the first labeling image and the predicted image comprises the following steps: the terminal determines a labeling probability distribution based on the first labeling image. The terminal determines a predictive probability distribution based on the predicted image. Then, the terminal determines a labeling probability density function, a predictive probability density function, and a joint probability density function based on the labeling probability distribution and the predictive probability distribution. And finally, the terminal determines cross entropy loss based on the labeling probability density function, the prediction probability density function and the joint probability density function, and takes the cross entropy loss as a third loss. The third loss is calculated based on a mutual information loss function, and different from a pixel-by-pixel calculation mode of the cross entropy loss function, the mutual information loss function is derived from a regional mutual information strategy, and the basis of the regional mutual information strategy is as follows: if the pixel class of a pixel is a jacket, the pixel classes of pixels surrounding the pixel are also likely to be jackets. Based on the regional user information strategy, the terminal can represent the pixel by using the pixel and surrounding pixels of the pixel, and encode the pixel, so that one image is represented as a distribution of a plurality of high-dimensional points, and the prediction result output by the model can have better high-order consistency by calculating the distance between the labeling probability distribution and the prediction probability distribution. The third loss is calculated as shown in the following equation (6):

Wherein L is _mu The third loss is represented by Y representing a labeling probability distribution, Y representing a labeling probability, P representing a prediction probability distribution, P representing a prediction probability, f (Y, P) representing a joint probability density function, f (Y) representing a labeling probability density function, and f (P) representing a prediction probability density function.

3033. The terminal adjusts model parameters of the human body analytical model of the ith iteration based on the first loss, the second loss and the third loss.

In the embodiment of the present application, the terminal may use the sum of the first loss, the second loss, and the third loss as the training loss, and adjust the model parameters of the human body analytical model of the ith iteration based on the training loss. The model parameters are adjusted based on multiple losses during model training, so that the problem of unbalanced pixel types can be solved, and the accuracy of human body analysis is improved.

The training loss is calculated as shown in the following formula (7):

L＝L _ce +L _W +L _mu (7)；

wherein L represents training loss, L _ce Representing the first loss, L _W Representing a second loss, L _mu Representing a third loss. The equation (7) is an objective function of model training.

304. The terminal maps a first image into three intermediate images based on a human body analysis model, wherein the first image comprises a human body, and the three intermediate images are different in scale and are used for representing image characteristics of the first image.

In this embodiment of the present application, the human body analysis model is a model obtained by training a terminal, the first image is an input image of the human body analysis model, the terminal inputs the first image into the human body analysis model, and the human body analysis model maps the first image into three intermediate images with different scales based on a plurality of convolution layers. The human body analysis model can be a model obtained by training a server, and the terminal obtains the model from the server.

In some embodiments, the step of the terminal mapping the first image into three intermediate images comprises: firstly, a terminal convolves a first image to obtain a first intermediate image; and then the terminal convolves the first intermediate image to obtain a second intermediate image. And finally, carrying out channel feature reinforcement and semantic feature reinforcement on the second intermediate image by the terminal to obtain a third intermediate image, wherein the channel feature reinforcement is used for reinforcing the importance of different channel features, and the semantic feature reinforcement is used for reinforcing global semantic information. By carrying out channel feature reinforcement and semantic feature reinforcement, the performance of the model can be improved under the condition that the processing speed of the model is not influenced.

The terminal can strengthen the importance of different channel characteristics based on a compression-activation module (SE) in a convolution layer. Fig. 6 is a schematic diagram of a compression activation module according to an embodiment of the present application, and referring to fig. 6, after input X0 is processed by the compression activation module, X1 is obtained.

Wherein the terminal is capable of enhancing global semantic information based on pyramid pooling modules (Pyramid Pooling Module, PPM) in the convolutional layer. After the pyramid pooling module can be arranged at any network layer for output, fig. 7 is a schematic diagram of the pyramid pooling module provided according to an embodiment of the present application, referring to fig. 7, first, pooling is performed on an input feature map to obtain four pooling results with different sizes, then 1×1 convolution is performed on the pooling results respectively, the feature channels are reduced to 1/4 of the original size, then bilinear interpolation up-sampling is performed on each feature map obtained in the previous step respectively, 4 feature maps with the same size as the original feature map are obtained, and the original feature map and the 4 feature maps are spliced to obtain the output of the pyramid pooling module.

For example, fig. 8 is a schematic diagram of a model structure according to an embodiment of the present application. Referring to fig. 8, the terminal convolves the first image based on the convolution layers A1 and A2 to obtain a first intermediate image. The convolution layer A1 includes a standard convolution (Standard Convolution, conv 2D) module, where the input size of the convolution layer A1 is 256×256×4, the number of channels is 8, the convolution step size is 2, and the kernel width of the convolution kernel is 3. The convolution layer A2 includes a depth-separable convolution (Depthwise Separable Convolution, DSConv) module, the input size of the convolution layer A2 is 128×128×8, the number of channels is 16, the convolution step size is 2, and the kernel width of the convolution kernel is 3. Then, the terminal convolves the first intermediate image based on the convolution layer A3 to obtain a second intermediate image. The convolution layer A3 includes a depth separable convolution module, the input size of the convolution layer A3 is 64×64×16, the number of channels is 32, the convolution step length is 2, and the kernel width of the convolution kernel is 3. And then, the terminal sequentially performs channel characteristic reinforcement and semantic characteristic reinforcement on the second intermediate image based on the convolution layer B and the convolution layer C to obtain a third intermediate image. The convolution layer B comprises three serially connected convolution modules, the first convolution module being composed of an inverted residual module (inverted residual bottleneck blocks, bottleck) and a compression activation module. The original design of the bootleneck is: conv2D+DWConv (Depthwise Convolution, depth convolution) +Conv2D, and the first convolution module adds SE on the basis of bottlenneck, the structure of bottlenneck+SE is: conv2D+DWConv+SE+Conv2D. The input size of the first convolution module in convolution layer B is 32 x 32, the number of channels is 32, the convolution step size is 2, and the kernel width of the convolution kernel is 3. The structures of the second convolution module and the third convolution module in the convolution layer B are the same as those of the first convolution model, and are not repeated. The second convolution module in convolution layer B has an input size of 16×16×32, a number of channels of 64, a convolution step size of 2, and a kernel width of 3. The third convolution module in convolution layer B1 has an input size of 8 x 64, a number of channels of 128, a convolution step size of 1, and a kernel width of 3. The convolution layer C includes a pyramid pooling module and a standard convolution module, and the input size of the roll layer C is 8×8×128, and the number of channels is 64.

305. And the terminal fuses the three intermediate images based on the human body analysis model to obtain a second image.

In the embodiment of the application, the terminal can fuse the result of convolution of the third intermediate image and the second intermediate image to obtain a first fused image, then fuse the result of convolution of the first fused image and the first intermediate image to obtain a second fused image, and the second fused image is used as the second image. By fusing the three images with different scales in sequence, compared with a mode of fusing the intermediate images with each scale in a complex neural network, the structure complexity is reduced, and the calculated amount and the reasoning time are reduced.

For example, still referring to fig. 8, the terminal convolves the second intermediate image based on the convolution layer D1, and then fuses the result output by the convolution layer D1 with the third intermediate image based on the fusion layer F1, to obtain a first fused image. The convolution layer D1 includes a deep convolution module, where an input size of the convolution layer D1 is 32×32×64, a channel number is 32, a convolution step length is 1, and a kernel width of a convolution kernel is 1. The fusion layer F1 comprises a feature fusion module (Feature Fusion Module, FFM), the input size of the fusion layer F1 being 32 x 32, the number of channels is 32, the convolution step length is 1, and the kernel width of the convolution kernel is 1. The terminal convolves the first intermediate image based on the convolution layer D2, and then the terminal fuses the result output by the convolution layer D2 and the result output by the fusion layer F1 based on the fusion layer F2 to obtain a second fusion image; or the terminal convolves the result output by the fusion layer F1 based on another convolution layer, and fuses the result output by the convolution layer D2 and the result output by the other convolution layer based on the fusion layer F2 to obtain a second fusion image. The convolution layer D2 includes a standard convolution module, where the input size of the convolution layer D2 is 64×64×32, the number of channels is 32, the convolution step length is 1, and the kernel width of the convolution kernel is 3. The fusion layer F2 includes a feature fusion module, where the input size of the fusion layer F2 is 64×64×32, the channel number is 32, the convolution step size is 1, and the kernel width of the convolution kernel is 1.

306. The terminal maps the second image into a target image based on a human body analysis model, and the target image is marked with different parts of the human body.

In the embodiment of the application, the terminal can perform convolution processing on the second image based on the convolution layer, so that the resolution of the second image is increased to reduce edge jitter.

In some embodiments, the terminal can adjust the resolution of the second image to obtain a third image, the third image having a resolution not higher than the resolution of the first image, and then convolve the third image to obtain the target image.

For example, still referring to fig. 8, the terminal performs convolution processing on the second image based on the convolution layer G1 to increase the resolution of the second image to obtain a third image, the resolution of which is one half of the resolution of the first image, and then performs convolution processing on the third image based on the convolution layer G2 to obtain the target image. The convolution layer G1 includes a depth-separable convolution module, where an input size of the convolution layer G1 is 64×64×32, the number of channels is 16, a convolution step length is 1, and a kernel width of a convolution kernel is 3. The convolution layer G2 includes a depth-separable convolution module, the input size of the convolution layer G2 is 128×128×32, the number of channels is 15, the convolution step size is 1, and the kernel width of the convolution kernel is 3.

According to the scheme, the first image is mapped into the three images with different scales and then fused, and compared with the mode that the intermediate images of each scale are fused in the complex neural network, the method reduces structural complexity, reduces calculation amount and reasoning time, and can be deployed in the mobile terminal. In addition, a previous frame masking strategy is introduced during model training, namely, the labeling image is subjected to image transformation and coding, so that the labeling image of the previous frame image is simulated, and the accuracy and the stability of a prediction result can be remarkably improved when the human body analysis model performs human body analysis on videos. In addition, the model parameters are adjusted based on multiple losses during model training, so that the problem of unbalanced pixel types can be solved, and the accuracy of human body analysis is improved.

It should be noted that, in order to verify the effect of the human body analytical model obtained by training in the embodiment of the present application, quantitative and qualitative evaluations are also performed on the human body analytical model. The comparative model was DFANet (Deep Feature Aggregation for Real-Time Semantic Segmentation, a lightweight network), and the comparative index was a time consuming and two quantitative evaluation indices of average cross-over ratio and pixel accuracy. The test terminal is a low-end mobile phone carrying a high-pass 660, and the test data is 5000 test images. The test results are shown in table 1.

TABLE 1

	Time consuming	Average cross-over ratio/pixel accuracy
			DFANet	55ms	57.2/90.5
The scheme of the application	45ms	61.3/91.5

As shown in Table 1, the time consumption of the scheme is reduced by 10ms compared with that of DFANet, and the average cross ratio and the pixel accuracy are improved.

In addition, aiming at human body analysis of video scenes, the video is selected randomly for testing, so that a prediction result output by a model is more accurate and higher in stability after a previous frame masking strategy is used. Fig. 9 is a schematic diagram comparing the results of video human body analysis according to an embodiment of the present application. Referring to fig. 9, (1) and (3) in fig. 9 are prediction results without using the previous frame mask strategy, and (2) and (4) in fig. 9 are prediction results with using the previous frame mask strategy.

In addition, the scheme provided by the application carries out model training through three losses, and in order to verify the advantages and disadvantages of different loss function combinations, quantitative evaluation aiming at the loss function is also carried out on the test data, and the evaluation indexes are average cross-correlation ratio and pixel accuracy. The loss function combination is: first loss, first loss+second loss+third loss. The meaning and calculation manners of the first loss, the second loss and the third loss are shown in the above step 303, and are not described herein. The evaluation results are shown in Table 2.

TABLE 2

As shown in table 2, adding the second loss and the third loss to the first loss can improve the performance of the model.

Fig. 10 is a block diagram of an image processing apparatus provided according to an embodiment of the present application. The apparatus is for performing the steps in the above image processing method, referring to fig. 10, the apparatus includes: a first mapping module 1001, an image fusion module 1002, and a second mapping module 1003.

A first mapping module 1001, configured to map a first image into three intermediate images, where the first image includes a human body, and the three intermediate images have different scales;

the image fusion module 1002 is configured to fuse the three intermediate images to obtain a second image;

and a second mapping module 1003, configured to map the second image into a target image, where the target image is labeled with different parts of the human body.

In some embodiments, the first mapping module 1001 is configured to convolve the first image to obtain a first intermediate image; convolving the first intermediate image to obtain a second intermediate image; and carrying out channel feature enhancement and semantic feature enhancement on the second intermediate image to obtain a third intermediate image, wherein the channel feature enhancement is used for enhancing the importance of different channel features, and the semantic feature enhancement is used for enhancing global semantic information.

In some embodiments, the image fusion module 1002 is configured to fuse the result of convolving the third intermediate image with the second intermediate image to obtain a first fused image; and fusing the result of convolution of the first fused image and the first intermediate image to obtain a second fused image, and taking the second fused image as the second image.

In some embodiments, the second mapping module 1002 is configured to increase the resolution of the second image to obtain a third image, where the resolution of the third image is not higher than the resolution of the first image; and convolving the third image to obtain the target image.

In some embodiments, the steps performed by the image processing apparatus are implemented based on a human body analysis model for performing human body analysis on the input image and outputting images labeled with different parts of the human body.

In some embodiments, fig. 11 is a block diagram of another image processing apparatus provided according to an embodiment of the present application, and referring to fig. 11, the image processing apparatus further includes:

the preprocessing module 1004 is configured to preprocess a first labeling image of a sample image to obtain an encoded image of the sample image, where the first labeling image is used to indicate different parts of a sample human body in the sample image, and the encoded image is used to indicate a prediction result of a previous frame image of the sample image;

A stitching module 1005, configured to stitch the sample image and the encoded image to obtain an input image;

the training module 1006 is configured to train the human body analytical model of the ith iteration based on the input image by using the first labeling image as the supervision information, where i is a positive integer.

In some embodiments, the preprocessing module 1004 is configured to perform image transformation on the first labeling image of the sample image to obtain a second labeling image; and encoding the second marked image to obtain the encoded image.

In some embodiments, the preprocessing module 1004 is configured to perform at least one of rigid transformation and non-rigid transformation on the first labeling image to obtain the second labeling image.

In some embodiments, the preprocessing module 1004 is configured to map the pixels in the second labeling image to the target vector space according to the pixel class to obtain the encoded image.

In some embodiments, the stitching module 1005 is configured to stitch the sample image and the encoded image in a channel dimension to obtain the input image.

In some embodiments, the training module 1006 is configured to parse the input image based on the body parsing model of the ith iteration to obtain a predicted image, where the predicted image is used to indicate different parts of the sample body obtained by prediction; determining, based on the first annotation image and the prediction image, a first loss indicating a difference between the first annotation image and the prediction image, a second loss indicating a difference between the first annotation image and the prediction image after pixel weighting, and a third loss indicating a difference between the first annotation image and the prediction image after pixel addition of dependency information indicating information contained in pixels surrounding the pixel; based on the first loss, the second loss and the third loss, model parameters of the human body analytical model of the ith round of iteration are adjusted.

In some embodiments, the training module 1006 is configured to determine a class weight of each pixel class based on a number of pixels corresponding to the pixel class in the predicted image, where the class weight is inversely related to the number of pixels; a weighted cross entropy penalty is determined based on the class weights for the respective pixel classes, the weighted cross entropy penalty being taken as the second penalty.

In some embodiments, a training module 1006 for determining a labeling probability distribution based on the first labeling image; determining a predictive probability distribution based on the predicted image; determining a labeling probability density function, a predictive probability density function and a joint probability density function based on the labeling probability distribution and the predictive probability distribution; and determining a cross entropy loss based on the labeling probability density function, the prediction probability density function and the joint probability density function, and taking the cross entropy loss as the third loss.

According to the device, the first image is mapped into the three images with different scales and then fused, and compared with the mode that the intermediate images of each scale are fused in the complex neural network, the structure complexity is reduced, the calculated amount and the reasoning time are reduced, and therefore deployment can be carried out on the mobile terminal.

It should be noted that: in the image processing apparatus provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the image processing apparatus and the image processing method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

In the embodiment of the present application, the computer device can be configured as a terminal or a server, and when the computer device is configured as a terminal, the technical solution provided in the embodiment of the present application is implemented by the terminal as an execution body, and when the computer device is configured as a server, the technical solution provided in the embodiment of the present application is implemented by the server as an execution body, or the technical solution provided in the present application is implemented by interaction between the terminal and the server, which is not limited.

Fig. 12 is a block diagram of a terminal 1200 according to an embodiment of the present application when the computer device is configured as a terminal. The terminal 1200 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 1200 may also be referred to as a user device, portable terminal, laptop terminal, desktop terminal, etc.

In general, the terminal 1200 includes: a processor 1201 and a memory 1202.

Processor 1201 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1201 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1201 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1201 may be integrated with a GPU (Graphics Processing Unit, image processor) for taking care of rendering and rendering of content that the display screen is required to display. In some embodiments, the processor 1201 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 1202 may include one or more computer-readable storage media, which may be non-transitory. Memory 1202 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1202 is used to store at least one computer program for execution by processor 1201 to implement the image processing methods provided by the method embodiments herein.

In some embodiments, the terminal 1200 may further optionally include: a peripheral interface 1203, and at least one peripheral. The processor 1201, the memory 1202, and the peripheral interface 1203 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 1203 via buses, signal lines, or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1204, a display 1205, a camera assembly 1206, audio circuitry 1207, a positioning assembly 1208, and a power supply 1209.

The peripheral interface 1203 may be used to connect at least one peripheral device associated with an I/O (Input/Output) to the processor 1201 and the memory 1202. In some embodiments, the processor 1201, the memory 1202, and the peripheral interface 1203 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 1201, the memory 1202, and the peripheral interface 1203 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 1204 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuit 1204 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1204 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. In some embodiments, the radio frequency circuit 1204 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 1204 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuit 1204 may also include NFC (Near Field Communication ) related circuits, which are not limited in this application.

The display 1205 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1205 is a touch display, the display 1205 also has the ability to collect touch signals at or above the surface of the display 1205. The touch signal may be input as a control signal to the processor 1201 for processing. At this time, the display 1205 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 1205 may be one and disposed on a front panel of the terminal 1200; in other embodiments, the display 1205 may be at least two, respectively disposed on different surfaces of the terminal 1200 or in a folded design; in other embodiments, the display 1205 may be a flexible display disposed on a curved surface or a folded surface of the terminal 1200. Even more, the display 1205 may be arranged in an irregular pattern that is not rectangular, i.e., a shaped screen. The display 1205 can be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 1206 is used to capture images or video. In some embodiments, camera assembly 1206 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 1206 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuitry 1207 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1201 for processing, or inputting the electric signals to the radio frequency circuit 1204 for voice communication. For purposes of stereo acquisition or noise reduction, a plurality of microphones may be respectively disposed at different portions of the terminal 1200. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1201 or the radio frequency circuit 1204 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuitry 1207 may also include a headphone jack.

The positioning component 1208 is used to position the current geographic location of the terminal 1200 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 1208 may be a positioning component based on the united states GPS (Global Positioning System ), the beidou system of china, or the galileo system of russia.

The power supply 1209 is used to power the various components in the terminal 1200. The power source 1209 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power source 1209 comprises a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1200 also includes one or more sensors 1210. The one or more sensors 1210 include, but are not limited to: acceleration sensor 1211, gyroscope sensor 1212, pressure sensor 1213, fingerprint sensor 1214, optical sensor 1215, and proximity sensor 1216.

The acceleration sensor 1211 may detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 1200. For example, the acceleration sensor 1211 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 1201 may control the display 1205 to display a user interface in either a landscape view or a portrait view based on the gravitational acceleration signal acquired by the acceleration sensor 1211. The acceleration sensor 1211 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 1212 may detect a body direction and a rotation angle of the terminal 1200, and the gyro sensor 1212 may collect a 3D motion of the user on the terminal 1200 in cooperation with the acceleration sensor 1211. The processor 1201 may implement the following functions based on the data collected by the gyro sensor 1212: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 1213 may be disposed at a side frame of the terminal 1200 and/or at a lower layer of the display 1205. When the pressure sensor 1213 is provided at a side frame of the terminal 1200, a grip signal of the terminal 1200 by a user may be detected, and the processor 1201 performs a left-right hand recognition or a shortcut operation according to the grip signal collected by the pressure sensor 1213. When the pressure sensor 1213 is disposed at the lower layer of the display 1205, the processor 1201 controls the operability control on the UI interface according to the pressure operation of the user on the display 1205. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 1214 is used to collect a fingerprint of the user, and the processor 1201 identifies the identity of the user based on the fingerprint collected by the fingerprint sensor 1214, or the fingerprint sensor 1214 identifies the identity of the user based on the fingerprint collected. Upon recognizing that the user's identity is a trusted identity, the processor 1201 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 1214 may be provided on the front, back, or side of the terminal 1200. When a physical key or a vendor Logo is provided on the terminal 1200, the fingerprint sensor 1214 may be integrated with the physical key or the vendor Logo.

The optical sensor 1215 is used to collect the ambient light intensity. In one embodiment, processor 1201 may control the display brightness of display 1205 based on the intensity of ambient light collected by optical sensor 1215. Specifically, when the intensity of the ambient light is high, the display brightness of the display screen 1205 is turned up; when the ambient light intensity is low, the display brightness of the display screen 1205 is turned down. In another embodiment, processor 1201 may also dynamically adjust the shooting parameters of camera assembly 1206 based on the intensity of ambient light collected by optical sensor 1215.

A proximity sensor 1216, also referred to as a distance sensor, is typically provided on the front panel of the terminal 1200. The proximity sensor 1216 is used to collect the distance between the user and the front of the terminal 1200. In one embodiment, when the proximity sensor 1216 detects that the distance between the user and the front face of the terminal 1200 gradually decreases, the processor 1201 controls the display 1205 to switch from the bright screen state to the off screen state; when the proximity sensor 1216 detects that the distance between the user and the front surface of the terminal 1200 gradually increases, the processor 1201 controls the display 1205 to switch from the off-screen state to the on-screen state.

It will be appreciated by those skilled in the art that the structure shown in fig. 12 is not limiting and that more or fewer components than shown may be included or certain components may be combined or a different arrangement of components may be employed.

When the computer device is configured as a server, fig. 13 is a schematic structural diagram of a server provided according to an embodiment of the present application, where the server 1300 may have a relatively large difference due to configuration or performance, and may include one or more processors (Central Processing Units, CPU) 1301 and one or more memories 1302, where at least one computer program is stored in the memories 1302, and the at least one computer program is loaded and executed by the processor 1301 to implement the image processing methods provided in the foregoing method embodiments. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

The present application also provides a computer readable storage medium having stored therein at least one section of a computer program loaded and executed by a processor of a computer apparatus to implement the operations performed by the computer apparatus in the image processing method of the above embodiments. For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), magnetic tape, floppy disk, optical data storage device, and the like.

In some embodiments, the computer program related to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or on multiple computer devices distributed across multiple sites and interconnected by a communication network, where the multiple computer devices distributed across multiple sites and interconnected by a communication network may constitute a blockchain system.

Embodiments of the present application also provide a computer program product or computer program comprising computer program code stored in a computer readable storage medium. The computer program code is read from a computer readable storage medium by a processor of a computer device, and executed by the processor, causes the computer device to perform the image processing methods provided in the various alternative implementations described above.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims

1. An image processing method, the method comprising:

fusing the three intermediate images to obtain a second image;

2. The method of claim 1, wherein mapping the first image into three intermediate images comprises:

convolving the first image to obtain a first intermediate image;

convolving the first intermediate image to obtain a second intermediate image;

and carrying out channel feature reinforcement and semantic feature reinforcement on the second intermediate image to obtain a third intermediate image, wherein the channel feature reinforcement is used for reinforcing the importance of different channel features, and the semantic feature reinforcement is used for reinforcing global semantic information.

3. The method of claim 2, wherein fusing the three intermediate images to obtain a second image comprises:

fusing the result of convolution of the third intermediate image and the second intermediate image to obtain a first fused image;

and fusing the result of convolution of the first fused image and the first intermediate image to obtain a second fused image, and taking the second fused image as the second image.

4. The method of claim 1, wherein the mapping the second image to a target image comprises:

the resolution of the second image is increased, and a third image is obtained, wherein the resolution of the third image is not higher than the resolution of the first image;

and convolving the third image to obtain the target image.

5. The method according to any one of claims 1-4, wherein the image processing method is implemented based on a human body analysis model for human body analysis of the input image, and outputting images labeled with different parts of the human body.

6. The method of claim 5, wherein the method further comprises:

Preprocessing a first marked image of a sample image to obtain an encoded image of the sample image, wherein the first marked image is used for indicating different parts of a sample human body in the sample image, and the encoded image is used for indicating a prediction result of a previous frame image of the sample image;

splicing the sample image and the coded image to obtain an input image;

and training a human body analysis model of the ith iteration round based on the input image by taking the first labeling image as supervision information, wherein i is a positive integer.

7. The method of claim 6, wherein preprocessing the first annotation image of the sample image to obtain the encoded image of the sample image comprises:

performing image transformation on the first marked image of the sample image to obtain a second marked image;

and encoding the second marked image to obtain the encoded image.

8. The method of claim 7, wherein performing an image transformation on the first annotation image of the sample image to obtain a second annotation image comprises:

and performing at least one of rigid transformation and non-rigid transformation on the first marked image to obtain the second marked image.

9. The method of claim 7, wherein encoding the second annotation image results in the encoded image, comprising:

and mapping the pixels in the second labeling image to a target vector space according to the pixel category to which the pixels belong, so as to obtain the coding image.

10. The method of claim 6, wherein the stitching the sample image with the encoded image results in an input image, comprising:

and splicing the sample image and the coding image in the channel dimension to obtain the input image.

11. The method of claim 6, wherein training the body analytical model for the i-th iteration based on the input image with the first labeling image as the supervision information comprises:

performing human body analysis on the input image based on the human body analysis model of the ith iteration to obtain a predicted image, wherein the predicted image is used for indicating different parts of the sample human body obtained through prediction;

determining, based on the first annotation image and the prediction image, a first loss indicating a difference between the first annotation image and the prediction image, a second loss indicating a difference between the first annotation image and the prediction image after pixel weighting, and a third loss indicating a difference between the first annotation image and the prediction image after pixel addition of dependency information indicating information contained in pixels surrounding the pixel;

Based on the first loss, the second loss, and the third loss, model parameters of the human body analytical model of the ith round of iteration are adjusted.

12. The method of claim 11, wherein determining the second loss based on the first annotation image and the predictive image comprises:

determining class weights of all pixel classes based on the pixel quantity corresponding to the pixel classes in the predicted image, wherein the class weights are inversely related to the pixel quantity;

and determining a weighted cross entropy loss based on the class weight of each pixel class, and taking the weighted cross entropy loss as the second loss.

13. The method of claim 11, wherein determining the third loss based on the first annotation image and the predictive image comprises:

determining a labeling probability distribution based on the first labeling image;

determining a predictive probability distribution based on the predicted image;

determining a labeling probability density function, a prediction probability density function and a joint probability density function based on the labeling probability distribution and the prediction probability distribution;

and determining a cross entropy loss based on the labeling probability density function, the prediction probability density function and the joint probability density function, and taking the cross entropy loss as the third loss.

14. An image processing apparatus, characterized in that the apparatus comprises:

the first mapping module is used for mapping a first image into three intermediate images, wherein the first image comprises a human body, and the three intermediate images are different in scale and are used for representing image characteristics of the first image;

15. A computer device, characterized in that it comprises a processor and a memory for storing at least one computer program, which is loaded by the processor and which performs the image processing method according to any of claims 1 to 13.

16. A computer-readable storage medium, characterized in that the computer-readable storage medium is adapted to store at least one computer program for executing the image processing method according to any one of claims 1 to 13.

17. A computer program product comprising a computer program which, when executed by a processor, implements the image processing method according to any one of claims 1 to 13.