CN114202606A

CN114202606A - Image processing method, electronic device, storage medium, and computer program product

Info

Publication number: CN114202606A
Application number: CN202111223313.XA
Authority: CN
Inventors: 林祖增; 黄哲威; 续明凯; 胡晨; 周舒畅
Original assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-03-18

Abstract

The present disclosure relates to an image processing method, an electronic device, a storage medium, and a computer program product. The image processing method provided by the disclosure comprises the steps of obtaining a plurality of different known posture images of a target object and an image comprising a target posture, and respectively extracting known 3D posture characteristics of the known posture in the plurality of different known posture images and target 3D posture characteristics of the target posture; and fusing the known 3D posture characteristics corresponding to the plurality of different known posture images with the target 3D posture characteristics to obtain a target posture image of the target object. According to the method and the device, the target attitude image is obtained through the known 3D attitude characteristics corresponding to the known attitude image and the target 3D attitude characteristics corresponding to the target attitude, the calculated amount in the generation process of the target attitude image is reduced, the method and the device are convenient to be applied to mobile equipment, and the calculation efficiency is improved.

Description

Image processing method, electronic device, storage medium, and computer program product

Technical Field

The present disclosure relates to the field of image processing, and in particular, to an image processing method, an electronic device, a storage medium, and a computer program product.

Background

In the fields of movie and television production, cartoon game production, virtual reality, interactive digital display boards and the like, in order to create character special effects and enhance the introduction feeling, characters need to change different postures along with scenes. For example, in a game of cartoon, both sides play against a scene, and either side needs to change the posture. The traditional way of generating the gesture is to hand-draw the gesture image of the cartoon character by the creator and then play the hand-drawn image frame by frame in sequence, which needs to consume a great deal of effort by the creator and has higher drawing requirements on the creator.

With the development of artificial intelligence, the target object can be migrated to the target posture to obtain the target posture image of the target object according to the known posture image and the target posture of the target object. Image processing, machine learning and other technologies in artificial intelligence enable generation of a target pose image to be easier to operate, interest of amateur users is aroused, and requirements of the amateur users for generation of the target pose image through mobile equipment are rapidly increased. In the related art, the method for generating the target posture image has a large calculation amount and needs to occupy a large amount of calculation resources and storage space. However, mobile devices are generally equipped with only image processors with low computing power and limited storage, and therefore, it is difficult to directly deploy and run methods for obtaining target pose images in the related art on these low-resource mobile devices. Therefore, the method for obtaining the target posture image in the related art is not suitable for the mobile device.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides an image processing method, an electronic device, a storage medium, and a computer program product.

According to a first aspect of embodiments of the present disclosure, there is provided an image processing method, including:

acquiring a plurality of images with different known postures of a target object and an image comprising a target posture, wherein the target posture is a posture to be transferred to the target object; respectively extracting known 3D posture features of known postures in the multiple different known posture images and target 3D posture features of the target postures; and fusing the known 3D posture features corresponding to the plurality of different known posture images with the target 3D posture features to obtain a target posture image of the target object.

In one embodiment, the extracting the known 3D pose features of the known poses in the plurality of different known pose images and the target 3D pose feature of the target pose respectively includes:

respectively inputting the plurality of different known posture images and the image comprising the target posture into a 3D posture feature extraction network to obtain known 3D posture features corresponding to the known postures and the target 3D posture features of the target posture; the 3D posture feature extraction network is obtained based on human body 3D model training with standard postures.

In one embodiment, the 3D pose feature extraction network is trained as follows:

acquiring a human body 3D model with a standard posture, and determining 3D coordinates of all surface elements of the whole body of the human body 3D model in a world coordinate system; adjusting the postures of the human body 3D model to obtain a plurality of sample images with different postures; for each sample image in the multiple sample images, projecting the 3D coordinates of each surface element in the sample image into a pixel coordinate system to obtain pixel coordinates corresponding to the 3D coordinates of the sample image, and creating a corresponding relation between the sample image and 3D posture features of the sample image, wherein the 3D posture features of the sample image are the 3D coordinates and the pixel coordinates which have a corresponding relation with the sample image; training to obtain a 3D posture feature extraction network based on the sample image and the 3D posture feature with the corresponding relation, wherein the input of the 3D posture feature extraction network is the sample image, and the output is the 3D posture feature with the corresponding relation.

In one embodiment, training to obtain a 3D pose feature extraction network based on a sample image and a 3D pose feature having a corresponding relationship includes:

initializing an image segmentation network; taking the sample image as the input of the image segmentation network, taking the 3D posture characteristic as the output of the image segmentation network, training the image segmentation network, and obtaining the image segmentation network for extracting the 3D posture characteristic based on the sample image; and segmenting the trained image into a network to be used as a 3D attitude feature extraction network.

training a human body feature extraction network for identifying human body features; taking the human body feature extraction network as an encoder of an image segmentation network to obtain a network model to be trained; and taking the sample image as the input of the network model to be trained, taking the 3D posture characteristic as the output of the network model to be trained, training the network model to be trained, and obtaining a 3D posture characteristic extraction network for extracting the 3D posture characteristic based on the sample image.

In one embodiment, the fusing the known 3D pose features corresponding to the multiple different known pose images with the target 3D pose feature to obtain a target pose image of the target object includes:

estimating a reverse optical flow field between the target 3D attitude feature and each known 3D attitude feature respectively; aiming at each known attitude image of the target object, respectively transforming based on the reverse optical flow field corresponding to the known attitude image to obtain a plurality of initial target attitude images; and fusing the plurality of initial target attitude images to obtain a target attitude image of the target object.

In one embodiment, fusing the multiple initial target pose images to obtain a target pose image of the target object includes:

estimating the credibility between the target 3D posture characteristic and each known 3D posture characteristic, and determining the credibility of each known 3D posture characteristic as the credibility of a plurality of initial target posture images; and fusing the initial target attitude images based on the credibility of the initial target attitude images to obtain a target attitude image of the target object.

In one embodiment, the fusing the multiple initial target pose images to obtain the target pose image of the target object based on the credibility of the multiple initial target pose images includes:

and respectively carrying out soft maximum processing on the credibility of the initial target posture images, and fusing the initial target posture images based on the credibility after the soft maximum processing to obtain the target posture image of the target object.

According to a second aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the image processing method as described in any one of the embodiments of the first aspect.

According to a third aspect of the embodiments of the present disclosure, there is provided a storage medium having instructions stored therein, which when executed by a processor of a mobile device, enable the mobile device to perform the image processing method described in any one of the embodiments of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, enables the processor to perform the image processing method of any one of the first aspects.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: according to the method and the device, the target attitude image is obtained through the known 3D attitude characteristics corresponding to the known attitude image and the target 3D attitude characteristics corresponding to the target attitude, the calculated amount in the generation process of the target attitude image is reduced, the method and the device are convenient to be applied to mobile equipment, and the calculation efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram illustrating an image processing method according to an exemplary embodiment.

FIG. 2 is a flow diagram illustrating extraction of 3D pose features according to an exemplary embodiment.

FIG. 3 is a flow diagram illustrating training a 3D pose feature extraction network according to an example embodiment.

FIG. 4 is a flowchart illustrating training of an image segmentation network according to an example embodiment.

FIG. 5 is a training flow diagram illustrating a network model to be trained in accordance with an exemplary embodiment.

FIG. 6 is a flow diagram illustrating determination of a target pose image according to an exemplary embodiment.

FIG. 7 is a flowchart illustrating an example of multi-initial target pose image fusion, according to an example embodiment.

FIG. 8 is a schematic diagram illustrating an optical flow pose fusion network, according to an example embodiment.

FIG. 9 is a flow diagram illustrating a further example of multi-initial target pose image fusion according to an exemplary embodiment.

FIG. 10 is a schematic diagram of an image processing model shown in accordance with an exemplary embodiment.

Fig. 11 is a block diagram illustrating an image processing apparatus according to an exemplary embodiment.

FIG. 12 is an electronic device shown in accordance with an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In recent years, technical research based on artificial intelligence, such as computer vision, deep learning, machine learning, image processing, and image recognition, has been actively developed. Artificial Intelligence (AI) is an emerging scientific technology for studying and developing theories, methods, techniques and application systems for simulating and extending human Intelligence. The artificial intelligence subject is a comprehensive subject and relates to various technical categories such as chips, big data, cloud computing, internet of things, distributed storage, deep learning, machine learning and neural networks. Computer vision is used as an important branch of artificial intelligence, particularly a machine is used for identifying the world, and the computer vision technology generally comprises the technologies of face identification, living body detection, fingerprint identification and anti-counterfeiting verification, biological feature identification, face detection, pedestrian detection, target detection, pedestrian identification, image processing, image identification, image semantic understanding, image retrieval, character identification, video processing, video content identification, behavior identification, three-dimensional reconstruction, virtual reality, augmented reality, synchronous positioning and map construction (SLAM), computational photography, robot navigation and positioning and the like. With the research and progress of artificial intelligence technology, the technology is applied to various fields, such as security, city management, traffic management, building management, park management, face passage, face attendance, logistics management, warehouse management, robots, intelligent marketing, computational photography, mobile phone images, cloud services, smart homes, wearable equipment, unmanned driving, automatic driving, smart medical treatment, face payment, face unlocking, fingerprint unlocking, testimony verification, smart screens, smart televisions, cameras, mobile internet, live webcasts, beauty treatment, medical beauty treatment, intelligent temperature measurement and the like.

The embodiment of the disclosure is applied to the field of image processing, and particularly relates to an image processing method for determining a target attitude image by using a known attitude image of a target object. In the related art, methods for migrating a target pose to a target object to generate a target pose image include three types. Firstly, the relationship between human body surface texture and human body space posture is decoupled through a Generative Adaptive Network (GAN) to realize the generation of a target posture image, but the method not only has the problem of difficulty in building a GAN network, but also needs mass training data to train to obtain GAN network parameters, so that the efficiency of generating the target posture image is not high. And in the second category, modeling is carried out based on an optical flow field from the human body posture in the input image to the human body posture in the output image so as to obtain the moving relation between the pixels of the human body posture in the input image and the pixels of the human body posture in the output image. And then obtaining a target posture image according to the movement relation between the pixels of the human body posture in the input image and the pixels of the human body posture in the output image. However, such methods are limited by the integrity of the human surface texture provided by a single input image and have limited applicability to large-scale poses such as turning. And in the third category, a target posture image is obtained based on a nerve radiation field or implicit function method. The method needs to perform gradient descent calculation on each target object respectively to fit a function represented by a three-dimensional space, and obtain a target attitude image by using the function represented by the three-dimensional space. Although the accuracy of an output image when a human body is subjected to large-scale posture migration is improved, the method is large in calculation amount and does not have universality because the fitting is required to be performed through a gradient descent algorithm aiming at each target object. Therefore, the three methods for obtaining the target posture image have the technical problems of large calculation amount and inapplicability to mobile equipment.

In order to solve the technical problem, the embodiments of the present disclosure provide an image processing method for obtaining a target pose image. The method comprises the steps of obtaining a plurality of known posture images of a target object and an image comprising a target posture, extracting a known 3D posture characteristic corresponding to the known posture image from each known posture image, and extracting a target 3D posture characteristic corresponding to the target posture from the image comprising the target posture. And then fusing the known 3D attitude characteristics with the target 3D attitude characteristics to obtain a target attitude image of the target object. That is, the target posture image is obtained through the known 3D posture feature corresponding to the known posture image and the target 3D posture feature corresponding to the target posture, and compared with the related art in which the target posture image is obtained through the known posture image and the image including the target posture, the target posture image of the target object can be determined through the 3D posture feature corresponding to the image, so that the purposes of reducing the complexity in the calculation process and reducing the calculation amount are achieved.

It should be noted that the execution subject for obtaining the target posture image by using the image processing method of the embodiment of the present application may be a hardware device having a data information processing capability and/or necessary software for driving the hardware device to operate. Suitable for, but not limited to, mobile devices. The mobile device includes, but is not limited to, a mobile phone, a computer, a smart home appliance, a vehicle-mounted terminal, and the like. Other alternative execution entities may include workstations, servers, computers, and other devices.

The following embodiments of the present disclosure will explain a target posture image obtained by an image processing method with reference to the drawings.

FIG. 1 is a flow diagram illustrating an image processing method according to an exemplary embodiment. As shown in fig. 1, the image processing method includes the following steps.

In step S11, a plurality of different known pose images of the target object, and an image including the target pose are acquired.

The target pose is the pose to be migrated to the target object. The target object may include, but is not limited to, an animation character, a movie character, or any character in real life. The known posture image refers to a posture image of the target object that can be acquired, and in the present disclosure, the posture image of the target object that can be acquired is simply referred to as the known posture image. After the known attitude image is acquired, it is also necessary to determine the attitude images of other objects except the target object, and take the attitude in the attitude images of the other objects as the target attitude to be migrated to the target object. For example, the target object is characterized by object a and the other objects are characterized by object B. Acquiring a plurality of known posture images of the object A and a known posture image of the object B, and taking the known posture in the known posture image of the object B as the target posture of the object A. By the image processing method disclosed by the disclosure, the known posture of the object B is migrated to the object A as the target posture of the object A, and an image of the object A in the target posture is obtained. The present disclosure refers to the image of object a in the target pose as the target pose image of object a.

In the present disclosure, the multiple different known posture images mean that the posture in any one of the multiple known posture images is different from the postures in the remaining known posture images. The multiple images with different known postures of the target object are obtained, so as to obtain the features of the target object more accurately, and optionally, the features of the target object may include body type features, clothing features and the like. The number of known pose images acquired by the present disclosure is at least guaranteed to cover the body angle in the target pose. The number of known pose images of the target object may be determined according to the target pose, that is, the number of known pose images of the target object may be determined by how many combinations of known pose images can at least cover the target pose. For example, the target posture of the object a is a posture in which the standing human body turns 45 ° to the left, and the target posture in which the standing human body turns 45 ° to the left can be shifted to the object a by two or 3 known posture images in which the object a stands in consideration of the orientation of the target posture.

In the embodiment of the present disclosure, one or more target poses to be migrated to the target object may be determined, and when there are multiple target poses to be migrated to the target object, the image processing method provided in the present disclosure is respectively performed on each target pose to be migrated to the target object, so as to obtain a target pose image corresponding to each target pose to be migrated to the target object.

In step S12, known 3D pose features of known poses in a plurality of different known pose images and a target 3D pose feature of a target pose are extracted, respectively.

In the embodiment of the disclosure, a plurality of different known posture images are respectively input into a pre-trained 3D posture feature extraction network to extract the 3D posture features of the known posture in each known posture image. And if the target posture to be transferred to the target object is in an image form, inputting the image including the target posture into a pre-trained 3D posture feature extraction network so as to extract the target 3D posture feature of the target posture. And if the target posture to be transferred to the target object is characterized in a 3D posture characteristic mode, directly obtaining the target 3D posture characteristic of the target posture. In the disclosure, the 3D coordinates of each surface bin of an image in a world coordinate system are projected into an image coordinate system to obtain pixel coordinates, and the 3D coordinates and the pixel coordinates having a corresponding relationship are used as 3D posture characteristics of the image. In other words, the known 3D pose features extracted from the known pose image are two-dimensional AXYZ four-channel features corresponding to each pixel of the surface bin of the human body in the known pose. And the A channel represents whether the pixel is a human body, and the XYZ channel represents the 3D coordinate of the surface element of the human body corresponding to the pixel in the standard standing posture.

In step S13, known 3D pose features corresponding to the plurality of different known pose images are fused with the target 3D pose feature to obtain a target pose image of the target object.

In the embodiment of the disclosure, the known 3D pose feature of each known pose in the multiple different known pose images, the target 3D pose feature of the target pose, and the multiple different known pose images are input into a pre-trained optical flow pose fusion network for fusion to obtain the target pose image of the target object. In a pre-trained optical flow attitude fusion network, input target 3D attitude features, a plurality of known 3D attitude features and a plurality of different known attitude images are subjected to two-stage processing. In the first stage, the reverse optical flow field and the reliability of the target 3D attitude feature to each known 3D attitude feature are estimated, so that the reverse optical flow field and the reliability corresponding to a plurality of known 3D attitude features can be obtained. In the second stage, each known attitude image determines the reverse optical flow field and the credibility corresponding to the known attitude image according to the known 3D attitude feature corresponding to the known attitude image, so that a plurality of reverse optical flow fields and credibility corresponding to a plurality of known attitude images one by one can be obtained. And carrying out transformation operation according to the reverse optical flow field corresponding to each known attitude image to obtain an initial target attitude image corresponding to the known attitude image, thereby obtaining a plurality of initial target attitude images. And then fusing the plurality of initial target attitude images according to the credibility to obtain a target attitude image of the target object.

In summary, according to the image processing method for the target posture image in the embodiment of the disclosure, 3D posture features can be extracted from the image, the body surface texture information from the multiple images can be fused according to the 3D posture features, and high-fidelity human posture migration can be realized with low computational complexity, so as to obtain the target posture image.

In the embodiment of the disclosure, after a plurality of different images with known poses and an image including a target pose are acquired, known 3D pose features of the known poses in the plurality of images with different known poses and a target 3D pose feature of the target pose are respectively extracted. The steps in the extraction process can be seen in fig. 2. The following embodiment will describe the extraction of the known 3D pose features of the known poses in the plurality of different known pose images and the extraction of the target 3D pose features of the target pose with reference to fig. 2.

Fig. 2 is a flowchart illustrating extracting 3D pose features, according to an exemplary embodiment, as shown in fig. 2, the extracting 3D pose features includes the following steps.

In step S21, the plurality of different known posture images are input to the 3D posture feature extraction network, respectively, to obtain known 3D posture features corresponding to the known postures.

And inputting the first known posture image into a 3D posture feature extraction network to obtain a known 3D posture feature corresponding to the known posture in the first known posture image. And analogizing in sequence to obtain known 3D posture features corresponding to the known postures in the plurality of different known posture images.

In step S22, the image including the target pose is input to the 3D pose feature extraction network, and the target 3D pose feature of the target pose is obtained.

In the embodiment of the disclosure, the 3D pose feature extraction network is obtained based on human body 3D model training with standard poses. Before training the 3D pose feature extraction network, a training sample set for training the 3D pose feature extraction network is determined based on a human body 3D model with a standard pose. And training the 3D posture characteristic extraction network according to the training sample set. Training process referring to fig. 3, the following embodiment describes a training process for training a 3D pose feature extraction network in conjunction with fig. 3.

Fig. 3 is a flowchart illustrating training of a 3D pose feature extraction network according to an exemplary embodiment, where the 3D pose feature extraction network is trained as shown in fig. 3 according to the following steps.

In step S31, a human 3D model with a standard pose is acquired, and 3D coordinates of surface bins of the whole body of the human 3D model in a world coordinate system are determined.

And generating a virtual human body 3D model in mapping software, and adjusting the human body 3D model to a standard posture. The standard posture can be set according to actual conditions, for example, the standard posture can be a T-type (T-pos for short), and the standard posture can also be a "big" font, and the like. The standard posture is T-shaped, namely, the two hands of a human body in the human body 3D model are lifted horizontally, the two feet of the human body are separated, the human body stands upright, and the standing posture is similar to the T-shaped posture. The standard posture of the present disclosure may be any posture that can completely show all surface elements of the whole body in the human body 3D model without occlusion. The 3D model of the human body may include a three-dimensional cad (computer Aided design) drawing.

Further, after obtaining the human body 3D model, a world coordinate system may be established according to a set origin. The origin of the world coordinate system includes, but is not limited to, the navel in the human 3D model, the vertex of the head in the human 3D model, and the like. And after a world coordinate system is established, determining the 3D coordinates of the 3D coordinate points corresponding to all surface elements of the whole body of the human body 3D model in the world coordinate system.

In step S32, the poses of the 3D human body model are adjusted, and a plurality of sample images having different poses are obtained.

And adjusting the human body 3D model to enable the adjusted human body 3D model to swing any 3D posture. Such as a 3D pose of lifting the right foot, a set of rotated 3D poses, and the like. And acquiring a sample image corresponding to each 3D gesture by using the camera. And obtaining a plurality of sample images through different 3D postures put out by the adjusted human body 3D model.

It should be noted that the 3D coordinates of the 3D coordinate points corresponding to each surface element in the world coordinate system are fixed, and do not change with the observation angle of the camera or change with the posture of the human body after the 3D model is adjusted.

In step S33, for each of the plurality of sample images, the 3D coordinates of each surface bin in the sample image are projected into a pixel coordinate system, so as to obtain pixel coordinates corresponding to the 3D coordinates of the sample image, and create a corresponding relationship between the sample image and the 3D pose features of the sample image.

Each sample image is provided with a pixel coordinate system, and the 3D coordinates of each surface element in the sample image are projected to the pixel coordinate system corresponding to the sample image, so that the pixel coordinates corresponding to the 3D coordinates of each surface element in the sample image are obtained. A correspondence between the 3D coordinates of the surface elements and the pixel coordinates is then created. That is, the 3D coordinates of each surface element in the standard posture are filled in the pixel coordinates obtained by the 3D coordinate projection, and the 3D coordinates of each surface element are ensured to be fixed and unchanged. Based on the fact that a human body basic fact (ground channel) is needed for supervision when a 3D posture feature extraction network is trained, an alpha channel (alpha, referred to as an A channel for short) is added in a 3D coordinate corresponding to a pixel position and used for representing whether the pixel position is a human body surface element or not. Taking one of the sample images as an example, according to a corresponding relationship between a 3D coordinate of a human body surface element in the sample image and a pixel coordinate, if the pixel coordinate corresponds to the 3D coordinate, the 3D coordinate filled in the pixel coordinate is the 3D coordinate corresponding to the human body surface element, and the a channel is set to 1, which is used for representing that the pixel coordinate is the human body surface element. If the pixel coordinate does not correspond to the 3D coordinate, filling the 3D coordinate XYZ into the pixel coordinate to be 0 respectively, and setting the channel A to be 0 to represent that the pixel coordinate is not a surface element of the human body.

Further, for each sample image, projecting the 3D coordinates of each surface element in the sample image into a pixel coordinate system to obtain pixel coordinates corresponding to the 3D coordinates of the sample image. And creating a corresponding relation between the sample image and the 3D coordinates and the pixel coordinates, and taking the pixel coordinates of the sample image and the 3D coordinates corresponding to the pixel coordinates of the sample image as the 3D posture characteristics of the sample image. And the 3D coordinates corresponding to the pixel coordinates of the sample image comprise an A channel. And establishing a corresponding relation between each sample image and the corresponding 3D posture characteristic of the sample image. And taking all the sample images and the 3D posture features which have corresponding relations with the sample images as a training sample set for training the 3D posture feature extraction network.

It should be noted that, in order to more completely restore the surface texture of the human body, after the 3D coordinates of each surface element are projected to the pixel coordinate system, one pixel coordinate may be selected to correspond to the 3D coordinate of one surface element. The 3D coordinates of each surface element projected into the pixel coordinates, e.g. 3:1, may also be proportionally retained, i.e. 3 pixel coordinates correspond to the 3D coordinates of 1 surface element. The number of surface elements selected by the method is far larger than the number of key points selected in a human body 3D model in the related technology, so that the 3D coordinates and pixel coordinates corresponding to the sample image are also called as dense 3D posture characteristics.

In the above steps S31 to S33, a training sample set for training the 3D pose feature extraction network is obtained by adjusting the human body 3D model. Besides the manner of obtaining the training sample set through steps S31 to S33 provided by the present disclosure, the training sample set of the training 3D pose feature extraction network may be obtained by taking human body photographs in different poses and performing manual labeling.

In step S34, a 3D posture feature extraction network is trained based on the sample image and the 3D posture feature having the corresponding relationship, and the input of the 3D posture feature extraction network is the sample image and the output is the 3D posture feature having the corresponding relationship.

When the 3D posture feature extraction network is trained, the sample image can be input into the 3D posture feature extraction network to obtain a 3D posture feature extracted from the sample image by the 3D posture feature extraction network, then the value of an A channel in the 3D posture feature extracted by the network can be extracted according to the 3D posture feature, and the difference between the values of the A channel in the 3D posture feature corresponding to the actual sample image is obtained, the network parameters of the 3D posture feature extraction network are adjusted until the difference between the 3D posture feature extracted by the 3D posture feature extraction network and the 3D posture feature corresponding to the actual sample image is smaller than a preset threshold value, and then the 3D posture feature extraction network with the network parameters adjusted at the last time can be used as the trained 3D posture feature extraction network.

In the embodiment of the present disclosure, an image segmentation network is used as a network structure of a 3D pose feature extraction network, and the image segmentation network is trained based on sample images and 3D pose features having a correspondence relationship. And segmenting the trained image into a network to be used as a 3D attitude feature extraction network. The training process of the image segmentation network is shown in fig. 4.

FIG. 4 is a flowchart illustrating training of an image segmentation network, such as that shown in FIG. 4, according to an exemplary embodiment.

In step S41, the image segmentation network is initialized.

In step S42, the image segmentation network is trained by inputting the sample image as an input of the image segmentation network and outputting the 3D orientation features as an output of the image segmentation network, thereby obtaining an image segmentation network in which the 3D orientation features are extracted based on the sample image.

When an image segmentation network (unet) is trained, a sample image is used as the input of the image segmentation network, 3D posture characteristics are used as the output of the image segmentation network, and network parameters representing the mapping relation between the sample image and the 3D posture characteristics are obtained through training. And taking the image segmentation network provided with the network parameters representing the mapping relation between the sample image and the 3D posture characteristics as an image segmentation network for extracting the 3D posture characteristics based on the sample image.

In step S43, the trained image is divided into networks to be used as a 3D orientation feature extraction network.

And utilizing the trained image segmentation network as a 3D attitude feature extraction network for extracting the 3D attitude features in the sample image. And extracting the known 3D attitude feature in the known attitude image and the target 3D attitude feature of the target attitude based on the 3D attitude feature extraction network, so as to obtain the target attitude image directly according to the known 3D attitude feature and the target 3D attitude feature, and reduce the calculation amount of obtaining the target attitude image by using the known attitude image and the image comprising the target attitude.

In the embodiment of the disclosure, besides the trained image segmentation network is used as the 3D posture feature extraction network, the human body feature extraction network can be used as an encoder of the image segmentation network to obtain a network model to be trained. And training the network model to be trained based on the sample images and the 3D posture features with the corresponding relation. And taking the trained network model to be trained as a 3D attitude feature extraction network. The training process of the network model to be trained is shown in fig. 5. The following embodiment will describe a training process of a network model to be trained with reference to fig. 5.

FIG. 5 is a flowchart illustrating a training process of a network model to be trained according to an exemplary embodiment, wherein the network model to be trained is trained as shown in FIG. 5.

In step S51, a human body feature extraction network for recognizing human body features is trained.

The method comprises the steps of obtaining human body images, movie and animation character images, hand-drawn human body images and the like, and using the obtained human body images, movie and animation character images and hand-drawn human body images as a sample set of a training human body feature extraction network. And taking the sample set for training the human body feature extraction network as input, taking the human body features as output, and training the human body feature extraction network. Wherein the human body characteristics include head, torso, arms, feet, etc. The human body feature extraction network in the embodiment of the present disclosure may adopt a residual neural network (Resnet).

In step S52, the human body feature extraction network is used as an encoder of the image segmentation network to obtain a network model to be trained.

A backbone network (backbone) in the human body feature extraction network is accessed into an image segmentation network to be used as an encoder, and the image segmentation network with the encoder is used as a network model to be trained. That is, the backbone network in the human body feature extraction network is used as the up-sampling of the image segmentation network. The backbone network in the network is extracted based on the human body characteristics, so that the human body part can be identified, and the training speed of the whole network model to be trained can be accelerated.

In step S53, the sample image is used as an input of the network model to be trained, and the 3D pose feature is used as an output of the network model to be trained, so as to train the network model to be trained, thereby obtaining a 3D pose feature extraction network for extracting the 3D pose feature based on the sample image.

The process of training the to-be-trained network model is similar to the process of training the image segmentation network, and is not repeated here.

In the embodiment of the present disclosure, the network structure of the 3D pose feature extraction network may be an image segmentation network, and may also be an image segmentation network with an encoder. Regardless of whether the 3D posture feature extraction network is an image segmentation network or an image segmentation network with an encoder, the 3D posture feature extraction network is a feed-forward network (feed-forward network), and when extracting the 3D posture feature of an input image, the 3D posture feature extraction network does not depend on a gradient descent algorithm, so that the computation amount is small, and the 3D posture feature extraction network is easy to implement on mobile equipment. Further, according to the method, the human body 3D model is adjusted to obtain the training sample image of the training 3D posture characteristic extraction network on the basis of the 3D coordinates of the 3D coordinate points corresponding to the surface elements of the human body 3D model in the standard posture. Therefore, the 3D posture features extracted through the 3D posture feature extraction network do not change along with the observation angle of the camera, and do not change 3D coordinate points along with the posture of the human body, so that the calculation amount is reduced. Further, a target pose image of the target object is determined based on the known 3D pose features of the known pose image and the target 3D pose features of the target pose. The computational effort can be reduced to run on a computing resource-limited mobile device, as compared to determining a target pose image of a target object from a known pose image, and an image comprising the target pose.

In the embodiment of the disclosure, known 3D posture features of known postures in a plurality of different known posture images and target 3D posture features of target postures are extracted and obtained, and then the known 3D posture features corresponding to the plurality of different known posture images and the target 3D posture features are subjected to fusion operation to obtain the target posture image of the target object. The process of determining the target pose image is described with reference to fig. 6. The following embodiment will explain the determination process of the target pose with reference to fig. 6.

Fig. 6 is a flowchart illustrating a target posture image determining process according to an exemplary embodiment, where as shown in fig. 6, known 3D posture features corresponding to a plurality of different known posture images are fused with the target 3D posture feature to obtain a target posture image of a target object, and the method includes the following steps.

In step S61, inverse optical flow fields between the target 3D pose features and the known 3D pose features are estimated.

And respectively inputting the target 3D attitude feature and each known 3D attitude feature into an optical flow estimation sub-network (Flownet) for estimation to obtain a reverse optical flow field between the target 3D attitude feature and each known 3D attitude feature. And assuming that N (N is a positive integer) known attitude images are provided, and respectively extracting N groups of known 3D attitude characteristics corresponding to the N known attitude images according to a 3D attitude characteristic extraction network. A first inverse optical-flow field between the target 3D pose feature and the first set of known 3D pose features is estimated from the optical-flow estimation sub-network. Then, a second inverse optical-flow field between the target 3D pose feature and a second set of known 3D pose features is estimated based on the optical-flow estimation sub-network. By analogy, the reverse optical flow field between the target 3D attitude feature and the N groups of known 3D attitude features is estimated, and then the reverse optical flow field between the target attitude feature and each group of known 3D attitude features is determined.

In the present disclosure, a backward optical flow field is an optical flow field that generally refers to an optical flow field from a known 3D pose feature to a target 3D pose feature as a forward optical flow field, and an optical flow field from a target 3D pose feature to a known 3D pose feature as a backward optical flow field, with respect to a forward optical flow field.

In step S62, for each known posture image of the target object, transformation is performed based on the inverse optical flow field corresponding to the known posture image, so as to obtain a plurality of initial target posture images.

Taking the first set of known 3D pose features as an example, a first known pose image corresponding to the first set of known 3D pose features and a first inverse optical flow field corresponding to the first set of known 3D pose features are determined. And carrying out pixel transformation operation (warp) on the first known attitude image according to the first reverse optical flow field to obtain a first initial target attitude image. And similarly, transforming the second known attitude image according to the second reverse optical flow field to obtain a second initial target attitude image. And respectively determining N initial target attitude images for each known attitude image in the N known attitude images according to the reverse optical flow field corresponding to each known attitude image.

In step S63, the plurality of initial target posture images are fused to obtain a target posture image of the target object.

And a step of fusing the plurality of initial target posture images to obtain a target posture image of the target object, referring to fig. 7. Fig. 7 is a flowchart illustrating an example of fusing multiple initial target pose images according to an exemplary embodiment, where as shown in fig. 7, the fusing of the multiple initial target pose images to obtain a target pose image of a target object includes the following steps.

In step S71, the reliability between the target 3D pose feature and each of the known 3D pose features is estimated, and the reliability of each of the known 3D pose features is determined as the reliability of the plurality of initial target pose images.

In the sub-network of optical flow estimation, the confidence level of each pixel position in the target 3D pose feature to the corresponding pixel position in each known 3D pose feature is estimated. And establishing a corresponding relation between the known 3D posture characteristics and the reliability, and establishing a corresponding relation among the known posture images, the known 3D posture characteristics and the reliability according to the corresponding relation between the known 3D posture characteristics and the known posture images. And establishing a relation between the initial target posture image and the reliability according to the condition that each known posture image corresponds to one initial target posture image.

In step S72, the multiple initial target posture images are fused based on the credibility of the multiple initial target posture images, and a target posture image of the target object is obtained.

In the embodiment of the present disclosure, the reliability of each initial target posture image is regarded as the weight of the initial target posture image. And carrying out weighted summation on each pixel of the multiple initial target attitude images according to the weight corresponding to each pixel in each initial target attitude image to obtain a target attitude image of which the target object meets the target attitude.

In one embodiment, an optical flow pose fusion network may be utilized to obtain a target pose image of a target object based on known 3D pose features corresponding to a plurality of different known pose images and the target 3D pose features. And taking the target 3D attitude feature, the plurality of different known attitude images and the known 3D attitude features corresponding to the plurality of different known attitude images as the input of the optical flow attitude fusion network, taking the target attitude image of the target object as the output of the optical flow attitude fusion network, and training the optical flow attitude fusion network. FIG. 8 is a schematic diagram illustrating an optical-flow pose fusion network, as shown in FIG. 8, including an optical-flow estimation sub-network and an image fusion sub-network, according to an example embodiment.

And taking the target 3D attitude feature and each known 3D attitude feature as the input of the optical flow estimation subnetwork, and estimating the inverse optical flow field between the target 3D attitude feature and each known 3D attitude feature respectively and the credibility between the target 3D attitude feature and each known 3D attitude feature respectively through the optical flow estimation subnetwork. And establishing a corresponding relation among the known attitude image, the known 3D attitude feature, the reverse optical flow field and the credibility. And taking a plurality of known attitude images, each reverse optical flow field and each reliability which have corresponding relations as the input of the image fusion sub-network, and outputting the target attitude image of the target object through warp operation of the image fusion sub-network. And estimating the difference between the target 3D attitude feature and the known 3D attitude feature through an optical flow estimation sub-network in the optical flow attitude fusion network to obtain the reverse optical flow field and the credibility of the target 3D attitude feature and the known 3D attitude feature. Compared with the related art, the method has the advantages that the workload of the optical flow estimation sub-network is reduced and the calculation power for running the optical flow estimation sub-network device is reduced in the aspect of directly estimating the optical flow field between the image with the known attitude and the image including the target attitude.

In the present embodiment, the inverse optical flow field and the reliability output from the optical flow estimation sub-network are directly input to the graph fusion sub-network. In the image fusion sub-network, a plurality of initial target attitude images are generated according to the reverse optical flow field, and then fusion and inference are carried out on the plurality of initial target attitude images according to the credibility to obtain target attitude images. And adjusting network parameters of the optical flow estimation sub-network in reverse by taking whether the target attitude image meets the target object and/or whether the target attitude image meets the target attitude as guidance. This embodiment enables learning of the optical flow estimation sub-network without optical flow field ground truth as supervision, and learning of the image fusion sub-network without confidence ground truth as supervision. The method solves the problems that the basic fact of the optical flow field is difficult to determine and the basic fact of the credibility is difficult.

In the embodiment of the disclosure, the credibility estimated according to the optical flow estimation sub-network can be directly adopted to fuse a plurality of initial target posture images to obtain a target posture image. The reliability estimated by the optical flow estimation subnetwork may be processed by a soft maximum (Softmax) of pixels, and the reliability processed by the soft maximum may be obtained by suppressing a pixel source having low reliability. And then fusing the plurality of initial target attitude images according to the reliability processed based on the soft maximum value to obtain a target attitude image.

Fig. 9 is a flowchart illustrating a further example of fusing multiple initial target pose images according to an exemplary embodiment, where as shown in fig. 9, fusing multiple initial target pose images to obtain a target pose image of a target object based on the credibility of the multiple initial target pose images includes the following steps.

In step S81, soft maximum processing is performed on the reliability of each of the plurality of initial target posture images, and the reliability after the soft maximum processing is determined.

For example, assume that the initial target pose image has an array of confidence levels

z_iTo represent

The ith element in (1), j represents an array

The jth element in (1), K represents an array

Number of elements in (1). Then the softmax value of confidence is:

in the formula, σ represents the softmax value, that is, the softmax value of an element is the ratio of the index of the element to the sum of indexes of all elements.

And performing softmax processing on the credibility of each initial target posture image according to the formula, and determining the credibility after the softmax processing.

In step S82, a plurality of initial target posture images are fused based on the confidence level after the soft maximum processing, and a target posture image of the target object is obtained.

And summing the pixels of the multiple initial target attitude images according to the processed credibility of the soft maximum value corresponding to each initial target attitude image to obtain the target attitude image of the target object. And the soft maximum processing is adopted, so that the suppression of pixels with low reliability in the initial target attitude image is realized.

In one embodiment, the image processing method provided by the present disclosure may be implemented by an image processing model shown in fig. 10. FIG. 10 is a schematic diagram of an image processing model shown in accordance with an exemplary embodiment. As shown in fig. 10, the image processing model includes a 3D pose feature extraction network, and an optical flow pose fusion network. As shown in fig. 8, the optical-flow pose fusion network includes an optical-flow estimation sub-network and an image fusion sub-network. Inputting a plurality of known posture images into a 3D posture feature extraction network, and extracting the known 3D posture features in each known posture image through the 3D posture feature extraction network. And inputting the image including the target posture into a 3D posture characteristic extraction network to obtain the target 3D posture characteristic of the target posture. And taking the target 3D attitude features, the plurality of known attitude images and the known 3D attitude features extracted from the plurality of known attitude images as the input of the optical flow attitude fusion network, and outputting the target attitude image of the target object through the fusion processing of the optical flow attitude fusion network. In this embodiment, the 3D pose feature extraction network and the optical flow pose fusion network of the image processing model may be separate. When a video is synthesized by using the target posture image generated by the image processing model, the known 3D posture characteristic of the known posture characteristic of the target object and the target 3D posture characteristic are extracted only by the 3D posture characteristic extraction network. And then, repeatedly fusing the known 3D attitude features corresponding to the plurality of different known attitude images with the target 3D attitude feature by using an optical flow attitude fusion network to obtain a target attitude image of the target object. And the extraction of the known 3D posture characteristics of the known postures in the plurality of different known posture images and the target 3D posture characteristics of the target posture is not required to be repeatedly performed so as to reduce the calculation amount. When the video is synthesized by using the method for generating the target attitude image based on the known attitude image in the related art, the whole method needs to be completely executed once for each frame of image because the separation operation is not carried out, and the calculation amount is far larger than that of the scheme provided by the embodiment.

Based on the same conception, the embodiment of the disclosure also provides an image processing device.

It is understood that the image processing apparatus provided by the embodiments of the present disclosure includes a hardware structure and/or a software module for performing each function in order to realize the above functions. The disclosed embodiments can be implemented in hardware or a combination of hardware and computer software, in combination with the exemplary elements and algorithm steps disclosed in the disclosed embodiments. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

Referring to fig. 11, fig. 11 is a block diagram illustrating an image processing apparatus 100 according to an exemplary embodiment. The image processing apparatus 100 includes an acquisition unit 101, an extraction unit 102, and a fusion unit 103.

The acquiring unit 101 is configured to acquire a plurality of different known posture images of a target object and an image including a target posture, where the target posture is a posture to be migrated to the target object; the extraction unit 102 is configured to extract known 3D pose features of known poses in a plurality of different known pose images and target 3D pose features of a target pose respectively; the fusion unit 103 is configured to fuse the known 3D posture features corresponding to the multiple different known posture images with the target 3D posture feature to obtain a target posture image of the target object.

In one embodiment, the extracting unit 102 is configured to: respectively inputting a plurality of images with different known postures and images including target postures into a 3D posture feature extraction network to obtain known 3D posture features corresponding to the known postures and target 3D posture features of the target postures; the 3D posture feature extraction network is obtained based on human body 3D model training with standard postures.

In an embodiment, the image processing apparatus further comprises a training unit 104, the training unit 104 is configured to:

acquiring a human body 3D model with a standard posture, and determining 3D coordinates of all surface elements of the whole body of the human body 3D model in a world coordinate system; adjusting the postures of the human body 3D model to obtain a plurality of sample images with different postures; for each sample image in a plurality of sample images, respectively projecting the 3D coordinates of each surface element in the sample image into a pixel coordinate system to obtain pixel coordinates corresponding to the 3D coordinates of the sample image, and creating a corresponding relation between the sample image and the 3D posture characteristics of the sample image, wherein the 3D posture characteristics of the sample image are the 3D coordinates and the pixel coordinates which have the corresponding relation with the sample image; training to obtain a 3D posture feature extraction network based on the sample image and the 3D posture feature with the corresponding relation, wherein the input of the 3D posture feature extraction network is the sample image, and the output is the 3D posture feature with the corresponding relation.

In one embodiment, the training unit 104 is further configured to: initializing an image segmentation network; taking the sample image as the input of an image segmentation network, taking the 3D attitude feature as the output of the image segmentation network, training the image segmentation network, and obtaining the image segmentation network based on the sample image to extract the 3D attitude feature; and segmenting the trained image into a network to be used as a 3D attitude feature extraction network.

In one embodiment, the training unit 104 is further configured to: training a human body feature extraction network for identifying human body features; taking the human body feature extraction network as an encoder of an image segmentation network to obtain a network model to be trained; and taking the sample image as the input of the network model to be trained, taking the 3D posture characteristic as the output of the network model to be trained, training the network model to be trained, and obtaining the 3D posture characteristic extraction network based on the sample image extraction 3D posture characteristic.

In one embodiment, the fusion unit 103 is configured to: estimating a reverse optical flow field between the target 3D attitude characteristic and each known 3D attitude characteristic respectively; aiming at each known attitude image of the target object, respectively carrying out transformation based on a reverse optical flow field corresponding to the known attitude image to obtain a plurality of initial target attitude images; and fusing the plurality of initial target attitude images to obtain a target attitude image of the target object.

In one embodiment, the fusion unit 103 is further configured to: estimating the credibility between the target 3D posture characteristics and each known 3D posture characteristic, and determining the credibility of each known 3D posture characteristic as the credibility of a plurality of initial target posture images; and fusing the initial target attitude images based on the credibility of the initial target attitude images to obtain a target attitude image of the target object.

In one embodiment, the fusion unit 103 is further configured to: and respectively carrying out soft maximum processing on the credibility of the initial target posture images, and fusing the initial target posture images based on the credibility after the soft maximum processing to obtain the target posture images of the target object.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

As shown in fig. 12, one embodiment of the present disclosure provides an electronic device 200. The electronic device 200 includes a memory 201, a processor 202, and an Input/Output (I/O) interface 203. The memory 201 is used for storing instructions. And the processor 202 is used for calling the instructions stored in the memory 201 to execute the image processing method of the embodiment of the disclosure. The processor 202 is connected to the memory 201 and the I/O interface 203, respectively, for example, via a bus system and/or other connection mechanism (not shown). The memory 201 may be used to store programs and data, including programs of the image processing method involved in the embodiments of the present disclosure, and the processor 202 executes various functional applications and data processing of the electronic device 200 by running the programs stored in the memory 201.

In the embodiment of the present disclosure, the processor 202 may be implemented in at least one hardware form of a Digital Signal Processor (DSP), a Field-Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), and the processor 202 may be one or a combination of a Central Processing Unit (CPU) or other Processing units with data Processing capability and/or instruction execution capability.

Memory 201 in the disclosed embodiments may comprise one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile Memory may include, for example, a Random Access Memory (RAM), a cache Memory (cache), and/or the like. The nonvolatile Memory may include, for example, a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk Drive (HDD), a Solid-State Drive (SSD), or the like.

In the embodiment of the present disclosure, the I/O interface 203 may be used to receive input instructions (e.g., numeric or character information, and generate key signal inputs related to user settings and function control of the electronic apparatus 200, etc.), and may also output various information (e.g., images or sounds, etc.) to the outside. The I/O interface 203 in the disclosed embodiments may include one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a mouse, a joystick, a trackball, a microphone, a speaker, a touch panel, and the like.

Another embodiment of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, enables the processor to perform the image processing method described above.

It is to be understood that although operations are depicted in the drawings in a particular order, this is not to be understood as requiring that such operations be performed in the particular order shown or in serial order, or that all illustrated operations be performed, to achieve desirable results. In certain environments, multitasking and parallel processing may be advantageous.

The methods and apparatus related to embodiments of the present disclosure can be accomplished with standard programming techniques with rule-based logic or other logic to accomplish the various method steps. It should also be noted that the words "means" and "module," as used herein and in the claims, is intended to encompass implementations using one or more lines of software code, and/or hardware implementations, and/or equipment for receiving inputs.

Any of the steps, operations, or procedures described herein may be performed or implemented using one or more hardware or software modules, alone or in combination with other devices. In one embodiment, the software modules are implemented using a computer program product comprising a computer readable medium containing computer program code, which is executable by a computer processor for performing any or all of the described steps, operations, or procedures.

The foregoing description of the implementations of the disclosure has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure to the precise form disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of the disclosure. The embodiments were chosen and described in order to explain the principles of the disclosure and its practical application to enable one skilled in the art to utilize the disclosure in various embodiments and with various modifications as are suited to the particular use contemplated.

Claims

1. An image processing method, comprising:

acquiring a plurality of images with different known postures of a target object and an image comprising a target posture, wherein the target posture is a posture to be transferred to the target object;

respectively extracting known 3D posture features of known postures in the multiple different known posture images and target 3D posture features of the target postures;

and fusing the known 3D posture features corresponding to the plurality of different known posture images with the target 3D posture features to obtain a target posture image of the target object.

2. The image processing method according to claim 1, wherein extracting the known 3D pose features of the known poses in the plurality of different known pose images and the target 3D pose feature of the target pose respectively comprises:

respectively inputting the plurality of different known posture images and the image comprising the target posture into a 3D posture feature extraction network to obtain known 3D posture features corresponding to the known postures and the target 3D posture features of the target posture;

the 3D posture feature extraction network is obtained based on human body 3D model training with standard postures.

3. The image processing method of claim 2, wherein the 3D pose feature extraction network is trained by:

acquiring a human body 3D model with a standard posture, and determining 3D coordinates of all surface elements of the whole body of the human body 3D model in a world coordinate system;

adjusting the postures of the human body 3D model to obtain a plurality of sample images with different postures;

for each sample image in the multiple sample images, projecting the 3D coordinates of each surface element in the sample image into a pixel coordinate system to obtain pixel coordinates corresponding to the 3D coordinates of the sample image, and creating a corresponding relation between the sample image and 3D posture features of the sample image, wherein the 3D posture features of the sample image are the 3D coordinates and the pixel coordinates which have a corresponding relation with the sample image;

training to obtain a 3D posture feature extraction network based on the sample image and the 3D posture feature with the corresponding relation, wherein the input of the 3D posture feature extraction network is the sample image, and the output is the 3D posture feature with the corresponding relation.

4. The image processing method according to any one of claims 1 to 3, wherein training to obtain a 3D pose feature extraction network based on the sample image and the 3D pose feature having the corresponding relationship comprises:

initializing an image segmentation network;

taking the sample image as the input of the image segmentation network, taking the 3D posture characteristic as the output of the image segmentation network, training the image segmentation network, and obtaining the image segmentation network for extracting the 3D posture characteristic based on the sample image;

and segmenting the trained image into a network to be used as a 3D attitude feature extraction network.

5. The image processing method according to any one of claims 1 to 3, wherein training to obtain a 3D pose feature extraction network based on the sample image and the 3D pose feature having the corresponding relationship comprises:

training a human body feature extraction network for identifying human body features;

taking the human body feature extraction network as an encoder of an image segmentation network to obtain a network model to be trained;

and taking the sample image as the input of the network model to be trained, taking the 3D posture characteristic as the output of the network model to be trained, training the network model to be trained, and obtaining a 3D posture characteristic extraction network for extracting the 3D posture characteristic based on the sample image.

6. The image processing method according to claim 1, wherein fusing the known 3D pose features corresponding to the plurality of different known pose images with the target 3D pose feature to obtain a target pose image of the target object, comprises:

estimating a reverse optical flow field between the target 3D attitude feature and each known 3D attitude feature respectively;

aiming at each known attitude image of the target object, respectively transforming based on the reverse optical flow field corresponding to the known attitude image to obtain a plurality of initial target attitude images;

and fusing the plurality of initial target attitude images to obtain a target attitude image of the target object.

7. The image processing method according to claim 6, wherein fusing the plurality of initial target pose images to obtain a target pose image of the target object comprises:

estimating the credibility between the target 3D posture characteristic and each known 3D posture characteristic, and determining the credibility of each known 3D posture characteristic as the credibility of a plurality of initial target posture images;

and fusing the initial target attitude images based on the credibility of the initial target attitude images to obtain a target attitude image of the target object.

8. The image processing method according to claim 7, wherein the fusing the plurality of initial target pose images based on the credibility of the plurality of initial target pose images to obtain the target pose image of the target object comprises:

9. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the image processing method of any one of claims 1 to 8.

10. A storage medium having stored therein instructions that, when executed by a processor of a mobile device, enable the mobile device to perform the image processing method of any one of claims 1 to 8.

11. A computer program product, comprising a computer program which, when executed by a processor, enables the processor to carry out the image processing method of any one of claims 1 to 8.