CN114333051A

CN114333051A - Image processing method, virtual image processing method, image processing system and equipment

Info

Publication number: CN114333051A
Application number: CN202111545040.0A
Authority: CN
Inventors: 王云峰; 陈志文
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-04-12

Abstract

The embodiment of the application provides an image processing method, an avatar processing method, an image processing system and equipment. According to the technical scheme of the embodiment of the application, after the action of the first object in the first image is identified and the characteristic point information of the second object in the second image is determined, the second image is subjected to image processing corresponding to the type of the second object according to the action information of the first object and the characteristic point information of the second object, so that a target image of the second object with the corresponding action is obtained. Therefore, in the embodiment, the corresponding image processing modes of different types of objects are different, so that various types of second objects can be driven to perform corresponding actions based on the actions of the first object, and the application range is wider. In addition, the process of determining the feature point information of the second object in the second image in the embodiment of the application may be performed in advance, that is, performed offline; calling during online processing; the whole scheme can be operated on the client, has low requirement on the hard performance of the client and has high efficiency.

Description

Image processing method, virtual image processing method, image processing system and equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image processing method, an avatar processing method, an image processing system, and an image processing apparatus.

Background

With the development of mobile internet and virtual reality technology, digital avatars (or called avatars) are continuously applied to various scenes such as video live broadcast and games. The digital avatar is driven in the process of being applied to various scenes, so that the digital avatar can do actions similar to real people, and better interaction with a user is realized.

Some driving technologies exist in the existing schemes, and a physical motion capture mode is adopted to realize digital avatar driving. However, such a physical motion capture method has problems of high cost, requiring professional actors to participate, and the like.

Disclosure of Invention

In order to solve or improve the problems in the prior art, embodiments of the present application provide an image processing method, an image processing system, and an electronic device.

In one embodiment of the present application, an image processing method is provided. The method comprises the following steps:

acquiring a first image;

identifying the action of a first object in the first image to obtain action information;

determining a second image, a second object in the second image, a type of the second object and characteristic point information of the second object;

and according to the action information and the characteristic point information of the second object, performing image processing corresponding to the type of the second object on the second image to obtain a target image of the second object with a corresponding action.

In yet another embodiment of the present application, an image processing method is provided. The method comprises the following steps:

determining a first image in response to an operation of a user;

acquiring a second image, pre-calibrated characteristic point information of a second object in the second image and a type of the second object;

according to the action information and the characteristic point information of the second object, performing image processing corresponding to the type of the second object on the second image to obtain a target image of the second object with corresponding action;

and displaying the target image.

responding to the operation of a user, and acquiring a first video;

identifying the action of a first object in the image frame of the first video to obtain the action information of the image frame;

determining an image of a second object, a type of the second object and feature point information of the second object;

according to the action information of the image frame of the first video and the characteristic point information of the second object, carrying out image processing corresponding to the type of the second object on the image to obtain a second object image frame corresponding to the image frame;

and according to the sequence of the image frames in the first video, playing second object image frames corresponding to the continuous image frames respectively so as to show a second video with corresponding continuous actions of the second object.

In yet another embodiment of the present application, an avatar processing method is provided. The virtual image processing method comprises the following steps:

acquiring a user image;

identifying the user action in the user image to obtain action information;

determining an avatar, feature point information of the avatar and a type of the avatar;

and driving the action of the virtual image by using a driving algorithm corresponding to the type of the virtual image according to the action information and the characteristic point information of the second object so as to obtain a target image of the virtual image with the corresponding action.

In yet another embodiment of the present application, an image processing system is provided. The image processing system includes:

the server is used for pre-calibrating the characteristic point information of the second object in the second image;

the client is used for acquiring a first image; identifying the action of a first object in the first image to obtain action information; acquiring a second image and pre-calibrated characteristic point information of a second object in the second image from the server; determining a type to which the second object belongs; and according to the action information and the characteristic point information of the second object, performing image processing corresponding to the type of the second object on the second image to obtain a target image of the second object with a corresponding action.

In yet another embodiment of the present application, an electronic device is also provided. The electronic equipment comprises a memory and a processor, wherein the memory is used for storing programs; the processor is coupled to the memory and configured to execute the program stored in the memory to implement the steps in the above-mentioned embodiments of the image processing method.

The embodiments of the present application further provide a computer-readable storage medium storing a computer program, where the computer program can implement the steps in the above-mentioned image processing method embodiments when executed by a computer.

The embodiment of the application also provides a computer program product. The computer program product comprises a computer program which, when executed by a computer, causes the computer to implement the steps in the image processing method embodiments described above.

In the technical solutions provided in the embodiments of the present application, after the motion of the first object in the first image is identified and the feature point information of the second object in the second image is determined, according to the motion information of the first object and the feature point information of the second object, image processing corresponding to the type to which the second object belongs is performed on the second image, so as to obtain a simulated image in which the second object simulates the motion of the first object. Therefore, in the embodiment, the corresponding image processing modes of different types of objects are different, so that various types of second objects can be driven to perform corresponding actions, and the application range is wider. For example, some avatars (i.e. the second object) have limbs to perform corresponding actions, and some avatars do not have limbs, the scheme provided by the embodiment may also complete the process of driving to realize action simulation (or emulation). In addition, in this embodiment, the process of determining the feature point information of the second object in the second image may be performed in advance, that is, performed offline; therefore, the second object in the second image can be processed online in real time according to the first image or each image frame in the video based on the first image, the second image and the predetermined characteristic point information of the second object in the second image, so as to generate a target image or video with corresponding action of the second object; the whole scheme can be operated on the client, has low requirements on the hard performance of the client, and has high processing efficiency, good real-time performance and good user experience.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic diagram of an image processing system according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating a calibration of a second image and a generation process of a simulation image, both implemented by a client according to an embodiment of the present application;

fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a network architecture of an attitude estimation model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a TPose object provided in an embodiment of the present application;

FIG. 6 is a schematic view of a joint point of a human body according to an embodiment of the present application;

FIG. 7 is a schematic flow chart of a calibration algorithm provided in an embodiment of the present application;

fig. 8 is a schematic flowchart of an image processing method according to another embodiment of the present application;

fig. 9 is a schematic flowchart of an image processing method according to another embodiment of the present application;

fig. 10 is a flowchart illustrating an avatar processing method according to another embodiment of the present application;

fig. 11 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an image processing apparatus according to another embodiment of the present application;

fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

In some of the flows described in the specification, claims, and above-described figures of the present application, a number of operations are included that occur in a particular order, which operations may be performed out of order or in parallel as they occur herein. The sequence numbers of the operations, e.g., 101, 102, etc., are used merely to distinguish between the various operations, and do not represent any order of execution per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different. In addition, the embodiments described below are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to make the virtual IP image capable of doing actions similar to real people and interacting with users in real time, a technology for driving the virtual image by real people is required. At present, the mainstream avatar driving technology adopts a scheme of physical motion capture, but has the problems of high price and need of professional actors to participate.

With the continuous progress of computer vision technology based on deep learning, the field of driving the virtual image can also drive the virtual image by using an algorithm. Compared with a physical motion capture scheme, the cost of virtual image driving is greatly reduced. The driving of the avatar is currently mainly a two-dimensional avatar driving and a three-dimensional avatar driving. Because the three-dimensional scheme needs a large amount of capacity and resources in the aspects of image design, image drive and image display, the problems of long research and development period and high drive difficulty exist. The two-dimensional scheme only needs one two-dimensional image of the virtual image, and the virtual image can be driven in a common picture or video by using a human body key point estimation algorithm, so that the method has the advantages of robust algorithm, simple flow and convenience in display.

According to the above background, the embodiments of the present application classify the avatars according to common driving requirements, and adopt different processing schemes for different types of avatars, so that the application range of the technical means provided by the embodiments of the present application is wider.

In addition, before the technical solutions provided by the embodiments of the present application are introduced, some terms appearing hereinafter are briefly described.

Calibration: and operating the two-dimensional image to obtain the positions of the key points of the bones corresponding to the two-dimensional image. For the two-dimensional image, the two-dimensional image can be manually finished off line, and only needs to be calibrated once; and the method can also be completed by adopting an automatic calibration mode.

Deep learning: the method is based on a machine learning technology, and carries out intelligent analysis on tasks such as vision, hearing and text by utilizing a deep convolutional neural network. For example, in the present embodiment, for the driving of the avatar, a convolutional neural network structure may be adopted.

Image driving: the machine learning method or the hardware method is used for driving the virtual image to move, and the virtual image can perform actions similar to or identical to those of a reference object (such as a real person in a video or an image).

The human skeleton key point detection is used as the basis of a human body action identification technology, namely, the positions of all joint points of a human body in an input image or an image frame in a video are found out through the characteristic extraction of a detection network, and then the action information of a scene where the human body is located is identified.

Fig. 1 shows a schematic structural diagram of an image processing system according to an embodiment of the present application. As shown in fig. 1, the image processing system includes a server 11 and a client 12. The server 11 is configured to pre-calibrate feature point information of a second object in a second image. A client 12, configured to obtain a first image; identifying the action of a first object in the first image to obtain action information; acquiring a second image and pre-calibrated characteristic point information of a second object in the second image from the server 11; determining a type to which the second object belongs; and according to the action information and the characteristic point information of the second object, performing image processing corresponding to the type of the second object on the second image to obtain a target image of the second object with a corresponding action.

The client 12 may capture an image of a motion through a camera, that is, the above first image, in which a first object (e.g., a person, an animal, a robot, etc.) making a motion is located. The client 12 may identify the action of the first object using an action recognition model. Alternatively, the client-side user may retrieve the first image from the network-side or local gallery. The first image may be a photograph or a frame of image in a video.

Further, the client 12 is configured with a posture estimation model and different types of corresponding driving algorithms. As shown in fig. 1, a plurality of driving algorithms may be deployed on the client side, and one driving algorithm may correspond to one type, or two or more driving algorithms may correspond to one type, which is not limited in this embodiment. Correspondingly, when the client 12 identifies the motion of the first object in the first image and obtains the motion information, it is specifically configured to: performing image data weight reduction processing on the first image to reduce the image data amount of the first image; inputting the processed image data of the first image into a posture estimation model, and executing the posture estimation model to output a key point heat map reflecting a posture and key point relation information; calculating the position of a key point of a first object in the first image based on the key point heat map; optimizing the positions of the key points according to the key point relation information; wherein the action information comprises the optimized key point position.

When the client 12 performs image processing corresponding to the type to which the second object belongs on the second image according to the action information and the feature point information of the second object to obtain a target image of the second object having a corresponding action, the method is specifically configured to:

calling a driving algorithm corresponding to the type of the second object; and taking the action information and the characteristic point information of the second object as the parameters of the driving algorithm, and executing the driving algorithm to output a target image of the second object with a corresponding action.

Further, when the server 11 calibrates the feature point information of the second object in the second image, the method is specifically configured to:

performing image recognition on the second image to recognize the type of the second object;

when the type of the second object is a first type with limbs, determining joint points of the second object and acquiring joint point information; wherein the feature point information comprises the joint point information;

and when the type of the second object is a second type without limbs, determining a first anchor point and acquiring first anchor point information, wherein the characteristic point information comprises the first anchor point information, and the first anchor point is used for positioning the position of limbs constructed on the second object.

In another implementation solution, the calibration of the second image and the generation process of the simulated image can both be implemented by the client. As shown in fig. 2, the user may create a second image corresponding to the avatar offline through the client 12 and trigger a calibration for the second image. Specifically, the client 12 is configured to perform image recognition on the second image to identify a type to which the second object belongs; when the type of the second object is a first type with limbs, determining joint points of the second object and acquiring joint point information; wherein the feature point information comprises the joint point information; and when the type of the second object is a second type without limbs, determining a first anchor point and acquiring first anchor point information, wherein the characteristic point information comprises the first anchor point information, and the first anchor point is used for positioning the position of limbs constructed on the second object. The client 12 is further configured to store the calibrated feature point information corresponding to the second image.

For example, when the user wants to use the avatar of the second image to interact with the network-side user through the instant messaging application, or the user wants to make an avatar video, the user can take a first image or video through the camera of the client or obtain the first image or video from the gallery of the client. At this point, as shown in fig. 2, the client enters the online portion, i.e., the client 12 is configured to: acquiring a first image; identifying the action of a first object in the first image to obtain action information; acquiring a second image and pre-calibrated characteristic point information of a second object in the second image; determining a type to which the second object belongs; and according to the action information and the characteristic point information of the second object, performing image processing corresponding to the type of the second object on the second image to obtain a target image of the second object with a corresponding action.

The actions that the second object has in this context need to be briefly explained here. The motion of the second object can be the motion which is completely consistent with the motion of the first object, or the motion can be simulated by simulating the motion of the first object and simultaneously having the characteristics of the second object. For example, the first object in the first image is a user. When the second object in the second image is processed according to the action made by the user in the first image, the second object in the second image can be processed by adding the image attribute (such as lovely type, cool type, elegant type, etc.) of the second object besides the action information of the first object, the characteristic point information of the second object and the type of the second object. The action made by the second object has the action characteristic of the first object and the action characteristic of the quality corresponding to the self-image attribute.

The client in the above embodiment may be, but is not limited to: desktop computers, notebook computers, mobile phones, tablet computers, intelligent wearable devices, and the like. The server may include, but is not limited to: a single server, a cluster of servers, a virtual server or cloud deployed on a server, and so forth. For more specific functions of the server and the client in the foregoing system embodiments, reference may be made to the following method embodiments.

Fig. 3 is a flowchart illustrating an image processing method according to an embodiment of the present application. The execution subject of the method provided by this embodiment may be a client in the system. The client may be a smart phone, a desktop computer, a notebook computer, a tablet computer, an intelligent wearable device, and the like, which is not limited in this embodiment. As shown in fig. 2, the method includes:

101. acquiring a first image;

102. identifying the action of a first object in the first image to obtain action information;

103. determining a second image, a second object in the second image, a type of the second object and characteristic point information of the second object;

104. and according to the action information and the characteristic point information of the second object, performing image processing corresponding to the type of the second object on the second image to obtain a target image of the second object with corresponding action.

In the above 101, the first image may be a photo taken by a user through a camera of the client, or a photo found from a gallery. Alternatively, the first image is an image of one frame in a video taken by the user, and so on. This embodiment is not particularly limited thereto.

In the above 102, in implementation, the motion of the first object in the first image may be recognized by using the pose estimation model. For example, the pose estimation model of the structure shown in fig. 4. The structure of the pose estimation model in this embodiment is not limited to the structure shown in fig. 4, but may be other network structures. In the example of the attitude estimation model shown in fig. 4, the attitude estimation model includes: a first network module 1, a second network module 2, a third network module 3, a plurality of fourth network modules 4, a fifth network module 5, a sixth network module 6. Wherein the first network module 1 comprises: 7 × 7 convolution layers of convolution kernels, activation layers and pooling layers; the second network module 2 comprises: 3 x 3 convolutional layers of convolutional kernels, active layers and pooling layers. The third network module 3 comprises: 3 x 3 convolution kernel convolution layer and activation layer. The number of the fourth network module (which may also be referred to as a residual module) is 4, and the fourth network module may include a convolutional layer and an active layer. The fifth network module may include a convolution layer and an activation layer of 3 x 3 convolution kernels. A sixth network module, which may also be referred to as a header module, may include four convolutional layers + active layers.

The data processing procedure of the pose estimation model shown in fig. 4 can be described simply as follows: the input data (which may be the first image or an image obtained by preprocessing the first image) is subjected to 2-fold down-sampling by the convolution layer of the 7 × 7 convolution kernel in the first network module 1, and then subjected to 1/4-fold down-sampling by the active layer and the active layer. Then, the output data of the first network model 1 is subjected to 1/8 times of down sampling through the convolution layer, the activation layer and the pooling layer of the 3 × 3 convolution kernel in the second network module 2; then, the output data of the second network module 2 is input into a third network module 3, and feature transformation is carried out through a convolution layer and an active layer of a 3 x 3 convolution kernel in the third network module 3; the output data of the third network module 3 is input into the fourth network module 4, and the high-level features are extracted through the four fourth network modules 4. And the output data of the last fourth network module is subjected to feature transformation through the convolution layer and the activation layer of the 3 x 3 convolution kernel of the fifth network module 5. The output data of the fifth network module 5 is input into the sixth network module 6, and is processed by the four convolutional layers + the active layer in the sixth network module 6, and heatmaps (key point heat maps) and pafs (key point relationship information) are output.

Based on the output heatmaps and pafs (keypoint relationship information or keypoint connection vector field), motion information for the first object in the first image can be determined. After the heatmaps and the pafs are obtained, a network output post-processing process is required, that is, for the heatmaps, a method for calculating two-dimensional maximum estimation can be independently performed for each key point (such as a human skeleton point) to obtain an estimated position of each key point (such as a human skeleton point). Meanwhile, in order to eliminate the influence of different human bodies, pafs is calculated to obtain a key point pair with the maximum correlation (or a skeleton point pair with the maximum correlation), and a matching algorithm is adopted to estimate the heatmaps to obtain a key point position (such as a human skeleton point position) to optimize, so as to obtain a final key point position. The keypoint location may embody an action in the first object of the first image, and therefore, the action information in this embodiment may include keypoint information.

In the above 103, the second image may be made by the user through the corresponding APP, or may be obtained by the user from the network side, and the like, which is not limited in this embodiment. The determination of the second object in the second image can identify the object which can make action or similar human-shaped structure in the second image, namely the second object, by using an image recognition technology. Among them, the objects that can make actions may include but are not limited to: humans, animals, robots, and the like; an object of a humanoid structure can be understood simply as: a structure comprising a head and a torso, or a structure comprising a head, a torso, and limbs, etc.

In an implementation scheme, in step 103 of this embodiment, "determining the type of the second object in the second image and the feature point information of the second object" may be implemented by the following steps:

1031. performing image recognition on the second image to recognize the type of the second object;

1032. when the type of the second object is a first type with limbs, determining joint points of the second object and acquiring joint point information; wherein the feature point information comprises the joint point information;

1033. and when the type of the second object is a second type without limbs, determining a first anchor point and acquiring first anchor point information, wherein the characteristic point information comprises the first anchor point information, and the first anchor point is used for positioning the position of limbs constructed on the second object.

Further, when the type of the second object is the first type with limbs, besides acquiring the joint point information, the pixel point of the second object needs to be sampled. That is, the method provided by this embodiment may further include the following steps:

when the type of the second object is a first type with limbs, sampling pixel points of the second object to obtain sampling point information;

that is, the feature point information includes sampling point information in addition to the joint point information.

Further, the method provided by this embodiment may further include the following steps:

105. when the type of the second object is a second type without limbs, judging whether the second object is separated from the body or not;

106. when the head and the body of the second object are separated, determining a second anchor point and acquiring second anchor point information;

wherein the feature point information further includes the second anchor point information, the second anchor point being used to locate a connection position of the head and the torso of the second subject.

The steps 1031 to 1033, and 105 to 106 may be performed in an offline period, and the flow of the steps may be as shown in FIG. 7. For example, the user may create a two-dimensional TPose image of the avatar using some design keys, as shown in fig. 5. So-called TPose, is a standard gesture common in two-dimensional and three-dimensional drives. Specifically, under TPose, the figure is held with both hands flat, the body and legs upright, and the head forward. Assuming that the two-dimensional TPose image shown in fig. 5 is the second image in this embodiment, the calibration of the second image may be performed using a calibration algorithm. Calibration refers to the process of positioning the joint points of the second object. In summary, the marking of the joint point is performed at the corresponding pixel position in the second image by comparing the correspondence between the avatar and the real person. Fig. 6 shows a schematic diagram of a human body joint point (or called skeleton point). Only after calibration, the second object can be driven by a real person. Specifically, the calibration algorithm is divided into two categories according to whether the second object in the second image has a limb.

When the second object belongs to the first type with limbs, the position coordinates of 14 joint points of the human body can be obtained by using a calibration algorithm. Such as the 14 joint point position coordinates shown in fig. 6. Simultaneously, for four limbs on the body, the head and the main trunk area of the body, sampling points are selected. The sampling points are some pixel points in the second object, and the purpose is to perform deformation transformation on the second object according to the sampling points during driving. In particular, the sampling points may be selected using an algorithm that uniformly samples points within the body.

And when the second object belongs to a second type without limbs, the limbs are increased to be displayed according to the rendering module in the final driving process. Thus, only the connection point of the limb and the body of the person, referred to as the anchor point in the embodiments of the present application, needs to be located when calibrating. In practice, four anchor points, a left arm, a right arm, a left leg and a right leg, need to be located. The selection of the anchor point is determined according to the proper positions of the arms and the legs of the virtual image, and the limbs with increased rendering can be represented normally after the anchor point is selected. Therefore, the positions suitable for constructing the limbs can be determined on the human body through the contour size of the human body and according to the set human body proportion by calibrating the anchor points.

In the above step 104, as shown in fig. 1 and fig. 2, after the motion information of the first object and the feature point information of the second object are determined, a driving algorithm corresponding to a type to which the second object belongs may be obtained from a plurality of driving algorithms, and the second object in the second image is subjected to image processing to obtain a target image in which the second object has a corresponding motion.

In the technical solution provided in this embodiment, after the motion of the first object in the first image is recognized and the feature point information of the second object in the second image is determined, according to the motion information of the first object and the feature point information of the second object, image processing corresponding to the type to which the second object belongs is performed on the second image, so as to obtain a simulation image in which the second object simulates the motion of the first object. Therefore, in the embodiment, the corresponding image processing modes of different types of objects are different, so that various types of second objects can be driven to perform simulation actions, and the application range is wider. For example, some avatars (i.e., second objects) have limbs and can perform the simulation action, and some avatars do not have limbs, the scheme provided by this embodiment may also perform the processing of driving the simulation of the action. In addition, in this embodiment, the process of determining the feature point information of the second object in the second image may be performed in advance, that is, performed offline; therefore, the second object in the second image can be processed on line in real time according to the first image or each image frame in the video based on the first image, the second image and the predetermined characteristic point information of the second object in the second image to generate a simulated image or video; the whole scheme can be operated on the client, the requirement on the hard performance of the client is not high, the generation efficiency of the simulated image is high, the real-time performance is good, and the user experience is good.

Steps

101, 102 and 104 in this embodiment may be processes performed online. For example, after the user triggers a function of driving a second object simulation action of the second image through the client, the user can shoot a first image or video through the client camera or obtain the first image or video from the local library, and the client can perform image processing on the second object in the second image according to the first image or video to generate a simulation image or video of the second object simulating the first object action in the first image or video. Because the second object in the second image is calibrated in advance, the client only needs to perform corresponding processing on the first image in real time, such as lightweight processing, attitude estimation model output post-processing, inter-frame smoothing processing and the like in the following.

In order to reduce the amount of data processing, the image data of the first image may be subjected to weight reduction processing when the motion of the first object in the first image is recognized, so that the amount of data input to the posture estimation model can be reduced. That is, in a possible technical solution, the step 102 "identify the motion of the first object in the first image to obtain motion information", specifically includes:

1021. performing image data weight reduction processing on the first image to reduce the image data amount of the first image;

1022. inputting the processed image data of the first image into a posture estimation model, and executing the posture estimation model to output a key point heat map reflecting a posture and key point relation information;

1023. calculating the position of a key point of a first object in the first image based on the key point heat map;

1024. optimizing the positions of the key points according to the key point relation information;

wherein the action information comprises the optimized key point position.

In specific implementation, the first image may be an image acquired by a camera of a user client, or a frame of image in a video; or a locally stored image or a frame of image in a video. The image data weight reduction processing performed on the first image may include: scaling the first image and normalizing the RGB values of the pixels of the image. Since most client devices (e.g., smartphones, etc.) have a 16:9 picture width to height ratio. In order to ensure real-time performance of the client device and ensure image processing effect, the input resolution of the first image should not be too large, for example, the first image may be scaled to 256 pixels high and 144 pixels wide in the scaling process of the first image. The scaled image pixel RGB values in the first image are then normalized, e.g., by subtracting 128 from all 3 channel values of the RGB values and dividing by 256.

Further, it is found through specific practice that, if the first image in the embodiment is a frame image in a video, the problem of key point jitter may occur. To solve the jitter problem of the estimated keypoint position, the embodiment needs to add inter-frame smoothing processing. That is, the method described in this embodiment may further include the following steps:

107. acquiring a historical frame image before the first image in the video;

108. and smoothing the action information according to the positions of the key points in the historical frame image.

The historical frame image may be an image of a frame before the first image, or more than one historical frame image, such as an image of a frame two frames before the first image. In specific implementation, motion information (i.e. the above mentioned key point information) is smoothed according to the key point positions in the historical frame image by using a OneEuro smoothing algorithm and a Kalman filtering smoothing algorithm superposed. Through practical tests, the problem of jitter of key points can be well solved by adopting the OneEuro smoothing algorithm and the Kalman filtering smoothing algorithm, and the effect is stable in practical use. Of course, this embodiment is not limited to the OneEuro smoothing algorithm and the Kalman filter smoothing algorithm, and may also be implemented by using other smoothing processing algorithms.

In addition, the contents of the OneEuro smoothing algorithm and the Kalman filtering smoothing algorithm can be referred to related documents, which are not limited herein nor described in detail.

After the above processing is completed, the final driving processing can be entered. That is, in an achievable technical solution, the step 104 "performing image processing corresponding to the type to which the second object belongs on the second image according to the motion information and the feature point information of the second object to obtain a target image of the second object with a corresponding motion" in the present embodiment may specifically include:

1041. when the type of the second object is a first type with limbs, calculating feature point information of the second object after transformation according to the action information and the feature point information of the second object; adjusting pixel points of a second object in the second image according to the calculated characteristic point information to obtain a target image of the second object with corresponding action;

1042. when the type of the second object is a second type without limbs, determining limb construction positions on the second image according to the characteristic point information of the second object; based on the action information, pixel points of a second object in the second image are adjusted, and a limb of the second object is constructed at the limb construction position on the second image, so that a target image of the second object with a corresponding action is obtained.

The above embodiments list types of limb and limb, but other types may exist, and the specific implementation may be configured according to design requirements. One type may correspond to one driving algorithm, or one driving algorithm may correspond to multiple types, which is not limited in this embodiment.

Likewise, the second object type is exemplified to include a first type having limbs and a second type having no limbs. According to the types, the client is deployed with driving algorithms respectively corresponding to the two types in the embodiment.

For the first type with limbs, the 14 joint points (e.g. skeleton points) can be divided into 6 parts of four limbs, head and trunk, and each part is operated separately in this embodiment. For each section, the new position of the second object joint point and the offset and rotation angle of the feature point information calibrated at TPose are calculated with reference to the motion information of the first image. And after the rotation angle and the rotation vector are obtained, carrying out the same operation on the sampling point of each part of the second object in the second image, and ensuring that the sampling point of each part and the joint point carry out the same rotation and offset. And then, constructing a rotation matrix by using the sampling points, and rotating each part through a warpAffine function to obtain new positions corresponding to each part of pixel points in the second object.

For the second type without limbs, the feature point information (anchor point) of the second object may be divided into four limbs and a trunk. For the four-limb part, a 3-order Bezier curve is constructed at the position represented by the anchor point information based on the anchor point information by referring to the action information of the first image, a series of discrete points which can represent the position of the limb are obtained, and the discrete points are rendered by an engine at a rendering part. And for the torso portion, the global motion amount of the anchor point on the second object is converted into a motion amount of the second object in the second image.

Similarly, for the second type without limbs, the driving algorithms can be further classified into driving algorithms corresponding to the head-body separation type and driving algorithms corresponding to the head-body connection type.

For the head-body separation type, the motion information of the first image is referred, after a Bezier curve of a limb is constructed at the position represented by the first anchor point information based on the first anchor point information, the motion information of the first image is referred, and the motion distance of the head relative to the trunk is calculated at the position represented by the second anchor point information; finally, for the torso part, the global motion amount of the second object at the position corresponding to the first anchor point information and the second anchor point information on the second object can be replaced by the motion amount of the second object in the second image.

For the head-body connection type, in this embodiment, the feature point information (anchor point) of the second object may be divided into four limbs and a trunk. For the four-limb part, a 3-order Bezier curve is constructed at the position represented by the anchor point information based on the anchor point information by referring to the action information of the first image, a series of discrete points which can represent the position of the limb are obtained, and the discrete points are rendered by an engine at a rendering part. And for the torso portion, the global motion amount of the anchor point on the second object is converted into a motion amount of the second object in the second image.

After the driving of the second object in the second image is completed by using the driving algorithm, a visualization module is needed to display the second object after the driving calculation. Namely, the method provided by the present embodiment further includes:

rendering a simulation image of the second object simulating the action of the first object in the first image based on the calculation result output by the driving algorithm.

In specific implementation, the driven second object may be rendered and displayed by using an open source tool such as OpenCV or OpenGL.

In summary, according to the technical solution provided in the embodiment of the present application, before performing image processing, a type of a second object in a second image needs to be determined, and the second image is subjected to image processing by using an image processing method corresponding to the type, so as to obtain a simulation image in which the second object simulates a motion of the first object in the first image. The scheme provided by the embodiment of the application breaks through the problem that only a specific virtual image can be driven in the prior art, and is wide in application range. In addition, in the embodiment of the application, the calibration of the second object in the second image can be performed in an off-line stage, and the calibrated second object is stored for calling; in the online stage, only the calibrated second image and the feature point information of the second object are required to be retrieved, the generation of the simulated image can be rapidly completed according to the driving algorithm corresponding to the type of the second object by identifying the action of the first object in the first image, the whole process is good in real-time performance, the simulated image effect is good, and the real-time operation requirement on the client side is met.

Fig. 8 is a flowchart illustrating an image processing method according to an embodiment of the present application. As shown, the execution subject of the method may be a client in an image processing system, the method comprising:

201. determining a first image in response to an operation of a user;

202. identifying the action of a first object in the first image to obtain action information;

203. acquiring a second image, pre-calibrated characteristic point information of a second object in the second image and a type of the second object;

204. according to the action information and the characteristic point information of the second object, performing image processing corresponding to the type of the second object on the second image to obtain a target image of the second object with corresponding action;

205. displaying the simulated image.

In 201, the first image may be an image or a frame of image in a video captured by a user through a camera, or an image or a frame of image in a video selected by the user from a gallery, which is not specifically limited in this embodiment.

For the contents of the above steps 202 to 204, reference may be made to the related description above, which is not repeated herein.

Further, the step 203 of obtaining the second image, the pre-calibrated feature point information of the second object in the second image, and the type of the second object may specifically include:

2031. responding to an image selection operation of a user, and determining the selected second image and a type of the second object in the second image;

2032. sending a request to a server for the second image;

2033. and receiving the characteristic point information of the second object in the second image which is fed back by the server and is pre-calibrated.

Fig. 9 is a flowchart illustrating an image processing method according to still another embodiment of the present application. As shown in fig. 9, the method includes:

301. responding to the operation of a user, and acquiring a first video;

302. identifying the action of a first object in the image frame of the first video to obtain the action information of the image frame;

303. determining an image of a second object, a type of the second object and feature point information of the second object;

304. according to the action information of the image frame of the first video and the characteristic point information of the second object, performing image processing corresponding to the type of the second object on the image of the second object to obtain a second object image frame corresponding to the image frame;

305. and according to the sequence of the image frames in the first video, playing second object image frames corresponding to the continuous image frames respectively so as to show a second video with corresponding continuous actions of the second object.

Similarly, the more detailed contents of each step 301 to 305 in this embodiment can be referred to the related description above, and are not repeated herein.

The following embodiments apply the technical solution provided by the present application to an avatar-driven scene. For example, a user may want to create a personal avatar or emoticon while using an application (e.g., an instant messaging application, a social application, etc.) on a client. The user can shoot a photo of the favorite action of the user, such as dance action, fun action and the like, and then process the image of the favorite virtual image of the user by using the technical scheme provided by the embodiment of the application to obtain the target image of the virtual image with the corresponding action. Thus, the user can take the target image as a new head portrait or an expression bag of the user. For another example, if the user wants to make an avatar animation, the user can first shoot a video of dancing, martial arts and the like; then, processing the image of the favorite avatar by using the technical scheme provided by the embodiment of the application to obtain a plurality of avatar image frames; and playing the plurality of avatar image frames to obtain an avatar video of the avatar performing dance or martial art movements similar to or consistent with the movements in the video.

Fig. 10 is a flowchart illustrating an avatar processing method according to an embodiment of the present application. As shown, the method comprises:

401. acquiring a user image;

402. identifying the user action in the user image to obtain action information;

403. determining an avatar, feature point information of the avatar and a type of the avatar;

404. and driving the virtual image to act by using a driving algorithm corresponding to the type of the virtual image according to the action information and the characteristic point information of the virtual image so as to obtain a target image of the virtual image with corresponding action.

Similarly, the more detailed contents of each step 401 to 404 in this embodiment can be referred to the related description above, and are not repeated herein.

The embodiments of the application provide an innovative driving algorithm capable of executing the object in the image in real time at the client, and on the basis of estimating the joint point of the first object in the first image by using the deep convolutional neural network, any virtual image can be made to simulate the action of the first object in the first image by using the calibration algorithm and the driving algorithm, so that the method has the advantages of algorithm robustness, simple flow and convenience in display. The method and the device have wide application prospects, and are suitable for a plurality of application scenes, such as social application, instant messaging application, e-commerce application and the like.

Fig. 11 shows a schematic structural diagram of an image processing apparatus according to an embodiment of the present application. As shown in fig. 11, the apparatus includes: the device comprises an acquisition module 21, a recognition module 22, a determination module 23 and a processing module 24. The acquiring module 21 is configured to acquire a first image. The identification module 22 is configured to identify a motion of the first object in the first image, and obtain motion information. The determining module 23 is configured to determine a second image, a second object in the second image, a type of the second object, and feature point information of the second object. The processing module 24 is configured to perform image processing corresponding to the type to which the second object belongs on the second image according to the motion information and the feature point information of the second object, so as to obtain a target image of the second object with a corresponding motion.

Further, when determining the type of the second object in the second image and the feature point information of the second object, the determining module 23 is specifically configured to:

Further, the apparatus provided in this embodiment may further include a sampling module. The sampling module is used for sampling pixel points of the second object to obtain sampling point information when the type of the second object is a first type with limbs; correspondingly, the characteristic point information also comprises sampling point information.

Further, the apparatus provided in this embodiment may further include a determination module. The judging module is used for judging whether the second object is separated from the body or not when the type of the second object is a second type without limbs. The determining module determines a second anchor point and acquires information of the second anchor point when the judging module judges that the second object is separated from the body. Correspondingly, the feature point information further includes the second anchor point information, and the second anchor point is used for locating a connection position of the head and the trunk of the second object.

Further, when the identification module 22 identifies the motion of the first object in the first image and obtains the motion information, it is specifically configured to:

performing image data weight reduction processing on the first image to reduce the image data amount of the first image;

inputting the processed image data of the first image into a posture estimation model, and executing the posture estimation model to output a key point heat map reflecting a posture and key point relation information;

calculating the position of a key point of a first object in the first image based on the key point heat map;

optimizing the positions of the key points according to the key point relation information;

wherein the action information comprises the optimized key point position.

Further, the first image is a frame image in a video. Correspondingly, the obtaining module 21 in the apparatus of this embodiment may be further configured to obtain a historical frame image before the first image in the video. The processing module can also be used for smoothing the action information according to the positions of key points in the historical frame images.

Further, when the processing module 24 performs image processing corresponding to the type to which the second object belongs on the second image according to the motion information and the feature point information of the second object to obtain a target image of the second object having a corresponding motion, specifically, the processing module is configured to:

when the type of the second object is a first type with limbs, calculating feature point information of the second object after transformation according to the action information and the feature point information of the second object; adjusting pixel points of a second object in the second image according to the calculated characteristic point information to obtain a target image of the second object with corresponding action;

when the type of the second object is a second type without limbs, determining limb construction positions on the second image according to the characteristic point information of the second object; based on the action information, pixel points of a second object in the second image are adjusted, and a limb of the second object is constructed at the limb construction position on the second image, so that a target image of the second object with a corresponding action is obtained.

Here, it should be noted that: the image processing apparatus provided in the foregoing embodiments may implement the technical solutions described in the foregoing method embodiments, and the specific implementation principle of each module or unit may refer to the corresponding content in the foregoing method embodiments, and is not described herein again.

An image processing apparatus according to another embodiment of the present application, as shown in fig. 12, includes: a determination module 31, an identification module 32, an acquisition module 33, a processing module 34 and a display module 35. Wherein the determining module 31 is configured to determine the first image in response to an operation of a user. The recognition module 32 is configured to recognize a motion of the first object in the first image, and obtain motion information. The obtaining module 33 is configured to obtain a second image, pre-calibrated feature point information of a second object in the second image, and a type of the second object. The processing module 34 is configured to perform image processing corresponding to the type of the second object on the second image according to the motion information and the feature point information of the second object, so as to obtain a target image of the second object with a corresponding motion. The display module 35 is configured to display the simulated image.

Further, when the obtaining module obtains a second image, the pre-calibrated feature point information of a second object in the second image, and the type of the second object, the obtaining module is specifically configured to:

responding to an image selection operation of a user, and determining the selected second image and a type of the second object in the second image;

sending a request to a server for the second image;

and receiving the characteristic point information of the second object in the second image which is fed back by the server and is pre-calibrated.

Another embodiment of the present application provides an image processing apparatus having a structure similar to that shown in fig. 12. The image processing apparatus includes: the device comprises an acquisition module, an identification module, a determination module, a processing module and a display module. The acquisition module is used for responding to the operation of a user and acquiring a first video. The identification module is used for identifying the action of a first object in the image frame of the first video to obtain the action information of the image frame. The determining module is used for determining an image of a second object, a type of the second object and feature point information of the second object. The processing module is used for carrying out image processing corresponding to the type of the second object on the image of the second object according to the action information of the image frame of the first video and the characteristic point information of the second object so as to obtain a second object image frame corresponding to the image frame. The display module is used for playing second object two image frames corresponding to the continuous image frames respectively according to the sequence of the image frames in the first video so as to display a second video with corresponding continuous actions of the second object.

An embodiment of the present application further provides an avatar processing apparatus, which has a structure similar to the structure shown in fig. 11. The avatar processing apparatus includes: the device comprises an acquisition module, an identification module, a determination module and a processing module. Wherein the obtaining module obtains the user image. The identification module is used for identifying the user action in the user image to obtain action information. The determining module is used for determining the virtual image, the characteristic point information of the virtual image and the type of the virtual image. The processing module is used for driving the virtual image to move by using a driving algorithm corresponding to the type of the virtual image according to the action information and the characteristic point information of the virtual image so as to obtain a target image of the virtual image with corresponding action.

Here, it should be noted that: the virtual image processing apparatus provided in the above embodiments may implement the technical solutions described in the above method embodiments, and the specific implementation principle of each module or unit may refer to the corresponding content in the above method embodiments, and is not described herein again.

Fig. 13 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device includes a processor 42 and a memory 41. Wherein the memory 41 is configured to store one or more computer instructions; the processor 42 is coupled to the memory 41 for the at least one or more computer instructions (e.g., computer instructions implementing data storage logic) to implement the steps or functions described in the above embodiments of the image processing method.

Here, it should be noted that: the processor may specifically refer to the detailed contents in the above embodiments, and details are not described herein, except that the processor may implement the steps or functions provided in the above embodiments of the image processing method. The memory 41 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Further, as shown in fig. 13, the electronic apparatus further includes: communication components 43, power components 45, display 44, and audio components 46. Only some of the components are schematically shown in fig. 13, and the electronic device is not meant to include only the components shown in fig. 13.

Yet another embodiment of the present application provides a computer program product (not shown in any figure of the drawings). The computer program product comprises computer programs or instructions which, when executed by a processor, cause the processor to carry out the steps in the above-described method embodiments.

Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program can implement the method steps or functions provided by the foregoing embodiments when executed by a computer.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. An image processing method, comprising:

acquiring a first image;

and according to the action information and the characteristic point information of the second object, performing image processing corresponding to the type of the second object on the second image to obtain a target image of the second object with corresponding action.

2. The method of claim 1, wherein determining the type of the second object in the second image and the feature point information of the second object comprises:

3. The method of claim 2, further comprising:

wherein the characteristic point information further includes sampling point information.

4. The method of claim 2, further comprising:

when the type of the second object is a second type without limbs, judging whether the second object is separated from the body or not;

when the head and the body of the second object are separated, determining a second anchor point and acquiring second anchor point information;

5. The method of any one of claims 1 to 4, wherein identifying the motion of the first object in the first image, obtaining motion information, comprises:

wherein the action information comprises the optimized key point position.

6. The method of claim 5, wherein the first image is a frame of image in a video, and wherein the method further comprises:

acquiring a historical frame image before the first image in the video;

and smoothing the action information according to the positions of the key points in the historical frame image.

7. The method according to any one of claims 1 to 4, wherein performing image processing corresponding to a type to which the second object belongs on the second image according to the motion information and feature point information of the second object to obtain a target image of the second object with a corresponding motion comprises:

8. An image processing method, comprising:

determining a first image in response to an operation of a user;

and displaying the target image.

9. The method of claim 8, wherein obtaining a second image, pre-calibrated feature point information of a second object in the second image, and a type of the second object comprises:

sending a request to a server for the second image;

10. An image processing method is characterized in that,

responding to the operation of a user, and acquiring a first video;

according to the action information of the image frame of the first video and the characteristic point information of the second object, performing image processing corresponding to the type of the second object on the image of the second object to obtain a second object image frame corresponding to the image frame;

11. An avatar processing method, comprising:

acquiring a user image;

identifying the user action in the user image to obtain action information;

and driving the virtual image to act by using a driving algorithm corresponding to the type of the virtual image according to the action information and the characteristic point information of the virtual image so as to obtain a target image of the virtual image with corresponding action.

12. An image processing system, comprising:

13. The system of claim 12,

the client is provided with a posture estimation model and driving algorithms corresponding to different types;

when the client identifies the motion of the first object in the first image and obtains motion information, the client is specifically configured to: performing image data weight reduction processing on the first image to reduce the image data amount of the first image; inputting the processed image data of the first image into a posture estimation model, and executing the posture estimation model to output a key point heat map reflecting a posture and key point relation information; calculating the position of a key point of a first object in the first image based on the key point heat map; optimizing the positions of the key points according to the key point relation information; wherein the action information comprises the optimized key point position;

the client, when performing image processing corresponding to the type to which the second object belongs on the second image according to the action information and the feature point information of the second object to obtain a target image of the second object having a corresponding action, is specifically configured to:

calling a driving algorithm corresponding to the type of the second object;

and taking the action information and the characteristic point information of the second object as the parameters of the driving algorithm, and executing the driving algorithm to output a target image of the second object with a corresponding action.

14. An electronic device comprising a processor and a memory, wherein,

the memory to store one or more computer instructions;

the processor, coupled to the memory, configured to execute the one or more computer instructions to implement the steps of the method of any one of claims 1 to 7, or to implement the steps of the method of any one of claims 8 or 9, or to implement the steps of the method of claim 10 or 11.