CN115423752B

CN115423752B - Image processing method, electronic equipment and readable storage medium

Info

Publication number: CN115423752B
Application number: CN202210927878.4A
Authority: CN
Inventors: 姚洋; 高崇军; 史廓
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2022-08-03
Filing date: 2022-08-03
Publication date: 2023-07-07
Anticipated expiration: 2042-08-03
Also published as: CN115423752A

Abstract

The application provides an image processing method, electronic equipment and a readable storage medium, relates to the technical field of image processing, and can solve the problem that a hand blank area and a photographer have unnatural self-shooting gestures, and the method comprises the following steps: the electronic device displays a first target image; wherein the first target image comprises an image of a first person holding the first object in the first target pose; the electronic device selects a first reference image from M reference images; the first reference image comprises an image of the corresponding person in a second target posture, and the second target posture is different from the first target posture; the electronic equipment adopts the first reference image to carry out gesture migration on the first target image so as to obtain a second target image; wherein the second target image includes an image of the first person in the second target pose, the second target image not including an image of the first person holding the first object.

Description

Image processing method, electronic equipment and readable storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image processing method, an electronic device, and a readable storage medium.

Background

Smartphones have evolved to date, photographing and photography have become one of the most important features. As the photographing function of the smart phone becomes more and more powerful, more people use the smart phone to take photos instead of a camera. In order to realize wider shooting angle, can fix the smart mobile phone on telescopic selfie stick generally, realize multi-angle selfie through the flexible volume of free regulation telescopic link. However, when using a selfie stick for selfie, a local selfie stick may be taken, i.e. there may be a selfie stick in the photo or video taken, affecting the user experience.

In the existing scheme, can remove from the rapping bar in order to improve photographer's shooting experience. However, after the selfie stick is removed, a dead zone appears on the hands of the photographer, so that the self-photographing posture of the photographer is unnatural. Therefore, a solution to the problem of unnatural self-photographing posture of a photographer and a hand-free region is needed.

Disclosure of Invention

The embodiment of the application provides an image processing method, electronic equipment and a readable storage medium, which are used for solving the problem that a hand blank area and a photographer have unnatural self-shooting gestures.

In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:

In a first aspect, an image processing method is provided and applied to an electronic device, where the electronic device stores M reference images, where the M reference images include at least one image of multiple poses of a person, the M reference images are not images photographed by a corresponding person holding a first object, and M is an integer greater than 1; the method comprises the following steps: the electronic device displays a first target image; wherein the first target image comprises an image of a first person holding the first object in the first target pose; the electronic device selects a first reference image from M reference images; the first reference image comprises an image of the corresponding person in a second target posture, and the second target posture is different from the first target posture; the electronic equipment adopts the first reference image to carry out gesture migration on the first target image so as to obtain a second target image; wherein the second target image includes an image of the first person in the second target pose, the second target image not including an image of the first person holding the first object.

Based on the first aspect, in an embodiment of the present application, the electronic device stores M reference images, which are not images photographed by the corresponding person holding the first object; when the electronic equipment displays the first target image, the electronic equipment can select the first reference image from M reference images and adopts the first reference image to carry out gesture migration on the first target image so as to obtain a second target image because the first target image comprises an image of a first person holding a first object in a first target gesture; and because the first reference image comprises an image of the corresponding person in a second target posture, the second target posture is different from the first target posture, the second target image comprises an image of the first person in the second target posture, and the second target image does not comprise an image of the first person holding the first object, the second target image generated after processing does not comprise an image of the first person holding the first object, and the image of the first person in the second target posture is different from the first target posture, thereby solving the problem that the hand blank area and the self-shooting posture of the photographer are unnatural.

In one implementation of the first aspect, the first object includes a selfie stick.

In this embodiment, since the first object includes the selfie stick, the problem that the selfie stick appears in the self-timer image when the photographer holds the selfie stick for self-timer shooting can be solved; and can also solve only to remove from the rapping bar after, the photographer's hand appears empty region and from the unnatural problem of rapping posture.

In an implementation manner of the first aspect, the second target gesture is a default gesture preset in the electronic device; alternatively, the electronic device selects the first reference image from the M reference images, including: the electronic equipment displays a first interface; the first interface comprises a plurality of gesture selection items, wherein each gesture selection item in the plurality of gesture selection items corresponds to a target gesture; the electronic equipment responds to the operation of a first gesture selection item in the gesture selection items, and a reference image in a target gesture corresponding to the first gesture selection item is selected from M reference images to serve as a first reference image; wherein the first gesture selection item corresponds to the target gesture as the second target gesture.

In the implementation manner, the electronic device can display a plurality of gesture options, and as each gesture option in the plurality of gesture options corresponds to one target gesture, a user can select one target gesture from the plurality of gesture options, and the electronic device processes the first target image according to the target gesture selected by the user to generate a second target image; in this way, the second target gesture of the first person in the second target image generated by the electronic device is the target gesture selected by the user, so that the problem that the blank area and the self-shooting gesture of the hand of the user are unnatural is solved, and the user experience is improved.

In an implementation manner of the first aspect, the M reference images include: n reference images, wherein the N reference images are reference images under the corresponding target gesture of the first gesture selection item, N is an integer greater than 1, and N is less than or equal to M; selecting a reference image in a target pose corresponding to the first pose selection item from the M reference images as the first reference image, including: the electronic device selects a reference image with the maximum similarity between the target gesture and the first target gesture from N reference images as a first reference image.

In the implementation manner, the reference image under the corresponding target gesture of the first gesture selection item comprises N reference images, and on the basis, the electronic equipment can select the reference image with the maximum similarity between the target gesture and the first target gesture from the N reference images as the first reference image, so that the image effect is improved.

In one implementation of the first aspect, the first gesture selection item is used to customize the target gesture; selecting a reference image in a target pose corresponding to the first pose selection item from the M reference images as the first reference image, including: the electronic equipment displays a second interface; the second interface is used for indicating a user to draw a target gesture; and the electronic equipment receives the target gesture drawn by the user on the second interface, and selects a reference image with the maximum similarity between the target gesture and the drawn target gesture from M reference images as a first reference image.

In this implementation, when the first gesture selection item selected by the user is used to customize the target gesture, the electronic device may further display a second interface to facilitate the user to draw the target gesture; after the electronic equipment receives the target gesture drawn by the user on the second interface, the electronic equipment selects a reference image with the maximum similarity between the target gesture and the drawn target gesture from M reference images as a first reference image, so that user experience is further improved.

In one implementation of the first aspect, the first target image is acquired by the electronic device in response to a photographing instruction; wherein the electronic device selects a first reference image from the M reference images, comprising: in response to the photographing instruction, the electronic device determines that the first target image includes an image of a first person holding the selfie stick in the first target posture, and selects the first reference image from the M reference images.

In the implementation manner, since the first target image is acquired by the electronic device in response to the shooting instruction, after the electronic device finishes shooting, when the electronic device determines that the first target image includes an image of the first person holding the selfie stick in the first target posture, selecting the first reference image from the M reference images; and then, the electronic equipment can adopt the first reference image to carry out gesture migration on the first target image so as to generate a second target image, namely, the electronic equipment can process the first target image after shooting is completed, so that the user experience is improved.

In one implementation of the first aspect, the electronic device displays a first target image, including: the electronic equipment displays a first target image in a gallery application or displays the first target image in an instant messaging application; wherein the electronic device selects a first reference image from the M reference images, comprising: in response to a user's preset editing operation on the first target image, the electronic device selects a first reference image from the M reference images.

In the implementation manner, the electronic device can also process the first target image stored in the gallery application and the first target image received by the instant messaging application, so that the user experience is further improved.

In an implementation manner of the first aspect, the electronic device performs pose migration on the first target image by using the first reference image to obtain a second target image, including: the electronic equipment performs object segmentation on a first target image to remove an image of the first object in the first target image and identify a first image of the first person in the first target image; the electronic equipment performs portrait segmentation on the first reference image to obtain a second image of a first person in the first reference image; the electronic equipment carries out first gesture estimation on a first image of a first person and a second image of the first person to obtain a first gesture UV image of the first person, a first image UV image of the first person, a second gesture UV image of the first person and a second image UV image of the first person; the electronic equipment adopts a second gesture UV image of the first person and a second image UV image of the first person to carry out gesture migration on the first gesture UV image of the first person and the first image UV image of the first person, so as to obtain a third gesture UV image of the first person and a third image UV image of the first person; the electronic equipment carries out second gesture estimation on the third gesture UV image of the first person and the third image UV image of the first person to obtain a third image of the first person; and the electronic equipment performs fusion processing on the third image of the first person and the background image in the first target image to obtain a second target image.

In a second aspect, an electronic device is provided, which has the functions of implementing the first aspect. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

In a third aspect, an electronic device is provided, where M reference images are stored, where the M reference images include at least one image of multiple poses of a person, and the M reference images are not images photographed by a corresponding person holding a first object, and M is an integer greater than 1; the electronic device includes a display screen, a memory, and one or more processors; the display screen, the memory and the processor are coupled; the memory is for storing computer program code, the computer program code comprising computer instructions; the computer instructions, when executed by the processor, cause the electronic device to perform the steps of: the electronic device displays a first target image; wherein the first target image comprises an image of a first person holding the first object in the first target pose; the electronic device selects a first reference image from M reference images; the first reference image comprises an image of the corresponding person in a second target posture, and the second target posture is different from the first target posture; the electronic equipment adopts the first reference image to carry out gesture migration on the first target image so as to obtain a second target image; wherein the second target image includes an image of the first person in the second target pose, the second target image not including an image of the first person holding the first object.

In one implementation of the third aspect, the first object comprises a selfie stick.

In an implementation manner of the third aspect, the second target gesture is a default gesture preset in the electronic device; alternatively, the computer instructions, when executed by the processor, cause the electronic device to specifically perform the steps of: the electronic equipment displays a first interface; the first interface comprises a plurality of gesture selection items, wherein each gesture selection item in the plurality of gesture selection items corresponds to a target gesture; the electronic equipment responds to the operation of a first gesture selection item in the gesture selection items, and a reference image in a target gesture corresponding to the first gesture selection item is selected from M reference images to serve as a first reference image; wherein the first gesture selection item corresponds to the target gesture as the second target gesture.

In an implementation manner of the third aspect, the M reference images include: n reference images, wherein the N reference images are reference images under the corresponding target gesture of the first gesture selection item, N is an integer greater than 1, and N is less than or equal to M; when executed by a processor, the computer instructions cause the electronic device to specifically perform the steps of: the electronic device selects a reference image with the maximum similarity between the target gesture and the first target gesture from N reference images as a first reference image.

In one implementation manner of the third aspect, the first gesture selection item is used for customizing the target gesture; when executed by a processor, the computer instructions cause the electronic device to specifically perform the steps of: the electronic equipment displays a second interface; the second interface is used for indicating a user to draw a target gesture; and the electronic equipment receives the target gesture drawn by the user on the second interface, and selects a reference image with the maximum similarity between the target gesture and the drawn target gesture from M reference images as a first reference image.

In one implementation of the third aspect, the first target image is acquired by the electronic device in response to a photographing instruction; when executed by a processor, the computer instructions cause the electronic device to specifically perform the steps of: in response to the photographing instruction, the electronic device determines that the first target image includes an image of a first person holding the selfie stick in the first target posture, and selects the first reference image from the M reference images.

In one implementation of the third aspect, the computer instructions, when executed by the processor, cause the electronic device to specifically perform the steps of: the electronic equipment displays a first target image in a gallery application or displays the first target image in an instant messaging application; in response to a user's preset editing operation on the first target image, the electronic device selects a first reference image from the M reference images.

In one implementation of the third aspect, the computer instructions, when executed by the processor, cause the electronic device to specifically perform the steps of: the electronic equipment performs object segmentation on a first target image to remove an image of the first object in the first target image and identify a first image of the first person in the first target image; the electronic equipment performs portrait segmentation on the first reference image to obtain a second image of a first person in the first reference image; the electronic equipment carries out first gesture estimation on a first image of a first person and a second image of the first person to obtain a first gesture UV image of the first person, a first image UV image of the first person, a second gesture UV image of the first person and a second image UV image of the first person; the electronic equipment adopts a second gesture UV image of the first person and a second image UV image of the first person to carry out gesture migration on the first gesture UV image of the first person and the first image UV image of the first person, so as to obtain a third gesture UV image of the first person and a third image UV image of the first person; the electronic equipment carries out second gesture estimation on the third gesture UV image of the first person and the third image UV image of the first person to obtain a third image of the first person; and the electronic equipment performs fusion processing on the third image of the first person and the background image in the first target image to obtain a second target image.

In a fourth aspect, there is provided a computer readable storage medium having stored therein computer instructions which, when run on a computer, cause the computer to perform the image processing method of any of the first aspects above.

In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the image processing method of any of the first aspects above.

The technical effects of any one of the design manners of the second aspect to the fifth aspect may be referred to the technical effects of the different design manners of the first aspect, and will not be repeated here.

Drawings

Fig. 1 is a schematic structural diagram of a UV texture color value according to an embodiment of the present application;

fig. 2 is a schematic flow chart of an image processing method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 4 is an interface schematic diagram of an image processing method according to an embodiment of the present application;

fig. 5 is a second interface schematic diagram of an image processing method according to an embodiment of the present application;

Fig. 6 is an interface schematic diagram III of an image processing method according to an embodiment of the present application;

fig. 7 is a second flowchart of an image processing method according to an embodiment of the present application;

fig. 8 is a flowchart of a third image processing method according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a semantic segmentation model according to an embodiment of the present application;

fig. 10 is a schematic diagram of a semantic segmentation model according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a synthetic network according to an embodiment of the present application;

fig. 12 is a flowchart of an image processing method according to an embodiment of the present application;

fig. 13 is an interface schematic diagram of an image processing method according to an embodiment of the present application;

fig. 14 is a flowchart fifth of an image processing method according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a chip system according to an embodiment of the present application.

Detailed Description

The terms "first" and "second" are used below for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present embodiment, unless otherwise specified, the meaning of "plurality" is two or more.

In order to facilitate understanding of the schemes provided in the embodiments of the present application, some terms related to the embodiments of the present application will be explained below.

Posture estimation: the pose estimation is based on a pose conversion (Densepose) model, and is performed on an input image to be processed (such as an RGB image). In actual implementation, the pose estimation of the image to be processed is usually the prediction of the key points (or labeled points) of the portrait, namely, the position coordinates of each key point of the human body are predicted first, and then the spatial position relationship among each key point is determined, so as to obtain the predicted skeleton of the human body. The key point may be an articulation point on the human body or other points, which is not specifically limited in the embodiment of the present application.

In general, the gesture estimation of the input image to be processed by the Densepose model includes two tasks, one is a classification task and the other is a regression task. The classification task can classify and mark the human body area which belongs to each key point in the input image to be processed. Illustratively, the body regions are classified into 24 categories altogether, including, for example, head, left hand, right hand, arm, shoulder, etc.; on the basis, marking the human body area of each key point in the image to be processed based on different classification results. In some embodiments, the classification task may also implement classification and tagging of the background in the image to be processed, such that the classification task may classify the image to be processed into 25 classes.

The regression task may be based on a human body three-dimensional model (SMPL) to map three-dimensional coordinates (i.e., X, Y, Z coordinates) corresponding to each pixel in the image to be processed into UV coordinates to obtain a UV texture map. Wherein, the SMPL is a three-dimensional model of human body based on vertexes, which can accurately represent different shapes and postures of human body; the SMPL internal model includes 6890 vertices and 24 human region classifications. For the SMPL model, two important coordinate systems are included, one with respect to the position (X, Y, Z) coordinates of the vertices and the other with respect to UV coordinates. Wherein U refers to the coordinates of the picture in the horizontal direction of the display, and V refers to the coordinates of the picture in the vertical direction of the display; in other words, U represents the U-th pixel (i.e., picture width) in the horizontal direction, and V represents the V-th pixel (i.e., picture height) in the vertical direction. In general, U and V have values in the range of [0,1].

It should be noted that, based on the SMPL model, mapping the three-dimensional coordinates corresponding to each pixel in the image to be processed into the UV coordinates may be implemented, so as to obtain a UV texture map; accordingly, the mapping of the UV coordinates into the three-dimensional coordinates may also be implemented, resulting in a processed image. The image to be processed and the processed image are both 3D images (or referred to as I images), that is, in the embodiment of the application, conversion from an I image to a UV image (i.e., I2 UV) may be achieved, and conversion from a UV image to an I image (i.e., UV 2I) may also be achieved.

The following relationship between three-dimensional coordinates and UV coordinates is taken as an example to illustrate the conversion of I2UV and UV 2I.

v：a1、b1、c1、0；

v：a2、b2、c2、0；

v：a3、b3、c3、0；

v：a4、b4、c4、0；

vt：A1、B1；

vt：A2、B2；

vt：A3、B3；

vt：A4、B4。

Wherein v represents three-dimensional coordinates and vt represents UV coordinates. In v: a1, b1, c1, 0 are exemplified, wherein a1, b1, c1 correspond to the X-axis, Y-axis and Z-axis of the three-dimensional coordinates, respectively; a 0 indicates a human body region classification corresponding to the key point, and for example, a 0 may be indicated as a head. Accordingly, 1 may be denoted as a neck, 2 may be denoted as a shoulder, etc., and so on, which are not described in detail herein. At vt above: a1, B1 are examples, where A1 represents the coordinates of the picture in the horizontal direction of the display (i.e., the picture width), and B1 represents the coordinates of the picture in the vertical direction of the display (i.e., the picture height). For other v and vt, reference may be made to the above embodiments, and no further description is given here.

In addition, the locations of the vertices of the SMPL model may be expressed as f=v/vt. Where f represents the position of each vertex in the SMPL model and v/vt represents the position value of each vertex. For example, f1 represents the position of the first vertex of 6890 vertices, and v1/vt1 represents the position value of the first vertex; similarly, f2 represents the position of the second vertex of the 6890 vertices, and v2/vt2 represents the position value of the second vertex; f3 represents the position of the third vertex of the 6890 vertices, and v3/vt3 represents the position value of the third vertex. The relationship between the positions f and v and vt of each vertex of the SMPL model, combined with the relationship between the three-dimensional coordinates and the UV coordinates, is as follows: and f:4/1, 2/2, 1/3.

Combining the three-dimensional coordinates and the UV coordinates shown above, wherein 4 in 4/1 represents the fourth v coordinate and 1 represents the first vt coordinate; similarly, 2 in 2/2 represents a second v coordinate, and 2 represents a second vt coordinate; 1 in 1/3 represents the first v coordinate and 3 represents the third vt coordinate.

In combination with the above embodiment, the UV texture image may be obtained by searching the texture color value of each vertex in the SMPL model in the UV coordinates, and rendering the obtained texture color value to each vertex position. Exemplary, as shown in fig. 1, a schematic diagram of the correspondence between UV coordinates and texture color values is shown, and as can be seen in fig. 1, the texture color values include four colors RGBY; wherein R represents red (red), G represents green (green), B represents blue (blue), and Y represents yellow (yellow).

Illustratively, at f, above: 4/1 is illustrated as an example, and the UV coordinates corresponding to the vertex position are (0.8, 0.4), and then the texture color corresponding to the vertex position is blue. And by analogy, the texture color corresponding to each vertex position can be obtained, and each texture color is rendered to each vertex position, so that the conversion from the I image to the UV image is obtained. Accordingly, according to the relation between v and vt, the coordinate corresponding to vt is mapped to the v coordinate, and the conversion from the UV image to the I image can be obtained. For the conversion manner from the UV image to the I image, reference may be made to the above-mentioned conversion manner from the I image to the UV image, which will not be described in detail here.

In summary, in the embodiment of the present application, after inputting the image to be processed into the Densepose model for pose estimation, since the Densepose model includes a classification task and a regression task, the output of the Densepose model may include a pose estimation UV map and a UV texture image. The gesture estimation UV image refers to an image which is converted into UV by each key point in the human gesture in the image to be processed; UV texture image refers to an image after UV conversion of an image to be processed (i.e., RGB image).

It can be appreciated that the self-timer stick is a long stick with adjustable length, and one end of the stick is used for controlling photographing by taking hold of the handle part by the hands of a photographer; the other end of the rod is a fixed part arranged by the electronic equipment and used for adjusting the length of the self-timer rod. In actual operation, photographers can realize self-shooting images at different angles by adjusting the length of the self-shooting rod. The longer the self-timer stick length, the wider the image background of the self-timer image obtained by shooting.

However, if the length of the selfie stick is longer, the longer the selfie stick appears in the selfie image, the larger the area affecting the selfie image. In addition, if the selfie stick is removed alone, a dead zone can appear on the hands of the photographer, so that the selfie posture of the photographer is unnatural, and the shooting effect is affected.

The embodiment of the application provides an image processing method, which is applied to electronic equipment, and can simultaneously solve the problem that a photographer has an empty hand area and a self-shooting gesture are unnatural, and improve the shooting effect. Specifically, the electronic equipment is based on a self-timer image containing a self-timer rod, and the electronic equipment firstly positions the self-timer rod in the self-timer image and then removes the self-timer rod; and then, the electronic equipment performs image restoration on the image from which the selfie stick is removed, so as to obtain a selfie image with a shooting effect.

As shown in fig. 2, for example, (1) in fig. 2 is a self-timer image including a self-timer stick, which is taken by an electronic device; the electronic equipment firstly positions a self-timer rod in the self-timer image, and then removes the self-timer rod to obtain a self-timer image shown in (2) in fig. 2 after the self-timer rod is removed; subsequently, the electronic device performs image restoration on the image shown in (2) in fig. 2 to obtain a self-timer image with the takt effect shown in (3) in fig. 2.

The image processing method provided in the embodiment of the present application will be described in detail below with reference to the accompanying drawings.

The electronic device in the embodiment of the application may be an electronic device with a shooting function. For example, the electronic device may be a mobile phone motion camera (go pro), a digital camera, a tablet computer, a desktop, a laptop, a handheld computer, a notebook, a vehicle-mounted device, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a personal digital assistant (personal digital assistant, PDA), an augmented reality (augmented reality, AR) \virtual reality (VR) device, etc., and the specific form of the electronic device is not particularly limited in the embodiments of the present application.

As shown in fig. 3, a schematic structure of the electronic device 100 is shown. Wherein the electronic device 100 may include: processor 110, external memory interface 120, internal memory 121, universal serial bus (universal serial bus, USB) interface 130, charge management module 140, power management module 141, battery 142, antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headset interface 170D, sensor module 180, positioning module 181, keys 190, motor 191, indicator 192, camera 193, display 194, and subscriber identity module (subscriber identification module, SIM) card interface 195, etc.

It is to be understood that the structure illustrated in the present embodiment does not constitute a specific limitation on the electronic apparatus 100. In other embodiments, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a memory, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller may be a neural hub and command center of the electronic device 100. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

It should be understood that the connection relationship between the modules illustrated in this embodiment is only illustrative, and does not limit the structure of the electronic device. In other embodiments, the electronic device may also use different interfacing manners in the foregoing embodiments, or a combination of multiple interfacing manners.

The charge management module 140 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charge management module 140 may receive a charging input of a wired charger through the USB interface 130. In some wireless charging embodiments, the charge management module 140 may receive wireless charging input through a wireless charging coil of the electronic device. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be configured to monitor battery capacity, battery cycle number, battery health (leakage, impedance) and other parameters. In other embodiments, the power management module 141 may also be provided in the processor 110. In other embodiments, the power management module 141 and the charge management module 140 may be disposed in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light emitting diode (AMOLED), a flexible light-emitting diode (FLED), a Mini-LED, a Micro-OLED, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like.

The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.

The ISP is used to process data fed back by the camera 193. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also optimize the noise, brightness and color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format. In some embodiments, the electronic device may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the electronic device selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, and so on.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device may play or record video in a variety of encoding formats, such as: dynamic picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, etc.

The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning. Applications such as intelligent cognition of electronic devices can be realized through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, etc.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or a portion of the functional modules of the audio module 170 may be disposed in the processor 110. The speaker 170A, also referred to as a "horn," is used to convert audio electrical signals into sound signals. A receiver 170B, also referred to as a "earpiece", is used to convert the audio electrical signal into a sound signal. Microphone 170C, also referred to as a "microphone" or "microphone", is used to convert sound signals into electrical signals.

The earphone interface 170D is used to connect a wired earphone. The headset interface 170D may be a USB interface 130 or a 3.5mm open mobile electronic device platform (open mobile terminal platform, OMTP) standard interface, a american cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, audio, video, etc. files are stored in an external memory card.

The internal memory 121 may be used to store computer executable program code including instructions. The processor 110 executes various functional applications of the electronic device and data processing by executing instructions stored in the internal memory 121. For example, in an embodiment of the present application, the processor 110 may include a storage program area and a storage data area by executing instructions stored in the internal memory 121, and the internal memory 121 may include a storage program area and a storage data area.

The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device (e.g., audio data, phonebook, etc.), and so forth. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like.

The keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration alerting as well as for touch vibration feedback. The indicator 192 may be an indicator light, may be used to indicate a state of charge, a change in charge, a message indicating a missed call, a notification, etc. The SIM card interface 195 is used to connect a SIM card. The SIM card may be inserted into the SIM card interface 195, or removed from the SIM card interface 195 to enable contact and separation with the electronic device. The electronic device may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 195 may support Nano SIM cards, micro SIM cards, and the like.

It is to be understood that the structure illustrated in the embodiments of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, the electronic device may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of hardware and software.

The methods in the following embodiments may be implemented in the electronic device 100 having the above-described hardware structure. In the following embodiments, the electronic device 100 is taken as an example of a mobile phone, and the technical solutions provided in the embodiments of the present application are specifically described.

The image processing method provided by the embodiment of the application can be applied to various self-shooting scenes, such as shooting a person can fix a mobile phone at one end of a self-shooting rod, and the other end of the self-shooting rod is held by a hand to perform single shooting, multi-person group photo, scene group photo, high-altitude shooting, skiing shooting, surfing shooting, parachuting shooting and the like. The technical scheme provided by the embodiment of the application is exemplified by a scene that a photographer holds a self-timer rod to perform single self-timer.

For example, a camera application (or an application having a photographing function) may be installed in the cellular phone. Taking the example of installing the camera application on the mobile phone, the mobile phone can operate the camera application through the operation of receiving the first person input. The operation may be, for example, one of touch operation, key operation, gesture operation, voice operation, and the like; the touch operation may be, for example, a click operation or a slide operation.

The first person may be a photographer or a subject. When the first person performs self-shooting through a front camera of the mobile phone, the first person is a photographer and a shot person at the same time; when the first person shoots through the rear camera, the first person is a photographer. In the following embodiments, a first person is illustrated as a photographer when the first person performs self-photographing through a front camera of a mobile phone.

In some embodiments, as shown in fig. 4 (1), the handset displays an interface 220 as shown in fig. 4 (2) in response to the photographer operating an icon 210 of the camera application in the handset home screen interface. The interface 220 is a preview interface when the mobile phone photographs, and the preview interface includes a preview image. Illustratively, the interface 220 further includes a portrait mode, a video mode, a movie mode, and a professional mode, and the photographer may select any of the above modes to photograph.

In some embodiments, as also shown in fig. 4 (2), the interface 220 further includes an artificial intelligence (artificial intelligence, AI) control 230, and the handset turns on the AI function in response to a photographer's operation of the AI control 230. Wherein the AI function is used to identify whether a selfie stick is included in the preview image. When the handset recognizes that the selfie stick is included in the preview image, the handset displays an interface 240 (or first interface) as shown in fig. 5 (1). The interface 240 includes a plurality of gesture selections; wherein each of the plurality of gesture selections corresponds to a target gesture of the first persona.

Illustratively, the plurality of gesture selections include a default gesture selection, a gesture 1 (post 1) selection, a gesture 2 (post 2) selection, … …, a custom gesture selection, and the like, as also shown in interface 240 of fig. 5 (1). The plurality of gesture selections further includes a "more" function, which can provide more gesture selections to the photographer, given the limited display area in interface 240. The default gesture is a target gesture preset by the mobile phone; after the mobile phone starts the AI function, if the mobile phone recognizes that the preview image includes the selfie stick and the photographer does not select other gesture options, the mobile phone can adopt a default gesture to perform gesture migration on a target image (such as a first target image) generated by the mobile phone.

In addition, the custom gesture is a target gesture custom defined by a photographer, for example, the photographer can manually draw a target gesture; the gesture 1 and the gesture 2 are target gestures which can be selected by a photographer according to actual demands, for example, the gesture 1 can be a 'first person one-hand waist, one-hand ratio Yeah', and the gesture 2 can be a 'first person two-hand waist'.

Taking the first gesture selection item as a case 1 as an example, the mobile phone responds to the selection operation of the case 1 in the interface 240, and displays an interface 250 as shown in (2) in fig. 5, where the case 1 in the interface 250 is selected. For example, the pore 1 box is darkened. Then, in response to the operation of the photographing key, the mobile phone displays an interface 260 as shown in (3) of fig. 5, the interface 260 including a first target image; the first target image is generated by an image frame acquired by the mobile phone through the front-facing camera. Then, the mobile phone performs gesture migration on the first target image according to the target gesture corresponding to the point 1 selected by the photographer, and displays an interface 270 as shown in (4) in fig. 5, where the interface 270 includes the second target image. The interface 270 further includes a "save" control and a "cancel" control, where the mobile phone responds to the operation of the "save" control to save the second target image in the gallery application of the mobile phone; and responding to the operation of the cancel control by the mobile phone, and storing the first target image in a gallery application of the mobile phone by the mobile phone.

Taking the first gesture selection item as an example for the user-defined gesture selection item, for illustration, as shown in (1) in fig. 6, the mobile phone responds to the operation of the user-defined gesture selection item, and the mobile phone displays an interface 280 (or called a second interface) as shown in (2) in fig. 6, where the interface 280 is used for instructing a photographer to draw a target gesture. And then, the mobile phone performs gesture migration on the first target image according to the target gesture drawn by the photographer so as to obtain a second target image. For the illustration of the second target image, reference may be made to the second target image shown in fig. 5 (4), and details thereof will not be repeated here.

In the embodiment of the application, the second target image is generated after the mobile phone performs gesture migration on the first target image according to the target gesture corresponding to the gesture selection item. Comparing the first target image and the second target image shown in (3) in fig. 5 and (4) in fig. 5, it can be seen that the target pose (or first target pose) of the photographer in the first target image is different from the target pose (or second target pose) of the photographer in the second target image. For example, one hand of the photographer in the first target image holds the selfie stick, and the other hand forks the waist; one hand of the photographer in the second target image is compared with "Yeah", and the other hand is crossed over the waist. Therefore, when a photographer holds the self-timer stick to shoot, the mobile phone shifts the gesture of the first target image through the preset target gesture, the self-timer stick in the first target image can be removed, and a second target image with the self-timer effect is generated, so that the problems that after the self-timer stick is removed, an empty area appears on the hand of the photographer and the self-timer gesture of the photographer is unnatural can be solved.

In some embodiments, in conjunction with the interface diagrams shown in fig. 4 and 5, as shown in fig. 7, the mobile phone enters a photographing mode in response to an operation of a photographer running a camera application, and displays a preview image on a preview interface. Then, the mobile phone starts an AI function and starts to identify whether the preview image comprises a selfie stick or not; when the mobile phone recognizes that the preview image comprises the selfie stick, the mobile phone prompts a user whether the selfie stick needs to be removed or not through the first prompt information. For example, when the mobile phone receives an operation of removing the selfie stick from the user, the mobile phone displays a plurality of gesture options on the preview interface in response to the operation. Such as a default gesture selection, a phase 1 gesture selection, a phase 2 gesture selection, a custom gesture selection, etc. The first prompt information may be, for example, a voice prompt information or a text prompt information, which is not limited in the embodiment of the present application.

After a user selects a first gesture selection item in a plurality of gesture selection items, the mobile phone responds to the operation of a shooting key to generate a first target image, and gesture migration is carried out on the first target image by adopting a target gesture corresponding to the first gesture selection item so as to obtain a second target image.

With reference to fig. 8, a specific implementation process of performing gesture migration on the first target image by using a target gesture corresponding to the first gesture selection item to obtain the second target image is illustrated.

For example, M reference images are stored in the mobile phone in advance, where the M reference images include images of at least one person in multiple poses, and M is an integer greater than 1. For example, on the basis of the interface 240 shown in (1) in fig. 5, the mobile phone responds to the operation of the photographer on the first gesture selection item (such as a point 1) in the multiple gesture selection items, and selects a reference image in the target gesture corresponding to the first gesture selection item from the M reference images as the first reference image; subsequently, the mobile phone can adopt the first reference image to carry out gesture migration on the first target image so as to obtain a second target image.

In some embodiments, each of the plurality of gesture selections includes N reference images of the M reference images for a target gesture, i.e., the target gesture for one target gesture selection includes N reference images; wherein N is an integer greater than 1 and less than or equal to M. For example, the mobile phone responds to the operation of a first gesture selection item in a plurality of gesture selection items, and the mobile phone determines N reference images under the corresponding target gesture of the first gesture selection item; then, the mobile phone selects a first reference image from the N reference images. For example, the mobile phone may select, from the N reference images, a reference image having a maximum similarity between the target pose and the first target pose as the first reference image. It should be understood that the first target pose is a target pose of a first person holding the selfie stick in the first target image.

The mobile phone may calculate the similarity between the target pose corresponding to each of the N reference images and the first target pose, and select the reference image with the largest similarity as the first reference image. It should be noted that, for an illustration of calculating the similarity between the target pose corresponding to each of the N reference images and the first target pose of the mobile phone, reference may be made to the following embodiments, which are not described herein.

As described in connection with the above embodiments, the plurality of gesture selections include a default gesture selection, a custom gesture selection, and the like. On the basis, when the first gesture selection item is a default gesture selection item, the target gesture corresponding to the first gesture selection item is a default gesture; when the first gesture selection item is a user-defined gesture selection item, the target gesture corresponding to the first gesture selection item is a user-defined target gesture.

In some embodiments, after the mobile phone receives the operation of the shooting key, the mobile phone generates a first target image through an image frame acquired by the front camera; then, the mobile phone performs object segmentation on the first target image to remove a self-shooting rod in the first target image to obtain a third target image, and identifies a first image of a first person in the third target image.

For example, as shown in fig. 8, the mobile phone may use an object segmentation algorithm to remove the self-timer stick in the first target image to obtain a third target image, and identify the first image of the first person in the third target image. In the embodiment of the present application, the object segmentation algorithm may be, for example, a semantic segmentation algorithm (or referred to as a semantic segmentation model). The semantic segmentation algorithm comprises an encoder and a decoder. The encoder is a data compression algorithm that can be used to extract features of the input image; the decoder is the inverse reconstruction of the encoder, which is the inverse decoding of the deep feature space.

Illustratively, as shown in FIG. 9, the semantic segmentation algorithm is capable of separating a self-stick 310 and a person image 320 (or first image of a first person) in a first target image. It should be understood that if the selfie stick and the person image exist in the first target image, after the processing of the semantic segmentation algorithm, the mobile phone may output the corresponding mask area (i.e. output from the selfie stick class 1 and the person image class 2); if the first target image does not contain the selfie stick and the character image, the mobile phone outputs category 0 after being processed by the semantic segmentation algorithm, and the category 0 indicates that the target object is not detected this time, namely the selfie stick and the character image are not detected.

Subsequently, after the mobile phone separates the self-timer stick 310 and the person image 320 in the first target image by using the semantic segmentation algorithm, the mobile phone may remove the self-timer stick 310 to obtain a first image of the first person in the first target image, i.e. obtain the person image 320.

In some embodiments, the semantic segmentation model may be trained in an end-to-end manner based on a portrait segmentation dataset, and then trained based on a selfie stick dataset (finetune), thereby obtaining a network model that may perform selfie stick segmentation and portrait segmentation. In the embodiment of the application, the semantic segmentation model can be obtained based on the graphical display training of an optimization function adam, an Lr learning rate of 0.001, iteration times of 100 and NVIDIA RTX A5000.

For example, the self-stick dataset may be acquired using different types of self-sticks based on a mobile electronic device (e.g., a cell phone). For example, selfie sticks of different colors (e.g., black, white, blue, etc.) and lengths of 1 meter, 2 meters, and 3 meters may be selected. The following provides a collection mode of a self-timer stick data set, which is as follows: holding a 1-meter-length self-timer rod, and sequentially extending the length of the self-timer rod from 0 to 1 meter to acquire images by taking n (such as 10 cm) as a unit length; specifically, each unit length is used for acquiring images by holding the selfie stick with a single left hand, a single right hand and two hands. For example, m (e.g., 33) images may be acquired. Step 2: and (3) holding a self-timer stick of 2 meters, and acquiring 2m images according to the mode of the step (2). Step 3: 3m images are acquired according to the mode of step 2 by holding a 3-meter self-timer rod. Step 4: and (3) manually performing semantic segmentation on all the images acquired in the steps (1) to (3), and marking the positions and the category attributes of the portrait and the selfie stick. The category attributes may include, for example, background, portrait, selfie stick, etc.

Further, the portrait segmentation data may include two data sets, one of which is a portrait location and a category attribute manually segmented when constructing the selfie stick data; the other part is based on an open source belt portrait dataset. The open source band data set may refer to the related art, and will not be described in detail herein.

Illustratively, as can be seen in conjunction with FIG. 9, the semantic segmentation model includes a downsampled (upsampled) encoder and an upsampled (subsampled) decoder. The input data of the semantic segmentation model is the self-shooting bar data set portrait segmentation data set constructed as described above. In some embodiments, assume that the image is an original image with an image resolution of 5000×4000; in order to ensure that the image is lossless, the receptive field can be enlarged by 32 pixels in a mirror image operation mode, so that the image resolution of the original image is enlarged to 5032×4032. As shown in fig. 10, the encoder subjects the original image to five downsampling operations to obtain pool1, pool2, pool3, pool4, and pool5, respectively. Wherein the number of convolution layers per sampling is 64, 128, 256, 384 and 256 respectively; the convolution kernel takes 3 x 3 and the activation function is Relu. As also shown in fig. 10, the encoder obtains pool5 based on the five downsampling results, and then amplifies the pool5 by 2 times to obtain upsampled1, i.e., 2 times to obtain upsampled1. Then adding up sample 1 of 2 times and four down sampling results pool4 to obtain up sample 2, and amplifying up sample 2 by 2 times to obtain up sample 3. And then adding up sample 3 and three downsampling results pool3 to obtain up sample 4. Finally, the upsampled4 is connected with a normalized exponential function (softmax) to obtain a segmentation result of the pixel points in the original image.

Typically, when using a softmax function as the activation function of the output node, cross entropy is typically used as the loss function. In the embodiment of the application, the cross entropy can be calculated by using the segmentation result and a truth diagram (GT); the GT is a classification result for each pixel point in units of pixel points, and is used as supervision information.

It should be noted that, the object segmentation algorithm used in the embodiments of the present application is not limited to the semantic segmentation model described in the above embodiments, and other object segmentation algorithms may be used, for example, VGGNet, resNet based on feature encoding, convolutional neural network (region-based convolutional neural network, R-CNN) based on region selection, and conventional region growing algorithm, segmentation algorithm based on edge detection, and the embodiments of the present application are not limited thereto.

For example, as shown in fig. 8, the mobile phone may perform image segmentation on the first reference image by using an image segmentation algorithm to obtain a second image of the first person in the first reference image.

In the embodiment of the present application, the portrait segmentation algorithm may be, for example, a semantic segmentation algorithm, which may be obtained based on the foregoing training of the semantic segmentation model. It should be understood that, when a semantic segmentation algorithm capable of performing portrait segmentation is obtained based on training of a semantic segmentation model, the input data is the portrait segmentation data described above. The training method can refer to the above embodiments, and will not be described in detail herein.

In some embodiments, since the reference image at the first pose migration selection item corresponds to the target pose comprising N reference images of the M reference images, i.e., the first pose migration selection item corresponds to the target pose comprising N reference images. In this case, the mobile phone may perform image segmentation on each of the N reference images by using an image segmentation algorithm to obtain a second image of the first person in each reference image.

Still referring to fig. 8, the mobile phone performs first pose estimation on a first image of a first person in a first target image to obtain a first pose UV image of the first person and a first image UV image of the first person. It should be understood that the gesture UV map refers to a UV map obtained after UV treatment of each key point of the first person. The image UV map refers to a UV map obtained by UV-processing a first image (i.e., RGB image) of a first person.

Correspondingly, for each reference image, the mobile phone carries out first gesture estimation on a second image of the first person in each reference image to obtain a second gesture UV image of the first person and a second image UV image of the first person in each reference image.

Subsequently, the mobile phone can respectively calculate the similarity between the second gesture UV image and the first gesture UV image in each reference image, so that the reference image with the maximum similarity between the target gesture and the first target gesture is selected from N reference images as the first reference image.

In some embodiments, the similarity between the second pose UV map and the first pose UV map may be represented by a distance (euclidean distance) between each keypoint in the second pose UV map and the first pose UV map. Illustratively, the smaller the distance between each key point in the second pose UV map and the first pose UV map, the higher the similarity of the second pose UV map and the first pose UV map; conversely, the greater the distance between each key point in the second pose UV map and the first pose UV map, the lower the similarity of the second pose UV map and the first pose UV map.

Illustratively, the distance between the second pose UV map and the first pose UV map in the reference image satisfies the following formula:

wherein d (x, y) represents the distance between the second pose UV map and the first pose UV map; x is X _i A first pose UV map; y is Y _i A second pose UV map; z represents the number of dense dots.

In some embodiments, as still shown in fig. 8, the mobile phone uses the second pose UV map of the first person and the second image UV map of the first person in the first reference image to perform pose migration on the first pose UV map of the first person and the first image UV map of the first person in the first target image, so as to obtain the third pose UV map of the first person and the third image UV map of the first person.

The mobile phone may input the second pose UV map of the first person and the second image UV map of the first person in the first reference image, and the first pose UV map of the first person and the first image UV map of the first person in the first target image into the pose migration model, to obtain the third pose UV map of the first person and the third image UV map of the first person. The gesture migration model is obtained based on synthetic network training. The synthetic network may be, for example, VGGNet, resNet, convolutional neural network (convolutional neural network, CNN) or recurrent neural network (recurrent neural network, RNN), which embodiments of the present application are not limited to.

In some embodiments, the synthetic network is trained to derive a gesture migration model based on the input data and the output data. The input data are a character gesture UV image and a character image UV image in the self-shooting image, and the output data are the character gesture UV image and the character image UV image in the self-shooting image. In the embodiment of the application, when the synthetic network trains the gesture migration model based on the input data and the output data, the learning rate of Lr can be 0.001 based on an optimization function adam, the learning rate is reduced by 10 times at intervals of 20 rounds (epochs), 60 epochs are iterated, and the graphical display based on NVIDIA RTX A5000 is trained. Where one epoch represents the process where the data passes through the synthetic network once and returns once.

The self-shooting image is an image obtained by shooting a self-shooting rod by a photographer in a handheld manner, and the self-shooting image comprises part of the self-shooting rod; the self-timer shooting image is an image shot by a photographer without holding the self-timer stick, and the self-timer image does not comprise the self-timer stick. In the embodiment of the application, the gesture of the person in the self-shooting image is a hand-held self-shooting rod, and the other hand is a fork waist; the gesture of the person in the photographed image is that one hand is compared with 'Yeah', and the other hand is forked.

For example, the gesture migration dataset (including the self-captured image and the he captured image) may be obtained by capturing different character images under different scenes using different types of self-captured bars based on the mobile electronic device (e.g., a mobile phone). Wherein, a self-shooting image and a self-shooting image are a group of data pairs, and the scene, the character, the type and the length of the self-shooting rod of the self-shooting image and the self-shooting image in each group of data pairs are the same; the difference is that the self-shooting image includes a self-shooting rod, the self-shooting image does not include a self-shooting rod, and the figures are different in posture.

For example, a tripod and a self-timer stick may be combined together to enable taking a self-timer image and a his or her image at the same angle and at the same position. The following provides a method for acquiring a gesture migration data set, which comprises the following steps: under the same scene, the self-timer stick is held by a hand (the left hand, the right hand and the two hands are switched), and the hand gesture (such as back hand, specific Yeah, five hand gestures and the like) of the other hand is changed to acquire a self-timer image; and then the self-timer stick is rotated to one side, the position of the person is kept unchanged, and the states of the two hands (such as the two hands are placed on two sides of the body, the two hands are in the waist, one hand is in the waist, and the other hand is in the Y-shaped, etc.) are changed to collect the images of the person. Step 2: and (3) in the same scene as the step (1), changing the length of the selfie stick, and acquiring a selfie image and a takt image according to the mode of the step (1). Step 3: changing the scene, replacing the type of the self-shooting rod, and collecting the self-shooting images and the he-shooting images under different lengths of the self-shooting rod under different scenes according to the modes of the step 1 and the step 2. Step 3: changing the person, collecting the self-shooting images and the other shooting images of different persons, different scenes and different self-shooting rod lengths according to the modes of the step 1, the step 2 and the step 3, and completing the collection of all the gesture migration data by combining the step 1 to the step 3.

Illustratively, as shown in FIG. 11, the synthesis network includes a downsampled encoder and an upsampled decoder. The input data of the synthetic network is self-timer data in the attitude migration data with the structure, and the output data is takt data in the attitude migration data with the structure. In some embodiments, the encoder includes six convolutional layers, a convolutional kernel of 2×2, a maximum pooling (maxpool) of 2×2, and an activation function of Relu. The number of convolution layers of the encoder is 64, 128, 256, 512, 1024, respectively, wherein the third layer of convolution layer in the encoder is skip connected (skip connection) to the fifth layer of convolution layer. The input of the decoder is the output of the encoder, which comprises five convolutional layers, the convolutional kernel being 2 x 2, and the activation function being Relu. The number of convolutional layers of the decoder is 512, 256, 128, 64, 2, respectively. The output of the decoder is a figure gesture UV image and a figure image UV image in the photographed image.

When the synthesis network is VGGNet, the synthesis network extracts the features output by the third layer, the fourth layer and the fifth layer as globally unique identifiers (globally unique identifier, GUID) of the photographed image and inputs the globally unique identifiers into the sixth layer of the encoder.

In some embodiments, a loss function may be configured in the synthetic network for evaluating how inconsistent a predicted value outputted from a single data calculated by the synthetic network is to a true value. The loss function is a non-negative real value function, and in the model training process, the smaller the error is, the smaller the function value of the loss function is, and the model converges quickly. The function value of the loss function directly influences the prediction performance of the model, and the smaller the function value of the loss function is, the better the prediction performance of the model is.

Illustratively, the loss function configured in the composite network is as follows:

/>

wherein Lidt representsIdentity loss function, L1 represents reconstruction loss function, lp represents perceptual loss function; psrc represents a person pose UV image in the self-photographed image, ptar represents a person pose UV image in the his photographed image, isrc represents a person image UV image in the self-photographed image, iar represents a person image UV image in the his photographed image;

representing the combined characteristics of the outputs of the fourth, eighth, and 27 th convolution layers.

In this embodiment of the present application, the electronic device 100 (such as a mobile phone) described in the embodiment of the present application may be used to train the semantic segmentation model and the gesture migration model, or other devices, servers, etc. with a model training function may also be used to train the semantic segmentation model and the gesture migration model, which is not limited in this embodiment of the present application.

In combination with the above embodiment, as shown in fig. 8, the mobile phone performs the second pose estimation on the third pose UV map of the first person and the third image UV map of the first person, to obtain the third image of the first person. Illustratively, the second pose estimation is configured to perform a reverse calculation based on the pose UV map and the image UV map, and output a third image of the first person.

In order to ensure that the matching degree of the image output after the gesture migration and the first target image is better, in some embodiments, the mobile phone may further perform fusion processing on the third image of the first person and the background image in the first target image, so as to obtain the second target image. It should be appreciated that the selfie stick is not included in the second target image and that the second target pose of the first person in the second target image is different from the first target pose of the first person in the first target image. For example, the first target pose is a first person holding a selfie stick with one hand and a waist of the other hand; the second target posture is that the first person has one hand ratio 'Yeah' and the other hand is forked. In this way, by adopting the image processing method of the embodiment of the application, the problem that the hand blank area and the self-shooting gesture are unnatural can be solved simultaneously after the self-shooting rod is removed.

Further, in order to increase the processing speed of the algorithm and reduce the power consumption of the device, in some embodiments, in conjunction with fig. 8, after the mobile phone performs object segmentation on the first target image, the mobile phone separates the head image of the first person in the first target image, where the first image of the first person in the first target image identified by the mobile phone does not include the head image. Correspondingly, the mobile phone carries out first posture estimation on a first image of the first person, and the obtained first posture UV image of the first person and the obtained first image UV image of the first person in the first target image do not comprise head images; subsequently, the mobile phone inputs a first gesture UV image of the first person in the first target image in the gesture migration model, and the first image UV image of the first person does not comprise a head image. Thus, the third image of the first person finally obtained by the mobile phone is an image which does not include the head of the first person.

On this basis, as shown in fig. 12, the mobile phone may perform fusion processing on the third image of the first person, the background image in the first target image, and the head image of the first person, thereby obtaining the second target image.

It should be noted that, in the above embodiment, after the mobile phone captures the first target image in the photographing mode of the camera application, the first reference image is used to perform gesture migration on the first target image, and the second target image is obtained for illustration. In the embodiment of the application, the mobile phone can also perform gesture migration on the first target image stored in the mobile phone by adopting the first reference image to obtain the second target image.

For example, the mobile phone may perform gesture migration on an image stored in a gallery application or an image received in an instant messaging application; or, the mobile phone may further perform gesture migration on images acquired in other scenes, which is not limited in the embodiment of the present application.

Taking the mobile phone to perform gesture migration on the image stored in the gallery application as an example, as shown in (1) in fig. 13, an interface 410 is displayed on the mobile phone, where the interface 410 is an interface for displaying a first target image for the gallery application of the mobile phone, and the interface 410 further includes a preset editing control. In response to an operation of the edit control in the interface 410, the handset displays an interface 420 (or second interface) as shown in fig. 13 (2), which includes a plurality of gesture selections, such as a default gesture selection, a phase 1 gesture selection, a phase 2 gesture selection, a custom gesture selection, and the like. For the illustration of the gesture selection item, reference may be made to the above embodiments, and no further description is given here.

Based on this, in response to the selection operation of the first gesture selection item (such as the post 1), the mobile phone performs gesture migration on the first target image by using the target gesture corresponding to the post 1, and displays an interface 430 as shown in (3) in fig. 13, where the interface 430 includes the second target image. In some embodiments, the interface 430 further includes a "save" control and a "cancel" control, the handset saving the second target image in a gallery application of the handset in response to operation of the "save" control; and responding to the operation of the cancel control by the mobile phone, and storing the first target image in a gallery application of the mobile phone by the mobile phone.

In some embodiments, in conjunction with the interface schematic diagram shown in fig. 13, as shown in fig. 14, the mobile phone initiates a self-timer bar removal function in response to an operation on the preset edit control; and then, when the mobile phone recognizes that the first target image comprises the selfie stick, the mobile phone displays a plurality of gesture options. After a user selects a first gesture selection item in the plurality of gesture selection items, the mobile phone adopts a target gesture corresponding to the first gesture selection item to carry out gesture migration on the first target image so as to obtain a second target image.

Further, when the mobile phone recognizes that the first target image does not include the selfie stick, the mobile phone can prompt the user to select the image with the selfie stick through the second prompt information. The second prompt information may be a voice prompt information or a text prompt information.

It should be noted that, the specific implementation process of performing gesture migration on the first target image by using the target gesture corresponding to the first gesture selection item to obtain the second target image may be described with reference to fig. 8 and the above embodiments, which are not described in detail herein.

In addition, in the above embodiment, the mobile phone is used to display a plurality of gesture options, and the user selects the target gesture corresponding to the first gesture option from the plurality of gesture options to illustrate gesture migration of the first target image. Correspondingly, in actual implementation, the mobile phone can also receive semantic information input by a user to carry out gesture migration on the first target image. Such as: the semantic information input by the user is "remove selfie stick and cross waist with both hands". In this case, the specific implementation manner may refer to the illustration in the foregoing embodiments, which is not described herein in detail.

Further, in the embodiment of the present application, the first object is taken as an example of the selfie stick, and of course, the first object may also be a book, a newspaper, a water cup, etc., and these objects may also affect the image effect of the selfie image. On the basis, the technical solutions in other embodiments of the present application can be explained and illustrated by the descriptions of the various embodiments of the present application, the technical features described in the various embodiments can also be applied in other embodiments, and a new solution can be formed by combining the technical features of other embodiments, and the present application is only illustrative of several embodiments and not meant to be limiting.

An electronic device may include a display screen, a memory, and one or more processors; the display screen is used for displaying images acquired by the cameras or images generated by the processor; the memory has stored therein computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the functions or steps performed by the handset in the above-described embodiments. The structure of the electronic device may refer to the structure of the electronic device 100 shown in fig. 3.

Embodiments of the present application also provide a chip system, as shown in fig. 15, the chip system 1800 includes at least one processor 1801 and at least one interface circuit 1802. The processor 1801 may be the processor 110 shown in fig. 3 in the above embodiment. Interface circuit 1802 may be, for example, an interface circuit between processor 110 and an external memory; or as an interface circuit between the processor 110 and the internal memory 121.

The processor 1801 and interface circuit 1802 described above may be interconnected by wires. For example, interface circuit 1802 may be used to receive signals from other devices (e.g., a memory of an electronic apparatus). For another example, interface circuit 1802 may be used to send signals to other devices (e.g., processor 1801). The interface circuit 1802 may, for example, read instructions stored in a memory and send the instructions to the processor 1801. The instructions, when executed by the processor 1801, may cause the electronic device to perform the steps performed by the handset in the above embodiments. Of course, the chip system may also include other discrete devices, which are not specifically limited in this embodiment of the present application.

The embodiment of the application also provides a computer readable storage medium, which comprises computer instructions, when the computer instructions run on the electronic device, the electronic device is caused to execute the functions or steps executed by the mobile phone in the embodiment of the method.

The present application also provides a computer program product, which when run on a computer, causes the computer to perform the functions or steps performed by the mobile phone in the above-mentioned method embodiments.

It will be apparent to those skilled in the art from this description that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and the parts displayed as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The image processing method is characterized by being applied to electronic equipment, wherein M reference images are stored in the electronic equipment, the M reference images comprise at least one image of a plurality of gestures of a person, the M reference images are not images shot by a corresponding person holding a first object, and M is an integer larger than 1; the method comprises the following steps:

the electronic equipment displays a first target image; wherein the first target image includes an image of a first person holding the first object in a first target pose;

the electronic device selects a first reference image from the M reference images; wherein the first reference image includes an image of a corresponding person in a second target pose, the second target pose being different from the first target pose;

the electronic equipment adopts the first reference image to carry out gesture migration on the first target image so as to obtain a second target image; the second target image comprises an image of the first person in the second target posture, and the second target image does not comprise an image of the first person holding the first object.

2. The method of claim 1, wherein the first object comprises a selfie stick.

3. The method according to claim 1 or 2, wherein the second target gesture is a default gesture preset in the electronic device; or alternatively, the process may be performed,

the electronic device selecting a first reference image from the M reference images, comprising:

the electronic equipment displays a first interface; the first interface comprises a plurality of gesture options, each gesture option in the plurality of gesture options corresponding to a target gesture;

the electronic device responds to the operation of a first gesture selection item in the gesture selection items, and a reference image in a target gesture corresponding to the first gesture selection item is selected from the M reference images to serve as the first reference image;

wherein the first gesture selection item corresponds to the target gesture as the second target gesture.

4. A method according to claim 3, wherein the M reference pictures comprise: n reference images, wherein the N reference images are the reference images under the corresponding target gestures of the first gesture selection item, N is an integer greater than 1, and N is less than or equal to M;

The selecting the reference image under the target gesture corresponding to the first gesture selection item from the M reference images as the first reference image includes:

and the electronic equipment selects a reference image with the maximum similarity between the target gesture and the first target gesture from the N reference images as the first reference image.

5. A method according to claim 3, wherein the first gesture selection item is for customizing a target gesture;

the electronic equipment displays a second interface; the second interface is used for indicating a user to draw a target gesture;

and the electronic equipment receives the target gesture drawn by the user on the second interface, and selects a reference image with the maximum similarity between the target gesture and the drawn target gesture from the M reference images as the first reference image.

6. The method of any one of claims 1, 2, 4, 5, wherein the first target image is acquired by the electronic device in response to a photographing instruction;

Wherein the electronic device selects a first reference image from the M reference images, comprising:

and responding to the shooting instruction, and if the electronic equipment determines that the first target image comprises an image of the first person holding a selfie stick in the first target gesture, selecting the first reference image from the M reference images.

7. The method of any of claims 1, 2, 4, 5, wherein the electronic device displaying the first target image comprises:

the electronic equipment displays the first target image in a gallery application or displays the first target image in an instant messaging application;

and responding to the preset editing operation of the user on the first target image, and selecting the first reference image from the M reference images by the electronic equipment.

8. The method of any of claims 1, 2, 4, 5, wherein the electronic device performing pose migration on the first target image using the first reference image to obtain a second target image, comprising:

The electronic equipment performs object segmentation on the first target image to remove the image of the first object in the first target image and identify the first image of the first person in the first target image;

the electronic equipment performs image segmentation on the first reference image to obtain a second image of the first person in the first reference image;

the electronic equipment carries out first gesture estimation on a first image of the first person and a second image of the first person to obtain a first gesture UV image of the first person, a first image UV image of the first person, a second gesture UV image of the first person and a second image UV image of the first person;

the electronic equipment adopts a second gesture UV image of the first person and a second image UV image of the first person to carry out gesture migration on the first gesture UV image of the first person and the first image UV image of the first person, so as to obtain a third gesture UV image of the first person and a third image UV image of the first person;

the electronic equipment carries out second pose estimation on a third pose UV image of the first person and a third image UV image of the first person to obtain a third image of the first person;

And the electronic equipment performs fusion processing on the third image of the first person and the background image in the first target image to obtain the second target image.

9. An electronic device, comprising: a display screen, a memory, and one or more processors; the display screen, the memory, and the processor are coupled;

the display screen is used for displaying the image generated by the processor; the memory has stored therein computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method of any of claims 1-8.

10. A computer-readable storage medium comprising computer instructions; the computer instructions, when run on an electronic device, cause the electronic device to perform the method of any of claims 1-8.