CN113298858A

CN113298858A - Method, device, terminal and storage medium for generating action of virtual image

Info

Publication number: CN113298858A
Application number: CN202110556095.5A
Authority: CN
Inventors: 韩欣彤; 黄志超
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2021-08-24

Abstract

The application discloses a method, a device, a terminal and a storage medium for generating actions of an avatar, wherein the action generating method comprises the following steps: acquiring at least one picture containing a target virtual image; respectively extracting texture information of a target virtual image in each picture; fusing texture information acquired from different pictures to obtain complete texture information of a target virtual image; acquiring an image comprising human body movement, and acquiring joint point information of the human body movement from the image; and performing fusion rendering on the complete texture information of the target virtual image and the joint point information of the human body action to form an action image of the target virtual image. Through the mode, only a single picture (or a small number of pictures) including the target virtual image needs to be transmitted, the target virtual image can be driven to make a preset gesture and action, the cost problem caused by manual modeling is avoided, the threshold of anchor live broadcast and short video production is reduced, and the diversity of the virtual image is enriched.

Description

Method, device, terminal and storage medium for generating action of virtual image

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a method, an apparatus, a terminal, and a storage medium for generating an action of an avatar.

Background

With the development of network technology and live broadcast industry, virtual anchor is becoming more and more popular as a new live broadcast form.

In the prior art, human body posture driving requires an art designer to manually establish a human body 3D model, for example, a mobile phone or two cameras are used to shoot a person around one circle or multiple times, or multiple angles of shooting are performed on the person to obtain more information points, and the information points are fused with key points in a virtual image to generate a virtual anchor.

However, this approach results in a high cost for designing a new avatar on the one hand, and requires training with a large number of pictures of a specific target person, and only drives the specific target person, greatly limiting the variety of avatars.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a method, a device, a terminal and a storage medium for generating an action of an avatar, which can solve the problems of high design cost and scarce diversity of the avatar.

In order to solve the above technical problem, a first technical solution adopted by the present application is to provide a method for generating an action of an avatar, the method including: acquiring at least one picture containing a target virtual image; respectively extracting texture information of a target virtual image in each picture; fusing texture information acquired from different pictures to obtain complete texture information of a target virtual image; acquiring an image comprising human body movement, and acquiring joint point information of the human body movement from the image; and performing fusion rendering on the complete texture information of the target virtual image and the joint point information of the human body action to form an action image of the target virtual image.

The method comprises the steps of obtaining an image including human body movement, and obtaining joint point information of the human body movement from the image, and further comprises the following steps: forming a first map of the human body action based on the joint point information; the method comprises the following steps of performing fusion rendering on the complete texture information of the target virtual image and the joint point information of human body action to form an action image of the target virtual image, wherein the steps comprise: and rendering the complete texture information of the target virtual image into the first map to obtain an action image of the target virtual image.

The joint point information comprises the position relation among all joint points; the method comprises the following steps of performing fusion rendering on the complete texture information of the target virtual image and the joint point information of human body action to form an action image of the target virtual image, wherein the steps comprise: adjusting the position relationship of each joint point of the target virtual image in the complete texture information to enable the position relationship to correspond to the position relationship between each joint point in the human body action; and rendering each joint point of the adjusted target virtual image based on the complete texture information to obtain an action image of the target virtual image.

The method comprises the steps of obtaining an image including human body movement, and obtaining joint point information of the human body movement from the image, and further comprises the following steps: the method includes acquiring a distribution of joint points of the human body based on an image including a motion of the human body, and extracting a feature of each joint point based on the distribution of joint points to acquire a positional relationship between the joint points based on the feature of each joint point.

The step of forming a first map of the human body action based on the joint point information specifically comprises the following steps: and acquiring body shape contour information of the virtual image based on the picture containing the target virtual image, and fusing the body shape contour information and the joint point information through a neural network to form a first map.

Wherein the avatar in the action image includes a mask to blend with different backgrounds.

The method for acquiring the image including the human body motion and acquiring the joint point information of the human body motion from the image comprises the following steps: acquiring a video comprising human body actions, and respectively acquiring each frame of picture of the video; sequentially extracting joint point information of human body actions in each frame of picture; the step of fusing and rendering the complete texture information of the target virtual image and the joint point information of the human body action to form an action image of the target virtual image further comprises the following steps: and respectively performing fusion rendering on the complete texture information of the target virtual image and the joint point information of each frame of picture to obtain multi-frame rendered action images, and synthesizing the multi-frame rendered action images into a video according to a time sequence.

The step of respectively extracting the texture information of the target virtual image in each picture specifically comprises the following steps: extracting the texture of the virtual image from the picture, extracting texture features from the texture, and averaging the values of the texture features; the method comprises the following steps of fusing texture information acquired from different pictures to obtain complete texture information of a target virtual image, and specifically comprises the following steps: and fusing the texture features according to the average value to obtain the fused complete texture information.

Wherein, in the step of fusing the texture information obtained from different pictures to obtain the complete texture information of the target virtual image, the method further comprises the following steps: responding to the lack of texture information of part of the virtual image in different pictures, and completing the lack of texture information in the fusion process.

The method comprises the following steps of fusing texture information acquired from different pictures to obtain complete texture information of a target virtual image, and specifically comprises the following steps: and fusing the texture information acquired from different pictures through the fusion model to obtain the complete texture information of the target virtual image.

Before the step of fusing texture information acquired from different pictures through the fusion model to obtain complete texture information of the target virtual image, the method further comprises the following steps of: inputting training data in a first training set into a first preset deep learning model for training to obtain a first model; inputting verification data in the first verification set into the first model for fusion, and adjusting parameters of the first model based on a fusion result to obtain an adjusted first model; and inputting the test data in the first test set into the adjusted first model for fusion, and evaluating the scoring result of the adjusted first model based on the fusion result to obtain the fusion model.

In order to solve the above technical problem, a second technical solution adopted by the present application is to provide an avatar motion generating apparatus, including: the first acquisition module is used for acquiring at least one picture containing a target virtual image; the texture extraction module is used for respectively extracting the texture information of the target virtual image in each picture; the texture fusion module is used for fusing the texture information acquired from different pictures to obtain the complete texture information of the target virtual image; the second acquisition module is used for acquiring an image comprising human body actions and acquiring joint point information of the human body actions from the image; and the fusion rendering module is used for performing fusion rendering on the complete texture information of the target virtual image and the joint point information of the human body action to form an action image of the target virtual image.

In order to solve the above technical problem, a third technical solution adopted by the present application is to provide an action generating terminal of an avatar, the terminal including: a memory for storing program data which, when executed, implement the steps in the avatar motion generation method as described in any one of the above; a processor for executing program instructions stored in the memory to implement the steps in the avatar's motion generation method as described in any of the above.

In order to solve the above technical problem, a fourth technical solution adopted by the present application is to provide a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps in the action generating method of the avatar according to any one of the above items.

The beneficial effect of this application is: different from the prior art, the application provides a method, a device, a terminal and a storage medium for generating the action of the virtual image, which are characterized in that a small number of pictures containing the target virtual image are obtained, texture information of the target virtual image in each picture is extracted, the texture information is fused to obtain complete texture information of the target virtual image, joint point information including human body action is obtained from the images, and the complete texture information of the target virtual image and the joint point information of the human body action are fused and rendered to form an action image including the target virtual image. Through the mode, only a single picture (or a small number of pictures) including the target virtual image needs to be transmitted, the target virtual image can be driven to make preset postures and actions, the cost problem caused by manual modeling is avoided, the threshold of main broadcast live broadcast and short video production is reduced, the diversity of the virtual image is enriched, and the interest of live broadcast is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of an embodiment of an avatar action generation method according to the present application;

FIG. 2 is a schematic flow chart diagram illustrating another embodiment of an avatar movement generation method according to the present application;

FIG. 3 is a flowchart of an application scenario of the action generation method of the avatar of the present application;

FIG. 4 is a diagram of an application scenario of the action generation method of the avatar in FIG. 3;

FIG. 5 is a schematic diagram of an embodiment of an avatar motion generating apparatus according to the present application;

FIG. 6 is a schematic structural diagram of an embodiment of an avatar action generating terminal according to the present application;

FIG. 7 is a schematic structural diagram of an embodiment of a computer-readable storage medium according to the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, the "plural" includes at least two in general, but does not exclude the presence of at least one.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

It should be understood that the terms "comprises," "comprising," or any other variation thereof, as used herein, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

In the prior art, human body posture driving requires an art designer to manually establish a human body 3D model, for example, a mobile phone or two cameras are used to shoot a person around one circle or multiple times, or multiple angles of shooting are performed on the person to obtain more information points, and the information points are fused with key points in a virtual image to generate a virtual anchor. However, this approach results in a high cost for designing a new avatar on the one hand, and requires training with a large number of pictures of a specific target person, and only drives the specific target person, greatly limiting the variety of avatars.

Based on the above situation, the application provides a method, a device, a terminal and a storage medium for generating an avatar action, which are characterized in that after a small number of pictures containing a target avatar are obtained and texture information of the target avatar in each picture is extracted, the texture information is fused to obtain complete texture information of the target avatar, joint point information including human body action is obtained from the pictures, and the complete texture information of the target avatar and the joint point information of the human body action are fused and rendered to form an action image including the target avatar.

The present application will be described in detail below with reference to the drawings and embodiments.

Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a method for generating an avatar according to the present application. As shown in fig. 1, the operation generation method according to the present embodiment includes:

s11: at least one picture containing the target avatar is obtained.

In the present embodiment, the picture including the target avatar refers to a picture including the face and part or all of the limbs of the target avatar.

In a preferred embodiment, the picture containing the target avatar is a picture including a whole-body image of the target avatar.

The target virtual image includes a real character, a cartoon character or an anthropomorphic cartoon animal, and the like, which is not limited in the present application.

S12: and respectively extracting the texture information of the target virtual image in each picture.

In the present embodiment, the texture information includes information of the face, the limbs, the trunk, and the clothes of the target avatar.

In this embodiment, the texture information of the target avatar is extracted from each picture through the existing texture extraction network, wherein the extracted texture information from each picture is only the texture information of different positions of the target avatar and belongs to incomplete texture information.

S13: and fusing the texture information acquired from different pictures to obtain the complete texture information of the target virtual image.

In this embodiment, when there is more than one image including the target avatar, texture information at different positions on the target avatar may be extracted from different images, and the texture information obtained from different images is fused by the fusion model to obtain complete texture information of the target avatar.

The fusion model comprises an encoding network and a decoding network. When the fusion model receives texture information of a target virtual image extracted from each picture, extracting corresponding texture features from a plurality of incomplete texture information through a coding network, and averaging the extracted texture features; after the decoding network obtains the average value of the multiple texture features, the multiple texture features can be fused according to the average value to obtain the fused complete texture information.

Further, in order to obtain the complete texture information of the target avatar, in response to that different pictures lack the texture information of part of the avatar, the missing texture information is complemented in the fusion process.

Specifically, the present embodiment first trains a first preset deep learning model by setting a set number of texture information in a data set to construct a fusion model. The constructed fusion model has certain associative ability after being trained in advance, and missing texture information can be automatically supplemented according to a part of extracted texture information.

The process of constructing the fusion model is as follows: inputting training data in a first training set into a first preset deep learning model for training to obtain a first model; inputting verification data in the first verification set into the first model for fusion, and adjusting parameters of the first model based on a fusion result to obtain an adjusted first model; and inputting the test data in the first test set into the adjusted first model for fusion, and evaluating the scoring result of the adjusted first model based on the fusion result to construct a final fusion model.

The following describes how the embodiment fuses texture information extracted from different pictures to obtain complete texture information of a target avatar by combining with a specific scene.

In a specific implementation scenario, if two pictures including the target avatar are obtained, one of the two pictures includes the front side of the target avatar and the other includes the back side of the target avatar, the facial expression, the front limb form and the front clothes features of the target avatar can be extracted from the picture including the front side of the target avatar, the brain form, the back limb form and the back clothes features of the target avatar can be extracted from the picture including the back side of the target avatar, the texture information obtained from the two pictures is fused, the target avatar can be restored from multiple angles, and therefore relatively complete texture information of the target avatar can be obtained.

Because the obtained pictures may still have invisible textures, for example, the two pictures do not include texture information of the side of the target avatar, in order to obtain the complete texture information of the target avatar, the fusion model is required to complement the missing texture information in the fusion process, so as to obtain the complete texture information of the target avatar.

In this embodiment, the complete texture information further includes coordinates for two-dimensionally distributing the three-dimensional texture of the target avatar onto the two-dimensional canvas, so as to correspond to the spatial coordinates in the first map of the target avatar in the subsequent fusion rendering process.

S14: acquiring an image including the human body motion, and acquiring the joint point information of the human body motion from the image.

In the present embodiment, the joint point information includes the positional relationship between the joint points.

Specifically, a distribution of joint points of the human body is acquired based on an image including the motion of the human body, and a feature of each joint point is extracted based on the distribution of joint points to acquire a positional relationship between the joint points based on the feature of each joint point.

In this embodiment, the acquired image including the human body motion is input into one posture detection model, the image is detected through a neural network and joint point distribution in the human body motion is output, then the feature of each joint point is extracted through the joint point distribution, and the coordinates of the joint points are positioned based on the extracted features, so as to acquire the position relationship between the joint points.

The image including the human body motion may be obtained from a continuous dynamic picture or a video, which is not limited in the present application.

Because the embodiment directly acquires the joint point information from the picture, motion capture is not needed, and a sensor does not need to be attached to the human body to acquire the bone point information when the motion image of the human body is acquired, so that the cost of motion acquisition is reduced.

S15: and performing fusion rendering on the complete texture information of the target virtual image and the joint point information of the human body action to form an action image of the target virtual image.

In this embodiment, a first map of the human body motion is formed based on the joint point information, and then the complete texture information of the target avatar is rendered into the first map to obtain a motion image of the target avatar.

Wherein the first map further comprises the body shape of the target avatar.

In the present embodiment, the figure contour information of the avatar is obtained based on the picture including the target avatar, and the figure contour information and the joint point information are fused by generating the neural network to form the first map.

Specifically, the present embodiment first trains the second preset deep learning model by setting a set number of body shape profile data in the data set to construct the generative neural network. After the constructed generated neural network is trained in advance, the capability of fusing body shape contour information and joint point distribution is achieved, the position relation of each joint point of the target virtual image in the complete texture information can be adjusted to correspond to the position relation between each joint point in the human body action, and then the body shape contour information and the joint point information are fused to form a first map.

The process of constructing and generating the neural network is specifically as follows: inputting training data in a second training set into a second preset deep learning model for training to obtain a second model; inputting the verification data in the second verification set into a second model for fusion, and adjusting parameters of the second model based on a fusion result to obtain an adjusted second model; and inputting the test data in the second test set into the adjusted second model for fusion, and evaluating the scoring result of the adjusted second model based on the fusion result to construct a final generated neural network.

In the present embodiment, the first map is a UV map.

Specifically, the UV map is a planar representation of the surface of a 3D (3-dimension) model for packaging textures, and U and V refer to the horizontal and vertical axes of a 2D space, since X, Y, and Z have been used in the 3D space. The UV map is not inclusive of 3D texture because it is always based on 2D (2-dimension) images. The usefulness of UV mapping is the process of converting a 3D mesh to include 2D information in order to wrap 2D textures around the 2D information.

In this embodiment, by acquiring the figure contour information of the avatar and fusing the figure contour information and the acquired joint point information through the generated neural network, a complete first map of the avatar motion image can be developed.

Further, when the action image of the target avatar is constructed through the first map and the complete texture information, the position relationship of each joint point of the target avatar in the complete texture information needs to be adjusted to correspond to the position relationship between each joint point in the human body action, and then each joint point of the adjusted target avatar is rendered based on the complete texture information to obtain the action image of the target avatar.

Specifically, the adjusted complete texture information further includes specific coordinate information of each joint point of the target avatar, the first map also includes specific coordinate information of each joint point of the target avatar, and the rendering refers to copying the coordinate information of each joint point in the adjusted complete texture information onto the coordinate information of each corresponding joint point in the first map, so as to obtain the motion image of the target avatar.

Specifically, the background picture or the background video can be fused with the virtual image through the mask, so that the generated virtual image has any background, and the diversity of the virtual image and the interestingness of the virtual anchor are improved.

Different from the prior art, the method and the device have the advantages that a small number of pictures containing the target virtual image are obtained, the texture information of the target virtual image in each picture is extracted, the texture information is fused to obtain the complete texture information of the target virtual image, the joint point information including the human body action is obtained from the picture, and the complete texture information of the target virtual image and the joint point information including the human body action are fused and rendered to form the action image including the target virtual image. Through the mode, only a single picture (or a small number of pictures) including the target virtual image needs to be transmitted, the target virtual image can be driven to make preset postures and actions, the cost problem caused by manual modeling is avoided, the threshold of main broadcast live broadcast and short video production is reduced, the diversity of the virtual image is enriched, and the interest of live broadcast is further improved.

Referring to fig. 2, fig. 2 is a schematic flow chart diagram illustrating another embodiment of the method for generating an avatar according to the present application. As shown in fig. 2, the operation generation method according to the present embodiment includes:

s21: at least one picture containing the target avatar is obtained.

S22: and respectively extracting the texture information of the target virtual image in each picture.

In this embodiment, the texture of the avatar is first extracted from the picture through the coding network, the texture features are extracted from the texture, and the values of the texture features are averaged to obtain specific texture information of each texture.

In other embodiments, in order to improve the variety of the avatar, the texture can be finely adjusted on the obtained picture containing the target avatar, and the adjustment can be performed in advance when the picture is obtained, so that the real-time performance of the subsequent avatar driving is not affected.

S23: and fusing the texture information acquired from different pictures to obtain the complete texture information of the target virtual image.

S24: acquiring a video comprising human body actions, and respectively acquiring each frame of picture of the video; and sequentially extracting the joint point information of the human body action in each frame of picture.

In the embodiment, the human body action is collected through the camera device to obtain the video including the human body action, one frame of picture including the human body action is obtained based on the video, and the joint point information of the human body is extracted from the picture including the human body action.

The camera equipment comprises a monocular camera and can take pictures from one side of a human body. Because this embodiment just can gather human action from one side only through the monocular camera, need not to surround the shooting to the human body, also need not to set up many cameras, further reduced the cost that the action was gathered.

S25: and respectively performing fusion rendering on the complete texture information of the target virtual image and the joint point information of each frame of picture to obtain multi-frame rendered action images, and synthesizing the multi-frame rendered action images into a video according to a time sequence.

In the embodiment, each frame of picture of the video has a time sequence and the human body action is also coherent, and the action images after frame rendering are combined into the video according to the time sequence, so that the action sequence of the virtual image can be ensured to be consistent with the human body action.

Different from the prior art, the method and the device have the advantages that a small number of pictures containing the target virtual image are obtained, the texture information of the target virtual image in each picture is extracted, the texture information is fused to obtain the complete texture information of the target virtual image, the joint point information including the human body action is obtained from the video, and the complete texture information of the target virtual image and the joint point information including the human body action are fused and rendered to form the action image including the target virtual image. Through the mode, only a single picture (or a small number of pictures) including the target virtual image needs to be transmitted, the target virtual image can be driven to make preset postures and actions, the cost problem caused by manual modeling is avoided, the threshold of main broadcast live broadcast and short video production is reduced, the diversity of the virtual image is enriched, and the interest of live broadcast is further improved.

Referring to fig. 3 and 4, fig. 3 is a flowchart illustrating an application scenario of the method for generating an avatar action according to the present application, and fig. 4 is a schematic diagram illustrating an application scenario of the method for generating an avatar action according to fig. 3. Before broadcasting, the anchor can select a favorite avatar to broadcast, and specifically, at least one whole body image including the avatar is uploaded to an action generation terminal of the avatar. After the terminal acquires at least one picture containing the target virtual image, the texture image of the target virtual image in each picture is respectively extracted, and the texture information acquired from different pictures is fused to obtain the complete texture information of the target virtual image. The anchor makes different actions under the monocular camera, and the real person anchor video is collected through the monocular camera. After the relevant device acquires the video, processing the video, respectively acquiring each frame of picture of the video, and sequentially extracting the joint point information of the human body action in each frame of picture; meanwhile, the terminal acquires the body shape outline information of the target virtual image based on the picture containing the target virtual image, and fuses the body shape outline information and the joint point information through a built-in generated neural network to form a multi-frame first map. The terminal respectively performs fusion rendering on the complete texture information of the target virtual image and the first map corresponding to each frame to obtain a plurality of frames of rendered action images, and finally synthesizes the plurality of frames of rendered action images into a video according to a time sequence to obtain a virtual anchor with the target virtual image and the real anchor action; furthermore, the virtual image in the action image also comprises a mask and can be fused with different backgrounds, so that the generated virtual anchor can have any background, and the diversity and the interestingness of the virtual anchor are improved.

Correspondingly, the application provides an action generating device of the virtual image.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an embodiment of an avatar motion generating device according to the present application. As shown in fig. 5, in the present embodiment, the avatar motion generating apparatus 50 includes a first obtaining module 51, a texture extracting module 52, a texture fusion module 53, a second obtaining module 54, and a fusion rendering module 55.

The first obtaining module 51 is used for obtaining at least one picture containing the target avatar.

The texture extraction module 52 is used to extract the texture information of the target avatar in each picture.

The process of extracting please refer to the related text description in step S12, which is not described herein again.

The texture fusion module 53 is configured to fuse the texture information obtained from different pictures to obtain complete texture information of the target avatar.

The process of fusion please refer to the related text description in step S13, which is not described herein again.

The second obtaining module 54 is configured to obtain an image including the motion of the human body, and obtain joint point information of the motion of the human body from the image.

Please refer to the relevant text descriptions in step S14 and step S24 for the process of obtaining, which is not described herein again.

The fusion rendering module 55 is configured to perform fusion rendering on the complete texture information of the target avatar and the joint point information of the human body motion to form a motion image of the target avatar.

The process of fusion rendering please refer to the related text descriptions in step S15 and step S25, which are not described herein again.

Different from the prior art, the method comprises the steps of obtaining a small number of pictures containing the target virtual image through a first obtaining module, obtaining texture information of the target virtual image in each picture through a texture extracting module, fusing the texture information through a texture fusing module to obtain complete texture information of the target virtual image, obtaining joint point information including human body actions from a video through a second obtaining module, and finally fusing and rendering the complete texture information of the target virtual image and the joint point information of the human body actions through a fusing and rendering module to form an action image including the target virtual image. Through the mode, only a single picture (or a small number of pictures) including the target virtual image needs to be transmitted, the target virtual image can be driven to make preset postures and actions, the cost problem caused by manual modeling is avoided, the threshold of main broadcast live broadcast and short video production is reduced, the diversity of the virtual image is enriched, and the interest of live broadcast is further improved.

Correspondingly, the application provides an action generating terminal of the virtual image.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an embodiment of an action generating terminal of an avatar according to the present application. As shown in fig. 6, in the present embodiment, the action generating terminal 60 of the avatar includes a memory 61 and a processor 62 coupled to each other.

In the present embodiment, the memory 61 is used for storing program data, and the program data can realize the steps in the steps of the avatar motion generation method according to any one of the above embodiments when executed; the processor 62 is configured to execute the program instructions stored in the memory 61 to implement the steps of any of the above embodiments or the steps correspondingly executed by the motion generating device of the avatar in any of the above embodiments. The action generating terminal 60 of the avatar may include a touch screen, a communication circuit, etc. according to the requirement, in addition to the processor 62 and the memory 61, which is not limited herein.

Specifically, the processor 62 is configured to control itself and the memory 61 to implement the steps in any of the avatar motion generation method embodiments described above. The processor 62 may also be referred to as a CPU (Central Processing Unit). The processor 62 may be an integrated circuit chip having signal processing capabilities. The Processor 62 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 62 may be commonly implemented by a plurality of integrated circuit chips.

Different from the prior art, the application provides an action generating terminal of an avatar, which obtains a small number of pictures containing a target avatar, extracts texture information of the target avatar in each picture, fuses the texture information to obtain complete texture information of the target avatar, obtains joint point information including human body actions from a video, and performs fusion rendering on the complete texture information of the target avatar and the joint point information of the human body actions to form an action image including the target avatar. Through the mode, only a single picture (or a small number of pictures) including the target virtual image needs to be transmitted, the target virtual image can be driven to make preset postures and actions, the cost problem caused by manual modeling is avoided, the threshold of main broadcast live broadcast and short video production is reduced, the diversity of the virtual image is enriched, and the interest of live broadcast is further improved.

Accordingly, the present application provides a computer-readable storage medium.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an embodiment of a computer-readable storage medium according to the present application.

The computer-readable storage medium 70 comprises a computer program 701 stored on the computer-readable storage medium 70, and the computer program 701, when executed by the processor, implements the steps of any of the above method embodiments or the steps correspondingly executed by the related apparatus in the above method embodiments.

In particular, the integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium 70. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a computer-readable storage medium 70 and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned computer-readable storage medium 70 includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the purpose of illustrating embodiments of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application or are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. An avatar motion generating method, comprising:

acquiring at least one picture containing the target avatar;

respectively extracting texture information of the target virtual image in each picture;

fusing the texture information acquired from different pictures to obtain complete texture information of the target virtual image;

acquiring an image comprising human body movement, and acquiring joint point information of the human body movement from the image;

and performing fusion rendering on the complete texture information of the target virtual image and the joint point information of the human body action to form an action image of the target virtual image.

2. The avatar's motion generating method of claim 1, wherein said step of acquiring an image including a human motion and acquiring joint point information of the human motion from the image, further comprises:

forming a first map of the human body action based on the joint point information;

the step of performing fusion rendering on the complete texture information of the target virtual image and the joint point information of the human body action to form an action image of the target virtual image comprises the following steps:

and rendering the complete texture information of the target virtual image into the first map to obtain an action image of the target virtual image.

3. The action generating method of an avatar according to claim 1 or 2, wherein said joint information includes a positional relationship between each joint;

adjusting the position relationship of each joint point of the target virtual image in the complete texture information to enable the position relationship to correspond to the position relationship between each joint point in the human body action;

rendering each joint point of the adjusted target virtual image based on the complete texture information to obtain an action image of the target virtual image.

4. The avatar's motion generating method of claim 3, wherein said step of acquiring an image including a human motion and acquiring joint point information of the human motion from the image, further comprises:

acquiring the distribution of the joint points of the human body based on the image including the motion of the human body, and extracting the feature of each joint point based on the distribution of the joint points to acquire the position relationship between the joint points based on the feature of each joint point.

5. The avatar motion generating method according to claim 3, wherein said step of forming a first map of said body motion based on said joint point information comprises:

and acquiring body shape contour information of the virtual image based on the picture containing the target virtual image, and fusing the body shape contour information and the joint point information through a neural network to form the first map.

6. The action generating method of an avatar according to claim 1, wherein the avatar in the action image includes a mask to be fused with different backgrounds.

7. The avatar's motion generating method of claim 1, wherein said step of acquiring an image including a human motion and acquiring joint point information of the human motion from the image comprises:

acquiring a video comprising the human body action, and respectively acquiring each frame of picture of the video;

sequentially extracting the joint point information of the human body action in each frame of picture;

the step of performing fusion rendering on the complete texture information of the target avatar and the joint point information of the human body action to form the action image of the target avatar further comprises:

and respectively fusing and rendering the complete texture information of the target virtual image and the joint point information of each frame of picture to obtain multi-frame rendered action images, and synthesizing the multi-frame rendered action images into a video according to a time sequence.

8. The method for generating motion of an avatar according to claim 1, wherein said step of extracting texture information of said target avatar in each of said pictures respectively comprises:

extracting the texture of the virtual image from the picture, extracting texture features from the texture, and averaging the values of the texture features;

the step of fusing the texture information acquired from different pictures to obtain the complete texture information of the target virtual image specifically comprises:

and fusing the texture features according to the average value to obtain the fused complete texture information.

9. The method for generating motion of an avatar according to claim 8, wherein the step of fusing the texture information obtained from different pictures to obtain the complete texture information of the target avatar further comprises:

responding to the fact that texture information of the virtual image is absent from the different pictures, and completing the texture information of the absent part in the fusion process.

10. The method for generating actions of an avatar according to any of claims 8 or 9, wherein said step of fusing said texture information obtained from different pictures to obtain complete texture information of said target avatar comprises:

and fusing the texture information acquired from the different pictures through a fusion model to obtain the complete texture information of the target virtual image.

11. The method for generating motion of an avatar according to claim 10, wherein before the step of fusing the texture information obtained from the different pictures by a fusion model to obtain the complete texture information of the target avatar, the method further comprises:

inputting training data in a first training set into a first preset deep learning model for training to obtain a first model;

inputting verification data in a first verification set into the first model for fusion, and adjusting parameters of the first model based on a fusion result to obtain an adjusted first model;

and inputting the test data in the first test set into the adjusted first model for fusion, and evaluating the scoring result of the adjusted first model based on the fusion result to obtain the fusion model.

12. An avatar motion generating apparatus, comprising:

the first acquisition module is used for acquiring at least one picture containing the target virtual image;

the texture extraction module is used for respectively extracting the texture information of the target virtual image in each picture;

the texture fusion module is used for fusing the texture information acquired from different pictures to obtain complete texture information of the target virtual image;

the second acquisition module is used for acquiring an image comprising human body actions and acquiring joint point information of the human body actions from the image;

and the fusion rendering module is used for performing fusion rendering on the complete texture information of the target virtual image and the joint point information of the human body action to form an action image of the target virtual image.

13. An action generating terminal of an avatar, comprising:

a memory for storing program data which when executed implement the steps in the avatar's motion generation method of any of claims 1-11;

a processor for executing the program instructions stored by the memory to implement the steps in the avatar's motion generation method of any of claims 1-11.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps in the action generating method of an avatar according to any of claims 1-11.