WO2023071801A1 - 动画生成方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品 - Google Patents

动画生成方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品 Download PDF

Info

Publication number
WO2023071801A1
WO2023071801A1 PCT/CN2022/124879 CN2022124879W WO2023071801A1 WO 2023071801 A1 WO2023071801 A1 WO 2023071801A1 CN 2022124879 W CN2022124879 W CN 2022124879W WO 2023071801 A1 WO2023071801 A1 WO 2023071801A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
motion
video
human body
frames
Prior art date
Application number
PCT/CN2022/124879
Other languages
English (en)
French (fr)
Inventor
许嘉晨
汪旻
刘文韬
钱晨
马利庄
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023071801A1 publication Critical patent/WO2023071801A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the embodiment of the present disclosure is based on the Chinese patent application with the application number 202111275624.0, the application date is October 29, 2021, and the application name is "animation generation method, device, computer equipment and storage medium", and requires the priority of the Chinese patent application Right, the entire content of this Chinese patent application is hereby incorporated into this disclosure as a reference.
  • the present disclosure relates to but is not limited to the technical field of image processing, and in particular relates to an animation generation method and device, computer equipment, storage media, computer programs, and computer program products.
  • Existing video-based 3D human motion reconstruction and prediction frameworks are usually completed based on 3D human pose reconstruction. For example, given a video, for each video frame in the video, the human pose is reconstructed separately. Due to the lack of temporal continuity between the individually reconstructed human poses, the existing methods tend to cause a lack of smooth transitions between the output poses of each frame, making noise prone to appear after reconstruction.
  • Embodiments of the present disclosure at least provide an animation generation method and device, computer equipment, storage media, computer programs, and computer program products.
  • An embodiment of the present disclosure provides an animation generation method, including:
  • the target prior motion features matching the target motion made by the target object in the target video ;
  • the target prior motion feature corresponding to the target motion is fused with at least two frames of sample motion data used to describe the target motion, and the prior space includes at least two prior motion features respectively corresponding to the motion;
  • a three-dimensional motion animation including the target motion performed by the virtual object performed by the target object is generated.
  • An embodiment of the present disclosure also provides an animation generation device, including:
  • a human body feature extraction module configured to extract human body features corresponding to at least two video frames in the target video
  • the motion feature determination module is configured to determine the target motion made by the target object in the target video based on the human body features corresponding to at least two video frames in the target video and the neural network trained in prior space Matching prior motion features of the target; wherein, the prior motion features of the target corresponding to the target motion are fused with at least two frames of sample motion data used to describe the target motion, and the prior space includes at least two motions respectively Corresponding prior motion features;
  • the animation generating module is configured to generate a three-dimensional motion animation including a target motion performed by a virtual object performed by the target object based on the prior motion characteristics of the target.
  • An embodiment of the present disclosure also provides a computer device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the computer device is running, the processor and the The memory communicates with each other through the bus, and the machine-readable instructions are executed by the processor to execute the steps of the above aspects, or any possible animation generation method in the aspects.
  • An embodiment of the present disclosure also provides a computer-readable storage medium, on which a computer program is stored.
  • a computer program is stored.
  • the above-mentioned aspect or any possible animation generation method in the aspect is executed. step.
  • An embodiment of the present disclosure provides a computer program, the computer program includes computer readable code, and when the computer readable code is read and executed by a computer, a part or part of the method in any embodiment of the present disclosure is realized. All steps.
  • An embodiment of the present disclosure provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and when the computer program is read and executed by a computer, any embodiment of the present disclosure is realized Some or all of the steps in the method.
  • the present disclosure provides an animation generation method and device, computer equipment, a storage medium, a computer program, and a computer program product.
  • the neural network trained based on the prior space has the ability to accurately output any prior motion feature in the prior space.
  • the prior motion features stored in the prior space incorporate at least two frames of sample motion data used to describe a certain motion, and at least two frames of sample motion data include timing information, therefore, the prior motion features stored in the prior space
  • the motion feature can characterize the characteristics of the continuous action of a certain motion. Therefore, the target prior motion feature can represent the continuous motion feature of the target object in the target video, and the use of the target prior motion feature can enable the virtual object to restore the continuous 3D motion animation of the target motion made by the target object, reducing the 3D motion animation. Noise disturbance of human movement.
  • FIG. 1 shows a flowchart of an animation generation method provided by an embodiment of the present disclosure
  • FIG. 2 shows a schematic flow diagram of a neural network outputting a priori motion characteristics of a target provided by an embodiment of the present disclosure
  • FIG. 3 shows a schematic flow diagram of an animation generation process provided by an embodiment of the present disclosure
  • Fig. 4 shows a schematic diagram of an animation generation device provided by an embodiment of the present disclosure
  • Fig. 5 shows a schematic structural diagram of a computer device provided by an embodiment of the present disclosure.
  • At least two or several mentioned herein means two or more.
  • “And/or” describes the association relationship of associated objects, indicating that there may be three types of relationships, for example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists independently.
  • the character “/” generally indicates that the contextual objects are an "or” relationship.
  • the existing video-based 3D human motion reconstruction and prediction frameworks are all based on 3D human pose reconstruction. For example, given a video, for each video frame in the video, the human body pose is performed separately. Due to the lack of temporal continuity between the reconstructed human poses, it is easy to cause the lack of a smooth transition between the poses of each frame of the output, making the result prone to noise.
  • this disclosure provides a animation generation method based on the neural network trained in the prior space, which has the ability to accurately output any prior motion features in the prior space, and because the prior information stored in the prior space
  • the prior motion feature combines at least two frames of sample motion data used to describe a certain motion, and at least two frames of sample motion data include timing information. Therefore, the prior motion feature stored in the prior space can represent the continuous action of a certain motion Therefore, the target prior motion feature can characterize the continuous motion feature of the target object in the target video. Using the target prior motion feature can make the virtual object restore the continuous three-dimensional motion animation of the target motion made by the target object, reducing the three-dimensional Noise Interference of Human Movement in Motion Animation.
  • the prior space is a high-dimensional space.
  • the prior space includes at least two prior motion features.
  • the prior motion features can represent a reasonable and smooth human motion of a specified length, that is, a series of continuous human motions.
  • a series of human actions includes a cycle of standing (at least two frames), raising the left leg (at least two frames), and raising the right leg (at least two frames).
  • each action includes at least one frame of motion data, and a section of human body motion of a specified length includes motion data of a fixed number of frames.
  • an animation generation method disclosed in the embodiment of the present disclosure is first introduced in detail.
  • the animation generation method provided by the embodiment of the present disclosure is generally executed by a computer device with certain computing capabilities.
  • the animation generation method may be implemented by a processor invoking computer-readable instructions stored in a memory.
  • the animation generation method provided by the embodiments of the present disclosure will be described below by taking the execution subject as a terminal device or other processing device as an example.
  • the terminal device may be user equipment (User Equipment, UE), mobile device, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (PDA), handheld device, computing device, vehicle-mounted device, Wearable equipment etc.
  • the animation generation method may be implemented by a processor invoking computer-readable instructions stored in a memory.
  • FIG. 1 is a flow chart of an animation generation method provided by an embodiment of the present disclosure
  • the method includes steps S101 to S104, wherein:
  • the target object may include a person in a real scene, or a human body model in a virtual scene.
  • the target video may be a video of human body movement captured by the user using a shooting device; wherein, the human body movement may include one or more of walking, running, rope skipping, standing, lying down, and other movements.
  • Human motion may include a series of continuous motions made by the target object, including motions such as stationary motions, that is, the following target motions.
  • the target video can be a section of human body motion video captured by the user, or it can also be a certain segment of continuous video intercepted from a captured segment of human motion video, and the target video can be composed of at least two video frames. It may also be referred to as a video image, at least two video frames, that is, at least two video images. Since the prior motion features stored in the prior space represent a period of human motion with a specified length, and the human motion with a specified length is also expressed as motion data with a fixed number of frames, in order to restore the target object in each video frame in the target video The number of video frames in the target video may be less than or equal to the number of motion data with a fixed number of frames. Therefore, the implementation process of acquiring the target video may be acquiring a video with a fixed number of frames, or intercepting a video with a fixed number of frames from a shot video, and the like.
  • S102 Extract human body features respectively corresponding to at least two video frames in the target video.
  • the human body features corresponding to the video frame include features including the target object extracted from the video frame.
  • At least two video frames in the target video may include each video frame in the target video. Alternatively, it may also include a segment of at least two consecutive video frames extracted from the target video, that is, a part of the video frames in the target video.
  • a time series feature extraction network may be used to extract features of target objects in corresponding video frames according to the order of time series information of at least some frames of video frames. Since each video frame in the target video contains its corresponding timing information, the extracted human features correspond to the timing information of its corresponding video frame.
  • the temporal feature extraction network includes, for example, a Gated Recurrent Unit (GRU), a Long Short-Term Memory network (Long Short-Term Memory, LSTM) and the like.
  • the extracted human body features containing time series information can be composed into a time series feature set.
  • S103 Based on the human body features corresponding to at least two video frames in the target video and the neural network trained in prior space, determine the target prior motion features that match the target motion made by the target object in the target video; wherein, the target The prior motion features of the object corresponding to the motion are fused with at least two frames of sample motion data used to describe the motion of the object, and the prior space includes at least two prior motion features corresponding to the motions respectively.
  • the neural network is a neural network trained using a priori space.
  • it can be obtained by supervised training using a priori motion features stored in a priori space, and can be trained to obtain an accurate output prior.
  • the ability of any prior motion feature in the prior space, that is, the target prior motion feature output by the trained neural network is the same as a certain prior motion feature stored in the prior space.
  • the human body features corresponding to at least two frames of video frames can be input into the trained neural network, and regression processing (such as linear regression processing or least squares regression processing, etc.) Test the characteristics of the movement.
  • regression processing such as linear regression processing or least squares regression processing, etc.
  • the 2048-dimensional human body feature vectors (human body features including human body feature vectors) extracted by each video frame
  • the 2048-dimensional human body feature vectors corresponding to the extracted 16 video frames are all Input into the trained neural network, use the convolutional layer in the neural network for feature extraction, use the fully connected layer for feature fusion, and use the fused feature code as the target prior motion feature, that is, complete the target prior motion feature. return.
  • the prior motion features are fused with at least two frames of sample motion data used to describe a certain motion, and at least two frames of sample motion data respectively contain their corresponding timing information, the motion represented by the prior motion features is continuous, Furthermore, it is determined that the prior motion features of the target obtained by regression can represent the continuous motion of the target object in the target video.
  • the human body features corresponding to the selected partial continuous video frames are input into the trained neural network, and the prior motion features of the target are regressed.
  • at least two subsequent consecutive video frames are predicted, and the human body features corresponding to at least two video frames in the target video, and the predicted The human body features corresponding to at least two consecutive video frames are input into the trained neural network together, and the prior motion features of the target are regressed.
  • at least two consecutive video frames are predicted, and the human body features corresponding to the part of the continuous video frames in the target video respectively, and the predicted continuous at least two video frames are predicted.
  • the human body features corresponding to the two frames of video frames are input into the trained neural network together, and the prior motion characteristics of the target are regressed.
  • the regression process reference may be made to the description of the regression of the prior motion characteristics of the target in the first optional implementation manner.
  • S104 Based on the prior motion characteristics of the target, generate a three-dimensional motion animation including the target motion performed by the virtual object performed by the target object.
  • the virtual object is a three-dimensional object displayed in a virtual scene such as a computer display screen, and is used to imitate the target movement made by the target object.
  • the virtual object includes, for example, a virtual cartoon, a virtual character modeling the restoration target object, and the like.
  • a decoder can be used to decode the encoded target prior motion features to obtain at least two frames of motion data that can represent the target motion made by the target object, and then use the at least two frames of motion data to perform three-dimensional human body Motion modeling to generate a 3D motion animation that includes a virtual object making the target motion made by the target object, the virtual object in the 3D motion animation being able to perform the same continuous sequence of motions as the target object made in the target video action.
  • the decoder may be any pre-trained common decoder, which is not limited in this embodiment of the present disclosure.
  • a default set of human body shape parameters can be used for 3D human motion reconstruction during the modeling process.
  • the first frame of motion data corresponding to the prior motion features defaults to a set orientation
  • the 3D human motion reconstruction can be performed according to the default orientation without changing the orientation of the virtual object.
  • the 3D motion animation may also add 3D scene information according to actual application conditions, which is not limited in this embodiment of the present disclosure.
  • the above S101-S104 is based on the neural network trained in the prior space, which has the ability to accurately output any prior motion feature in the prior space, and because the prior motion feature stored in the prior space is integrated with the information used to describe a certain At least two frames of sample motion data of the motion, at least two frames of sample motion data include timing information, therefore, the prior motion features stored in the prior space can represent the characteristics of the continuous action of a certain motion, therefore, the target prior motion features It can characterize the characteristics of the continuous motion of the target object in the target video, and use the prior motion characteristics of the target to make the virtual object restore the continuous three-dimensional motion animation of the target motion made by the target object, and reduce the noise interference of human body movement in the three-dimensional motion animation.
  • the temporal feature set corresponding to the target video can be determined based on the human body features corresponding to at least some of the video frames in the target video; based on the temporal feature set and the neural network, determine Target prior motion features.
  • the temporal feature set may include human body features corresponding to at least two video frames in the target video; or, may include human body features corresponding to some consecutive video frames; or, may include at least some human body features corresponding to video frames , and the predicted human body features corresponding to at least two video frames after the target video; or, it may include the human body features corresponding to some consecutive video frames, and the predicted human body features corresponding to at least two video frames after the target video.
  • the temporal feature set corresponding to the target video includes human body features corresponding to multiple video frames, it can be determined that the temporal feature set can more comprehensively reflect the overall motion information of the target object in the target video. Therefore, the neural network is used to analyze the temporal feature set Through processing, it is possible to determine a more accurate target prior motion feature that reflects the continuous motion of the target object in the target video.
  • Animation generation tasks are different, and the human body features included in the temporal feature set are different.
  • Animation generation tasks may include, for example, human motion reconstruction tasks, including reconstructing the target motion made by the target object in the target video (only including the actions that the target object in the target video has made) according to the content of the target video, and generating 3D Motion animation; or, the task of human motion prediction, including according to the content of the target video, and the action that the target object in the predicted target video is about to make, determine the target motion made by the target object in the target video (including the target in the target video the actions the subject has already made, and the predicted actions).
  • the human body motion reconstruction task determine the time series feature set.
  • the human body features corresponding to the extracted at least two video frames can be connected in series according to the first time series information corresponding to the video frames to determine the corresponding time series of the target video. collection of time series features.
  • the first timing information includes the playing time of the video frames in the continuously played target video, and each video frame corresponds to its respective first timing information, such as 3 video frames, and the human body features extracted from the first video frame correspond to
  • the timing information is the first timing information
  • the first timing information corresponding to the human body features extracted by the second video frame is the second
  • the first timing information corresponding to the human body features extracted by the third video frame is the third
  • the corresponding human body features are connected in series in the order of first, second and third.
  • the concatenation can be end-to-end, that is, according to the first time sequence information corresponding to the video frame, the human body features of the later sequence are sequentially spliced after the human body features of the previous time sequence, for example, the human body features extracted from the first video frame (2048-dimensional human body feature vector) first, the human body feature (2048-dimensional human body feature vector) extracted by the second video frame is spliced after the human body feature extracted by the first video frame, and the human body feature (2048-dimensional human body feature vector) extracted by the third video frame dimensional human body feature vector) splicing after the human body feature extracted in the second video frame.
  • a 2048-dimensional human body feature vector is extracted from one video frame, and the human body features corresponding to three video frames are concatenated according to the first time-series information to obtain a 3 ⁇ 2048-dimensional human body feature vector, that is, a sequence feature set.
  • the human body features corresponding to the multi-frame video frames can be connected in series according to the first time series information corresponding to the video frames, and the obtained time series feature set contains the first time series information, so it can more comprehensively reflect the whole of the target object in the target video Motion information, and the overall motion information is continuous in time sequence.
  • the human body features corresponding to the extracted partial consecutive video frames may be concatenated according to the first time-series information corresponding to the video frames to determine the time-series feature set corresponding to the target video.
  • the execution process in some embodiments refer to the above description.
  • the human body motion prediction task determine the time series feature set, in some embodiments, based on the human body features corresponding to at least two video frames in the target video, predict the predicted human body features corresponding to the N frames of video frames after the target video; wherein, N is a preset positive integer; the human body characteristics corresponding to at least two video frames in the target video and the predicted human body characteristics corresponding to N frames of video frames are connected in series according to the second timing information corresponding to the video frames to determine the corresponding human body characteristics of the target video collection of time series features.
  • the second timing information includes the playing time of video frames in the continuously played target video+the playing time of N frames of video frames, and the above video frames all correspond to their respective second timing information, such as the video frames in 3 frames of target video+
  • the second time series information corresponding to the human body features extracted from the first video frame is the first
  • the second time series information corresponding to the human body features extracted from the second video frame is Second
  • the second time-series information corresponding to the human body features extracted by the third video frame is the third
  • the second time-series information corresponding to the human body features extracted by the fourth video frame is the fourth; 2.
  • the corresponding human body features are connected in series in the third and fourth order.
  • the series connection can be end-to-end, that is, according to the second time sequence information corresponding to the video frame, the human body features in the sequence after the sequence are sequentially stitched together in the previous sequence.
  • the human body features for example, the human body features (2048-dimensional human feature vectors) extracted from the first video frame are in front, and the human body features (2048-dimensional human feature vectors) extracted from the second video frame are spliced on the first video frame.
  • the human body features (2048-dimension human feature vectors) extracted by the third video frame are spliced after the human body features extracted by the second video frame, and the human body features (2048-dimension human feature vectors) extracted by the fourth video frame are spliced in After the human body features extracted from the third video frame. That is, a 2048-dimensional human body feature vector is extracted from one video frame, and the human body features corresponding to four video frames are concatenated according to the second time-series information to obtain a 4 ⁇ 2048-dimensional human body feature vector, which is a sequence feature set.
  • the predicted human body features related to the action corresponding to the target movement made by the target object in the target video can be predicted, and the multiple people extracted before prediction Body features and multiple predicted human body features obtained after prediction are concatenated according to the second time-series information corresponding to the video frame, and the obtained time-series feature set contains the second time-series information, so it can comprehensively reflect a period of time (including target video The three video frames included and the one video frame to be predicted after the target video) the overall motion information of the target object, and the overall motion information is continuous in time sequence.
  • N frames of video frames after the target video are predicted to correspond to predicted human body features respectively.
  • the human body features of the selected continuous video frames are input to the autoregressive loop network to predict the predicted human body features corresponding to the first video frame after the target video; after that, the predicted human body features, and From the human body characteristics corresponding to some consecutive video frames, select some continuous video frames and input them to the autoregressive cyclic network, so that the cyclic prediction can predict the predicted human body characteristics corresponding to the video frames with the preset number of frames, that is, N frames of video frames Corresponding to predicted human body features, N is the preset number of video frames to be predicted, and N takes a positive integer.
  • the prior motion features stored in the prior space represent a period of human motion of a specified length
  • the human motion of a specified length is also expressed as motion data with a fixed number of frames
  • the number of video frames in the target video + the number of N frames of video frames may be less than or equal to the amount of motion data with a fixed number of frames.
  • the value of N is preset, it is enough to ensure that the number of video frames in the target video + the number of N frames of video frames is less than or equal to the number of motion data with a fixed number of frames, and the embodiment of the present disclosure does not limit the value.
  • the N frames of video frames include the fourth video frame and the fifth video frame), when determining the predicted human body features, for example, the human body features corresponding to the three video frames in the target video can be input to the autoregressive loop network Among them, the predicted human body features related to the human body features corresponding to the above-mentioned three frames of video frames are output, and the predicted human body features corresponding to the fourth video frame obtained by the prediction are continuous with the human body features corresponding to the third frame of video frames; after that, Input the human body features corresponding to the second video frame and the third video frame in the target video, and the predicted human body features corresponding to the fourth video frame into the autoregressive recurrent network, and output the above second and third frames
  • Another possible implementation is to input the human body features corresponding to at least two video frames in the target video into the autoregressive neural network, and regress out the predicted human body features; after that, at least two video frames in the target video
  • the corresponding human body features and each predicted human body feature are input into the autoregressive neural network, and the next predicted human body feature is regressed, and the cycle is repeated until N predicted human body features are predicted.
  • the M+1th video frame in the M video frames following the target video its corresponding predicted human body features are predicted.
  • the predicted human body features of the M+1th video frame after the target video are predicted; M is a positive integer less than N.
  • the human body features corresponding to at least two video frames in the target video and the predicted human body features corresponding to M frames of video frames can be input into the autoregressive neural network, and the M+1th video frame corresponding to Predict human characteristics.
  • the human body features corresponding to some consecutive video frames in the target video and the predicted human body features corresponding to M frames of video frames are input into the autoregressive neural network, and the predicted human body features corresponding to the M+1th video frame are regressed .
  • the human body features corresponding to at least two video frames in the target video are input into the autoregressive neural network, and the predicted human body features corresponding to M video frames after the target video are regressed.
  • the predicted human body features corresponding to the M video frames are input into the autoregressive neural network, and the predicted human body features corresponding to the M+1th video frame after the target video are regressed.
  • the predicted human body features corresponding to the previous M+1 video frames and the human body features corresponding to at least two video frames in the target video, and can accurately predict a feature related to the above features, that is, the M+1th frame
  • the predicted human body features of the video frame due to the high correlation between the predicted preset human body features and the human body features, therefore, the predicted predicted human body features and the corresponding human body features of the previous video frame or the predicted human body features are continuous .
  • the training process can refer to the following steps:
  • Step 1 Obtain a sample video, and extract sample human features corresponding to at least two video frames in the sample video.
  • the feature extraction process of extracting the sample human body features corresponding to at least two frames of video in the sample video can refer to the extraction process of human body features in S102.
  • Step 2 Based on the neural network to be trained and the characteristics of the human body of the sample, determine the prior motion characteristics of the sample corresponding to the sample video.
  • the extracted at least two sample human body features are connected in series according to the timing information of video frames in the sample video to determine a sample human body feature set; after that, the sample human body feature set is input into the neural network to be trained , use at least two convolutional layers to perform feature extraction on the sample human body feature set, and at least two sample extraction features can be extracted; after that, the extracted sample extraction features are input to the fully connected layer, and the sample prior motion features can be output .
  • Step 3 Detect whether the prior motion feature of the sample is the prior motion feature in the prior space, and adjust the network parameters of the neural network to be trained based on the detection result.
  • the prior motion features stored in the prior space can be used for supervised training, that is, the loss between the prior motion features of the sample and the prior motion features in the prior space can be calculated, the loss function is constructed using the loss, and the loss The function adjusts the network parameters of the neural network to be trained until the prior motion features of all samples output by the neural network are consistent with any prior motion features in the prior space, that is, the mapping space of the neural network is completely consistent with the prior space, then It can be determined that the training of the neural network to be trained is completed.
  • the prior motion features of the sample are continuously optimized as the prior motion features stored in the prior space.
  • the network parameters of the neural network to be trained are continuously adjusted to obtain an accurate output in any prior space. Neural networks with the ability to a priori motion features.
  • a possible implementation is, after the time series feature set is input into the neural network, the neural network performs the following operations: based on at least two convolutional layers of the neural network, the time series feature set is Feature extraction, at least two extraction features are obtained; multiple extraction features are input to the fully connected layer, and the prior motion features of the target are obtained.
  • the neural network may be a convolutional neural network (Convolutional Neural Networks, CNN for short) or the like.
  • the processing process can be referred to as shown in FIG. 2 , which is a schematic flow chart of the neural network outputting the prior motion characteristics of the target.
  • These include, a neural network 21, two convolutional layers 22 and 23, and a fully connected layer 24.
  • the convolutional layer 22 can be used to perform feature extraction on the time series feature set to obtain the extracted feature A; afterward, the convolutional layer 23 is used to perform feature extraction on the extracted feature A to obtain the extracted feature B; after that, the extracted feature A and The extracted feature B is input to the fully connected layer 24 to obtain the prior motion feature of the target.
  • At least two convolutional layers can be used to perform feature extraction on the time-series feature set, and at least two extracted features that represent human motion and have a higher depth can be obtained; at least two extracted features with a higher depth can be used to perform full
  • the connection processing can obtain more accurate target prior motion characteristics that reflect the continuous motion of the target object in the target video.
  • the neural network determines at least two prior motion features, for example, for a certain temporal feature set, which includes crossing hands.
  • the neural network may return the prior motion features corresponding to the left hand in front and the right hand in the back, and return the prior motion features corresponding to the left hand in front of the right hand in front of the action .
  • the prior motion features in the prior space obey the normal distribution, that is, the closer to the origin of the normal distribution, the higher the probability of the prior motion features appearing, that is, the smaller the eigenvalue of the prior motion features, the prior The higher the probability of the occurrence of experimental motion features. Therefore, based on the determined eigenvalues of at least two prior motion features, a priori motion feature of the target can be determined. Compared with the related art, when generating 3D human motion data from 2D video frame data, there is ambiguity of multiple target prior motion features.
  • the probability of the occurrence of prior motion features is as large as possible through constraints, so as to reduce the occurrence of multiple target prior motion features, that is, the ambiguity in the case of multiple solutions can be reduced, and you can choose The object prior motion features with higher probability are output.
  • an eigenvalue corresponding to each of the at least two priori motion features may be calculated, and the priori motion feature corresponding to the smallest eigenvalue is used as the target priori motion feature.
  • the calculation process of the eigenvalues includes: the known prior motion features are R-dimensional eigenvectors, the value K of each dimension is determined, and the eigenvalue Y is determined according to the formula (1).
  • R is a positive integer.
  • the target prior motion feature closest to the origin of the normal distribution is screened out from the determined at least two prior motion features, and the target prior motion feature can be used to restore a relatively accurate target motion made by the target object. a series of actions.
  • At least two frames of motion data used to describe the motion of the virtual object can be determined based on the target prior motion characteristics and the pre-trained decoder corresponding to the prior space; after that, based on The motion data of at least two frames of the virtual object is used to generate a 3D motion animation including the target motion of the virtual object made by the target object, wherein the frame number of the 3D motion animation is the same as the frame number of the target video.
  • At least two frames of sample motion data can be used to train the encoder and decoder in the self-encoder.
  • the prior motion features are obtained based on the encoder encoding. Therefore, the decoder based on the same training can perform a priori motion feature Decoding processing.
  • the prior motion feature of the target is input into the pre-trained decoder corresponding to the prior space, and the convolution process is performed through the convolutional layer, that is, the encoded target prior motion feature is decoded, and the final output can express At least two frames of motion data of the virtual object movement, and then use the at least two frames of motion data to perform three-dimensional human motion modeling to generate a three-dimensional motion animation including the target motion made by the virtual object made by the target object, in the three-dimensional motion animation
  • the virtual object is able to perform the same continuous motion as the target object does in the target video.
  • the trained decoder corresponding to the prior space can more accurately decode at least two frames of motion data used to describe the motion of the virtual object , since the at least two frames of motion data are obtained by decoding the prior motion features of the target, and the prior motion features of the target can reflect the continuous motion of the target object in the target video, therefore, the at least two frames of motion data can reflect the continuous motion of the target object in the target video , and then generate a continuous three-dimensional motion animation that can make the virtual object restore the target motion made by the target object.
  • the target motion data is screened out from the numerous motion data, and based on the target motion data, a three-dimensional motion animation including the target motion of the virtual object made by the target object is generated.
  • the first frame number of the target video is less than or equal to the second frame number of the motion data, and when the first frame number is less than the second frame number, the first frame number is filtered out from the motion data of the second frame number.
  • the target motion data of the first frame number is used to generate the 3D motion animation of the first frame number, that is, the frame number of the 3D motion animation is the same as the frame number of the target video.
  • the target motion data of the first frame number is screened out from the motion data of the second frame number, since the motion data includes timing information, it is possible to select sequentially from the first frame of motion data.
  • the motion data of the first frame number is used as the target motion data.
  • it may also be selected from the motion data of the second number of stitches according to other selection rules according to the actual application scenario, which is not limited in this embodiment of the present disclosure.
  • the number of frames of the obtained 3D motion animation may be the same as that of the target video. Because the frame number of at least two frames of motion data obtained by decoding by the decoder is fixed, and the first frame number of the target video is less than the second frame number of the motion data, therefore, it is necessary to screen out the target motion data from the second motion data kind to So that the 3D motion animation with the same frame number as the target video can be restored.
  • the human body shape information corresponding to the target object can be modeled to obtain a virtual object consistent with the shape of the target object.
  • the first full-connection processing may be performed on the human body features of the target video frame in the target video to determine the human body shape information corresponding to the target object. Afterwards, based on the prior motion characteristics of the target and the shape information of the human body, a three-dimensional motion animation including the target motion of the virtual object made by the target object is generated.
  • the target video frame may be any video frame randomly selected from the target video; or a specified first video frame.
  • the human body features corresponding to the target video frame are input into the fully connected layer, and at least one fully connected layer is used to return the corresponding human body from the human body features shape information, and then use the human body shape information as the shape information of the virtual object in the three-dimensional human motion modeling process, so that the shape of the virtual object remains unchanged.
  • the 3D motion animation shows that a virtual object with the same shape as the target object makes the target motion of the target object.
  • the shape information of the human body is used to constrain the shape of the virtual object in the three-dimensional motion animation, so that the shape of the virtual object in each frame of animation remains unchanged, and the display effect of the virtual object in the three-dimensional motion animation is improved, thereby improving the user's visual viewing experience.
  • the first frame of motion data corresponding to the prior motion features defaults to a certain set orientation, such as facing the positive direction of the display screen, it is likely to be inconsistent with the orientation information of the target object in the target video. Therefore, in order to establish The virtual object of the model can restore the orientation information of the target object in the target video, and it is necessary to determine the orientation adjustment information corresponding to a target object according to the human body characteristics of the first video frame.
  • the second full-connection process is performed on the human body features of the first video frame in the target video to determine the orientation adjustment information corresponding to the target object; A 3D motion animation of the target movement made by the target object.
  • the human body characteristics corresponding to the first video frame can be input into the fully connected layer, and at least one fully connected layer is used to return the corresponding human body characteristics from the human body characteristics
  • Orientation adjustment information may include rotating a certain adjustment angle according to the set direction, and then using the orientation adjustment information to adjust the orientation of the virtual object during the modeling process, such as adjusting an adjustment angle in a clockwise direction, so that the virtual object rotates in each frame
  • the orientation of the 3D motion animation is the same as the orientation of the target object in at least two frames of video frames in the target video, which can improve the user's visual experience of watching the 3D motion animation.
  • the orientation adjustment information may also include orientation, and during the modeling process, the orientation of each frame of the virtual object model is adjusted to the orientation determined in the orientation adjustment information.
  • FIG. 3 it is a schematic diagram showing the flow of the animation generation process.
  • target video 31 Timing feature extraction network 32; Timing feature set 33; Neural network 34; Target prior motion feature 35; Pre-trained decoder 36 corresponding to prior space;
  • the fully connected layer 38 corresponding to the first fully connected process; the fully connected layer 39 corresponding to the second fully connected process; other videos 40, other videos 40 are different from the shooting angles of the target video 31, and other videos 40 are different from the target video 31 Have the same target object of moment; Temporal convolution 41, be used to extract the temporal feature in other video 40, as, temporal coder (Temporal Encoder);
  • temporal feature set 33 comprises the temporal feature set determined based on target video 31, And based on the timing feature set determined by other videos 40, etc.; when the virtual object does not change its shape and orientation, the generated 3D human animation 44; when the virtual object adjusts its shape and orientation, the generated 3D human animation 45 .
  • some implementations of the animation generation process include: acquiring the target video 31 containing the target object, using the temporal feature extraction network 32 to extract the human body features corresponding to each video frame in the target video 31, and for the task of human motion reconstruction, the Each extracted human body feature is concatenated according to the first time-series information corresponding to the video frame to obtain the time-series feature set 33 corresponding to the target video; for the human body motion prediction task, each extracted human body feature and each predicted prediction The human body features are concatenated according to the second time series information to obtain a time series feature set 33 corresponding to the target video. Afterwards, the neural network 34 is used to regress the prior motion features 35 of the target.
  • the prior motion feature of the target is input into the decoder 36 for decoding processing, and at least two frames of motion data 37 are obtained.
  • each prior motion feature in the prior space can correspond to a reasonably smooth human motion of a selected length. Therefore, in order to perform high-precision 3D human motion reconstruction and prediction, the prior space can be used as the target space for 3D human motion reconstruction and prediction.
  • the corresponding human body feature can be extracted from the input target video 31 first, and then the target prior motion feature 35 in the prior space is regressed from the human body feature, and finally the human body motion corresponding to the target prior motion feature 35 is the three-dimensional human body The result of motion reconstruction or prediction.
  • human motion is generated frame by frame, and a segment of human motion is composed of multiple continuous human motions.
  • the corresponding human motion can be generated at one time based on the target prior motion feature 35, and the continuity and smoothness between human motions can be improved.
  • the fully connected layer 38 may include a first fully connected layer 301 and a second fully connected layer 302
  • the fully connected layer 39 may include a third fully connected layer 303 and a fourth fully connected layer 304 .
  • a three-dimensional human body animation 45 is generated.
  • the human body shape of the target object can be observed from at least two angles, so based on the input video, the corresponding human body features can be extracted, and through the human body
  • the feature returns the human body shape attribute (such as information such as human body shape information 42 and orientation adjustment information 43), and the human body shape of the target object can be described through the information such as the human body shape information 42 and the orientation adjustment information 43, so as to ensure the accuracy of the human body shape. consistency.
  • a human form attribute is generated for each video frame, resulting in lower consistency of the human form attribute.
  • information such as human body shape information 42 and orientation adjustment information 43 is used as the human body shape attributes of any frame in the target video or other videos corresponding to the complete movement of the target object, which improves the consistency of human body shape attributes in the target video or other videos sex.
  • each step does not imply a strict execution order and constitutes any limitation on the implementation process.
  • the execution order of each step should be based on its function and possible internal Logically OK.
  • an animation generation device corresponding to the animation generation method is also provided in the embodiment of the present disclosure. Since the problem-solving principle of the device in the embodiment of the disclosure is similar to the above-mentioned animation generation method in the embodiment of the disclosure, the implementation of the device See the implementation of the method.
  • the device includes: a video acquisition module 401, a human body feature extraction module 402, a motion feature determination module 403, and an animation generation module 404; wherein,
  • a video acquisition module 401 configured to acquire a target video containing a target object
  • the human body feature extraction module 402 is configured to extract human body features corresponding to at least two video frames in the target video;
  • the motion feature determination module 403 is configured to determine the target made by the target object in the target video based on the human body features corresponding to at least two video frames in the target video and the neural network trained in prior space.
  • a priori motion feature of the target for motion matching wherein, the prior motion feature of the target corresponding to the target motion is fused with at least two frames of sample motion data used to describe the target motion, and the prior space includes at least two motion Corresponding prior motion features respectively;
  • the animation generating module 404 is configured to generate a three-dimensional motion animation including the target motion performed by the target object by the virtual object based on the prior motion characteristics of the target.
  • the motion feature determination module 403 is configured to determine a time series feature set corresponding to the target video based on the human body features corresponding to at least two video frames in the target video; based on the time series features A set and the neural network are used to determine the prior motion characteristics of the target.
  • the motion feature determination module 403 is configured to concatenate the extracted human body features corresponding to at least two video frames according to the first timing information corresponding to the video frames, and determine the A collection of temporal features corresponding to the target video.
  • the motion feature determination module 403 is configured to predict the predicted human body corresponding to the N video frames after the target video based on the human body features corresponding to at least two video frames in the target video respectively feature; wherein, N is a preset positive integer; the human body features corresponding to at least two video frames in the target video and the predicted human body features corresponding to the N frames of video frames respectively, according to the corresponding video frames
  • the second time series information is concatenated to determine the time series feature set corresponding to the target video.
  • the motion feature determination module 403 is configured to, for M frames of video frames after the target video, based on the human body features corresponding to at least two video frames in the target video and the M frames
  • the video frames respectively correspond to the predicted human body features, and predict the predicted human body features of the M+1th video frame after the target video; M is a positive integer smaller than N.
  • the device further includes a neural network training module 405, configured to acquire a sample video, and extract sample human features corresponding to at least two video frames in the sample video; based on the neural network to be trained and The human body characteristics of the sample determine the prior motion characteristics of the sample corresponding to the sample video;
  • a neural network training module 405 configured to acquire a sample video, and extract sample human features corresponding to at least two video frames in the sample video; based on the neural network to be trained and The human body characteristics of the sample determine the prior motion characteristics of the sample corresponding to the sample video;
  • the motion feature determining module 403, when determining the prior motion feature of the target based on the temporal feature set and the neural network, is configured to: input the temporal feature set into the A neural network, performing the following operations: performing feature extraction on the time series feature set based on at least two convolutional layers of the neural network to obtain at least two extracted features; inputting the at least two extracted features to a fully connected layer, The prior motion features of the target are obtained.
  • the motion feature determination module 403 is configured to determine, based on the neural network, that there are at least two prior motion features corresponding to the temporal feature set, based on the determined at least two prior motion features The eigenvalue of the motion feature is used to determine the prior motion feature of the target.
  • the animation generating module 404 is configured to determine at least two frames for describing the motion of the virtual object based on the target prior motion feature and the pre-trained decoder corresponding to the prior space Motion data; based on at least two frames of motion data of the virtual object motion, generate a three-dimensional motion animation including the virtual object making the target motion made by the target object; wherein, the number of frames of the three-dimensional motion animation is the same as the The number of frames of the target video is the same.
  • the first number of frames of the target video does not exceed the second number of frames of the motion data; the animation generation module 404 is configured to be less than the second number of frames when the first number of frames is In the case of , based on the first frame number, filter out the target motion data from the motion data of the second frame number, and based on the target motion data, generate the target object including the virtual object Make a 3D motion animation of the target's motion.
  • the device further includes a human body shape determination module 406, configured to perform a first full-connection process on the human body features of the target video frame in the target video, and determine the human body shape information corresponding to the target object;
  • the animation generating module 404 is configured to generate a three-dimensional motion animation including the target motion performed by the virtual object performed by the target object based on the prior motion characteristics of the target and the human body shape information.
  • the device further includes a human body orientation determination module 407 configured to perform a second full-connection process on the human body features of the first video frame in the target video, and determine the orientation adjustment information corresponding to the target object
  • the animation generating module 404 is configured to generate a three-dimensional motion animation including a virtual object making the target motion made by the target object based on the prior motion characteristics of the target and the orientation adjustment information.
  • FIG. 5 it is a schematic structural diagram of a computer device provided by an embodiment of the present disclosure, including:
  • Processor 51 memory 52 and bus 53 .
  • the memory 52 stores machine-readable instructions executable by the processor 51
  • the processor 51 is used to execute the machine-readable instructions stored in the memory 52.
  • the processor 51 executes The following steps: S101: Obtain the target video containing the target object; S102: Extract the human body features corresponding to at least two video frames in the target video; S103: Based on the human body features corresponding to at least two video frames in the target video and A neural network trained in a priori space to determine a target prior motion feature that matches the target motion made by the target object in the target video; wherein, the target prior motion feature corresponding to the target motion incorporates the target motion used to describe the target motion At least two frames of sample motion data, including prior motion features corresponding to at least two motions in the prior space; S104: Based on the prior motion features of the target, generate a three-dimensional motion animation including the target motion of the virtual object made by the target object .
  • Above-mentioned storer 52 comprises internal memory 521 and external memory 522;
  • Internal memory 521 here is also called internal memory, is used for temporarily storing computing data in processor 51, and the data exchanged with external memory 522 such as hard disk, processor 51 communicates with external memory 521 through internal memory 521.
  • the external memory 522 performs data exchange.
  • the processor 51 communicates with the memory 52 through the bus 53, so that the processor 51 executes the execution instructions mentioned in the above method embodiments.
  • An embodiment of the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the above animation generation method is realized.
  • the computer-readable storage medium may only store the computer program corresponding to the animation generation method.
  • a computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device, and may be a volatile storage medium or a nonvolatile storage medium.
  • a computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Computer-readable storage media include: portable computer disks, hard disks, Random Access Memory (RAM), Read-Only Memory (ROM), computer Erasable Programmable Read-Only Memory (EPROM or Flash), Static Random Access Memory (SRAM), Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disk (DVD), Memory Stick, Floppy Disk, Mechanically Encoded Devices , such as a punched card with instructions stored thereon, or a raised structure in a groove, and any suitable combination of the above.
  • RAM Random Access Memory
  • ROM Read-Only Memory
  • EPROM or Flash computer Erasable Programmable Read-Only Memory
  • SRAM Static Random Access Memory
  • CD-ROM Compact Disc Read-Only Memory
  • DVD Digital Versatile Disk
  • Memory Stick Floppy Disk
  • Mechanically Encoded Devices such as a punched card with instructions stored thereon, or a raised structure in a groove, and any suitable combination of the above.
  • computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
  • An embodiment of the present disclosure also proposes a computer program, the computer program includes computer readable code, and when the computer readable code is read and executed by a computer, part of the method in any embodiment of the present disclosure is implemented or all steps.
  • An embodiment of the present disclosure also provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes some or all steps of the above method.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to at least two network units . Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional module in each embodiment of the present disclosure may be integrated into one processing module, each module may exist separately physically, or two or more modules may be integrated into one module.
  • the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor.
  • the technical solution of the present disclosure is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage medium includes: various media capable of storing program codes such as U disk, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Processing Or Creating Images (AREA)

Abstract

一种动画生成方法,包括:获取包含目标对象的目标视频(S101);提取所述目标视频中的至少两帧视频帧分别对应的人体特征(S102);基于所述目标视频中的至少两帧视频帧分别对应的人体特征和先验空间训练的神经网络,确定与所述目标对象在所述目标视频中做出的目标运动匹配的目标先验运动特征;其中,所述目标运动对应的目标先验运动特征融合了用于描述所述目标运动的至少两帧样本运动数据,所述先验空间中包括至少两个运动分别对应的先验运动特征(S103);基于所述目标先验运动特征,生成包括虚拟对象做出所述目标对象做出的目标运动的三维运动动画(S1014)。

Description

动画生成方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品
相关申请的交叉应用
本公开实施例基于申请号为202111275624.0、申请日为2021年10月29日、申请名称为“动画生成方法、装置、计算机设备和存储介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本公开作为参考。
技术领域
本公开涉及但不限于图像处理技术领域,尤其涉及一种动画生成方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品。
背景技术
已有的基于视频的三维人体运动重建与预测的框架,通常都是基于三维人体姿态重建来完成的,比如,给定一段视频,针对视频中的每帧视频帧,单独进行人体姿态的重建,由于单独重建后的人体姿态之间缺少时序上的连续性,因此,已有方式容易导致输出的每一帧的姿态之间缺少平滑的过渡,使得重建后容易出现噪声。
发明内容
本公开实施例至少提供一种动画生成方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品。
本公开实施例提供了一种动画生成方法,包括:
获取包含目标对象的目标视频;
提取所述目标视频中的至少两帧视频帧分别对应的人体特征;
基于所述目标视频中的至少两帧视频帧分别对应的人体特征和先验空间训练的神经网络,确定与所述目标对象在所述目标视频中做出的目标运动匹配的目标先验运动特征;其中,所述目标运动对应的目标先验运动特征融合了用于描述所述目标运动的至少两帧样本运动数据,所述先验空间中包括至少两个运动分别对应的先验运动特征;
基于所述目标先验运动特征,生成包括虚拟对象做出所述目标对象做出的目标运动的三维运动动画。
本公开实施例还提供一种动画生成装置,包括:
视频获取模块,配置为获取包含目标对象的目标视频;
人体特征提取模块,配置为提取所述目标视频中的至少两帧视频帧分别对应的人体特征;
运动特征确定模块,配置为基于所述目标视频中的至少两帧视频帧分别对应的人体特征和先验空间训练的神经网络,确定与所述目标对象在所述目标视频中做出的目标运动匹配的目标先验运动特征;其中,所述目标运动对应的目标先验运动特征融合了用于描述所述目标运动的至少两帧样本运动数据,所述先验空间中包括至少两个运动分别对应的先验运动特征;
动画生成模块,配置为基于所述目标先验运动特征,生成包括虚拟对象做出所述目标对象做出的目标运动的三维运动动画。
本公开实施例还提供一种计算机设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当计算机设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行上述方面,或方面中任一种可能的动画生成方法的步骤。
本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述方面,或方面中任一种可能的动画生成方法的步骤。
本公开实施例提供一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码被计算机读取并执行的情况下,实现本公开任一实施例中的方法的部分或全部步骤。
本公开实施例提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序被计算机读取并执行时,实现本公开任一实施例中的方法的部分或全部步骤。
本公开提供一种动画生成方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品。基于先验空间训练的神经网络,具备准确输出先验空间中的任一先验运动特征的能力。又因为,先验空间中存储的先验运动特征融合了用于描述某一运动的至少两帧样本运动数据,至少两帧样本运动数据包括了时序信息,因此,先验空间中存储的先验运动特征能够表征某一运动的连续动作的特征。因此,目标先验运动特征能够表征目标视频中目标对象的连续运动的特征,利用该目标先验运动特征能够令虚拟对象还原目标对象做出的目标运动的连续三维运动动画,减少三维运动动画中人体运动的噪声干扰。
关于上述动画生成装置、计算机设备、存储介质、计算机程序和计算机程序产品的效果描述参见上述动画生成方法的说明。
为使本公开的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,此处的附图被并入说明书中并构成本说明书中的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1示出了本公开实施例所提供的一种动画生成方法的流程图;
图2示出了本公开实施例所提供的神经网络输出目标先验运动特征的流程示意图;
图3示出了本公开实施例所提供的动画生成过程的流程展示示意图;
图4示出了本公开实施例所提供的一种动画生成装置的示意图;
图5示出了本公开实施例所提供的一种计算机设备的结构示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本公开实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本公开的实施例的详细描述并非旨在限制要求保护的本公开的范围,而是仅仅表示本公开的选定实施例。基于本公开的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
另外,本公开实施例中的说明书和权利要求书及上述附图中的术语“”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。
在本文中提及的“至少两个或者若干个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
经研究发现,已有的基于视频的三维人体运动重建与预测的框架,都是基于三维人体姿态重建来完成的,比如,给定一段视频,针对视频中的每帧视频帧,单独进行人体姿态的重建,由于单独重建后的人体姿态之间缺少时序上的连续性,因此,容易导致输出的每一帧的姿态之间缺少平滑的过渡,使得结果中容易出现噪声。
基于上述研究,本公开提供了一种动画生成方法,基于先验空间训练的神经网络, 具备准确输出先验空间中的任一先验运动特征的能力,又因为,先验空间中存储的先验运动特征融合了用于描述某一运动的至少两帧样本运动数据,至少两帧样本运动数据包括了时序信息,因此,先验空间中存储的先验运动特征能够表征某一运动的连续动作的特征,因此,目标先验运动特征能够表征目标视频中目标对象的连续运动的特征,利用该目标先验运动特征能够令虚拟对象还原目标对象做出的目标运动的连续三维运动动画,减少三维运动动画中人体运动的噪声干扰。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
下面对本公开实施例中涉及到的特殊名词做详细解释:
先验空间:通过变分自编码器的神经网络的结构,利用重建的方式来做样本数据的编码,得到的编码空间即为先验空间。先验空间是一个高维空间。先验空间中包括至少两个先验运动特征,先验运动特征能够表示一段规定长度的合理平滑的人体运动,也即包括一系列连续人体动作,比如,在人体运动为跑步运动的情况下,一系列人体动作包括站立(至少两帧)、抬左腿(至少两帧)、抬右腿(至少两帧)这一系列动作的循环。其中,每个动作包括至少一帧运动数据,一段规定长度的人体运动包括固定帧数的运动数据。
为便于对本实施例进行理解,首先对本公开实施例所公开的一种动画生成方法进行详细介绍,本公开实施例所提供的动画生成方法的执行主体一般为具有一定计算能力的计算机设备。在一些可能的实现方式中,该动画生成方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。
下面以执行主体为终端设备或其他处理设备为例对本公开实施例提供的动画生成方法加以说明。其中,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字处理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中,该动画生成方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。
参见图1所示,为本公开实施例提供的一种动画生成方法的流程图,所述方法包括步骤S101~S104,其中:
S101:获取包含目标对象的目标视频;
本步骤中,目标对象可以包括真实场景下的人,或者虚拟场景下的人体模型等。目标视频可以是用户利用拍摄设备拍摄到的一段人体运动的视频;其中,人体运动可以包括走路、跑步、跳绳、站立、躺卧等运动中的一种或多种。人体运动可以包括目标对象做出的一系列连续动作,包括静止动作等的运动,也即下述的目标运动。
这里,目标视频可以是用户拍摄到的一段人体运动视频,或者,也可以是从拍摄到的一段人体运动视频中截取出的某一段连续视频,目标视频可以由至少两帧视频帧组成,视频帧也可以称为视频图像,至少两帧视频帧也即至少两张视频图像。由于先验空间存储的先验运动特征表示一段规定长度的人体运动,而规定长度的人体运动,也表示 为固定帧数的运动数据,因此,为了能够还原目标视频中每帧视频帧中目标对象的运动,目标视频中视频帧的数量可以小于或等于固定帧数运动数据的数量。因此,获取目标视频的实施过程可以为,获取一段固定帧数的视频,或者,从一段拍摄视频中截取固定帧数的视频等。
S102:提取目标视频中的至少两帧视频帧分别对应的人体特征。
本步骤中,视频帧对应的人体特征包括从该视频帧提取到的包含目标对象的特征。
示例性的,目标视频中的至少两帧视频帧可以包括目标视频中的每帧视频帧。或者,也可以包括从目标视频中提取的、一段连续至少两帧视频帧,也即目标视频中的部分视频帧。
人体特征提取过程,在一些实施例中,可以利用时序特征提取网络,按照至少部分帧视频帧的时序信息顺序分别提取对应视频帧中的目标对象的特征。由于目标视频中的每帧视频帧分别包含其对应的时序信息,因此,提取到的人体特征对应其对应视频帧的时序信息。时序特征提取网络例如包括,门控循环单元(Gated Recurrent Unit,GRU)、长短期记忆网络(Long Short-Term Memory,LSTM)等。
这里,可以将提取到的包含时序信息的人体特征组成时序特征集合。
S103:基于目标视频中的至少两帧视频帧分别对应的人体特征和先验空间训练的神经网络,确定与目标对象在目标视频中做出的目标运动匹配的目标先验运动特征;其中,目标运动对应的目标先验运动特征融合了用于描述目标运动的至少两帧样本运动数据,先验空间中包括至少两个运动分别对应的先验运动特征。
本步骤中,神经网络是一种利用先验空间训练的神经网络,在一些实施例中,可以是利用先验空间中存储的先验运动特征进行监督训练得到的,能够训练得到具有准确输出先验空间中的任一先验运动特征的能力,即训练完成的神经网络输出的目标先验运动特征与先验空间中存储的某一先验运动特征相同。
在一些实施例中,可以将至少两帧视频帧分别对应的人体特征都输入到训练好的神经网络中,进行回归处理(如,线性回归处理或最小二乘回归处理等),回归出目标先验运动特征。示例性的,已知16帧目标视频,每帧视频帧分别提取到的2048维人体特征向量(人体特征包括人体特征向量),将提取到的16帧视频帧分别对应的2048维人体特征向量都输入到训练好的神经网络中,利用神经网络中的卷积层进行特征提取,利用全连接层进行特征融合,将融合后的特征编码作为目标先验运动特征,即完成目标先验运动特征的回归。这里,由于先验运动特征融合了用于描述某一运动的至少两帧样本运动数据,且至少两帧样本运动数据分别包含其对应的时序信息,因此,先验运动特征所表征的运动连续,进而确定回归得到的目标先验运动特征能够表征目标视频中目标对象连续运动。
在一些实施例中,将选取到的部分连续视频帧分别对应的人体特征都输入到训练好的神经网络中,回归出目标先验运动特征。在一些实施例中,在目标视频中至少两帧视频帧的基础上,再预测出之后的连续至少两帧视频帧,将目标视频中至少两帧视频帧分 别对应的人体特征,以及预测后的连续至少两帧视频帧分别对应的人体特征一起,都输入到训练好的神经网络中,回归出目标先验运动特征。在一些实施例中,在目标视频中部分连续视频帧的基础上,再预测出之后的连续至少两帧视频帧,将目标视频中部分连续视频帧分别对应的人体特征,以及预测后的连续至少两帧视频帧分别对应的人体特征一起,都输入到训练好的神经网络中,回归出目标先验运动特征。这里,回归过程可以参见第一种可选实施方式中针对目标先验运动特征回归的说明。
S104:基于目标先验运动特征,生成包括虚拟对象做出目标对象做出的目标运动的三维运动动画。
本步骤中,虚拟对象是展示在计算机显示屏等虚拟场景下的三维对象,用于模仿目标对象做出的目标运动。虚拟对象包括,例如,虚拟卡通、建模还原目标对象的虚拟人物等。
目标运动包括目标对象在目标视频中做出的一系列连续动作的集合,以目标运动为走路为例,一系列连续动作包括至少两帧站立动作,至少两帧左抬腿动作和至少两帧右抬腿动作这三种动作的循环。
在一些实施例中,可以利用解码器解码被编码后的目标先验运动特征,得到能够表征目标对象做出的目标运动的至少两帧运动数据,之后,利用该至少两帧运动数据进行三维人体运动建模,生成包括虚拟对象做出目标对象做出的目标运动的三维运动动画,该三维运动动画中的虚拟对象能够做出与目标对象在目标视频中做出的一系列连续动作相同的连续动作。其中,解码器可以是预先训练好的任一常用解码器,本公开实施例不进行限定。
这里,由于先验运动特征不存在约束人体形态的参数,在建模过程中可以利用默认设置的一个人体形态参数进行三维人体运动重建。另外,由于先验运动特征对应的首帧运动数据默认某一设定朝向,因此,在不改变虚拟对象朝向的情况下,可以按照默认朝向进行三维人体运动重建。
这里,三维运动动画除了包含虚拟对象做运动以外,还可以按照实际应用情况加入三维场景信息,本公开实施例不进行限定。
上述S101~S104基于先验空间训练的神经网络,具备准确输出先验空间中的任一先验运动特征的能力,又因为,先验空间中存储的先验运动特征融合了用于描述某一运动的至少两帧样本运动数据,至少两帧样本运动数据包括了时序信息,因此,先验空间中存储的先验运动特征能够表征某一运动的连续动作的特征,因此,目标先验运动特征能够表征目标视频中目标对象的连续运动的特征,利用该目标先验运动特征能够令虚拟对象还原目标对象做出的目标运动的连续三维运动动画,减少三维运动动画中人体运动的噪声干扰。
针对S103,确定目标先验运动特征,在一些实施例中,可以基于目标视频中至少部分帧视频帧分别对应的人体特征,确定目标视频对应的时序特征集合;基于时序特征集合和神经网络,确定目标先验运动特征。
这里,时序特征集合,可以包括目标视频中至少两帧视频帧分别对应的人体特征;或者,可以包括部分连续视频帧分别对应的人体特征;或者,可以包括至少部分帧视频帧分别对应的人体特征,以及目标视频之后的至少两帧视频帧分别对应的预测人体特征;或者,可以包括部分连续视频帧分别对应的人体特征,以及目标视频之后的至少两帧视频帧分别对应的预测人体特征。这里,由于目标视频对应的时序特征集合包括多帧视频帧对应的人体特征,可以确定时序特征集合能够较为全面的反映目标视频中目标对象的整体运动信息,因此,利用神经网络对时序特征集合进行处理,能够确定出较为准确的、反映目标视频中目标对象连续运动的目标先验运动特征。
这里,动画生成任务不同,时序特征集合所包含的人体特征不同。动画生成任务可以包括,例如,人体运动重建任务,包括按照目标视频内容,重建目标视频中的目标对象做出的目标运动(仅包含目标视频中的目标对象已经做出的动作),并生成三维运动动画;或者,人体运动预测任务,包括按照目标视频内容,以及预测出的目标视频中目标对象即将做出的动作,确定目标视频中的目标对象做出的目标运动(包含目标视频中的目标对象已经做出的动作,以及预测得到的动作)。
针对人体运动重建任务,确定时序特征集合,在一些实施例中,可以将提取到的至少两帧视频帧分别对应的人体特征,按照视频帧对应的第一时序信息进行串联,确定目标视频对应的时序特征集合。
这里,第一时序信息包括连续播放的目标视频中视频帧播放的时间,每帧视频帧都对应其各自的第一时序信息,比如3帧视频帧,第一帧视频帧提取到的人体特征对应的时序信息即为第一时序信息,第二视频帧提取到的人体特征对应的第一时序信息即为第二,第三视频帧提取到的人体特征对应的第一时序信息即为第三;之后,按照第一、第二和第三的顺序串联对应的人体特征。这里,串联可以为首尾相连,也即,按照视频帧对应的第一时序信息,将时序在后的人体特征依次拼接在前一时序的人体特征之后,例如,第一视频帧提取到的人体特征(2048维人体特征向量)在前,第二视频帧提取到的人体特征(2048维人体特征向量)拼接在第一视频帧提取到的人体特征之后,第三视频帧提取到的人体特征(2048维人体特征向量)拼接在第二视频帧提取到的人体特征之后。也即,一帧视频帧提取到2048维人体特征向量,三帧视频帧对应的人体特征按照第一时序信息串联得到3×2048维人体特征向量,即时序特征集合。这里,可以将多帧视频帧分别对应的人体特征按照视频帧对应的第一时序信息进行串联,得到的时序特征集合包含第一时序信息,因此,能够较为全面的反映目标视频中目标对象的整体运动信息,且该整体运动信息时序连续。
或者,也可以将提取到的部分连续视频帧分别对应的人体特征,按照视频帧对应的第一时序信息进行串联,确定目标视频对应的时序特征集合。在一些实施例执行过程可以参见上述说明。
针对人体运动预测任务,确定时序特征集合,在一些实施例中,基于目标视频中至少两帧视频帧分别对应的人体特征,预测目标视频之后的N帧视频帧分别对应的预测人 体特征;其中,N为预设正整数;将目标视频中至少两帧视频帧分别对应的人体特征和N帧视频帧分别对应的预测人体特征,按照视频帧对应的第二时序信息进行串联,确定目标视频对应的时序特征集合。
这里,第二时序信息包括连续播放的目标视频中视频帧播放的时间+N帧视频帧播放的时间,上述视频帧都对应其各自的第二时序信息,比如3帧目标视频中的视频帧+1帧目标视频之后的待预测的视频帧,第一帧视频帧提取到的人体特征对应的第二时序信息即为第一,第二视频帧提取到的人体特征对应的第二时序信息即为第二,第三视频帧提取到的人体特征对应的第二时序信息即为第三,第四视频帧提取到的人体特征对应的第二时序信息即为第四;之后,按照第一、第二、第三和第四的顺序串联对应的人体特征,这里,串联可以为首尾相连,也即,按照视频帧对应的第二时序信息,将时序在后的人体特征依次拼接在前一时序的人体特征之后,例如,第一视频帧提取到的人体特征(2048维人体特征向量)在前,第二视频帧提取到的人体特征(2048维人体特征向量)拼接在第一视频帧提取到的人体特征之后,第三视频帧提取到的人体特征(2048维人体特征向量)拼接在第二视频帧提取到的人体特征之后,第四视频帧提取到的人体特征2048维人体特征向量)拼接在第三视频帧提取到的人体特征之后。也即,一帧视频帧提取到2048维人体特征向量,四帧视频帧对应的人体特征按照第二时序信息串联得到4×2048维人体特征向量,即时序特征集合。这里,可以基于目标视频中的多帧视频帧分别对应的人体特征,能够预测出与目标视频中目标对象做出的目标运动对应的动作相关的预测人体特征,并将预测前提取到的多个人体特征和预测后得到的多个预测人体特征,按照视频帧对应的第二时序信息进行串联,得到的时序特征集合包含第二时序信息,因此,能够较为全面的反映一段时间内(包括目标视频包括的三帧视频帧和目标视频之后的待预测的1帧视频帧)目标对象的整体运动信息,且该整体运动信息时序连续。
预测目标视频之后的N帧视频帧分别对应的预测人体特征,一种可能的实施方式为,首先,可以从目标视频中的至少两帧视频帧分别对应的人体特征中选取部分连续视频帧分别对应的人体特征,将选取到的部分连续视频帧分别对应的人体特征输入到自回归循环网络,预测出目标视频之后的第一帧视频帧对应的预测人体特征;之后,将该预测人体特征,以及从部分连续视频帧分别对应的人体特征中再选取部分连续视频帧输入到自回归循环网络,如此循环预测,能够预测出预设帧数的视频帧分别对应的预测人体特征,即N帧视频帧分别对应的预测人体特征,N为预先设定的需要预测的视频帧的帧数,N取正整数。
这里,由于先验空间存储的先验运动特征表示一段规定长度的人体运动,而规定长度的人体运动,也表示为固定帧数的运动数据,因此,为了能够还原目标视频中的至少两帧视频帧中目标对象的运动,以及N帧视频帧对应的目标对象的预测运动,目标视频中视频帧的数量+N帧视频帧的数量可以小于或等于固定帧数运动数据的数量。所以,在预设N值时,确保目标视频中视频帧的数量+N帧视频帧的数量小于或等于固定帧数运动数据的数量即可,本公开实施例不进行数值的限定。
示例性的,已知目标视频中包括三帧视频帧(按照第二时序信息分别确定其为第一帧、第二帧和第三帧),预设N=2(按照第二时序信息分别确定的N帧视频帧包括第四帧视频帧和第五帧视频帧),在确定预测人体特征时,示例性的,可以将目标视频中三帧视频帧分别对应的人体特征输入到自回归循环网络中,输出与上述三帧视频帧分别对应的人体特征相关的预测人体特征,且该预测得到的第四视频帧对应的预测人体特征与第三帧视频帧对应的人体特征之间连续;之后,将目标视频中第二帧视频帧、第三帧视频帧分别对应的人体特征,以及第四帧视频帧对应的预测人体特征输入到自回归循环网络中,输出与上述第二帧、第三帧视频帧分别对应的人体特征以及第四帧视频帧对应的预测人体特征相关的预测人体特征,且该预测得到的第五视频帧对应的预测人体特征与第四帧视频帧对应的预测人体特征之间连续。
另一种可能的实施方式为,将目标视频中的至少两帧视频帧分别对应的人体特征输入到自回归神经网络中,回归出预测人体特征;之后,将目标视频中的至少两帧视频帧分别对应的人体特征以及预测得到的每个预测人体特征输入到自回归神经网络中,回归出下一预测人体特征,如此循环,直到预测得到N个预测人体特征为止。
在一个实施例中,针对目标视频之后的M帧视频帧中的第M+1帧视频帧,预测其对应的预测人体特征,在一些实施例中,基于目标视频中至少部分帧视频帧分别对应的人体特征和M帧视频帧分别对应的预测人体特征,预测目标视频之后的第M+1帧视频帧的预测人体特征;M为小于N的正整数。
示例性的,可以将目标视频中的至少两帧视频帧分别对应的人体特征和M帧视频帧分别对应的预测人体特征输入到自回归神经网络中,回归出第M+1帧视频帧对应的预测人体特征。或者,将目标视频中的部分连续视频帧分别对应的人体特征和M帧视频帧分别对应的预测人体特征输入到到自回归神经网络中,回归出第M+1帧视频帧对应的预测人体特征。
示例性的,首先,将目标视频中的至少两帧视频帧分别对应的人体特征输入到自回归神经网络中,回归出目标视频之后M帧视频帧对应的预测人体特征。之后,将M帧视频帧对应的预测人体特征输入到自回归神经网络中,回归出目标视频之后的第M+1帧视频帧对应的预测人体特征。
这里,可以充分利用前M+1帧视频帧对应的预测人体特征,以及目标视频中至少两帧视频帧分别对应的人体特征,能够精准预测一个与上述特征相关的特征,即第M+1帧视频帧的预测人体特征,由于预测出的预设人体特征与人体特征之间的相关性较高,因此,预测出的预测人体特征与前一视频帧对应的人体特征或预测人体特征之间连续。
针对S103中的基于先验空间训练的神经网络,其训练过程可以参照下述步骤:
步骤1、获取样本视频,并提取样本视频中的至少两帧视频帧分别对应的样本人体特征。
这里,提取样本视频中的至少两帧视频中分别对应的样本人体特征的特征提取过程可以参见S102中人体特征的提取过程。
步骤2、基于待训练的神经网络和样本人体特征,确定样本视频对应的样本先验运动特征。
这里,确定样本先验运动特征的过程可以参见S103中确定目标先验运动特征的过程。在一些实施例中,将提取到的至少两个样本人体特征按照样本视频中视频帧的时序信息进行串联,确定一个样本人体特征集合;之后,将样本人体特征集合输入到待训练的神经网络中,利用至少两个卷积层分别对样本人体特征集合进行特征提取,可以提取得到至少两个样本提取特征;之后,将提取到的样本提取特征输入至全连接层,能够输出样本先验运动特征。
步骤3、检测样本先验运动特征是否为先验空间中的先验运动特征,并基于检测结果调整待训练的神经网络的网络参数。
这里,可以利用先验空间中存储的先验运动特征进行监督训练,即可以计算样本先验运动特征与先验空间中的先验运动特征之间的损失,利用损失构建损失函数,并利用损失函数调整待训练的神经网络的网络参数,直到神经网络输出的所有样本先验运动特征与先验空间中的任一先验运动特征一致,即神经网络的映射空间与先验空间完全一致,则可以确定待训练的神经网络训练完成。
这里,不断优化样本先验运动特征为先验空间中存储的先验运动特征,在优化过程中,不断调整待训练的神经网络的网络参数,能够得到一个具备准确输出先验空间中的任一先验运动特征的能力的神经网络。
针对S103中确定目标先验运动特征,一种可能的实施方式为,将时序特征集合输入至神经网络中之后,神经网络执行如下操作:基于神经网络的至少两个卷积层对时序特征集合进行特征提取,得到至少两个提取特征;将多个提取特征输入至全连接层,得到目标先验运动特征。
这里,神经网络可以为卷积神经网络(Convolutional Neural Networks,简称CNN)等。其处理过程可以参见图2所示,其为神经网络输出目标先验运动特征的流程示意图。其中包括,神经网络21,两个卷积层22和23,全连接层24。示例性的,可以利用卷积层22对时序特征集合进行特征提取,得到提取特征A;之后,利用卷积层23对提取特征A进行特征提取,得到提取特征B;之后,将提取特征A和提取特征B输入至全连接层24得到目标先验运动特征。
在一些实施例中,还可以基于至少两个卷积层对时序特征集合进行特征提取,得到每个卷积层对应的提取特征;之后,按照设定要求,从至少两个提取特征中选取至少两个目标提取特征,之后,将目标提取特征输入至全连接层24得到目标先验运动特征。
这里,可以利用至少两个卷积层对时序特征集合进行特征提取,能够得到表征人体运动、且具有较高深度的至少两个提取特征;利用较高深度的至少两个提取特征之间进行全连接处理,能够得到较为准确的、反映目标视频中目标对象连续运动的目标先验运动特征。
在一些实施例中,在神经网络对时序特征集合进行处理的过程中,可能存在神经网 络确定出至少两个先验运动特征的情况,例如,针对某一时序特征集合,其包括双手交叉动作,神经网络在对该时序特征集合进行卷积处理过程中,可能回归出左手在前右手在后的动作对应的先验运动特征、以及回归出左手在后右手在前的动作对应的先验运动特征。此时,由于先验空间中先验运动特征服从正态分布的特点,即越靠近正态分布原点的先验运动特征出现的概率越高,也即先验运动特征的特征值越小,先验运动特征出现的概率越高。因此,可以基于确定的至少两个先验运动特征的特征值,确定目标先验运动特征。相比于相关技术中,由二维视频帧数据生成三维人体运动数据时存在出现多个目标先验运动特征的歧义性。本公开实施例中,通过约束使得先验运动特征出现的概率尽可能的大,以此来减少出现多个目标先验运动特征的情况,也即可以降低多解情况下的歧义性,可以选择概率更大的目标先验运动特征作为输出。
在一些实施例中,可以计算至少两个先验运动特征中每个先验运动特征对应的特征值,将最小特征值对应的先验运动特征作为目标先验运动特征。
特征值的计算过程包括:已知先验运动特征为R维特征向量,确定每一维度的数值K,根据公式(1)确定特征值Y。其中,R为正整数。
Figure PCTCN2022124879-appb-000001
这里,从确定出的至少两个先验运动特征中筛选出最为靠近正态分布原点的目标先验运动特征,利用该目标先验运动特征能够还原较为准确的、目标对象做出的目标运动中的一系列动作。
针对S104,生成三维运动动画,在一些实施例中,可以基于目标先验运动特征和先验空间对应的预先训练的解码器,确定用于描述虚拟对象运动的至少两帧运动数据;之后,基于虚拟对象运动的至少两帧运动数据,生成包括虚拟对象做出目标对象做出的目标运动的三维运动动画,其中,三维运动动画的帧数与目标视频的帧数相同。
这里,可以利用至少两帧样本运动数据训练自编码器中的编码器和解码器,先验运动特征是基于编码器编码得到的,因此,基于同样训练得到的解码器可以对先验运动特征进行解码处理。
示例性的,将目标先验运动特征输入到先验空间对应的预先训练的解码器中,通过卷积层进行卷积处理,即对编码后的目标先验运动特征进行解码,最终输出能够表述虚拟对象运动的至少两帧运动数据,之后,利用该至少两帧运动数据进行三维人体运动建模,生成包括虚拟对象做出目标对象做出的目标运动的三维运动动画,该三维运动动画中的虚拟对象能够做出与目标对象在目标视频中做出的一系列连续动作相同的连续动作。
这里,由于先验空间中的先验运动特征是基于编码器编码得到的,因此利用先验空间对应的训练后的解码器能够较为准确的解码出用于描述虚拟对象运动的至少两帧运动数据,由于该至少两帧运动数据为解码目标先验运动特征得到的,而目标先验运动特 征能够反映目标视频中目标对象连续运动,因此,至少两帧运动数据能够反映目标视频中目标对象连续运动,进而生成能够令虚拟对象还原目标对象做出的目标运动的连续三维运动动画。
在一些实施例中,由于目标视频的第一帧数不超过运动数据的第二帧数;因此,在第一帧数小于第二帧数的情况下,基于第一帧数,从第二帧数的运动数据中,筛选出目标运动数据,并基于目标运动数据,生成包括虚拟对象做出目标对象做出的目标运动的三维运动动画。
在一些实施例中,目标视频的第一帧数小于等于运动数据的第二帧数,在第一帧数小于第二帧数的情况下,从第二帧数的运动数据种筛选出第一帧数的目标运动数据,利用第一帧数的目标运动数据,生成第一帧数的三维运动动画,即三维运动动画的帧数与目标视频的帧数相同。
在一种可能的实施方式中,在从第二帧数的运动数据中筛选出第一帧数的目标运动数据时,由于运动数据包括时序信息,因此可以从第一帧运动数据开始顺序选取连续第一帧数的运动数据,将其作为目标运动数据。或者,还可以根据实际应用场景从第二针数的运动数据中按照其他选取规则进行选取,本公开实施例不进行限定。
这里,为了还原出目标视频中目标对象做出的目标运动的三维运动动画,得到的三维运动动画的帧数可以与目标视频的帧数相同。因为,解码器解码得到的至少两帧运动数据的帧数固定,且目标视频的第一帧数小于运动数据的第二帧数,因此,需要从第二运动数据种筛选出目标运动数据,以使得还原出与目标视频帧数相同的三维运动动画。
在一些实施例中,由于先验运动特征不存在约束人体形态的参数,因此,为了建模的虚拟对象能够还原目标对象的形态,需要从目标视频中分析出目标对象对应的人体形态信息,通过确定的目标对象对应的人体形态信息,能够建模得到与目标对象形态一致的虚拟对象。
在一些实施例中,可以对目标视频中的目标视频帧的人体特征进行第一全连接处理,确定目标对象对应的人体形态信息。之后,再基于目标先验运动特征和人体形态信息,生成包括虚拟对象做出目标对象做出的目标运动的三维运动动画。
这里,目标视频帧,可以是随机从目标视频中筛选出的任一视频帧;或者为指定的首帧视频帧等。
这里,通过对于确定目标对象对应的人体形态信息,在一些实施例中,将目标视频帧对应的人体特征输入到全连接层中,利用至少一层全连接层从该人体特征中回归对应的人体形态信息,之后,将该人体形态信息作为三维人体运动建模过程中虚拟对象的形态信息,使得虚拟对象形态保持不变。此时,三维运动动画中展示有与目标对象形态相同的虚拟对象做出目标对象做出的目标运动。
在一些实施例中,利用人体形态信息约束三维运动动画中虚拟对象的形态,使其每帧动画中虚拟对象形态保持不变,提升虚拟对象在三维运动动画中的展示效果,进而提升用户视觉观看体验。
在一些实施例中,由于先验运动特征对应的首帧运动数据默认某一设定朝向,比如面朝展示屏幕的正方向,很可能与目标视频中目标对象的朝向信息不符,因此,为了建模的虚拟对象能够还原目标视频中目标对象的朝向信息,需要根据首帧视频帧的人体特征确定一个目标对象对应的朝向调整信息。在一些实施例中,对目标视频中的首帧视频帧的人体特征进行第二全连接处理,确定目标对象对应的朝向调整信息;基于目标先验运动特征和朝向调整信息,生成包括虚拟对象做出目标对象做出的目标运动的三维运动动画。
这里,通过确定目标对象对应的朝向调整信息,在一些实施例中,可以将首帧视频帧对应的人体特征输入到全连接层中,利用至少一层全连接层从该人体特征中回归对应的朝向调整信息,可以包括按照设定方向旋转某一调整角,之后,利用该朝向调整信息,调整建模过程中虚拟对象的朝向,比如按照顺时针方向调整一调整角,使得虚拟对象在每帧三维运动动画的朝向分别与目标视频中的至少两帧视频帧目标对象的朝向相同,能够提升用户视觉上观看三维运动动画的体验。
或者,朝向调整信息还可以包括朝向,在建模过程中将虚拟对象的每帧模型朝向调整为朝向调整信息中确定的朝向。
参见图3所示,其为动画生成过程的流程展示示意图。包括目标视频31;时序特征提取网络32;时序特征集合33;神经网络34;目标先验运动特征35;先验空间对应的预先训练的解码器36;解码器36输出的至少两帧运动数据37;第一全连接处理过程对应的全连接层38;第二全连接处理过程对应的全连接层39;其他视频40,其他视频40与目标视频31的拍摄角度不同,其他视频40与目标视频31具有时刻的相同目标对象;时序卷积41,用于提取其他视频40中的时序特征,如,时间编码器(Temporal Encoder);这里,时序特征集合33包括基于目标视频31确定的时序特征集合,以及基于其他视频40确定的时序特征集合等;在虚拟对象未改变形态、未调整朝向的情况下,生成的三维人体动画44;在虚拟对象调整形态和朝向的情况下,生成的三维人体动画45。
示例性的,动画生成过程的一些实施方式包括,获取包含目标对象的目标视频31,利用时序特征提取网络32提取目标视频31中每帧视频帧分别对应的人体特征,针对人体运动重建任务,将提取到的每个人体特征按照视频帧对应的第一时序信息进行串联,得到目标视频对应的时序特征集合33;针对人体运动预测任务,将提取到的每个人体特征以及预测得到的每个预测人体特征按照第二时序信息进行串联,得到目标视频对应的时序特征集合33。之后,利用神经网络34回归出目标先验运动特征35。之后,将目标先验运动特征输入到解码器36中进行解码处理,得到至少两帧运动数据37。这样,通过预先训练好先验空间,该先验空间中的每个先验运动特征都可以对应一段选定长度的合理平滑的人体运动。因此为了能够进行高精度的三维人体运动重建与预测,可以将该先验空间作为三维人体运动重建与预测的目标空间。可以先从输入的目标视频31中提取对应的人体特征,然后从该人体特征中回归出先验空间中的目标先验运动特征35,最后目标先验运动特征35对应的人体运动即是三维人体运动重建或预测的结果。相比于 相关技术中,逐帧生成人体运动,由多个连续的人体运动构成一段人体运动。本公开实施例中,可以基于目标先验运动特征35一次性地生成对应的人体运动,并提高人体运动之间的连续性和平滑性。
在确定不调整虚拟对象形态和朝向的情况下,利用至少两帧运动数据37生成的三维人体动画44。
在确定还原目标对象的形态和朝向的情况下,获取包含目标对象的其他视频40,利用时序卷积41提取其他视频40中每帧视频帧分别对应的人体特征,也可以得到时序特征集合33;利用其他视频40中首帧视频帧对应的人体特征,通过全连接层38,确定人体形态信息42;通过全连接层39,确定朝向调整信息43。其中,全连接层38可以包括第一全连接层301和第二全连接层302,全连接层39可以包括第三全连接层303和第四全连接层304。之后,利用该至少两帧运动数据37,人体形态信息42和朝向调整信息43,生成三维人体动画45。这里,基于输入的视频(也即,目标视频31和其他视频40),可以从至少两个角度观察目标对象的人体形态,因此基于输入的视频,可以提取出对应的人体特征,并通过该人体特征回归出人体形态属性(如,人体形态信息42和朝向调整信息43等信息),可以通过人体形态信息42和朝向调整信息43等信息来描述目标对象完整运动的人体形态,从而保证人体形态的一致性。相比于相关技术中,对于每一帧视频帧都生成一个人体形态属性,导致人体形态属性的一致性较低。本公开实施例中,人体形态信息42和朝向调整信息43等信息作为目标对象完整运动对应的目标视频或者其他视频中任意帧的人体形态属性,提高了目标视频或者其他视频中人体形态属性的一致性。
本领域技术人员可以理解,在一些实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的执行顺序应当以其功能和可能的内在逻辑确定。
基于同一技术构思,本公开实施例中还提供了与动画生成方法对应的动画生成装置,由于本公开实施例中的装置解决问题的原理与本公开实施例上述动画生成方法相似,因此装置的实施可以参见方法的实施。
参照图4所示,为本公开实施例提供的一种动画生成装置的示意图,所述装置包括:视频获取模块401、人体特征提取模块402、运动特征确定模块403和动画生成模块404;其中,
视频获取模块401,配置为获取包含目标对象的目标视频;
人体特征提取模块402,配置为提取所述目标视频中的至少两帧视频帧分别对应的人体特征;
运动特征确定模块403,配置为基于所述目标视频中的至少两帧视频帧分别对应的人体特征和先验空间训练的神经网络,确定与所述目标对象在所述目标视频中做出的目标运动匹配的目标先验运动特征;其中,所述目标运动对应的目标先验运动特征融合了用于描述所述目标运动的至少两帧样本运动数据,所述先验空间中包括至少两个运动分 别对应的先验运动特征;
动画生成模块404,配置为基于所述目标先验运动特征,生成包括虚拟对象做出所述目标对象做出的目标运动的三维运动动画。
在一些实施例中,所述运动特征确定模块403,配置为基于所述目标视频中的至少两帧视频帧分别对应的人体特征,确定所述目标视频对应的时序特征集合;基于所述时序特征集合和所述神经网络,确定所述目标先验运动特征。
在一些实施例中,所述运动特征确定模块403,配置为将提取到的至少两帧视频帧分别对应的所述人体特征,按照所述视频帧对应的第一时序信息进行串联,确定所述目标视频对应的时序特征集合。
在一些实施例中,所述运动特征确定模块403,配置为基于所述目标视频中的至少两帧视频帧分别对应的人体特征,预测所述目标视频之后的N帧视频帧分别对应的预测人体特征;其中,N为预设正整数;将所述目标视频中的至少两帧视频帧分别对应的所述人体特征和所述N帧视频帧分别对应的预测人体特征,按照所述视频帧对应的第二时序信息进行串联,确定所述目标视频对应的时序特征集合。
在一些实施例中,所述运动特征确定模块403,配置为针对所述目标视频之后的M帧视频帧,基于所述目标视频中的至少两帧视频帧分别对应的人体特征和所述M帧视频帧分别对应的预测人体特征,预测所述目标视频之后的第M+1帧视频帧的预测人体特征;M为小于N的正整数。
在一些实施例中,所述装置还包括神经网络训练模块405,配置为获取样本视频,并提取所述样本视频中的至少两帧视频帧分别对应的样本人体特征;基于待训练的神经网络和所述样本人体特征,确定所述样本视频对应的样本先验运动特征;
检测所述样本先验运动特征是否为所述先验空间中的先验运动特征,并基于检测结果调整所述待训练的神经网络的网络参数。
在一些实施例中,所述运动特征确定模块403,在基于所述时序特征集合和所述神经网络,确定所述目标先验运动特征时,配置为:将所述时序特征集合输入至所述神经网络,执行如下操作:基于所述神经网络的至少两个卷积层对所述时序特征集合进行特征提取,得到至少两个提取特征;将所述至少两个提取特征输入至全连接层,得到所述目标先验运动特征。
在一些实施例中,所述运动特征确定模块403,配置为在基于所述神经网络确定所述时序特征集合对应的先验运动特征有至少两个的情况下,基于确定的至少两个先验运动特征的特征值,确定所述目标先验运动特征。
在一些实施例中,所述动画生成模块404,配置为基于所述目标先验运动特征和所述先验空间对应的预先训练的解码器,确定用于描述所述虚拟对象运动的至少两帧运动数据;基于所述虚拟对象运动的至少两帧运动数据,生成包括所述虚拟对象做出所述目标对象做出的目标运动的三维运动动画;其中,所述三维运动动画的帧数与所述目标视频的帧数相同。
在一些实施例中,所述目标视频的第一帧数不超过所述运动数据的第二帧数;所述动画生成模块404,配置为在所述第一帧数小于所述第二帧数的情况下,基于所述第一帧数,从所述第二帧数的运动数据中,筛选出目标运动数据,并基于所述目标运动数据,生成包括所述虚拟对象做出所述目标对象做出的目标运动的三维运动动画。
在一些实施例中,所述装置还包括人体形态确定模块406,配置为对所述目标视频中的目标视频帧的人体特征进行第一全连接处理,确定所述目标对象对应的人体形态信息;所述动画生成模块404,配置为于基于所述目标先验运动特征和所述人体形态信息,生成包括所述虚拟对象做出所述目标对象做出的目标运动的三维运动动画。
在一些实施例中,所述装置还包括人体朝向确定模块407,配置为对所述目标视频中的首帧视频帧的人体特征进行第二全连接处理,确定所述目标对象对应的朝向调整信息;所述动画生成模块404,配置为基于所述目标先验运动特征和所述朝向调整信息,生成包括虚拟对象做出所述目标对象做出的目标运动的三维运动动画。
关于装置中的各模块的处理流程、以及各模块之间的交互流程的描述可以参照上述方法实施例中的相关说明,这里不再详述。
基于同一技术构思,本公开实施例还提供了一种计算机设备。参照图5所示,为本公开实施例提供的计算机设备的结构示意图,包括:
处理器51、存储器52和总线53。其中,存储器52存储有处理器51可执行的机器可读指令,处理器51用于执行存储器52中存储的机器可读指令,所述机器可读指令被处理器51执行时,处理器51执行下述步骤:S101:获取包含目标对象的目标视频;S102:提取目标视频中的至少两帧视频帧分别对应的人体特征;S103:基于目标视频中的至少两帧视频帧分别对应的人体特征和先验空间训练的神经网络,确定与目标对象在目标视频中做出的目标运动匹配的目标先验运动特征;其中,目标运动对应的目标先验运动特征融合了用于描述所述目标运动的至少两帧样本运动数据,先验空间中包括至少两个运动分别对应的先验运动特征;S104:基于目标先验运动特征,生成包括虚拟对象做出目标对象做出的目标运动的三维运动动画。
上述存储器52包括内存521和外部存储器522;这里的内存521也称内存储器,用于暂时存放处理器51中的运算数据,以及与硬盘等外部存储器522交换的数据,处理器51通过内存521与外部存储器522进行数据交换,当计算机设备运行时,处理器51与存储器52之间通过总线53通信,使得处理器51在执行上述方法实施例中所提及的执行指令。
本公开实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述的动画生成方法。其中,该计算机可读存储介质可以只存储动画生成方法对应的计算机程序。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备,可为易失性存储介质或者非易失性存储介质。计算机可读存储介质例如可以是(但不限于)电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者 上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。
本公开实施例还提出一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码被计算机读取并执行的情况下,实现本公开任一实施例中的方法的部分或全部步骤。
本公开实施例还提供了一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当所述计算机可读代码在电子设备的处理器中运行时,所述电子设备中的处理器执行上述方法的部分或全部步骤。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的装置的一些工作过程,可以参考前述方法实施例中的对应过程。在本公开所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,至少两个模块或组件可以结合,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到至少两个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本公开各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。
所述功能如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上所述实施例,仅为本公开的较佳实施方式,用以说明本公开 的技术方案,而非对其限制,本公开的保护范围并不局限于此,尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本公开实施例技术方案的精神和范围,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应所述以权利要求的保护范围为准。

Claims (20)

  1. 一种动画生成方法,所述方法包括:
    获取包含目标对象的目标视频;
    提取所述目标视频中的至少两帧视频帧分别对应的人体特征;
    基于所述目标视频中的至少两帧视频帧分别对应的人体特征和先验空间训练的神经网络,确定与所述目标对象在所述目标视频中做出的目标运动匹配的目标先验运动特征;其中,所述目标运动对应的目标先验运动特征融合了用于描述所述目标运动的至少两帧样本运动数据,所述先验空间中包括至少两个运动分别对应的先验运动特征;
    基于所述目标先验运动特征,生成包括虚拟对象做出所述目标对象做出的目标运动的三维运动动画。
  2. 根据权利要求1所述的方法,其中,所述基于所述目标视频中的至少两帧视频帧分别对应的人体特征和先验空间训练的神经网络,确定与所述目标对象在所述目标视频中做出的目标运动匹配的目标先验运动特征,包括:
    基于所述目标视频中的至少两帧视频帧分别对应的人体特征,确定所述目标视频对应的时序特征集合;
    基于所述时序特征集合和所述神经网络,确定所述目标先验运动特征。
  3. 根据权利要求2所述的方法,其中,所述基于所述目标视频中的至少两帧视频帧分别对应的人体特征,确定所述目标视频对应的时序特征集合,包括:
    将提取到的至少两帧所述视频帧分别对应的所述人体特征,按照所述视频帧对应的第一时序信息进行串联,确定所述目标视频对应的时序特征集合。
  4. 根据权利要求2所述的方法,其中,所述基于所述目标视频中的至少两帧视频帧分别对应的人体特征,确定所述目标视频对应的时序特征集合,包括:
    基于所述目标视频中的至少两帧视频帧分别对应的人体特征,预测所述目标视频之后的N帧视频帧分别对应的预测人体特征;其中,N为预设正整数;
    将所述目标视频中的至少两帧视频帧分别对应的所述人体特征和所述N帧视频帧分别对应的预测人体特征,按照所述视频帧对应的第二时序信息进行串联,确定所述目标视频对应的时序特征集合。
  5. 根据权利要求4所述的方法,其中,所述基于所述目标视频中的至少两帧视频帧分别对应的人体特征,预测所述目标视频之后的N帧视频帧分别对应的预测人体特征,包括:
    针对所述目标视频之后的M帧视频帧,基于所述目标视频中的至少两帧视频帧分别对应的人体特征和所述M帧视频帧分别对应的预测人体特征,预测所述目标视频之后的第M+1帧视频帧的预测人体特征;M为小于N的正整数。
  6. 根据权利要求1至5任一项所述的方法,其中,所述方法还包括根据以下方 式训练所述神经网络:
    获取样本视频,并提取所述样本视频中的至少两帧视频帧分别对应的样本人体特征;
    基于待训练的神经网络和所述样本人体特征,确定所述样本视频对应的样本先验运动特征;
    检测所述样本先验运动特征是否为所述先验空间中的先验运动特征,并基于检测结果调整所述待训练的神经网络的网络参数。
  7. 根据权利要求2至5任一所述的方法,其中,所述基于所述时序特征集合和所述神经网络,确定所述目标先验运动特征,包括:
    将所述时序特征集合输入至所述神经网络,执行如下操作:
    基于所述神经网络的至少两个卷积层对所述时序特征集合进行特征提取,得到至少两个提取特征;
    将所述至少两个提取特征输入至全连接层,得到所述目标先验运动特征。
  8. 根据权利要求2至5任一所述的方法,其中,所述基于所述时序特征集合和所述神经网络,确定所述目标先验运动特征,包括:
    在基于所述神经网络确定所述时序特征集合对应的先验运动特征有至少两个的情况下,基于确定的至少两个所述先验运动特征的特征值,确定所述目标先验运动特征。
  9. 根据权利要求1所述的方法,其中,所述基于所述目标先验运动特征,生成包括虚拟对象做出所述目标对象做出的目标运动的三维运动动画,包括:
    基于所述目标先验运动特征和所述先验空间对应的预先训练的解码器,确定用于描述所述虚拟对象运动的至少两帧运动数据;
    基于所述虚拟对象运动的至少两帧运动数据,生成包括所述虚拟对象做出所述目标对象做出的目标运动的三维运动动画;其中,所述三维运动动画的帧数与所述目标视频的帧数相同。
  10. 根据权利要求9所述的方法,其中,所述目标视频的第一帧数小于等于所述运动数据的第二帧数;
    所述基于所述虚拟对象运动的至少两帧运动数据,生成包括所述虚拟对象做出所述目标对象做出的目标运动的三维运动动画,包括:
    在所述第一帧数小于所述第二帧数的情况下,基于所述第一帧数,从所述第二帧数的所述运动数据中,筛选出目标运动数据,并基于所述目标运动数据,生成包括所述虚拟对象做出所述目标对象做出的目标运动的三维运动动画。
  11. 根据权利要求1至10任一所述的方法,其中,所述方法还包括:
    对所述目标视频中的目标视频帧的人体特征进行第一全连接处理,确定所述目标对象对应的人体形态信息;
    所述基于所述目标先验运动特征,生成包括虚拟对象做出所述目标对象做出的目 标运动的三维运动动画,包括:
    基于所述目标先验运动特征和所述人体形态信息,生成包括所述虚拟对象做出所述目标对象做出的目标运动的三维运动动画。
  12. 根据权利要求1至11任一所述的方法,其中,所述方法还包括:
    对所述目标视频中的首帧视频帧的人体特征进行第二全连接处理,确定所述目标对象对应的朝向调整信息;
    所述基于所述目标先验运动特征,生成包括虚拟对象做出所述目标对象做出的目标运动的三维运动动画,包括:
    基于所述目标先验运动特征和所述朝向调整信息,生成包括所述虚拟对象做出所述目标对象做出的目标运动的三维运动动画。
  13. 一种动画生成装置,所述装置包括:
    视频获取模块,配置为获取包含目标对象的目标视频;
    人体特征提取模块,配置为提取所述目标视频中的至少两帧视频帧分别对应的人体特征;
    运动特征确定模块,配置为基于所述目标视频中的至少两帧视频帧分别对应的人体特征和先验空间训练的神经网络,确定与所述目标对象在所述目标视频中做出的目标运动匹配的目标先验运动特征;其中,所述目标运动对应的目标先验运动特征融合了用于描述所述目标运动的至少两帧样本运动数据,所述先验空间中包括至少两个运动分别对应的先验运动特征;
    动画生成模块,配置为基于所述目标先验运动特征,生成包括虚拟对象做出所述目标对象做出的目标运动的三维运动动画。
  14. 根据权利要求13所述的装置,其中,所述运动特征确定模块,配置为:
    基于所述目标视频中的至少两帧视频帧分别对应的人体特征,确定所述目标视频对应的时序特征集合;
    基于所述时序特征集合和所述神经网络,确定所述目标先验运动特征。
  15. 根据权利要求14所述的装置,其中,所述运动特征确定模块,配置为:
    将提取到的至少两帧所述视频帧分别对应的所述人体特征,按照所述视频帧对应的第一时序信息进行串联,确定所述目标视频对应的时序特征集合。
  16. 根据权利要求14所述的装置,其中,所述运动特征确定模块,配置为:
    基于所述目标视频中的至少两帧视频帧分别对应的人体特征,预测所述目标视频之后的N帧视频帧分别对应的预测人体特征;其中,N为预设正整数;
    将所述目标视频中的至少两帧视频帧分别对应的所述人体特征和所述N帧视频帧分别对应的预测人体特征,按照所述视频帧对应的第二时序信息进行串联,确定所述目标视频对应的时序特征集合。
  17. 一种计算机设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当计算机设备运行时,所述处理器与所述存储器之间通 过总线通信,所述机器可读指令被所述处理器执行时执行如权利要求1至12任一项所述的动画生成方法的步骤。
  18. 一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行如权利要求1至12任一项所述的动画生成方法的步骤。
  19. 一种计算机程序,包括计算机可读代码,在计算机可读代码在设备上运行的情况下,设备中的处理器执行用于实现权利要求1至12中任一所述的方法。
  20. 一种计算机程序产品,配置为存储计算机可读指令,所述计算机可读指令被执行时使得计算机执行权利要求1至12中任一所述的方法。
PCT/CN2022/124879 2021-10-29 2022-10-12 动画生成方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品 WO2023071801A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111275624.0 2021-10-29
CN202111275624.0A CN113920232A (zh) 2021-10-29 2021-10-29 动画生成方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2023071801A1 true WO2023071801A1 (zh) 2023-05-04

Family

ID=79244012

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/124879 WO2023071801A1 (zh) 2021-10-29 2022-10-12 动画生成方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品

Country Status (2)

Country Link
CN (1) CN113920232A (zh)
WO (1) WO2023071801A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113920232A (zh) * 2021-10-29 2022-01-11 上海商汤智能科技有限公司 动画生成方法、装置、计算机设备和存储介质
WO2024000480A1 (zh) * 2022-06-30 2024-01-04 中国科学院深圳先进技术研究院 3d虚拟对象的动画生成方法、装置、终端设备及介质
CN115797606B (zh) * 2023-02-07 2023-04-21 合肥孪生宇宙科技有限公司 基于深度学习的3d虚拟数字人交互动作生成方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040051783A1 (en) * 2002-08-23 2004-03-18 Ramalingam Chellappa Method of three-dimensional object reconstruction from a video sequence using a generic model
CN113192109A (zh) * 2021-06-01 2021-07-30 北京海天瑞声科技股份有限公司 在连续帧中识别物体运动状态的方法及装置
CN113556600A (zh) * 2021-07-13 2021-10-26 广州虎牙科技有限公司 基于时序信息的驱动控制方法、装置、电子设备和可读存储介质
CN113920232A (zh) * 2021-10-29 2022-01-11 上海商汤智能科技有限公司 动画生成方法、装置、计算机设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040051783A1 (en) * 2002-08-23 2004-03-18 Ramalingam Chellappa Method of three-dimensional object reconstruction from a video sequence using a generic model
CN113192109A (zh) * 2021-06-01 2021-07-30 北京海天瑞声科技股份有限公司 在连续帧中识别物体运动状态的方法及装置
CN113556600A (zh) * 2021-07-13 2021-10-26 广州虎牙科技有限公司 基于时序信息的驱动控制方法、装置、电子设备和可读存储介质
CN113920232A (zh) * 2021-10-29 2022-01-11 上海商汤智能科技有限公司 动画生成方法、装置、计算机设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LEE, SEONG-WHAN ; LI, STAN Z: "SAT 2015 18th International Conference, Austin, TX, USA, September 24-27, 2015", vol. 11209 Chap.17, 6 October 2018, SPRINGER , Berlin, Heidelberg , ISBN: 3540745491, article YAN XINCHEN; RASTOGI AKASH; VILLEGAS RUBEN; SUNKAVALLI KALYAN; SHECHTMAN ELI; HADAP SUNIL; YUMER ERSIN; LEE HONGLAK: "MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics", pages: 276 - 293, XP047488909, 032548, DOI: 10.1007/978-3-030-01228-1_17 *

Also Published As

Publication number Publication date
CN113920232A (zh) 2022-01-11

Similar Documents

Publication Publication Date Title
WO2023071801A1 (zh) 动画生成方法及装置、计算机设备、存储介质、计算机程序、计算机程序产品
US11290640B2 (en) Electronic device and controlling method of electronic device
WO2022205760A1 (zh) 三维人体重建方法、装置、设备及存储介质
CN112967212A (zh) 一种虚拟人物的合成方法、装置、设备及存储介质
US20230123820A1 (en) Generating animated digital videos utilizing a character animation neural network informed by pose and motion embeddings
WO2021159781A1 (zh) 图像处理方法、装置、设备及存储介质
CN114339409B (zh) 视频处理方法、装置、计算机设备及存储介质
EP4300431A1 (en) Action processing method and apparatus for virtual object, and storage medium
CN111753801A (zh) 人体姿态跟踪与动画生成方法及装置
WO2022227765A1 (zh) 生成图像修复模型的方法、设备、介质及程序产品
CN115661336A (zh) 一种三维重建方法及相关装置
CN117274491A (zh) 三维重建模型的训练方法、装置、设备和介质
CN115439927A (zh) 基于机器人的步态监测方法、装置、设备及存储介质
US11836836B2 (en) Methods and apparatuses for generating model and generating 3D animation, devices and storage mediums
CN113269066B (zh) 说话视频生成方法、装置和电子设备
Huang et al. Object-occluded human shape and pose estimation with probabilistic latent consistency
CN116757923B (zh) 一种图像生成方法、装置、电子设备及存储介质
CN112634413B (zh) 生成模型和生成3d动画的方法、装置、设备和存储介质
CN116704084B (zh) 人脸动画生成网络的训练方法、人脸动画生成方法及装置
US20200092444A1 (en) Playback method, playback device and computer-readable storage medium
CN116485961A (zh) 一种手语动画生成方法、设备和介质
US11893056B2 (en) Using interpolation to generate a video from static images
CN116863042A (zh) 虚拟对象的动作生成方法及动作生成模型的训练方法
CN115509345A (zh) 虚拟现实场景的显示处理方法及虚拟现实设备
Sun et al. GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22885680

Country of ref document: EP

Kind code of ref document: A1