CN114866807A - Avatar video generation method and device, electronic equipment and readable storage medium - Google Patents

Avatar video generation method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN114866807A
CN114866807A CN202210512471.5A CN202210512471A CN114866807A CN 114866807 A CN114866807 A CN 114866807A CN 202210512471 A CN202210512471 A CN 202210512471A CN 114866807 A CN114866807 A CN 114866807A
Authority
CN
China
Prior art keywords
video
lip
voice
sound
explanation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210512471.5A
Other languages
Chinese (zh)
Inventor
朱超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210512471.5A priority Critical patent/CN114866807A/en
Publication of CN114866807A publication Critical patent/CN114866807A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Quality & Reliability (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention relates to an artificial intelligence technology, and discloses a virtual image video generation method, which comprises the following steps: carrying out sound sampling on the real person image recorded video to obtain video sound data and extracting sound characteristics; extracting video key frames in the real image recorded video, and identifying lip movement characteristics in the video key frames; screening a limb action video containing a plurality of reference limb actions from an action video reference library; converting the to-be-converted explanation text into explanation voice, and converting the explanation voice into virtual voice data with sound characteristics; virtual voice data is used as output voice data of the digital image, lip video of the digital image is constructed by lip movement characteristics, and the virtual image explanation video is obtained by fusing limb movement videos. In addition, the invention also relates to a block chain technology, and the target file can be stored in the node of the block chain. The invention also provides an avatar video generation device, electronic equipment and a readable storage medium. The invention can improve the efficiency of generating the virtual image video.

Description

Avatar video generation method and device, electronic equipment and readable storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a virtual image video generation method and device, electronic equipment and a readable storage medium.
Background
With the progress and development of education cause, online education becomes a key point in the education industry, and the online education generally realizes that students can quickly listen to classes at any time and any place by recording video explanation courseware. When making video courseware, a teacher often needs to explain the courseware content once, and then records the explanation process, so as to form a video course which is explained by the teacher. The mode production director explains the video courses and has the defects of high recording cost and long recording period, so that the efficiency of producing the courses is low, and the requirement of producing the courses cannot be responded in time. Therefore, it is desirable to provide a more efficient avatar video generation method.
Disclosure of Invention
The invention provides a method and a device for generating an avatar video, electronic equipment and a readable storage medium, and mainly aims to improve the efficiency of generating the avatar video.
In order to achieve the above object, the present invention provides an avatar video generating method, comprising:
acquiring a real image recorded video, carrying out sound sampling processing on the real image recorded video to obtain video sound data, and extracting sound characteristics in the video sound data based on a preset equal-amplitude difference frequency method;
extracting video key frames in the real person image recorded video, and identifying lip movement characteristics in the video key frames;
setting a plurality of reference limb actions, and screening a limb action video containing the reference limb actions from a preset action video reference library;
acquiring an explanation text to be converted, converting the explanation text to be converted into corresponding explanation voice through a voice conversion technology, and converting the explanation voice into virtual voice data with the voice characteristics by using a soft sound source parameter controller;
and taking the virtual voice data as output voice data of a preset digital image, constructing a lip video of the preset digital image according to the lip movement characteristics, and fusing the limb movement video to obtain a virtual image explanation video.
Optionally, the performing sound sampling processing on the video recorded by the real person image to obtain video sound data includes:
establishing wireless communication connection between a recording end corresponding to the real person image recorded video and each mobile terminal, wherein the mobile terminal is used for sound sampling;
receiving sound data sampled by the mobile terminal by utilizing a wireless communication connection;
recognizing a voice scene to which the voice data belongs, and selecting a noise reduction model corresponding to the voice scene;
and carrying out noise reduction processing on the sound data by using the noise reduction model to obtain the video sound data.
Optionally, the recognizing the voice scene to which the sound data belongs includes:
collecting a noise sample set under each scene, and extracting corresponding audio features from each noise sample;
performing cluster analysis on the noise sample set based on the audio features to obtain a classified voice set;
segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model;
and recognizing the voice scene to which the sound data belongs by using the standard scene recognition model.
Optionally, the extracting the video key frame in the real-person image recorded video includes:
inputting the real person image recorded video into a pre-trained convolution network for feature extraction to obtain a first feature vector;
inputting the first feature vector into a cross attention module for aggregation to obtain a second feature vector;
inputting the second feature vector and the output feature vector of the lower network of the convolutional network into a channel attention module together to obtain a third feature vector;
performing feature reconstruction on the third feature vector by using a decoder to obtain final reconstruction features, and acquiring corresponding video frames in the real image recorded video based on the final reconstruction features;
and taking the corresponding video frame in the real human image recorded video as a video key frame.
Optionally, the identifying lip motion features in the video keyframes includes:
acquiring a plurality of preset lip key points, identifying the positions of the lip key points in the video key frame, and performing one-way connection on the lip key points to obtain a lip edge profile;
performing curve fitting on the lip edge profile, and extracting curvature change characteristics in the lip edge profile after curve fitting;
obtaining an included angle change characteristic corresponding to the lip edge profile based on the lip key points;
and combining the curvature change characteristic and the included angle change characteristic to obtain the lip motion characteristic in the video key frame.
Optionally, the obtaining an included angle variation feature corresponding to the lip edge profile based on the plurality of lip key points includes:
constructing a first triangular body covering the left side or the right side of the lip and a second triangular body covering the upper part or the lower part of the lip according to the lip key point connecting line;
and obtaining the change characteristic of the included angle by using the angle values of the preset angles in the first triangular body and the second triangular body.
Optionally, before the converting the explained speech into the virtual speech data having the sound characteristics by using the soft sound source parameter controller, the method further includes:
and packaging the sound characteristics into a soft sound source parameter controller.
In order to solve the above problems, the present invention also provides an avatar video generating apparatus, comprising:
the sound feature extraction module is used for acquiring a real person image recorded video, carrying out sound sampling processing on the real person image recorded video to obtain video sound data, and extracting sound features in the video sound data based on a preset constant amplitude difference frequency method;
the lip movement feature identification module is used for extracting video key frames in the real person image recorded video and identifying lip movement features in the video key frames;
the video screening module is used for setting a plurality of reference limb actions and screening a limb action video containing a plurality of reference limb actions from a preset action video reference library;
the explanation video generation module for it treats the conversion explanation text to acquire, will through the speech conversion technique treat that the conversion explanation text converts corresponding explanation pronunciation into, utilizes soft sound source parameter control ware will explanation voice conversion has the virtual speech data of sound characteristic, with virtual speech data is as the output speech data of presetting digital image, with lip moves the characteristic and constructs preset digital image's lip video, and fuse the limb action video obtains virtual image explanation video.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the avatar video generation method described above.
In order to solve the above problem, the present invention also provides a readable storage medium having at least one computer program stored therein, the at least one computer program being executed by a processor in an electronic device to implement the avatar video generating method described above.
In the embodiment of the invention, different characteristics capable of representing the real-person image are obtained by extracting the sound characteristics and the lip movement characteristics in the real-person image recorded video, the to-be-converted explanation text is converted into corresponding explanation voice through a voice conversion technology, the explanation voice is converted into virtual voice data with the sound characteristics by using a soft sound source parameter controller, so that the virtual voice data has the sound characteristics, the virtual voice data is used as output voice data of a preset digital image, the lip video of the preset digital image is constructed by using the lip movement characteristics, and the body movement video is fused to obtain the virtual image explanation video. The efficiency of virtual image explanation video generation is improved. Therefore, the avatar video generation method, the avatar video generation device, the electronic device and the readable storage medium provided by the invention can solve the problem of low avatar video generation efficiency.
Drawings
Fig. 1 is a schematic flow chart of a method for generating an avatar video according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart showing a detailed implementation of one of the steps in FIG. 1;
FIG. 3 is a schematic flow chart showing a detailed implementation of one of the steps in FIG. 2;
FIG. 4 is a schematic flow chart showing a detailed implementation of one of the steps in FIG. 1;
FIG. 5 is a schematic flow chart showing a detailed implementation of one of the steps in FIG. 1;
FIG. 6 is a flow chart illustrating a detailed implementation of one of the steps in FIG. 5;
fig. 7 is a functional block diagram of an avatar video generating apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device implementing the avatar video generating method according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a virtual image video generation method. The execution subject of the avatar video generation method includes, but is not limited to, at least one of electronic devices, such as a server, a terminal, and the like, which can be configured to execute the method provided by the embodiments of the present application. In other words, the avatar video generating method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
Fig. 1 is a schematic flow chart of an avatar video generation method according to an embodiment of the present invention. In the present embodiment, the avatar video generating method includes the following steps S1-S4:
s1, acquiring a real image recorded video, carrying out sound sampling processing on the real image recorded video to obtain video sound data, and extracting sound features in the video sound data based on a preset equal-amplitude difference frequency method.
In the embodiment of the invention, the real-person image recorded video refers to a video for a tutor to explain courseware contents when recording video courseware, wherein the real-person image video can accurately reflect the body action, the lip action and the sound during explanation of the tutor.
Specifically, referring to fig. 2, the sound sampling processing of the real person image recorded video to obtain video and sound data includes the following steps S11-S13:
s11, establishing wireless communication connection between a recording end corresponding to the real image recorded video and each mobile terminal, wherein the mobile terminal is used for sound sampling;
s12, receiving the sound data sampled by the mobile terminal by using wireless communication connection;
s13, recognizing the voice scene to which the voice data belongs, and selecting a noise reduction model corresponding to the voice scene;
and S14, carrying out noise reduction processing on the sound data by using the noise reduction model to obtain the video sound data.
In detail, the denoising processing is to remove the voice of the non-speaker speaking in the voice data, and since the different environments of the recorded videos generate a lot of noise caused by the surrounding environment, it is necessary to perform denoising processing on the voice data to obtain the video voice data.
Further, referring to fig. 3, the recognizing the voice scene to which the sound data belongs includes the following steps S101 to S104:
s101, collecting noise sample sets in each scene, and extracting corresponding audio features from each noise sample;
s102, carrying out cluster analysis on the noise sample set based on the audio features to obtain a classified voice set;
s103, segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model;
and S104, recognizing the voice scene to which the sound data belongs by using the standard scene recognition model.
In detail, the noise sample set includes noise audio data in each voice scene, for example, noise audio data in a park, noise audio data at a roadside, or noise audio data in an office. In the embodiment of the present invention, the noise sample set may further include a feature label corresponding to each noise sample, where the feature label is used to label each noise sample to extract a corresponding audio feature. The audio features may include a zero-crossing rate, mel-frequency cepstrum coefficients, a spectral centroid, a spectral diffusion, a spectral entropy, a spectral flux, and the like, wherein the audio features in the embodiment of the present application are preferably mel-frequency cepstrum coefficients.
Specifically, the sound features in the video and sound data are extracted based on a preset constant amplitude difference frequency method, where the constant amplitude difference frequency method refers to an FM frequency modulation idea, and the video and sound data are split into pronunciation parameters by using the FM frequency modulation idea, and the pronunciation parameters are used as the sound features.
In detail, the sound characteristics can reflect the tone color characteristics of video sound data in the real person image recorded video, and serve as the basis of subsequent synthesized voice, so that the virtual images all have the tone color characteristics of the real person image in the real person recorded video.
S2, extracting video key frames in the real image recorded video, and identifying lip movement characteristics in the video key frames.
In the embodiment of the present invention, referring to fig. 4, the extracting the video key frames in the real-person image recorded video includes the following steps S21-S25:
s21, inputting the real image recorded video into a pre-trained convolution network for feature extraction to obtain a first feature vector;
s22, inputting the first feature vector into a cross attention module for aggregation to obtain a second feature vector;
s23, inputting the second feature vector and the output feature vector of the lower network of the convolutional network into a channel attention module together to obtain a third feature vector;
s24, performing feature reconstruction on the third feature vector by using a decoder to obtain final reconstruction features, and acquiring corresponding video frames in the real image recorded video based on the final reconstruction features;
and S25, taking the corresponding video frame in the real human image recorded video as a video key frame.
In detail, the pre-trained convolutional network is used as an encoder for extracting video key frames, and the convolutional network may adopt a network structure in the prior art, such as a ResNet network (residual error network), a VGG network, a google net network, and the like.
Further, referring to fig. 5, the identifying lip motion features in the video keyframes includes the following steps S201 to S204:
s201, obtaining a plurality of preset lip key points, identifying the positions of the lip key points in the video key frame, and performing one-way connection on the lip key points to obtain a lip edge profile;
s202, performing curve fitting on the lip edge profile, and extracting curvature change characteristics in the lip edge profile after curve fitting;
s203, solving an included angle change characteristic corresponding to the lip edge profile based on the plurality of lip key points;
s204, combining the curvature change characteristic and the included angle change characteristic to obtain the lip movement characteristic in the video key frame.
Specifically, referring to fig. 6, the obtaining of the included angle variation feature corresponding to the lip edge profile based on the plurality of lip key points includes the following steps S211 to S212:
s211, constructing a first triangular body covering the left side or the right side of the lip and a second triangular body covering the upper part or the lower part of the lip according to the lip key point connecting line;
s212, obtaining the change characteristic of the included angle by using the angle values of the preset angles in the first triangular body and the second triangular body.
In detail, the first triangular body and the second triangular body are two preset triangular areas.
S3, setting a plurality of reference limb motions, and screening out a limb motion video containing the reference limb motions from a preset motion video reference library.
In the embodiment of the invention, a plurality of reference limb actions are set to limit actions in a subsequent video to be in line with the field of the scheme, for example, the scheme is mainly used for designing a virtual image explanation video corresponding to a video course explained by a guide, so that the plurality of reference limb actions are turning actions, hand raising actions, blackboard pointing actions or roll calling actions, and the reference limb actions are all common actions in the explanation process of the guide.
The motion video reference library includes a plurality of different motions and body motion videos corresponding to the motions, for example, there may be a motion video for turning to the left direction for a preset time or another motion video for turning to the right direction for another preset time for the turning motion. A body motion video containing a plurality of the reference body motions can be screened out from the motion video reference library, and the body motion video can be one or more.
S4, obtaining an explanation text to be converted, converting the explanation text to be converted into corresponding explanation voice through a voice conversion technology, and converting the explanation voice into virtual voice data with the voice characteristics by using a soft sound source parameter controller.
In the embodiment of the invention, the to-be-converted explanation text is a text which is required to be subjected to voice-over in the instructor explanation course. The method comprises the steps of converting the To-be-converted explanation Text into corresponding explanation voice through a voice conversion technology, wherein the voice conversion technology is a TTS technology, the TTS is an abbreviation of Text To Speech, namely from Text To voice, and is one of voice synthesis applications, and files stored in a computer, such as help files or web pages, are converted into natural voice To be output.
Specifically, before the converting the explained speech into the virtual speech data having the sound characteristics by using the soft sound source parameter controller, the method further includes:
and packaging the sound characteristics into a soft sound source parameter controller.
In detail, the speech to be explained is converted into virtual speech data having the sound characteristics by using a soft sound source parameter controller, and the soft sound source parameter controller can perform operations such as adjustment, audition, tuning and the like in combination with the speech recorded in the sound library, so that the virtual speech data has the sound characteristics.
S5, the virtual voice data is used as output voice data of a preset digital image, the lip video of the preset digital image is constructed according to the lip movement characteristics, and the limb movement video is fused to obtain a virtual image explanation video.
In the embodiment of the invention, as only about 2000 characters are commonly used in life, and the pronunciation of the 2000 characters can be further combined, common voice characteristics can be simulated through the two characteristics of the voice characteristic, the lip movement characteristic and the like. And taking the virtual voice data as output voice data of a preset digital image, constructing a lip video of the preset digital image according to the lip movement characteristics, and fusing the limb movement video to obtain a virtual image explanation video.
Preferably, the virtual image explanation video replaces the recording instructor's lecture by using digital human explanation, so that the threshold for making video classes is greatly reduced, and the efficiency for making courses is improved. Once the digital human figure is generated, the digital human figure can be used for making scenes such as anchor courses, intelligent customer service, news broadcasting and the like, and the trained digital human figure is continuously and repeatedly used.
In the embodiment of the invention, different characteristics capable of representing the real-person image are obtained by extracting the sound characteristics and the lip movement characteristics in the real-person image recorded video, the to-be-converted explanation text is converted into corresponding explanation voice through a voice conversion technology, the explanation voice is converted into virtual voice data with the sound characteristics by using a soft sound source parameter controller, so that the virtual voice data has the sound characteristics, the virtual voice data is used as output voice data of a preset digital image, the lip video of the preset digital image is constructed by using the lip movement characteristics, and the body movement video is fused to obtain the virtual image explanation video. The efficiency of virtual image explanation video generation is improved. Therefore, the virtual image video generation method provided by the invention can solve the problem of low virtual image video generation efficiency.
Fig. 7 is a functional block diagram of an avatar video generating apparatus according to an embodiment of the present invention.
The avatar video generating apparatus 100 of the present invention may be installed in an electronic device. According to the implemented functions, the avatar video generating apparatus 100 may include a sound feature extraction module 101, a lip movement feature recognition module 102, a video screening module 103, and an explanation video generating module 104. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the sound feature extraction module 101 is configured to obtain a real-person image recorded video, perform sound sampling processing on the real-person image recorded video to obtain video sound data, and extract sound features in the video sound data based on a preset constant-amplitude difference frequency method;
the lip motion feature recognition module 102 is configured to extract a video key frame in the real-person image recorded video, and recognize a lip motion feature in the video key frame;
the video screening module 103 is configured to set a plurality of reference limb motions and screen a limb motion video including the plurality of reference limb motions from a preset motion video reference library;
the explanation video generation module 104 for acquire and treat the conversion explanation text, will through the speech conversion technique treat that the conversion explanation text converts corresponding explanation pronunciation into, utilizes soft sound source parameter control ware will explanation voice conversion has the virtual voice data of sound characteristic, with virtual voice data regards as the output voice data of presetting digital image, with lip moves the characteristic and constructs preset digital image's lip video, and fuse the limb action video obtains virtual image explanation video.
In detail, the avatar video generating apparatus 100 has the following modules:
s1, acquiring a real image recorded video, carrying out sound sampling processing on the real image recorded video to obtain video sound data, and extracting sound features in the video sound data based on a preset equal-amplitude difference frequency method.
In the embodiment of the invention, the real-person image recorded video refers to a video for a tutor to explain courseware contents when recording video courseware, wherein the real-person image video can accurately reflect the body action, the lip action and the sound during explanation of the tutor.
Specifically, the said real person image recording video is processed by sound sampling to obtain video sound data, which includes:
establishing wireless communication connection between a recording end corresponding to the real person image recorded video and each mobile terminal, wherein the mobile terminal is used for sound sampling;
receiving sound data sampled by the mobile terminal by utilizing a wireless communication connection;
recognizing a voice scene to which the voice data belongs, and selecting a noise reduction model corresponding to the voice scene;
and carrying out noise reduction processing on the sound data by using the noise reduction model to obtain the video sound data.
In detail, the denoising processing is to remove the voice of the non-speaker speaking in the voice data, and since the different environments of the recorded videos generate a lot of noise caused by the surrounding environment, it is necessary to perform denoising processing on the voice data to obtain the video voice data.
Further, the recognizing the voice scene to which the sound data belongs includes:
collecting a noise sample set under each scene, and extracting corresponding audio features from each noise sample;
performing cluster analysis on the noise sample set based on the audio features to obtain a classified voice set;
segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model;
and recognizing the voice scene to which the sound data belongs by using the standard scene recognition model.
In detail, the noise sample set includes noise audio data in each voice scene, for example, noise audio data in a park, noise audio data at a roadside, or noise audio data in an office. In the embodiment of the present invention, the noise sample set may further include a feature label corresponding to each noise sample, where the feature label is used to label each noise sample to extract a corresponding audio feature. The audio features may include a zero-crossing rate, mel-frequency cepstrum coefficients, a spectral centroid, a spectral diffusion, a spectral entropy, a spectral flux, and the like, wherein the audio features in the embodiment of the present application are preferably mel-frequency cepstrum coefficients.
Specifically, the sound features in the video and sound data are extracted based on a preset constant amplitude difference frequency method, where the constant amplitude difference frequency method refers to an FM frequency modulation idea, and the video and sound data are split into pronunciation parameters by using the FM frequency modulation idea, and the pronunciation parameters are used as the sound features.
In detail, the sound characteristics can reflect the tone color characteristics of video sound data in the real person image recorded video, and serve as the basis of subsequent synthesized voice, so that the virtual images all have the tone color characteristics of the real person image in the real person recorded video.
S2, extracting video key frames in the real image recorded video, and identifying lip movement characteristics in the video key frames.
In the embodiment of the present invention, the extracting the video key frames in the real-person image recorded video includes:
inputting the real person image recorded video into a pre-trained convolution network for feature extraction to obtain a first feature vector;
inputting the first feature vector into a cross attention module for aggregation to obtain a second feature vector;
inputting the second feature vector and the output feature vector of the lower network of the convolutional network into a channel attention module together to obtain a third feature vector;
performing feature reconstruction on the third feature vector by using a decoder to obtain final reconstruction features, and acquiring corresponding video frames in the real image recorded video based on the final reconstruction features;
and taking the corresponding video frame in the real human image recorded video as a video key frame.
In detail, the pre-trained convolutional network is used as an encoder for extracting video key frames, and the convolutional network may adopt a network structure in the prior art, such as a ResNet network (residual error network), a VGG network, a google net network, and the like.
Further, the identifying lip movement features in the video keyframes includes:
acquiring a plurality of preset lip key points, identifying the positions of the lip key points in the video key frame, and performing one-way connection on the lip key points to obtain a lip edge profile;
performing curve fitting on the lip edge profile, and extracting curvature change characteristics in the lip edge profile after curve fitting;
obtaining an included angle change characteristic corresponding to the lip edge profile based on the lip key points;
and combining the curvature change characteristic and the included angle change characteristic to obtain the lip motion characteristic in the video key frame.
Specifically, the obtaining of the included angle change feature corresponding to the lip edge profile based on the plurality of lip key points includes:
constructing a first triangular body covering the left side or the right side of the lip and a second triangular body covering the upper part or the lower part of the lip according to the lip key point connecting line;
and obtaining the change characteristic of the included angle by using the angle values of the preset angles in the first triangular body and the second triangular body.
In detail, the first triangular body and the second triangular body are two preset triangular areas.
S3, setting a plurality of reference limb motions, and screening out a limb motion video containing the reference limb motions from a preset motion video reference library.
In the embodiment of the invention, a plurality of reference limb actions are set to limit actions in a subsequent video to be in line with the field of the scheme, for example, the scheme is mainly used for designing a virtual image explanation video corresponding to a video course explained by a guide, so that the plurality of reference limb actions are turning actions, hand raising actions, blackboard pointing actions or roll calling actions, and the reference limb actions are all common actions in the explanation process of the guide.
The motion video reference library includes a plurality of different motions and body motion videos corresponding to the motions, for example, there may be a motion video for turning to the left direction for a preset time or another motion video for turning to the right direction for another preset time for the turning motion. A body motion video containing a plurality of the reference body motions can be screened out from the motion video reference library, and the body motion video can be one or more.
S4, obtaining an explanation text to be converted, converting the explanation text to be converted into corresponding explanation voice through a voice conversion technology, and converting the explanation voice into virtual voice data with the voice characteristics by using a soft sound source parameter controller.
In the embodiment of the invention, the to-be-converted explanation text is a text which is required to be subjected to voice-over in the instructor explanation course. The method comprises the steps of converting the To-be-converted explanation Text into corresponding explanation voice through a voice conversion technology, wherein the voice conversion technology is a TTS technology, the TTS is an abbreviation of Text To Speech, namely from Text To voice, and is one of voice synthesis applications, and files stored in a computer, such as help files or web pages, are converted into natural voice To be output.
Specifically, before the converting the explained speech into the virtual speech data having the sound characteristics by using the soft sound source parameter controller, the method further includes:
and packaging the sound characteristics into a soft sound source parameter controller.
In detail, the speech to be explained is converted into virtual speech data having the sound characteristics by using a soft sound source parameter controller, and the soft sound source parameter controller can perform operations such as adjustment, audition, tuning and the like in combination with the speech recorded in the sound library, so that the virtual speech data has the sound characteristics.
S5, the virtual voice data are used as output voice data of a preset digital image, the lip video of the preset digital image is constructed according to the lip movement characteristics, and the limb movement video is fused to obtain a virtual image explanation video.
In the embodiment of the invention, as only about 2000 characters are commonly used in life, and the pronunciation of the 2000 characters can be further combined, common voice characteristics can be simulated through the two characteristics of the voice characteristic, the lip movement characteristic and the like. And taking the virtual voice data as output voice data of a preset digital image, constructing a lip video of the preset digital image according to the lip movement characteristics, and fusing the limb movement video to obtain a virtual image explanation video.
Preferably, the virtual image explanation video replaces the recording instructor's lecture by using digital human explanation, so that the threshold for making video classes is greatly reduced, and the efficiency for making courses is improved. Once the digital human figure is generated, the digital human figure can be used for making scenes such as anchor courses, intelligent customer service, news broadcasting and the like, and the trained digital human figure is continuously and repeatedly used.
In the embodiment of the invention, different characteristics capable of representing the real-person image are obtained by extracting the sound characteristics and the lip movement characteristics in the real-person image recorded video, the to-be-converted explanation text is converted into corresponding explanation voice through a voice conversion technology, the explanation voice is converted into virtual voice data with the sound characteristics by using a soft sound source parameter controller, so that the virtual voice data has the sound characteristics, the virtual voice data is used as output voice data of a preset digital image, the lip video of the preset digital image is constructed by using the lip movement characteristics, and the body movement video is fused to obtain the virtual image explanation video. The efficiency of virtual image explanation video generation is improved. Therefore, the virtual image video generation method provided by the invention can solve the problem of low virtual image video generation efficiency.
Fig. 8 is a schematic structural diagram of an electronic device implementing an avatar video generation method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program, such as an avatar video generation program, stored in the memory 11 and executable on the processor 10.
In some embodiments, the processor 10 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), a microprocessor, a digital Processing chip, a graphics processor, a combination of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (e.g., executing avatar video generation programs, etc.) stored in the memory 11 and calling data stored in the memory 11.
The memory 11 includes at least one type of readable storage medium including flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various kinds of data, such as codes of an avatar video generation program, etc., but also to temporarily store data that has been output or is to be output.
The communication bus 12 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
The communication interface 13 is used for communication between the electronic device and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
Fig. 8 only shows an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 8 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The avatar video generation program stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:
acquiring a real person image recorded video, carrying out sound sampling processing on the real person image recorded video to obtain video sound data, and extracting sound features in the video sound data based on a preset constant amplitude difference frequency method;
extracting video key frames in the real person image recorded video, and identifying lip movement characteristics in the video key frames;
setting a plurality of reference limb actions, and screening a limb action video containing the reference limb actions from a preset action video reference library;
acquiring an explanation text to be converted, converting the explanation text to be converted into corresponding explanation voice through a voice conversion technology, and converting the explanation voice into virtual voice data with the voice characteristics by using a soft sound source parameter controller;
and taking the virtual voice data as output voice data of a preset digital image, constructing a lip video of the preset digital image according to the lip movement characteristics, and fusing the limb movement video to obtain a virtual image explanation video.
Specifically, the specific implementation method of the instruction by the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to the drawings, which is not described herein again.
Further, the integrated modules/units of the electronic device 1 may be stored in a readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. The readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The invention also provides a readable storage medium storing a computer program which, when executed by a processor of an electronic device, may implement:
acquiring a real person image recorded video, carrying out sound sampling processing on the real person image recorded video to obtain video sound data, and extracting sound features in the video sound data based on a preset constant amplitude difference frequency method;
extracting video key frames in the real person image recorded video, and identifying lip movement characteristics in the video key frames;
setting a plurality of reference limb actions, and screening a limb action video containing the reference limb actions from a preset action video reference library;
acquiring an explanation text to be converted, converting the explanation text to be converted into corresponding explanation voice through a voice conversion technology, and converting the explanation voice into virtual voice data with the voice characteristics by using a soft sound source parameter controller;
and taking the virtual voice data as output voice data of a preset digital image, constructing a lip video of the preset digital image according to the lip movement characteristics, and fusing the limb movement video to obtain a virtual image explanation video.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method for avatar video generation, the method comprising:
acquiring a real person image recorded video, carrying out sound sampling processing on the real person image recorded video to obtain video sound data, and extracting sound features in the video sound data based on a preset constant amplitude difference frequency method;
extracting video key frames in the real person image recorded video, and identifying lip movement characteristics in the video key frames;
setting a plurality of reference limb actions, and screening a limb action video containing the reference limb actions from a preset action video reference library;
acquiring an explanation text to be converted, converting the explanation text to be converted into corresponding explanation voice through a voice conversion technology, and converting the explanation voice into virtual voice data with the voice characteristics by using a soft sound source parameter controller;
and taking the virtual voice data as output voice data of a preset digital image, constructing a lip video of the preset digital image according to the lip movement characteristics, and fusing the limb movement video to obtain a virtual image explanation video.
2. The avatar video generating method of claim 1, wherein said performing a sound sampling process on said live image recorded video to obtain video sound data comprises:
establishing wireless communication connection between a recording end corresponding to the real person image recorded video and each mobile terminal, wherein the mobile terminal is used for sound sampling;
receiving sound data sampled by the mobile terminal by utilizing a wireless communication connection;
recognizing a voice scene to which the voice data belongs, and selecting a noise reduction model corresponding to the voice scene;
and carrying out noise reduction processing on the sound data by using the noise reduction model to obtain the video sound data.
3. The avatar video generating method of claim 2, wherein said recognizing a voice scene to which said sound data belongs comprises:
collecting a noise sample set under each scene, and extracting corresponding audio features from each noise sample;
performing cluster analysis on the noise sample set based on the audio features to obtain a classified voice set;
segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model;
and recognizing the voice scene to which the sound data belongs by using the standard scene recognition model.
4. The avatar video generation method of claim 1, wherein said extracting video key frames from said live avatar recording video comprises:
inputting the real person image recorded video into a pre-trained convolution network for feature extraction to obtain a first feature vector;
inputting the first feature vector into a cross attention module for aggregation to obtain a second feature vector;
inputting the second feature vector and the output feature vector of the lower network of the convolutional network into a channel attention module together to obtain a third feature vector;
performing feature reconstruction on the third feature vector by using a decoder to obtain final reconstruction features, and acquiring corresponding video frames in the real image recorded video based on the final reconstruction features;
and taking the corresponding video frame in the real human image recorded video as a video key frame.
5. The avatar video generation method of claim 1, wherein said identifying lip movement features in said video keyframes comprises:
acquiring a plurality of preset lip key points, identifying the positions of the lip key points in the video key frame, and performing one-way connection on the lip key points to obtain a lip edge profile;
performing curve fitting on the lip edge profile, and extracting curvature change characteristics in the lip edge profile after curve fitting;
obtaining an included angle change characteristic corresponding to the lip edge profile based on the lip key points;
and combining the curvature change characteristic and the included angle change characteristic to obtain the lip motion characteristic in the video key frame.
6. The avatar video generating method of claim 5, wherein said deriving an angle change feature corresponding to said lip edge profile based on said lip keypoints comprises:
constructing a first triangular body covering the left side or the right side of the lip and a second triangular body covering the upper part or the lower part of the lip according to the lip key point connecting line;
and obtaining the change characteristic of the included angle by using the angle values of the preset angles in the first triangular body and the second triangular body.
7. The avatar video generating method of claim 1, wherein before said converting said interpreted speech into virtual speech data having said sound characteristics using a soft sound source parameter controller, said method further comprises:
and packaging the sound characteristics into a soft sound source parameter controller.
8. An avatar video generating apparatus, the apparatus comprising:
the sound feature extraction module is used for acquiring a real person image recorded video, carrying out sound sampling processing on the real person image recorded video to obtain video sound data, and extracting sound features in the video sound data based on a preset constant amplitude difference frequency method;
the lip motion characteristic identification module is used for extracting video key frames in the real person image recorded video and identifying lip motion characteristics in the video key frames;
the video screening module is used for setting a plurality of reference limb actions and screening a limb action video containing a plurality of reference limb actions from a preset action video reference library;
the explanation video generation module for it treats the conversion explanation text to acquire, will through the speech conversion technique treat that the conversion explanation text converts corresponding explanation pronunciation into, utilizes soft sound source parameter control ware will explanation voice conversion has the virtual speech data of sound characteristic, with virtual speech data is as the output speech data of presetting digital image, with lip moves the characteristic and constructs preset digital image's lip video, and fuse the limb action video obtains virtual image explanation video.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and (c) a second step of,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the avatar video generation method of any of claims 1-7.
10. A readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the avatar video generation method of any of claims 1-7.
CN202210512471.5A 2022-05-12 2022-05-12 Avatar video generation method and device, electronic equipment and readable storage medium Pending CN114866807A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210512471.5A CN114866807A (en) 2022-05-12 2022-05-12 Avatar video generation method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210512471.5A CN114866807A (en) 2022-05-12 2022-05-12 Avatar video generation method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN114866807A true CN114866807A (en) 2022-08-05

Family

ID=82637471

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210512471.5A Pending CN114866807A (en) 2022-05-12 2022-05-12 Avatar video generation method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN114866807A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115050354A (en) * 2022-08-10 2022-09-13 北京百度网讯科技有限公司 Digital human driving method and device
CN115953515A (en) * 2023-03-14 2023-04-11 深圳崇德动漫股份有限公司 Animation image generation method, device, equipment and medium based on real person data
CN116248811A (en) * 2022-12-09 2023-06-09 北京生数科技有限公司 Video processing method, device and storage medium
CN116702707A (en) * 2023-08-03 2023-09-05 腾讯科技(深圳)有限公司 Action generation method, device and equipment based on action generation model
CN117714763A (en) * 2024-02-05 2024-03-15 深圳市鸿普森科技股份有限公司 Virtual object speaking video generation method and device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109584648A (en) * 2018-11-08 2019-04-05 北京葡萄智学科技有限公司 Data creation method and device
CN112465935A (en) * 2020-11-19 2021-03-09 科大讯飞股份有限公司 Virtual image synthesis method and device, electronic equipment and storage medium
WO2021159391A1 (en) * 2020-02-13 2021-08-19 南昌欧菲光电技术有限公司 Camera module, photographing apparatus, and electronic device
CN113469292A (en) * 2021-09-02 2021-10-01 北京世纪好未来教育科技有限公司 Training method, synthesizing method, device, medium and equipment for video synthesizing model
US20220084502A1 (en) * 2020-09-14 2022-03-17 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for determining shape of lips of virtual character, device and computer storage medium
CN114202605A (en) * 2021-12-07 2022-03-18 北京百度网讯科技有限公司 3D video generation method, model training method, device, equipment and medium
CN114338994A (en) * 2021-12-30 2022-04-12 Oppo广东移动通信有限公司 Optical anti-shake method, optical anti-shake apparatus, electronic device, and computer-readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109584648A (en) * 2018-11-08 2019-04-05 北京葡萄智学科技有限公司 Data creation method and device
WO2021159391A1 (en) * 2020-02-13 2021-08-19 南昌欧菲光电技术有限公司 Camera module, photographing apparatus, and electronic device
US20220084502A1 (en) * 2020-09-14 2022-03-17 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for determining shape of lips of virtual character, device and computer storage medium
CN112465935A (en) * 2020-11-19 2021-03-09 科大讯飞股份有限公司 Virtual image synthesis method and device, electronic equipment and storage medium
CN113469292A (en) * 2021-09-02 2021-10-01 北京世纪好未来教育科技有限公司 Training method, synthesizing method, device, medium and equipment for video synthesizing model
CN114202605A (en) * 2021-12-07 2022-03-18 北京百度网讯科技有限公司 3D video generation method, model training method, device, equipment and medium
CN114338994A (en) * 2021-12-30 2022-04-12 Oppo广东移动通信有限公司 Optical anti-shake method, optical anti-shake apparatus, electronic device, and computer-readable storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115050354A (en) * 2022-08-10 2022-09-13 北京百度网讯科技有限公司 Digital human driving method and device
CN116248811A (en) * 2022-12-09 2023-06-09 北京生数科技有限公司 Video processing method, device and storage medium
CN116248811B (en) * 2022-12-09 2023-12-05 北京生数科技有限公司 Video processing method, device and storage medium
CN115953515A (en) * 2023-03-14 2023-04-11 深圳崇德动漫股份有限公司 Animation image generation method, device, equipment and medium based on real person data
CN115953515B (en) * 2023-03-14 2023-06-27 深圳崇德动漫股份有限公司 Cartoon image generation method, device, equipment and medium based on real person data
CN116702707A (en) * 2023-08-03 2023-09-05 腾讯科技(深圳)有限公司 Action generation method, device and equipment based on action generation model
CN116702707B (en) * 2023-08-03 2023-10-03 腾讯科技(深圳)有限公司 Action generation method, device and equipment based on action generation model
CN117714763A (en) * 2024-02-05 2024-03-15 深圳市鸿普森科技股份有限公司 Virtual object speaking video generation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN114866807A (en) Avatar video generation method and device, electronic equipment and readable storage medium
EP3889912A1 (en) Method and apparatus for generating video
CN111681681A (en) Voice emotion recognition method and device, electronic equipment and storage medium
CN110246512A (en) Sound separation method, device and computer readable storage medium
CN107844481B (en) Text recognition error detection method and device
CN111626049B (en) Title correction method and device for multimedia information, electronic equipment and storage medium
WO2023050650A1 (en) Animation video generation method and apparatus, and device and storage medium
CN107092664A (en) A kind of content means of interpretation and device
WO2023273628A1 (en) Video loop recognition method and apparatus, computer device, and storage medium
CN112466273A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN115002491A (en) Network live broadcast method, device, equipment and storage medium based on intelligent machine
CN113903363A (en) Violation detection method, device, equipment and medium based on artificial intelligence
CN114780701B (en) Automatic question-answer matching method, device, computer equipment and storage medium
CN114140814A (en) Emotion recognition capability training method and device and electronic equipment
CN113591472A (en) Lyric generation method, lyric generation model training method and device and electronic equipment
CN112542172A (en) Communication auxiliary method, device, equipment and medium based on online conference
CN116939288A (en) Video generation method and device and computer equipment
CN109766089B (en) Code generation method and device based on dynamic diagram, electronic equipment and storage medium
CN113555003B (en) Speech synthesis method, device, electronic equipment and storage medium
CN117119123A (en) Method and system for generating digital human video based on video material
CN116129860A (en) Automatic meta-space virtual human book broadcasting method based on AI artificial intelligence technology
CN116561294A (en) Sign language video generation method and device, computer equipment and storage medium
CN114842880A (en) Intelligent customer service voice rhythm adjusting method, device, equipment and storage medium
CN113762056A (en) Singing video recognition method, device, equipment and storage medium
CN115206342A (en) Data processing method and device, computer equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination