CN114866807A - Avatar video generation method and device, electronic equipment and readable storage medium - Google Patents
Avatar video generation method and device, electronic equipment and readable storage medium Download PDFInfo
- Publication number
- CN114866807A CN114866807A CN202210512471.5A CN202210512471A CN114866807A CN 114866807 A CN114866807 A CN 114866807A CN 202210512471 A CN202210512471 A CN 202210512471A CN 114866807 A CN114866807 A CN 114866807A
- Authority
- CN
- China
- Prior art keywords
- video
- lip
- voice
- sound
- explanation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000003860 storage Methods 0.000 title claims abstract description 17
- 230000033001 locomotion Effects 0.000 claims abstract description 85
- 230000009471 action Effects 0.000 claims abstract description 52
- 238000005516 engineering process Methods 0.000 claims abstract description 19
- 238000005070 sampling Methods 0.000 claims abstract description 18
- 238000012216 screening Methods 0.000 claims abstract description 13
- 238000006243 chemical reaction Methods 0.000 claims description 25
- 230000008859 change Effects 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 22
- 238000004891 communication Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 12
- 230000009467 reduction Effects 0.000 claims description 12
- 238000012360 testing method Methods 0.000 claims description 12
- 238000000605 extraction Methods 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 6
- 230000002776 aggregation Effects 0.000 claims description 4
- 238000004220 aggregation Methods 0.000 claims description 4
- 238000007621 cluster analysis Methods 0.000 claims description 4
- 238000004806 packaging method and process Methods 0.000 claims description 4
- 239000000126 substance Substances 0.000 claims description 2
- 238000013473 artificial intelligence Methods 0.000 abstract description 5
- 230000006870 function Effects 0.000 description 8
- 230000003595 spectral effect Effects 0.000 description 8
- 238000007726 management method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000009792 diffusion process Methods 0.000 description 2
- 230000004907 flux Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23418—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/233—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/816—Monomedia components thereof involving special video data, e.g 3D video
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Quality & Reliability (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention relates to an artificial intelligence technology, and discloses a virtual image video generation method, which comprises the following steps: carrying out sound sampling on the real person image recorded video to obtain video sound data and extracting sound characteristics; extracting video key frames in the real image recorded video, and identifying lip movement characteristics in the video key frames; screening a limb action video containing a plurality of reference limb actions from an action video reference library; converting the to-be-converted explanation text into explanation voice, and converting the explanation voice into virtual voice data with sound characteristics; virtual voice data is used as output voice data of the digital image, lip video of the digital image is constructed by lip movement characteristics, and the virtual image explanation video is obtained by fusing limb movement videos. In addition, the invention also relates to a block chain technology, and the target file can be stored in the node of the block chain. The invention also provides an avatar video generation device, electronic equipment and a readable storage medium. The invention can improve the efficiency of generating the virtual image video.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a virtual image video generation method and device, electronic equipment and a readable storage medium.
Background
With the progress and development of education cause, online education becomes a key point in the education industry, and the online education generally realizes that students can quickly listen to classes at any time and any place by recording video explanation courseware. When making video courseware, a teacher often needs to explain the courseware content once, and then records the explanation process, so as to form a video course which is explained by the teacher. The mode production director explains the video courses and has the defects of high recording cost and long recording period, so that the efficiency of producing the courses is low, and the requirement of producing the courses cannot be responded in time. Therefore, it is desirable to provide a more efficient avatar video generation method.
Disclosure of Invention
The invention provides a method and a device for generating an avatar video, electronic equipment and a readable storage medium, and mainly aims to improve the efficiency of generating the avatar video.
In order to achieve the above object, the present invention provides an avatar video generating method, comprising:
acquiring a real image recorded video, carrying out sound sampling processing on the real image recorded video to obtain video sound data, and extracting sound characteristics in the video sound data based on a preset equal-amplitude difference frequency method;
extracting video key frames in the real person image recorded video, and identifying lip movement characteristics in the video key frames;
setting a plurality of reference limb actions, and screening a limb action video containing the reference limb actions from a preset action video reference library;
acquiring an explanation text to be converted, converting the explanation text to be converted into corresponding explanation voice through a voice conversion technology, and converting the explanation voice into virtual voice data with the voice characteristics by using a soft sound source parameter controller;
and taking the virtual voice data as output voice data of a preset digital image, constructing a lip video of the preset digital image according to the lip movement characteristics, and fusing the limb movement video to obtain a virtual image explanation video.
Optionally, the performing sound sampling processing on the video recorded by the real person image to obtain video sound data includes:
establishing wireless communication connection between a recording end corresponding to the real person image recorded video and each mobile terminal, wherein the mobile terminal is used for sound sampling;
receiving sound data sampled by the mobile terminal by utilizing a wireless communication connection;
recognizing a voice scene to which the voice data belongs, and selecting a noise reduction model corresponding to the voice scene;
and carrying out noise reduction processing on the sound data by using the noise reduction model to obtain the video sound data.
Optionally, the recognizing the voice scene to which the sound data belongs includes:
collecting a noise sample set under each scene, and extracting corresponding audio features from each noise sample;
performing cluster analysis on the noise sample set based on the audio features to obtain a classified voice set;
segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model;
and recognizing the voice scene to which the sound data belongs by using the standard scene recognition model.
Optionally, the extracting the video key frame in the real-person image recorded video includes:
inputting the real person image recorded video into a pre-trained convolution network for feature extraction to obtain a first feature vector;
inputting the first feature vector into a cross attention module for aggregation to obtain a second feature vector;
inputting the second feature vector and the output feature vector of the lower network of the convolutional network into a channel attention module together to obtain a third feature vector;
performing feature reconstruction on the third feature vector by using a decoder to obtain final reconstruction features, and acquiring corresponding video frames in the real image recorded video based on the final reconstruction features;
and taking the corresponding video frame in the real human image recorded video as a video key frame.
Optionally, the identifying lip motion features in the video keyframes includes:
acquiring a plurality of preset lip key points, identifying the positions of the lip key points in the video key frame, and performing one-way connection on the lip key points to obtain a lip edge profile;
performing curve fitting on the lip edge profile, and extracting curvature change characteristics in the lip edge profile after curve fitting;
obtaining an included angle change characteristic corresponding to the lip edge profile based on the lip key points;
and combining the curvature change characteristic and the included angle change characteristic to obtain the lip motion characteristic in the video key frame.
Optionally, the obtaining an included angle variation feature corresponding to the lip edge profile based on the plurality of lip key points includes:
constructing a first triangular body covering the left side or the right side of the lip and a second triangular body covering the upper part or the lower part of the lip according to the lip key point connecting line;
and obtaining the change characteristic of the included angle by using the angle values of the preset angles in the first triangular body and the second triangular body.
Optionally, before the converting the explained speech into the virtual speech data having the sound characteristics by using the soft sound source parameter controller, the method further includes:
and packaging the sound characteristics into a soft sound source parameter controller.
In order to solve the above problems, the present invention also provides an avatar video generating apparatus, comprising:
the sound feature extraction module is used for acquiring a real person image recorded video, carrying out sound sampling processing on the real person image recorded video to obtain video sound data, and extracting sound features in the video sound data based on a preset constant amplitude difference frequency method;
the lip movement feature identification module is used for extracting video key frames in the real person image recorded video and identifying lip movement features in the video key frames;
the video screening module is used for setting a plurality of reference limb actions and screening a limb action video containing a plurality of reference limb actions from a preset action video reference library;
the explanation video generation module for it treats the conversion explanation text to acquire, will through the speech conversion technique treat that the conversion explanation text converts corresponding explanation pronunciation into, utilizes soft sound source parameter control ware will explanation voice conversion has the virtual speech data of sound characteristic, with virtual speech data is as the output speech data of presetting digital image, with lip moves the characteristic and constructs preset digital image's lip video, and fuse the limb action video obtains virtual image explanation video.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the avatar video generation method described above.
In order to solve the above problem, the present invention also provides a readable storage medium having at least one computer program stored therein, the at least one computer program being executed by a processor in an electronic device to implement the avatar video generating method described above.
In the embodiment of the invention, different characteristics capable of representing the real-person image are obtained by extracting the sound characteristics and the lip movement characteristics in the real-person image recorded video, the to-be-converted explanation text is converted into corresponding explanation voice through a voice conversion technology, the explanation voice is converted into virtual voice data with the sound characteristics by using a soft sound source parameter controller, so that the virtual voice data has the sound characteristics, the virtual voice data is used as output voice data of a preset digital image, the lip video of the preset digital image is constructed by using the lip movement characteristics, and the body movement video is fused to obtain the virtual image explanation video. The efficiency of virtual image explanation video generation is improved. Therefore, the avatar video generation method, the avatar video generation device, the electronic device and the readable storage medium provided by the invention can solve the problem of low avatar video generation efficiency.
Drawings
Fig. 1 is a schematic flow chart of a method for generating an avatar video according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart showing a detailed implementation of one of the steps in FIG. 1;
FIG. 3 is a schematic flow chart showing a detailed implementation of one of the steps in FIG. 2;
FIG. 4 is a schematic flow chart showing a detailed implementation of one of the steps in FIG. 1;
FIG. 5 is a schematic flow chart showing a detailed implementation of one of the steps in FIG. 1;
FIG. 6 is a flow chart illustrating a detailed implementation of one of the steps in FIG. 5;
fig. 7 is a functional block diagram of an avatar video generating apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device implementing the avatar video generating method according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a virtual image video generation method. The execution subject of the avatar video generation method includes, but is not limited to, at least one of electronic devices, such as a server, a terminal, and the like, which can be configured to execute the method provided by the embodiments of the present application. In other words, the avatar video generating method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
Fig. 1 is a schematic flow chart of an avatar video generation method according to an embodiment of the present invention. In the present embodiment, the avatar video generating method includes the following steps S1-S4:
s1, acquiring a real image recorded video, carrying out sound sampling processing on the real image recorded video to obtain video sound data, and extracting sound features in the video sound data based on a preset equal-amplitude difference frequency method.
In the embodiment of the invention, the real-person image recorded video refers to a video for a tutor to explain courseware contents when recording video courseware, wherein the real-person image video can accurately reflect the body action, the lip action and the sound during explanation of the tutor.
Specifically, referring to fig. 2, the sound sampling processing of the real person image recorded video to obtain video and sound data includes the following steps S11-S13:
s11, establishing wireless communication connection between a recording end corresponding to the real image recorded video and each mobile terminal, wherein the mobile terminal is used for sound sampling;
s12, receiving the sound data sampled by the mobile terminal by using wireless communication connection;
s13, recognizing the voice scene to which the voice data belongs, and selecting a noise reduction model corresponding to the voice scene;
and S14, carrying out noise reduction processing on the sound data by using the noise reduction model to obtain the video sound data.
In detail, the denoising processing is to remove the voice of the non-speaker speaking in the voice data, and since the different environments of the recorded videos generate a lot of noise caused by the surrounding environment, it is necessary to perform denoising processing on the voice data to obtain the video voice data.
Further, referring to fig. 3, the recognizing the voice scene to which the sound data belongs includes the following steps S101 to S104:
s101, collecting noise sample sets in each scene, and extracting corresponding audio features from each noise sample;
s102, carrying out cluster analysis on the noise sample set based on the audio features to obtain a classified voice set;
s103, segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model;
and S104, recognizing the voice scene to which the sound data belongs by using the standard scene recognition model.
In detail, the noise sample set includes noise audio data in each voice scene, for example, noise audio data in a park, noise audio data at a roadside, or noise audio data in an office. In the embodiment of the present invention, the noise sample set may further include a feature label corresponding to each noise sample, where the feature label is used to label each noise sample to extract a corresponding audio feature. The audio features may include a zero-crossing rate, mel-frequency cepstrum coefficients, a spectral centroid, a spectral diffusion, a spectral entropy, a spectral flux, and the like, wherein the audio features in the embodiment of the present application are preferably mel-frequency cepstrum coefficients.
Specifically, the sound features in the video and sound data are extracted based on a preset constant amplitude difference frequency method, where the constant amplitude difference frequency method refers to an FM frequency modulation idea, and the video and sound data are split into pronunciation parameters by using the FM frequency modulation idea, and the pronunciation parameters are used as the sound features.
In detail, the sound characteristics can reflect the tone color characteristics of video sound data in the real person image recorded video, and serve as the basis of subsequent synthesized voice, so that the virtual images all have the tone color characteristics of the real person image in the real person recorded video.
S2, extracting video key frames in the real image recorded video, and identifying lip movement characteristics in the video key frames.
In the embodiment of the present invention, referring to fig. 4, the extracting the video key frames in the real-person image recorded video includes the following steps S21-S25:
s21, inputting the real image recorded video into a pre-trained convolution network for feature extraction to obtain a first feature vector;
s22, inputting the first feature vector into a cross attention module for aggregation to obtain a second feature vector;
s23, inputting the second feature vector and the output feature vector of the lower network of the convolutional network into a channel attention module together to obtain a third feature vector;
s24, performing feature reconstruction on the third feature vector by using a decoder to obtain final reconstruction features, and acquiring corresponding video frames in the real image recorded video based on the final reconstruction features;
and S25, taking the corresponding video frame in the real human image recorded video as a video key frame.
In detail, the pre-trained convolutional network is used as an encoder for extracting video key frames, and the convolutional network may adopt a network structure in the prior art, such as a ResNet network (residual error network), a VGG network, a google net network, and the like.
Further, referring to fig. 5, the identifying lip motion features in the video keyframes includes the following steps S201 to S204:
s201, obtaining a plurality of preset lip key points, identifying the positions of the lip key points in the video key frame, and performing one-way connection on the lip key points to obtain a lip edge profile;
s202, performing curve fitting on the lip edge profile, and extracting curvature change characteristics in the lip edge profile after curve fitting;
s203, solving an included angle change characteristic corresponding to the lip edge profile based on the plurality of lip key points;
s204, combining the curvature change characteristic and the included angle change characteristic to obtain the lip movement characteristic in the video key frame.
Specifically, referring to fig. 6, the obtaining of the included angle variation feature corresponding to the lip edge profile based on the plurality of lip key points includes the following steps S211 to S212:
s211, constructing a first triangular body covering the left side or the right side of the lip and a second triangular body covering the upper part or the lower part of the lip according to the lip key point connecting line;
s212, obtaining the change characteristic of the included angle by using the angle values of the preset angles in the first triangular body and the second triangular body.
In detail, the first triangular body and the second triangular body are two preset triangular areas.
S3, setting a plurality of reference limb motions, and screening out a limb motion video containing the reference limb motions from a preset motion video reference library.
In the embodiment of the invention, a plurality of reference limb actions are set to limit actions in a subsequent video to be in line with the field of the scheme, for example, the scheme is mainly used for designing a virtual image explanation video corresponding to a video course explained by a guide, so that the plurality of reference limb actions are turning actions, hand raising actions, blackboard pointing actions or roll calling actions, and the reference limb actions are all common actions in the explanation process of the guide.
The motion video reference library includes a plurality of different motions and body motion videos corresponding to the motions, for example, there may be a motion video for turning to the left direction for a preset time or another motion video for turning to the right direction for another preset time for the turning motion. A body motion video containing a plurality of the reference body motions can be screened out from the motion video reference library, and the body motion video can be one or more.
S4, obtaining an explanation text to be converted, converting the explanation text to be converted into corresponding explanation voice through a voice conversion technology, and converting the explanation voice into virtual voice data with the voice characteristics by using a soft sound source parameter controller.
In the embodiment of the invention, the to-be-converted explanation text is a text which is required to be subjected to voice-over in the instructor explanation course. The method comprises the steps of converting the To-be-converted explanation Text into corresponding explanation voice through a voice conversion technology, wherein the voice conversion technology is a TTS technology, the TTS is an abbreviation of Text To Speech, namely from Text To voice, and is one of voice synthesis applications, and files stored in a computer, such as help files or web pages, are converted into natural voice To be output.
Specifically, before the converting the explained speech into the virtual speech data having the sound characteristics by using the soft sound source parameter controller, the method further includes:
and packaging the sound characteristics into a soft sound source parameter controller.
In detail, the speech to be explained is converted into virtual speech data having the sound characteristics by using a soft sound source parameter controller, and the soft sound source parameter controller can perform operations such as adjustment, audition, tuning and the like in combination with the speech recorded in the sound library, so that the virtual speech data has the sound characteristics.
S5, the virtual voice data is used as output voice data of a preset digital image, the lip video of the preset digital image is constructed according to the lip movement characteristics, and the limb movement video is fused to obtain a virtual image explanation video.
In the embodiment of the invention, as only about 2000 characters are commonly used in life, and the pronunciation of the 2000 characters can be further combined, common voice characteristics can be simulated through the two characteristics of the voice characteristic, the lip movement characteristic and the like. And taking the virtual voice data as output voice data of a preset digital image, constructing a lip video of the preset digital image according to the lip movement characteristics, and fusing the limb movement video to obtain a virtual image explanation video.
Preferably, the virtual image explanation video replaces the recording instructor's lecture by using digital human explanation, so that the threshold for making video classes is greatly reduced, and the efficiency for making courses is improved. Once the digital human figure is generated, the digital human figure can be used for making scenes such as anchor courses, intelligent customer service, news broadcasting and the like, and the trained digital human figure is continuously and repeatedly used.
In the embodiment of the invention, different characteristics capable of representing the real-person image are obtained by extracting the sound characteristics and the lip movement characteristics in the real-person image recorded video, the to-be-converted explanation text is converted into corresponding explanation voice through a voice conversion technology, the explanation voice is converted into virtual voice data with the sound characteristics by using a soft sound source parameter controller, so that the virtual voice data has the sound characteristics, the virtual voice data is used as output voice data of a preset digital image, the lip video of the preset digital image is constructed by using the lip movement characteristics, and the body movement video is fused to obtain the virtual image explanation video. The efficiency of virtual image explanation video generation is improved. Therefore, the virtual image video generation method provided by the invention can solve the problem of low virtual image video generation efficiency.
Fig. 7 is a functional block diagram of an avatar video generating apparatus according to an embodiment of the present invention.
The avatar video generating apparatus 100 of the present invention may be installed in an electronic device. According to the implemented functions, the avatar video generating apparatus 100 may include a sound feature extraction module 101, a lip movement feature recognition module 102, a video screening module 103, and an explanation video generating module 104. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the sound feature extraction module 101 is configured to obtain a real-person image recorded video, perform sound sampling processing on the real-person image recorded video to obtain video sound data, and extract sound features in the video sound data based on a preset constant-amplitude difference frequency method;
the lip motion feature recognition module 102 is configured to extract a video key frame in the real-person image recorded video, and recognize a lip motion feature in the video key frame;
the video screening module 103 is configured to set a plurality of reference limb motions and screen a limb motion video including the plurality of reference limb motions from a preset motion video reference library;
the explanation video generation module 104 for acquire and treat the conversion explanation text, will through the speech conversion technique treat that the conversion explanation text converts corresponding explanation pronunciation into, utilizes soft sound source parameter control ware will explanation voice conversion has the virtual voice data of sound characteristic, with virtual voice data regards as the output voice data of presetting digital image, with lip moves the characteristic and constructs preset digital image's lip video, and fuse the limb action video obtains virtual image explanation video.
In detail, the avatar video generating apparatus 100 has the following modules:
s1, acquiring a real image recorded video, carrying out sound sampling processing on the real image recorded video to obtain video sound data, and extracting sound features in the video sound data based on a preset equal-amplitude difference frequency method.
In the embodiment of the invention, the real-person image recorded video refers to a video for a tutor to explain courseware contents when recording video courseware, wherein the real-person image video can accurately reflect the body action, the lip action and the sound during explanation of the tutor.
Specifically, the said real person image recording video is processed by sound sampling to obtain video sound data, which includes:
establishing wireless communication connection between a recording end corresponding to the real person image recorded video and each mobile terminal, wherein the mobile terminal is used for sound sampling;
receiving sound data sampled by the mobile terminal by utilizing a wireless communication connection;
recognizing a voice scene to which the voice data belongs, and selecting a noise reduction model corresponding to the voice scene;
and carrying out noise reduction processing on the sound data by using the noise reduction model to obtain the video sound data.
In detail, the denoising processing is to remove the voice of the non-speaker speaking in the voice data, and since the different environments of the recorded videos generate a lot of noise caused by the surrounding environment, it is necessary to perform denoising processing on the voice data to obtain the video voice data.
Further, the recognizing the voice scene to which the sound data belongs includes:
collecting a noise sample set under each scene, and extracting corresponding audio features from each noise sample;
performing cluster analysis on the noise sample set based on the audio features to obtain a classified voice set;
segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model;
and recognizing the voice scene to which the sound data belongs by using the standard scene recognition model.
In detail, the noise sample set includes noise audio data in each voice scene, for example, noise audio data in a park, noise audio data at a roadside, or noise audio data in an office. In the embodiment of the present invention, the noise sample set may further include a feature label corresponding to each noise sample, where the feature label is used to label each noise sample to extract a corresponding audio feature. The audio features may include a zero-crossing rate, mel-frequency cepstrum coefficients, a spectral centroid, a spectral diffusion, a spectral entropy, a spectral flux, and the like, wherein the audio features in the embodiment of the present application are preferably mel-frequency cepstrum coefficients.
Specifically, the sound features in the video and sound data are extracted based on a preset constant amplitude difference frequency method, where the constant amplitude difference frequency method refers to an FM frequency modulation idea, and the video and sound data are split into pronunciation parameters by using the FM frequency modulation idea, and the pronunciation parameters are used as the sound features.
In detail, the sound characteristics can reflect the tone color characteristics of video sound data in the real person image recorded video, and serve as the basis of subsequent synthesized voice, so that the virtual images all have the tone color characteristics of the real person image in the real person recorded video.
S2, extracting video key frames in the real image recorded video, and identifying lip movement characteristics in the video key frames.
In the embodiment of the present invention, the extracting the video key frames in the real-person image recorded video includes:
inputting the real person image recorded video into a pre-trained convolution network for feature extraction to obtain a first feature vector;
inputting the first feature vector into a cross attention module for aggregation to obtain a second feature vector;
inputting the second feature vector and the output feature vector of the lower network of the convolutional network into a channel attention module together to obtain a third feature vector;
performing feature reconstruction on the third feature vector by using a decoder to obtain final reconstruction features, and acquiring corresponding video frames in the real image recorded video based on the final reconstruction features;
and taking the corresponding video frame in the real human image recorded video as a video key frame.
In detail, the pre-trained convolutional network is used as an encoder for extracting video key frames, and the convolutional network may adopt a network structure in the prior art, such as a ResNet network (residual error network), a VGG network, a google net network, and the like.
Further, the identifying lip movement features in the video keyframes includes:
acquiring a plurality of preset lip key points, identifying the positions of the lip key points in the video key frame, and performing one-way connection on the lip key points to obtain a lip edge profile;
performing curve fitting on the lip edge profile, and extracting curvature change characteristics in the lip edge profile after curve fitting;
obtaining an included angle change characteristic corresponding to the lip edge profile based on the lip key points;
and combining the curvature change characteristic and the included angle change characteristic to obtain the lip motion characteristic in the video key frame.
Specifically, the obtaining of the included angle change feature corresponding to the lip edge profile based on the plurality of lip key points includes:
constructing a first triangular body covering the left side or the right side of the lip and a second triangular body covering the upper part or the lower part of the lip according to the lip key point connecting line;
and obtaining the change characteristic of the included angle by using the angle values of the preset angles in the first triangular body and the second triangular body.
In detail, the first triangular body and the second triangular body are two preset triangular areas.
S3, setting a plurality of reference limb motions, and screening out a limb motion video containing the reference limb motions from a preset motion video reference library.
In the embodiment of the invention, a plurality of reference limb actions are set to limit actions in a subsequent video to be in line with the field of the scheme, for example, the scheme is mainly used for designing a virtual image explanation video corresponding to a video course explained by a guide, so that the plurality of reference limb actions are turning actions, hand raising actions, blackboard pointing actions or roll calling actions, and the reference limb actions are all common actions in the explanation process of the guide.
The motion video reference library includes a plurality of different motions and body motion videos corresponding to the motions, for example, there may be a motion video for turning to the left direction for a preset time or another motion video for turning to the right direction for another preset time for the turning motion. A body motion video containing a plurality of the reference body motions can be screened out from the motion video reference library, and the body motion video can be one or more.
S4, obtaining an explanation text to be converted, converting the explanation text to be converted into corresponding explanation voice through a voice conversion technology, and converting the explanation voice into virtual voice data with the voice characteristics by using a soft sound source parameter controller.
In the embodiment of the invention, the to-be-converted explanation text is a text which is required to be subjected to voice-over in the instructor explanation course. The method comprises the steps of converting the To-be-converted explanation Text into corresponding explanation voice through a voice conversion technology, wherein the voice conversion technology is a TTS technology, the TTS is an abbreviation of Text To Speech, namely from Text To voice, and is one of voice synthesis applications, and files stored in a computer, such as help files or web pages, are converted into natural voice To be output.
Specifically, before the converting the explained speech into the virtual speech data having the sound characteristics by using the soft sound source parameter controller, the method further includes:
and packaging the sound characteristics into a soft sound source parameter controller.
In detail, the speech to be explained is converted into virtual speech data having the sound characteristics by using a soft sound source parameter controller, and the soft sound source parameter controller can perform operations such as adjustment, audition, tuning and the like in combination with the speech recorded in the sound library, so that the virtual speech data has the sound characteristics.
S5, the virtual voice data are used as output voice data of a preset digital image, the lip video of the preset digital image is constructed according to the lip movement characteristics, and the limb movement video is fused to obtain a virtual image explanation video.
In the embodiment of the invention, as only about 2000 characters are commonly used in life, and the pronunciation of the 2000 characters can be further combined, common voice characteristics can be simulated through the two characteristics of the voice characteristic, the lip movement characteristic and the like. And taking the virtual voice data as output voice data of a preset digital image, constructing a lip video of the preset digital image according to the lip movement characteristics, and fusing the limb movement video to obtain a virtual image explanation video.
Preferably, the virtual image explanation video replaces the recording instructor's lecture by using digital human explanation, so that the threshold for making video classes is greatly reduced, and the efficiency for making courses is improved. Once the digital human figure is generated, the digital human figure can be used for making scenes such as anchor courses, intelligent customer service, news broadcasting and the like, and the trained digital human figure is continuously and repeatedly used.
In the embodiment of the invention, different characteristics capable of representing the real-person image are obtained by extracting the sound characteristics and the lip movement characteristics in the real-person image recorded video, the to-be-converted explanation text is converted into corresponding explanation voice through a voice conversion technology, the explanation voice is converted into virtual voice data with the sound characteristics by using a soft sound source parameter controller, so that the virtual voice data has the sound characteristics, the virtual voice data is used as output voice data of a preset digital image, the lip video of the preset digital image is constructed by using the lip movement characteristics, and the body movement video is fused to obtain the virtual image explanation video. The efficiency of virtual image explanation video generation is improved. Therefore, the virtual image video generation method provided by the invention can solve the problem of low virtual image video generation efficiency.
Fig. 8 is a schematic structural diagram of an electronic device implementing an avatar video generation method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program, such as an avatar video generation program, stored in the memory 11 and executable on the processor 10.
In some embodiments, the processor 10 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), a microprocessor, a digital Processing chip, a graphics processor, a combination of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (e.g., executing avatar video generation programs, etc.) stored in the memory 11 and calling data stored in the memory 11.
The memory 11 includes at least one type of readable storage medium including flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various kinds of data, such as codes of an avatar video generation program, etc., but also to temporarily store data that has been output or is to be output.
The communication bus 12 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
The communication interface 13 is used for communication between the electronic device and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
Fig. 8 only shows an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 8 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The avatar video generation program stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:
acquiring a real person image recorded video, carrying out sound sampling processing on the real person image recorded video to obtain video sound data, and extracting sound features in the video sound data based on a preset constant amplitude difference frequency method;
extracting video key frames in the real person image recorded video, and identifying lip movement characteristics in the video key frames;
setting a plurality of reference limb actions, and screening a limb action video containing the reference limb actions from a preset action video reference library;
acquiring an explanation text to be converted, converting the explanation text to be converted into corresponding explanation voice through a voice conversion technology, and converting the explanation voice into virtual voice data with the voice characteristics by using a soft sound source parameter controller;
and taking the virtual voice data as output voice data of a preset digital image, constructing a lip video of the preset digital image according to the lip movement characteristics, and fusing the limb movement video to obtain a virtual image explanation video.
Specifically, the specific implementation method of the instruction by the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to the drawings, which is not described herein again.
Further, the integrated modules/units of the electronic device 1 may be stored in a readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. The readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The invention also provides a readable storage medium storing a computer program which, when executed by a processor of an electronic device, may implement:
acquiring a real person image recorded video, carrying out sound sampling processing on the real person image recorded video to obtain video sound data, and extracting sound features in the video sound data based on a preset constant amplitude difference frequency method;
extracting video key frames in the real person image recorded video, and identifying lip movement characteristics in the video key frames;
setting a plurality of reference limb actions, and screening a limb action video containing the reference limb actions from a preset action video reference library;
acquiring an explanation text to be converted, converting the explanation text to be converted into corresponding explanation voice through a voice conversion technology, and converting the explanation voice into virtual voice data with the voice characteristics by using a soft sound source parameter controller;
and taking the virtual voice data as output voice data of a preset digital image, constructing a lip video of the preset digital image according to the lip movement characteristics, and fusing the limb movement video to obtain a virtual image explanation video.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims (10)
1. A method for avatar video generation, the method comprising:
acquiring a real person image recorded video, carrying out sound sampling processing on the real person image recorded video to obtain video sound data, and extracting sound features in the video sound data based on a preset constant amplitude difference frequency method;
extracting video key frames in the real person image recorded video, and identifying lip movement characteristics in the video key frames;
setting a plurality of reference limb actions, and screening a limb action video containing the reference limb actions from a preset action video reference library;
acquiring an explanation text to be converted, converting the explanation text to be converted into corresponding explanation voice through a voice conversion technology, and converting the explanation voice into virtual voice data with the voice characteristics by using a soft sound source parameter controller;
and taking the virtual voice data as output voice data of a preset digital image, constructing a lip video of the preset digital image according to the lip movement characteristics, and fusing the limb movement video to obtain a virtual image explanation video.
2. The avatar video generating method of claim 1, wherein said performing a sound sampling process on said live image recorded video to obtain video sound data comprises:
establishing wireless communication connection between a recording end corresponding to the real person image recorded video and each mobile terminal, wherein the mobile terminal is used for sound sampling;
receiving sound data sampled by the mobile terminal by utilizing a wireless communication connection;
recognizing a voice scene to which the voice data belongs, and selecting a noise reduction model corresponding to the voice scene;
and carrying out noise reduction processing on the sound data by using the noise reduction model to obtain the video sound data.
3. The avatar video generating method of claim 2, wherein said recognizing a voice scene to which said sound data belongs comprises:
collecting a noise sample set under each scene, and extracting corresponding audio features from each noise sample;
performing cluster analysis on the noise sample set based on the audio features to obtain a classified voice set;
segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model;
and recognizing the voice scene to which the sound data belongs by using the standard scene recognition model.
4. The avatar video generation method of claim 1, wherein said extracting video key frames from said live avatar recording video comprises:
inputting the real person image recorded video into a pre-trained convolution network for feature extraction to obtain a first feature vector;
inputting the first feature vector into a cross attention module for aggregation to obtain a second feature vector;
inputting the second feature vector and the output feature vector of the lower network of the convolutional network into a channel attention module together to obtain a third feature vector;
performing feature reconstruction on the third feature vector by using a decoder to obtain final reconstruction features, and acquiring corresponding video frames in the real image recorded video based on the final reconstruction features;
and taking the corresponding video frame in the real human image recorded video as a video key frame.
5. The avatar video generation method of claim 1, wherein said identifying lip movement features in said video keyframes comprises:
acquiring a plurality of preset lip key points, identifying the positions of the lip key points in the video key frame, and performing one-way connection on the lip key points to obtain a lip edge profile;
performing curve fitting on the lip edge profile, and extracting curvature change characteristics in the lip edge profile after curve fitting;
obtaining an included angle change characteristic corresponding to the lip edge profile based on the lip key points;
and combining the curvature change characteristic and the included angle change characteristic to obtain the lip motion characteristic in the video key frame.
6. The avatar video generating method of claim 5, wherein said deriving an angle change feature corresponding to said lip edge profile based on said lip keypoints comprises:
constructing a first triangular body covering the left side or the right side of the lip and a second triangular body covering the upper part or the lower part of the lip according to the lip key point connecting line;
and obtaining the change characteristic of the included angle by using the angle values of the preset angles in the first triangular body and the second triangular body.
7. The avatar video generating method of claim 1, wherein before said converting said interpreted speech into virtual speech data having said sound characteristics using a soft sound source parameter controller, said method further comprises:
and packaging the sound characteristics into a soft sound source parameter controller.
8. An avatar video generating apparatus, the apparatus comprising:
the sound feature extraction module is used for acquiring a real person image recorded video, carrying out sound sampling processing on the real person image recorded video to obtain video sound data, and extracting sound features in the video sound data based on a preset constant amplitude difference frequency method;
the lip motion characteristic identification module is used for extracting video key frames in the real person image recorded video and identifying lip motion characteristics in the video key frames;
the video screening module is used for setting a plurality of reference limb actions and screening a limb action video containing a plurality of reference limb actions from a preset action video reference library;
the explanation video generation module for it treats the conversion explanation text to acquire, will through the speech conversion technique treat that the conversion explanation text converts corresponding explanation pronunciation into, utilizes soft sound source parameter control ware will explanation voice conversion has the virtual speech data of sound characteristic, with virtual speech data is as the output speech data of presetting digital image, with lip moves the characteristic and constructs preset digital image's lip video, and fuse the limb action video obtains virtual image explanation video.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and (c) a second step of,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the avatar video generation method of any of claims 1-7.
10. A readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the avatar video generation method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210512471.5A CN114866807A (en) | 2022-05-12 | 2022-05-12 | Avatar video generation method and device, electronic equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210512471.5A CN114866807A (en) | 2022-05-12 | 2022-05-12 | Avatar video generation method and device, electronic equipment and readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114866807A true CN114866807A (en) | 2022-08-05 |
Family
ID=82637471
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210512471.5A Pending CN114866807A (en) | 2022-05-12 | 2022-05-12 | Avatar video generation method and device, electronic equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114866807A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115050354A (en) * | 2022-08-10 | 2022-09-13 | 北京百度网讯科技有限公司 | Digital human driving method and device |
CN115953515A (en) * | 2023-03-14 | 2023-04-11 | 深圳崇德动漫股份有限公司 | Animation image generation method, device, equipment and medium based on real person data |
CN116248811A (en) * | 2022-12-09 | 2023-06-09 | 北京生数科技有限公司 | Video processing method, device and storage medium |
CN116702707A (en) * | 2023-08-03 | 2023-09-05 | 腾讯科技(深圳)有限公司 | Action generation method, device and equipment based on action generation model |
CN117714763A (en) * | 2024-02-05 | 2024-03-15 | 深圳市鸿普森科技股份有限公司 | Virtual object speaking video generation method and device, electronic equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109584648A (en) * | 2018-11-08 | 2019-04-05 | 北京葡萄智学科技有限公司 | Data creation method and device |
CN112465935A (en) * | 2020-11-19 | 2021-03-09 | 科大讯飞股份有限公司 | Virtual image synthesis method and device, electronic equipment and storage medium |
WO2021159391A1 (en) * | 2020-02-13 | 2021-08-19 | 南昌欧菲光电技术有限公司 | Camera module, photographing apparatus, and electronic device |
CN113469292A (en) * | 2021-09-02 | 2021-10-01 | 北京世纪好未来教育科技有限公司 | Training method, synthesizing method, device, medium and equipment for video synthesizing model |
US20220084502A1 (en) * | 2020-09-14 | 2022-03-17 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for determining shape of lips of virtual character, device and computer storage medium |
CN114202605A (en) * | 2021-12-07 | 2022-03-18 | 北京百度网讯科技有限公司 | 3D video generation method, model training method, device, equipment and medium |
CN114338994A (en) * | 2021-12-30 | 2022-04-12 | Oppo广东移动通信有限公司 | Optical anti-shake method, optical anti-shake apparatus, electronic device, and computer-readable storage medium |
-
2022
- 2022-05-12 CN CN202210512471.5A patent/CN114866807A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109584648A (en) * | 2018-11-08 | 2019-04-05 | 北京葡萄智学科技有限公司 | Data creation method and device |
WO2021159391A1 (en) * | 2020-02-13 | 2021-08-19 | 南昌欧菲光电技术有限公司 | Camera module, photographing apparatus, and electronic device |
US20220084502A1 (en) * | 2020-09-14 | 2022-03-17 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for determining shape of lips of virtual character, device and computer storage medium |
CN112465935A (en) * | 2020-11-19 | 2021-03-09 | 科大讯飞股份有限公司 | Virtual image synthesis method and device, electronic equipment and storage medium |
CN113469292A (en) * | 2021-09-02 | 2021-10-01 | 北京世纪好未来教育科技有限公司 | Training method, synthesizing method, device, medium and equipment for video synthesizing model |
CN114202605A (en) * | 2021-12-07 | 2022-03-18 | 北京百度网讯科技有限公司 | 3D video generation method, model training method, device, equipment and medium |
CN114338994A (en) * | 2021-12-30 | 2022-04-12 | Oppo广东移动通信有限公司 | Optical anti-shake method, optical anti-shake apparatus, electronic device, and computer-readable storage medium |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115050354A (en) * | 2022-08-10 | 2022-09-13 | 北京百度网讯科技有限公司 | Digital human driving method and device |
CN116248811A (en) * | 2022-12-09 | 2023-06-09 | 北京生数科技有限公司 | Video processing method, device and storage medium |
CN116248811B (en) * | 2022-12-09 | 2023-12-05 | 北京生数科技有限公司 | Video processing method, device and storage medium |
CN115953515A (en) * | 2023-03-14 | 2023-04-11 | 深圳崇德动漫股份有限公司 | Animation image generation method, device, equipment and medium based on real person data |
CN115953515B (en) * | 2023-03-14 | 2023-06-27 | 深圳崇德动漫股份有限公司 | Cartoon image generation method, device, equipment and medium based on real person data |
CN116702707A (en) * | 2023-08-03 | 2023-09-05 | 腾讯科技(深圳)有限公司 | Action generation method, device and equipment based on action generation model |
CN116702707B (en) * | 2023-08-03 | 2023-10-03 | 腾讯科技(深圳)有限公司 | Action generation method, device and equipment based on action generation model |
CN117714763A (en) * | 2024-02-05 | 2024-03-15 | 深圳市鸿普森科技股份有限公司 | Virtual object speaking video generation method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114866807A (en) | Avatar video generation method and device, electronic equipment and readable storage medium | |
EP3889912A1 (en) | Method and apparatus for generating video | |
CN111681681A (en) | Voice emotion recognition method and device, electronic equipment and storage medium | |
CN110246512A (en) | Sound separation method, device and computer readable storage medium | |
CN107844481B (en) | Text recognition error detection method and device | |
CN111626049B (en) | Title correction method and device for multimedia information, electronic equipment and storage medium | |
WO2023050650A1 (en) | Animation video generation method and apparatus, and device and storage medium | |
CN107092664A (en) | A kind of content means of interpretation and device | |
WO2023273628A1 (en) | Video loop recognition method and apparatus, computer device, and storage medium | |
CN112466273A (en) | Speech synthesis method, speech synthesis device, electronic equipment and storage medium | |
CN115002491A (en) | Network live broadcast method, device, equipment and storage medium based on intelligent machine | |
CN113903363A (en) | Violation detection method, device, equipment and medium based on artificial intelligence | |
CN114780701B (en) | Automatic question-answer matching method, device, computer equipment and storage medium | |
CN114140814A (en) | Emotion recognition capability training method and device and electronic equipment | |
CN113591472A (en) | Lyric generation method, lyric generation model training method and device and electronic equipment | |
CN112542172A (en) | Communication auxiliary method, device, equipment and medium based on online conference | |
CN116939288A (en) | Video generation method and device and computer equipment | |
CN109766089B (en) | Code generation method and device based on dynamic diagram, electronic equipment and storage medium | |
CN113555003B (en) | Speech synthesis method, device, electronic equipment and storage medium | |
CN117119123A (en) | Method and system for generating digital human video based on video material | |
CN116129860A (en) | Automatic meta-space virtual human book broadcasting method based on AI artificial intelligence technology | |
CN116561294A (en) | Sign language video generation method and device, computer equipment and storage medium | |
CN114842880A (en) | Intelligent customer service voice rhythm adjusting method, device, equipment and storage medium | |
CN113762056A (en) | Singing video recognition method, device, equipment and storage medium | |
CN115206342A (en) | Data processing method and device, computer equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |