CN112752118B - Video generation method, device, equipment and storage medium - Google Patents

Video generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN112752118B
CN112752118B CN202011587839.1A CN202011587839A CN112752118B CN 112752118 B CN112752118 B CN 112752118B CN 202011587839 A CN202011587839 A CN 202011587839A CN 112752118 B CN112752118 B CN 112752118B
Authority
CN
China
Prior art keywords
sub
voice
image
video
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011587839.1A
Other languages
Chinese (zh)
Other versions
CN112752118A (en
Inventor
杜绪晗
焦少慧
苏再卿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN202011587839.1A priority Critical patent/CN112752118B/en
Publication of CN112752118A publication Critical patent/CN112752118A/en
Application granted granted Critical
Publication of CN112752118B publication Critical patent/CN112752118B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • H04N21/2335Processing of audio elementary streams involving reformatting operations of audio signals, e.g. by converting from one coding standard to another
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/802D [Two Dimensional] animation, e.g. using sprites
    • G06T3/02
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4398Processing of audio elementary streams involving reformatting operations of audio signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display

Abstract

The embodiment of the disclosure discloses a video generation method, a device, equipment and a storage medium. Comprising the following steps: extracting the voice characteristics of each voice frame in the voice data and the image characteristics of the video frame corresponding to each voice frame; affine transformation is carried out on the video frames according to the voice characteristics and the image characteristics; and generating a target video according to the affine transformed video frame. According to the video generation method disclosed by the embodiment of the disclosure, affine transformation is carried out on the video frames according to the voice characteristics and the image characteristics, so that the target video is generated according to the affine transformed video frames, the alignment of voice and the mouth shape in the video is realized, the cost can be reduced, and the accuracy of the alignment of the mouth shape and the voice is improved.

Description

Video generation method, device, equipment and storage medium
Technical Field
The embodiment of the disclosure relates to the technical field of image processing, in particular to a video generation method, a device, equipment and a storage medium.
Background
In both dubbing and animation, it is necessary to associate speech with the mouth shape of a person in an image. In the related art, in order to achieve the alignment of the voice and the mouth shape, either the cost is high, or the accuracy of the generated mouth shape and voice alignment is not high.
Disclosure of Invention
The embodiment of the disclosure provides a video generation method, a device, equipment and a storage medium, so as to realize the alignment of voice and a mouth shape in video, reduce the cost and improve the accuracy of the mouth shape and voice alignment.
In a first aspect, an embodiment of the present disclosure provides a video generating method, including:
extracting the voice characteristics of each voice frame in the voice data and the image characteristics of the video frame corresponding to each voice frame;
affine transformation is carried out on the video frames according to the voice characteristics and the image characteristics;
and generating a target video according to the affine transformed video frame.
In a second aspect, an embodiment of the present disclosure further provides a video generating apparatus, including:
the feature extraction module is used for extracting the voice features of each voice frame and the image features of the video frames corresponding to each voice frame in the voice data;
affine transformation module, is used for carrying on affine transformation to the said video frame according to said speech feature and said image feature;
and the target video generation module is used for generating target video according to the affine transformed video frame.
In a third aspect, embodiments of the present disclosure further provide an electronic device, including:
one or more processing devices;
a storage means for storing one or more programs;
the one or more programs, when executed by the one or more processing devices, cause the one or more processing devices to implement the video generation method as described in embodiments of the present disclosure.
In a fourth aspect, the embodiments of the present disclosure further provide a computer readable medium having stored thereon a computer program which, when executed by a processing device, implements a video generation method according to the embodiments of the present disclosure.
The embodiment of the disclosure discloses a video generation method, a device, equipment and a storage medium. Extracting the voice characteristics of each voice frame in the voice data and the image characteristics of the video frame corresponding to each voice frame; affine transformation is carried out on the video frames according to the voice characteristics and the image characteristics; and generating a target video according to the affine transformed video frame. According to the video generation method disclosed by the embodiment of the disclosure, affine transformation is carried out on the video frames according to the voice characteristics and the image characteristics, so that the target video is generated according to the affine transformed video frames, the alignment of voice and the mouth shape in the video is realized, the cost can be reduced, and the accuracy of the alignment of the mouth shape and the voice is improved.
Drawings
FIG. 1 is a flow chart of a video generation method in an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of an affine transformation sub-network in an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of an affine transformation of a video frame by a set neural network in an embodiment of the disclosure;
fig. 4 is a schematic structural view of a video generating apparatus in an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an electronic device in an embodiment of the disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
Fig. 1 is a flowchart of a video generating method according to a first embodiment of the present disclosure, where the method may be applicable to a case of generating video based on voice, and the method may be performed by a video generating apparatus, where the apparatus may be composed of hardware and/or software, and may be generally integrated in a device having a video generating function, where the device may be an electronic device such as a server, a mobile terminal, or a server cluster. As shown in fig. 1, the method specifically includes the following steps:
step 110, extracting the voice characteristics of each voice frame and the image characteristics of the video frame corresponding to each voice frame in the voice data.
The voice data may be recorded voice or voice converted from text. The video corresponding to the video frame can be recorded video, video downloaded in the network or video synthesized by the same face picture. Speech features may be characterized by vectors, such as: and D, representing the voice characteristics of the D dimension. Image features may be characterized by a matrix, for example: (C, H, W), wherein C represents the number of channels, H represents the image height, and W represents the image width.
In this embodiment, firstly, frame division processing is performed on voice data to obtain a plurality of voice frames, and then feature extraction is performed on each voice frame to obtain voice features of each voice frame. Specifically, any existing speech feature extraction algorithm may be used to perform feature extraction on the speech frame, which is not described herein.
In this embodiment, the voice frames and the video frames have a one-to-one correspondence, and image features of the video frames corresponding to each voice frame need to be extracted. Likewise, the extraction of the image features may be performed by using any existing algorithm, which is not described herein.
Step 120, affine transformation is performed on the video frame according to the voice feature and the image feature.
Affine transformation is understood to be an operation of panning, zooming, rotating, etc. a two-dimensional image such that the mouth shape in the transformed video frame matches the speech frame. In this embodiment, the affine transformation of the video frame according to the voice feature and the image feature can be understood as: firstly, affine transformation coefficients are determined according to voice features and image features, and then the image features are multiplied with the affine transformation coefficients to realize affine transformation of video frames. Also, affine transformation of a video frame can be understood as affine transformation of each channel of the video frame separately, i.e. affine transformation coefficients comprise affine transformation coefficients of each channel of the video frame.
Specifically, the affine transformation of the video frame according to the voice feature and the image feature may be: and inputting the voice characteristics and the image characteristics into a set neural network to obtain the affine transformed video frame.
Wherein the set neural network includes at least one sub-network and at least one affine transformation unit; the sub-network comprises a global average value pooling layer, a characteristic splicing layer, at least two full-connection layers and a dimension transformation layer, the output of the sub-network is a sub-affine transformation coefficient, and the affine transformation unit is used for carrying out affine transformation on the image characteristics according to the sub-affine transformation coefficient.
Fig. 2 is a schematic diagram of an affine transformation sub-network in the present embodiment. As shown in fig. 2, the structure diagram of the subnetwork is shown in the dashed line frame, and the image features (C, H, W) are input into the global averaging layer (global average pooling, gap) for pooling; the image features (C) after pooling treatment are input into a feature splicing layer (candate) and the voice features (D) of the input feature splicing layer are subjected to feature splicing; inputting the spliced feature (C+D) into at least two full-connection layers (MLP) to obtain (C.times.6) after feature extraction, and inputting the feature (C.times.6) into a dimension transformation Layer (reshape) to obtain sub-affine transformation coefficients (C, 2, 3); the affine transformation unit carries out affine transformation on the image features input into the sub-network according to the sub-affine transformation coefficients to obtain affine transformed image features; if the sub-network is the last sub-network, determining a video frame according to the affine transformed image characteristics; if the sub-network is not the last sub-network, the affine transformed image features are input to the next sub-network.
Wherein the fully connected layer may comprise 2 or 3 layers.
In this embodiment, fig. 3 is a schematic diagram of affine transformation of a video frame by a neural network according to an embodiment of the present disclosure. As shown in fig. 3, the neural network is set to include at least two sub-networks, the input of the 1 st sub-network is a voice feature and an image feature, and the input of the nth sub-network is a voice feature and an image feature transformed according to a sub-affine transformation coefficient output by the N-1 st sub-network; wherein N is more than or equal to 2.
For example, assuming that the set neural network includes 3 sub-networks, the voice feature and the image feature are input into the first sub-network first to obtain the first sub-affine transformation coefficient. The first sub affine transformation coefficient and the image feature are processed by an affine transformation unit to obtain a first intermediate image feature. And then inputting the voice feature and the first intermediate image feature into a second sub-network to obtain a second sub-affine transformation coefficient, and processing the second sub-affine transformation coefficient and the first intermediate image feature by an affine transformation unit to obtain a second intermediate image feature. And then inputting the voice feature and the second intermediate image feature into a third sub-network to obtain a third sub-affine transformation coefficient, and processing the third sub-affine transformation coefficient and the second intermediate image feature by an affine transformation unit to obtain the final transformed image feature. Finally, generating an affine transformed video frame according to the transformed image characteristics.
And step 130, generating a target video according to the affine transformed video frame.
In this embodiment, after obtaining the affine transformed video frame, the affine transformed video frame is merged and rendered to obtain the target video.
The technical scheme of the embodiment can be used for educational scenes: only one portrait template is used, and different voices are used for driving the portraits to generate videos for teaching; such as reading poems, reading english articles, etc. It can also be used for short video production: firstly, recording a section of portrait video, and matching different voices; it can also be used for language translation: different languages of speech can be used to generate videos with the same picture but different speech.
The embodiment of the disclosure discloses a video generation method. Comprising the following steps: extracting the voice characteristics of each voice frame in the voice data and the image characteristics of the video frame corresponding to each voice frame; affine transformation is carried out on the video frames according to the voice characteristics and the image characteristics; and generating a target video according to the affine transformed video frame. According to the video generation method disclosed by the embodiment of the disclosure, affine transformation is carried out on the video frames according to the voice characteristics and the image characteristics, so that the target video is generated according to the affine transformed video frames, the alignment of voice and the mouth shape in the video is realized, the cost can be reduced, and the accuracy of the alignment of the mouth shape and the voice is improved.
Fig. 4 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present disclosure. As shown in fig. 4, the apparatus includes:
the feature extraction module 210 is configured to extract a voice feature of each voice frame and an image feature of a video frame corresponding to each voice frame in the voice data;
affine transformation module 220 for affine transforming the video frames according to the speech features and the image features;
the target video generating module 230 is configured to generate a target video according to the affine transformed video frame.
Optionally, the affine transformation module 220 is further configured to:
inputting the voice characteristics and the image characteristics into a set neural network to obtain a video frame after affine transformation; wherein the set neural network comprises at least one sub-network and at least one affine transformation unit.
Optionally, the sub-network includes a global averaging layer, a feature stitching layer, at least two full connection layers and a dimension transformation layer, the output of the sub-network is a sub-affine transformation coefficient, and the affine transformation unit is used for carrying out affine transformation on the image features according to the sub-affine transformation coefficient.
Optionally, if the neural network is set to include at least two sub-networks, the input of the 1 st sub-network is a voice feature and an image feature, and the input of the nth sub-network is a voice feature and an image feature transformed according to a sub-affine transformation coefficient output by the N-1 th sub-network; wherein N is more than or equal to 2.
Optionally, the affine transformation module 220 is further configured to:
for each sub-network, inputting the image characteristics into a global average pooling layer for pooling treatment; inputting the pooled image features into a feature splicing layer, and performing feature splicing on the voice features of the input feature splicing layer; inputting the spliced features into at least two full-connection layers for feature extraction, and inputting into a dimension transformation layer to obtain sub affine transformation coefficients;
the affine transformation unit carries out affine transformation on the image features input into the sub-network according to the sub-affine transformation coefficients to obtain affine transformed image features;
if the sub-network is the last sub-network, determining a video frame according to the affine transformed image characteristics;
if the sub-network is not the last sub-network, the affine transformed image features are input to the next sub-network.
Optionally, the affine transformation module 220 is further configured to:
affine transformation is carried out on each channel of the video frame according to the voice characteristics and the image characteristics.
Optionally, the video corresponding to the video frame includes a video synthesized by the same face picture.
The device can execute the method provided by all the embodiments of the disclosure, and has the corresponding functional modules and beneficial effects of executing the method. Technical details not described in detail in this embodiment can be found in the methods provided by all of the foregoing embodiments of the present disclosure.
Referring now to fig. 5, a schematic diagram of an electronic device 300 suitable for use in implementing embodiments of the present disclosure is shown. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), car terminals (e.g., car navigation terminals), etc., as well as fixed terminals such as digital TVs, desktop computers, etc., or various forms of servers such as stand-alone servers or server clusters. The electronic device shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 5, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various suitable actions and processes in accordance with a program stored in a read-only memory (ROM) 302 or a program loaded from a storage means 305 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.
In general, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 308 including, for example, magnetic tape, hard disk, etc.; and communication means 309. The communication means 309 may allow the electronic device 300 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 shows an electronic device 300 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program containing program code for performing a recommended method of words. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 309, or installed from storage means 305, or installed from ROM 302. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing means 301.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: extracting the voice characteristics of each voice frame in the voice data and the image characteristics of the video frame corresponding to each voice frame; affine transformation is carried out on the video frames according to the voice characteristics and the image characteristics; and generating a target video according to the affine transformed video frame.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure, the embodiments of the present disclosure disclose a video generation method, including:
extracting the voice characteristics of each voice frame in the voice data and the image characteristics of the video frame corresponding to each voice frame;
affine transformation is carried out on the video frames according to the voice characteristics and the image characteristics;
and generating a target video according to the affine transformed video frame.
Further, affine transforming the video frame according to the speech features and the image features, comprising:
inputting the voice characteristics and the image characteristics into a set neural network to obtain a video frame after affine transformation; wherein the set neural network includes at least one sub-network and at least one affine transformation unit.
Further, the sub-network comprises a global average pooling layer, a feature stitching layer, at least two full-connection layers and a dimension transformation layer, the output of the sub-network is a sub-affine transformation coefficient, and the affine transformation unit is used for carrying out affine transformation on the image features according to the sub-affine transformation coefficient.
Further, if the neural network is set to include at least two sub-networks, the input of the 1 st sub-network is the voice feature and the image feature, and the input of the nth sub-network is the voice feature and the image feature transformed according to the sub-affine transformation coefficient output by the (N-1) th sub-network; wherein N is more than or equal to 2.
Further, inputting the voice feature and the image feature into a set neural network to obtain an affine transformed video frame, including:
for each sub-network, inputting the image characteristics into the global average pooling layer for pooling treatment; inputting the pooled image features into the feature splicing layer, and inputting the voice features of the feature splicing layer for feature splicing; inputting the spliced features into the at least two fully connected layers for feature extraction, and inputting the features into the dimension transformation layer to obtain sub affine transformation coefficients;
the affine transformation unit carries out affine transformation on the image features input into the sub-network according to the sub-affine transformation coefficients to obtain affine transformed image features;
if the sub-network is the last sub-network, determining a video frame according to the affine transformed image characteristics;
and if the sub-network is not the last sub-network, inputting the affine transformed image characteristics into the next sub-network.
Further, affine transforming the video frame according to the speech features and the image features, comprising:
and carrying out affine transformation on each channel of the video frame according to the voice characteristic and the image characteristic.
Further, the video corresponding to the video frame comprises a video synthesized by the same face picture.
Note that the above is only a preferred embodiment of the present disclosure and the technical principle applied. Those skilled in the art will appreciate that the present disclosure is not limited to the particular embodiments described herein, and that various obvious changes, rearrangements and substitutions can be made by those skilled in the art without departing from the scope of the disclosure. Therefore, while the present disclosure has been described in connection with the above embodiments, the present disclosure is not limited to the above embodiments, but may include many other equivalent embodiments without departing from the spirit of the present disclosure, the scope of which is determined by the scope of the appended claims.

Claims (8)

1. A video generation method, comprising:
extracting voice characteristics of each voice frame and image characteristics of video frames corresponding to each voice frame in voice data, wherein the voice characteristics comprise dimensions, the image characteristics comprise channel number, image height and image width, and the voice frames are in one-to-one correspondence with the video frames;
affine transformation is carried out on the video frames according to the voice characteristics and the image characteristics;
generating a target video according to the affine transformed video frame;
affine transforming the video frame according to the speech features and the image features, comprising:
inputting the voice feature and the image feature into a set neural network to obtain an affine transformed video frame, wherein the set neural network comprises at least one sub-network and at least one affine transformation unit;
the sub-network comprises a global average value pooling layer, a characteristic splicing layer, at least two full-connection layers and a dimension transformation layer, the output of the sub-network is a sub-affine transformation coefficient, and the affine transformation unit is used for carrying out affine transformation on the image characteristics according to the sub-affine transformation coefficient;
for each sub-network, inputting the image characteristics into the global average pooling layer for pooling treatment; inputting the pooled image features into the feature splicing layer, and inputting the voice features of the feature splicing layer for feature splicing; and inputting the spliced features into the at least two fully connected layers for feature extraction, and inputting the features into the dimension transformation layer to obtain sub affine transformation coefficients.
2. The method according to claim 1, wherein if the neural network is set to include at least two sub-networks, the input of the 1 st sub-network is the speech feature and the image feature, and the input of the nth sub-network is the speech feature and the image feature transformed according to the sub-affine transformation coefficient outputted from the N-1 th sub-network; wherein N is more than or equal to 2.
3. The method according to claim 1 or 2, wherein inputting the speech feature and the image feature into a set neural network, obtaining affine transformed video frames, comprises:
the affine transformation module carries out affine transformation on the image features input into the sub-network according to the sub-affine transformation coefficients to obtain affine transformed image features;
if the sub-network is the last sub-network, determining a video frame according to the affine transformed image characteristics;
and if the sub-network is not the last sub-network, inputting the affine transformed image characteristics into the next sub-network.
4. The method of claim 1, wherein affine transforming the video frame based on the speech features and the image features comprises:
and carrying out affine transformation on each channel of the video frame according to the voice characteristic and the image characteristic.
5. The method of claim 1, wherein the video corresponding to the video frame comprises a video synthesized from the same face picture.
6. A video generating apparatus, comprising:
the device comprises a feature extraction module, a video frame extraction module and a video frame extraction module, wherein the feature extraction module is used for extracting voice features of voice frames in voice data and image features of video frames corresponding to the voice frames, the voice features comprise dimensions, the image features comprise channel numbers, image heights and image widths, and the voice frames correspond to the video frames one by one;
affine transformation module, is used for carrying on affine transformation to the said video frame according to said speech feature and said image feature;
the target video generation module is used for generating a target video according to the affine transformed video frame;
the affine transformation module is specifically configured to:
inputting the voice characteristics and the image characteristics into a set neural network to obtain an affine transformed video frame, wherein the set neural network comprises at least one sub-network and at least one affine transformation unit;
the sub-network comprises a global average value pooling layer, a characteristic splicing layer, at least two full-connection layers and a dimension transformation layer, the output of the sub-network is a sub-affine transformation coefficient, and the affine transformation unit is used for carrying out affine transformation on the image characteristics according to the sub-affine transformation coefficient;
for each sub-network, inputting the image characteristics into the global average pooling layer for pooling treatment; inputting the pooled image features into the feature splicing layer, and performing feature splicing on the voice features input into the feature splicing layer; and inputting the spliced features into the at least two fully connected layers for feature extraction, and inputting the features into the dimension transformation layer to obtain sub affine transformation coefficients.
7. An electronic device, the electronic device comprising:
one or more processing devices;
a storage means for storing one or more programs;
the one or more programs, when executed by the one or more processing devices, cause the one or more processing devices to implement the video generation method of any of claims 1-5.
8. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, implements the video generation method according to any one of claims 1-5.
CN202011587839.1A 2020-12-29 2020-12-29 Video generation method, device, equipment and storage medium Active CN112752118B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011587839.1A CN112752118B (en) 2020-12-29 2020-12-29 Video generation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011587839.1A CN112752118B (en) 2020-12-29 2020-12-29 Video generation method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112752118A CN112752118A (en) 2021-05-04
CN112752118B true CN112752118B (en) 2023-06-27

Family

ID=75646486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011587839.1A Active CN112752118B (en) 2020-12-29 2020-12-29 Video generation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112752118B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220374637A1 (en) * 2021-05-20 2022-11-24 Nvidia Corporation Synthesizing video from audio using one or more neural networks
CN113935418A (en) * 2021-10-15 2022-01-14 北京字节跳动网络技术有限公司 Video generation method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0674315A1 (en) * 1994-03-18 1995-09-27 AT&T Corp. Audio visual dubbing system and method
CN111277912A (en) * 2020-02-17 2020-06-12 百度在线网络技术(北京)有限公司 Image processing method and device and electronic equipment
CN111325817A (en) * 2020-02-04 2020-06-23 清华珠三角研究院 Virtual character scene video generation method, terminal device and medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106297792A (en) * 2016-09-14 2017-01-04 厦门幻世网络科技有限公司 The recognition methods of a kind of voice mouth shape cartoon and device
US10347241B1 (en) * 2018-03-23 2019-07-09 Microsoft Technology Licensing, Llc Speaker-invariant training via adversarial learning
CN108962216B (en) * 2018-06-12 2021-02-02 北京市商汤科技开发有限公司 Method, device, equipment and storage medium for processing speaking video
CN108847234B (en) * 2018-06-28 2020-10-30 广州华多网络科技有限公司 Lip language synthesis method and device, electronic equipment and storage medium
CN109214366B (en) * 2018-10-24 2021-05-04 北京旷视科技有限公司 Local target re-identification method, device and system
CN109767460A (en) * 2018-12-27 2019-05-17 上海商汤智能科技有限公司 Image processing method, device, electronic equipment and computer readable storage medium
CN110189394B (en) * 2019-05-14 2020-12-29 北京字节跳动网络技术有限公司 Mouth shape generation method and device and electronic equipment
CN110347867B (en) * 2019-07-16 2022-04-19 北京百度网讯科技有限公司 Method and device for generating lip motion video
CN111145322B (en) * 2019-12-26 2024-01-19 上海浦东发展银行股份有限公司 Method, apparatus, and computer-readable storage medium for driving avatar

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0674315A1 (en) * 1994-03-18 1995-09-27 AT&T Corp. Audio visual dubbing system and method
CN111325817A (en) * 2020-02-04 2020-06-23 清华珠三角研究院 Virtual character scene video generation method, terminal device and medium
CN111277912A (en) * 2020-02-17 2020-06-12 百度在线网络技术(北京)有限公司 Image processing method and device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周东生 ; 张强 ; 魏小鹏.人脸动画中语音可视化算法研究进展.计算机工程与应用.2007,(第第9期期),全文. *

Also Published As

Publication number Publication date
CN112752118A (en) 2021-05-04

Similar Documents

Publication Publication Date Title
CN110298413B (en) Image feature extraction method and device, storage medium and electronic equipment
CN112752118B (en) Video generation method, device, equipment and storage medium
CN111459364B (en) Icon updating method and device and electronic equipment
CN111325704A (en) Image restoration method and device, electronic equipment and computer-readable storage medium
CN112418249A (en) Mask image generation method and device, electronic equipment and computer readable medium
CN112330788A (en) Image processing method, image processing device, readable medium and electronic equipment
CN114429658A (en) Face key point information acquisition method, and method and device for generating face animation
CN114550728B (en) Method, device and electronic equipment for marking speaker
CN112307393A (en) Information issuing method and device and electronic equipment
WO2022228067A1 (en) Speech processing method and apparatus, and electronic device
CN113905177B (en) Video generation method, device, equipment and storage medium
CN113255812B (en) Video frame detection method and device and electronic equipment
CN112434064B (en) Data processing method, device, medium and electronic equipment
CN116437093A (en) Video frame repair method, apparatus, device, storage medium, and program product
CN112017685B (en) Speech generation method, device, equipment and computer readable medium
CN113709573B (en) Method, device, equipment and storage medium for configuring video special effects
CN111596823B (en) Page display method and device and electronic equipment
CN114004229A (en) Text recognition method and device, readable medium and electronic equipment
CN114419298A (en) Virtual object generation method, device, equipment and storage medium
CN110209851B (en) Model training method and device, electronic equipment and storage medium
CN113705386A (en) Video classification method and device, readable medium and electronic equipment
CN112233207A (en) Image processing method, device, equipment and computer readable medium
CN111738958B (en) Picture restoration method and device, electronic equipment and computer readable medium
WO2021018176A1 (en) Text special effect processing method and apparatus
CN111898338B (en) Text generation method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant