CN116248811B - Video processing method, device and storage medium - Google Patents

Video processing method, device and storage medium Download PDF

Info

Publication number
CN116248811B
CN116248811B CN202211580044.7A CN202211580044A CN116248811B CN 116248811 B CN116248811 B CN 116248811B CN 202211580044 A CN202211580044 A CN 202211580044A CN 116248811 B CN116248811 B CN 116248811B
Authority
CN
China
Prior art keywords
video
audio
audio file
target object
continuous reading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211580044.7A
Other languages
Chinese (zh)
Other versions
CN116248811A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shengshu Technology Co ltd
Original Assignee
Beijing Shengshu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shengshu Technology Co ltd filed Critical Beijing Shengshu Technology Co ltd
Priority to CN202211580044.7A priority Critical patent/CN116248811B/en
Publication of CN116248811A publication Critical patent/CN116248811A/en
Application granted granted Critical
Publication of CN116248811B publication Critical patent/CN116248811B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/2224Studio circuitry; Studio devices; Studio equipment related to virtual studio applications
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the application relates to the technical field of artificial intelligence, and provides a video processing method, a device and a storage medium, wherein the method comprises the following steps: acquiring a first video of a lip shape keeping a completely closed state and a second video of an open state of a lip shape keeping a first amplitude, which are recorded in a first scene for a target object; acquiring an audio file of an virtual image generated by a driver; driving a first video by adopting a non-continuous reading part in an audio file for generating the virtual image, obtaining a first video to be processed, and driving a second video by adopting a continuous reading part in the audio file for generating the virtual image, obtaining a second video to be processed; and synthesizing the first video to be processed and the second video to be processed to obtain the virtual image of the target object. The scheme can solve the phenomenon of lip shake and asynchronous with audio frequency, and improves the visual effect of the virtual image.

Description

Video processing method, device and storage medium
Technical Field
The application relates to the technical field of artificial intelligence, further relates to the technical field of computer vision, and in particular relates to a video processing method, a video processing device and a storage medium.
Background
Currently, avatar composition can be applied in different situations, such as: in the online education process, the virtual teacher provides teaching services, so that not only can the burden of the teacher be greatly reduced, but also the teaching cost can be reduced, and better teaching experience is provided compared with a simple recorded and broadcast class and the like. In addition, the avatar may be applied to a wider range of occasions, for example: artificial intelligence (Artificial Intelligence, AI) has great commercial value in real business scenarios such as newsletters, games, animations and applications. At present, a root video of a natural person in a silence state, namely in a lip-shaped closed state, is often adopted as a driving video to synthesize the virtual image of the natural person, but the natural person keeps the lip-shaped closed state for a long time, so that fatigue is easy to generate and slight variation is difficult to avoid, the phenomenon that the lip-shaped jitter and the audio are asynchronous in the virtual image synthesized later is easy to occur, and the visual effect of the virtual image is poor.
Disclosure of Invention
The embodiment of the application provides a video processing method, a video processing device and a storage medium, which can solve the problems of lip jitter and asynchronous audio frequency and improve the visual effect of an virtual image.
In a first aspect, an embodiment of the present application provides a video processing method, including:
acquiring a first video and a second video recorded under a first scene aiming at a target object, wherein the lip shape of the target object in the first video keeps a completely closed state, and the lip shape of the target object in the second video keeps an open state with a first amplitude;
acquiring an audio file of an virtual image generated by a driver;
driving the first root video by using the first part of the audio file for generating the virtual image by using the driver to obtain a first video to be processed, and driving the second root video by using the second part of the audio file for generating the virtual image by using the driver to obtain a second video to be processed; wherein the first portion of the audio file comprises a non-read-through portion of the audio file and the second portion of the audio file comprises a read-through portion of the audio file;
and synthesizing the first video to be processed and the second video to be processed to obtain the virtual image of the target object.
In some embodiments, before the first portion of the audio file for generating the avatar using the driver drives the first root video, the method further comprises:
Determining the non-read-through portion and the read-through portion according to text content of an audio file of the avatar generated by the driver;
and taking the audio segment corresponding to the non-continuous reading part as a first part of the audio file, and taking the audio segment corresponding to the continuous reading part as a second part of the audio file.
In some embodiments, the determining the non-read-through portion and the read-through portion from text content of the audio file of the avatar generated by the driver includes:
determining pronunciation interval time between adjacent words in text content of the audio file for generating the avatar by the driver, taking audio segments corresponding to all word sets with pronunciation interval time greater than a first preset threshold value as the non-continuous reading part, and taking audio segments corresponding to all word sets with pronunciation interval time less than or equal to the first preset threshold value as the continuous reading part.
In some embodiments, the determining the non-read-through portion and the read-through portion from text content of the audio file of the avatar generated by the driver includes:
determining the integrity of an audio signal of each word in the text content of the audio file of the driving generation avatar, taking an audio segment corresponding to a set of all words with the integrity of the audio signal being greater than a second preset threshold value as the non-continuous reading part, and taking an audio segment corresponding to a set of all words with the integrity of the audio signal being less than or equal to the second preset threshold value as the continuous reading part.
In some embodiments, the determining the non-read-through portion and the read-through portion from text content of the audio file of the avatar generated by the driver includes:
determining pronunciation interval time between adjacent words in text content of the audio file for generating the avatar by driving and the integrity of audio signals of each word, taking audio segments corresponding to all words with the pronunciation interval time being larger than a first preset threshold value and the integrity of the audio signals being larger than a second preset threshold value as the non-continuous reading part, and taking audio segments corresponding to all words with the pronunciation interval time being smaller than or equal to the first preset threshold value and the integrity of the audio signals being smaller than or equal to the second preset threshold value as the continuous reading part.
In some embodiments, the synthesizing the first video to be processed and the second video to be processed includes:
and video stitching is carried out on the first video to be processed and the second video to be processed, and super-resolution processing is carried out on the stitched video through a deep learning algorithm, so that the virtual image of the target object is obtained.
In some embodiments, before the obtaining the first root video and the second root video recorded in the first scene for the target object, the method further includes:
Comparing the lip position of the target object in the first root video with a first preset lip position, and if so, determining that the first root video meets the requirement;
and comparing the lip position of the target object in the second video with a second preset lip position, and if the lip position is matched with the second preset lip position, determining that the second video meets the requirement.
In a second aspect, an embodiment of the present application provides a video processing apparatus having a function of implementing a video processing method corresponding to the above first aspect. The functions may be realized by hardware, or may be realized by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the functions described above, and the modules may be software and/or hardware.
In some embodiments, the video processing apparatus includes:
the acquisition module is used for acquiring a first video and a second video recorded in a first scene aiming at a target object and driving an audio file for generating an avatar; wherein the lip shape of the target object in the first root video keeps a completely closed state, and the lip shape of the target object in the second root video keeps an open state with a first amplitude;
The processing module is used for driving the first root video by adopting the first part of the audio file for generating the virtual image by adopting the driving to obtain a first video to be processed, and driving the second root video by adopting the second part of the audio file for generating the virtual image by adopting the driving to obtain a second video to be processed; wherein the first portion of the audio file comprises a non-read-through portion of the audio file and the second portion of the audio file comprises a read-through portion of the audio file;
the processing module is also used for synthesizing the first video to be processed and the second video to be processed to obtain the virtual image of the target object.
In some embodiments, the processing module is further configured to:
determining the non-read-through portion and the read-through portion according to text content of an audio file of the avatar generated by the driver;
and taking the audio segment corresponding to the non-continuous reading part as a first part of the audio file, and taking the audio segment corresponding to the continuous reading part as a second part of the audio file.
In some embodiments, the processing module is specifically configured to:
determining pronunciation interval time between adjacent words in text content of the audio file for generating the avatar by the driver, taking audio segments corresponding to all word sets with pronunciation interval time greater than a first preset threshold value as the non-continuous reading part, and taking audio segments corresponding to all word sets with pronunciation interval time less than or equal to the first preset threshold value as the continuous reading part.
In some embodiments, the processing module is specifically configured to:
determining the integrity of an audio signal of each word in the text content of the audio file of the driving generation avatar, taking an audio segment corresponding to a set of all words with the integrity of the audio signal being greater than a second preset threshold value as the non-continuous reading part, and taking an audio segment corresponding to a set of all words with the integrity of the audio signal being less than or equal to the second preset threshold value as the continuous reading part.
In some embodiments, the processing module is specifically configured to:
determining pronunciation interval time between adjacent words in text content of the audio file for generating the avatar by driving and the integrity of audio signals of each word, taking audio segments corresponding to all words with the pronunciation interval time being larger than a first preset threshold value and the integrity of the audio signals being larger than a second preset threshold value as the non-continuous reading part, and taking audio segments corresponding to all words with the pronunciation interval time being smaller than or equal to the first preset threshold value and the integrity of the audio signals being smaller than or equal to the second preset threshold value as the continuous reading part.
In some embodiments, the processing module is specifically configured to:
And video stitching is carried out on the first video to be processed and the second video to be processed, and super-resolution processing is carried out on the stitched video through a deep learning algorithm, so that the virtual image of the target object is obtained.
In some embodiments, the processing module is further configured to:
comparing the lip position of the target object in the first root video with a first preset lip position, and if so, determining that the first root video meets the requirement;
and comparing the lip position of the target object in the second video with a second preset lip position, and if the lip position is matched with the second preset lip position, determining that the second video meets the requirement.
In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor and memory; wherein the memory is for storing a computer program and the processor is for invoking the computer program stored in the memory to perform the steps of any of the video processing methods provided in the first aspect or any of the embodiments of the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having a function of implementing a video processing method corresponding to the above first aspect. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above, which may be software and/or hardware. In particular, the computer readable storage medium stores a plurality of instructions adapted to be loaded by a processor for performing the steps of the first aspect of the embodiment of the present application or any of the video processing methods provided in any implementation manner of the first aspect.
Compared with the prior art, in the scheme provided by the embodiment of the application, the lip shape of the target object in the first video keeps a completely closed state, so that the non-continuous reading part in the audio file for generating the virtual image is adopted to drive the first video, so that the lip shape of the virtual image in the obtained first video to be processed and the sent voice conform to the synchronism and the continuity under the non-continuous reading condition, and similarly, the second video to be processed is obtained by adopting the second video for driving the continuous reading part in the audio file for generating the virtual image to drive the lip shape to keep the open state of the first amplitude, so that the lip shape of the virtual image in the second video to be processed and the sent voice conform to the synchronism and the continuity under the continuous reading condition, therefore, after the virtual image of the target object is synthesized by the first video to be processed and the second video, even if the continuous reading condition occurs, the lip shape is slightly opened, the generated virtual image does not appear in the process of continuous reading words, and the voice is not always changed between each word and the voice is always opened, and the phenomenon of shaking of the virtual image is avoided in the process of the lip shape is completely read, and the visual effect is improved.
Drawings
Fig. 1 is a schematic diagram of a server according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a video processing method according to an embodiment of the application;
FIG. 3 is a schematic diagram illustrating a process of processing video by voice-driven video according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a video processing apparatus according to an embodiment of the present application;
FIG. 5 is a schematic diagram of an electronic device implementing a video processing method according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a mobile phone implementing a video processing method according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a server for implementing a video processing method according to an embodiment of the present application;
fig. 8 is a schematic view of an avatar generated in an embodiment of the present application;
fig. 9 is a schematic diagram corresponding to an audio/video in an avatar generated in an embodiment of the present application;
fig. 10 is another schematic view of audio and video correspondence in an avatar generated in an embodiment of the present application;
fig. 11 is another schematic view of audio and video correspondence in an avatar generated in an embodiment of the present application.
Detailed Description
The terms "first," "second," and the like in the description and the claims of the embodiments of the present application and in the foregoing drawings are used for distinguishing similar objects (e.g., the first region and the second region in the embodiment of the present application respectively represent different regions in the initial face image), and are not necessarily used for describing a specific order or precedence. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those explicitly listed but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus, such that the partitioning of modules by embodiments of the application is only one logical partitioning, may be implemented with additional partitioning, such as a plurality of modules may be combined or integrated in another system, or some features may be omitted, or not implemented, and further, such that the coupling or direct coupling or communication connection between modules may be via some interfaces, indirect coupling or communication connection between modules may be electrical or otherwise similar, none of which are limited in embodiments of the application. The modules or sub-modules described as separate components may or may not be physically separate, may or may not be physical modules, or may be distributed in a plurality of circuit modules, and some or all of the modules may be selected according to actual needs to achieve the purposes of the embodiment of the present application.
The scheme provided by the embodiment of the application relates to artificial intelligence (Artificial Intelligence, AI), natural language processing (Nature Language processing, NLP), machine Learning (ML) and other technologies, and is specifically described by the following embodiments:
the AI is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
AI technology is a comprehensive discipline, and relates to a wide range of technologies, both hardware and software. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
NLP is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Digital personal technology requires the use of different mouth shapes to synchronize different audio information in order to generate realistic digital personal video. In particular, a link between the audio signal and the digital person's mouth shape needs to be established. For example, audio features (e.g., phonemes, energy, etc.) may be mapped to video features (e.g., mouth-shaped features). Artificial intelligence (Artificial Intelligence, AI for short) can automatically learn the mapping between audio features and video type features. For example, the mapping relationship between audio features and video features may be constructed based on machine learning techniques.
In order to improve the reality of the target person in the digital person video, for example, improve the reality restoration degree of the face of the teaching teacher, the digital person video can be generated by adopting the background video comprising the target person. The length of audio in digital human video can be determined by the recording time length or the text length of specific text. The length of the audio may be relatively long, such as 40 minutes, 1 hour, or longer, etc. In order to ensure that the length of the background video is not shorter than the length of the audio in order to synthesize the digital person video, the target person is required to keep a specific posture continuously during the recording of the background video. This way of recording background video places a great physical and mental burden on the target person. In addition, the requirements on the shooting environment are high during background video shooting, such as the situation that the background of the video is prevented from changing as much as possible, and the cost of a shooting place and the like which need to be rented is high.
In order to reduce the shooting difficulty and shooting cost of the background video, video clips with shorter lengths, such as 10 seconds, 30 seconds, 1 minute, 3 minutes or 10 minutes, can be shot, and then the required background video is generated in a video clip splicing mode. However, the poses of the persons in different video clips may be different, particularly the poses of the photographed objects of the ending period of the current video clip to be spliced and the starting period of the next video clip are different, resulting in inconvenience in video frequency splicing. In addition, the gesture of the target person in the background video is inevitably changed (such as slight shaking, etc.), and when the spliced video clips are played, the video display effect at the spliced position is poor, such as image shake, image jump, etc. are easy to occur.
The embodiment of the application provides a video processing method, a video processing device and a storage medium, which can be used for a server or terminal equipment. The method comprises the steps of obtaining a first video to be processed by adopting a first video in which a non-continuous reading part in an audio file for generating an avatar drives a lip to keep a completely closed state, obtaining a second video to be processed by adopting a second video in which a continuous reading part in the audio file for generating the avatar drives the lip to keep an open state with a first amplitude, and then synthesizing the first video to be processed and the second video to be processed to obtain the avatar of a target object. Therefore, when the condition of continuous reading occurs, as the lip slightly-opened root video is inserted, the generated virtual image does not generate alternating of closing and opening each word in the process of speaking a section of continuous reading words, but finishes a section of continuous reading words in the state of opening the mouth, so that the phenomenon of lip shaking and asynchronous audio frequency of the virtual image is avoided in the process of speaking, and the visual effect of the virtual image is improved.
The scheme of the embodiment of the application can be realized based on cloud technology, artificial intelligence technology and the like, and particularly relates to the technical fields of cloud computing, cloud storage, databases and the like in the cloud technology, and the technical fields are respectively described below.
Fig. 1 is a schematic diagram of a server according to an embodiment of the present application. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present application may be applied to help those skilled in the art understand the technical content of the present application, and does not mean that the embodiments of the present application may not be used in other devices, systems, environments, or scenarios.
Referring to fig. 1, a system architecture 100 according to the present embodiment may include a plurality of servers 101, 102, 103. Wherein different servers 101, 102, 103 may each provide different kinds of services. For example, the server 101 may provide a text recognition service, the server 102 may provide a speech synthesis service, and the server 103 may provide an image processing service.
For example, the server 101 may transmit text recognized from an image to the server 102 to synthesize an audio clip corresponding to the text. The server 103 may perform image processing on the received video slices. Such as server 103, may receive at least two video slices and obtain a target slice from the at least two video slices. In addition, the server 103 may generate a complementary frame video slice for the motion video slice and the motion video slice, so as to reduce the image jump at the splicing position of the video slices. In addition, the received audio fragments are utilized to drive the target fragments, and the driven target fragments and other functions are obtained. The server 103 may also send the driven target slices, the generated mouth images, the driven video frames, etc. to the terminal device in order to present the above information on the terminal device. For example, the terminal device may display the driven video, implement video teaching, and the like. For example, the server 103 may be a background management server, a server cluster, a cloud server, or the like.
It should be specifically noted that, the server (for example, a business server and a search engine) related to the embodiment of the present application may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The image processing device according to the embodiment of the present application may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a personal digital assistant, and the like. The image processing device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in this embodiment of the present application.
The cloud server can realize cloud computing (cloud computing), and cloud technology refers to a delivery and use mode of an IT infrastructure, namely that required resources are obtained in an on-demand and easily-expandable manner through a network; generalized cloud computing refers to the delivery and usage patterns of services, meaning that the required services are obtained in an on-demand, easily scalable manner over a network. Such services may be IT, software, internet related, or other services. Cloud Computing is a product of fusion of traditional computer and network technology developments such as Grid Computing (Grid Computing), distributed Computing (Distributed Computing), parallel Computing (Parallel Computing), utility Computing (Utility Computing), network storage (Network Storage Technologies), virtualization (Virtualization), load balancing (Load balancing), and the like.
For example, a cloud server may provide an artificial intelligence cloud Service, also known as AI as a Service (AIaaS for short). The AIaaS platform can split several common AI services and provide independent or packaged services at the cloud. This service mode is similar to an AI theme mall: all developers can access one or more artificial intelligence services provided by the use platform through an API interface, and partial deep developers can also use an AI framework and AI infrastructure provided by the platform to deploy and operate and maintain self-proprietary cloud artificial intelligence services.
The following is an exemplary description of the technical solution of the embodiment of the present application with reference to fig. 2 to 7.
As shown in fig. 2, fig. 2 is a schematic flow chart of a video processing method according to an embodiment of the application, where the method flow includes:
201. and acquiring a first video and a second video recorded in a first scene aiming at the target object.
In this embodiment, the target object may be a person with different ages or sexes, for example, a child, an adult, or an elderly person, and may be a male or a female.
The first scene may be any scene indoors or outdoors, for example, may be a scene of online lectures, online live broadcasting, online sales, or the like. The background or character pose in the scene may be switched.
The root video may be a video that the user's mobile terminal is live. Wherein the lips of the target object in the first video remain fully closed, e.g. the first video may be during recording, requiring the lips of a natural person to remain closed, and there may be a slight float within the tolerance allowed by the calculation; the lip shape of the target object in the second video keeps the opening state of the first amplitude, for the recording of the second video, the lip shape of a natural person is required to be slightly opened, the opening amplitude can be adjusted according to specific requirements, and the opening amplitude and the processing time of the lip opening amplitude can be uniformly adjusted through the later video processing.
In this embodiment, the electronic device with the processing function may acquire the first root video and the second root video recorded in the first scene for the target object. The recording duration of the first video and the second video can be determined according to requirements.
In some embodiments, maintaining the lip of the target object in a fully closed state comprises: the lips are kept in a closed state, and the pitch angle and the yaw angle of the human face are not more than 20 degrees. Therefore, the lip shape of the target object is kept in a completely closed state, so that when the follow-up audio drives the basic root video to synthesize the virtual image, the lip shape can be adjusted to be larger, the mouth shape corresponding to the audio is more accurate, and larger deformation is not easy to generate. For example, during recording, the face keeps a silent natural state, and the mouth can be in a natural closed state, so that the mouth has no obvious change in the whole recording process, and the visual effect of subsequent lip driving is improved. The eye can glance in the direction of 20 degrees of right and left offset range, the glance speed is slow, the person can not need to speak or walk, and the facial expression can be in a normal state, namely a natural state without emotion. During recording, the user can slightly nod or shake the head, but the offset is kept to be less than 20 degrees as much as possible.
Maintaining the lip of the target object in an open state of a first magnitude includes: the lips are kept slightly open and the pitch angle and yaw angle of the face are not more than 20 degrees. Similarly, in the stage of recording the root video, the lip shape of the target object is kept in an open state with a first amplitude, so that when the subsequent audio drives the basic root video to synthesize the virtual image, the lip shape can be adjusted to be larger, the mouth shape corresponding to the audio is more accurate, and larger deformation is not easy to generate.
202. An audio file is acquired that drives the avatar.
In this embodiment, the audio file is a broadcast audio of a user recorded by a professional noise reduction device, and is used for driving a lip shape in a video, and is named as "driving audio". The audio file for generating the avatar may be acquired by the electronic device with processing functions. Since the driver generates an audio file of the avatar for the generation of the subsequent avatar, the text content corresponding to the audio file needs to be matched with the root video to be driven, specifically, the position corresponding to each word in the text content is also a time position in time, and the time position is matched with the corresponding time position in the root video. As shown in fig. 8, the text content corresponding to the time position of 6-8s is "good, i am …", the text content corresponding to the time position of 15-20s is "first aspect …", and the text content corresponding to the time position of 25-28s is "second aspect …". It should be noted that, the target object may be a teacher, the audio file for driving to generate the avatar may be an audio file for the teacher to give lessons online, the target object may also be a person doing live goods selling, providing skill showing, or performing on the platform, and the audio file for driving to generate the avatar may be an audio file for speaking by such person.
203. And driving the first root video by using the first part of the audio file for generating the virtual image by using the driver to obtain a first video to be processed, and driving the second root video by using the second part of the audio file for generating the virtual image by using the driver to obtain a second video to be processed.
In this embodiment, the electronic device with a processing function may use the first portion of the audio file for generating the avatar to drive the first root video, to obtain a first video to be processed, and use the second portion of the audio file for generating the avatar to drive the second root video, to obtain a second video to be processed.
As for a specific manner of driving the root video with respect to the audio file, it can be understood with reference to fig. 3, and as shown in fig. 3, the target slices can be generated from a plurality of target slices, such as root_video. An audio clip for driving a target clip may be an audio file (driving_audio) for generating an avatar for driving, and the audio clip may include a plurality of audio frames.
In order to facilitate understanding of the technical solution of the present application, such as the correspondence between audio frames and video frames, the length of the audio frames and the like are described herein as examples.
For example, the length of play time of one frame of audio frame is the inverse of the frame rate of the image. If the frame rate of the image is 50fps, it means that 50 frames are transmitted in one second, and each frame of video frame requires a playing time of 20ms, so that one 20ms of audio may correspond to one frame of video frame. Accordingly, the preset time length is set as the reciprocal of the frame rate, so that the audio output by the segmentation corresponds to the picture, namely, the time alignment of the audio output by the segmentation and the picture is realized.
However, in some scenarios, the frame rate of the audio frames in the audio slices and the frame rate of the video frames in the video slices are different.
For example, the frequency range of normal human hearing is approximately between 20Hz and 20 kHz. The sampling frequency (sampling) refers to the number of samples of the acoustic wave amplitude taken per second when the analog acoustic waveform is digitized. For example, to reduce the distortion rate of sound, the sampling frequency may be greater than 16kHz. Typical audio sampling frequencies are 8kHz, 11.025kHz, 16kHz, 22.05kHz, 37.8kHz, 44.1kHz, 48kHz, etc. For example, a frame of audio frames may be formed at 200 sample points.
The sampling rate is 16KHz, which means 16000 sampling points per second, and the playing duration of the audio frame=the number of sampling points/sampling frequency corresponding to one advanced audio coding (Advanced Audio Coding, abbreviated as ACC) frame, then for a frame rate of 80fps audio frame, the playing duration of the current audio frame=200×1000/16000=12.5 milliseconds (ms). The frame rate of the video frame can be about 25fps, so that the video playing effect can be met, and 25 frames of pictures can be transmitted in one second, so that each frame of pictures needs 1000/25=40 ms of time. It can be seen that the play duration is different between the two.
In order to facilitate the generation of digital person information including audio and video of equal play time length, the correspondence between video frames and audio frames may be determined as follows:
In some embodiments, each of the at least two video slices has a frame rate of a first frame rate f1 and the audio slices has a frame rate of a second frame rate f2, the second frame rate f2 being greater than the first frame rate f1.
Accordingly, one frame of the video slice corresponds to N frames of the audio slice, wherein, to be rounded up, or alternatively, is rounded down.
If the first frame rate f1 and the second frame rate f2 are in an integer multiple relationship, the relationship between the audio frame and the video frame is determined according to the integer multiple relationship. If the first frame rate f1 and the second frame rate f2 are not in an integer multiple relationship, the correspondence between the audio frame and the video frame may be determined by rounding.
In some embodiments, before driving the target tile with the audio tile, the method may further include: if f2/f1 is a fraction greater than 1, andit is determined that there is an overlap between the audio frame at the end play time of the first play session and the audio frame at the start play time of the second play session.
Accordingly, driving the target tile with the audio tile may include the following operations.
First, a first correspondence is determined, the first correspondence including: the (N x (i+1) -1) -th audio frame of the audio slice corresponds to the (i+1) -th video frame of the target slice, wherein the overlapping portion of the (N x (i+1) -1) -th audio frame also corresponds to the (i+1) -th video frame of the target slice.
Then, driving the video frame corresponding to the audio frame by using the audio frame based on the first corresponding relation to obtain the driven target video frame, namely the virtual image of the target object.
Therefore, due to the corresponding relation between the audio frames and the video frames in the first corresponding relation, the audio output by the segmentation corresponds to the picture, namely, the alignment of the audio frame and the video frame in time is realized. For example, as shown in fig. 9, a first video frame corresponding to a first audio frame, as shown in fig. 10, a second video frame corresponding to a second audio frame, as shown in fig. 11, and a third video frame corresponding to a third audio frame. It will be appreciated that in practical applications, there may be more audio frames and corresponding video frames, and this is only illustrative and not limiting in number.
204. And synthesizing the first video to be processed and the second video to be processed to obtain the virtual image of the target object.
In this embodiment, the electronic device with a processing function may synthesize the first video to be processed and the second video to be processed to obtain the avatar of the target object. The first video to be processed is obtained by driving the first video by adopting a non-continuous reading part in the audio file for driving the generated virtual image, the second video to be processed is obtained by driving the second video by adopting a continuous reading part in the audio file for driving the generated virtual image, and the two videos are spliced and combined, so that the complete virtual image of the target object corresponding to the audio file can be obtained. In this way, the generated complete virtual image does not generate alternating conversion of closing and opening each character in the process of speaking a section of continuous reading and non-continuous reading characters, but finishes the section of continuous reading characters in the opening state, and finishes the section of non-continuous reading characters in the closing and opening state, so that the lip shape and the emitted voice of the virtual image in the speaking process are consistent with the synchronism and the consistency under the condition of continuous reading and the synchronism and the consistency under the condition of non-continuous reading.
In the embodiment of the application, the lip shape of the target object in the first root video is kept in a completely closed state, so that the non-continuous reading part in the audio file for generating the virtual image is adopted to drive the first root video, so that the lip shape of the virtual image in the obtained first root video and the emitted voice accord with the synchronism and continuity under the non-continuous reading condition, and likewise, the second root video for driving the continuous reading part in the audio file for generating the virtual image to keep the lip shape in the open state of a first amplitude is adopted to obtain the second root video, so that the lip shape of the virtual image in the second root video and the emitted voice accord with the synchronism and continuity under the continuous reading condition, and therefore, when the first root video and the second root video for generating the virtual image are synthesized to obtain the virtual image of the target object, even if the continuous reading condition occurs, the root video with the lip shape slightly opened is inserted, the generated virtual image can not always change between each word and the mouth in an alternative way, but the mouth of each word is always closed in the process of speaking, and the mouth is not opened in the process of the continuous reading of the virtual image, so that the lip shape is not synchronous with the virtual image, and the visual image is not shaken in the process.
In some embodiments, the non-read-through portions and the read-through portions may be defined based on the text content of the audio file. That is, before the first portion of the audio file for generating the avatar using the driver drives the first video, the non-read-through portion and the read-through portion may be determined according to text content of the audio file for generating the avatar using the driver, the audio segment corresponding to the non-read-through portion may be used as the first portion of the audio file, and the audio segment corresponding to the read-through portion may be used as the second portion of the audio file. The characters in the text content can define which characters need to be read continuously and which characters do not need to be read continuously, so that a more accurate direction is provided for driving the subsequent root video. For example, text content includes words that need to be read in succession, such as "la", "haha", "n, and" n ", and words that need not be read in succession, such as" we "," a person ", and the like, and by identifying the text content, the portion that is read in succession and the portion that is not read in succession can be distinguished. Because the continuous reading part is opened by the lip shape and has smaller floating, the problem of the unnatural shaking stiffness is solved by splicing the first video with slightly opened lip shape, and the non-continuous reading part is naturally closed by the lip shape, so that the mouth shape effect of each character reading can be distinguished. By distinguishing the read-through part from the non-read-through part, guidance can be provided for selection of the subsequent drive root video. The method effectively ensures that the non-continuous reading part in the audio file is adopted when the first root video is driven, and the continuous reading part in the audio file is adopted when the second root video is driven, so that the phenomenon of lip jitter and asynchronous with the audio in the speaking process of the synthesized virtual image is avoided.
In some embodiments, the definition of the non-read-through portion and the read-through portion may be defined by a pronunciation interval time. The method comprises the steps of determining pronunciation interval time between adjacent words in text content of an audio file for generating an avatar by driving, taking audio segments corresponding to a set of all words with pronunciation interval time larger than a first preset threshold value as the non-continuous reading part, and taking audio segments corresponding to a set of all words with pronunciation interval time smaller than or equal to the first preset threshold value as the continuous reading part. The continuous reading and non-continuous reading parts are classified according to the pronunciation interval time between adjacent words, so that the speaking habit of a user is met, and the non-continuous reading parts and the continuous reading parts can be accurately classified. For example, the pronunciation interval between words that are read in succession may be shorter than the non-read word dwell time, and may be as accurate as milliseconds in recognition of the pronunciation interval, and by setting the threshold for the pronunciation interval, the read-in portion may be distinguished from the non-read portion.
In some embodiments, the definition of the non-read-through portion and the read-through portion may also be defined by the integrity of the audio signal. The method comprises the steps of determining the integrity of an audio signal of each word in the text content of an audio file of an avatar generated by driving, taking an audio segment corresponding to a set of all words with the integrity of the audio signal being greater than a second preset threshold value as the non-continuous reading part, and taking an audio segment corresponding to a set of all words with the integrity of the audio signal being less than or equal to the second preset threshold value as the continuous reading part. For example, if a word is read alone, the sound will be relatively direct and long, but if read in succession, the sound will be weaker and less direct. The continuous reading and the non-continuous reading parts are classified according to the integrity of the audio signal of each word, and the continuous reading condition can be identified by quantitative technical indexes, so that the non-continuous reading parts and the continuous reading parts can be accurately distinguished.
In some embodiments, the definition of the non-read-through portion and the read-through portion may also be defined by a combination of the pronunciation interval and the integrity of the audio signal. The method comprises the steps of determining pronunciation interval time between adjacent words in text content of an audio file for generating an avatar by driving and the integrity of an audio signal of each word, taking an audio segment corresponding to a set of all words with the pronunciation interval time being larger than a first preset threshold value and the integrity of the audio signal being larger than a second preset threshold value as the non-continuous reading part, and taking an audio segment corresponding to a set of all words with the pronunciation interval time being smaller than or equal to the first preset threshold value and the integrity of the audio signal being smaller than or equal to the second preset threshold value as the continuous reading part. Thus, the continuous reading and the non-continuous reading parts are classified by the double indexes of the pronunciation interval time between adjacent words and the audio signal of each word, so that the classification requirements are more strict, and the classification accuracy of the non-continuous reading parts and the continuous reading parts is higher.
In some embodiments, the combining of the first video to be processed and the second video to be processed may employ video stitching techniques. Specifically, video stitching is performed on the first video to be processed and the second video to be processed, and super-resolution processing is performed on the stitched video through a deep learning algorithm to obtain the virtual image of the target object. The video stitching technology can be realized by adopting a general algorithm, and the super-resolution processing can reconstruct a low-resolution image into a corresponding high-resolution image without excessive description, and can be based on an image interpolation or a deep learning method.
The interpolation algorithm is to insert new pixels around the original pixels of the image to enlarge the size of the image, and assign values to the pixels after inserting the pixels, so that the image content is restored, and the effect of improving the resolution of the image is achieved. Common image interpolation includes linear interpolation or nonlinear interpolation. Among these, the usual linear interpolation includes: nearest neighbor interpolation, bilinear interpolation, bicubic interpolation. Common non-linear interpolation includes: an interpolation algorithm based on edge information, an interpolation algorithm based on wavelet coefficients, and an interpolation algorithm based on deep learning. The interpolation algorithm based on the edge information is to interpolate non-edge pixel points by adopting a non-directional linear interpolation method, and to interpolate edge pixel points by adopting a directional interpolation method, so as to protect the edge and enable the edge to be smoother. The interpolation algorithm based on wavelet coefficients separates high-frequency information from low-frequency information of the image and independently processes the high-frequency information. If the high-frequency details of the image can be accurately obtained, the obtained high frequency is overlapped with the original low frequency by utilizing a reconstruction theory, and then the high-resolution image can be obtained by inverse discrete wavelet transform. The interpolation algorithm based on the deep learning can restore the low-resolution image to a clear texture, the effect is better than that of the traditional algorithm, particularly when the up-sampling rate is higher, the traditional algorithm can not reconstruct the corresponding high-definition image well, the advantage of the algorithm based on the deep learning is obvious, and a better image restoration effect can be obtained. In general, super-resolution refers by default to an interpolation algorithm based on deep learning. The super-division based on the deep learning mainly uses the prior knowledge of the resolution image and the high-frequency information existing in the aliasing form to restore, and also uses the complementary information between the adjacent images in the video. The knowledge is learned in advance by training the deep neural network, and the trained deep neural network is the super-score model.
The real low-resolution image is input into a trained superdivision model, and the high-frequency details of the image are reconstructed by using priori knowledge acquired by the model, so that a better image recovery effect is obtained. Wherein the algorithm steps based on deep learning may comprise: initializing the weight of the model by using a random number when training is started; inputting the low-definition image in the training data into the model, and performing forward calculation to obtain an output reconstructed high-definition image; comparing the difference between the generated reconstructed high-definition image and the real high-definition image, and measuring by using an loss function; the gradient descent method is utilized to minimize a loss function, so that the reconstructed high-definition image and the real high-definition image are as close as possible, and the weight of the model is updated by the gradient back propagation method.
In some embodiments, the acquired root video with lips fully closed and slightly open may also be quality controlled. Comparing the lip position of the target object in the first root video with a first preset lip position, and if so, determining that the first root video meets the requirement; and comparing the lip position of the target object in the second video with a second preset lip position, and if the lip position is matched with the second preset lip position, determining that the second video meets the requirement.
Therefore, the quality of the recorded root video can be controlled by comparing the preset standard positions of the lips, so that the effect of generating the virtual image for the subsequent driving root video is ensured, and the phenomenon that pronunciation and lips are not synchronous or are not standard is avoided.
The above description is given of a video processing method in the embodiment of the present application, and a video processing apparatus and an electronic device for executing the video processing method are described below.
Referring to fig. 4, a schematic structural diagram of a video processing apparatus 40 shown in fig. 4 is shown, and the video processing apparatus 40 in the embodiment of the present application can implement steps in the video processing method performed by the video processing apparatus 40 in the embodiment corresponding to fig. 2. The functions implemented by the video processing apparatus 40 may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the functions described above, which may be software and/or hardware. The video processing apparatus 40 includes:
an obtaining module 401, configured to obtain a first root video and a second root video recorded in a first scene for a target object, and drive an audio file for generating an avatar; wherein the lip shape of the target object in the first root video keeps a completely closed state, and the lip shape of the target object in the second root video keeps an open state with a first amplitude;
A processing module 402, configured to use the first portion of the audio file for generating the avatar to drive the first root video, obtain a first video to be processed, and use the second portion of the audio file for generating the avatar to drive the second root video, obtain a second video to be processed; wherein the first portion of the audio file comprises a non-read-through portion of the audio file and the second portion of the audio file comprises a read-through portion of the audio file;
the processing module 402 is further configured to synthesize the first video to be processed and the second video to be processed to obtain an avatar of the target object.
In some embodiments, the processing module 402 is further configured to:
determining the non-read-through portion and the read-through portion according to text content of an audio file of the avatar generated by the driver;
and taking the audio segment corresponding to the non-continuous reading part as a first part of the audio file, and taking the audio segment corresponding to the continuous reading part as a second part of the audio file.
In some embodiments, the processing module 402 is specifically configured to:
determining pronunciation interval time between adjacent words in text content of the audio file for generating the avatar by the driver, taking audio segments corresponding to all word sets with pronunciation interval time greater than a first preset threshold value as the non-continuous reading part, and taking audio segments corresponding to all word sets with pronunciation interval time less than or equal to the first preset threshold value as the continuous reading part.
In some embodiments, the processing module 402 is specifically configured to:
determining the integrity of an audio signal of each word in the text content of the audio file of the driving generation avatar, taking an audio segment corresponding to a set of all words with the integrity of the audio signal being greater than a second preset threshold value as the non-continuous reading part, and taking an audio segment corresponding to a set of all words with the integrity of the audio signal being less than or equal to the second preset threshold value as the continuous reading part.
In some embodiments, the processing module 402 is specifically configured to:
determining pronunciation interval time between adjacent words in text content of the audio file for generating the avatar by driving and the integrity of audio signals of each word, taking audio segments corresponding to all words with the pronunciation interval time being larger than a first preset threshold value and the integrity of the audio signals being larger than a second preset threshold value as the non-continuous reading part, and taking audio segments corresponding to all words with the pronunciation interval time being smaller than or equal to the first preset threshold value and the integrity of the audio signals being smaller than or equal to the second preset threshold value as the continuous reading part.
In some embodiments, the processing module 402 is specifically configured to:
And video stitching is carried out on the first video to be processed and the second video to be processed, and super-resolution processing is carried out on the stitched video through a deep learning algorithm, so that the virtual image of the target object is obtained.
In some embodiments, the processing module 402 is further configured to:
comparing the lip position of the target object in the first root video with a first preset lip position, and if so, determining that the first root video meets the requirement;
and comparing the lip position of the target object in the second video with a second preset lip position, and if the lip position is matched with the second preset lip position, determining that the second video meets the requirement.
In the scheme provided by the embodiment of the application, the lip shape of the target object in the first video keeps a completely closed state, so that the non-continuous reading part in the audio file for generating the virtual image is adopted to drive the first video, so that the lip shape of the virtual image in the obtained first video to be processed and the emitted voice accord with the synchronicity and the continuity under the non-continuous reading condition, and likewise, the second video to be processed is obtained by adopting the second video for driving the continuous reading part in the audio file for generating the virtual image to drive the lip shape to keep the open state of the first amplitude, so that the lip shape of the virtual image in the second video to be processed and the emitted voice accord with the synchronicity and the continuity under the continuous reading condition, therefore, after the first video to be processed and the second video to be processed are synthesized to obtain the virtual image of the target object, even if the continuous reading condition occurs, the lip shape slightly opens, the generated virtual image does not always change between the closed state and the open state of the mouth in the process of one section of continuous reading words, and the phenomenon that the lip shape is always changed between the lip shape and the mouth is open in the process of one section of continuous reading words is not generated, and the visual image is not completely read, and the phenomenon of the virtual image is avoided in the process is completely read.
The video processing apparatus 40 for performing the video processing method in the embodiment of the present application has been described above from the viewpoint of the modularized functional entity, and the video processing apparatus 40 for performing the video processing method in the embodiment of the present application will be described below from the viewpoint of hardware processing, respectively. It should be noted that, in the embodiment of the present application shown in fig. 4, the physical device corresponding to the obtaining module 401 may be an input/output unit, a transceiver, a radio frequency circuit, a communication module, an output interface, etc., and the physical device corresponding to the processing module 402 may be a processor. The video processing apparatus 40 shown in fig. 4 may have an electronic device structure as shown in fig. 5, and when the video processing apparatus 40 shown in fig. 4 has the structure as shown in fig. 5, the processor and the input/output unit in fig. 5 can realize the same or similar functions as the processing module 402 provided in the foregoing apparatus embodiment of the video processing apparatus 40, and the memory in fig. 5 stores a computer program to be called when the processor executes the foregoing video processing method.
The embodiment of the present application further provides another video processing apparatus, as shown in fig. 6, for convenience of explanation, only the portions related to the embodiment of the present application are shown, and specific technical details are not disclosed, please refer to the method portion of the embodiment of the present application. The video processing device can be any image processing device including a mobile phone, a tablet personal computer, a personal digital assistant (English: personal Digital Assistant, english: PDA), a Sales image processing device (English: point of Sales, english: POS), a vehicle-mounted computer and the like, taking the image processing device as an example of the mobile phone:
Fig. 6 is a block diagram showing a part of the structure of a mobile phone related to a video processing apparatus provided by an embodiment of the present application. Referring to fig. 6, the mobile phone includes: radio Frequency (RF) circuit 610, memory 620, input unit 630, display unit 640, sensor 680, audio circuit 660, wireless-fidelity (Wi-Fi) module 670, processor 680, and power supply 690. Those skilled in the art will appreciate that the handset configuration shown in fig. 6 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.
The following describes the components of the mobile phone in detail with reference to fig. 6:
the RF circuit 610 may be configured to receive and transmit signals during a message or a call, and in particular, receive downlink information of a base station and process the downlink information with the processor 680; in addition, the data of the design uplink is sent to the base station. Generally, RF circuitry 610 includes, but is not limited to, antennas, at least one amplifier, transceivers, couplers, low noise amplifiers (English full name: low Noise Amplifier, english short name: LNA), diplexers, and the like. In addition, the RF circuitry 610 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (english: global System of Mobile communication, english: GSM), general packet radio service (english: general Packet Radio Service, english: GPRS), code division multiple access (english: code Division Multiple Access, CDMA), wideband code division multiple access (english: wideband Code Division Multiple Access, english: WCDMA), long term evolution (english: long Term Evolution, english: LTE), email, short message service (english: short Messaging Service, english: SMS), and the like.
The memory 620 may be used to store software programs and modules, and the processor 680 may perform various functional applications and data processing of the cellular phone by executing the software programs and modules stored in the memory 620. The memory 620 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 620 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
The input unit 630 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the handset. In particular, the input unit 630 may include a touch panel 631 and other input devices 632. The touch panel 631, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 631 or thereabout using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 631 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 680 and can receive commands from the processor 680 and execute them. In addition, the touch panel 631 may be implemented in various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 630 may include other input devices 632 in addition to the touch panel 631. In particular, other input devices 632 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.
The display unit 640 may be used to display information input by a user or information provided to the user and various menus of the mobile phone. The display unit 640 may include a display panel 641, and optionally, the display panel 641 may be configured in the form of a liquid crystal display (english: liquid Crystal Display, abbreviated as LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 631 may cover the display panel 641, and when the touch panel 631 detects a touch operation thereon or thereabout, the touch panel 631 is transferred to the processor 680 to determine the type of the touch event, and then the processor 680 provides a corresponding visual output on the display panel 641 according to the type of the touch event. Although in fig. 6, the touch panel 631 and the display panel 641 are two independent components to implement the input and input functions of the mobile phone, in some embodiments, the touch panel 631 and the display panel 641 may be integrated to implement the input and output functions of the mobile phone.
The handset may also include at least one sensor 680, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 641 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 641 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for applications of recognizing the gesture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the handset are not described in detail herein.
Audio circuitry 660, speaker 661, microphone 662 may provide an audio interface between a user and the handset. The audio circuit 660 may transmit the received electrical signal converted from audio data to the speaker 661, and the electrical signal is converted into a sound signal by the speaker 661 to be output; on the other hand, microphone 662 converts the collected sound signals into electrical signals, which are received by audio circuit 660 and converted into audio data, which are processed by audio data output processor 680 for transmission to, for example, another cell phone via RF circuit 610, or which are output to memory 620 for further processing.
Wi-Fi belongs to a short-distance wireless transmission technology, and a mobile phone can help a user to send and receive e-mails, browse web pages, access streaming media and the like through a Wi-Fi module 670, so that wireless broadband Internet access is provided for the user. Although fig. 6 shows Wi-Fi module 670, it is understood that it does not belong to the necessary constitution of the cell phone, and can be omitted entirely as needed within the scope of not changing the essence of the application.
Processor 680 is a control center of the handset, connects various parts of the entire handset using various interfaces and lines, and performs various functions and processes of the handset by running or executing software programs and/or modules stored in memory 620, and invoking data stored in memory 620, thereby performing overall monitoring of the handset. Optionally, processor 680 may include one or more processing units; preferably, the processor 680 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 680.
The handset further includes a power supply 690 (e.g., a battery) for powering the various components, which may be logically connected to the processor 680 through a power management system so as to perform functions such as managing charging, discharging, and power consumption by the power management system.
Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.
In the embodiment of the present application, the processor 680 included in the mobile phone further has a control function to execute the above method executed by the video processing device 40 shown in fig. 4. The steps performed by the video processing device 40 in the above embodiment may be based on the structure of the mobile phone shown in fig. 6. For example, the processor 680 performs the following operations by calling instructions in the memory 632:
acquiring a first root video and a second root video recorded in a first scene for a target object through an input unit 630, and driving an audio file for generating an avatar; wherein the lip shape of the target object in the first root video keeps a completely closed state, and the lip shape of the target object in the second root video keeps an open state with a first amplitude;
driving the first root video by the processor 680 using a first portion of the audio file of the avatar generated by the driver to obtain a first video to be processed, and driving the second root video by a second portion of the audio file of the avatar generated by the driver to obtain a second video to be processed; wherein the first portion of the audio file comprises a non-read-through portion of the audio file and the second portion of the audio file comprises a read-through portion of the audio file;
The first video to be processed and the second video to be processed are synthesized by the processor 680 to obtain the avatar of the target object.
The embodiment of the present application further provides another video processing apparatus for implementing the video processing method, as shown in fig. 7, fig. 7 is a schematic diagram of a server structure provided in the embodiment of the present application, where the server 1020 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (in english: central processing units, in english: CPU) 1022 (for example, one or more processors) and a memory 1032, and one or more storage media 1030 (for example, one or more mass storage devices) storing application programs 1042 or data 1044. Wherein memory 1032 and storage medium 1030 may be transitory or persistent. The program stored on the storage medium 1030 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, central processor 1022 may be configured to communicate with storage medium 1030 to execute a series of instruction operations in storage medium 1030 on server 1020.
The Server(s) 1020 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input/output interfaces 1058, and/or one or more operating systems 1041, such as Windows Server, mac OS X, unix, linux, freeBSD, and the like.
The steps performed by the server in the above embodiments may be based on the structure of the server 1020 shown in fig. 7. The steps performed by the video processing apparatus 40 shown in fig. 4 in the above-described embodiment, for example, may be based on the server structure shown in fig. 7. For example, the processor 1022 may perform the following operations by invoking instructions in the memory 1032:
acquiring a first root video and a second root video recorded in a first scene aiming at a target object through an input/output interface 1058, and driving an audio file for generating an avatar; wherein the lip shape of the target object in the first root video keeps a completely closed state, and the lip shape of the target object in the second root video keeps an open state with a first amplitude;
driving, by the processor 1022, the first root video with the first portion of the audio file for generating the avatar by the driving, to obtain a first video to be processed, and driving the second root video with the second portion of the audio file for generating the avatar by the driving, to obtain a second video to be processed; wherein the first portion of the audio file comprises a non-read-through portion of the audio file and the second portion of the audio file comprises a read-through portion of the audio file;
The processor 1022 synthesizes the first video to be processed and the second video to be processed to obtain the avatar of the target object.
Embodiments of the present application also provide a computer-readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to perform the video processing method of the above embodiments.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, apparatuses and modules described above may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein.
In the embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
The computer program product includes one or more computer instructions. When the computer program is loaded and executed on a computer, the flow or functions according to the embodiments of the present application are fully or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
The above description has been made in detail on the technical solutions provided by the embodiments of the present application, and specific examples are applied in the embodiments of the present application to illustrate the principles and implementation manners of the embodiments of the present application, where the above description of the embodiments is only for helping to understand the methods and core ideas of the embodiments of the present application; meanwhile, as for those skilled in the art, according to the idea of the embodiment of the present application, there are various changes in the specific implementation and application scope, and in summary, the present disclosure should not be construed as limiting the embodiment of the present application.

Claims (16)

1. A video processing method, comprising:
acquiring a first video and a second video recorded under a first scene aiming at a target object, wherein the lip shape of the target object in the first video keeps a completely closed state, and the lip shape of the target object in the second video keeps an open state with a first amplitude;
acquiring an audio file of an virtual image generated by a driver;
driving the first root video by adopting the first part of the audio file for generating the virtual image by adopting the driving, obtaining a first video to be processed, and driving the second root video by adopting the second part of the audio file for generating the virtual image by adopting the driving, obtaining a second video to be processed; wherein the first part of the audio file is a non-read-through part in the audio file, and the second part of the audio file is a read-through part in the audio file;
And synthesizing the first video to be processed and the second video to be processed to obtain the virtual image of the target object.
2. The video processing method of claim 1, wherein prior to driving the first root video with the first portion of the audio file of the avatar generated by the driver, the method further comprises:
determining the non-read-through part and the read-through part according to the text content of the audio file of the virtual image generated by the driver;
and taking the audio segment corresponding to the non-continuous reading part as a first part of the audio file, and taking the audio segment corresponding to the continuous reading part as a second part of the audio file.
3. The video processing method according to claim 2, wherein the determining the non-read-through portion and the read-through portion from text contents of the audio file of the driving generation avatar includes:
determining pronunciation interval time between adjacent words in text content of the audio file for driving to generate the avatar, taking audio segments corresponding to all words with pronunciation interval time larger than a first preset threshold value as the non-continuous reading part, and taking audio segments corresponding to all words with pronunciation interval time smaller than or equal to the first preset threshold value as the continuous reading part.
4. The video processing method according to claim 2, wherein the determining the non-read-through portion and the read-through portion from text contents of the audio file of the driving generation avatar includes:
and determining the integrity of an audio signal of each word in the text content of the audio file of the driving generation virtual image, taking an audio segment corresponding to a set of all words with the integrity of the audio signal being greater than a second preset threshold value as the non-continuous reading part, and taking an audio segment corresponding to a set of all words with the integrity of the audio signal being less than or equal to the second preset threshold value as the continuous reading part.
5. The video processing method according to claim 2, wherein the determining the non-read-through portion and the read-through portion from text contents of the audio file of the driving generation avatar includes:
determining pronunciation interval time between adjacent words in text content of the audio file for driving to generate the avatar and the integrity of an audio signal of each word, taking an audio segment corresponding to a set of all words with the pronunciation interval time being greater than a first preset threshold and the integrity of the audio signal being greater than a second preset threshold as the non-continuous reading part, and taking an audio segment corresponding to a set of all words with the pronunciation interval time being less than or equal to the first preset threshold and the integrity of the audio signal being less than or equal to the second preset threshold as the continuous reading part.
6. The video processing method according to any one of claims 1 to 5, characterized in that the synthesizing the first video to be processed and the second video to be processed includes:
and video stitching is carried out on the first video to be processed and the second video to be processed, and super-resolution processing is carried out on the stitched video through a deep learning algorithm, so that the virtual image of the target object is obtained.
7. The video processing method according to any one of claims 1 to 5, wherein before the acquiring the first root video and the second root video recorded in the first scene for the target object, the method further comprises:
comparing the lip position of the target object in the first root video with a first preset lip position, and if so, determining that the first root video meets the requirement;
and comparing the lip position of the target object in the second root video with a second preset lip position, and if the lip position of the target object is matched with the second preset lip position, determining that the second root video meets the requirement.
8. A video processing apparatus, comprising:
the acquisition module is used for acquiring a first video and a second video recorded in a first scene aiming at a target object and driving an audio file for generating an avatar; the lip shape of the target object in the first root video keeps a completely closed state, and the lip shape of the target object in the second root video keeps an open state with a first amplitude;
The processing module is used for driving the first root video by adopting the first part of the audio file of the virtual image generated by driving to obtain a first video to be processed, and driving the second root video by adopting the second part of the audio file of the virtual image generated by driving to obtain a second video to be processed; wherein the first part of the audio file is a non-read-through part in the audio file, and the second part of the audio file is a read-through part in the audio file;
the processing module is further configured to synthesize the first video to be processed and the second video to be processed, so as to obtain an avatar of the target object.
9. The video processing device of claim 8, wherein the processing module is further configured to:
determining the non-read-through portion and the read-through portion according to text content of an audio file of the avatar generated by the driver;
and taking the audio segment corresponding to the non-continuous reading part as a first part of the audio file, and taking the audio segment corresponding to the continuous reading part as a second part of the audio file.
10. The video processing device according to claim 9, wherein the processing module is specifically configured to:
Determining pronunciation interval time between adjacent words in text content of the audio file for generating the avatar by the driver, taking audio segments corresponding to all word sets with pronunciation interval time greater than a first preset threshold value as the non-continuous reading part, and taking audio segments corresponding to all word sets with pronunciation interval time less than or equal to the first preset threshold value as the continuous reading part.
11. The video processing device according to claim 9, wherein the processing module is specifically configured to:
determining the integrity of an audio signal of each word in the text content of the audio file of the driving generation avatar, taking an audio segment corresponding to a set of all words with the integrity of the audio signal being greater than a second preset threshold value as the non-continuous reading part, and taking an audio segment corresponding to a set of all words with the integrity of the audio signal being less than or equal to the second preset threshold value as the continuous reading part.
12. The video processing device according to claim 9, wherein the processing module is specifically configured to:
determining pronunciation interval time between adjacent words in text content of the audio file for generating the avatar by driving and the integrity of audio signals of each word, taking audio segments corresponding to all words with the pronunciation interval time being larger than a first preset threshold value and the integrity of the audio signals being larger than a second preset threshold value as the non-continuous reading part, and taking audio segments corresponding to all words with the pronunciation interval time being smaller than or equal to the first preset threshold value and the integrity of the audio signals being smaller than or equal to the second preset threshold value as the continuous reading part.
13. The video processing apparatus according to any one of claims 8-12, wherein the processing module is specifically configured to:
and video stitching is carried out on the first video to be processed and the second video to be processed, and super-resolution processing is carried out on the stitched video through a deep learning algorithm, so that the virtual image of the target object is obtained.
14. The video processing apparatus according to any one of claims 8-12, wherein the processing module is further configured to:
comparing the lip position of the target object in the first root video with a first preset lip position, and if so, determining that the first root video meets the requirement;
and comparing the lip position of the target object in the second video with a second preset lip position, and if the lip position is matched with the second preset lip position, determining that the second video meets the requirement.
15. An electronic device, comprising:
a processor; and
a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the steps in the video processing method of any of claims 1-7.
16. A computer readable storage medium having stored thereon executable code which when executed by a processor of an electronic device causes the processor to perform the video processing method of any of claims 1-7.
CN202211580044.7A 2022-12-09 2022-12-09 Video processing method, device and storage medium Active CN116248811B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211580044.7A CN116248811B (en) 2022-12-09 2022-12-09 Video processing method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211580044.7A CN116248811B (en) 2022-12-09 2022-12-09 Video processing method, device and storage medium

Publications (2)

Publication Number Publication Date
CN116248811A CN116248811A (en) 2023-06-09
CN116248811B true CN116248811B (en) 2023-12-05

Family

ID=86630317

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211580044.7A Active CN116248811B (en) 2022-12-09 2022-12-09 Video processing method, device and storage medium

Country Status (1)

Country Link
CN (1) CN116248811B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116843805B (en) * 2023-06-19 2024-03-19 上海奥玩士信息技术有限公司 Method, device, equipment and medium for generating virtual image containing behaviors

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019168834A1 (en) * 2018-02-28 2019-09-06 Apple Inc. Voice effects based on facial expressions
JP2020161121A (en) * 2019-03-27 2020-10-01 ダイコク電機株式会社 Video output system
CN113192161A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human image video generation method, system, device and storage medium
CN114866807A (en) * 2022-05-12 2022-08-05 平安科技(深圳)有限公司 Avatar video generation method and device, electronic equipment and readable storage medium
CN114900733A (en) * 2022-04-28 2022-08-12 北京瑞莱智慧科技有限公司 Video generation method, related device and storage medium
CN115423908A (en) * 2022-08-19 2022-12-02 深圳市达旦数生科技有限公司 Virtual face generation method, device, equipment and readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100279822A1 (en) * 2008-11-01 2010-11-04 Ford John Hajime Systems and methods for optimizing one or more audio tracks to a video stream

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019168834A1 (en) * 2018-02-28 2019-09-06 Apple Inc. Voice effects based on facial expressions
CN112512649A (en) * 2018-07-11 2021-03-16 苹果公司 Techniques for providing audio and video effects
JP2020161121A (en) * 2019-03-27 2020-10-01 ダイコク電機株式会社 Video output system
CN113192161A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human image video generation method, system, device and storage medium
CN114900733A (en) * 2022-04-28 2022-08-12 北京瑞莱智慧科技有限公司 Video generation method, related device and storage medium
CN114866807A (en) * 2022-05-12 2022-08-05 平安科技(深圳)有限公司 Avatar video generation method and device, electronic equipment and readable storage medium
CN115423908A (en) * 2022-08-19 2022-12-02 深圳市达旦数生科技有限公司 Virtual face generation method, device, equipment and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于情感计算的虚拟教师模型设计与应用优势;朱珂;张思妍;刘濛雨;;现代教育技术(06);全文 *
朱珂;张思妍;刘濛雨.基于情感计算的虚拟教师模型设计与应用优势.现代教育技术.2020,(06),全文. *

Also Published As

Publication number Publication date
CN116248811A (en) 2023-06-09

Similar Documents

Publication Publication Date Title
WO2021043053A1 (en) Animation image driving method based on artificial intelligence, and related device
CN111652121B (en) Training method of expression migration model, and method and device for expression migration
CN110766777B (en) Method and device for generating virtual image, electronic equipment and storage medium
JP7312853B2 (en) AI-BASED VOICE-DRIVEN ANIMATION METHOD AND APPARATUS, DEVICE AND COMPUTER PROGRAM
JP2021192222A (en) Video image interactive method and apparatus, electronic device, computer readable storage medium, and computer program
KR20210001859A (en) 3d virtual figure mouth shape control method and device
US20230315382A1 (en) Communication assistance program, communication assistance method, communication assistance system, terminal device, and non-verbal expression program
CN114900733B (en) Video generation method, related device and storage medium
CN116248811B (en) Video processing method, device and storage medium
CN113421547A (en) Voice processing method and related equipment
CN113420177A (en) Audio data processing method and device, computer equipment and storage medium
CN116137673A (en) Digital human expression driving method and device, equipment and medium thereof
US11756251B2 (en) Facial animation control by automatic generation of facial action units using text and speech
CN116229311B (en) Video processing method, device and storage medium
CN115526772B (en) Video processing method, device, equipment and storage medium
CN115223224A (en) Digital human speaking video generation method, system, terminal device and medium
CN116708899B (en) Video processing method, device and storage medium applied to virtual image synthesis
CN116708920B (en) Video processing method, device and storage medium applied to virtual image synthesis
CN116453005A (en) Video cover extraction method and related device
CN111159472B (en) Multimodal chat technique
CN116320222B (en) Audio processing method, device and storage medium
CN116708919A (en) Video processing method for synthesizing virtual image, related device and storage medium
CN116074577B (en) Video processing method, related device and storage medium
CN113559500B (en) Method and device for generating action data, electronic equipment and storage medium
CN118052912A (en) Video generation method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant