CN109819313B

CN109819313B - Video processing method, device and storage medium

Info

Publication number: CN109819313B
Application number: CN201910023976.3A
Authority: CN
Inventors: 田元
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Yayue Technology Co ltd
Priority date: 2019-01-10
Filing date: 2019-01-10
Publication date: 2021-01-08
Anticipated expiration: 2039-01-10
Also published as: CN109819313A

Abstract

The embodiment of the application discloses a video processing method, a video processing device and a storage medium, wherein the video processing method comprises the following steps: acquiring dubbing audio data input by a user; obtaining a plurality of frames of video images from a video file; determining an initial video image containing a target face from the multi-frame video image, and fusing the target face in the initial video image with the selected face image to obtain a target video image; and synthesizing the dubbing audio data and at least the target video image to obtain an audio and video synthetic file. According to the scheme, elements such as user dubbing and user portrait are organically fused into video production, and the depth fusion degree of the user in the video production and the personalized strength of the video are improved.

Description

Video processing method, device and storage medium

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a video processing method, apparatus, and storage medium.

Background

With the development of the internet and the mobile communication network, and with the rapid development of the processing capability and the storage capability of the terminal, a great number of application programs are rapidly spread and used, especially video applications.

Video generally refers to various techniques for capturing, recording, processing, storing, transmitting, and reproducing a series of still images as electrical signals. When the continuous image changes more than a certain number of frames per second, human eyes cannot distinguish a single static image, and the image looks like a smooth continuous visual effect, so that the continuous image is called a video. Advances in networking technology have also enabled recorded segments of video to be streamed over the internet and received and played by computers. In the related art, the user may also be allowed to perform operations such as editing, recombining, and converting formats on the video material.

Disclosure of Invention

The embodiment of the application provides a video processing method, a video processing device and a storage medium, which can improve the depth integration degree of a user in video production and the personalized strength of a video.

The embodiment of the application provides a video processing method, which comprises the following steps:

acquiring dubbing audio data input by a user;

obtaining a plurality of frames of video images from a video file;

determining an initial video image containing a target face from the multi-frame video image, and fusing the target face in the initial video image with the selected face image to obtain a target video image;

and synthesizing the dubbing audio data and at least the target video image to obtain an audio and video synthetic file.

Correspondingly, an embodiment of the present application further provides a video processing apparatus, including:

the audio acquisition unit is used for acquiring dubbing audio data input by a user;

the image acquisition unit is used for acquiring a plurality of frames of video images from the video file;

the processing unit is used for determining an initial video image containing a target face from the multi-frame video images and fusing the target face in the initial video image with the selected face image to obtain a target video image;

and the synthesis unit is used for synthesizing the dubbing audio data and at least the target video image to obtain an audio and video synthesis file.

Accordingly, the present application further provides a storage medium, where the storage medium stores a plurality of instructions, and the instructions are suitable for being loaded by a processor to perform the steps in the video processing method as described above.

In the process of playing the video file, dubbing audio data input by a user is firstly acquired, and a plurality of frames of video images are obtained from the video file. And then, determining an initial video image containing a target face from the multi-frame video images, and fusing the target face in the initial video image with the selected face image to obtain a target video image. And finally, synthesizing the dubbing audio data and at least the target video image to obtain an audio and video synthetic file. The scheme can organically integrate elements such as user dubbing and user portrait into video production, and improve the depth integration degree of the user and the personalized strength of the video in the video production.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is an architectural diagram of a video processing method according to an embodiment of the present application.

Fig. 2 is a schematic flowchart of a video processing method according to an embodiment of the present application.

Fig. 3 is another schematic flow chart of a video processing method according to an embodiment of the present application.

Fig. 4 is a schematic view of an application scenario of the video processing method according to the embodiment of the present application.

Fig. 5 is a schematic block diagram of another architecture of a video processing method according to an embodiment of the present application.

Fig. 6a to 6e are schematic interface interaction diagrams of a video processing method according to an embodiment of the present application.

Fig. 7 is a schematic structural diagram of another video processing method according to an embodiment of the present application.

Fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application.

Fig. 9 is a schematic structural diagram of another video processing apparatus according to an embodiment of the present application.

Fig. 10 is a schematic structural diagram of a terminal according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a video processing method, a video processing device and a storage medium.

The video processing apparatus may be integrated in a terminal having a storage unit and a microprocessor installed therein, such as a tablet pc (personal computer), a mobile phone, and the like, and having an operation capability. For example, taking the video processing apparatus specifically integrated in a mobile phone as an example, referring to fig. 1, in the process of playing a video file, the mobile phone acquires dubbing audio data input by a user, and captures a played video according to a predetermined frame rate time interval to frame the video into an image. And then, determining a target face to be processed from the multi-frame video image, and fusing the target face and the face image selected by the user to obtain a processed target video image (namely a face fusion image). And then, coding the processed target video image to obtain a video code stream, and coding the audio data to obtain an audio code stream. And finally, synthesizing and outputting the video code stream and the audio code stream to obtain an audio and video synthetic file.

The following are detailed below. The numbers in the following examples are not intended to limit the order of preference of the examples.

An embodiment of the present application provides a video processing method, including: acquiring dubbing audio data input by a user; obtaining a plurality of frames of video images from a video file; determining an initial video image containing a target face from the multi-frame video image, and fusing the target face in the initial video image with the selected face image to obtain a target video image; and synthesizing the dubbing audio data and at least the target video image to obtain an audio and video synthetic file.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a video processing method according to an embodiment of the present disclosure. The specific flow of the video processing method can be as follows:

101. dubbing audio data input by a user is acquired.

Specifically, dubbing audio data input by the user can be acquired in the process of playing the video file, and the dubbing audio data can be recorded in advance by the user. For example, the dubbing audio data may be voice information recorded by a user in real time through a microphone, a receiver, and the like of the terminal during playing of a video file. The video file may be a full-mute video file (i.e., the video file does not have audio data), a partial mute video file (i.e., only a portion of the audio data is retained in the video file), or an un-mute video file.

In this embodiment, the dubbing audio data may include user audio data, original audio data, and background audio data, where the user audio data is a user real sound recorded by a user for a specific movie and television role, and may also be an voice-over recorded by the user for movie and television content; the original audio data is original sound of a non-fixed visual angle color; the background audio data is movie and television background sound. For example, the video file includes a movie character a and a movie character B, when playing the video file, the dubbing audio data may retain the original sound emitted by the movie character a, and the dubbing user may dub and record the speech of the specific movie character B that needs dubbing, and in addition, the dubbing audio data may further include movie background sound, such as background music, background special effect sound, and the like.

102. Obtaining a plurality of frames of video images from the video file.

In the embodiment of the application, the video file comprises at least one movie and television character with a face image. By performing frame extraction processing on the video file, a plurality of frames of video images can be obtained from the video file.

103. Determining an initial video image containing a target face from the multi-frame video images, and fusing the target face in the initial video image with the selected face image to obtain a target video image.

The target face in the initial video image can be a face of a specific film and television role to which a user needs to dub, and the selected face image is a face in a photo selected by the user through opening an album or a face directly shot by a shooting means.

The face image fusion refers to replacing or covering the selected face image with the target face image, or obtaining an appearance deformation face based on the characteristics of the target face and the selected face image. In specific implementation, a target face in an initial video image may be detected first, and integrity information, orientation information, expression information, and the like of the target face are obtained, for example, whether the target face is occluded, whether the side face or the front face of the target face faces towards the lens, and whether the target face roars or cry is detected. And after the information is acquired, correspondingly processing the selected face image by combining the information. For example, when the target face is shielded, the selected face image is subjected to adaptive shielding processing; when the target face is a side face, acquiring a corresponding side face image from the selected face image; and when the target face is in a crying state, carrying out crying image processing on the selected face image, so that the selected face image can be more naturally blended into the video image, and a more natural target video image is obtained.

The selected face image may be a face image obtained by shooting with a mobile phone or a local face image already stored in the mobile phone. In practical application, the selected face image may be the face image of the dubbing user. And the target face contained in the initial video image is the face of one or some film and television characters in the captured multi-frame video images. In practical application, the face image of the dubbing user and the target face in the initial video image are subjected to face fusion to obtain a face fusion image with the face characteristics of the dubbing user and the target face characteristics. Then, the face fusion image is used for replacing the target face in the initial video image, and therefore the processed target video image is obtained.

In some embodiments, before obtaining the plurality of frames of video images from the video file, the following process may be further included:

analyzing the video file, and extracting at least one face image from the video file;

and receiving a sample selection instruction of a user, and selecting a human face image from the at least one human face image based on the sample selection instruction.

Specifically, the terminal can intelligently analyze the video file, recognize face images of all video roles (only roles with faces) appearing in the video material, and match the identity of the person for each role. And then, extracting the identified film and television roles to a display interface for selection by a user. Finally, one or more movie characters needing to replace the human face are selected by the user.

It is understood that there may be a video image that does not contain the image of the target face among the captured plurality of frames of video images. Therefore, in order to improve the efficiency of face image fusion, a target video image containing a target face image can be screened from multiple frames of video images, so that a face fusion operation is subsequently performed only on the screened target video image. Then, the step "determining an initial video image containing a target face from a plurality of video images" may include the following steps:

capturing a human face in each frame of a plurality of frames of video images;

judging whether the video image contains a target face matched with the sample face image;

and if so, taking the video image as an initial video image.

Specifically, a face image of a movie role selected by a user is matched with a face in each captured frame of video image, so that a video image needing face replacement is screened out from a plurality of frames of video images.

In a specific implementation process, a Residual Network can be set up to detect face positions in a single frame video image, such as a depth Network (ResNet for short), so as to find all face positions in the single frame image. Then, face key point detection is carried out, and the identity of the person is matched through the position of the face key point. For example, when determining a target face image from a plurality of video images, the following steps can be included:

(11) constructing a ResNet network with 29 layers;

(12) extracting human face features based on the directional gradient histogram;

(13) training 300 million pictures to finish network training;

(14) calculating the face key point characteristics of the film and television roles detected in the multi-frame video images;

(15) retrieving the detected face image from the database;

(16) and returning the matched face identity.

The more the number of layers of the ResNet network is, the higher the identification accuracy is. The number of layers of the ResNet network in this embodiment can be set according to actual situations, and is not limited to the above 29 layers.

104. And synthesizing the dubbing audio data and at least the target video image to obtain an audio and video synthetic file.

Specifically, the synthesizing process refers to superposition coding of dubbing audio data and at least a target video image, and then a synthesized audio/video synthesis file is obtained.

In some embodiments, the playing duration of the obtained audio/video composite file may be equal to the duration of the initial video file, that is, the audio/video composite file includes a target video image including a human face and other images not including a human face, and the scenario of the obtained audio/video composite file is richer by encoding the dubbing audio data and the target video image in an overlapping manner.

In other embodiments, the playing duration of the obtained audio/video composite file may be shorter than the duration of the initial video file, that is, the audio/video composite file may only include a target video image including a human face, and the audio dubbing data and the target video image are encoded in an overlapping manner, so that the obtained audio/video composite file has a dubbing clipping effect of a specific video character.

In the video processing method provided by the embodiment, dubbing audio data input by a user is acquired; obtaining a plurality of frames of video images from a video file; determining an initial video image containing a target face from the multi-frame video image, and fusing the target face in the initial video image with the selected face image to obtain a target video image; and synthesizing the dubbing audio data and at least the target video image to obtain an audio and video synthetic file. According to the scheme, elements such as user dubbing and user portrait are organically fused into video production, and the depth fusion degree of the user in the video production and the personalized strength of the video are improved.

On the basis of the above-described embodiments, some steps will be described in further detail below.

Referring to fig. 3, in practical applications, the obtained dubbing audio and the processed video image need to be re-encoded to be outputted as a synthesized audio/video file. Meanwhile, in conjunction with fig. 1, in some embodiments, the step "performing a synthesizing process on dubbing audio data and at least a target video image" may include the following processes:

1041. updating the multi-frame video image based on the target video image;

1042. coding the updated multi-frame video image to obtain a video code stream;

1043. coding dubbing audio data to obtain an audio code stream;

1044. and synthesizing the video code stream and the audio code stream and outputting.

In this embodiment, the dubbing audio data, the target video image after face replacement, and the video image that is not screened out from the multi-frame video image are output together as a dubbing video file, so as to obtain a complete dubbing video.

The method for encoding the processed video image may be various, and may be any format supported by a product system. For example, the processed video images may be encoded based on video formats such as. mpg,. mpeg,. mp4,. rmvb,. wmv,. asf,. avi,. asx, etc., to form a video stream, so as to encapsulate the processed multi-frame video images into a video file.

In practical application, the playing time of the video code stream can be controlled based on different coding modes. Preferably, the playing time can be controlled within 15 seconds.

Similarly, the processed video image may be encoded in various ways, as long as the format is supported by the product system. For example, dubbing audio data input by a user can be encoded based on audio formats such as act, mp3, wma, wav, etc. to form an audio stream, so that the audio stream is packaged into an audio file matching the video file.

In some embodiments, time points of frames or sampling points of the video code stream and the audio code stream can be respectively calculated, and the encoded video code stream and the encoded audio code stream are synchronously played and output through the audio and video synthesis system, so that an audio and video synthesis file is obtained. That is, in some embodiments, the step "get multiple frames of video images from video file" may include the following process:

and capturing video images from the video file according to a preset frame rate time interval to obtain a plurality of frames of video images.

The predetermined frame rate time interval may be set by a product manufacturer or a person skilled in the art when the video is decimated into an image. For example, the frame rate may be 20 frames/second, 50 frames/second, etc., with corresponding frame rate time intervals of 50 milliseconds, 20 milliseconds, etc.

Then, the step "encode dubbing audio data to obtain an audio code stream" may include the following steps:

acquiring the number of video image frames captured in a target time period, wherein the target time period is the time from the starting time to the ending time of dubbing audio data input;

determining the total playing time length of the video code stream;

calculating the target playing time length of the dubbing audio data according to the frame number and the total playing time length;

and determining a sampling frequency based on the target playing time length and the time length corresponding to the target time period, and coding the dubbing audio data based on the sampling frequency to obtain an audio code stream.

It should be noted that, in this embodiment, the dubbing user can input multiple segments of audio data during the process of playing the video file. And the time from the starting time to the ending time is the time of one section of audio input by the dubbing user.

Specifically, based on the number of frames of the video images captured in the target time period and the total playing time of the encoded video code stream, the time consumed for playing the video images of the number of frames after encoding can be calculated. If the audio code stream and the video code stream need to be played synchronously, the time consumed for playing the video images of the frame number after coding is required to be equal to the target playing time of the dubbing audio data. Therefore, the time length consumed for playing the video image with the frame number after the coding is calculated is taken as the target playing time length of the dubbing audio data.

After the target playing time length of the dubbing audio data and the time length corresponding to the target time period are obtained, the sampling frequency is determined by calculating the proportion of the target playing time length and the target time period, the dubbing audio data is sampled and encoded based on the sampling frequency, so that the dubbing audio data is compressed, the audio and the video are synchronously played, and the problem that the movie and television roles in the audio and the video are not in the mouth shape can be avoided.

In some embodiments, the step "synthesizing and outputting a video bitstream and the audio bitstream" may include the following processes:

determining a playing start time point and an playing end time point corresponding to a video image captured in a target time period in the video code stream;

and configuring the playing starting time point and the playing ending time point as the playing starting time point and the playing ending time point of the audio code stream, and synthesizing and outputting the video code stream and the audio code stream.

Specifically, when data is output, the video code stream and the audio code stream are synchronized at the playing start time point and the playing end time point, so that the audio and the video are synchronously played.

For example, in a video material, the playing start time point and the playing end time point corresponding to the video image captured in the target time period are 00:00:05 and 00:00:10, respectively, and then the playing start time point and the playing end time point of the audio code stream are also set to be 00:00:05 and 00:00:10, respectively.

In some embodiments, the step of "fusing the target face in the initial video image with the selected face image" may include the following processes:

detecting and positioning face key points of a target face in an initial video image and face key points in a selected face image;

aligning the selected face image with the target face through affine transformation;

and updating the facial features of the target face based on the aligned face image.

Specifically, the facial key points may be detected by using a machine learning algorithm of a cascaded residual regression Tree, such as a Gradient Boosting Decision Tree (GBDT) algorithm. Taking GBDT as an example, the specific algorithm model building steps are as follows:

(21) constructing a regression initial shape by using the real shapes of the trained N images;

(22) splitting a tree structure by using the pixel difference as a characteristic, so that each picture falls into a leaf node;

(23) calculating the difference value between all the picture shapes in each leaf node and the current tree shape, averaging and storing the difference values to leaf nodes;

(24) updating the shape of the tree by using the values in the leaves;

(25) enough subtrees are built until the GBDT tree shape represents the real shape.

After the algorithm model is built, the target face in the initial video image and the face key points in the selected face image can be detected by using the algorithm model. Then, based on the detected target face and the positions of key points of the face in the selected face image, an affine transformation matrix from a preset face image to the target face image is obtained through a Poincare analysis (Procrustes analysis) and calculation by utilizing a least square method. Therefore, the selected face is subjected to graphic transformation such as translation, rotation, scaling and the like based on the obtained affine transformation matrix, so that the target face in the initial video image is aligned with the preset face image in the face position, and the face feature points of the target face and the preset face image are close to each other as much as possible. For example, the preset face image a in fig. 4 may be referred to, and after affine transformation, a transformed image d is obtained.

In some embodiments, the step "updating the facial features of the target face based on the aligned face image" may include the following steps:

dividing a face region based on face key point division in a target face to obtain a face feature region of the target face;

processing the facial feature area according to a preset algorithm to obtain a facial feature template of the facial feature area;

and fusing the aligned face image with the target face by using the face feature template to obtain a face fusion image.

Specifically, the geometric features of the human face can be used to extract human face feature points with size, rotation and displacement invariance, for example, key feature point positions of parts such as eyes, nose and lips can be extracted. For example, 9 feature points of the human face are selected, and the distribution of the feature points has angle invariance and is respectively 2 eyeball center points, 4 eye corner points, the middle points of two nostrils and 2 mouth corner points.

For example, in this embodiment, a face triangle contour template (i.e., an eye-mouth-nose template, refer to c in fig. 4) may be obtained through the face feature points as the face feature template, and the upper contour template is used to delineate details of the input image, and then two input images, namely, a preset face image and a target face image, are superimposed to complete image fusion.

Referring to fig. 4, a is a preset face image, b is a target face image, c is a face mask generated based on a face feature region in the target face image b, d is an image obtained by affine transformation of the target image a, and a fused image e after face fusion is finally output.

However, when extracting the face features, the conventional edge detection operator cannot reliably extract the features of the face, such as the regions of the eyes or lips, because the local edge information cannot be effectively organized, so that the features of the face can be extracted by using an algorithm such as Susan operator. The principle of the Susan operator is as follows: and (3) taking a circular area with the pixel as the radius, namely the position of the area coverage pixel as a mask, and observing the consistency degree of the pixel values of all points of each point in the face image in the area range and the pixel value of the current point.

In some embodiments, after the face template is used to fuse the aligned face image with the target face to obtain a face fusion image, the following process may be further included:

calculating a pixel difference value of facial features between the target face and the selected face image;

generating a color adjustment parameter according to the pixel difference value;

and adjusting the face fusion image based on the color adjustment parameters.

The color adjustment parameter may specifically be a difference value between RGB values of the pixel points.

Specifically, the selected face image and the target face in the initial video image may have a large difference in skin color, so that after the face image is fused, the fused boundary sawtooth effect of the replacement region and the original face region is obvious. Therefore, it is necessary to reduce the edge aliasing effect by adjusting the pixel difference between the fused region and the original region to enhance the face fusion degree.

For example, in some embodiments, pixel difference values may be reduced by a blurring effect. The concrete implementation is as follows:

(31) calculating the pixel difference value of the facial features in the target face and the selected face image;

(32) calculating a blurring effect through the pixel difference value;

(33) and reducing the pixel difference between the target face and the selected face image through Gaussian blur.

By utilizing the operation, the skin color of the fusion area is changed into the face skin color which is closer to the target face.

Referring to fig. 5, fig. 6a to 6e and fig. 7, fig. 5 is a schematic diagram of another architecture of a video processing method according to an embodiment of the present disclosure; 6 a-6 e are schematic interface interaction diagrams of a video processing method provided by the embodiment of the application; fig. 7 is a schematic structural diagram of another video processing method according to an embodiment of the present application.

First, the user can log in to the account registered in the dubbing application through the account login interface to enter the dubbing main interface. As shown in fig. 6a, when the user opens the dubbing main interface, hit material and other materials may be displayed on the current interface, and the user may click the display control of the material to trigger the selection of the current video material for video preview or directly enter the dubbing phase. In addition, the main interface can also comprise a search bar, and the matched video materials can be searched from the video material library by inputting the key words in the search bar, so that the retrieval speed of the video materials is improved.

Referring to fig. 6b, when a selected video material is dubbed, face recognition can be performed on the video material, roles in the video can be recognized, and video roles with faces can be analyzed from the video material. As in fig. 6b, three video characters are parsed from the selected video material and character images are displayed. In the embodiment of the application, a visible or invisible selection control can be set in the character image, and the video character needing face replacement can be selected through the selection control. For example, the selection icon in the upper right corner of the character image in fig. 6b, through which the first video character is selected.

In addition, an image adding interface can be arranged on the current interface, and the replacement human face image can be added through the interface. In practical application, the replacement human face materials can be added from the local photo library through the image horizon interface. In specific implementation, the requirements for replacing the human face materials are that the front face (without raising head, lowering head and turning sideways) and the face and the five sense organs of the human face are not shielded. If the added image does not meet the requirements, the next step cannot be carried out, and a prompt message can be generated to prompt the image to be added again.

If the image selection is completed, the background fusion user can obtain the face fusion image from the locally added replacement face image to the video role selected from the video material through a cloud algorithm.

In some embodiments, in order to facilitate accurate dubbing by a dubbing user, a video file can be played in the dubbing process of the user, and text information corresponding to the lines of each movie and television role can be displayed in a video playing interface, so that the lines of the dubbing user can be prompted, and word forgetting can be avoided. That is, the video processing method may further include the following steps:

obtaining a sample text;

the sample text is displayed while dubbing audio data input by the user is acquired.

The sample text may be text information edited in advance, and may be displayed in any text format such as font, size, color, and the like. For example, referring to the area of the "subtitle" in fig. 6c, the sample text may be arranged at the dotted line. In addition, information such as playing progress, playing duration and the like can be displayed through the progress bar in the video material playing process.

Further, a progress prompt for lines may also be made to remind the user to prepare to enter dubbing. For example, a subtitle currently being played is marked by a color change.

In some embodiments, a text editing interface can be further provided in the dubbing interface, and a user can edit and adjust the existing sample text through the text editing interface so as to meet the text customization requirements of some users.

With continued reference to fig. 5, in some embodiments, to render the dubbing atmosphere of the dubbing user, the video file may be augmented with background music of a corresponding genre according to the video content, and the background music may be played while the video file is being played, so that the dubbing user is put into the video scenario as soon as possible. That is, the video processing method may further include the following steps:

acquiring sample background audio data;

the sample background audio data is played while dubbing audio data input by the user is acquired.

The background audio data may be pure music played by a physical musical instrument (such as a piano, a violin, etc.) or an electronic musical instrument, or mixed music with human voice and musical instruments. With continued reference to fig. 6c, a music selection interface (e.g., a note icon control shown in fig. 6 c) may be provided in the dubbing interface, through which a music style with various styles may be selected from a background music library. The background music library may be audio data stored in the cloud or local audio data of the terminal.

In practical applications, a recording control interface (e.g., a microphone icon control in fig. 6 c) may be set on the dubbing interface, and the recording control interface may drive the calling terminal microphone to receive the dubbing voice from the user, and may implement the functions of starting recording, pausing recording, continuing recording, and the like.

In the embodiment of the application, whether operations such as face replacement, text display or background music addition are needed or not can be selectively selected.

Referring to fig. 6d, after the dubbing is completed, the user may autonomously adjust the recorded audio/video synthetic file through a preview interface, a subtitle display interface, a voice setting interface, a background music setting interface, and the like set on the current interface. After the adjustment is finished, the new face-fused video, music and dubbing can be previewed, and the audio and video synthetic file can be stored through the set storage interface.

Finally, referring to FIG. 6e, the dubbed video production may be viewed through the user's personal home page. In practical application, the interface can be provided with a sharing interface, related social applications or platforms can be authorized, and recorded dubbing video files can be shared to other social platforms.

The scheme can organically integrate elements such as user dubbing and user portrait into video production, and improve the depth integration degree of the user and the personalized strength of the video in the video production.

In order to better implement the video processing method provided by the embodiments of the present application, an apparatus (processing apparatus for short) based on the video processing method is also provided in the embodiments of the present application. The terms are the same as those in the video processing method, and details of implementation can be referred to the description in the method embodiment.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure, where the processing apparatus may include an obtaining unit 301, a processing unit 302, a video encoding subunit 303, an audio encoding subunit 304, and a synthesizing unit 305, which may specifically be as follows:

an audio acquisition unit 301 configured to acquire dubbing audio data input by a user;

an image obtaining unit 302, configured to obtain multiple frames of video images from a video file;

the processing unit 303 is configured to determine an initial video image including a target face from the multiple frames of video images, and fuse the target face in the initial video image with the selected face image to obtain a target video image;

and the synthesis unit 304 is configured to synthesize the dubbing audio data and at least the target video image to obtain an audio/video synthesis file.

In some embodiments, referring to fig. 9, the video processing apparatus 300 may further include:

an extracting unit 305, configured to parse the video file to extract at least one facial image from the video file before obtaining multiple frames of video images from the video file;

the selection unit 306 is used for receiving a sample selection instruction of a user and selecting a human face image from at least one human face image based on the sample selection instruction;

the processing unit 304 may be specifically configured to:

capturing a face image in each frame of a plurality of frames of video images;

and if so, taking the video image as an initial video image.

In some embodiments, the processing unit 304 may be specifically configured to:

In some embodiments, the synthesis unit 304 may include:

an updating subunit, configured to update the multiple frames of video images based on the target video image;

the video coding subunit is used for coding the updated multi-frame video image to obtain a video code stream;

the audio coding subunit is used for coding the dubbing audio data to obtain an audio code stream;

and the synthesizing subunit is used for synthesizing and outputting the video code stream and the audio code stream.

In some embodiments, the image acquisition unit 302 may be specifically configured to:

capturing video images from a video file according to a preset frame rate time interval to obtain a plurality of frames of video images;

the audio coding subunit may be configured to:

determining the total playing time length of the video code stream;

calculating the target playing time length of dubbing audio data according to the frame number and the total playing time length;

and determining a sampling frequency based on the target playing time length and the time length corresponding to the target time period, and encoding the dubbing audio data based on the sampling frequency.

In the video processing apparatus provided by the embodiment of the application, dubbing audio data input by a user is acquired by the acquisition unit 301; the image obtaining unit 302 obtains a plurality of frames of video images from a video file; the processing unit 303 determines a target face image from the multi-frame video image, and fuses the target face image with a preset face image to obtain a processed video image; the video coding subunit 303 is configured to determine an initial video image including a target face from the multiple frames of video images, and fuse the target face in the initial video image with the selected face image to obtain a target video image; the synthesis unit 304 performs synthesis processing on the dubbing audio data and at least the target video image to obtain an audio/video synthesis file. The scheme can organically integrate elements such as user dubbing and user portrait into video production, and improve the depth integration degree of the user and the personalized strength of the video in the video production.

An embodiment of the present application also provides a terminal, as shown in fig. 10, which may include components such as a Radio Frequency (RF) circuit 601, a memory 602 including one or more computer-readable storage media, an input unit 603, a display unit 604, a sensor 605, an audio circuit 606, a Wireless Fidelity (WiFi) module 607, a processor 608 including one or more processing cores, and a power supply 609. Those skilled in the art will appreciate that the terminal structure shown in fig. 10 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the RF circuit 601 may be used for receiving and transmitting signals during a message transmission or communication process, and in particular, for receiving downlink messages from a base station and then processing the received downlink messages by one or more processors 608; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuit 601 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 601 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.

The memory 602 may be used to store software programs and modules, and the processor 608 executes various functional applications and data processing by operating the software programs and modules stored in the memory 602. The memory 602 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal, etc. Further, the memory 602 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 608 and the input unit 603 access to the memory 602.

The input unit 603 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in one particular embodiment, input unit 603 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 608, and can receive and execute commands sent by the processor 608. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 603 may include other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 604 may be used to display information input by or provided to the user and various graphical user interfaces of the terminal, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 604 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 608 to determine the type of touch event, and the processor 608 then provides a corresponding visual output on the display panel according to the type of touch event. Although in FIG. 10 the touch sensitive surface and the display panel are two separate components to implement input and output functions, in some embodiments the touch sensitive surface may be integrated with the display panel to implement input and output functions.

The terminal may also include at least one sensor 605, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or the backlight when the terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, and can be used for applications of recognizing the posture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured in the terminal, detailed description is omitted here.

Audio circuitry 606, a speaker, and a microphone may provide an audio interface between the user and the terminal. The audio circuit 606 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electric signal, which is received by the audio circuit 606 and converted into audio data, which is then processed by the audio data output processor 608, and then transmitted to, for example, another terminal via the RF circuit 601, or the audio data is output to the memory 602 for further processing. The audio circuit 606 may also include an earbud jack to provide communication of peripheral headphones with the terminal.

WiFi belongs to short-distance wireless transmission technology, and the terminal can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 607, and provides wireless broadband internet access for the user. Although fig. 10 shows the WiFi module 607, it is understood that it does not belong to the essential constitution of the terminal, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 608 is a control center of the terminal, connects various parts of the entire handset using various interfaces and lines, and performs various functions of the terminal and processes data by operating or executing software programs and/or modules stored in the memory 602 and calling data stored in the memory 602, thereby performing overall monitoring of the handset. Optionally, processor 608 may include one or more processing cores; preferably, the processor 608 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 608.

The terminal also includes a power supply 609 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 608 via a power management system that may be used to manage charging, discharging, and power consumption. The power supply 609 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown, the terminal may further include a camera, a bluetooth module, and the like, which will not be described herein. Specifically, in this embodiment, the processor 608 in the terminal loads the executable file corresponding to the process of one or more application programs into the memory 602 according to the following instructions, and the processor 608 runs the application programs stored in the memory 602, thereby implementing various functions:

acquiring dubbing audio data input by a user; obtaining a plurality of frames of video images from a video file; determining an initial video image containing a target face from the multi-frame video image, and fusing the target face in the initial video image with the selected face image to obtain a target video image; and synthesizing the dubbing audio data and at least the target video image to obtain an audio and video synthetic file.

In the process of playing the video file, dubbing audio data input by a user are acquired; obtaining a plurality of frames of video images from a video file; determining an initial video image containing a target face from the multi-frame video image, and fusing the target face in the initial video image with the selected face image to obtain a target video image; and synthesizing the dubbing audio data and at least the target video image to obtain an audio and video synthetic file. According to the scheme, elements such as user dubbing and user portrait are organically fused into video production, and the depth fusion degree of the user in the video production and the personalized strength of the video are improved. The scheme can organically integrate elements such as user dubbing and user portrait into video production, and improve the depth integration degree of the user and the personalized strength of the video in the video production.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the video processing methods provided in the embodiments of the present application. For example, the instructions may perform the steps of:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any video processing method provided in the embodiments of the present application, beneficial effects that can be achieved by any video processing method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The foregoing detailed description is directed to a video processing method, a video processing apparatus, and a storage medium provided in the embodiments of the present application, and specific examples are applied in the present application to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the methods and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A video processing method, comprising:

acquiring dubbing audio data input by a user;

obtaining a plurality of frames of video images from a video file;

identifying all candidate persons with human faces from the multi-frame video file, extracting the identified candidate persons to a display interface, and determining one or more target persons needing to replace the human faces through the display interface by a user;

determining all initial video images containing the target face from the multi-frame video images, detecting the target face in all the initial video images, acquiring the integrity information, orientation information and/or expression information of the target face in each initial video image, respectively and correspondingly processing the selected face images according to the integrity information, orientation information and/or expression information of the target face in each initial video image to obtain a processed face image corresponding to each initial video image, fusing the target face in each initial video image with the corresponding processed face image to obtain a target video image, the selected face image is a face on a photo selected from an album or a face obtained by directly shooting through a shooting means, the target face is a face of a target person, and the fusion processing comprises the following steps: replacing or covering the processed face image with the target face in the initial video image, or performing deformation processing on the target face in the initial video image based on the characteristics of the processed face image;

2. The video processing method of claim 1, further comprising, before obtaining the plurality of frames of video images from the video file:

receiving a sample selection instruction of a user, and selecting a human face image from the at least one human face image based on the sample selection instruction;

the determining an initial video image containing a target face from the multiple frames of video images includes:

capturing a face in each frame of the plurality of frames of video images;

and if so, taking the video image as an initial video image.

3. The video processing method according to claim 1, wherein said fusing the target face in each initial video image with the corresponding processed face image comprises:

detecting and positioning the face key points of the target face in the initial video image and the corresponding face key points in the processed face image;

aligning the processed face image with the target face through affine transformation;

4. The video processing method according to claim 3, wherein the updating the facial features of the target face based on the aligned face image comprises:

dividing a face region based on the face key points in the target face to obtain a face feature region of the target face;

processing the facial feature region according to a preset algorithm to obtain a facial feature template of the facial feature region;

5. The video processing method according to claim 4, wherein after the face feature template is used to fuse the aligned face image with the target face to obtain a face fusion image, the method further comprises:

calculating the pixel difference value of the facial features between the target face and the processed face image;

and adjusting the face fusion image based on the color adjustment parameters.

6. The video processing method according to claim 1, wherein said synthesizing the dubbing audio data and at least the target video image comprises:

updating the multi-frame video image based on the target video image;

coding the updated multi-frame video image to obtain a video code stream;

coding the dubbing audio data to obtain an audio code stream;

and synthesizing and outputting the video code stream and the audio code stream.

7. The video processing method according to claim 6, wherein said deriving the plurality of frames of video images from the video file comprises:

the encoding the dubbing audio data to obtain an audio code stream includes:

determining the total playing time length of the video code stream;

8. The video processing method according to claim 7, wherein said synthesizing and outputting the video bitstream and the audio bitstream comprises:

9. The video processing method according to any one of claims 1 to 8, further comprising:

acquiring sample background audio data and/or sample text;

and playing the video file in the process of dubbing audio data input by a user, and simultaneously playing the sample background audio data and/or displaying the sample text.

10. A video processing apparatus, comprising:

a processing unit, configured to determine all initial video images including a target face from the multiple frames of video images, detect the target face in all initial video images, obtain integrity information, orientation information, and/or expression information of the target face in each initial video image, perform corresponding processing on a selected face image according to the integrity information, the orientation information, and/or the expression information of the target face in each initial video image, to obtain a processed face image corresponding to each initial video image, and fuse the target face in each initial video image with the corresponding processed face image to obtain a target video image, where the selected face image is a face on a photo selected from an album or a face directly photographed by a photographing means, and the target face is a face of a target person, the fusion process includes: replacing or covering the processed face image with the target face in the initial video image, or performing deformation processing on the target face in the initial video image based on the characteristics of the processed face image;

11. The video processing apparatus of claim 10, wherein the apparatus further comprises:

the extraction unit is used for analyzing the video file before playing the video file and extracting at least one face image from the video file;

the selection unit is used for receiving a sample selection instruction of a user and selecting a human face image from the at least one human face image based on the sample selection instruction;

the processing unit is configured to:

capturing a face in each frame of the plurality of frames of video images;

and if so, taking the video image as an initial video image.

12. The video processing apparatus of claim 10, wherein the processing unit is configured to:

13. The video processing apparatus according to claim 10, wherein the synthesizing unit includes:

14. The video processing apparatus of claim 13, wherein the image acquisition unit is configured to:

the audio coder is used for:

determining the total playing time length of the video code stream;

15. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the video processing method according to any one of claims 1 to 9.