CN110933330A

CN110933330A - Video dubbing method and device, computer equipment and computer-readable storage medium

Info

Publication number: CN110933330A
Application number: CN201911248806.1A
Authority: CN
Inventors: 吴晗; 李文涛
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2020-03-27

Abstract

The application discloses a video dubbing method and device, computer equipment and a computer readable storage medium, and belongs to the technical field of computers. According to the method and the device, a dubbing interface is displayed, a plurality of video frames of a target video are displayed on the dubbing interface, voice data corresponding to the text information are generated based on the text information collected on the dubbing interface and the selected tone type, the audio characteristics of the voice data are determined based on the tone type, the voice data are added into the target video based on the selected target video frame in the video frames, the initial playing time of the voice data is the same as the playing time of the target video frame, in the video dubbing process, the text information provided by a user can be converted into dubbing according to the specific tone, the dubbing is added into the video, manual dubbing is not needed, the dubbing efficiency is improved, and the video production efficiency can be improved.

Description

Video dubbing method and device, computer equipment and computer-readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video dubbing method and apparatus, a computer device, and a computer-readable storage medium.

Background

The video dubbing is an important link in the video production process, when a video is recorded, the sound collected on the recording site often has noise and affects the video effect, therefore, in the current video production process, after the video recording is usually completed, a section of voice needs to be recorded separately by manpower in a professional recording studio or other quiet environments, and then the voice and the video are synthesized.

Disclosure of Invention

The embodiment of the application provides a video dubbing method and device, computer equipment and a computer readable storage medium, which can solve the problem of low video dubbing efficiency in the related art. The technical scheme is as follows:

in one aspect, a video dubbing method is provided, and the method includes:

displaying a dubbing interface on which a plurality of video frames of a target video are displayed;

generating voice data corresponding to the text information based on the text information collected on the dubbing interface and the selected tone type, wherein the audio features of the voice data are determined based on the tone type;

and adding the voice data into the target video based on the selected target video frame in the plurality of video frames, wherein the starting playing time of the voice data is the same as the playing time of the target video frame.

In one possible implementation manner, the generating voice data corresponding to the text information based on the text information and the tone color type collected at the dubbing interface includes:

acquiring a tone characteristic corresponding to the tone type;

acquiring a phoneme sequence corresponding to the text information;

and generating the voice data based on the phoneme sequence and the tone characteristics.

In one possible implementation, the generating the speech data based on the phoneme sequence and the timbre features includes:

performing emotion recognition on the text information to obtain emotion characteristics corresponding to the text information, wherein the emotion characteristics are used for indicating the emotion information corresponding to the text information;

based on the phoneme sequence, the timbre features and the emotion features, generating voice data, and audio features of the voice data are changed based on changes of the emotion information.

acquiring the audio characteristics of background music in the target video;

and generating the voice data based on the phoneme sequence, the tone color characteristic and the audio characteristic of the background music, wherein the audio characteristic of the voice data changes based on the change of the audio characteristic of the background music.

In one possible implementation manner, after generating the voice data corresponding to the text information based on the text information collected at the dubbing interface and the tone color type, the method further includes:

receiving an editing instruction of the voice data, wherein the editing instruction carries first volume information and second volume information;

and adjusting the playing volume of the voice data based on the first volume information, and adjusting the playing volume of the target video based on the second volume information.

In one possible implementation, after the displaying the dubbing interface, the method further includes:

when a text adding instruction is received, generating a target image corresponding to the text information;

the target image is added to the target position of the target video frame.

In one aspect, there is provided a video dubbing apparatus, the apparatus comprising:

the display module is used for displaying a dubbing interface, and a plurality of video frames of the target video are displayed on the dubbing interface;

the voice generating module is used for generating voice data corresponding to the text information based on the text information collected on the dubbing interface and the selected tone type, and the audio characteristic of the voice data is determined based on the tone type;

and the voice adding module is used for adding the voice data into the target video based on the selected target video frame in the video frames, and the starting playing time of the voice data is the same as the playing time of the target video frame.

In one possible implementation, the speech generation module is to:

acquiring a tone characteristic corresponding to the tone type;

acquiring a phoneme sequence corresponding to the text information;

In one possible implementation, the speech generation module is to:

acquiring the audio characteristics of background music in the target video;

In one possible implementation, the apparatus further includes:

the receiving module is used for receiving an editing instruction of the voice data, wherein the editing instruction carries first volume information and second volume information;

and the volume adjusting module is used for adjusting the playing volume of the voice data based on the first volume information and adjusting the playing volume of the target video based on the second volume information.

In one possible implementation, the apparatus further includes:

the image generation module is used for generating a target image corresponding to the text information when a text adding instruction is received;

and the image adding module is used for adding the target image to the target position of the target video frame.

In one aspect, a computer device is provided and includes a processor and a memory having at least one program code stored therein, the at least one program code being loaded and executed by the processor to perform operations performed by the video dubbing method.

In one aspect, a computer-readable storage medium having at least one program code stored therein is provided, the at least one program code being loaded into and executed by a processor to implement the operations performed by the video dubbing method.

According to the technical scheme, a dubbing interface is displayed, a plurality of video frames of a target video are displayed on the dubbing interface, voice data corresponding to the text information are generated based on the text information collected on the dubbing interface and the selected tone type, the audio characteristics of the voice data are determined based on the tone type, the voice data are added into the target video based on the selected target video frame in the video frames, and the starting playing time of the voice data is the same as the playing time of the target video frame. In the video dubbing process, the text information provided by the user can be converted into dubbing according to the specific timbre and added into the video without manual dubbing, so that the dubbing efficiency is improved, and the video production efficiency can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment of a video dubbing method according to an embodiment of the present application;

fig. 2 is a flowchart of a video dubbing method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a specific form of a dubbing interface provided in an embodiment of the present application;

FIG. 4 is a diagram illustrating a specific form of a text entry box according to an embodiment of the present application;

fig. 5 is a schematic diagram illustrating a display manner of voice data preview information according to an embodiment of the present application;

fig. 6 is a schematic diagram illustrating a display manner of a tone type option in a dubbing interface according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a video dubbing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Key technologies of Speech Technology (Speech Technology) are Automatic Speech Recognition (ASR) and Speech synthesis (Text To Speech, TTS) as well as voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of future human-computer interaction is provided, wherein voice becomes one of the good human-computer interaction modes in the future. The scheme provided by the embodiment of the present application relates to a speech synthesis technology in a speech technology, and is specifically described by the following embodiment.

Fig. 1 is a schematic diagram of an implementation environment of a video dubbing method provided in an embodiment of the present application, and referring to fig. 1, the implementation environment may include a terminal 110 and a speech synthesis platform 140.

The terminal 110 is connected to the voice synthesis platform 140 through a wireless network or a wired network. The terminal 110 may be at least one of a smartphone, a desktop computer, a tablet, an MP3 player, an MP4 player, and a laptop portable computer. The terminal 110 is installed and operated with an application program supporting voice synthesis. The application may be a video type application, an audio type application, or the like. Illustratively, the terminal 110 is a terminal used by a user, and an application running in the terminal 110 is logged with a user account.

The terminal 110 is connected to the voice synthesis platform 140 through a wireless network or a wired network.

The speech synthesis platform 140 includes at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. The speech synthesis platform 140 is used to provide background services for applications that support speech synthesis. Optionally, the speech synthesis platform 140 undertakes primary speech synthesis work and the terminal 110 undertakes secondary speech synthesis work; or, the speech synthesis platform 140 undertakes the secondary speech synthesis work, and the terminal 110 undertakes the primary speech synthesis work; alternatively, the speech synthesis platform 140 or the terminal 110, respectively, may undertake speech synthesis separately.

Optionally, the speech synthesis platform 140 comprises: an access server, a speech synthesis server and a database. The access server is used to provide access services for the terminal 110. The speech synthesis server is used for providing background services related to speech synthesis. The voice synthesis server may be one or more. When there are multiple speech synthesis servers, there are at least two speech synthesis servers for providing different services, and/or there are at least two speech synthesis servers for providing the same service, for example, providing the same service in a load balancing manner, which is not limited in the embodiments of the present application. The voice synthesis server may be provided with a voice synthesis model, an emotion recognition model, and the like.

The terminal 110 may be generally referred to as one of a plurality of terminals, and the embodiment is only illustrated by the terminal 110.

Those skilled in the art will appreciate that the number of terminals described above may be greater or fewer. For example, the number of the terminals may be only one, or several tens or hundreds, or more, and in this case, the speech synthesis system further includes other terminals. The number of terminals and the type of the device are not limited in the embodiments of the present application.

Fig. 2 is a flowchart of a video dubbing method provided in an embodiment of the present application, where the method may be applied to the terminal or the server, and both the terminal and the server may be regarded as a computer device, so that the embodiment of the present application is described based on the computer device as an execution subject, referring to fig. 2, and the embodiment may specifically include the following steps:

201. the computer device displays a dubbing interface having a plurality of video frames of a target video displayed thereon.

The target video may be any video stored in a computer device, or may also be a video recorded in real time by a computer device having a video recording function, and the embodiment of the present application does not limit which video is specifically adopted.

In an embodiment of the application, the computer device may display the dubbing interface based on video editing instructions. In one possible implementation manner, the video editing instruction may be triggered by a triggering operation of a user on a video editing control corresponding to the target video, where the triggering operation may be a click operation, a long-press operation, or the like. Of course, the video editing instruction may also be triggered in other manners, for example, by a long press operation of the user on the target video display area in the video list, and the like, which is not specifically limited in this embodiment of the application.

Fig. 3 is a schematic diagram of a specific form of a dubbing interface according to an embodiment of the present application. Referring to fig. 3, the dubbing interface may include a video frame display area 301, an editing area 302, and a preview area 303. The video frame display area 301 may display a plurality of video frames of the target video. In a possible implementation manner, a user may adjust the video frames displayed in the video frame display area through gesture operations such as sliding, when the computer device detects that a certain video frame is selected by the user, a preview image of the certain video frame may be displayed in the preview area 303, and the editing area 302 may display operation controls such as "add text", "text to speech", and the like.

In one possible implementation, when the computer device detects a user's trigger operation on the "text-to-speech" control, a text input box may be displayed in a first target area of the dubbing interface, which may be any area in the dubbing interface. Referring to fig. 4, fig. 4 is a schematic diagram of a specific form of a text input box according to an embodiment of the present application. Taking the first target area 401 as the lower area of the dubbing interface 402 as an example, the first target area 401 may further include a "confirm conversion" control, and in this embodiment, after the computer device detects the user's trigger operation on the "confirm conversion" control, the following step 202 may be continuously performed.

In a possible implementation manner, when the computer device detects that the user triggers the "text addition" control, a text input box may be displayed in a first target area of the dubbing interface, and the first target area may further include a "confirm addition" control. The target video frame is a video frame selected by a user in the plurality of video frames. The target image may be a map including the text information, and the embodiment of the present application does not limit the specific form of the target image. The target position may be set by a user. For example, when the computer device detects a trigger operation of the user on the "confirm add" control, the target image may be generated, the target image may be displayed at a default position of the target video frame, the default position may be set by a developer, the user may view a display effect of the target image in a preview area of the dubbing interface, and the display position of the target image may be adjusted through a drag operation. In a possible implementation manner, the user may also modify a display style of the target image, which is not specifically limited in this embodiment of the application.

202. The computer equipment generates voice data corresponding to the text information based on the collected text information and the selected tone type.

In this embodiment, after detecting the trigger operation of the "confirm conversion" control by the user, the computer device may acquire text information input by the user and a tone type corresponding to the text information, and an audio feature of the voice data may be determined based on the tone type. The selected tone type may be a tone type set by default in the application program, or may be a tone type selected by the user, which is not specifically limited in the embodiment of the present application. In a possible implementation manner, the process of generating the voice data by the computer device may specifically include the following steps:

step one, computer equipment obtains the tone color characteristics corresponding to the tone color type.

In the embodiment of the present application, one tone color type may correspond to one tone color feature, and the tone color feature may be used to indicate a frequency feature, a waveform feature, and the like corresponding to the tone color type. In a possible implementation manner, the tone color feature may be represented by feature parameters such as vectors and matrices, and the dimension and the specific numerical value of each feature parameter may be set by a developer, which is not limited in the embodiment of the present application.

And step two, the computer equipment acquires the phoneme sequence corresponding to the text information.

In a possible implementation manner, the computer device may perform preprocessing on the acquired text information to remove invalid characters, eliminate ambiguity, and the like in the text information, and for example, perform preprocessing on a Chinese text, the process may be completed based on modules such as text regularization, word segmentation, part of speech prediction, and polyphonic disambiguation. The text regularization module can be used for converting non-Chinese characters such as Arabic numerals and symbols in the text information into corresponding Chinese characters, and the word segmentation module can be used for splitting the text information into a plurality of word groups. In one possible implementation, the computer device may perform word segmentation based on a matching result of the text information and the dictionary, the part-of-speech prediction module may be configured to label parts of speech of the plurality of word groups, and the polyphonic word disambiguation module may be configured to determine a pronunciation of each polyphonic word in the text information. In one possible implementation, the computer device may determine the pronunciation of a polyphone based on a phrase in which the polyphone is located, the part of speech of the phrase, and the context information of the text information.

In a possible implementation manner, the computer device may match the preprocessed text information with a phoneme dictionary to obtain phoneme information corresponding to each phrase in the text information, and determine a phoneme sequence corresponding to the text information based on the phoneme information corresponding to each phrase and an arrangement order of each phrase. In an embodiment of the present application, the computer device may further label duration and frequency variation information of each phoneme in the phoneme sequence. The phoneme dictionary may record a corresponding relationship between each phrase and a phoneme.

And step three, the computer equipment generates the voice data based on the phoneme sequence and the tone characteristics.

In one possible implementation, the computer device may generate the speech data based on a speech synthesis model. Specifically, the computer device may adjust sets of parameters in the speech synthesis model based on the timbre features, input a phoneme sequence carrying duration and frequency variation information into the speech synthesis model, determine a sound waveform corresponding to the text information based on the phoneme sequence and the timbre features, and generate speech data. The speech synthesis model may be a WaveNet model, deep speech 2 (deep speech) model, tacontron (end-to-end speech synthesis) model, or the like, and the embodiment of the present application does not limit which speech synthesis model is specifically applied.

It should be noted that the above description of the speech synthesis process is only an exemplary illustration of a speech synthesis method, and the embodiment of the present application does not limit which speech synthesis method is specifically adopted.

In a possible implementation manner, the computer device may further perform emotion recognition on the text information, obtain an emotion feature corresponding to the text information, where the emotion feature is used to indicate emotion information corresponding to the text information, and generate voice data based on the phoneme sequence, the timbre feature, and the emotion feature, and an audio feature of the voice data changes based on a change in the emotion information. Specifically, when the computer device processes the text message, the computer device may screen a plurality of phrases split from the text message. In one possible implementation, the computer device may filter based on the part of speech of each phrase. For example, adjectives in the text information can be screened out, emotion information corresponding to the text information can be determined based on the appearance position, appearance frequency and emotion tendency of each adjective, and the computer device can adjust each set of parameters in the speech synthesis model based on emotion change in the text information, so that the audio features of the speech data output by the speech synthesis model can be changed based on the emotion information. It should be noted that the above description of text emotion recognition is only an exemplary illustration of a text emotion recognition method, and the embodiment of the present application does not limit which text emotion recognition method is specifically adopted.

In one possible implementation, the computer device may obtain an audio feature of the background music in the target video, and generate the speech data based on the phoneme sequence, the timbre feature, and the audio feature of the background music, the audio feature of the speech data varying based on a variation of the audio feature of the background music. The background music may be a piece of audio added by the user to the target video, or may be audio separated from the target video by the computer device, which is not limited in this embodiment of the application. Specifically, first, the computer device may pre-process the background music to remove invalid information such as silence and noise in the background music; then, the background music is segmented into a plurality of audio segments, no overlapping part exists between the audio segments, and the computer equipment can extract the characteristics of the audio segments to obtain the audio characteristics corresponding to the voice segments; finally, the audio feature of the background music based on the audio feature change information of the background music is obtained based on the audio features of the audio segments, and the computer device may adjust each set of parameters in the speech synthesis model based on the audio feature of the background music and the audio feature change information, so that the speech data output by the speech synthesis model may change based on the change of the audio feature of the background music. It should be noted that the above description of the audio feature extraction of the background music is only an exemplary description of an audio feature extraction method, and the embodiment of the present application does not limit which audio feature extraction method is specifically adopted.

In this embodiment, after the computer device generates the voice data, the computer device may display the preview information of the voice data in a second target area of the dubbing interface, where the second target area may be any area in the dubbing interface. Referring to fig. 5, fig. 5 is a schematic view illustrating a display manner of voice data preview information according to an embodiment of the present application. Taking the second target area as the lower area of the video frame display area as an example, the computer device may acquire the display position of the target video frame selected by the user, and display the preview information of the voice data in the lower area 501 of the display position of the target video frame. For example, a portion of text information corresponding to the voice data may be displayed, and the embodiment of the present application does not limit the specific content of the preview information. In one possible implementation, the user may switch the video frames displayed in the video frame display area through a sliding operation, and when the user selects any other video frame, the computer device may modify the adding position of the voice data and display the preview information of the voice data in the lower area of the any other video frame.

203. And the computer equipment adjusts the playing volume of the voice data based on the voice data editing instruction.

In this embodiment of the application, a user may edit generated voice data, and the computer device may receive an edit instruction for the voice data, where the edit instruction carries first volume information and second volume information, adjust the play volume of the voice data based on the first volume information, and adjust the play volume of the target video based on the second volume information. In one possible implementation manner, the computer device may display a voice editing page, where the voice editing page is used to provide a play volume setting function, the voice editing page may include a play volume setting area and a confirmation control, and a user may set, on the voice editing page, first volume information corresponding to the voice data and second volume information corresponding to the target video. Of course, the playing volume of the background music may also be set, which is not limited in the embodiment of the present application. When the computer device detects that the user triggers the operation of the confirmation control in the voice editing page, that is, the user triggers the editing instruction of the voice data, the computer device may set the play volume of the voice data, the play volume of the target video, and the play volume of the background music based on at least one volume information in the editing instruction. The volume information may be a specific value of the volume, or a volume ratio with respect to the maximum volume, for example, the volume of the voice data may be set to 80%, the volume of the background music may be set to 50%, and the volume of the target video may be set to 10%.

204. And the computer equipment adjusts the tone type corresponding to the voice data based on the tone modification instruction.

In the embodiment of the application, the computer device generates voice data, a user can modify the tone of the voice data, the computer device can determine a target tone type selected by the user based on a tone modification instruction of the user, and adjust the audio characteristic of the voice data based on the tone characteristic corresponding to the target tone type. In one possible implementation, the dubbing interface may be displayed with a plurality of timbre type options. Fig. 6 is a schematic diagram of a display manner of a tone color type option in a dubbing interface according to an embodiment of the present application. As shown in fig. 6, the dubbing interface may display the color type options of "rales", "broadcast cavity", "yujie sound", and so on. When the computer device detects that the user selects any one of the tone type options, the computer device may obtain a tone identifier corresponding to the tone type option, determine a tone feature corresponding to the tone type based on the tone identifier, and adjust the audio feature of the voice data based on the tone feature.

In this embodiment, an execution sequence of first adjusting the playing volume of the voice data and then adjusting the tone type corresponding to the voice data is used for description, but in some possible embodiments, a step of adjusting the tone type corresponding to the voice data and then a step of adjusting the playing volume of the voice data may also be executed first, or both the steps may be executed at the same time, which is not specifically limited in this embodiment of the present application.

205. The computer device adds the voice data to a target video based on a selected target video frame of the plurality of video frames.

In one possible implementation, the step of adding voice data may be performed when the computer device detects a user's trigger operation of the confirmation adding control in the dubbing interface. For example, the computer device may add a timestamp in the voice data, where the timestamp may be used to indicate a playing time of the voice data, and the computer device may package and encapsulate the voice data and the target video to complete voice data addition, and the embodiment of the present application does not limit what method for synthesizing voice and video is specifically adopted.

In this embodiment, after the computer device completes the addition of the voice data, the starting playing time of the voice data is the same as the playing time of the target video frame.

In the embodiment of the application, a user can obtain the voice data corresponding to the input text by simple text input operation, complete dubbing on the video, improve the efficiency of dubbing on the video, adjust the tone color, the volume and the like of the voice data, enhance the interest of video production, superimpose the dubbing on background music, and produce diversified and personalized dubbing effects.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

Fig. 7 is a schematic structural diagram of a video dubbing apparatus according to an embodiment of the present application, and referring to fig. 7, the apparatus includes:

a display module 701, configured to display a dubbing interface, where multiple video frames of a target video are displayed on the dubbing interface;

a voice generating module 702, configured to generate voice data corresponding to the text information based on the text information collected in the dubbing interface and the selected tone type, where an audio feature of the voice data is determined based on the tone type;

the voice adding module 703 is configured to add the voice data to the target video based on the selected target video frame in the plurality of video frames, where an initial playing time of the voice data is the same as a playing time of the target video frame.

In one possible implementation, the speech generation module 702 is configured to:

acquiring a tone characteristic corresponding to the tone type;

acquiring a phoneme sequence corresponding to the text information;

acquiring the audio characteristics of background music in the target video;

In one possible implementation, the apparatus further includes:

According to the device provided by the embodiment of the application, a dubbing interface is displayed, a plurality of video frames of a target video are displayed on the dubbing interface, voice data corresponding to the text information are generated based on the text information collected on the dubbing interface and the selected tone type, the audio characteristics of the voice data are determined based on the tone type, the voice data are added into the target video based on the selected target video frame in the video frames, and the starting playing time of the voice data is the same as the playing time of the target video frame. By applying the video dubbing device, the text information provided by the user can be converted into dubbing according to the specific timbre and added into the video without manual dubbing, so that the dubbing efficiency is improved, and the video production efficiency can be further improved.

It should be noted that: in the video dubbing apparatus provided in the above embodiment, only the division of the above functional modules is taken as an example for illustration when dubbing video, and in practical applications, the above function allocation may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules to complete all or part of the above described functions. In addition, the video dubbing apparatus and the touch video dubbing method provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

The computer device provided by the above technical solution can be implemented as a terminal or a server, for example, fig. 8 is a schematic structural diagram of a terminal provided in the embodiment of the present application. The terminal 800 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The terminal 800 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, the terminal 800 includes: one or more processors 801 and one or more memories 802.

The processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 801 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 801 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 801 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 801 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 802 is used to store at least one program code for execution by processor 801 to implement the video dubbing method provided by the method embodiments herein.

In some embodiments, the terminal 800 may further include: a peripheral interface 803 and at least one peripheral. The processor 801, memory 802 and peripheral interface 803 may be connected by bus or signal lines. Various peripheral devices may be connected to peripheral interface 803 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 804, a display screen 805, a camera assembly 806, an audio circuit 807, a positioning assembly 808, and a power supply 809.

The peripheral interface 803 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 801 and the memory 802. In some embodiments, the processor 801, memory 802, and peripheral interface 803 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 801, the memory 802, and the peripheral interface 803 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.

The Radio Frequency circuit 804 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 804 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 804 converts an electrical signal into an electromagnetic signal to be transmitted, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 804 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 804 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 805 is a touch display, the display 805 also has the ability to capture touch signals on or above the surface of the display 805. The touch signal may be input to the processor 801 as a control signal for processing. At this point, the display 805 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 805 may be one, providing the front panel of the terminal 800; in other embodiments, the display 805 may be at least two, respectively disposed on different surfaces of the terminal 800 or in a folded design; in some embodiments, display 805 may be a flexible display disposed on a curved surface or a folded surface of terminal 800. Even further, the display 805 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 805 can be made of LCD (liquid crystal Display), OLED (Organic Light-Emitting Diode), and the like.

The camera assembly 806 is used to capture images or video. Optionally, camera assembly 806 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 806 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 807 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 801 for processing or inputting the electric signals to the radio frequency circuit 804 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 800. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 801 or the radio frequency circuit 804 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 807 may also include a headphone jack.

The positioning component 808 is used to locate the current geographic position of the terminal 800 for navigation or LBS (location based Service). The positioning component 808 may be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

Power supply 809 is used to provide power to various components in terminal 800. The power supply 809 can be ac, dc, disposable or rechargeable. When the power source 809 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 800 also includes one or more sensors 810. The one or more sensors 810 include, but are not limited to: acceleration sensor 811, gyro sensor 812, pressure sensor 813, fingerprint sensor 814, optical sensor 815 and proximity sensor 816.

The acceleration sensor 811 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 800. For example, the acceleration sensor 811 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 801 may control the display 805 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 811. The acceleration sensor 811 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 812 may detect a body direction and a rotation angle of the terminal 800, and the gyro sensor 812 may cooperate with the acceleration sensor 811 to acquire a 3D motion of the user with respect to the terminal 800. From the data collected by the gyro sensor 812, the processor 801 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 813 may be disposed on the side frames of terminal 800 and/or underneath display 805. When the pressure sensor 813 is disposed on the side frame of the terminal 800, the holding signal of the user to the terminal 800 can be detected, and the processor 801 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 813. When the pressure sensor 813 is disposed at a lower layer of the display screen 805, the processor 801 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 805. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 814 is used for collecting a fingerprint of the user, and the processor 801 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 814, or the fingerprint sensor 814 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 801 authorizes the user to perform relevant sensitive operations including unlocking a screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Fingerprint sensor 814 may be disposed on the front, back, or side of terminal 800. When a physical button or a vendor Logo is provided on the terminal 800, the fingerprint sensor 814 may be integrated with the physical button or the vendor Logo.

The optical sensor 815 is used to collect the ambient light intensity. In one embodiment, processor 801 may control the display brightness of display 805 based on the ambient light intensity collected by optical sensor 815. Specifically, when the ambient light intensity is high, the display brightness of the display screen 805 is increased; when the ambient light intensity is low, the display brightness of the display 805 is reduced. In another embodiment, the processor 801 may also dynamically adjust the shooting parameters of the camera assembly 806 based on the ambient light intensity collected by the optical sensor 815.

A proximity sensor 816, also known as a distance sensor, is typically provided on the front panel of the terminal 800. The proximity sensor 816 is used to collect the distance between the user and the front surface of the terminal 800. In one embodiment, when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal 800 gradually decreases, the processor 801 controls the display 805 to switch from the bright screen state to the dark screen state; when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal 800 becomes gradually larger, the display 805 is controlled by the processor 801 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 8 is not intended to be limiting of terminal 800 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 900 may include one or more processors (CPUs) 901 and one or more memories 902, where at least one program code is stored in the one or more memories 902, and is loaded and executed by the one or more processors 901 to implement the methods provided by the foregoing method embodiments. Certainly, the server 900 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 900 may also include other components for implementing device functions, which are not described herein again.

In an exemplary embodiment, there is also provided a computer readable storage medium, such as a memory, comprising at least one program code executable by a processor to perform the video dubbing method in the above embodiments. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, or implemented by at least one program code associated with hardware, where the program code is stored in a computer readable storage medium, such as a read only memory, a magnetic or optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for dubbing video, the method comprising:

2. The method of claim 1, wherein generating voice data corresponding to the text information based on the collected text information and the selected tone color type comprises:

acquiring tone characteristics corresponding to the tone types;

acquiring a phoneme sequence corresponding to the text information;

3. The method of claim 2, wherein generating the speech data based on the phoneme sequence and the timbre features comprises:

generating voice data based on the phoneme sequence, the timbre features and the emotion features, wherein the audio features of the voice data are changed based on the change of the emotion information.

4. The method of claim 2, wherein generating the speech data based on the phoneme sequence and the timbre features comprises:

acquiring the audio characteristics of background music in the target video;

generating the voice data based on the phoneme sequence, the tone color feature and the audio feature of the background music, wherein the audio feature of the voice data changes based on the change of the audio feature of the background music.

5. The method of claim 1, wherein after generating the voice data corresponding to the text information based on the collected text information and the selected tone type, the method further comprises:

6. The method of claim 1, wherein after displaying the dubbing interface, the method further comprises:

adding the target image to a target position of the target video frame.

7. A video dubbing apparatus, the apparatus comprising:

the voice generation module is used for generating voice data corresponding to the text information based on the text information collected on the dubbing interface and the selected tone type, and the audio frequency characteristic of the voice data is determined based on the tone type;

8. The apparatus of claim 7, wherein the speech generation module is configured to:

acquiring tone characteristics corresponding to the tone types;

acquiring a phoneme sequence corresponding to the text information;

9. A computer device comprising a processor and a memory, the memory having stored therein at least one program code, the at least one program code being loaded into and executed by the processor to perform operations performed by the video dubbing method of any of claims 1 to 6.

10. A computer-readable storage medium having at least one program code stored therein, the at least one program code being loaded and executed by a processor to perform operations performed by the video dubbing method of any of claims 1 to 6.