WO2022228179A1

WO2022228179A1 - Video processing method and apparatus, electronic device, and storage medium

Info

Publication number: WO2022228179A1
Application number: PCT/CN2022/087381
Authority: WO
Inventors: 杜育璋; 刘坚; 李磊; 王明轩
Original assignee: 北京有竹居网络技术有限公司
Priority date: 2021-04-29
Filing date: 2022-04-18
Publication date: 2022-11-03
Also published as: CN113207044A

Abstract

Provided are a video processing method and apparatus, an electronic device, and a storage medium. The method comprises: acquiring first subtitles in an original video; translating the first subtitles to obtain second subtitles; determining a target speech rate for dubbing; and according to the target speech rate for dubbing, generating a dubbing audio corresponding to the second subtitles.

Description

Video processing method, apparatus, electronic device and storage medium

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority of the Chinese patent application number "202110472124.X" filed on April 29, 2021 with the application name "video processing method, device, electronic device and storage medium". The entire contents of this application are incorporated by reference.

technical field

The present disclosure relates to the field of information technology, and in particular, to a video processing method, apparatus, electronic device, and storage medium.

Background technique

With the development of information technology, terminals have become an indispensable device in people's lives. For example, users can watch videos through the terminal.

Some current videos may be videos in other languages, and users may not understand the audio content in the videos. The existing technology is to display subtitles that the user can read in the video, but in some cases, the speed at which the user browses the subtitles may not match the display speed of the subtitles, thereby reducing the user experience.

technical solutions

In order to solve the above technical problems or at least partially solve the above technical problems, embodiments of the present disclosure provide a video processing method, apparatus, electronic device, and storage medium.

An embodiment of the present disclosure provides a video processing method, and the method includes:

Get the first subtitle in the original video;

The first subtitle is translated to obtain the second subtitle;

Determine the target speech rate for dubbing;

The dubbing audio corresponding to the second subtitle is generated according to the target speech rate of the dubbing.

Embodiments of the present disclosure also provide a video processing apparatus, including:

an acquisition module for acquiring the first subtitle in the original video;

a translation module for translating the first subtitle to obtain a second subtitle;

A determination module for determining the target speech rate of dubbing;

A dubbing module, configured to generate dubbing audio corresponding to the second subtitle according to the target speech rate of the dubbing.

Embodiments of the present disclosure also provide an electronic device, the electronic device comprising:

one or more processors;

a storage device for storing one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement the video processing method as described above.

Embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the above-mentioned video processing method.

Embodiments of the present disclosure also provide a computer program product, where the computer program product includes a computer program or instructions, and when the computer program or instructions are executed by a processor, implement the video processing method as described above.

Compared with the prior art, the technical solution provided by the embodiment of the present disclosure has at least the following advantages: the technical solution provided by the embodiment of the present disclosure obtains the first subtitle in the original video by setting; translates the first subtitle to obtain the second subtitle; Determine the target speech rate of dubbing; generate dubbing audio corresponding to the second subtitle according to the target speech rate of dubbing, which can realize the generation of dubbing audio that can be understood by video viewers, which can help users reduce the difficulty of understanding video content and improve user experience.

In addition, in the video processing method provided by the embodiments of the present disclosure, the display duration of the target subtitle and/or the duration of the dubbed audio corresponding to the second subtitle are adjusted by determining the display duration of the target subtitle and the target speech rate of the dubbing. , so that the duration of the dubbing audio is consistent with the display duration of the target subtitles within the allowable range of error, so as to solve the problem that for the same meaning, the length of the sentences expressed in different languages may be different, resulting in the dubbing duration and subtitle display duration. Mismatch issues improve user experience.

Description of drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent when taken in conjunction with the accompanying drawings and with reference to the following detailed description. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that the originals and elements are not necessarily drawn to scale.

1 is a flowchart of a video processing method according to an embodiment of the present disclosure;

2 is a schematic diagram of an image frame in a video according to an embodiment of the present disclosure;

3 is a flowchart of another video processing method provided by an embodiment of the present disclosure;

FIG. 4 is a flowchart of a method for implementing S130 according to an embodiment of the present disclosure;

FIG. 5 is a flowchart of another method for implementing S130 provided by an embodiment of the present disclosure;

FIG. 6 is a flowchart of another method for implementing S130 provided by an embodiment of the present disclosure;

7 is a flowchart of another video processing method provided by an embodiment of the present disclosure;

FIG. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure;

FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed ways

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for the purpose of A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes, and are not intended to limit the protection scope of the present disclosure.

It should be understood that the various steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.

As used herein, the term "including" and variations thereof are open-ended inclusions, ie, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below.

It should be noted that concepts such as "first" and "second" mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or interdependence.

It should be noted that the modifications of "a" and "a plurality" mentioned in the present disclosure are illustrative rather than restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, they should be understood as "one or a plurality of". multiple".

The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only for illustrative purposes, and are not intended to limit the scope of these messages or information.

FIG. 1 is a flowchart of a video processing method provided by an embodiment of the present disclosure. This embodiment is applicable to the case of dubbing a video in a client. The method can be executed by a video processing device, and the device can use software and/or or hardware, the device can be configured in electronic equipment, such as terminals, specifically including but not limited to smart phones, PDAs, tablet computers, wearable devices with display screens, desktop computers, notebook computers, all-in-one computers, smart phones household equipment, etc. Alternatively, this embodiment may be applicable to the case of dubbing the video in the server, the method may be executed by a video processing apparatus, and the apparatus may be implemented by means of software and/or hardware, and the apparatus may be configured in an electronic device, such as server.

As shown in Figure 1, the method may specifically include:

S1. Obtain the first subtitle in the original video.

The first subtitle is the same as the language used by the character in the original video. Exemplarily, if the character in the video speaks in English, the first subtitle is in English.

There are various implementation methods of this step, which are not limited in this application. Exemplarily, if the original video includes the first subtitle, the first subtitle can be directly extracted.

Or, if the original video does not include the first subtitle, speech recognition is performed on any piece of audio in the original video to obtain the first subtitle. In this case, the first subtitle is a subtitle corresponding to any audio segment in the original video.

Here, any piece of audio refers to audio information corresponding to any sentence spoken by any character in the video.

Specifically, video includes audio stream and video stream. The video stream includes multiple image frames. Multiple image frames are played in chronological order to form a dynamic image of the video. The characters in this video speak during some periods of time and do not speak during other periods of time. The audio stream is composed of multiple pieces of audio, and each piece of audio corresponds to a sentence spoken by a character in the video. A piece of audio corresponds to multiple image frames.

Exemplarily, FIG. 2 is a schematic diagram of an image frame in a video according to an embodiment of the present disclosure. Referring to Figure 2, it is assumed that the character in the video speaks during the t0 time period and the t3 time period, and does not speak during the rest of the time period. Then the image frame in the time period t0 corresponds to a continuous speech, and the continuous speech corresponding to the time period t0 constitutes a piece of audio. The image frame in the time period t3 corresponds to another continuous speech, and the continuous speech corresponding to the time period t3 constitutes a piece of audio.

Optionally, when this step is performed, speech recognition is performed on any piece of audio in the original video through speech recognition technology to obtain the first subtitle. For example, each continuous sentence of Chinese speech can be recognized as a continuous sentence of Chinese subtitles according to the speech pauses in the audio.

S2. Translate the first subtitle to obtain the second subtitle.

The second subtitle is in a different language than the first subtitle. Exemplarily, the first subtitle is in Chinese, and the second subtitle is in English. The second subtitle is in a language understood by the viewer of the video. In actual setting, the second subtitle can be set according to the needs of the video viewer.

S3. Determine the target speech rate of the dubbing.

Those skilled in the art can understand that, in order to realize the purpose of dubbing the original video, the second subtitle needs to be read at a certain speech rate. In this step, the target speech rate of dubbing refers to the speech rate of dubbing used for reading the second subtitle.

Optionally, a plurality of dubbing timbre data are pre-stored in the database, and different dubbing timbre data have corresponding default speech rate data. When performing the steps, the dubbing tone data is selected, and the default speech rate corresponding to the selected dubbing tone and/or the speech rate after adjusting the default speech rate is used as the dubbing target speech rate.

It should be noted that, when this step is performed, it is necessary to ensure that the second subtitle is read at the determined target speech speed of dubbing, and the video viewer can hear it clearly.

S4. Generate dubbing audio corresponding to the second subtitle according to the target speech rate of dubbing.

The technical solutions provided by the embodiments of the present disclosure provide the first subtitles in the original video by setting; translate the first subtitles to obtain the second subtitles; determine the target speech rate of the dubbing; Dubbing audio can generate dubbing audio that can be understood by video viewers, which can help users reduce the difficulty of understanding video content and improve user experience.

Optionally, the dubbed audio obtained by the above technical solution of the present application may be played alone, or the audio in the original video may be replaced with the dubbed audio to obtain a dubbed video. You can also play the original video synchronously with the dubbed audio, so as to achieve the goal of having a dubbing effect when the video is played.

FIG. 3 is a flowchart of a video processing method provided by an embodiment of the present disclosure. FIG. 3 is a specific example of FIG. 1 . Referring to Figure 3, the method includes:

S110. Perform speech recognition on any piece of audio in the original video to obtain a first subtitle.

S120. Translate the first subtitle to obtain a second subtitle.

S130. Determine the display duration of the target subtitle and the target speech rate of the dubbing, where the target subtitle includes the first subtitle and/or the second subtitle.

Those skilled in the art can understand that, in order to realize the purpose of dubbing the original video, it is necessary to generate dubbing audio based on the second subtitles, and make the dubbing audio duration and the playback duration of the image frame corresponding to the dubbing audio (which can also be understood as the first subtitle). The display duration of the first subtitle and/or the second subtitle, that is, the display duration of the target subtitle here) is equivalent. This is because if the duration of the dubbed audio is greater than the playback duration of the image frame corresponding to the dubbed audio, it will appear that the dubbed audio has not ended, but the image frame has been played. If the duration of the dubbed audio is less than the playback duration of the image frame corresponding to the dubbed audio, it will appear that the dubbed audio has ended, but the image frame is still displayed. Both of these situations can cause the image to be out of sync with the audio, affecting the user experience.

To make the dubbing audio duration equal to the playback duration of the image frame corresponding to the dubbing audio, two parameters must be specified first, one is the dubbing audio duration, and the other is the playback duration of the image frame corresponding to the dubbing audio.

Corresponding to the first parameter, the duration of the dubbing audio mainly depends on two quantities, one is the length of the second subtitle, and the other is the speech rate of the dubbing. Since the second subtitle has been obtained in S120, the length of the second subtitle is constant in this step, and the duration of the dubbing audio at this time mainly depends on the speech rate of the dubbing.

Therefore, the essence of this step is to determine the appropriate display duration of the target subtitle and the target speech rate of the dubbing, so that the duration of the dubbed audio corresponding to the second subtitle is consistent with the display duration of the target subtitle within the allowable error range . In some embodiments, the sequence between determining the display duration of the target subtitles and determining the target speech rate for dubbing is not limited. In addition, determining the display duration of the target subtitles and determining the target speech rate of dubbing may be two independent processes, or may be interrelated processes.

S140. Generate dubbing audio corresponding to the second subtitle according to the target speech rate of dubbing.

The essence of this step is to read the text information in the second subtitle at the target speech rate of the dubbing determined in S130, and perform the dubbing audio corresponding to the second subtitle.

S150. Replace any piece of audio in the original video with dubbed audio to obtain a target video, and display the target subtitle in a picture corresponding to the display duration of the target subtitle in the target video.

Exemplarily, if any piece of audio in the original video is Chinese audio, and the dubbed audio is English audio, the target video is obtained by replacing the Chinese audio in the original video with the English audio. And add Chinese subtitles and/or English subtitles to the picture of the video frame.

The essence of the above technical solutions of the present application is that by determining the display duration of the target subtitles and the target speech rate of the dubbing, the display duration of the target subtitles and/or the duration of the dubbing audio corresponding to the second subtitles are adjusted, so that the dubbing is performed. The audio duration is consistent with the display duration of the target subtitles within the allowable range of error, so as to solve the problem that for the same meaning, the length of sentences expressed in different languages may be different, resulting in a mismatch between the dubbing duration and the subtitle display duration problems and improve user experience.

On the basis of the above technical solution, optionally, S150 can also be replaced by: generating an audio file according to the dubbing audio corresponding to each segment of audio in the original video, and the audio file includes the dubbing audio corresponding to each segment of audio in the original video. , and time information for each dubbed audio. When playing the original video, each dubbing audio is called and played in sequence according to the current playing time progress of the original video and the time information of each dubbing audio. Further, the time information of each dubbed audio includes the start time and/or the end time of the dubbed audio.

Exemplarily, if multiple audio files are generated based on a certain original video, each audio file includes the start time of the dubbing audio corresponding to the audio file. Assume that the start time of the dubbing audio A is the 12th second from the playback of the first image frame of the video. When the video is played and the 12th s is played, the dubbing audio A is called and played. Optionally, in this way, the original audio of the original video is eliminated. Further, when eliminating the original audio of the original video, only the audio of the characters speaking in the original video is eliminated, and the background sound of the original video is retained.

Further, a play button or icon corresponding to the audio file may be displayed in the user interface for playing the target video. When the button or icon is off, the audio in the target video is still the audio in the original video, that is, the audio in the original video is not replaced with the corresponding dubbing audio. When the button or icon is on, the audio in the original video is replaced with the corresponding dubbing audio, that is, the audio in the target video becomes the dubbing audio. Alternatively, when the button or icon is in an on state, the terminal can play the dubbing audio alone.

On the basis of the above technical solutions, it is further analyzed how to determine the display duration of the target subtitles and the target speech rate of the dubbing when S130 is executed. It can be found that in practice, there are mainly two situations:

Case 1: For a piece of audio in the original video, the duration of the dubbed audio is obtained based on the default speech rate of the dubbing and the second subtitle. The duration of the dubbed audio and the duration of the audio in the original video are originally the same within the allowable error range. In this case, it can be directly determined that the display duration of the target subtitle is the time corresponding to the audio segment, and the target speech rate of the dubbing is the default speech rate.

Wherein, optionally, a plurality of dubbing timbre data are pre-stored in the database, and different dubbing timbre data have corresponding default speech rate data. The default speech rate of dubbing is the default speech rate corresponding to the selected dubbing timbre, which is obtained based on the dubbing timbre data stored in the database.

Case 2: For a piece of audio in the original video, the duration of the dubbed audio is obtained based on the default speech rate of the dubbing and the second subtitle. The duration of the dubbed audio and the duration of the audio in the original video are originally quite different, and cannot be considered to be consistent within the allowable range of errors. In this case, the display duration of the target subtitle can be determined based on the duration of the audio in the original video; at the same time, the target speech rate of the dubbing can be determined based on the default speech rate of the dubbing.

FIG. 4 is a flowchart of a method for implementing S130 according to an embodiment of the present disclosure. This method is applicable to the above case one. Referring to Figure 4, the method includes:

S1311. Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing.

S1312, if the duration of any piece of audio is greater than or equal to the default duration, and the first difference between the duration of any piece of audio and the default duration is less than or equal to the first threshold, then the display duration of the target subtitle is the time corresponding to any piece of audio, The target speech rate for dubbing is the default speech rate.

The "first difference between the duration of any piece of audio and the default duration" may be the absolute value of the difference between the duration of any piece of audio and the default duration, or may be the ratio of the duration of any piece of audio to the default duration.

The essence of this setting is that, within the allowable range of errors, if the duration of any audio segment is consistent with the default duration of its corresponding dubbing, the display duration of the target subtitle is directly determined to be the time corresponding to this segment of audio, and the target speech rate of the dubbing is the default speech rate.

FIG. 5 is a flowchart of another method for implementing S130 according to an embodiment of the present disclosure. This method is applicable to the second case above. Referring to Figure 5, the method includes:

S1321. Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing.

S1322. If the duration of any piece of audio is less than the default duration, determine whether the length of the second subtitle is within a preset range.

S1323, if the length of the second subtitle is within the preset range, increase the display duration of the target subtitle, and/or increase the target speech rate of the dubbing, so that the product of the display duration of the target subtitle and the target speech rate of the dubbing is equal to the second The second difference between the lengths of the subtitles is less than or equal to the second threshold.

The display duration of the target subtitle is the display duration of the picture corresponding to any piece of audio after adjustment.

Those skilled in the art can understand that when S1323 is executed, if the display duration of the target subtitle is selected to be increased, the display duration of the target subtitle cannot be infinite after the increase. If the display duration of the target subtitle after the increase is too long (exceeding a certain limit), it means that the switching speed of the image frame corresponding to this audio segment is too slow, while the switching speed of the image frames corresponding to other audios is normal, which will cause the overall video. Discord and affect the user experience. Therefore, the maximum display duration of the target subtitle can be limited by setting a duration adjustment parameter.

Specifically, if the initial display duration of a group of subtitles (that is, the duration of the audio corresponding to the group of subtitles in the original video) is T1, the duration adjustment parameter is x1, and the display duration of the subtitle after adjustment is T2, then there is T2 =T1/x1, where x1 is a number less than 1. By setting the minimum value of the duration adjustment parameter x1, the maximum value of the display duration of the subtitle after adjustment is limited.

Optionally, there are multiple implementation methods for "increasing the display duration of the target subtitles". Exemplarily, the implementation method for increasing the display duration of the target subtitles includes: reducing the display speed of a picture corresponding to any piece of audio.

Further, if the method of reducing the display speed of a picture corresponding to any piece of audio is adopted, the duration adjustment parameter x1 can be regarded as the display speed adjustment parameter x1.

Specifically, assuming that the display speed of the picture corresponding to a certain audio segment in the original video is V1, the display speed adjustment parameter is x1, and the display speed of the picture corresponding to the audio segment after adjustment is V2, then V2=V1·x1. By setting the minimum value of the display speed adjustment parameter x1, the minimum value of the display speed of the picture corresponding to the audio segment after adjustment is limited, thereby limiting the maximum value of the display duration of the target subtitle after adjustment.

Exemplarily, continuing to refer to FIG. 2 , it is assumed that there are 20 image frames corresponding to the time period t0 , and t0 = 2S, indicating that the original display speed V1 of the image frames corresponding to the time period t0 is 20 frames/2 seconds. If the display speed of the image frame is slowed down, the display speed of the slowed image frame is recorded as V2, V2=V1·x1, and at this time, x1 is a number less than 1. If the minimum value of x1 is set to 0.9, that is, the slowest display speed of the image frame can be adjusted to 18 frames/2 seconds. At this time, the display speed of the image frame is reduced by 10%, and the image corresponding to the t0 time period The total number of frames is 20, so the display duration of the 20 image frames becomes t0/x1. Changing the value of x1 from 1 to 0.9 is equivalent to increasing the display duration of the image frame, that is, increasing the display duration of the group of subtitles.

Similarly, when S1323 is executed, if the target speech rate of dubbing is selected to be increased, if the increased target speech rate is too fast (over a certain limit), the user may not hear clearly, which will affect the user experience. Therefore, the maximum value of the target speech rate can be limited by setting the speech rate adjustment parameter.

Specifically, it is assumed that the default speech rate corresponding to the dubbing tone selected for a certain piece of audio in the original video is V3, the speech rate adjustment parameter is x2, and the target speech rate corresponding to the dubbing tone after adjustment is V4, then V4=V3 · x2. By setting the maximum value of the speech rate adjustment parameter x2, the maximum value of the target speech rate of the dubbing tone after adjustment is limited.

Exemplarily, the default speech shorthand of the dubbing tone is V3, and V3 is fixed. The shorthand for the target language of the dubbing tone is V4, and the initial value of V4 is V3. The target speech rate of the dubbing tone can be adjusted. For example, if the target speech rate of the dubbing tone is increased, it is equivalent to V4=V3·x2, where x2 is a number greater than 1. If the maximum value of x2 is set to 1.1, that is, the maximum target speech rate can be 1.1 times the default speech rate, which is equivalent to increasing the target speech rate of the dubbing tone by 10%.

The product L1 of the display duration T2 of the target subtitle and the target speech rate V4 of dubbing can be expressed as L1=T2·V4=(T1/x1)(V3·x2). That is, the length of characters that can be read within the display duration of the target subtitles at the target speech rate of dubbing. Here, the length of the character can be understood as the number of words in the character, the number of syllables in the character, or the like.

The product of the display duration of the target subtitle and the target speech rate of dubbing is equal to the length of the second subtitle, which means that the length of readable text at the target speech rate of dubbing within the display duration of the target subtitle is exactly equal to the length of the second subtitle. In other words, the time required to read the second subtitle at the target speech rate of dubbing is exactly equal to the display duration of the target subtitle.

Therefore, in S1323, "making the second difference between the product of the display duration of the target subtitle and the target speech rate of dubbing and the length of the second subtitle be less than or equal to the second threshold" means that within the allowable range of the error, based on the The duration of the dubbing audio generated by the second subtitle is consistent with the display duration of the target subtitle.

The second difference may be the absolute value of the difference between the product of the display duration of the target subtitle and the target speech rate of the dubbing and the difference between the length of the second subtitle, or the product of the display duration of the target subtitle and the target speech rate of the dubbing and the first subtitle. The ratio of the lengths of the two subtitles.

As mentioned above, due to the need to ensure that the video has a better audio-visual effect, the maximum display duration and maximum target speech rate of the target subtitles need to be limited, which will make the length of the second subtitle within a certain range (that is, the "preset range" in S1322 )Inside.

If the length of the second subtitle is within the preset range, under the two conditions of "the display duration of the target subtitle is less than or equal to the maximum display duration of the target subtitle" and "the target speech rate of dubbing is less than or equal to the maximum target speech rate", By increasing the display duration of the target subtitles and/or increasing the target speech rate of dubbing, the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles can be less than or equal to the first Two thresholds.

If the length of the second subtitle is not within the preset range, under the two conditions of "the display duration of the target subtitle is less than or equal to the maximum display duration of the target subtitle" and "the target speech rate of dubbing is less than or equal to the maximum target speech rate", No matter how to increase the display duration of the target subtitles and/or increase the target speech rate of dubbing, the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles cannot be less than or equal to the second threshold.

Optionally, if it is determined that the length of the second subtitle is not within the preset range, the first subtitle is retranslated according to the preset range, so that the length of the second subtitle obtained after retranslation is within the preset range.

Optionally, the preset range is related to the default speech rate and the duration of any piece of audio.

Further, the preset range includes an upper limit value (ie, a maximum value), and the upper limit value is related to the default speech rate, the duration of any piece of audio, the minimum value of the duration adjustment parameter, and the maximum value of the speech rate adjustment parameter.

Exemplarily, assuming that the upper limit of the preset range is n1, the default speech rate is V3, the duration of any piece of audio is T1, the minimum value of the duration adjustment parameter is 0.9, and the maximum value of the speech rate adjustment parameter is 1.1, then n1= (V3*1.1)(T1/0.9).

Based on the above technical solutions, when S1323 is executed, there are various specific implementation methods for "increasing the display duration of the target subtitles, and/or increasing the target speech rate of dubbing". Three typical methods are given below.

method one:

On the basis that the target speech rate of dubbing is the default speech rate, gradually increase the target speech rate of dubbing; if the target speech rate of dubbing has reached the maximum value and the second difference is greater than the second threshold, the display duration of the target subtitles will be On the basis of the duration of any piece of audio, the display duration of the target subtitle is gradually increased until the second difference is less than or equal to the second threshold.

For example, the display duration of a certain group of subtitles is controlled to remain unchanged, that is, T1 is a fixed value, and the duration adjustment parameter x1=1. Priority is given to adjusting the speech rate adjustment parameter x2, and the value of x2 gradually increases from 1 to 1.1. For example, in the order from 1 to 1.1, the values are taken at intervals. When x2 takes a certain value, if it is within the allowable range of error If (V3·x2)(T1/x1)=the length of the text in this group of subtitles, the adjustment of x2 is stopped, and the current time length adjustment parameter x1 and speech rate adjustment parameter x2 are output.

If the value of x2 has reached the maximum value of 1.1, but it is not within the allowable range of the error, so that (V3 x2)(T1/x1)=the length of the text in this group of subtitles, then fix x2=1.1, adjust x1 The value of x1 starts from 1 and gradually decreases to 0.9. For example, according to the order from 1 to 0.9, the value is taken at intervals until it is within the allowable range of error, so that (V3 x2)(T1/x1) = The length of the text in this group of subtitles, output the current time length adjustment parameter x1 and speech rate adjustment parameter x2.

Method Two:

On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually increase the display duration of the target subtitles; if the display duration of the target subtitles has reached the maximum value and the second difference is greater than the second threshold, then the dubbing target On the basis of the default speech rate, the target speech rate of dubbing is gradually increased until the second difference is less than or equal to the second threshold.

For example, the target speech rate for controlling the dubbing timbre remains unchanged, that is, V3 is a fixed value, and the speech rate adjustment parameter x2=1. Priority is given to adjusting the duration adjustment parameter x1, and the value of x1 gradually decreases from 1 to 0.9. For example, in the order from 1 to 0.9, the values are taken at intervals. When x1 takes a certain value, if it can be made within the allowable error Within the range, (V3·x2)(T1/x1)=the length of the text in this group of subtitles, stop adjusting x1, and output the current time length adjustment parameter x1 and speech rate adjustment parameter x2.

If the value of x1 has reached the minimum value of 0.9, but it is not within the allowable range of the error, so that (V3 x2)(T1/x1)=the length of the text in this group of subtitles, then further adjust the value of x2, x2 The value of , gradually increases from 1 to 1.1. For example, in the order from 1 to 1.1, the values are taken at intervals until within the allowable range of error, (V3*x2)(T1/x1)=in this group of subtitles Length of English subtitles. Output the current duration adjustment parameter x1 and speech rate adjustment parameter x2.

Method three:

On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually increase the display duration of the target subtitles. At the same time, on the basis that the target speech rate of dubbing is the default speech rate, gradually increase the target speech rate of dubbing until the first The second difference is less than or equal to the second threshold.

For example, adjust the value of the duration adjustment parameter x1 and the speech rate adjustment parameter x2 at the same time, the value of x1 gradually decreases from 1 to 0.9, and the value of x2 gradually increases from 1 to 1.1, until it is within the allowable range of error In, (V3·x2)(T1/x1)=the length of the text in this group of subtitles. Output the current duration adjustment parameter x1 and speech rate adjustment parameter x2.

Further, for the third method, there may be various combinations of the duration adjustment parameter x1 and the speech rate adjustment parameter x2 in practice, and each combination can satisfy the allowable error range, (V3 x2)(T1/x1)= The length of the text in the set of subtitles. In view of this situation, other filtering conditions can also be added, such as, x1+x2 is the smallest, 2x1+x2 is the smallest, x1 ² +x2 ² is the smallest, etc., to obtain the optimal combination of the duration adjustment parameter x1 and the speech rate adjustment parameter x2 .

FIG. 6 is a flowchart of another method for implementing S130 according to an embodiment of the present disclosure. This method is applicable to the second case above. Referring to Figure 6, the method includes:

S1331. Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing.

S1332. If the duration of any piece of audio is greater than the default duration, and the first difference between the duration of any piece of audio and the default duration is greater than the first threshold, determine whether the length of the second subtitle is within a preset range.

S1333. If the length of the second subtitle is within the preset range, reduce the display duration of the target subtitle, and/or reduce the target speech rate of the dubbing, so that the product of the display duration of the target subtitle and the target speech rate of the dubbing is equal to the second The second difference between the lengths of the subtitles is less than or equal to the second threshold.

Those skilled in the art can understand that when S1333 is executed, if the display duration of the target subtitle is selected to be reduced, if the display duration of the target subtitle after the reduction is too short (exceeds a certain limit), it means that the image frame corresponding to the audio segment is switched. If the speed is too fast, while the switching speed of the image frames corresponding to other audios is normal, it will cause the overall video disharmony and affect the user experience. Therefore, the minimum display duration of the target subtitle can be limited by setting a duration adjustment parameter.

Specifically, if the initial display duration of a group of subtitles (that is, the duration of the audio corresponding to the group of subtitles in the original video) is T1, the duration adjustment parameter is x1, and the display duration of the subtitle after adjustment is T2, then there is T2 =T1/x1, where x1 is a number greater than 1. By setting the maximum value of the duration adjustment parameter x1, the minimum value of the display duration of the subtitle after adjustment is limited.

Optionally, there are multiple implementation methods for "reducing the display duration of the target subtitles". Exemplarily, the implementation method for reducing the display duration of the target subtitles includes: increasing the display speed of a picture corresponding to any piece of audio.

Further, if the method of increasing the display speed of the picture corresponding to any piece of audio is adopted, the duration adjustment parameter x1 can be regarded as the display speed adjustment parameter x1.

Specifically, assuming that the display speed of the picture corresponding to a certain audio segment in the original video is V1, the display speed adjustment parameter is x1, and the display speed of the picture corresponding to the audio segment after adjustment is V2, then V2=V1·x1. By setting the maximum value of the display speed adjustment parameter x1, the maximum value of the display speed of the picture corresponding to the audio segment after adjustment is limited, thereby limiting the minimum value of the display duration of the target subtitle after adjustment.

Exemplarily, continuing to refer to FIG. 2 , it is assumed that there are 20 image frames corresponding to the time period t0 , and t0 = 2S, indicating that the original display speed V1 of the image frames corresponding to the time period t0 is 20 frames/2 seconds. If the display speed of the image frame is increased, the display speed of the increased image frame is recorded as V2, V2=V1·x1, and at this time, x1 is a number greater than 1. If the maximum value of x1 is set to 1.1, that is, the display speed of the image frame can be adjusted to 22 frames/2 seconds at the fastest. The total number of frames is 20. Therefore, the display duration of the 20 image frames becomes t0/x1, and the value of x1 is changed from 1 to 1.1, which is equivalent to reducing the display duration of the image frame, that is, reducing the Display time for group captions.

Similarly, when performing S1333, if the target speech rate of the dubbing is selected to be reduced, if the target speech rate after the reduction is too slow (exceeding a certain limit), the speech rate of this segment of audio will be too slow, while the speech rates corresponding to other audios are normal, It will cause the overall disharmony of the video and affect the user experience. Therefore, the minimum value of the target speech rate can be limited by setting the speech rate adjustment parameter.

Specifically, it is assumed that the default speech rate corresponding to the dubbing tone selected for a certain segment of audio in the original video is V3, the speech rate adjustment parameter is x2, and the speech rate corresponding to the dubbing tone after adjustment is V4, then there is V4=V3. x2. By setting the minimum value of the speech rate adjustment parameter x2, the minimum value of the target speech rate of the dubbing tone after adjustment is limited.

Exemplarily, the default speech shorthand of the dubbing tone is V3, and V3 is fixed. The shorthand for the target language of the dubbing tone is V4, and the initial value of V4 is V3. The target speech rate of the dubbing tone can be adjusted. For example, if the target speech rate of the dubbing tone is reduced, it is equivalent to V4=V3·x2, where x2 is a number less than 1. If the minimum value of x2 is set to 0.9, that is, the minimum target speech rate can be 0.9 times the default speech rate, which is equivalent to slowing down the target speech rate of the dubbing tone by 10%.

Therefore, in S1333, "making the second difference between the product of the display duration of the target subtitle and the target speech rate of dubbing and the length of the second subtitle be less than or equal to the second threshold" means that within the allowable range of the error, based on the The duration of the dubbing audio generated by the second subtitle is consistent with the display duration of the target subtitle.

As mentioned above, due to the need to ensure that the video has a better audio-visual effect, it is necessary to limit the shortest display duration and the minimum target speech rate of the target subtitle, which will make the length of the second subtitle within a certain range (that is, the "preset range" in S1332 ")Inside.

If the length of the second subtitle is within the preset range, under the conditions of "the display duration of the target subtitle is greater than or equal to the minimum display duration of the target subtitle" and "the target speech rate of dubbing is greater than or equal to the minimum target speech rate", By reducing the display duration of the target subtitles and/or reducing the target speech rate of dubbing, the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles can be less than or equal to the first Two thresholds.

If the length of the second subtitle is not within the preset range, under the two conditions of "the display duration of the target subtitle is greater than or equal to the minimum display duration of the target subtitle" and "the target speech rate of dubbing is greater than or equal to the minimum target speech rate", No matter how the display duration of the target subtitles is reduced, and/or the target speech rate of dubbing is reduced, the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles cannot be less than or equal to the second threshold.

Further, the preset range includes a lower limit value (ie, a minimum value), and the lower limit value is related to the default speech rate, the duration of any piece of audio, the maximum value of the duration adjustment parameter, and the minimum value of the speech rate adjustment parameter.

Exemplarily, assuming that the upper limit of the preset range is n1, the default speech rate is V3, the duration of any audio segment is T1, the maximum value of the duration adjustment parameter is 1.1, and the minimum value of the speech rate adjustment parameter is 0.9, then n1= (V3*0.9)(T1/1.1).

On the basis of the above technical solution, when S1333 is executed, there are various specific implementation methods for "reducing the display duration of the target subtitles, and/or reducing the target speech rate of dubbing". Three typical methods are given below.

method one:

On the basis that the target speech rate of dubbing is the default speech rate, gradually reduce the target speech rate of dubbing; if the target speech rate of dubbing has reached the minimum value and the second difference is greater than the second threshold, the display duration of the target subtitles is On the basis of the duration of any piece of audio, the display duration of the target subtitle is gradually reduced until the second difference is less than or equal to the second threshold. .

For example, the display duration of a certain group of subtitles is controlled to remain unchanged, that is, T1 is a fixed value, and the duration adjustment parameter x1=1. Priority is given to adjusting the speech speed adjustment parameter x2, and the value of x2 gradually decreases from 1 to 0.9. For example, in the order from 1 to 0.9, the values are taken at intervals. When x2 takes a certain value, if it is within the allowable range of error If (V3·x2)(T1/x1)=the length of the text in this group of subtitles, the adjustment of x2 is stopped, and the current time length adjustment parameter x1 and speech rate adjustment parameter x2 are output.

If the value of x2 has reached the minimum value of 0.9, but it is not within the allowable range of the error, so that (V3 x2)(T1/x1)=the length of the text in this group of subtitles, then fix x2=0.9, adjust x1 The value of x1 starts from 1 and gradually increases to 1.1. For example, in the order from 1 to 1.1, the values are taken at intervals until within the allowable error range, so that (V3 x2)(T1/x1) = The length of the text in this group of subtitles, output the current time length adjustment parameter x1 and speech rate adjustment parameter x2.

Method Two:

On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the display duration of the target subtitles; if the display duration of the target subtitles has reached the minimum value and the second difference is greater than the second threshold, the dubbing target On the basis of the default speech rate, the target speech rate of dubbing is gradually reduced until the second difference is less than or equal to the second threshold.

For example, the target speech rate for controlling the dubbing timbre remains unchanged, that is, V3 is a fixed value, and the speech rate adjustment parameter x2=1. Priority is given to adjusting the duration adjustment parameter x1, and the value of x1 gradually increases from 1 to 1.1. For example, in the order from 1 to 1.1, the values are taken at intervals. When x1 takes a certain value, if it can be made within the allowable error Within the range, (V3·x2)(T1/x1)=the length of the text in this group of subtitles, stop adjusting x1, and output the current time length adjustment parameter x1 and speech rate adjustment parameter x2.

If the value of x1 has reached the maximum value of 1.1, but it is not within the allowable range of the error, so that (V3 x2)(T1/x1)=the length of the text in this group of subtitles, then further adjust the value of x2, x2 The value of , gradually decreases from 1 to 0.9. For example, according to the order from 1 to 1.1, the value is taken at intervals until it is within the allowable range of error, (V3*x2)(T1/x1)=In this group of subtitles Length of English subtitles. Output the current duration adjustment parameter x1 and speech rate adjustment parameter x2.

Method three:

On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the display duration of the target subtitles. At the same time, on the basis that the target speech rate of dubbing is the default speech rate, gradually reduce the target speech rate of dubbing until the first The second difference is less than or equal to the second threshold.

For example, adjust the value of the duration adjustment parameter x1 and the speech rate adjustment parameter x2 at the same time, the value of x1 gradually increases from 1 to 1.1, and the value of x2 gradually decreases from 1 to 0.9, until it is within the allowable range of error In, (V3·x2)(T1/x1)=the length of the text in this group of subtitles. Output the current duration adjustment parameter x1 and speech rate adjustment parameter x2.

FIG. 7 is a flowchart of another video processing method provided by an embodiment of the present disclosure. In practice, it may appear that the original video includes multiple pieces of audio, and the multiple pieces of audio are the voices of multiple target objects. The target object can be understood as a person in the video. In view of this situation, on the basis of the above technical solutions, optionally, referring to FIG. 7 , the method further includes:

S210. For each target object in the plurality of target objects, select a dubbing tone corresponding to the target object.

There are various methods for implementing this step. Exemplarily, multiple dubbing timbre data are stored in the database in advance, and different dubbing timbre data correspond to different character attribute data. Here, the person attribute data includes the age, gender, tone, occupation, and the like of the person. When performing this step, based on the original video, the character attribute data of the target object is identified; based on the character attribute data of the target object, the corresponding dubbing timbre of the target object is determined.

Optionally, in the same video, the dubbing timbres corresponding to the same target object are the same, and the dubbing timbres corresponding to different target objects are different.

S220. Generate multiple dubbing audios corresponding to the multiple audio segments according to the dubbing timbres corresponding to each target object respectively.

S230. Replace multiple audio segments in the original video with multiple dubbing audios to obtain a target video.

The above technical solution selects the dubbing timbre corresponding to the target object for each target object in the plurality of target objects; Correspondingly, it is convenient for the user to distinguish different characters after dubbing from the aspect of sound, which can improve the user experience.

It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described action sequences. Because certain steps may be performed in other orders or concurrently in accordance with the present disclosure. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present disclosure.

FIG. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure. The video processing apparatus provided by the embodiment of the present disclosure may be configured in a client or may be configured in a server, and the video processing apparatus specifically includes:

an obtaining module 310, configured to obtain the first subtitle in the original video;

a translation module 320, configured to translate the first subtitle to obtain a second subtitle;

A determination module 330, configured to determine the target speech rate of the dubbing;

The dubbing module 340 is configured to generate dubbing audio corresponding to the second subtitle according to the target speech rate of the dubbing.

Further, the first subtitle is a subtitle corresponding to any segment of audio in the original video,

The determining module 330 is further configured to determine the display duration of target subtitles, where the target subtitles include the first subtitle and/or the second subtitle;

The device also includes a replacement module 350, which is used to replace the audio in the original video with the dubbing audio to obtain a target video, which corresponds to the display duration of the target subtitles in the target video. The target subtitle is displayed on the screen.

Further, the determination module is used to:

Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing;

If the duration of any piece of audio is greater than or equal to the default duration, and the first difference between the duration of any piece of audio and the default duration is less than or equal to the first threshold, the display duration of the target subtitle is the time corresponding to any piece of audio, and the target speech rate of the dubbing is the default speech rate.

Further, the device also includes a first adjustment module. The first adjustment module is used to:

If the duration of any piece of audio is less than the default duration, determining whether the length of the second subtitle is within a preset range;

If the length of the second subtitle is within the preset range, the display duration of the target subtitle is increased, and/or the target speech rate of the dubbing is increased, so that the display duration of the target subtitle is the same as that of the target subtitle. A second difference between the product of the target speech rate for dubbing and the length of the second subtitle is less than or equal to a second threshold.

Further, the first adjustment module is used for:

On the basis that the target speech rate of the dubbing is the default speech rate, gradually increase the target speech rate of the dubbing;

If the target speech rate of the dubbing has reached the maximum value, and the second difference is greater than the second threshold, gradually increase the Display duration of the target subtitle until the second difference is less than or equal to the second threshold.

Further, the first adjustment module is used for:

On the basis that the display duration of the target subtitle is the duration of any piece of audio, gradually increase the display duration of the target subtitle;

If the display duration of the target subtitle has reached the maximum value, and the second difference is greater than the second threshold, the target speech rate of the dubbing is gradually increased on the basis that the target speech rate of the dubbing is the default speech rate speech rate until the second difference is less than or equal to the second threshold.

Further, the first adjustment module is used for:

On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually increase the display duration of the target subtitles, and on the basis that the target speech rate of the dubbing is the default speech rate, The target speech rate of the dubbing is gradually increased until the second difference is less than or equal to a second threshold.

Further, the first adjustment module increases the display duration of the target subtitle by reducing the display speed of the picture corresponding to any piece of audio.

Further, the device also includes a second adjustment module. The second adjustment module is used for:

If the duration of any piece of audio is greater than the default duration, and the first difference between the duration of any piece of audio and the default duration is greater than a first threshold, determine whether the length of the second subtitle is within the predetermined duration within the set range;

If the length of the second subtitle is within the preset range, the display duration of the target subtitle is reduced, and/or the target speech rate of the dubbing is reduced, so that the display duration of the target subtitle is the same as that of the target subtitle. A second difference between the product of the target speech rate for dubbing and the length of the second subtitle is less than or equal to a second threshold.

Further, the second adjustment module is used for:

On the basis that the target speech rate of the dubbing is the default speech rate, gradually reduce the target speech rate of the dubbing;

If the target speech rate of the dubbing has reached the minimum value, and the second difference is greater than the second threshold, on the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the Display duration of the target subtitle until the second difference is less than or equal to the second threshold.

Further, the second adjustment module is used for:

On the basis that the display duration of the target subtitle is the duration of the audio of any segment, gradually reduce the display duration of the target subtitle;

If the display duration of the target subtitle has reached the minimum value and the second difference is greater than the second threshold, the target speech rate of the dubbing is gradually reduced on the basis that the target speech rate of the dubbing is the default speech rate speech rate until the second difference is less than or equal to the second threshold.

Further, the second adjustment module is used for:

On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the display duration of the target subtitles, and on the basis that the target speech rate of the dubbing is the default speech rate, The target speech rate of the dubbing is gradually decreased until the second difference is less than or equal to a second threshold.

Further, the second adjustment module reduces the display duration of the target subtitle by increasing the display speed of the picture corresponding to any piece of audio.

Further, the translation module is also used to:

If it is determined that the length of the second subtitle is not within the preset range, the first subtitle is retranslated according to the preset range, so that the length of the second subtitle obtained after retranslation is within the preset range .

Further, the display duration of the target subtitle is the display duration of the picture corresponding to any piece of audio.

Further, the preset range is related to the default speech rate and the duration of any piece of audio.

Further, the original video includes multiple pieces of audio, and the multiple pieces of audio are the voices of multiple target objects;

The device further includes a selection module; the selection module is configured to, for each target object in the plurality of target objects, select a dubbing timbre corresponding to the target object;

A dubbing module, configured to generate a plurality of dubbing audios corresponding to the multi-segment audios according to the dubbing timbres corresponding to each target object respectively;

A replacement module, configured to replace the multiple audio segments in the original video with the multiple dubbed audios to obtain a target video.

The video processing apparatus provided by the embodiments of the present disclosure can execute the steps performed by the client or the server in the video processing method provided by the method embodiments of the present disclosure, and the execution steps and beneficial effects are not repeated here.

FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. Referring specifically to FIG. 9 below, it shows a schematic structural diagram of an electronic device 1000 suitable for implementing an embodiment of the present disclosure. The electronic device 1000 in the embodiment of the present disclosure may include, but is not limited to, such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle-mounted terminal ( Mobile terminals such as in-vehicle navigation terminals), wearable electronic devices, etc., and stationary terminals such as digital TVs, desktop computers, smart home devices, and the like. The electronic device shown in FIG. 9 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.

As shown in FIG. 9, an electronic device 1000 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 1001, which may be loaded into random access according to a program stored in a read only memory (ROM) 1002 or from a storage device 1008 A program in the memory (RAM) 1003 executes various appropriate actions and processes to implement the video processing method of the embodiment as described in the present disclosure. In the RAM 1003, various programs and information necessary for the operation of the electronic device 1000 are also stored. The processing device 1001, the ROM 1002, and the RAM 1003 are connected to each other through a bus 1004. An input/output (I/O) interface 1005 is also connected to the bus 1004 .

Typically, the following devices can be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 1007 such as a computer; a storage device 1008 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 1009 . Communication means 1009 may allow electronic device 1000 to communicate wirelessly or by wire with other devices to exchange information. While FIG. 9 shows the electronic device 1000 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flowchart, thereby achieving the above the video processing method. In such an embodiment, the computer program may be downloaded and installed from the network via the communication device 1009, or from the storage device 1008, or from the ROM 1002. When the computer program is executed by the processing apparatus 1001, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.

It should be noted that the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. Rather, in the present disclosure, a computer-readable signal medium may include an information signal in baseband or propagated as part of a carrier wave with computer-readable program code embodied thereon. Such propagated information signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the client and server can use any known or future developed network protocol such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital information in any form or medium (eg, a communications network) interconnected. Examples of communication networks include local area networks ("LAN"), wide area networks ("WAN"), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any known or future developed network.

The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.

The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device:

Get the first subtitle in the original video;

The first subtitle is translated to obtain the second subtitle;

Determine the target speech rate for dubbing;

Optionally, when the above one or more programs are executed by the electronic device, the electronic device may also perform other steps described in the above embodiments.

Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner. Among them, the name of the unit does not constitute a limitation of the unit itself under certain circumstances.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, the present disclosure provides an electronic device, comprising:

one or more processors;

memory for storing one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement the video processing method as provided in any one of the present disclosure.

According to one or more embodiments of the present disclosure, the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements video processing as described in any one of the present disclosure method.

The above description is merely a preferred embodiment of the present disclosure and an illustration of the technical principles employed. Those skilled in the art should understand that the scope of the disclosure involved in the present disclosure is not limited to the technical solutions formed by the specific combination of the above-mentioned technical features, and should also cover, without departing from the above-mentioned disclosed concept, the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of its equivalent features. For example, a technical solution is formed by replacing the above-mentioned features with the technical features disclosed in the present disclosure (but not limited to) with similar functions.

Additionally, although operations are depicted in a particular order, this should not be construed as requiring that the operations be performed in the particular order shown or in a sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, although the above discussion contains several implementation-specific details, these should not be construed as limitations on the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or logical acts of method, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims.

Claims

A video processing method, the method comprising:

Get the first subtitle in the original video;

The first subtitle is translated to obtain the second subtitle;

Determine the target speech rate for dubbing;

The dubbing audio corresponding to the second subtitle is generated according to the target speech rate of the dubbing.
The method according to claim 1, wherein the first subtitle is a subtitle corresponding to any piece of audio in the original video;

The method also includes:

determining the display duration of the target subtitle, where the target subtitle includes the first subtitle and/or the second subtitle;

After generating the dubbing audio corresponding to the second subtitle according to the target speech rate of the dubbing, the method further includes:

Replace any piece of audio in the original video with the dubbed audio to obtain a target video, and display the target subtitle in a picture corresponding to the display duration of the target subtitle in the target video.
The method according to claim 2, wherein determining the display duration of the target subtitle and the target speech rate of the dubbing comprises:

Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing;

If the duration of any piece of audio is greater than or equal to the default duration, and the first difference between the duration of any piece of audio and the default duration is less than or equal to the first threshold, the display duration of the target subtitles is the time corresponding to any piece of audio, and the target speech rate of the dubbing is the default speech rate.
The method of claim 3, wherein the method further comprises:

If the duration of any piece of audio is less than the default duration, determining whether the length of the second subtitle is within a preset range;

If the length of the second subtitle is within the preset range, the display duration of the target subtitle is increased, and/or the target speech rate of the dubbing is increased, so that the display duration of the target subtitle is the same as that of the target subtitle. A second difference between the product of the target speech rate for dubbing and the length of the second subtitle is less than or equal to a second threshold.
The method according to claim 4, wherein increasing the display duration of the target subtitle and/or increasing the target speech rate of the dubbing comprises:

On the basis that the target speech rate of the dubbing is the default speech rate, gradually increase the target speech rate of the dubbing;

If the target speech rate of the dubbing has reached the maximum value, and the second difference is greater than the second threshold, gradually increase the Display duration of the target subtitle until the second difference is less than or equal to the second threshold.
The method according to claim 4, wherein increasing the display duration of the target subtitle and/or increasing the target speech rate of the dubbing comprises:

On the basis that the display duration of the target subtitle is the duration of any piece of audio, gradually increase the display duration of the target subtitle;

If the display duration of the target subtitle has reached the maximum value, and the second difference is greater than the second threshold, the target speech rate of the dubbing is gradually increased on the basis that the target speech rate of the dubbing is the default speech rate speech rate until the second difference is less than or equal to the second threshold.
The method according to claim 4, wherein increasing the display duration of the target subtitle and/or increasing the target speech rate of the dubbing comprises:

On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually increase the display duration of the target subtitles, and on the basis that the target speech rate of the dubbing is the default speech rate, The target speech rate of the dubbing is gradually increased until the second difference is less than or equal to a second threshold.
The method according to any one of claims 5-7, wherein increasing the display duration of the target subtitles comprises:

Decrease the display speed of the picture corresponding to any piece of audio.
The method of claim 3, wherein the method further comprises:

If the duration of any piece of audio is greater than the default duration, and the first difference between the duration of any piece of audio and the default duration is greater than a first threshold, determine whether the length of the second subtitle is within the predetermined duration within the set range;

If the length of the second subtitle is within the preset range, the display duration of the target subtitle is reduced, and/or the target speech rate of the dubbing is reduced, so that the display duration of the target subtitle is the same as that of the target subtitle. A second difference between the product of the target speech rate for dubbing and the length of the second subtitle is less than or equal to a second threshold.
The method according to claim 9, wherein reducing the display duration of the target subtitle and/or reducing the target speech rate of the dubbing comprises:

On the basis that the target speech rate of the dubbing is the default speech rate, gradually reduce the target speech rate of the dubbing;

If the target speech rate of the dubbing has reached the minimum value, and the second difference is greater than the second threshold, on the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the Display duration of the target subtitle until the second difference is less than or equal to the second threshold.
The method according to claim 9, wherein reducing the display duration of the target subtitle and/or reducing the target speech rate of the dubbing comprises:

On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the display duration of the target subtitles;

If the display duration of the target subtitle has reached the minimum value and the second difference is greater than the second threshold, the target speech rate of the dubbing is gradually reduced on the basis that the target speech rate of the dubbing is the default speech rate speech rate until the second difference is less than or equal to the second threshold.
The method according to claim 9, wherein reducing the display duration of the target subtitle and/or reducing the target speech rate of the dubbing comprises:

On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the display duration of the target subtitles, and on the basis that the target speech rate of the dubbing is the default speech rate, The target speech rate of the dubbing is gradually decreased until the second difference is less than or equal to a second threshold.
The method according to any one of claims 10-12, wherein reducing the display duration of the target subtitles comprises:

The display speed of the picture corresponding to any piece of audio is increased.
The method of claim 4 or 9, wherein the method further comprises:

If it is determined that the length of the second subtitle is not within the preset range, the first subtitle is retranslated according to the preset range, so that the length of the second subtitle obtained after retranslation is within the preset range .
The method according to claim 4 or 9, wherein the display duration of the target subtitle is the display duration of the picture corresponding to any piece of audio.
The method according to claim 4 or 9, wherein the preset range is related to the default speech rate and the duration of any piece of audio.
The method according to claim 1, wherein the original video includes multiple pieces of audio, and the multiple pieces of audio are voices of multiple target objects;

The method also includes:

For each target object in the plurality of target objects, select the dubbing timbre corresponding to the target object;

According to the dubbing timbre corresponding to each target object respectively, generate a plurality of dubbing audios corresponding to the multi-segment audios;

The multiple pieces of audio in the original video are replaced with the multiple dubbed audios to obtain a target video.
A video processing device, comprising:

an acquisition module for acquiring the first subtitle in the original video;

a translation module for translating the first subtitle to obtain a second subtitle;

A determination module for determining the target speech rate of dubbing;

A dubbing module, configured to generate dubbing audio corresponding to the second subtitle according to the target speech rate of the dubbing.
An electronic device comprising:

one or more processors;

a storage device for storing one or more programs;

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-17.
A computer-readable storage medium having stored thereon a computer program, which, when executed by a processor, implements the method of any one of claims 1-17.