CN113207044A

CN113207044A - Video processing method and device, electronic equipment and storage medium

Info

Publication number: CN113207044A
Application number: CN202110472124.XA
Authority: CN
Inventors: 杜育璋; 刘坚; 李磊; 王明轩
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-08-03
Also published as: WO2022228179A1

Abstract

The embodiment of the disclosure discloses a video processing method, a video processing device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring a first subtitle in an original video; translating the first caption to obtain a second caption; determining a target speech rate of dubbing; and generating dubbing audio corresponding to the second subtitle according to the target speech speed of the dubbing. The technical scheme provided by the embodiment of the disclosure can realize the generation of dubbing audio which can be understood by a video viewer, and can help the user to reduce the difficulty in understanding the video content.

Description

Video processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of information technologies, and in particular, to a video processing method and apparatus, an electronic device, and a storage medium.

Background

With the development of information technology, terminals have become indispensable devices in people's lives. For example, a user may view a video through a terminal.

Some current videos may be in other languages, and the user may not understand the audio content in the video. However, in the existing technology, subtitles that can be read by a user are displayed in a video, but in some cases, the speed of browsing the subtitles by the user may not be matched with the display speed of the subtitles, thereby reducing the user experience.

Disclosure of Invention

To solve the technical problem or at least partially solve the technical problem, embodiments of the present disclosure provide a video processing method, an apparatus, an electronic device, and a storage medium.

The embodiment of the present disclosure provides a video processing method, which includes:

acquiring a first subtitle in an original video;

translating the first caption to obtain a second caption;

determining a target speech rate of dubbing;

and generating dubbing audio corresponding to the second subtitle according to the target speech speed of the dubbing.

An embodiment of the present disclosure further provides a video processing apparatus, including:

the acquisition module is used for acquiring a first subtitle in an original video;

the translation module is used for translating the first caption to obtain a second caption;

the determining module is used for determining the target speech rate of dubbing;

and the dubbing module is used for generating the dubbing audio corresponding to the second subtitle according to the target speech speed of the dubbing.

An embodiment of the present disclosure further provides an electronic device, which includes:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the video processing method as described above.

The disclosed embodiments also provide a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the video processing method as described above.

Embodiments of the present disclosure also provide a computer program product comprising a computer program or instructions which, when executed by a processor, implement the video processing method as described above.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has at least the following advantages: the technical scheme provided by the embodiment of the disclosure acquires a first subtitle in an original video through setting; translating the first caption to obtain a second caption; determining a target speech rate of dubbing; and generating the dubbing audio corresponding to the second subtitle according to the target speech speed of the dubbing, so that the dubbing audio which can be understood by a video viewer can be generated, the difficulty in understanding the video content can be reduced for the user, and the user experience can be improved.

In addition, according to the video processing method provided by the embodiment of the disclosure, the display duration of the target subtitle and/or the duration of the dubbing audio corresponding to the second subtitle are/is adjusted by determining the display duration of the target subtitle and the target speed of dubbing, so that the duration of the dubbing audio is consistent with the display duration of the target subtitle within an error allowable range, and therefore, the problem that the dubbing duration is not matched with the subtitle display duration due to different lengths of sentences expressed by different languages aiming at the same meaning is solved, and user experience is improved.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is a flowchart of a video processing method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of an image frame in a video according to an embodiment of the present disclosure;

fig. 3 is a flowchart of another video processing method provided by the embodiment of the present disclosure;

fig. 4 is a flowchart of a method for implementing S130 according to an embodiment of the present disclosure;

fig. 5 is a flowchart of another method for implementing S130 according to an embodiment of the present disclosure;

fig. 6 is a flowchart of another method for implementing S130 according to an embodiment of the present disclosure;

fig. 7 is a flowchart of another video processing method provided by the embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 is a flowchart of a video processing method according to an embodiment of the present disclosure, where the present embodiment is applicable to a situation where a client dubs a video, and the method may be executed by a video processing apparatus, where the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be configured in an electronic device, such as a terminal, specifically including but not limited to a smart phone, a palm computer, a tablet computer, a wearable device with a display screen, a desktop computer, a notebook computer, an all-in-one machine, a smart home device, and the like. Alternatively, the embodiment may be applicable to the case where the server dubs the video, and the method may be executed by a video processing apparatus, where the apparatus may be implemented in software and/or hardware, and the apparatus may be configured in an electronic device, such as a server.

As shown in fig. 1, the method may specifically include:

and S1, acquiring the first subtitle in the original video.

The first caption is in the same language as the original video character, and illustratively, if the character in the video speaks in english, the first caption is in english.

There are various ways to implement this step, which should not be limited in this application. Illustratively, if the original video includes the first subtitle, the first subtitle may be directly extracted.

Or if the original video does not comprise the first caption, performing voice recognition on any section of audio in the original video to obtain the first caption. At this time, the first subtitle is a subtitle corresponding to any piece of audio in the original video.

Here, any piece of audio refers to audio information corresponding to any sentence spoken by any character in the video.

Specifically, video includes audio streams and video streams. The video stream includes a plurality of image frames. The plurality of image frames are played in sequence according to time sequence to form dynamic images of the video. The people in the video speak during some periods of time and do not speak during other periods of time. The audio stream is composed of multiple audio segments, each audio segment corresponding to a sentence spoken by a character in the video. A piece of audio corresponds to a plurality of image frames.

Illustratively, fig. 2 is a schematic diagram of an image frame in a video provided by an embodiment of the present disclosure. Referring to fig. 2, it is assumed that the person in the video speaks for the t0 time period and the t3 time period and does not speak for the remaining time periods. The image frame in the time period t0 corresponds to a continuous speech, and the continuous speech corresponding to the time period t0 constitutes a piece of audio. the image frame in the time period t3 corresponds to another continuous voice, and the continuous voice corresponding to the time period t3 constitutes a piece of audio.

Optionally, when the step is executed, performing voice recognition on any segment of audio in the original video through a voice recognition technology to obtain the first subtitle. For example, each sentence of continuous Chinese speech may be recognized as a sentence of continuous Chinese subtitles based on speech pauses in the audio.

And S2, translating the first caption to obtain a second caption.

The second caption is of a different language than the first caption. Illustratively, the first subtitle is in chinese form and the second subtitle is in english form. The second subtitle uses a language understood by a video viewer. In actual setting, the second subtitle can be set according to the needs of the video viewer.

And S3, determining the target speech rate of the dubbing.

Those skilled in the art will appreciate that the second subtitle needs to be read at a certain speech rate for the purpose of dubbing the original video. In this step, the target speech rate for dubbing is the dubbing speech rate for reading the second subtitle.

Optionally, the database stores a plurality of dubbing tone color data in advance, and different dubbing tone color data have default speech rate data corresponding thereto. And when the step is executed, selecting dubbing timbre data, and taking the default speech speed corresponding to the selected dubbing timbre and/or the speech speed after the adjustment of the default speech speed as the target speech speed of the dubbing.

It should be noted that, in executing this step, it is necessary to ensure that the second subtitle is read at the determined target speech rate of dubbing, and the video viewer can hear it.

And S4, generating dubbing audio corresponding to the second subtitle according to the target speech speed of the dubbing.

The technical scheme provided by the embodiment of the disclosure acquires a first subtitle in an original video through setting; translating the first caption to obtain a second caption; determining a target speech rate of dubbing; and generating the dubbing audio corresponding to the second subtitle according to the target speech speed of the dubbing, so that the dubbing audio which can be understood by a video viewer can be generated, the difficulty in understanding the video content can be reduced for the user, and the user experience can be improved.

Optionally, the dubbed audio obtained by the technical scheme of the application can be played independently, and the audio in the original video can be replaced by the dubbed audio to obtain the dubbed video. The original video and the dubbing audio can be synchronously played, and the purpose of dubbing effect is achieved when the video is played.

Fig. 3 is a flowchart of a video processing method according to an embodiment of the present disclosure. Fig. 3 is a specific example of fig. 1. Referring to fig. 3, the method includes:

s110, performing voice recognition on any section of audio in the original video to obtain a first subtitle.

And S120, translating the first caption to obtain a second caption.

S130, determining the display duration of the target subtitles and the target speech speed of dubbing, wherein the target subtitles comprise the first subtitles and/or the second subtitles.

Those skilled in the art can understand that, for the purpose of dubbing the original video, the dubbing audio needs to be generated based on the second subtitle, and the duration of the dubbing audio is made to be equivalent to the playing duration of the image frame corresponding to the dubbing audio (which may also be understood as the display duration of the first subtitle and/or the second subtitle, i.e. the display duration of the target subtitle here). This is because if the duration of the dubbing audio is longer than the playing duration of the image frame corresponding to the dubbing audio, the dubbing audio is not finished yet, but the image frame has been played. If the duration of the dubbing audio is less than the playing duration of the image frame corresponding to the dubbing audio, the dubbing audio is finished at this time, but the image frame is still displayed. Both of these situations can cause the problem of the image not being synchronized with the audio, affecting the user experience.

In order to make the duration of the dubbing audio equivalent to the playing duration of the image frame corresponding to the dubbing audio, two parameters must be defined first, one is the duration of the dubbing audio, and the other is the playing duration of the image frame corresponding to the dubbing audio.

The duration of the dubbed audio depends, for the first parameter, mainly on two quantities, one being the length of the second subtitle and the other being the pace of the dubbing. Since the second subtitle is already obtained in S120, the length of the second subtitle is fixed in this step, and the duration of the dubbed audio depends mainly on the speech rate of the dubbing.

Therefore, the essence of this step is to determine the appropriate display duration of the target subtitle and the target speech rate of the dubbing so that the dubbing audio duration corresponding to the second subtitle and the display duration of the target subtitle are consistent within the error tolerance. In some embodiments, the precedence order between determining the display duration of the target subtitle and determining the target speech rate for dubbing is not limited. In addition, the determination of the display duration of the target subtitle and the determination of the target speech rate for dubbing may be two mutually independent processes or may be mutually associated processes.

And S140, generating dubbing audio corresponding to the second subtitle according to the target speech speed of the dubbing.

The essence of this step is reading the text information in the second caption at the target speech rate of dubbing determined in S130, and obtaining the dubbed audio corresponding to the second caption.

S150, replacing any section of audio in the original video with dubbing audio to obtain a target video, and displaying the target subtitle in a picture corresponding to the display duration of the target subtitle in the target video.

Illustratively, if any segment of audio in the original video is a Chinese audio and the dubbing audio is an English audio, the Chinese audio in the original video is replaced by the English audio to obtain a target video. And Chinese subtitles and/or English subtitles are added in the picture of the video frame.

The essence of the technical scheme of the application is that the display duration of the target subtitle and/or the duration of the dubbing audio corresponding to the second subtitle are adjusted by determining the display duration of the target subtitle and the target speed of dubbing, so that the duration of the dubbing audio is consistent with the display duration of the target subtitle within an error allowable range, and the problem that the dubbing duration is not matched with the subtitle display duration due to different lengths of sentences expressed by different languages aiming at the same meaning is solved, and the user experience is improved.

On the basis of the above technical solution, optionally, S150 may also be replaced with: and generating an audio file according to the dubbing audio corresponding to each section of audio in the original video, wherein the audio file comprises the dubbing audio corresponding to each section of audio in the original video and the time information of each dubbing audio. And when the original video is played, calling and playing the dubbing audio in sequence according to the current playing time progress of the original video and the time information of the dubbing audio. Further, the time information of each dubbed audio includes a start time and/or an end time of the dubbed audio.

For example, if a plurality of audio files are generated based on a certain original video, each audio file includes the start time of the dubbing audio corresponding to the audio file. Assume that the start time of dubbing audio a is 12 seconds from the playing of the first image frame of video. When the video is played and the 12 th s is played, dubbing audio a is called and played. Optionally, in this manner, the original audio of the original video is eliminated. Furthermore, when the original audio of the original video is eliminated, only the audio of the character speaking in the original video is eliminated, and the background sound of the original video is reserved.

Further, a play button or icon corresponding to the audio file may be displayed in the user interface for playing the target video. When the button or icon is in the off state, the audio in the target video is also the audio in the original video, i.e., the audio in the original video is not replaced with the corresponding dubbing audio. When the button or icon is in the on state, the audio in the original video is replaced by the corresponding dubbing audio, namely the audio in the target video is changed into the dubbing audio. Alternatively, the terminal may play the dubbed audio alone while the button or icon is in the on state.

On the basis of the above technical solution, further analyzing how to determine the display duration of the target subtitle and the target speech rate of the dubbing in executing S130, it can be found that in practice, two cases are mainly included:

and in the first situation, aiming at a section of audio in the original video, obtaining the duration of the dubbed audio based on the default speech rate of the dubbing and the second subtitle. The duration of the dubbing audio and the duration of the audio in the original video are consistent within the range allowed by the error. In this case, the display duration of the target caption can be directly determined as the time corresponding to the audio segment, and the target speech rate of dubbing is the default speech rate.

Optionally, the database stores a plurality of dubbing tone color data in advance, and different dubbing tone color data have default speech rate data corresponding to the dubbing tone color data. And the default speech speed of the dubbing is the default speech speed corresponding to the selected dubbing timbre, and is obtained based on dubbing timbre data stored in the database.

And in case II, aiming at a section of audio in the original video, obtaining the duration of the dubbed audio based on the default speech rate of the dubbing and the second subtitle. The duration of the dubbing audio and the duration of the audio in the original video are originally different from each other greatly and cannot be considered to be consistent within the range allowed by the error. In this case, the display duration of the target subtitle can be determined based on the duration of the audio in the original video; meanwhile, a target speech rate for dubbing is determined based on the default speech rate for dubbing.

Fig. 4 is a flowchart of a method for implementing S130 according to an embodiment of the present disclosure. This method is applicable to case one above. Referring to fig. 4, the method includes:

s1311, determining default duration of dubbing according to the length of the second subtitle and the default speech rate of the dubbing.

S1312, if the duration of any one section of audio is greater than or equal to the default duration, and the first difference between the duration of any one section of audio and the default duration is less than or equal to the first threshold, the display duration of the target subtitle is the time corresponding to any one section of audio, and the target speech rate of dubbing is the default speech rate.

The "first difference between the duration of any one of the audio segments and the default duration" may be an absolute value of a difference between the duration of any one of the audio segments and the default duration, or may be a ratio of the duration of any one of the audio segments and the default duration.

The essence of the setting is that in the error allowable range, if the duration of any section of audio is consistent with the default duration of dubbing corresponding to the section of audio, the display duration of the target caption is directly determined as the time corresponding to the section of audio, and the target speech speed of dubbing is the default speech speed.

Fig. 5 is a flowchart of another method for implementing S130 according to an embodiment of the present disclosure. This method is applicable to case two above. Referring to fig. 5, the method includes:

s1321, determining the default duration of dubbing according to the length of the second subtitle and the default speech speed of the dubbing.

S1322, if the duration of any segment of audio is less than the default duration, determining whether the length of the second subtitle is within the preset range.

S1323, if the length of the second subtitle is within the preset range, the display duration of the target subtitle is increased, and/or the target speech rate of dubbing is increased, so that a second difference between the product of the display duration of the target subtitle and the target speech rate of dubbing and the length of the second subtitle is smaller than or equal to a second threshold value.

And the display duration of the target caption is the display duration of the picture corresponding to any section of the adjusted audio.

It will be appreciated by those skilled in the art that if the selection of increasing the display duration of the target subtitle is performed in S1323, the display duration of the increased target subtitle cannot be infinite. If the display duration of the target subtitle is too long (exceeds a certain limit) after the target subtitle is increased, the switching speed of the image frame corresponding to the audio is too low, and the switching speed of the image frames corresponding to other audios is normal, which causes the integral discordance of the video and affects the user experience. Therefore, the maximum display time length of the target subtitle can be defined in a manner of setting the time length adjustment parameter.

Specifically, if the initial display duration of a group of subtitles (i.e., the duration of the audio corresponding to the group of subtitles in the original video) is T1, the duration adjustment parameter is x1, and the display duration of the subtitles after adjustment is T2, then there is T2 — T1/x1, where x1 is a number less than 1. The maximum value of the display duration of the subtitle after adjustment is limited by setting the minimum value of the duration adjustment parameter x 1.

Optionally, there are various implementation methods for "increasing the display duration of the target subtitle", and for example, the implementation method for increasing the display duration of the target subtitle includes: and reducing the display speed of the picture corresponding to any section of audio.

Further, if the method of reducing the display speed of the picture corresponding to any one of the audio segments is adopted, the duration adjustment parameter x1 can be regarded as the display speed adjustment parameter x 1.

Specifically, if the display speed of the screen corresponding to a certain piece of audio in the original video is V1, the display speed adjustment parameter is x1, and the display speed of the screen corresponding to the piece of audio after adjustment is V2, V2 is V1 · x 1. And limiting the minimum value of the display speed of the picture corresponding to the adjusted section of audio by setting the minimum value of the display speed adjustment parameter x1, and further limiting the maximum value of the display duration of the adjusted target subtitle.

For example, with reference to fig. 2, assuming that the t0 time period corresponds to 20 image frames, and t0 ═ 2S, it is stated that the original display speed V1 of the image frame corresponding to the t0 time period is 20 frames/2 seconds. If the display speed of the image frame is slowed down, the display speed of the image frame after the slowing down is recorded as V2, and V2 is V1 x1, and in this case, x1 is a number smaller than 1. If the minimum value of x1 is set to 0.9, that is, the display speed of the image frame can be adjusted to 18 frames/2 seconds at the slowest, at this time, the display speed corresponding to the image frame is adjusted to 10% slower, and the total number of image frames 20 corresponding to the t0 time period is fixed, so that the display duration of the 20 image frames becomes t0/x 1. Changing the value of x1 from 1 to 0.9 corresponds to increasing the display duration of the image frame, i.e., the display duration of the set of subtitles.

Similarly, when S1323 is executed, if the target speech rate of the dubbing is selected to be increased, if the target speech rate is increased too fast (exceeds a certain limit), the user may not hear clearly, and the user experience may be affected. Therefore, the maximum value of the target speech rate can be defined in such a manner that the speech rate adjustment parameter is set.

Specifically, assuming that the default speech rate corresponding to the dubbing timbre selected for a certain section of audio in the original video is V3, the speech rate adjustment parameter is x2, and the target speech rate corresponding to the dubbing timbre after adjustment is V4, then V4 is V3 · x 2. The maximum value of the target speech rate of the dubbed timbre after the adjustment is limited by setting the maximum value of the speech rate adjustment parameter x 2.

Illustratively, the default pace of dubbing timbre is denoted as V3, and V3 is fixed. The target speech rate of the dubbing timbre is denoted as V4, and the initial value of V4 is V3. The target speech rate of the dubbing timbre may be adjusted, for example, if the target speech rate of the dubbing timbre is increased, which corresponds to V4 ═ V3 · x2, in which case x2 is a number greater than 1. If the maximum x2 is set to 1.1, the target speech rate may be 1.1 times the default speech rate at the maximum, which is equivalent to 10% faster target speech rate for dubbed timbre.

The product L1 of the display duration T2 of the target subtitle and the target speech rate V4 of the dubbing may be expressed as L1 ═ T2 · V4 ═ T1/x1 (V3 · x 2). That is, the length of the text that can be read within the display duration of the target subtitle at the dubbed target speech rate. Here, the length of a character can be understood as the number of words of the character, the number of syllables of the character, or the like.

The product of the display duration of the target subtitle and the target speech rate of the dubbing is equal to the length of the second subtitle, meaning that the length of the text readable within the display duration of the target subtitle at the target speech rate of the dubbing is exactly equal to the length of the second subtitle. In other words, the time required to read the second subtitle at the dubbed target speech rate is exactly equal to the display duration of the target subtitle.

Therefore, in S1323, "making the second difference between the product of the display duration of the target subtitle and the target speech rate of the dubbing and the length of the second subtitle smaller than or equal to the second threshold" means that the duration of the dubbing audio generated based on the second subtitle coincides with the display duration of the target subtitle within the error-allowable range.

The second difference may be an absolute value of a difference between a product of the display duration of the target subtitle and the target speech rate of the dubbing and the length of the second subtitle, or may be a ratio of the product of the display duration of the target subtitle and the target speech rate of the dubbing and the length of the second subtitle.

As described above, since it is necessary to ensure that the video has a better audio-visual effect, the maximum display time length and the maximum target speech rate of the target subtitle are limited, which causes the length of the second subtitle to be within a certain range (i.e., "preset range" in S1322).

If the length of the second subtitle is just in the preset range, under the two conditions that the display duration of the target subtitle is less than or equal to the maximum display duration of the target subtitle and the target speech rate of dubbing is less than or equal to the maximum target speech rate, the second difference between the product of the display duration of the target subtitle and the target speech rate of dubbing and the length of the second subtitle can be less than or equal to a second threshold value by increasing the display duration of the target subtitle and/or increasing the target speech rate of dubbing.

If the length of the second subtitle is not within the preset range, under the two conditions that the display duration of the target subtitle is less than or equal to the maximum display duration of the target subtitle and the target speech rate of dubbing is less than or equal to the maximum target speech rate, no matter how the display duration of the target subtitle is increased and/or the target speech rate of dubbing is increased, the second difference between the product of the display duration of the target subtitle and the target speech rate of dubbing and the length of the second subtitle cannot be less than or equal to the second threshold.

Optionally, if it is determined that the length of the second subtitle is not within the preset range, the first subtitle is retranslated according to the preset range, so that the length of the second subtitle obtained after retranslation is within the preset range.

Optionally, the preset range is associated with a default speech rate and a duration of any one piece of audio.

Further, the preset range includes an upper limit value (i.e. a maximum value) related to the default speech rate, the duration of any one piece of audio, the minimum value of the duration adjustment parameter, and the maximum value of the speech rate adjustment parameter.

For example, assuming that the upper limit of the preset range is n1, the default speech rate is V3, the duration of any piece of audio is T1, the minimum value of the duration adjustment parameter is 0.9, and the maximum value of the speech rate adjustment parameter is 1.1, then n1 is (V3 × 1.1) (T1/0.9).

Based on the above technical solution, there are various specific implementation methods for "increasing the display duration of the target subtitle and/or increasing the target speech rate of the dubbing" in executing S1323. Three typical methods are given below.

The method comprises the following steps:

gradually increasing the dubbing target speech rate on the basis that the dubbing target speech rate is the default speech rate; if the target speech rate of the dubbing has reached the maximum value and the second difference is greater than the second threshold, gradually increasing the display duration of the target subtitle on the basis that the display duration of the target subtitle is the duration of any segment of audio until the second difference is less than or equal to the second threshold.

For example, the display duration of a certain group of subtitles is controlled to be constant, i.e., T1 is constant, and the duration adjustment parameter x1 is 1. Preferentially adjusting the speech rate adjustment parameter x2, gradually increasing the value of x2 from 1 to 1.1, for example, sequentially taking values at intervals in the order from 1 to 1.1, and when x2 takes a certain value, if (V3 · x2) (T1/x1) can be made equal to the length of the characters in the group of subtitles within the error tolerance range, stopping adjusting x2, and outputting the current time length adjustment parameter x1 and the speech rate adjustment parameter x 2.

If the value of x2 has reached the maximum value of 1.1, but cannot be within the error-allowable range, so that (V3 · x2) (T1/x1) is equal to the length of the characters in the group of captions, fixing x2 to 1.1, adjusting the value of x1, gradually decreasing the value of x1 from 1 to 0.9, for example, sequentially spacing the values in order from 1 to 0.9 until within the error-allowable range, so that (V3 · x2) (T1/x1) is equal to the length of the characters in the group of captions, and outputting the current time length adjustment parameter x1 and the speed adjustment parameter x 2.

The second method comprises the following steps:

gradually increasing the display duration of the target caption on the basis that the display duration of the target caption is the duration of any section of audio; if the display duration of the target subtitle has reached the maximum value and the second difference is greater than the second threshold, gradually increasing the dubbed target speech rate until the second difference is less than or equal to the second threshold on the basis that the dubbed target speech rate is the default speech rate.

For example, the target speech rate for controlling the dubbing timbre is constant, i.e., V3 is constant, and the speech rate adjustment parameter x2 is 1. The value of the priority adjustment duration adjustment parameter x1 and the value of the x1 gradually decrease from 1 to 0.9, for example, values are sequentially obtained at intervals in the order from 1 to 0.9, and when x1 takes a certain value, if the value is within the allowable error range, (V3 · x2) (T1/x1) ═ the length of the characters in the group of subtitles, the adjustment of x1 is stopped, and the current duration adjustment parameter x1 and the speech rate adjustment parameter x2 are output.

If the value of x1 has reached the minimum value of 0.9, but cannot be within the error tolerance range, so that (V3 · x2) (T1/x1) is equal to the length of the characters in the group of subtitles, the value of x2 is further adjusted, and the value of x2 is gradually increased from 1 to 1.1, for example, the values are sequentially spaced in the order from 1 to 1.1 until the value is within the error tolerance range, (V3 × 2) (T1/x1) is equal to the length of the english characters in the group of subtitles. And outputting a current time length adjusting parameter x1 and a speech speed adjusting parameter x 2.

The third method comprises the following steps:

and gradually increasing the display duration of the target caption on the basis that the display duration of the target caption is the duration of any section of audio, and gradually increasing the target speech rate of dubbing on the basis that the target speech rate of dubbing is the default speech rate until the second difference is less than or equal to a second threshold value.

For example, the values of the duration adjustment parameter x1 and the speed adjustment parameter x2 are adjusted simultaneously, the value of x1 gradually decreases from 1 to 0.9, and the value of x2 gradually increases from 1 to 1.1 until the error is within the allowable range, (V3 · x2) (T1/x1) ═ the length of the characters in the group of subtitles. And outputting a current time length adjusting parameter x1 and a speech speed adjusting parameter x 2.

Further, for method three, in practice, there may be a plurality of combinations of the duration adjustment parameter x1 and the speed adjustment parameter x2, and each combination can satisfy the error tolerance range, (V3 · x2) (T1/x1) ═ the length of the characters in the group of subtitles. For this case, other screening conditions may be added, such as x1+ x2 min, 2x1+ x2 min, and x1²+x2²Minimum, etc., to obtain an optimal combination of the duration adjustment parameter x1 and the pace adjustment parameter x 2.

Fig. 6 is a flowchart of another method for implementing S130 according to an embodiment of the present disclosure. This method is applicable to case two above. Referring to fig. 6, the method includes:

and S1331, determining default duration of dubbing according to the length of the second subtitle and the default speech rate of the dubbing.

And S1332, if the duration of any one section of audio is greater than the default duration, and the first difference between the duration of any one section of audio and the default duration is greater than a first threshold, determining whether the length of the second subtitle is within a preset range.

And S1333, if the length of the second caption is within the preset range, reducing the display duration of the target caption and/or reducing the target speech rate of the dubbing, so that the second difference between the product of the display duration of the target caption and the target speech rate of the dubbing and the length of the second caption is less than or equal to a second threshold value.

Those skilled in the art can understand that, when S1333 is executed, if the display duration of the target subtitle is selected to be reduced, and if the display duration of the reduced target subtitle is too short (exceeds a certain limit), it means that the switching speed of the image frame corresponding to the audio segment is too fast, and the switching speed of the image frames corresponding to other audio segments is normal, which may cause the video to be discordant as a whole, and affect the user experience. Therefore, the minimum display time length of the target subtitle can be defined in a manner of setting the time length adjustment parameter.

Specifically, if the initial display duration of a group of subtitles (i.e., the duration of the audio corresponding to the group of subtitles in the original video) is T1, the duration adjustment parameter is x1, and the display duration of the subtitles after adjustment is T2, then there is T2 — T1/x1, where x1 is a number greater than 1. And limiting the minimum value of the display time length of the adjusted subtitle by setting the maximum value of the time length adjusting parameter x 1.

Optionally, there are various implementation methods for "reducing the display duration of the target subtitle", and for example, the implementation method for reducing the display duration of the target subtitle includes: and improving the display speed of the picture corresponding to any section of audio.

Further, if the method of increasing the display speed of the picture corresponding to any one of the audio segments is adopted, the duration adjustment parameter x1 can be regarded as the display speed adjustment parameter x 1.

Specifically, if the display speed of the screen corresponding to a certain piece of audio in the original video is V1, the display speed adjustment parameter is x1, and the display speed of the screen corresponding to the piece of audio after adjustment is V2, V2 is V1 · x 1. And limiting the maximum value of the display speed of the picture corresponding to the adjusted section of audio by setting the maximum value of the display speed adjustment parameter x1, and further limiting the minimum value of the display duration of the adjusted target subtitle.

For example, with reference to fig. 2, assuming that the t0 time period corresponds to 20 image frames, and t0 ═ 2S, it is stated that the original display speed V1 of the image frame corresponding to the t0 time period is 20 frames/2 seconds. If the display speed of the image frame is adjusted to be faster, the display speed of the image frame after the adjustment is recorded as V2, and V2 is V1 · x1, in which case x1 is a number greater than 1. If the maximum value of x1 is set to 1.1, that is, the display speed of the image frame can be adjusted to 22 frames/2 seconds at the fastest time, at this time, the display speed corresponding to the image frame is adjusted to 10% faster, and the total number of image frames 20 corresponding to the t0 time period is fixed, therefore, the display duration of the 20 image frames becomes t0/x1, and the value of x1 is changed from 1 to 1.1, which is equivalent to reducing the display duration of the image frame, that is, reducing the display duration of the group of subtitles.

Similarly, when S1333 is executed, if the target speech rate for dubbing is selected to be reduced, if the target speech rate is too slow (exceeds a certain limit), the speech rate of the audio segment will be too slow, and the speech rates corresponding to other audios will be normal, which may cause the video to be discordant as a whole, and affect the user experience. Therefore, the minimum value of the target speech rate may be defined in a manner of setting the speech rate adjustment parameter.

Specifically, assuming that the default speech rate corresponding to the dubbing timbre selected for a certain section of audio in the original video is V3, the speech rate adjustment parameter is x2, and the speech rate corresponding to the dubbing timbre after adjustment is V4, then V4 is V3 · x 2. The minimum value of the target speech rate of the dubbed timbre after adjustment is limited by setting the minimum value of the speech rate adjustment parameter x 2.

Illustratively, the default pace of dubbing timbre is denoted as V3, and V3 is fixed. The target speech rate of the dubbing timbre is denoted as V4, and the initial value of V4 is V3. The target speech rate of the dubbing timbre may be adjusted, for example, if the target speech rate of the dubbing timbre is lowered, which corresponds to V4 ═ V3 · x2, in which case x2 is a number less than 1. If the minimum value of x2 is set to 0.9, the target speech rate may be 0.9 times the default speech rate, which is equivalent to slowing down the target speech rate by 10% for dubbed timbre.

Therefore, in S1333, "making the second difference between the product of the display duration of the target subtitle and the target speech rate of the dubbing and the length of the second subtitle less than or equal to the second threshold" means that the duration of dubbing audio generated based on the second subtitle coincides with the display duration of the target subtitle within the error-allowable range.

As described above, since it is necessary to ensure that the video has a better viewing effect, it is necessary to define the shortest display time length and the smallest target speech speed of the target subtitle, which makes the length of the second subtitle within a certain range (i.e., "preset range" in S1332).

If the length of the second subtitle is just in the preset range, under the two conditions that the display duration of the target subtitle is greater than or equal to the minimum display duration of the target subtitle and the target speech rate of dubbing is greater than or equal to the minimum target speech rate, the second difference between the product of the display duration of the target subtitle and the target speech rate of dubbing and the length of the second subtitle can be smaller than or equal to a second threshold value by reducing the display duration of the target subtitle and/or reducing the target speech rate of dubbing.

If the length of the second subtitle is not within the preset range, under the two conditions that the display duration of the target subtitle is greater than or equal to the minimum display duration of the target subtitle and the target speech rate of dubbing is greater than or equal to the minimum target speech rate, no matter how the display duration of the target subtitle is reduced and/or the target speech rate of dubbing is reduced, the second difference between the product of the display duration of the target subtitle and the target speech rate of dubbing and the length of the second subtitle cannot be smaller than or equal to the second threshold.

Further, the preset range includes a lower limit (i.e. a minimum) related to the default speech rate, the duration of any one of the audio pieces, the maximum of the duration adjustment parameter, and the minimum of the speech rate adjustment parameter.

For example, assuming that the upper limit of the preset range is n1, the default speech rate is V3, the duration of any piece of audio is T1, the maximum value of the duration adjustment parameter is 1.1, and the minimum value of the speech rate adjustment parameter is 0.9, then n1 is (V3 × 0.9) (T1/1.1).

On the basis of the above technical solution, there are various specific implementation methods for "reducing the display duration of the target subtitle and/or reducing the target speech rate of the dubbing" when executing S1333. Three typical methods are given below.

The method comprises the following steps:

gradually decreasing the dubbing target speech rate on the basis that the dubbing target speech rate is the default speech rate; if the target speech rate of the dubbing has reached the minimum value and the second difference is greater than the second threshold, the display duration of the target caption is gradually reduced on the basis that the display duration of the target caption is the duration of any segment of audio until the second difference is less than or equal to the second threshold. .

For example, the display duration of a certain group of subtitles is controlled to be constant, i.e., T1 is constant, and the duration adjustment parameter x1 is 1. Preferentially adjusting the speech rate adjustment parameter x2, gradually reducing the value of x2 from 1 to 0.9, for example, sequentially taking values at intervals in the order from 1 to 0.9, and when x2 takes a certain value, if (V3 · x2) (T1/x1) can be made equal to the length of the characters in the group of captions within the error tolerance range, stopping adjusting x2, and outputting the current time length adjustment parameter x1 and the speech rate adjustment parameter x 2.

If the value of x2 has reached the minimum value of 0.9, but cannot be within the error-allowable range, so that (V3 · x2) (T1/x1) is equal to the length of the characters in the group of captions, then x2 is fixed to be 0.9, the value of x1 is adjusted, the value of x1 is gradually increased from 1 to 1.1, for example, values are sequentially spaced in the order from 1 to 1.1 until within the error-allowable range, so that (V3 · x2) (T1/x1) is equal to the length of the characters in the group of captions, and the current duration adjustment parameter x1 and the speed adjustment parameter x2 are output.

The second method comprises the following steps:

on the basis that the display duration of the target caption is the duration of any section of audio, gradually reducing the display duration of the target caption; if the display duration of the target caption has reached the minimum value and the second difference is greater than the second threshold, the dubbed target speech rate is gradually decreased on the basis that the dubbed target speech rate is the default speech rate until the second difference is less than or equal to the second threshold.

For example, the target speech rate for controlling the dubbing timbre is constant, i.e., V3 is constant, and the speech rate adjustment parameter x2 is 1. The value of the priority adjustment duration adjustment parameter x1, x1 gradually increases from 1 to 1.1, and for example, values are sequentially provided at intervals in the order from 1 to 1.1, and when x1 takes a certain value, if the value is within the allowable error range, (V3 · x2) (T1/x1) ═ the length of the characters in the group of subtitles, the adjustment of x1 is stopped, and the current duration adjustment parameter x1 and the speech rate adjustment parameter x2 are output.

If the value of x1 has reached the maximum value of 1.1, but cannot be within the error tolerance range, so that (V3 · x2) (T1/x1) is equal to the length of the characters in the group of subtitles, the value of x2 is further adjusted, and the value of x2 is gradually decreased from 1 to 0.9, for example, sequentially spaced in the order from 1 to 1.1 until the value is within the error tolerance range, (V3 × 2) (T1/x1) is equal to the length of the english subtitles in the group of subtitles. And outputting a current time length adjusting parameter x1 and a speech speed adjusting parameter x 2.

The third method comprises the following steps:

and gradually reducing the display duration of the target caption on the basis that the display duration of the target caption is the duration of any section of audio, and simultaneously gradually reducing the target speech rate of dubbing on the basis that the target speech rate of dubbing is the default speech rate until the second difference is less than or equal to a second threshold value.

For example, the values of the duration adjustment parameter x1 and the speed adjustment parameter x2 are adjusted simultaneously, the value of x1 gradually increases from 1 to 1.1, and the value of x2 gradually decreases from 1 to 0.9 until the error is within the allowable range, (V3 · x2) (T1/x1) ═ the length of the characters in the group of subtitles. And outputting a current time length adjusting parameter x1 and a speech speed adjusting parameter x 2.

Fig. 7 is a flowchart of another video processing method according to an embodiment of the disclosure. In practice, it may occur that the original video includes multiple pieces of audio, which are voices of multiple target objects. Wherein the target object can be understood as a person in the video. For this case, on the basis of the above technical solutions, optionally, referring to fig. 7, the method further includes:

s210, aiming at each target object in the plurality of target objects, selecting the dubbing tone corresponding to the target object.

The implementation method of this step is various, and exemplarily, a plurality of dubbing tone color data are stored in a database in advance, and different dubbing tone color data correspond to different character attribute data. Here, the person attribute data includes the age, sex, tone, occupation, and the like of the person. When the step is executed, identifying character attribute data of the target object based on the original video; based on the person attribute data of the target object, the corresponding dubbing timbre of the target object is determined.

Optionally, in the same video, the dubbing timbres corresponding to the same target object are the same, and the dubbing timbres corresponding to different target objects are different.

And S220, generating a plurality of dubbing audios corresponding to the plurality of sections of audios according to the dubbing timbres corresponding to each target object respectively.

And S230, replacing the multiple sections of audio in the original video with multiple dubbing audio to obtain the target video.

According to the technical scheme, the dubbing timbre corresponding to the target object is selected for each target object in the target objects; according to the dubbing timbre corresponding to each target object, a plurality of dubbing audios corresponding to the multiple sections of audios are generated, the correspondence between characters and timbres is realized, the user can distinguish different characters after dubbing from the aspect of sound, and the user experience can be improved.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure. The video processing apparatus provided by the embodiment of the present disclosure may be configured in a client, or may be configured in a server, and the video processing apparatus specifically includes:

an obtaining module 310, configured to obtain a first subtitle in an original video;

the translation module 320 is configured to translate the first subtitles to obtain second subtitles;

a determining module 330, configured to determine a target speech rate of the dubbing;

and the dubbing module 340 is configured to generate a dubbing audio corresponding to the second subtitle according to the target speech rate of the dubbing.

Further, the first subtitle is a subtitle corresponding to any piece of audio in the original video,

the determining module 330 is further configured to determine a display duration of a target subtitle, where the target subtitle includes the first subtitle and/or the second subtitle;

the apparatus further includes a replacing module 350, configured to replace the audio of any segment in the original video with the dubbing audio to obtain a target video, and display the target subtitle in a picture corresponding to a display duration of the target subtitle in the target video.

Further, the determination module is to:

determining the default duration of the dubbing according to the length of the second subtitle and the default speech speed of the dubbing;

if the duration of any one of the audio segments is greater than or equal to the default duration, and the first difference between the duration of any one of the audio segments and the default duration is less than or equal to a first threshold, the display duration of the target subtitle is the time corresponding to any one of the audio segments, and the target speech rate of the dubbing is the default speech rate.

Further, the device also comprises a first adjusting module. The first adjusting module is used for:

if the duration of any one section of audio is smaller than the default duration, determining whether the length of the second caption is within a preset range;

if the length of the second caption is within the preset range, increasing the display duration of the target caption and/or increasing the target speech rate of the dubbing so that a second difference between the product of the display duration of the target caption and the target speech rate of the dubbing and the length of the second caption is less than or equal to a second threshold value.

Further, the first adjusting module is configured to:

gradually increasing the target speech rate of the dubbing on the basis that the target speech rate of the dubbing is the default speech rate;

if the target speech rate of the dubbing has reached the maximum value and the second difference is greater than a second threshold, gradually increasing the display duration of the target subtitle on the basis that the display duration of the target subtitle is the duration of any one of the audio segments until the second difference is less than or equal to the second threshold.

Further, the first adjusting module is configured to:

on the basis that the display duration of the target subtitle is the duration of any one section of audio, gradually increasing the display duration of the target subtitle;

if the display duration of the target subtitle reaches the maximum value and the second difference is greater than a second threshold, gradually increasing the target speech rate of the dubbing until the second difference is less than or equal to the second threshold on the basis that the target speech rate of the dubbing is the default speech rate.

Further, the first adjusting module is configured to:

on the basis that the display duration of the target subtitle is the duration of any one of the audio segments, gradually increasing the display duration of the target subtitle, and on the basis that the target speech rate of the dubbing is the default speech rate, gradually increasing the target speech rate of the dubbing until the second difference is less than or equal to a second threshold.

Further, the first adjusting module increases the display duration of the target subtitle by reducing the display speed of the picture corresponding to the any section of audio.

Further, the device also comprises a second adjusting module. The second adjusting module is used for:

if the duration of any one section of audio is greater than the default duration, and a first difference between the duration of any one section of audio and the default duration is greater than a first threshold, determining whether the length of the second caption is within a preset range;

if the length of the second caption is within the preset range, reducing the display duration of the target caption and/or reducing the target speech rate of the dubbing so that a second difference between the product of the display duration of the target caption and the target speech rate of the dubbing and the length of the second caption is less than or equal to a second threshold value.

Further, the second adjusting module is configured to:

gradually decreasing the target speech rate of the dubbing on the basis that the target speech rate of the dubbing is the default speech rate;

if the target speech rate of the dubbing has reached the minimum value and the second difference is greater than a second threshold, gradually decreasing the display duration of the target caption on the basis that the display duration of the target caption is the duration of any one of the audio segments until the second difference is less than or equal to the second threshold.

Further, the second adjusting module is configured to:

on the basis that the display duration of the target subtitle is the duration of any one section of audio, gradually reducing the display duration of the target subtitle;

if the display duration of the target caption has reached the minimum value and the second difference is greater than a second threshold, gradually decreasing the target speech rate of the dubbing until the second difference is less than or equal to the second threshold on the basis that the target speech rate of the dubbing is the default speech rate.

Further, the second adjusting module is configured to:

and gradually reducing the display duration of the target subtitle on the basis that the display duration of the target subtitle is the duration of any one section of audio, and gradually reducing the target speech rate of the dubbing on the basis that the target speech rate of the dubbing is the default speech rate until the second difference is less than or equal to a second threshold value.

Further, the second adjusting module reduces the display duration of the target caption by increasing the display speed of the picture corresponding to any one of the audio segments.

Further, the translation module is further to:

and if the length of the second caption is determined not to be within the preset range, re-translating the first caption according to the preset range, so that the length of the second caption obtained after re-translation is within the preset range.

Further, the display duration of the target subtitle is the display duration of the picture corresponding to any one of the audio segments.

Further, the preset range is related to the default speech rate and the duration of any one piece of audio.

Further, the original video comprises a plurality of pieces of audio, and the plurality of pieces of audio are voices of a plurality of target objects;

the apparatus also includes a selection module; the selection module is used for selecting the dubbing timbre corresponding to the target object aiming at each target object in the plurality of target objects;

the dubbing module is used for generating a plurality of dubbing audios corresponding to the plurality of sections of audios according to the dubbing timbres respectively corresponding to each target object;

and the replacing module is used for replacing the plurality of sections of audio in the original video with the plurality of dubbing audio to obtain the target video.

The video processing apparatus provided in the embodiment of the present disclosure may perform steps performed by a client or a server in the video processing method provided in the embodiment of the present disclosure, and the steps and the beneficial effects are not described herein again.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. Referring now specifically to fig. 9, a schematic diagram of an electronic device 1000 suitable for use in implementing embodiments of the present disclosure is shown. The electronic device 1000 in the embodiments of the present disclosure may include, but is not limited to, mobile terminals such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet), a PMP (portable multimedia player), a vehicle-mounted terminal (e.g., a car navigation terminal), a wearable electronic device, and the like, and fixed terminals such as a digital TV, a desktop computer, a smart home device, and the like. The electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 9, the electronic device 1000 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 1001 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage means 1008 into a Random Access Memory (RAM)1003 to implement the video processing method of the embodiments as described in the present disclosure. In the RAM 1003, various programs and information necessary for the operation of the electronic apparatus 1000 are also stored. The processing device 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Generally, the following devices may be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 1007 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 1008 including, for example, magnetic tape, hard disk, and the like; and a communication device 1009. The communications apparatus 1009 may allow the electronic device 1000 to communicate wirelessly or by wire with other devices to exchange information. While fig. 9 illustrates an electronic device 1000 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart, thereby implementing the video processing method as described above. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 1009, or installed from the storage means 1008, or installed from the ROM 1002. The computer program, when executed by the processing device 1001, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may include an information signal propagated in baseband or as part of a carrier wave, in which computer readable program code is carried. Such a propagated information signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital information communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

acquiring a first subtitle in an original video;

translating the first caption to obtain a second caption;

determining a target speech rate of dubbing;

Optionally, when the one or more programs are executed by the electronic device, the electronic device may further perform other steps described in the above embodiments.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In accordance with one or more embodiments of the present disclosure, there is provided an electronic device including:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement any of the video processing methods provided by the present disclosure.

According to one or more embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a video processing method as any one of the video processing methods provided by the present disclosure.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method of video processing, the method comprising:

acquiring a first subtitle in an original video;

translating the first caption to obtain a second caption;

determining a target speech rate of dubbing;

2. The method of claim 1, wherein the first subtitle is a subtitle corresponding to any piece of audio in the original video;

the method further comprises the following steps:

determining the display duration of a target subtitle, wherein the target subtitle comprises the first subtitle and/or the second subtitle;

after generating the dubbing audio corresponding to the second subtitle according to the target speech rate of the dubbing, the method further includes:

and replacing any section of audio in the original video with the dubbing audio to obtain a target video, and displaying the target subtitle in a picture corresponding to the display duration of the target subtitle in the target video.

3. The method of claim 2, wherein determining the display duration of the target caption and the target speech rate for dubbing comprises:

4. The method of claim 3, further comprising:

5. The method of claim 4, wherein increasing the display duration of the target caption and/or increasing the target pace of the dubbing comprises:

6. The method of claim 4, wherein increasing the display duration of the target caption and/or increasing the target pace of the dubbing comprises:

7. The method of claim 4, wherein increasing the display duration of the target caption and/or increasing the target pace of the dubbing comprises:

8. The method of any one of claims 5-7, wherein increasing the display duration of the target subtitle comprises:

and reducing the display speed of the picture corresponding to any section of audio.

9. The method of claim 3, further comprising:

10. The method of claim 9, wherein reducing the display duration of the target caption and/or reducing the target speech rate of the dubbing comprises:

11. The method of claim 9, wherein reducing the display duration of the target caption and/or reducing the target speech rate of the dubbing comprises:

12. The method of claim 9, wherein reducing the display duration of the target caption and/or reducing the target speech rate of the dubbing comprises:

13. The method of any one of claims 10-12, wherein reducing the display duration of the target subtitle comprises:

and improving the display speed of the picture corresponding to any section of audio.

14. The method according to claim 4 or 9, characterized in that the method further comprises:

15. The method according to claim 4 or 9, wherein the display duration of the target caption is the display duration of the picture corresponding to any one of the audio pieces.

16. The method according to claim 4 or 9, wherein the preset range is related to the default speech rate and the duration of any one of the audio pieces.

17. The method of claim 1, wherein the original video comprises a plurality of pieces of audio, the plurality of pieces of audio being speech of a plurality of target objects;

the method further comprises the following steps:

for each target object in the plurality of target objects, selecting a dubbing tone corresponding to the target object;

generating a plurality of dubbing audios corresponding to the plurality of sections of audios according to the dubbing timbres corresponding to each target object respectively;

and replacing the multiple sections of audio in the original video with the multiple dubbing audios to obtain a target video.

18. A video processing apparatus, comprising:

19. An electronic device, characterized in that the electronic device comprises:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-17.

20. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-17.