WO2022228179A1 - Video processing method and apparatus, electronic device, and storage medium - Google Patents

Video processing method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2022228179A1
WO2022228179A1 PCT/CN2022/087381 CN2022087381W WO2022228179A1 WO 2022228179 A1 WO2022228179 A1 WO 2022228179A1 CN 2022087381 W CN2022087381 W CN 2022087381W WO 2022228179 A1 WO2022228179 A1 WO 2022228179A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
subtitle
dubbing
speech rate
duration
Prior art date
Application number
PCT/CN2022/087381
Other languages
French (fr)
Chinese (zh)
Inventor
杜育璋
刘坚
李磊
王明轩
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2022228179A1 publication Critical patent/WO2022228179A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
    • H04N21/4355Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream involving reformatting operations of additional data, e.g. HTML pages on a television screen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/462Content or additional data management, e.g. creating a master electronic program guide from data received from the Internet and a Head-end, controlling the complexity of a video stream by scaling the resolution or bit-rate based on the client capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Definitions

  • the present disclosure relates to the field of information technology, and in particular, to a video processing method, apparatus, electronic device, and storage medium.
  • terminals have become an indispensable device in people's lives. For example, users can watch videos through the terminal.
  • Some current videos may be videos in other languages, and users may not understand the audio content in the videos.
  • the existing technology is to display subtitles that the user can read in the video, but in some cases, the speed at which the user browses the subtitles may not match the display speed of the subtitles, thereby reducing the user experience.
  • embodiments of the present disclosure provide a video processing method, apparatus, electronic device, and storage medium.
  • An embodiment of the present disclosure provides a video processing method, and the method includes:
  • the first subtitle is translated to obtain the second subtitle
  • the dubbing audio corresponding to the second subtitle is generated according to the target speech rate of the dubbing.
  • Embodiments of the present disclosure also provide a video processing apparatus, including:
  • an acquisition module for acquiring the first subtitle in the original video
  • a translation module for translating the first subtitle to obtain a second subtitle
  • a dubbing module configured to generate dubbing audio corresponding to the second subtitle according to the target speech rate of the dubbing.
  • Embodiments of the present disclosure also provide an electronic device, the electronic device comprising:
  • processors one or more processors
  • a storage device for storing one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the video processing method as described above.
  • Embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the above-mentioned video processing method.
  • Embodiments of the present disclosure also provide a computer program product, where the computer program product includes a computer program or instructions, and when the computer program or instructions are executed by a processor, implement the video processing method as described above.
  • the technical solution provided by the embodiment of the present disclosure has at least the following advantages: the technical solution provided by the embodiment of the present disclosure obtains the first subtitle in the original video by setting; translates the first subtitle to obtain the second subtitle; Determine the target speech rate of dubbing; generate dubbing audio corresponding to the second subtitle according to the target speech rate of dubbing, which can realize the generation of dubbing audio that can be understood by video viewers, which can help users reduce the difficulty of understanding video content and improve user experience.
  • the display duration of the target subtitle and/or the duration of the dubbed audio corresponding to the second subtitle are adjusted by determining the display duration of the target subtitle and the target speech rate of the dubbing. , so that the duration of the dubbing audio is consistent with the display duration of the target subtitles within the allowable range of error, so as to solve the problem that for the same meaning, the length of the sentences expressed in different languages may be different, resulting in the dubbing duration and subtitle display duration. Mismatch issues improve user experience.
  • FIG. 1 is a flowchart of a video processing method according to an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of an image frame in a video according to an embodiment of the present disclosure
  • FIG. 3 is a flowchart of another video processing method provided by an embodiment of the present disclosure.
  • FIG. 4 is a flowchart of a method for implementing S130 according to an embodiment of the present disclosure
  • FIG. 5 is a flowchart of another method for implementing S130 provided by an embodiment of the present disclosure.
  • FIG. 6 is a flowchart of another method for implementing S130 provided by an embodiment of the present disclosure.
  • FIG. 7 is a flowchart of another video processing method provided by an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • FIG. 1 is a flowchart of a video processing method provided by an embodiment of the present disclosure.
  • This embodiment is applicable to the case of dubbing a video in a client.
  • the method can be executed by a video processing device, and the device can use software and/or or hardware, the device can be configured in electronic equipment, such as terminals, specifically including but not limited to smart phones, PDAs, tablet computers, wearable devices with display screens, desktop computers, notebook computers, all-in-one computers, smart phones household equipment, etc.
  • this embodiment may be applicable to the case of dubbing the video in the server, the method may be executed by a video processing apparatus, and the apparatus may be implemented by means of software and/or hardware, and the apparatus may be configured in an electronic device, such as server.
  • the method may specifically include:
  • the first subtitle is the same as the language used by the character in the original video. Exemplarily, if the character in the video speaks in English, the first subtitle is in English.
  • the first subtitle can be directly extracted.
  • the original video does not include the first subtitle
  • speech recognition is performed on any piece of audio in the original video to obtain the first subtitle.
  • the first subtitle is a subtitle corresponding to any audio segment in the original video.
  • any piece of audio refers to audio information corresponding to any sentence spoken by any character in the video.
  • video includes audio stream and video stream.
  • the video stream includes multiple image frames. Multiple image frames are played in chronological order to form a dynamic image of the video. The characters in this video speak during some periods of time and do not speak during other periods of time.
  • the audio stream is composed of multiple pieces of audio, and each piece of audio corresponds to a sentence spoken by a character in the video. A piece of audio corresponds to multiple image frames.
  • FIG. 2 is a schematic diagram of an image frame in a video according to an embodiment of the present disclosure.
  • the character in the video speaks during the t0 time period and the t3 time period, and does not speak during the rest of the time period.
  • the image frame in the time period t0 corresponds to a continuous speech
  • the continuous speech corresponding to the time period t0 constitutes a piece of audio.
  • the image frame in the time period t3 corresponds to another continuous speech, and the continuous speech corresponding to the time period t3 constitutes a piece of audio.
  • each continuous sentence of Chinese speech can be recognized as a continuous sentence of Chinese subtitles according to the speech pauses in the audio.
  • the second subtitle is in a different language than the first subtitle.
  • the first subtitle is in Chinese
  • the second subtitle is in English.
  • the second subtitle is in a language understood by the viewer of the video. In actual setting, the second subtitle can be set according to the needs of the video viewer.
  • the target speech rate of dubbing refers to the speech rate of dubbing used for reading the second subtitle.
  • a plurality of dubbing timbre data are pre-stored in the database, and different dubbing timbre data have corresponding default speech rate data.
  • the dubbing tone data is selected, and the default speech rate corresponding to the selected dubbing tone and/or the speech rate after adjusting the default speech rate is used as the dubbing target speech rate.
  • the technical solutions provided by the embodiments of the present disclosure provide the first subtitles in the original video by setting; translate the first subtitles to obtain the second subtitles; determine the target speech rate of the dubbing; Dubbing audio can generate dubbing audio that can be understood by video viewers, which can help users reduce the difficulty of understanding video content and improve user experience.
  • the dubbed audio obtained by the above technical solution of the present application may be played alone, or the audio in the original video may be replaced with the dubbed audio to obtain a dubbed video. You can also play the original video synchronously with the dubbed audio, so as to achieve the goal of having a dubbing effect when the video is played.
  • FIG. 3 is a flowchart of a video processing method provided by an embodiment of the present disclosure.
  • FIG. 3 is a specific example of FIG. 1 . Referring to Figure 3, the method includes:
  • S130 Determine the display duration of the target subtitle and the target speech rate of the dubbing, where the target subtitle includes the first subtitle and/or the second subtitle.
  • dubbing audio based on the second subtitles
  • the dubbing audio duration and the playback duration of the image frame corresponding to the dubbing audio (which can also be understood as the first subtitle).
  • the display duration of the first subtitle and/or the second subtitle that is, the display duration of the target subtitle here) is equivalent. This is because if the duration of the dubbed audio is greater than the playback duration of the image frame corresponding to the dubbed audio, it will appear that the dubbed audio has not ended, but the image frame has been played.
  • the duration of the dubbed audio is less than the playback duration of the image frame corresponding to the dubbed audio, it will appear that the dubbed audio has ended, but the image frame is still displayed. Both of these situations can cause the image to be out of sync with the audio, affecting the user experience.
  • dubbing audio duration equal to the playback duration of the image frame corresponding to the dubbing audio
  • two parameters must be specified first, one is the dubbing audio duration, and the other is the playback duration of the image frame corresponding to the dubbing audio.
  • the duration of the dubbing audio mainly depends on two quantities, one is the length of the second subtitle, and the other is the speech rate of the dubbing. Since the second subtitle has been obtained in S120, the length of the second subtitle is constant in this step, and the duration of the dubbing audio at this time mainly depends on the speech rate of the dubbing.
  • the essence of this step is to determine the appropriate display duration of the target subtitle and the target speech rate of the dubbing, so that the duration of the dubbed audio corresponding to the second subtitle is consistent with the display duration of the target subtitle within the allowable error range .
  • the sequence between determining the display duration of the target subtitles and determining the target speech rate for dubbing is not limited.
  • determining the display duration of the target subtitles and determining the target speech rate of dubbing may be two independent processes, or may be interrelated processes.
  • the essence of this step is to read the text information in the second subtitle at the target speech rate of the dubbing determined in S130, and perform the dubbing audio corresponding to the second subtitle.
  • the target video is obtained by replacing the Chinese audio in the original video with the English audio. And add Chinese subtitles and/or English subtitles to the picture of the video frame.
  • the essence of the above technical solutions of the present application is that by determining the display duration of the target subtitles and the target speech rate of the dubbing, the display duration of the target subtitles and/or the duration of the dubbing audio corresponding to the second subtitles are adjusted, so that the dubbing is performed.
  • the audio duration is consistent with the display duration of the target subtitles within the allowable range of error, so as to solve the problem that for the same meaning, the length of sentences expressed in different languages may be different, resulting in a mismatch between the dubbing duration and the subtitle display duration problems and improve user experience.
  • S150 can also be replaced by: generating an audio file according to the dubbing audio corresponding to each segment of audio in the original video, and the audio file includes the dubbing audio corresponding to each segment of audio in the original video. , and time information for each dubbed audio.
  • each dubbing audio is called and played in sequence according to the current playing time progress of the original video and the time information of each dubbing audio.
  • the time information of each dubbed audio includes the start time and/or the end time of the dubbed audio.
  • each audio file includes the start time of the dubbing audio corresponding to the audio file.
  • the start time of the dubbing audio A is the 12th second from the playback of the first image frame of the video.
  • the dubbing audio A is called and played.
  • the original audio of the original video is eliminated.
  • only the audio of the characters speaking in the original video is eliminated, and the background sound of the original video is retained.
  • a play button or icon corresponding to the audio file may be displayed in the user interface for playing the target video.
  • the audio in the target video is still the audio in the original video, that is, the audio in the original video is not replaced with the corresponding dubbing audio.
  • the button or icon is on, the audio in the original video is replaced with the corresponding dubbing audio, that is, the audio in the target video becomes the dubbing audio.
  • the terminal can play the dubbing audio alone.
  • Case 1 For a piece of audio in the original video, the duration of the dubbed audio is obtained based on the default speech rate of the dubbing and the second subtitle.
  • the duration of the dubbed audio and the duration of the audio in the original video are originally the same within the allowable error range. In this case, it can be directly determined that the display duration of the target subtitle is the time corresponding to the audio segment, and the target speech rate of the dubbing is the default speech rate.
  • a plurality of dubbing timbre data are pre-stored in the database, and different dubbing timbre data have corresponding default speech rate data.
  • the default speech rate of dubbing is the default speech rate corresponding to the selected dubbing timbre, which is obtained based on the dubbing timbre data stored in the database.
  • Case 2 For a piece of audio in the original video, the duration of the dubbed audio is obtained based on the default speech rate of the dubbing and the second subtitle.
  • the duration of the dubbed audio and the duration of the audio in the original video are originally quite different, and cannot be considered to be consistent within the allowable range of errors.
  • the display duration of the target subtitle can be determined based on the duration of the audio in the original video; at the same time, the target speech rate of the dubbing can be determined based on the default speech rate of the dubbing.
  • FIG. 4 is a flowchart of a method for implementing S130 according to an embodiment of the present disclosure. This method is applicable to the above case one. Referring to Figure 4, the method includes:
  • S1311. Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing.
  • the display duration of the target subtitle is the time corresponding to any piece of audio
  • the target speech rate for dubbing is the default speech rate.
  • the "first difference between the duration of any piece of audio and the default duration” may be the absolute value of the difference between the duration of any piece of audio and the default duration, or may be the ratio of the duration of any piece of audio to the default duration.
  • the essence of this setting is that, within the allowable range of errors, if the duration of any audio segment is consistent with the default duration of its corresponding dubbing, the display duration of the target subtitle is directly determined to be the time corresponding to this segment of audio, and the target speech rate of the dubbing is the default speech rate.
  • FIG. 5 is a flowchart of another method for implementing S130 according to an embodiment of the present disclosure. This method is applicable to the second case above. Referring to Figure 5, the method includes:
  • S1321. Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing.
  • the display duration of the target subtitle is the display duration of the picture corresponding to any piece of audio after adjustment.
  • the display duration of the target subtitle cannot be infinite after the increase. If the display duration of the target subtitle after the increase is too long (exceeding a certain limit), it means that the switching speed of the image frame corresponding to this audio segment is too slow, while the switching speed of the image frames corresponding to other audios is normal, which will cause the overall video. Discord and affect the user experience. Therefore, the maximum display duration of the target subtitle can be limited by setting a duration adjustment parameter.
  • the duration adjustment parameter is x1
  • the display duration of the subtitle after adjustment is T2
  • T1/x1 is a number less than 1.
  • the implementation method for increasing the display duration of the target subtitles includes: reducing the display speed of a picture corresponding to any piece of audio.
  • the duration adjustment parameter x1 can be regarded as the display speed adjustment parameter x1.
  • the display speed adjustment parameter is x1
  • the display speed of the picture corresponding to the audio segment after adjustment is V2
  • V2 V1 ⁇ x1
  • the display speed of the image frame is reduced by 10%, and the image corresponding to the t0 time period
  • the total number of frames is 20, so the display duration of the 20 image frames becomes t0/x1.
  • Changing the value of x1 from 1 to 0.9 is equivalent to increasing the display duration of the image frame, that is, increasing the display duration of the group of subtitles.
  • the target speech rate of dubbing is selected to be increased, if the increased target speech rate is too fast (over a certain limit), the user may not hear clearly, which will affect the user experience. Therefore, the maximum value of the target speech rate can be limited by setting the speech rate adjustment parameter.
  • the default speech rate corresponding to the dubbing tone selected for a certain piece of audio in the original video is V3
  • the speech rate adjustment parameter is x2
  • the default speech shorthand of the dubbing tone is V3, and V3 is fixed.
  • the shorthand for the target language of the dubbing tone is V4, and the initial value of V4 is V3.
  • the length of the character can be understood as the number of words in the character, the number of syllables in the character, or the like.
  • the product of the display duration of the target subtitle and the target speech rate of dubbing is equal to the length of the second subtitle, which means that the length of readable text at the target speech rate of dubbing within the display duration of the target subtitle is exactly equal to the length of the second subtitle.
  • the time required to read the second subtitle at the target speech rate of dubbing is exactly equal to the display duration of the target subtitle.
  • the second difference may be the absolute value of the difference between the product of the display duration of the target subtitle and the target speech rate of the dubbing and the difference between the length of the second subtitle, or the product of the display duration of the target subtitle and the target speech rate of the dubbing and the first subtitle.
  • the ratio of the lengths of the two subtitles may be the absolute value of the difference between the product of the display duration of the target subtitle and the target speech rate of the dubbing and the difference between the length of the second subtitle, or the product of the display duration of the target subtitle and the target speech rate of the dubbing and the first subtitle.
  • the maximum display duration and maximum target speech rate of the target subtitles need to be limited, which will make the length of the second subtitle within a certain range (that is, the "preset range" in S1322 )Inside.
  • the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles can be less than or equal to the first Two thresholds.
  • the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles cannot be less than or equal to the second threshold.
  • the first subtitle is retranslated according to the preset range, so that the length of the second subtitle obtained after retranslation is within the preset range.
  • the preset range is related to the default speech rate and the duration of any piece of audio.
  • the preset range includes an upper limit value (ie, a maximum value), and the upper limit value is related to the default speech rate, the duration of any piece of audio, the minimum value of the duration adjustment parameter, and the maximum value of the speech rate adjustment parameter.
  • an upper limit value ie, a maximum value
  • n1 (V3*1.1)(T1/0.9).
  • the target speech rate of dubbing is the default speech rate
  • gradually increase the target speech rate of dubbing if the target speech rate of dubbing has reached the maximum value and the second difference is greater than the second threshold, the display duration of the target subtitles will be On the basis of the duration of any piece of audio, the display duration of the target subtitle is gradually increased until the second difference is less than or equal to the second threshold.
  • Priority is given to adjusting the speech rate adjustment parameter x2, and the value of x2 gradually increases from 1 to 1.1. For example, in the order from 1 to 1.1, the values are taken at intervals.
  • the display duration of the target subtitles is the duration of any piece of audio
  • gradually increase the display duration of the target subtitles if the display duration of the target subtitles has reached the maximum value and the second difference is greater than the second threshold, then the dubbing target
  • the target speech rate of dubbing is gradually increased until the second difference is less than or equal to the second threshold.
  • Priority is given to adjusting the duration adjustment parameter x1, and the value of x1 gradually decreases from 1 to 0.9. For example, in the order from 1 to 0.9, the values are taken at intervals.
  • the display duration of the target subtitles is the duration of any piece of audio
  • the target speech rate of dubbing is the default speech rate
  • other filtering conditions can also be added, such as, x1+x2 is the smallest, 2x1+x2 is the smallest, x1 2 +x2 2 is the smallest, etc., to obtain the optimal combination of the duration adjustment parameter x1 and the speech rate adjustment parameter x2 .
  • FIG. 6 is a flowchart of another method for implementing S130 according to an embodiment of the present disclosure. This method is applicable to the second case above. Referring to Figure 6, the method includes:
  • S1331. Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing.
  • the display duration of the target subtitle is the display duration of the picture corresponding to any piece of audio after adjustment.
  • the minimum display duration of the target subtitle can be limited by setting a duration adjustment parameter.
  • the duration adjustment parameter is x1
  • the display duration of the subtitle after adjustment is T2
  • T1/x1 is a number greater than 1.
  • the implementation method for reducing the display duration of the target subtitles includes: increasing the display speed of a picture corresponding to any piece of audio.
  • the duration adjustment parameter x1 can be regarded as the display speed adjustment parameter x1.
  • the display speed adjustment parameter is x1
  • the display speed of the picture corresponding to the audio segment after adjustment is V2
  • V2 V1 ⁇ x1
  • the target speech rate of the dubbing is selected to be reduced, if the target speech rate after the reduction is too slow (exceeding a certain limit), the speech rate of this segment of audio will be too slow, while the speech rates corresponding to other audios are normal, It will cause the overall disharmony of the video and affect the user experience. Therefore, the minimum value of the target speech rate can be limited by setting the speech rate adjustment parameter.
  • the default speech rate corresponding to the dubbing tone selected for a certain segment of audio in the original video is V3
  • the speech rate adjustment parameter is x2
  • the default speech shorthand of the dubbing tone is V3, and V3 is fixed.
  • the shorthand for the target language of the dubbing tone is V4, and the initial value of V4 is V3.
  • the length of the character can be understood as the number of words in the character, the number of syllables in the character, or the like.
  • the product of the display duration of the target subtitle and the target speech rate of dubbing is equal to the length of the second subtitle, which means that the length of readable text at the target speech rate of dubbing within the display duration of the target subtitle is exactly equal to the length of the second subtitle.
  • the time required to read the second subtitle at the target speech rate of dubbing is exactly equal to the display duration of the target subtitle.
  • the second difference may be the absolute value of the difference between the product of the display duration of the target subtitle and the target speech rate of the dubbing and the difference between the length of the second subtitle, or the product of the display duration of the target subtitle and the target speech rate of the dubbing and the first subtitle.
  • the ratio of the lengths of the two subtitles may be the absolute value of the difference between the product of the display duration of the target subtitle and the target speech rate of the dubbing and the difference between the length of the second subtitle, or the product of the display duration of the target subtitle and the target speech rate of the dubbing and the first subtitle.
  • the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles can be less than or equal to the first Two thresholds.
  • the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles cannot be less than or equal to the second threshold.
  • the first subtitle is retranslated according to the preset range, so that the length of the second subtitle obtained after retranslation is within the preset range.
  • the preset range is related to the default speech rate and the duration of any piece of audio.
  • the preset range includes a lower limit value (ie, a minimum value), and the lower limit value is related to the default speech rate, the duration of any piece of audio, the maximum value of the duration adjustment parameter, and the minimum value of the speech rate adjustment parameter.
  • a lower limit value ie, a minimum value
  • n1 (V3*0.9)(T1/1.1).
  • the target speech rate of dubbing is the default speech rate
  • gradually reduce the target speech rate of dubbing if the target speech rate of dubbing has reached the minimum value and the second difference is greater than the second threshold, the display duration of the target subtitles is On the basis of the duration of any piece of audio, the display duration of the target subtitle is gradually reduced until the second difference is less than or equal to the second threshold.
  • Priority is given to adjusting the speech speed adjustment parameter x2, and the value of x2 gradually decreases from 1 to 0.9. For example, in the order from 1 to 0.9, the values are taken at intervals.
  • the display duration of the target subtitles is the duration of any piece of audio
  • gradually reduce the display duration of the target subtitles if the display duration of the target subtitles has reached the minimum value and the second difference is greater than the second threshold, the dubbing target
  • the target speech rate of dubbing is gradually reduced until the second difference is less than or equal to the second threshold.
  • Priority is given to adjusting the duration adjustment parameter x1, and the value of x1 gradually increases from 1 to 1.1. For example, in the order from 1 to 1.1, the values are taken at intervals.
  • the display duration of the target subtitles is the duration of any piece of audio
  • the target speech rate of dubbing is the default speech rate
  • other filtering conditions can also be added, such as, x1+x2 is the smallest, 2x1+x2 is the smallest, x1 2 +x2 2 is the smallest, etc., to obtain the optimal combination of the duration adjustment parameter x1 and the speech rate adjustment parameter x2 .
  • FIG. 7 is a flowchart of another video processing method provided by an embodiment of the present disclosure.
  • the original video includes multiple pieces of audio, and the multiple pieces of audio are the voices of multiple target objects.
  • the target object can be understood as a person in the video.
  • the method further includes:
  • dubbing timbre data are stored in the database in advance, and different dubbing timbre data correspond to different character attribute data.
  • the person attribute data includes the age, gender, tone, occupation, and the like of the person.
  • the dubbing timbres corresponding to the same target object are the same, and the dubbing timbres corresponding to different target objects are different.
  • the above technical solution selects the dubbing timbre corresponding to the target object for each target object in the plurality of target objects; Correspondingly, it is convenient for the user to distinguish different characters after dubbing from the aspect of sound, which can improve the user experience.
  • FIG. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure.
  • the video processing apparatus provided by the embodiment of the present disclosure may be configured in a client or may be configured in a server, and the video processing apparatus specifically includes:
  • an obtaining module 310 configured to obtain the first subtitle in the original video
  • a translation module 320 configured to translate the first subtitle to obtain a second subtitle
  • a determination module 330 configured to determine the target speech rate of the dubbing
  • the dubbing module 340 is configured to generate dubbing audio corresponding to the second subtitle according to the target speech rate of the dubbing.
  • the first subtitle is a subtitle corresponding to any segment of audio in the original video
  • the determining module 330 is further configured to determine the display duration of target subtitles, where the target subtitles include the first subtitle and/or the second subtitle;
  • the device also includes a replacement module 350, which is used to replace the audio in the original video with the dubbing audio to obtain a target video, which corresponds to the display duration of the target subtitles in the target video.
  • the target subtitle is displayed on the screen.
  • the determination module is used to:
  • the display duration of the target subtitle is the time corresponding to any piece of audio, and the target speech rate of the dubbing is the default speech rate.
  • the device also includes a first adjustment module.
  • the first adjustment module is used to:
  • the display duration of the target subtitle is increased, and/or the target speech rate of the dubbing is increased, so that the display duration of the target subtitle is the same as that of the target subtitle.
  • a second difference between the product of the target speech rate for dubbing and the length of the second subtitle is less than or equal to a second threshold.
  • the first adjustment module is used for:
  • the target speech rate of the dubbing is the default speech rate, gradually increase the target speech rate of the dubbing;
  • the target speech rate of the dubbing has reached the maximum value, and the second difference is greater than the second threshold, gradually increase the Display duration of the target subtitle until the second difference is less than or equal to the second threshold.
  • the first adjustment module is used for:
  • the display duration of the target subtitle is the duration of any piece of audio, gradually increase the display duration of the target subtitle
  • the target speech rate of the dubbing is gradually increased on the basis that the target speech rate of the dubbing is the default speech rate speech rate until the second difference is less than or equal to the second threshold.
  • the first adjustment module is used for:
  • the display duration of the target subtitles is the duration of any piece of audio
  • the target speech rate of the dubbing is the default speech rate
  • the target speech rate of the dubbing is gradually increased until the second difference is less than or equal to a second threshold.
  • the first adjustment module increases the display duration of the target subtitle by reducing the display speed of the picture corresponding to any piece of audio.
  • the device also includes a second adjustment module.
  • the second adjustment module is used for:
  • the duration of any piece of audio is greater than the default duration, and the first difference between the duration of any piece of audio and the default duration is greater than a first threshold, determine whether the length of the second subtitle is within the predetermined duration within the set range;
  • the display duration of the target subtitle is reduced, and/or the target speech rate of the dubbing is reduced, so that the display duration of the target subtitle is the same as that of the target subtitle.
  • a second difference between the product of the target speech rate for dubbing and the length of the second subtitle is less than or equal to a second threshold.
  • the second adjustment module is used for:
  • the target speech rate of the dubbing is the default speech rate, gradually reduce the target speech rate of the dubbing;
  • the target speech rate of the dubbing has reached the minimum value, and the second difference is greater than the second threshold, on the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the Display duration of the target subtitle until the second difference is less than or equal to the second threshold.
  • the second adjustment module is used for:
  • the display duration of the target subtitle is the duration of the audio of any segment, gradually reduce the display duration of the target subtitle
  • the target speech rate of the dubbing is gradually reduced on the basis that the target speech rate of the dubbing is the default speech rate speech rate until the second difference is less than or equal to the second threshold.
  • the second adjustment module is used for:
  • the display duration of the target subtitles is the duration of any piece of audio
  • the target speech rate of the dubbing is the default speech rate
  • the target speech rate of the dubbing is gradually decreased until the second difference is less than or equal to a second threshold.
  • the second adjustment module reduces the display duration of the target subtitle by increasing the display speed of the picture corresponding to any piece of audio.
  • translation module is also used to:
  • the first subtitle is retranslated according to the preset range, so that the length of the second subtitle obtained after retranslation is within the preset range .
  • the display duration of the target subtitle is the display duration of the picture corresponding to any piece of audio.
  • the preset range is related to the default speech rate and the duration of any piece of audio.
  • the original video includes multiple pieces of audio, and the multiple pieces of audio are the voices of multiple target objects;
  • the device further includes a selection module; the selection module is configured to, for each target object in the plurality of target objects, select a dubbing timbre corresponding to the target object;
  • a dubbing module configured to generate a plurality of dubbing audios corresponding to the multi-segment audios according to the dubbing timbres corresponding to each target object respectively;
  • a replacement module configured to replace the multiple audio segments in the original video with the multiple dubbed audios to obtain a target video.
  • the video processing apparatus provided by the embodiments of the present disclosure can execute the steps performed by the client or the server in the video processing method provided by the method embodiments of the present disclosure, and the execution steps and beneficial effects are not repeated here.
  • FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. Referring specifically to FIG. 9 below, it shows a schematic structural diagram of an electronic device 1000 suitable for implementing an embodiment of the present disclosure.
  • the electronic device 1000 in the embodiment of the present disclosure may include, but is not limited to, such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle-mounted terminal ( Mobile terminals such as in-vehicle navigation terminals), wearable electronic devices, etc., and stationary terminals such as digital TVs, desktop computers, smart home devices, and the like.
  • the electronic device shown in FIG. 9 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • an electronic device 1000 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 1001, which may be loaded into random access according to a program stored in a read only memory (ROM) 1002 or from a storage device 1008
  • a program in the memory (RAM) 1003 executes various appropriate actions and processes to implement the video processing method of the embodiment as described in the present disclosure.
  • RAM 1003 various programs and information necessary for the operation of the electronic device 1000 are also stored.
  • the processing device 1001, the ROM 1002, and the RAM 1003 are connected to each other through a bus 1004.
  • An input/output (I/O) interface 1005 is also connected to the bus 1004 .
  • the following devices can be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 1007 such as a computer; a storage device 1008 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 1009 .
  • Communication means 1009 may allow electronic device 1000 to communicate wirelessly or by wire with other devices to exchange information. While FIG. 9 shows the electronic device 1000 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flowchart, thereby achieving the above the video processing method.
  • the computer program may be downloaded and installed from the network via the communication device 1009, or from the storage device 1008, or from the ROM 1002.
  • the processing apparatus 1001 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include an information signal in baseband or propagated as part of a carrier wave with computer-readable program code embodied thereon. Such propagated information signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.
  • the client and server can use any known or future developed network protocol such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital information in any form or medium (eg, a communications network) interconnected.
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any known or future developed network.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device:
  • the first subtitle is translated to obtain the second subtitle
  • the dubbing audio corresponding to the second subtitle is generated according to the target speech rate of the dubbing.
  • the electronic device may also perform other steps described in the above embodiments.
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner. Among them, the name of the unit does not constitute a limitation of the unit itself under certain circumstances.
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLDs Complex Programmable Logical Devices
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the present disclosure provides an electronic device, comprising:
  • processors one or more processors
  • memory for storing one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the video processing method as provided in any one of the present disclosure.
  • the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements video processing as described in any one of the present disclosure method.
  • Embodiments of the present disclosure also provide a computer program product, where the computer program product includes a computer program or instructions, and when the computer program or instructions are executed by a processor, implement the video processing method as described above.

Abstract

Provided are a video processing method and apparatus, an electronic device, and a storage medium. The method comprises: acquiring first subtitles in an original video; translating the first subtitles to obtain second subtitles; determining a target speech rate for dubbing; and according to the target speech rate for dubbing, generating a dubbing audio corresponding to the second subtitles.

Description

视频处理方法、装置、电子设备和存储介质Video processing method, apparatus, electronic device and storage medium
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2021年4月29日提交的,申请名称为“视频处理方法、装置、电子设备和存储介质”的、中国专利申请号为“202110472124.X”的优先权,该中国专利申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application number "202110472124.X" filed on April 29, 2021 with the application name "video processing method, device, electronic device and storage medium". The entire contents of this application are incorporated by reference.
技术领域technical field
本公开涉及信息技术领域,尤其涉及一种视频处理方法、装置、电子设备和存储介质。The present disclosure relates to the field of information technology, and in particular, to a video processing method, apparatus, electronic device, and storage medium.
背景技术Background technique
随着信息技术的发展,终端成为人们生活中不可或缺的设备。例如,用户可以通过终端观看视频。With the development of information technology, terminals have become an indispensable device in people's lives. For example, users can watch videos through the terminal.
当前的一些视频可能是其他语种的视频,用户可能听不懂视频中的音频内容。而目前已有的技术是在视频中显示用户可以读懂的字幕,但有些情况下用户浏览字幕的速度可能无法与字幕的显示速度相匹配,从而降低了用户体验。Some current videos may be videos in other languages, and users may not understand the audio content in the videos. The existing technology is to display subtitles that the user can read in the video, but in some cases, the speed at which the user browses the subtitles may not match the display speed of the subtitles, thereby reducing the user experience.
技术解决方案technical solutions
为了解决上述技术问题或者至少部分地解决上述技术问题,本公开实施例提供了一种视频处理方法、装置、电子设备和存储介质。In order to solve the above technical problems or at least partially solve the above technical problems, embodiments of the present disclosure provide a video processing method, apparatus, electronic device, and storage medium.
本公开实施例提供了一种视频处理方法,所述方法包括:An embodiment of the present disclosure provides a video processing method, and the method includes:
获取原视频中的第一字幕;Get the first subtitle in the original video;
对所述第一字幕进行翻译,得到第二字幕;The first subtitle is translated to obtain the second subtitle;
确定配音的目标语速;Determine the target speech rate for dubbing;
根据所述配音的目标语速生成所述第二字幕对应的配音音频。The dubbing audio corresponding to the second subtitle is generated according to the target speech rate of the dubbing.
本公开实施例还提供了一种视频处理装置,包括:Embodiments of the present disclosure also provide a video processing apparatus, including:
获取模块,用于获取原视频中的第一字幕;an acquisition module for acquiring the first subtitle in the original video;
翻译模块,用于对所述第一字幕进行翻译,得到第二字幕;a translation module for translating the first subtitle to obtain a second subtitle;
确定模块,用于确定配音的目标语速;A determination module for determining the target speech rate of dubbing;
配音模块,用于根据所述配音的目标语速生成所述第二字幕对应的配音音 频。A dubbing module, configured to generate dubbing audio corresponding to the second subtitle according to the target speech rate of the dubbing.
本公开实施例还提供了一种电子设备,所述电子设备包括:Embodiments of the present disclosure also provide an electronic device, the electronic device comprising:
一个或多个处理器;one or more processors;
存储装置,用于存储一个或多个程序;a storage device for storing one or more programs;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如上所述的视频处理方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the video processing method as described above.
本公开实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如上所述的视频处理方法。Embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the above-mentioned video processing method.
本公开实施例还提供了一种计算机程序产品,该计算机程序产品包括计算机程序或指令,该计算机程序或指令被处理器执行时实现如上所述的视频处理方法。Embodiments of the present disclosure also provide a computer program product, where the computer program product includes a computer program or instructions, and when the computer program or instructions are executed by a processor, implement the video processing method as described above.
本公开实施例提供的技术方案与现有技术相比至少具有如下优点:本公开实施例提供的技术方案通过设置获取原视频中的第一字幕;对第一字幕进行翻译,得到第二字幕;确定配音的目标语速;根据配音的目标语速生成第二字幕对应的配音音频,可以实现生成视频观看者所能理解的配音音频,可以帮助用户降低对视频内容的理解难度,提高用户体验。Compared with the prior art, the technical solution provided by the embodiment of the present disclosure has at least the following advantages: the technical solution provided by the embodiment of the present disclosure obtains the first subtitle in the original video by setting; translates the first subtitle to obtain the second subtitle; Determine the target speech rate of dubbing; generate dubbing audio corresponding to the second subtitle according to the target speech rate of dubbing, which can realize the generation of dubbing audio that can be understood by video viewers, which can help users reduce the difficulty of understanding video content and improve user experience.
此外,本公开实施例提供的视频处理方法,通过确定目标字幕的显示时长、以及配音的目标语速,以实现对目标字幕的显示时长和/或第二字幕对应的配音音频的持续时间进行调整,使得配音音频持续时长与目标字幕的显示时长在误差允许的范围内相一致,从而解决针对相同的意思,因不同语种所表达的语句的长短可能是不一样的,导致配音时长与字幕显示时长不匹配的问题,提高用户体验。In addition, in the video processing method provided by the embodiments of the present disclosure, the display duration of the target subtitle and/or the duration of the dubbed audio corresponding to the second subtitle are adjusted by determining the display duration of the target subtitle and the target speech rate of the dubbing. , so that the duration of the dubbing audio is consistent with the display duration of the target subtitles within the allowable range of error, so as to solve the problem that for the same meaning, the length of the sentences expressed in different languages may be different, resulting in the dubbing duration and subtitle display duration. Mismatch issues improve user experience.
附图说明Description of drawings
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent when taken in conjunction with the accompanying drawings and with reference to the following detailed description. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that the originals and elements are not necessarily drawn to scale.
图1为本公开实施例提供的一种视频处理方法的流程图;1 is a flowchart of a video processing method according to an embodiment of the present disclosure;
图2为本公开实施例提供的一种视频中图像帧的示意图;2 is a schematic diagram of an image frame in a video according to an embodiment of the present disclosure;
图3为本公开实施例提供的另一种视频处理方法的流程图;3 is a flowchart of another video processing method provided by an embodiment of the present disclosure;
图4为本公开实施例提供的一种用于实现S130的方法的流程图;FIG. 4 is a flowchart of a method for implementing S130 according to an embodiment of the present disclosure;
图5为本公开实施例提供的另一种用于实现S130的方法的流程图;FIG. 5 is a flowchart of another method for implementing S130 provided by an embodiment of the present disclosure;
图6为本公开实施例提供的另一种用于实现S130的方法的流程图;FIG. 6 is a flowchart of another method for implementing S130 provided by an embodiment of the present disclosure;
图7为本公开实施例提供的另一种视频处理方法的流程图;7 is a flowchart of another video processing method provided by an embodiment of the present disclosure;
图8为本公开实施例提供的一种视频处理装置的结构示意图;FIG. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure;
图9为本公开实施例提供的一种电子设备的结构示意图。FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for the purpose of A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes, and are not intended to limit the protection scope of the present disclosure.
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the various steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。As used herein, the term "including" and variations thereof are open-ended inclusions, ie, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below.
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that concepts such as "first" and "second" mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or interdependence.
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of "a" and "a plurality" mentioned in the present disclosure are illustrative rather than restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, they should be understood as "one or a plurality of". multiple".
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only for illustrative purposes, and are not intended to limit the scope of these messages or information.
图1为本公开实施例提供的一种视频处理方法的流程图,本实施例可适用于客户端中对视频进行配音的情况,该方法可以由视频处理装置执行,该装置可以采用软件和/或硬件的方式实现,该装置可配置于电子设备中,例如终端,具体包括但不限于智能手机、掌上电脑、平板电脑、带显示屏的可穿戴设备、台式机、笔记本电脑、一体机、智能家居设备等。或者,本实施例可适用于服务端中对视频进行配音的情况,该方法可以由视频处理装置执行,该装置可以采用软件和/或硬件的方式实现,该装置可配置于电子设备中,例如服务器。FIG. 1 is a flowchart of a video processing method provided by an embodiment of the present disclosure. This embodiment is applicable to the case of dubbing a video in a client. The method can be executed by a video processing device, and the device can use software and/or or hardware, the device can be configured in electronic equipment, such as terminals, specifically including but not limited to smart phones, PDAs, tablet computers, wearable devices with display screens, desktop computers, notebook computers, all-in-one computers, smart phones household equipment, etc. Alternatively, this embodiment may be applicable to the case of dubbing the video in the server, the method may be executed by a video processing apparatus, and the apparatus may be implemented by means of software and/or hardware, and the apparatus may be configured in an electronic device, such as server.
如图1所示,该方法具体可以包括:As shown in Figure 1, the method may specifically include:
S1、获取原视频中的第一字幕。S1. Obtain the first subtitle in the original video.
第一字幕与原视频人物说话所使用的语种一致,示例性地,若视频中人物使用英文说话,第一字幕为英文形式。The first subtitle is the same as the language used by the character in the original video. Exemplarily, if the character in the video speaks in English, the first subtitle is in English.
本步骤的实现方法有多种,本申请对此不作限制。示例性地,若原视频中包括第一字幕,可以直接提取第一字幕。There are various implementation methods of this step, which are not limited in this application. Exemplarily, if the original video includes the first subtitle, the first subtitle can be directly extracted.
或者,若原视频中不包括第一字幕,对原视频中的任一段音频进行语音识别,得到第一字幕。此时,第一字幕是原视频中任一段音频对应的字幕。Or, if the original video does not include the first subtitle, speech recognition is performed on any piece of audio in the original video to obtain the first subtitle. In this case, the first subtitle is a subtitle corresponding to any audio segment in the original video.
此处,任一段音频是指与视频中任一人物所说的任意一句话对应音频信息。Here, any piece of audio refers to audio information corresponding to any sentence spoken by any character in the video.
具体地,视频包括音频流和视频流。视频流包括多个图像帧。多个图像帧按时间顺序顺次播放,形成视频的动态图像。该视频中的人物在一些时间段内说话,在另一些时间段内不说话。音频流由多段音频构成,每一段音频对应视频中一个人物所说的一句话。一段音频与多个图像帧对应。Specifically, video includes audio stream and video stream. The video stream includes multiple image frames. Multiple image frames are played in chronological order to form a dynamic image of the video. The characters in this video speak during some periods of time and do not speak during other periods of time. The audio stream is composed of multiple pieces of audio, and each piece of audio corresponds to a sentence spoken by a character in the video. A piece of audio corresponds to multiple image frames.
示例性地,图2为本公开实施例提供的一种视频中图像帧的示意图。参见图2,假设该视频中的人物在t0时间段和t3时间段内说话,在其余时间段内不说话。则t0时间段内的图像帧对应有一句连续的语音,t0时间段所对应的连续的语音构成一段音频。t3时间段内的图像帧对应有另一句连续的语音,t3时间段所对应的连续的语音构成一段音频。Exemplarily, FIG. 2 is a schematic diagram of an image frame in a video according to an embodiment of the present disclosure. Referring to Figure 2, it is assumed that the character in the video speaks during the t0 time period and the t3 time period, and does not speak during the rest of the time period. Then the image frame in the time period t0 corresponds to a continuous speech, and the continuous speech corresponding to the time period t0 constitutes a piece of audio. The image frame in the time period t3 corresponds to another continuous speech, and the continuous speech corresponding to the time period t3 constitutes a piece of audio.
可选地,在执行本步骤时,通过语音识别技术,对原视频中的任一段音频进行语音识别,得到第一字幕。例如,可以根据该音频中的语音停顿,将每一句连续的中文语音识别为一句连续的中文字幕。Optionally, when this step is performed, speech recognition is performed on any piece of audio in the original video through speech recognition technology to obtain the first subtitle. For example, each continuous sentence of Chinese speech can be recognized as a continuous sentence of Chinese subtitles according to the speech pauses in the audio.
S2、对第一字幕进行翻译,得到第二字幕。S2. Translate the first subtitle to obtain the second subtitle.
第二字幕与第一字幕所使用的语种不同。示例性地,第一字幕为中文形式,第二字幕为英文形式。第二字幕使用视频观看者所能理解的语种。在实际设置时,第二字幕可以根据视频观看者需要进行设置。The second subtitle is in a different language than the first subtitle. Exemplarily, the first subtitle is in Chinese, and the second subtitle is in English. The second subtitle is in a language understood by the viewer of the video. In actual setting, the second subtitle can be set according to the needs of the video viewer.
S3、确定配音的目标语速。S3. Determine the target speech rate of the dubbing.
本领域技术人员可以理解,为了实现对原视频进行配音的目的,需要以一定的语速阅读第二字幕。本步骤中,配音的目标语速是指用于阅读第二字幕的配音语速。Those skilled in the art can understand that, in order to realize the purpose of dubbing the original video, the second subtitle needs to be read at a certain speech rate. In this step, the target speech rate of dubbing refers to the speech rate of dubbing used for reading the second subtitle.
可选地,数据库中预先存储有多个配音音色数据,不同的配音音色数据具有与其对应的默认语速数据。在执行步骤时,选择配音音色数据,将所选择的配音音色对应的默认语速和/或对默认语速进行调整后的语速作为配音的目标语速。Optionally, a plurality of dubbing timbre data are pre-stored in the database, and different dubbing timbre data have corresponding default speech rate data. When performing the steps, the dubbing tone data is selected, and the default speech rate corresponding to the selected dubbing tone and/or the speech rate after adjusting the default speech rate is used as the dubbing target speech rate.
需要说明的是,在执行本步骤时,需要确保以所确定的配音的目标语速阅读第二字幕,视频观看者能够听清。It should be noted that, when this step is performed, it is necessary to ensure that the second subtitle is read at the determined target speech speed of dubbing, and the video viewer can hear it clearly.
S4、根据配音的目标语速生成第二字幕对应的配音音频。S4. Generate dubbing audio corresponding to the second subtitle according to the target speech rate of dubbing.
本公开实施例提供的技术方案通过设置获取原视频中的第一字幕;对第一字幕进行翻译,得到第二字幕;确定配音的目标语速;根据配音的目标语速生成第二字幕对应的配音音频,可以实现生成视频观看者所能理解的配音音频,可以帮助用户降低对视频内容的理解难度,提高用户体验。The technical solutions provided by the embodiments of the present disclosure provide the first subtitles in the original video by setting; translate the first subtitles to obtain the second subtitles; determine the target speech rate of the dubbing; Dubbing audio can generate dubbing audio that can be understood by video viewers, which can help users reduce the difficulty of understanding video content and improve user experience.
可选地,本申请上述技术方案所得到的配音音频可以单独播放,也可以将原视频中的音频替换为配音音频,得到配音后的视频。还可以将原视频与配音音频同步播放,实现在视频播放时,具有配音效果的目标。Optionally, the dubbed audio obtained by the above technical solution of the present application may be played alone, or the audio in the original video may be replaced with the dubbed audio to obtain a dubbed video. You can also play the original video synchronously with the dubbed audio, so as to achieve the goal of having a dubbing effect when the video is played.
图3为本公开实施例提供的一种视频处理方法的流程图。图3为图1中的一个具体示例。参见图3,该方法包括:FIG. 3 is a flowchart of a video processing method provided by an embodiment of the present disclosure. FIG. 3 is a specific example of FIG. 1 . Referring to Figure 3, the method includes:
S110、对原视频中的任一段音频进行语音识别,得到第一字幕。S110. Perform speech recognition on any piece of audio in the original video to obtain a first subtitle.
S120、对第一字幕进行翻译,得到第二字幕。S120. Translate the first subtitle to obtain a second subtitle.
S130、确定目标字幕的显示时长、以及配音的目标语速,目标字幕包括第一字幕和/或第二字幕。S130. Determine the display duration of the target subtitle and the target speech rate of the dubbing, where the target subtitle includes the first subtitle and/or the second subtitle.
本领域技术人员可以理解,为了实现对原视频进行配音的目的,后续需要基于第二字幕生成配音音频,并使得配音音频持续时间与该配音音频对应的图像帧的播放时长(也可以理解为第一字幕和/或第二字幕的显示时长,即此处的目标字幕的显示时长)相当。这是因为如果配音音频的持续时长大于该配音音频对应的图像帧的播放时长,此时会出现配音音频还没结束,但图像帧已经播放完毕。如果配音音频的持续时长小于该配音音频对应的图像帧的播放时长,此时会出现配音音频已经结束,但该图像帧还在显示。这两种情况都会导致图像与音频不同步的问题,影响用户体验。Those skilled in the art can understand that, in order to realize the purpose of dubbing the original video, it is necessary to generate dubbing audio based on the second subtitles, and make the dubbing audio duration and the playback duration of the image frame corresponding to the dubbing audio (which can also be understood as the first subtitle). The display duration of the first subtitle and/or the second subtitle, that is, the display duration of the target subtitle here) is equivalent. This is because if the duration of the dubbed audio is greater than the playback duration of the image frame corresponding to the dubbed audio, it will appear that the dubbed audio has not ended, but the image frame has been played. If the duration of the dubbed audio is less than the playback duration of the image frame corresponding to the dubbed audio, it will appear that the dubbed audio has ended, but the image frame is still displayed. Both of these situations can cause the image to be out of sync with the audio, affecting the user experience.
要使得配音音频持续时间与该配音音频对应的图像帧的播放时长相当,必须先明确两个参量,一个是配音音频持续时间,另一个是配音音频对应的图像帧的播放时长。To make the dubbing audio duration equal to the playback duration of the image frame corresponding to the dubbing audio, two parameters must be specified first, one is the dubbing audio duration, and the other is the playback duration of the image frame corresponding to the dubbing audio.
对应第一个参量,配音音频持续时间主要取决于两个量,一个是第二字幕的长度,另一个是配音的语速。由于在S120中已得到第二字幕,因此在本步骤中第二字幕的长度一定,此时配音音频持续时间主要取决于配音的语速。Corresponding to the first parameter, the duration of the dubbing audio mainly depends on two quantities, one is the length of the second subtitle, and the other is the speech rate of the dubbing. Since the second subtitle has been obtained in S120, the length of the second subtitle is constant in this step, and the duration of the dubbing audio at this time mainly depends on the speech rate of the dubbing.
因此,本步骤的实质是,通过确定合适的目标字幕的显示时长、以及配音的目标语速,以使得第二字幕对应的配音音频持续时长与目标字幕的显示时长在误差允许的范围内相一致。在一些实施例中,并不限定确定目标字幕的显示时长和确定配音的目标语速之间的先后顺序。另外,确定目标字幕的显示时长和确定配音的目标语速可以是两个相互独立的过程,也可以是相互关联的过程。Therefore, the essence of this step is to determine the appropriate display duration of the target subtitle and the target speech rate of the dubbing, so that the duration of the dubbed audio corresponding to the second subtitle is consistent with the display duration of the target subtitle within the allowable error range . In some embodiments, the sequence between determining the display duration of the target subtitles and determining the target speech rate for dubbing is not limited. In addition, determining the display duration of the target subtitles and determining the target speech rate of dubbing may be two independent processes, or may be interrelated processes.
S140、根据配音的目标语速生成第二字幕对应的配音音频。S140. Generate dubbing audio corresponding to the second subtitle according to the target speech rate of dubbing.
本步骤的实质是,以S130所确定的配音的目标语速阅读第二字幕中的文字信息,进行得的与第二字幕对应的配音音频。The essence of this step is to read the text information in the second subtitle at the target speech rate of the dubbing determined in S130, and perform the dubbing audio corresponding to the second subtitle.
S150、将原视频中的任一段音频替换为配音音频,得到目标视频,并在目标视频中与目标字幕的显示时长对应的画面中显示目标字幕。S150. Replace any piece of audio in the original video with dubbed audio to obtain a target video, and display the target subtitle in a picture corresponding to the display duration of the target subtitle in the target video.
示例性地,若原视频中的任一段音频为中文音频,配音音频为英文音频,利用英文音频替换原视频中的中文音频,得的目标视频。并在视频帧的画面中添加中文字幕和/或英文字幕。Exemplarily, if any piece of audio in the original video is Chinese audio, and the dubbed audio is English audio, the target video is obtained by replacing the Chinese audio in the original video with the English audio. And add Chinese subtitles and/or English subtitles to the picture of the video frame.
本申请上述技术方案的实质是,通过确定目标字幕的显示时长、以及配音 的目标语速,以实现对目标字幕的显示时长和/或第二字幕对应的配音音频的持续时间进行调整,使得配音音频持续时长与目标字幕的显示时长在误差允许的范围内相一致,从而解决针对相同的意思,因不同语种所表达的语句的长短可能是不一样的,导致配音时长与字幕展示时长不匹配的问题,提高用户体验。The essence of the above technical solutions of the present application is that by determining the display duration of the target subtitles and the target speech rate of the dubbing, the display duration of the target subtitles and/or the duration of the dubbing audio corresponding to the second subtitles are adjusted, so that the dubbing is performed. The audio duration is consistent with the display duration of the target subtitles within the allowable range of error, so as to solve the problem that for the same meaning, the length of sentences expressed in different languages may be different, resulting in a mismatch between the dubbing duration and the subtitle display duration problems and improve user experience.
在上述技术方案的基础上,可选地,S150还可以替换为:根据原视频中每一段音频分别对应的配音音频生成一个音频文件,该音频文件包括原视频中每一段音频分别对应的配音音频,以及每个配音音频的时间信息。在播放原视频时,根据原视频当前播放的时间进度以及各配音音频的时间信息,顺次调用并播放各配音音频。进一步地,每个配音音频的时间信息包括该配音音频的开始时间和/或结束时间。On the basis of the above technical solution, optionally, S150 can also be replaced by: generating an audio file according to the dubbing audio corresponding to each segment of audio in the original video, and the audio file includes the dubbing audio corresponding to each segment of audio in the original video. , and time information for each dubbed audio. When playing the original video, each dubbing audio is called and played in sequence according to the current playing time progress of the original video and the time information of each dubbing audio. Further, the time information of each dubbed audio includes the start time and/or the end time of the dubbed audio.
示例性地,若基于某一原视频生成多个音频文件,各音频文件包括该音频文件对应的配音音频的开始时间。假设配音音频A的开始时间是从视频第一图像帧播放起算的第12秒。当该视频被播放,且播放到第12s时,调用并播放配音音频A。可选地,采用这种方式时,消除原视频的原有音频。进一步地,在消除原视频的原有音频时,仅消除原视频中人物说话的音频,保留原视频的背景音。Exemplarily, if multiple audio files are generated based on a certain original video, each audio file includes the start time of the dubbing audio corresponding to the audio file. Assume that the start time of the dubbing audio A is the 12th second from the playback of the first image frame of the video. When the video is played and the 12th s is played, the dubbing audio A is called and played. Optionally, in this way, the original audio of the original video is eliminated. Further, when eliminating the original audio of the original video, only the audio of the characters speaking in the original video is eliminated, and the background sound of the original video is retained.
进一步,在播放目标视频的用户界面中可以显示有与该音频文件对应的播放按钮或图标。在该按钮或图标处于关闭状态时,目标视频中的音频还是原视频中的音频,即原视频中的音频没有被替换为相应的配音音频。在该按钮或图标处于开启状态时,原视频中的音频被替换为相应的配音音频,即目标视频中的音频变成了配音音频。或者,在该按钮或图标处于开启状态时,终端可以单独播放该配音音频。Further, a play button or icon corresponding to the audio file may be displayed in the user interface for playing the target video. When the button or icon is off, the audio in the target video is still the audio in the original video, that is, the audio in the original video is not replaced with the corresponding dubbing audio. When the button or icon is on, the audio in the original video is replaced with the corresponding dubbing audio, that is, the audio in the target video becomes the dubbing audio. Alternatively, when the button or icon is in an on state, the terminal can play the dubbing audio alone.
在上述技术方案的基础上,进一步分析在执行S130时,如何确定目标字幕的显示时长、以及配音的目标语速,可以发现在实际中,主要包括两种情况:On the basis of the above technical solutions, it is further analyzed how to determine the display duration of the target subtitles and the target speech rate of the dubbing when S130 is executed. It can be found that in practice, there are mainly two situations:
情况一、针对原视频中一段音频,基于配音的默认语速和第二字幕,得到配音音频的持续时长。该配音音频的持续时长和原视频中该段音频的持续时长,两者原本在误差允许的范围内就相一致。此种情况,可以直接确定目标字幕的显示时长为该段音频对应的时间,配音的目标语速为默认语速。Case 1: For a piece of audio in the original video, the duration of the dubbed audio is obtained based on the default speech rate of the dubbing and the second subtitle. The duration of the dubbed audio and the duration of the audio in the original video are originally the same within the allowable error range. In this case, it can be directly determined that the display duration of the target subtitle is the time corresponding to the audio segment, and the target speech rate of the dubbing is the default speech rate.
其中,可选地,数据库中预先存储有多个配音音色数据,不同的配音音色数据具有与其对应的默认语速数据。配音的默认语速为所选择的配音音色对应的默认语速,基于数据库中存储的配音音色数据得到。Wherein, optionally, a plurality of dubbing timbre data are pre-stored in the database, and different dubbing timbre data have corresponding default speech rate data. The default speech rate of dubbing is the default speech rate corresponding to the selected dubbing timbre, which is obtained based on the dubbing timbre data stored in the database.
情况二、针对原视频中一段音频,基于配音的默认语速和第二字幕,得到配音音频的持续时长。该配音音频的持续时长和原视频中该段音频的持续时长,两者原本相差很大,不能认为在误差允许的范围内相一致。此种情况,可以基于原视频中该段音频的持续时间,确定目标字幕的显示时长;同时,基于配音的默认语速,确定配音的目标语速。Case 2: For a piece of audio in the original video, the duration of the dubbed audio is obtained based on the default speech rate of the dubbing and the second subtitle. The duration of the dubbed audio and the duration of the audio in the original video are originally quite different, and cannot be considered to be consistent within the allowable range of errors. In this case, the display duration of the target subtitle can be determined based on the duration of the audio in the original video; at the same time, the target speech rate of the dubbing can be determined based on the default speech rate of the dubbing.
图4为本公开实施例提供的一种用于实现S130的方法的流程图。该方法适用于上述情况一。参见图4,该方法包括:FIG. 4 is a flowchart of a method for implementing S130 according to an embodiment of the present disclosure. This method is applicable to the above case one. Referring to Figure 4, the method includes:
S1311、根据第二字幕的长度和配音的默认语速,确定配音的默认时长。S1311. Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing.
S1312、若任一段音频的时长大于或等于默认时长,且任一段音频的时长和默认时长之间的第一差异小于或等于第一阈值,则目标字幕的显示时长为任一段音频对应的时间,配音的目标语速为默认语速。S1312, if the duration of any piece of audio is greater than or equal to the default duration, and the first difference between the duration of any piece of audio and the default duration is less than or equal to the first threshold, then the display duration of the target subtitle is the time corresponding to any piece of audio, The target speech rate for dubbing is the default speech rate.
其中,“任一段音频的时长和默认时长之间的第一差异”可以为任一段音频的时长和默认时长之差的绝对值,也可以为任一段音频的时长和默认时长的比值。The "first difference between the duration of any piece of audio and the default duration" may be the absolute value of the difference between the duration of any piece of audio and the default duration, or may be the ratio of the duration of any piece of audio to the default duration.
这样设置的实质是,在误差允许的范围内容,若任一段音频的时长和与其对应的配音的默认时长一致,则直接确定目标字幕的显示时长为该段音频对应的时间,配音的目标语速为默认语速。The essence of this setting is that, within the allowable range of errors, if the duration of any audio segment is consistent with the default duration of its corresponding dubbing, the display duration of the target subtitle is directly determined to be the time corresponding to this segment of audio, and the target speech rate of the dubbing is the default speech rate.
图5为本公开实施例提供的另一种用于实现S130的方法的流程图。该方法适用于上述情况二。参见图5,该方法包括:FIG. 5 is a flowchart of another method for implementing S130 according to an embodiment of the present disclosure. This method is applicable to the second case above. Referring to Figure 5, the method includes:
S1321、根据第二字幕的长度和配音的默认语速,确定配音的默认时长。S1321. Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing.
S1322、若任一段音频的时长小于默认时长,则确定第二字幕的长度是否在预设范围内。S1322. If the duration of any piece of audio is less than the default duration, determine whether the length of the second subtitle is within a preset range.
S1323、若第二字幕的长度在预设范围内,则增大目标字幕的显示时长,和/或提高配音的目标语速,使得目标字幕的显示时长与配音的目标语速的乘积与第二字幕的长度之间的第二差异小于或等于第二阈值。S1323, if the length of the second subtitle is within the preset range, increase the display duration of the target subtitle, and/or increase the target speech rate of the dubbing, so that the product of the display duration of the target subtitle and the target speech rate of the dubbing is equal to the second The second difference between the lengths of the subtitles is less than or equal to the second threshold.
目标字幕的显示时长为调整后任一段音频对应的画面的显示时长。The display duration of the target subtitle is the display duration of the picture corresponding to any piece of audio after adjustment.
本领域技术人员可以理解,执行S1323时,若选择增大目标字幕的显示时长,增大后目标字幕的显示时长不能是无限的。若增大后目标字幕的显示时长过长(超出一定的限度),意味着该段音频对应的图像帧的切换速度过慢,而其他音频对应的图像帧的切换速度正常,其会造成视频整体不和谐,影响用户体验。因此,可以采用设置时长调整参数的方式来限定目标字幕的最大显示时长。Those skilled in the art can understand that when S1323 is executed, if the display duration of the target subtitle is selected to be increased, the display duration of the target subtitle cannot be infinite after the increase. If the display duration of the target subtitle after the increase is too long (exceeding a certain limit), it means that the switching speed of the image frame corresponding to this audio segment is too slow, while the switching speed of the image frames corresponding to other audios is normal, which will cause the overall video. Discord and affect the user experience. Therefore, the maximum display duration of the target subtitle can be limited by setting a duration adjustment parameter.
具体地,若一组字幕的初始显示时长(即在原视频中,与该组字幕对应的音频的持续时长)为T1,时长调整参数为x1,调整后该字幕的显示时长为T2,则有T2=T1/x1,此时x1是小于1的数。通过设置时长调整参数x1的最小取值,来限制调整后该字幕的显示时长的最大值。Specifically, if the initial display duration of a group of subtitles (that is, the duration of the audio corresponding to the group of subtitles in the original video) is T1, the duration adjustment parameter is x1, and the display duration of the subtitle after adjustment is T2, then there is T2 =T1/x1, where x1 is a number less than 1. By setting the minimum value of the duration adjustment parameter x1, the maximum value of the display duration of the subtitle after adjustment is limited.
可选地,“增大目标字幕的显示时长”的实现方法有多种,示例性地,增大目标字幕的显示时长的实现方法,包括:降低任一段音频对应的画面的显示速度。Optionally, there are multiple implementation methods for "increasing the display duration of the target subtitles". Exemplarily, the implementation method for increasing the display duration of the target subtitles includes: reducing the display speed of a picture corresponding to any piece of audio.
进一步地,若采用降低任一段音频对应的画面的显示速度的方法,时长调整参数x1可视作为显示速度调整参数x1。Further, if the method of reducing the display speed of a picture corresponding to any piece of audio is adopted, the duration adjustment parameter x1 can be regarded as the display speed adjustment parameter x1.
具体地,假设原视频中的某一段音频对应的画面的显示速度为V1,显示速度调整参数为x1,调整后该段音频对应的画面的显示速度为V2,则有V2=V1·x1。通过设置显示速度调整参数x1的最小取值,来限制调整后该段音频对应的画面的显示速度的最小值,进而限制调整后目标字幕的显示时长的最大值。Specifically, assuming that the display speed of the picture corresponding to a certain audio segment in the original video is V1, the display speed adjustment parameter is x1, and the display speed of the picture corresponding to the audio segment after adjustment is V2, then V2=V1·x1. By setting the minimum value of the display speed adjustment parameter x1, the minimum value of the display speed of the picture corresponding to the audio segment after adjustment is limited, thereby limiting the maximum value of the display duration of the target subtitle after adjustment.
示例性地,继续参见图2,假设t0时间段对应有20个图像帧,t0=2S,说明t0时间段对应的图像帧原来的显示速度V1为20帧/2秒。如果将图像帧的显示速度调慢,调慢后的图像帧的显示速度记为V2,V2=V1·x1,此时,x1是小于1的数。若设置x1最小值为0.9,即图像帧的显示速度最慢可以被调整为18帧/2秒,此时,相当于图像帧的显示速度被调慢了10%,而t0时间段对应的图像帧的总数20是固定的,因此,该20个图像帧的显示时长变为了t0/x1。将x1的值由1变为0.9,相当于增大了图像帧的显示时长,即增大了该组字幕 的显示时长。Exemplarily, continuing to refer to FIG. 2 , it is assumed that there are 20 image frames corresponding to the time period t0 , and t0 = 2S, indicating that the original display speed V1 of the image frames corresponding to the time period t0 is 20 frames/2 seconds. If the display speed of the image frame is slowed down, the display speed of the slowed image frame is recorded as V2, V2=V1·x1, and at this time, x1 is a number less than 1. If the minimum value of x1 is set to 0.9, that is, the slowest display speed of the image frame can be adjusted to 18 frames/2 seconds. At this time, the display speed of the image frame is reduced by 10%, and the image corresponding to the t0 time period The total number of frames is 20, so the display duration of the 20 image frames becomes t0/x1. Changing the value of x1 from 1 to 0.9 is equivalent to increasing the display duration of the image frame, that is, increasing the display duration of the group of subtitles.
类似地,执行S1323时,若选择提高配音的目标语速,若提高后目标语速过快(超出一定的限度),会导致用户听不清楚,影响用户体验。因此,可以采用设置语速调整参数的方式来限定目标语速的最大值。Similarly, when S1323 is executed, if the target speech rate of dubbing is selected to be increased, if the increased target speech rate is too fast (over a certain limit), the user may not hear clearly, which will affect the user experience. Therefore, the maximum value of the target speech rate can be limited by setting the speech rate adjustment parameter.
具体地,假设为原视频中的某一段音频所选择的配音音色对应的默认语速为V3,语速调整参数为x2,调整后该配音音色对应的目标语速为V4,则有V4=V3·x2。通过设置语速调整参数x2的最大取值,来限制调整后该配音音色的目标语速的最大值。Specifically, it is assumed that the default speech rate corresponding to the dubbing tone selected for a certain piece of audio in the original video is V3, the speech rate adjustment parameter is x2, and the target speech rate corresponding to the dubbing tone after adjustment is V4, then V4=V3 · x2. By setting the maximum value of the speech rate adjustment parameter x2, the maximum value of the target speech rate of the dubbing tone after adjustment is limited.
示例性地,将配音音色的默认语速记为V3,V3是固定的。配音音色的目标语速记为V4,V4的初始值为V3。配音音色的目标语速可以被调整,例如,如果提高该配音音色的目标语速,相当于V4=V3·x2,此时,x2是大于1的数。若设置x2最大值是1.1,即目标语速最大可以是默认语速的1.1倍,这时候相当于将配音音色的目标语速调快了10%。Exemplarily, the default speech shorthand of the dubbing tone is V3, and V3 is fixed. The shorthand for the target language of the dubbing tone is V4, and the initial value of V4 is V3. The target speech rate of the dubbing tone can be adjusted. For example, if the target speech rate of the dubbing tone is increased, it is equivalent to V4=V3·x2, where x2 is a number greater than 1. If the maximum value of x2 is set to 1.1, that is, the maximum target speech rate can be 1.1 times the default speech rate, which is equivalent to increasing the target speech rate of the dubbing tone by 10%.
目标字幕的显示时长T2与配音的目标语速V4的乘积L1,可表示为L1=T2·V4=(T1/x1)(V3·x2)。即,以配音的目标语速在目标字幕的显示时长内可阅读的文字的长度。此处,文字的长度可以理解为文字的单词个数、或者文字的音节数等。The product L1 of the display duration T2 of the target subtitle and the target speech rate V4 of dubbing can be expressed as L1=T2·V4=(T1/x1)(V3·x2). That is, the length of characters that can be read within the display duration of the target subtitles at the target speech rate of dubbing. Here, the length of the character can be understood as the number of words in the character, the number of syllables in the character, or the like.
目标字幕的显示时长与配音的目标语速的乘积等于第二字幕的长度,意味着以配音的目标语速在目标字幕的显示时长内可阅读的文字的长度恰好等于第二字幕的长度。换言之,以配音的目标语速阅读第二字幕所需要的时间恰好等于目标字幕的显示时长。The product of the display duration of the target subtitle and the target speech rate of dubbing is equal to the length of the second subtitle, which means that the length of readable text at the target speech rate of dubbing within the display duration of the target subtitle is exactly equal to the length of the second subtitle. In other words, the time required to read the second subtitle at the target speech rate of dubbing is exactly equal to the display duration of the target subtitle.
因此,在S1323中,“使得目标字幕的显示时长与配音的目标语速的乘积与第二字幕的长度之间的第二差异小于或等于第二阈值”是指在误差允许的范围内,基于第二字幕生成的配音音频的持续时长与目标字幕的显示时长相一致。Therefore, in S1323, "making the second difference between the product of the display duration of the target subtitle and the target speech rate of dubbing and the length of the second subtitle be less than or equal to the second threshold" means that within the allowable range of the error, based on the The duration of the dubbing audio generated by the second subtitle is consistent with the display duration of the target subtitle.
其中,第二差异可以为目标字幕的显示时长与配音的目标语速的乘积与第二字幕的长度之差的绝对值,也可以为目标字幕的显示时长与配音的目标语速的乘积与第二字幕的长度的比值。The second difference may be the absolute value of the difference between the product of the display duration of the target subtitle and the target speech rate of the dubbing and the difference between the length of the second subtitle, or the product of the display duration of the target subtitle and the target speech rate of the dubbing and the first subtitle. The ratio of the lengths of the two subtitles.
如上述内容,由于需要保证视频具有较佳的视听效果,需要限定目标字幕 的最大显示时长以及最大目标语速,这会使得第二字幕的长度处于一定范围(即S1322中的“预设范围”)内。As mentioned above, due to the need to ensure that the video has a better audio-visual effect, the maximum display duration and maximum target speech rate of the target subtitles need to be limited, which will make the length of the second subtitle within a certain range (that is, the "preset range" in S1322 )Inside.
若第二字幕的长度恰好处于预设范围,在“目标字幕的显示时长小于或等于目标字幕的最大显示时长”以及“配音的目标语速小于或等于最大目标语速”这两个条件下,通过增大目标字幕的显示时长,和/或提高配音的目标语速,可以使得目标字幕的显示时长与配音的目标语速的乘积与第二字幕的长度之间的第二差异小于或等于第二阈值。If the length of the second subtitle is within the preset range, under the two conditions of "the display duration of the target subtitle is less than or equal to the maximum display duration of the target subtitle" and "the target speech rate of dubbing is less than or equal to the maximum target speech rate", By increasing the display duration of the target subtitles and/or increasing the target speech rate of dubbing, the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles can be less than or equal to the first Two thresholds.
若第二字幕的长度不处于预设范围,在“目标字幕的显示时长小于或等于目标字幕的最大显示时长”以及“配音的目标语速小于或等于最大目标语速”这两个条件下,无论如何增大目标字幕的显示时长,和/或提高配音的目标语速,均不能使得目标字幕的显示时长与配音的目标语速的乘积与第二字幕的长度之间的第二差异小于或等于第二阈值。If the length of the second subtitle is not within the preset range, under the two conditions of "the display duration of the target subtitle is less than or equal to the maximum display duration of the target subtitle" and "the target speech rate of dubbing is less than or equal to the maximum target speech rate", No matter how to increase the display duration of the target subtitles and/or increase the target speech rate of dubbing, the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles cannot be less than or equal to the second threshold.
可选地,若确定第二字幕的长度不在预设范围内,则根据预设范围对第一字幕进行重新翻译,使得重新翻译后得到的第二字幕的长度在预设范围内。Optionally, if it is determined that the length of the second subtitle is not within the preset range, the first subtitle is retranslated according to the preset range, so that the length of the second subtitle obtained after retranslation is within the preset range.
可选地,预设范围与默认语速和任一段音频的时长相关。Optionally, the preset range is related to the default speech rate and the duration of any piece of audio.
进一步地,预设范围包括上限值(即最大值),该上限值与默认语速、任一段音频的时长、时长调整参数的最小值以及语速调整参数的最大值相关。Further, the preset range includes an upper limit value (ie, a maximum value), and the upper limit value is related to the default speech rate, the duration of any piece of audio, the minimum value of the duration adjustment parameter, and the maximum value of the speech rate adjustment parameter.
示例性地,假设预设范围上限值为n1,默认语速为V3,任一段音频的时长为T1,时长调整参数的最小值为0.9,语速调整参数的最大值为1.1,则n1=(V3*1.1)(T1/0.9)。Exemplarily, assuming that the upper limit of the preset range is n1, the default speech rate is V3, the duration of any piece of audio is T1, the minimum value of the duration adjustment parameter is 0.9, and the maximum value of the speech rate adjustment parameter is 1.1, then n1= (V3*1.1)(T1/0.9).
在上述技术方案的基础上,在执行S1323时,“增大目标字幕的显示时长,和/或提高配音的目标语速”的具体实现方法有多种。下面给出典型的三种方法。Based on the above technical solutions, when S1323 is executed, there are various specific implementation methods for "increasing the display duration of the target subtitles, and/or increasing the target speech rate of dubbing". Three typical methods are given below.
方法一:method one:
在配音的目标语速为默认语速的基础上,逐渐提高配音的目标语速;若配音的目标语速已达到最大值,且第二差异大于第二阈值,则在目标字幕的显示时长为任一段音频的时长的基础上,逐渐增大目标字幕的显示时长,直到第二差异小于或等于第二阈值。On the basis that the target speech rate of dubbing is the default speech rate, gradually increase the target speech rate of dubbing; if the target speech rate of dubbing has reached the maximum value and the second difference is greater than the second threshold, the display duration of the target subtitles will be On the basis of the duration of any piece of audio, the display duration of the target subtitle is gradually increased until the second difference is less than or equal to the second threshold.
例如,控制某组字幕的显示时长不变,即T1为定值,时长调整参数x1=1。 优先调整语速调整参数x2,x2的取值从1开始逐渐向1.1增大,例如,按照从1到1.1的顺序,依次间隔取值,当x2取某个值时,若在误差允许的范围内,能使得(V3·x2)(T1/x1)=该组字幕中文字的长度,则停止调整x2,输出当前时长调整参数x1和语速调整参数x2。For example, the display duration of a certain group of subtitles is controlled to remain unchanged, that is, T1 is a fixed value, and the duration adjustment parameter x1=1. Priority is given to adjusting the speech rate adjustment parameter x2, and the value of x2 gradually increases from 1 to 1.1. For example, in the order from 1 to 1.1, the values are taken at intervals. When x2 takes a certain value, if it is within the allowable range of error If (V3·x2)(T1/x1)=the length of the text in this group of subtitles, the adjustment of x2 is stopped, and the current time length adjustment parameter x1 and speech rate adjustment parameter x2 are output.
如果x2的取值已经达到了最大值1.1,但是还不能在误差允许的范围内,使得(V3·x2)(T1/x1)=该组字幕中文字的长度,则固定x2=1.1,调整x1的值,x1的取值从1开始逐渐向0.9减小,例如,按照从1到0.9的顺序,依次间隔取值,直到在误差允许的范围内,使得(V3·x2)(T1/x1)=该组字幕中文字的长度,输出当前时长调整参数x1和语速调整参数x2。If the value of x2 has reached the maximum value of 1.1, but it is not within the allowable range of the error, so that (V3 x2)(T1/x1)=the length of the text in this group of subtitles, then fix x2=1.1, adjust x1 The value of x1 starts from 1 and gradually decreases to 0.9. For example, according to the order from 1 to 0.9, the value is taken at intervals until it is within the allowable range of error, so that (V3 x2)(T1/x1) = The length of the text in this group of subtitles, output the current time length adjustment parameter x1 and speech rate adjustment parameter x2.
方法二:Method Two:
在目标字幕的显示时长为任一段音频的时长的基础上,逐渐增大目标字幕的显示时长;若目标字幕的显示时长已达到最大值,且第二差异大于第二阈值,则在配音的目标语速为默认语速的基础上,逐渐提高配音的目标语速,直到第二差异小于或等于第二阈值。On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually increase the display duration of the target subtitles; if the display duration of the target subtitles has reached the maximum value and the second difference is greater than the second threshold, then the dubbing target On the basis of the default speech rate, the target speech rate of dubbing is gradually increased until the second difference is less than or equal to the second threshold.
例如,控制配音音色的目标语速不变,即V3为定值,语速调整参数x2=1。优先调整时长调整参数x1,x1的取值从1开始逐渐向0.9减小,例如,按照从1到0.9的顺序,依次间隔取值,当x1取某个值时,若能使得在误差允许的范围内,(V3·x2)(T1/x1)=该组字幕中文字的长度,则停止调整x1,输出当前时长调整参数x1和语速调整参数x2。For example, the target speech rate for controlling the dubbing timbre remains unchanged, that is, V3 is a fixed value, and the speech rate adjustment parameter x2=1. Priority is given to adjusting the duration adjustment parameter x1, and the value of x1 gradually decreases from 1 to 0.9. For example, in the order from 1 to 0.9, the values are taken at intervals. When x1 takes a certain value, if it can be made within the allowable error Within the range, (V3·x2)(T1/x1)=the length of the text in this group of subtitles, stop adjusting x1, and output the current time length adjustment parameter x1 and speech rate adjustment parameter x2.
如果x1的取值已经达到了最小值0.9,但是还不能在误差允许的范围内,使得(V3·x2)(T1/x1)=该组字幕中文字的长度,则进一步调整x2的值,x2的取值从1开始逐渐向1.1增大,例如,按照从1到1.1的顺序,依次间隔取值,直到在误差允许的范围内,(V3*x2)(T1/x1)=该组字幕中英文字幕的长度。输出当前时长调整参数x1和语速调整参数x2。If the value of x1 has reached the minimum value of 0.9, but it is not within the allowable range of the error, so that (V3 x2)(T1/x1)=the length of the text in this group of subtitles, then further adjust the value of x2, x2 The value of , gradually increases from 1 to 1.1. For example, in the order from 1 to 1.1, the values are taken at intervals until within the allowable range of error, (V3*x2)(T1/x1)=in this group of subtitles Length of English subtitles. Output the current duration adjustment parameter x1 and speech rate adjustment parameter x2.
方法三:Method three:
在目标字幕的显示时长为任一段音频的时长的基础上,逐渐增大目标字幕的显示时长,同时在配音的目标语速为默认语速的基础上,逐渐提高配音的目标语速,直到第二差异小于或等于第二阈值。On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually increase the display duration of the target subtitles. At the same time, on the basis that the target speech rate of dubbing is the default speech rate, gradually increase the target speech rate of dubbing until the first The second difference is less than or equal to the second threshold.
例如,同时调整时长调整参数x1和语速调整参数x2的取值,x1的取值从1开始逐渐向0.9减小,x2的取值从1开始逐渐向1.1增大,直至在误差允许的范围内,(V3·x2)(T1/x1)=该组字幕中文字的长度。输出当前时长调整参数x1和语速调整参数x2。For example, adjust the value of the duration adjustment parameter x1 and the speech rate adjustment parameter x2 at the same time, the value of x1 gradually decreases from 1 to 0.9, and the value of x2 gradually increases from 1 to 1.1, until it is within the allowable range of error In, (V3·x2)(T1/x1)=the length of the text in this group of subtitles. Output the current duration adjustment parameter x1 and speech rate adjustment parameter x2.
进一步地,针对方法三,在实际中可能存在多种时长调整参数x1和语速调整参数x2的组合,各组合均能够满足在误差允许的范围内,(V3·x2)(T1/x1)=该组字幕中文字的长度。针对这种情况,还可以增设其他筛选条件,如,x1+x2最小、2x1+x2最小、x1 2+x2 2最小等,以得到最优的关于时长调整参数x1和语速调整参数x2的组合。 Further, for the third method, there may be various combinations of the duration adjustment parameter x1 and the speech rate adjustment parameter x2 in practice, and each combination can satisfy the allowable error range, (V3 x2)(T1/x1)= The length of the text in the set of subtitles. In view of this situation, other filtering conditions can also be added, such as, x1+x2 is the smallest, 2x1+x2 is the smallest, x1 2 +x2 2 is the smallest, etc., to obtain the optimal combination of the duration adjustment parameter x1 and the speech rate adjustment parameter x2 .
图6为本公开实施例提供的另一种用于实现S130的方法的流程图。该方法适用于上述情况二。参见图6,该方法包括:FIG. 6 is a flowchart of another method for implementing S130 according to an embodiment of the present disclosure. This method is applicable to the second case above. Referring to Figure 6, the method includes:
S1331、根据第二字幕的长度和配音的默认语速,确定配音的默认时长。S1331. Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing.
S1332、若任一段音频的时长大于默认时长,且任一段音频的时长和默认时长之间的第一差异大于第一阈值,则确定第二字幕的长度是否在预设范围内。S1332. If the duration of any piece of audio is greater than the default duration, and the first difference between the duration of any piece of audio and the default duration is greater than the first threshold, determine whether the length of the second subtitle is within a preset range.
S1333、若第二字幕的长度在预设范围内,则减小目标字幕的显示时长,和/或降低配音的目标语速,使得目标字幕的显示时长与配音的目标语速的乘积与第二字幕的长度之间的第二差异小于或等于第二阈值。S1333. If the length of the second subtitle is within the preset range, reduce the display duration of the target subtitle, and/or reduce the target speech rate of the dubbing, so that the product of the display duration of the target subtitle and the target speech rate of the dubbing is equal to the second The second difference between the lengths of the subtitles is less than or equal to the second threshold.
目标字幕的显示时长为调整后任一段音频对应的画面的显示时长。The display duration of the target subtitle is the display duration of the picture corresponding to any piece of audio after adjustment.
本领域技术人员可以理解,执行S1333时,若选择减小目标字幕的显示时长,若减小后目标字幕的显示时长过短(超出一定的限度),意味着该段音频对应的图像帧的切换速度过快,而其他音频对应的图像帧的切换速度正常,其会造成视频整体不和谐,影响用户体验。因此,可以采用设置时长调整参数的方式来限定目标字幕的最小显示时长。Those skilled in the art can understand that when S1333 is executed, if the display duration of the target subtitle is selected to be reduced, if the display duration of the target subtitle after the reduction is too short (exceeds a certain limit), it means that the image frame corresponding to the audio segment is switched. If the speed is too fast, while the switching speed of the image frames corresponding to other audios is normal, it will cause the overall video disharmony and affect the user experience. Therefore, the minimum display duration of the target subtitle can be limited by setting a duration adjustment parameter.
具体地,若一组字幕的初始显示时长(即在原视频中,与该组字幕对应的音频的持续时长)为T1,时长调整参数为x1,调整后该字幕的显示时长为T2,则有T2=T1/x1,此时x1是大于1的数。通过设置时长调整参数x1的最大取值,来限制调整后该字幕的显示时长的最小值。Specifically, if the initial display duration of a group of subtitles (that is, the duration of the audio corresponding to the group of subtitles in the original video) is T1, the duration adjustment parameter is x1, and the display duration of the subtitle after adjustment is T2, then there is T2 =T1/x1, where x1 is a number greater than 1. By setting the maximum value of the duration adjustment parameter x1, the minimum value of the display duration of the subtitle after adjustment is limited.
可选地,“减小目标字幕的显示时长”的实现方法有多种,示例性地,减小 目标字幕的显示时长的实现方法,包括:提高任一段音频对应的画面的显示速度。Optionally, there are multiple implementation methods for "reducing the display duration of the target subtitles". Exemplarily, the implementation method for reducing the display duration of the target subtitles includes: increasing the display speed of a picture corresponding to any piece of audio.
进一步地,若采用提高任一段音频对应的画面的显示速度的方法,时长调整参数x1可视作为显示速度调整参数x1。Further, if the method of increasing the display speed of the picture corresponding to any piece of audio is adopted, the duration adjustment parameter x1 can be regarded as the display speed adjustment parameter x1.
具体地,假设原视频中的某一段音频对应的画面的显示速度为V1,显示速度调整参数为x1,调整后该段音频对应的画面的显示速度为V2,则有V2=V1·x1。通过设置显示速度调整参数x1的最大取值,来限制调整后该段音频对应的画面的显示速度的最大值,进而限制调整后目标字幕的显示时长的最小值。Specifically, assuming that the display speed of the picture corresponding to a certain audio segment in the original video is V1, the display speed adjustment parameter is x1, and the display speed of the picture corresponding to the audio segment after adjustment is V2, then V2=V1·x1. By setting the maximum value of the display speed adjustment parameter x1, the maximum value of the display speed of the picture corresponding to the audio segment after adjustment is limited, thereby limiting the minimum value of the display duration of the target subtitle after adjustment.
示例性地,继续参见图2,假设t0时间段对应有20个图像帧,t0=2S,说明t0时间段对应的图像帧原来的显示速度V1为20帧/2秒。如果将图像帧的显示速度调快,调快后的图像帧的显示速度记为V2,V2=V1·x1,此时,x1是大于1的数。若设置x1最大值为1.1,即图像帧的显示速度最快可以被调整为22帧/2秒,此时,相当于图像帧的显示速度被调快了10%,而t0时间段对应的图像帧的总数20是固定的,因此,该20个图像帧的显示时长变为了t0/x1,将x1的值由1变为1.1,相当于减小了图像帧的显示时长,即减小了该组字幕的显示时长。Exemplarily, continuing to refer to FIG. 2 , it is assumed that there are 20 image frames corresponding to the time period t0 , and t0 = 2S, indicating that the original display speed V1 of the image frames corresponding to the time period t0 is 20 frames/2 seconds. If the display speed of the image frame is increased, the display speed of the increased image frame is recorded as V2, V2=V1·x1, and at this time, x1 is a number greater than 1. If the maximum value of x1 is set to 1.1, that is, the display speed of the image frame can be adjusted to 22 frames/2 seconds at the fastest. The total number of frames is 20. Therefore, the display duration of the 20 image frames becomes t0/x1, and the value of x1 is changed from 1 to 1.1, which is equivalent to reducing the display duration of the image frame, that is, reducing the Display time for group captions.
类似地,执行S1333时,若选择降低配音的目标语速,若降低后目标语速过慢(超出一定的限度),会导致该段音频语速过慢,而其他音频对应的语速正常,其会造成视频整体不和谐,影响用户体验。因此,可以采用设置语速调整参数的方式来限定目标语速的最小值。Similarly, when performing S1333, if the target speech rate of the dubbing is selected to be reduced, if the target speech rate after the reduction is too slow (exceeding a certain limit), the speech rate of this segment of audio will be too slow, while the speech rates corresponding to other audios are normal, It will cause the overall disharmony of the video and affect the user experience. Therefore, the minimum value of the target speech rate can be limited by setting the speech rate adjustment parameter.
具体地,假设为原视频中的某一段音频所选择的配音音色对应的默认语速为V3,语速调整参数为x2,调整后该配音音色对应的语速为V4,则有V4=V3·x2。通过设置语速调整参数x2的最小取值,来限制调整后该配音音色的目标语速的最小值。Specifically, it is assumed that the default speech rate corresponding to the dubbing tone selected for a certain segment of audio in the original video is V3, the speech rate adjustment parameter is x2, and the speech rate corresponding to the dubbing tone after adjustment is V4, then there is V4=V3. x2. By setting the minimum value of the speech rate adjustment parameter x2, the minimum value of the target speech rate of the dubbing tone after adjustment is limited.
示例性地,将配音音色的默认语速记为V3,V3是固定的。配音音色的目标语速记为V4,V4的初始值为V3。配音音色的目标语速可以被调整,例如,如果降低该配音音色的目标语速,相当于V4=V3·x2,此时,x2是小于1的数。 若设置x2最小值是0.9,即目标语速最小可以是默认语速的0.9倍,这时候相当于将配音音色的目标语速调慢了10%。Exemplarily, the default speech shorthand of the dubbing tone is V3, and V3 is fixed. The shorthand for the target language of the dubbing tone is V4, and the initial value of V4 is V3. The target speech rate of the dubbing tone can be adjusted. For example, if the target speech rate of the dubbing tone is reduced, it is equivalent to V4=V3·x2, where x2 is a number less than 1. If the minimum value of x2 is set to 0.9, that is, the minimum target speech rate can be 0.9 times the default speech rate, which is equivalent to slowing down the target speech rate of the dubbing tone by 10%.
目标字幕的显示时长T2与配音的目标语速V4的乘积L1,可表示为L1=T2·V4=(T1/x1)(V3·x2)。即,以配音的目标语速在目标字幕的显示时长内可阅读的文字的长度。此处,文字的长度可以理解为文字的单词个数、或者文字的音节数等。The product L1 of the display duration T2 of the target subtitle and the target speech rate V4 of dubbing can be expressed as L1=T2·V4=(T1/x1)(V3·x2). That is, the length of characters that can be read within the display duration of the target subtitles at the target speech rate of dubbing. Here, the length of the character can be understood as the number of words in the character, the number of syllables in the character, or the like.
目标字幕的显示时长与配音的目标语速的乘积等于第二字幕的长度,意味着以配音的目标语速在目标字幕的显示时长内可阅读的文字的长度恰好等于第二字幕的长度。换言之,以配音的目标语速阅读第二字幕所需要的时间恰好等于目标字幕的显示时长。The product of the display duration of the target subtitle and the target speech rate of dubbing is equal to the length of the second subtitle, which means that the length of readable text at the target speech rate of dubbing within the display duration of the target subtitle is exactly equal to the length of the second subtitle. In other words, the time required to read the second subtitle at the target speech rate of dubbing is exactly equal to the display duration of the target subtitle.
因此,在S1333中,“使得目标字幕的显示时长与配音的目标语速的乘积与第二字幕的长度之间的第二差异小于或等于第二阈值”是指在误差允许的范围内,基于第二字幕生成的配音音频的持续时长与目标字幕的显示时长相一致。Therefore, in S1333, "making the second difference between the product of the display duration of the target subtitle and the target speech rate of dubbing and the length of the second subtitle be less than or equal to the second threshold" means that within the allowable range of the error, based on the The duration of the dubbing audio generated by the second subtitle is consistent with the display duration of the target subtitle.
其中,第二差异可以为目标字幕的显示时长与配音的目标语速的乘积与第二字幕的长度之差的绝对值,也可以为目标字幕的显示时长与配音的目标语速的乘积与第二字幕的长度的比值。The second difference may be the absolute value of the difference between the product of the display duration of the target subtitle and the target speech rate of the dubbing and the difference between the length of the second subtitle, or the product of the display duration of the target subtitle and the target speech rate of the dubbing and the first subtitle. The ratio of the lengths of the two subtitles.
如上述内容,由于需要保证视频具有较佳的视听效果,需要限定目标字幕的最短显示时长以及最小目标语速,这会使得第二字幕的长度处于一定的范围(即S1332中的“预设范围”)内。As mentioned above, due to the need to ensure that the video has a better audio-visual effect, it is necessary to limit the shortest display duration and the minimum target speech rate of the target subtitle, which will make the length of the second subtitle within a certain range (that is, the "preset range" in S1332 ")Inside.
若第二字幕的长度恰好处于预设范围,在“目标字幕的显示时长大于或等于目标字幕的最小显示时长”以及“配音的目标语速大于或等于最小目标语速”这两个条件下,通过减小目标字幕的显示时长,和/或降低配音的目标语速,可以使得目标字幕的显示时长与配音的目标语速的乘积与第二字幕的长度之间的第二差异小于或等于第二阈值。If the length of the second subtitle is within the preset range, under the conditions of "the display duration of the target subtitle is greater than or equal to the minimum display duration of the target subtitle" and "the target speech rate of dubbing is greater than or equal to the minimum target speech rate", By reducing the display duration of the target subtitles and/or reducing the target speech rate of dubbing, the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles can be less than or equal to the first Two thresholds.
若第二字幕的长度不处于预设范围,在“目标字幕的显示时长大于或等于目标字幕的最小显示时长”以及“配音的目标语速大于或等于最小目标语速”这两个条件下,无论如何减小目标字幕的显示时长,和/或降低配音的目标语速,均不能使得目标字幕的显示时长与配音的目标语速的乘积与第二字幕的长度之间 的第二差异小于或等于第二阈值。If the length of the second subtitle is not within the preset range, under the two conditions of "the display duration of the target subtitle is greater than or equal to the minimum display duration of the target subtitle" and "the target speech rate of dubbing is greater than or equal to the minimum target speech rate", No matter how the display duration of the target subtitles is reduced, and/or the target speech rate of dubbing is reduced, the second difference between the product of the display duration of the target subtitles and the target speech rate of dubbing and the length of the second subtitles cannot be less than or equal to the second threshold.
可选地,若确定第二字幕的长度不在预设范围内,则根据预设范围对第一字幕进行重新翻译,使得重新翻译后得到的第二字幕的长度在预设范围内。Optionally, if it is determined that the length of the second subtitle is not within the preset range, the first subtitle is retranslated according to the preset range, so that the length of the second subtitle obtained after retranslation is within the preset range.
可选地,预设范围与默认语速和任一段音频的时长相关。Optionally, the preset range is related to the default speech rate and the duration of any piece of audio.
进一步地,预设范围包括下限值(即最小值),该下限值与默认语速、任一段音频的时长、时长调整参数的最大值以及语速调整参数的最小值相关。Further, the preset range includes a lower limit value (ie, a minimum value), and the lower limit value is related to the default speech rate, the duration of any piece of audio, the maximum value of the duration adjustment parameter, and the minimum value of the speech rate adjustment parameter.
示例性地,假设预设范围上限值为n1,默认语速为V3,任一段音频的时长为T1,时长调整参数的最大值为1.1,语速调整参数的最小值为0.9,则n1=(V3*0.9)(T1/1.1)。Exemplarily, assuming that the upper limit of the preset range is n1, the default speech rate is V3, the duration of any audio segment is T1, the maximum value of the duration adjustment parameter is 1.1, and the minimum value of the speech rate adjustment parameter is 0.9, then n1= (V3*0.9)(T1/1.1).
在上述技术方案的基础上,在执行S1333时,“减小目标字幕的显示时长,和/或降低配音的目标语速”的具体实现方法有多种。下面给出典型的三种方法。On the basis of the above technical solution, when S1333 is executed, there are various specific implementation methods for "reducing the display duration of the target subtitles, and/or reducing the target speech rate of dubbing". Three typical methods are given below.
方法一:method one:
在配音的目标语速为默认语速的基础上,逐渐降低配音的目标语速;若配音的目标语速已达到最小值,且第二差异大于第二阈值,则在目标字幕的显示时长为任一段音频的时长的基础上,逐渐减小目标字幕的显示时长,直到第二差异小于或等于第二阈值。。On the basis that the target speech rate of dubbing is the default speech rate, gradually reduce the target speech rate of dubbing; if the target speech rate of dubbing has reached the minimum value and the second difference is greater than the second threshold, the display duration of the target subtitles is On the basis of the duration of any piece of audio, the display duration of the target subtitle is gradually reduced until the second difference is less than or equal to the second threshold. .
例如,控制某组字幕的显示时长不变,即T1为定值,时长调整参数x1=1。优先调整语速调整参数x2,x2的取值从1开始逐渐向0.9减小,例如,按照从1到0.9的顺序,依次间隔取值,当x2取某个值时,若在误差允许的范围内,能使得(V3·x2)(T1/x1)=该组字幕中文字的长度,则停止调整x2,输出当前时长调整参数x1和语速调整参数x2。For example, the display duration of a certain group of subtitles is controlled to remain unchanged, that is, T1 is a fixed value, and the duration adjustment parameter x1=1. Priority is given to adjusting the speech speed adjustment parameter x2, and the value of x2 gradually decreases from 1 to 0.9. For example, in the order from 1 to 0.9, the values are taken at intervals. When x2 takes a certain value, if it is within the allowable range of error If (V3·x2)(T1/x1)=the length of the text in this group of subtitles, the adjustment of x2 is stopped, and the current time length adjustment parameter x1 and speech rate adjustment parameter x2 are output.
如果x2的取值已经达到了最小值0.9,但是还不能在误差允许的范围内,使得(V3·x2)(T1/x1)=该组字幕中文字的长度,则固定x2=0.9,调整x1的值,x1的取值从1开始逐渐向1.1增大,例如,按照从1到1.1的顺序,依次间隔取值,直到在误差允许的范围内,使得(V3·x2)(T1/x1)=该组字幕中文字的长度,输出当前时长调整参数x1和语速调整参数x2。If the value of x2 has reached the minimum value of 0.9, but it is not within the allowable range of the error, so that (V3 x2)(T1/x1)=the length of the text in this group of subtitles, then fix x2=0.9, adjust x1 The value of x1 starts from 1 and gradually increases to 1.1. For example, in the order from 1 to 1.1, the values are taken at intervals until within the allowable error range, so that (V3 x2)(T1/x1) = The length of the text in this group of subtitles, output the current time length adjustment parameter x1 and speech rate adjustment parameter x2.
方法二:Method Two:
在目标字幕的显示时长为任一段音频的时长的基础上,逐渐减小目标字幕 的显示时长;若目标字幕的显示时长已达到最小值,且第二差异大于第二阈值,则在配音的目标语速为默认语速的基础上,逐渐降低配音的目标语速,直到第二差异小于或等于第二阈值。On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the display duration of the target subtitles; if the display duration of the target subtitles has reached the minimum value and the second difference is greater than the second threshold, the dubbing target On the basis of the default speech rate, the target speech rate of dubbing is gradually reduced until the second difference is less than or equal to the second threshold.
例如,控制配音音色的目标语速不变,即V3为定值,语速调整参数x2=1。优先调整时长调整参数x1,x1的取值从1开始逐渐向1.1增大,例如,按照从1到1.1的顺序,依次间隔取值,当x1取某个值时,若能使得在误差允许的范围内,(V3·x2)(T1/x1)=该组字幕中文字的长度,则停止调整x1,输出当前时长调整参数x1和语速调整参数x2。For example, the target speech rate for controlling the dubbing timbre remains unchanged, that is, V3 is a fixed value, and the speech rate adjustment parameter x2=1. Priority is given to adjusting the duration adjustment parameter x1, and the value of x1 gradually increases from 1 to 1.1. For example, in the order from 1 to 1.1, the values are taken at intervals. When x1 takes a certain value, if it can be made within the allowable error Within the range, (V3·x2)(T1/x1)=the length of the text in this group of subtitles, stop adjusting x1, and output the current time length adjustment parameter x1 and speech rate adjustment parameter x2.
如果x1的取值已经达到了最大值1.1,但是还不能在误差允许的范围内,使得(V3·x2)(T1/x1)=该组字幕中文字的长度,则进一步调整x2的值,x2的取值从1开始逐渐向0.9减小,例如,按照从1到1.1的顺序,依次间隔取值,直到在误差允许的范围内,(V3*x2)(T1/x1)=该组字幕中英文字幕的长度。输出当前时长调整参数x1和语速调整参数x2。If the value of x1 has reached the maximum value of 1.1, but it is not within the allowable range of the error, so that (V3 x2)(T1/x1)=the length of the text in this group of subtitles, then further adjust the value of x2, x2 The value of , gradually decreases from 1 to 0.9. For example, according to the order from 1 to 1.1, the value is taken at intervals until it is within the allowable range of error, (V3*x2)(T1/x1)=In this group of subtitles Length of English subtitles. Output the current duration adjustment parameter x1 and speech rate adjustment parameter x2.
方法三:Method three:
在目标字幕的显示时长为任一段音频的时长的基础上,逐渐减小目标字幕的显示时长,同时在配音的目标语速为默认语速的基础上,逐渐降低配音的目标语速,直到第二差异小于或等于第二阈值。On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the display duration of the target subtitles. At the same time, on the basis that the target speech rate of dubbing is the default speech rate, gradually reduce the target speech rate of dubbing until the first The second difference is less than or equal to the second threshold.
例如,同时调整时长调整参数x1和语速调整参数x2的取值,x1的取值从1开始逐渐向1.1增大,x2的取值从1开始逐渐向0.9减小,直至在误差允许的范围内,(V3·x2)(T1/x1)=该组字幕中文字的长度。输出当前时长调整参数x1和语速调整参数x2。For example, adjust the value of the duration adjustment parameter x1 and the speech rate adjustment parameter x2 at the same time, the value of x1 gradually increases from 1 to 1.1, and the value of x2 gradually decreases from 1 to 0.9, until it is within the allowable range of error In, (V3·x2)(T1/x1)=the length of the text in this group of subtitles. Output the current duration adjustment parameter x1 and speech rate adjustment parameter x2.
进一步地,针对方法三,在实际中可能存在多种时长调整参数x1和语速调整参数x2的组合,各组合均能够满足在误差允许的范围内,(V3·x2)(T1/x1)=该组字幕中文字的长度。针对这种情况,还可以增设其他筛选条件,如,x1+x2最小、2x1+x2最小、x1 2+x2 2最小等,以得到最优的关于时长调整参数x1和语速调整参数x2的组合。 Further, for the third method, there may be various combinations of the duration adjustment parameter x1 and the speech rate adjustment parameter x2 in practice, and each combination can satisfy the allowable error range, (V3 x2)(T1/x1)= The length of the text in the set of subtitles. In view of this situation, other filtering conditions can also be added, such as, x1+x2 is the smallest, 2x1+x2 is the smallest, x1 2 +x2 2 is the smallest, etc., to obtain the optimal combination of the duration adjustment parameter x1 and the speech rate adjustment parameter x2 .
图7为本公开实施例提供的另一种视频处理方法的流程图。在实际中,可能出现原视频包括多段音频,多段音频是多个目标对象的语音。其中目标对象 可以理解为视频中的人物。针对这种情况,在上述各技术方案的基础上,可选地,参见图7,该方法还包括:FIG. 7 is a flowchart of another video processing method provided by an embodiment of the present disclosure. In practice, it may appear that the original video includes multiple pieces of audio, and the multiple pieces of audio are the voices of multiple target objects. The target object can be understood as a person in the video. In view of this situation, on the basis of the above technical solutions, optionally, referring to FIG. 7 , the method further includes:
S210、针对多个目标对象中的每个目标对象,选择目标对象对应的配音音色。S210. For each target object in the plurality of target objects, select a dubbing tone corresponding to the target object.
本步骤的实现方法有多种,示例性地,预先在数据库中存储多个配音音色数据,不同的配音音色数据对应不同的人物属性数据。此处,人物属性数据包括人物的年龄、性别、语气、职业等。在执行本步骤时,基于原视频,识别目标对象的人物属性数据;基于目标对象的人物属性数据,确定目标对象的对应的配音音色。There are various methods for implementing this step. Exemplarily, multiple dubbing timbre data are stored in the database in advance, and different dubbing timbre data correspond to different character attribute data. Here, the person attribute data includes the age, gender, tone, occupation, and the like of the person. When performing this step, based on the original video, the character attribute data of the target object is identified; based on the character attribute data of the target object, the corresponding dubbing timbre of the target object is determined.
可选地,同一视频中,同一目标对象对应的配音音色相同,不同目标对象对应的配音音色不同。Optionally, in the same video, the dubbing timbres corresponding to the same target object are the same, and the dubbing timbres corresponding to different target objects are different.
S220、根据每个目标对象分别对应的配音音色,生成多段音频对应的多个配音音频。S220. Generate multiple dubbing audios corresponding to the multiple audio segments according to the dubbing timbres corresponding to each target object respectively.
S230、将原视频中的多段音频替换为多个配音音频,得到目标视频。S230. Replace multiple audio segments in the original video with multiple dubbing audios to obtain a target video.
上述技术方案通过针对多个目标对象中的每个目标对象,选择目标对象对应的配音音色;根据每个目标对象分别对应的配音音色,生成多段音频对应的多个配音音频,实现人物与音色的对应,可以便于用户从声音方面对配音后不同人物角色进行区分,可以提高用户体验。The above technical solution selects the dubbing timbre corresponding to the target object for each target object in the plurality of target objects; Correspondingly, it is convenient for the user to distinguish different characters after dubbing from the aspect of sound, which can improve the user experience.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限制,因为依据本公开,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本公开所必须的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described action sequences. Because certain steps may be performed in other orders or concurrently in accordance with the present disclosure. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present disclosure.
图8为本公开实施例提供的一种视频处理装置的结构示意图。本公开实施例所提供的视频处理装置可以配置于客户端中,或者可以配置于服务端中,该视频处理装置具体包括:FIG. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure. The video processing apparatus provided by the embodiment of the present disclosure may be configured in a client or may be configured in a server, and the video processing apparatus specifically includes:
获取模块310,用于获取原视频中的第一字幕;an obtaining module 310, configured to obtain the first subtitle in the original video;
翻译模块320,用于对所述第一字幕进行翻译,得到第二字幕;a translation module 320, configured to translate the first subtitle to obtain a second subtitle;
确定模块330,用于确定配音的目标语速;A determination module 330, configured to determine the target speech rate of the dubbing;
配音模块340,用于根据所述配音的目标语速生成所述第二字幕对应的配音音频。The dubbing module 340 is configured to generate dubbing audio corresponding to the second subtitle according to the target speech rate of the dubbing.
进一步地,所述第一字幕是所述原视频中任一段音频对应的字幕,Further, the first subtitle is a subtitle corresponding to any segment of audio in the original video,
所述确定模块330,还用于确定目标字幕的显示时长,所述目标字幕包括所述第一字幕和/或所述第二字幕;The determining module 330 is further configured to determine the display duration of target subtitles, where the target subtitles include the first subtitle and/or the second subtitle;
所述装置还包括替换模块350,用于将所述原视频中的所述任一段音频替换为所述配音音频,得到目标视频,并在所述目标视频中与所述目标字幕的显示时长对应的画面中显示所述目标字幕。The device also includes a replacement module 350, which is used to replace the audio in the original video with the dubbing audio to obtain a target video, which corresponds to the display duration of the target subtitles in the target video. The target subtitle is displayed on the screen.
进一步地,确定模块用于:Further, the determination module is used to:
根据所述第二字幕的长度和所述配音的默认语速,确定所述配音的默认时长;Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing;
若所述任一段音频的时长大于或等于所述默认时长,且所述任一段音频的时长和所述默认时长之间的第一差异小于或等于第一阈值,则所述目标字幕的显示时长为所述任一段音频对应的时间,所述配音的目标语速为所述默认语速。If the duration of any piece of audio is greater than or equal to the default duration, and the first difference between the duration of any piece of audio and the default duration is less than or equal to the first threshold, the display duration of the target subtitle is the time corresponding to any piece of audio, and the target speech rate of the dubbing is the default speech rate.
进一步地,该装置还包括第一调整模块。第一调整模块用于:Further, the device also includes a first adjustment module. The first adjustment module is used to:
若所述任一段音频的时长小于所述默认时长,则确定所述第二字幕的长度是否在预设范围内;If the duration of any piece of audio is less than the default duration, determining whether the length of the second subtitle is within a preset range;
若所述第二字幕的长度在所述预设范围内,则增大所述目标字幕的显示时长,和/或提高所述配音的目标语速,使得所述目标字幕的显示时长与所述配音的目标语速的乘积与所述第二字幕的长度之间的第二差异小于或等于第二阈值。If the length of the second subtitle is within the preset range, the display duration of the target subtitle is increased, and/or the target speech rate of the dubbing is increased, so that the display duration of the target subtitle is the same as that of the target subtitle. A second difference between the product of the target speech rate for dubbing and the length of the second subtitle is less than or equal to a second threshold.
进一步地,第一调整模块用于:Further, the first adjustment module is used for:
在所述配音的目标语速为所述默认语速的基础上,逐渐提高所述配音的目标语速;On the basis that the target speech rate of the dubbing is the default speech rate, gradually increase the target speech rate of the dubbing;
若所述配音的目标语速已达到最大值,且所述第二差异大于第二阈值,则在所述目标字幕的显示时长为所述任一段音频的时长的基础上,逐渐增大所述目标字幕的显示时长,直到所述第二差异小于或等于第二阈值。If the target speech rate of the dubbing has reached the maximum value, and the second difference is greater than the second threshold, gradually increase the Display duration of the target subtitle until the second difference is less than or equal to the second threshold.
进一步地,第一调整模块用于:Further, the first adjustment module is used for:
在所述目标字幕的显示时长为所述任一段音频的时长的基础上,逐渐增大所述目标字幕的显示时长;On the basis that the display duration of the target subtitle is the duration of any piece of audio, gradually increase the display duration of the target subtitle;
若所述目标字幕的显示时长已达到最大值,且所述第二差异大于第二阈值,则在所述配音的目标语速为所述默认语速的基础上,逐渐提高所述配音的目标语速,直到所述第二差异小于或等于第二阈值。If the display duration of the target subtitle has reached the maximum value, and the second difference is greater than the second threshold, the target speech rate of the dubbing is gradually increased on the basis that the target speech rate of the dubbing is the default speech rate speech rate until the second difference is less than or equal to the second threshold.
进一步地,第一调整模块用于:Further, the first adjustment module is used for:
在所述目标字幕的显示时长为所述任一段音频的时长的基础上,逐渐增大所述目标字幕的显示时长,同时在所述配音的目标语速为所述默认语速的基础上,逐渐提高所述配音的目标语速,直到所述第二差异小于或等于第二阈值。On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually increase the display duration of the target subtitles, and on the basis that the target speech rate of the dubbing is the default speech rate, The target speech rate of the dubbing is gradually increased until the second difference is less than or equal to a second threshold.
进一步地,第一调整模块通过降低所述任一段音频对应的画面的显示速度,增大所述目标字幕的显示时长。Further, the first adjustment module increases the display duration of the target subtitle by reducing the display speed of the picture corresponding to any piece of audio.
进一步地,该装置还包括第二调整模块。第二调整模块用于:Further, the device also includes a second adjustment module. The second adjustment module is used for:
若所述任一段音频的时长大于所述默认时长,且所述任一段音频的时长和所述默认时长之间的第一差异大于第一阈值,则确定所述第二字幕的长度是否在预设范围内;If the duration of any piece of audio is greater than the default duration, and the first difference between the duration of any piece of audio and the default duration is greater than a first threshold, determine whether the length of the second subtitle is within the predetermined duration within the set range;
若所述第二字幕的长度在所述预设范围内,则减小所述目标字幕的显示时长,和/或降低所述配音的目标语速,使得所述目标字幕的显示时长与所述配音的目标语速的乘积与所述第二字幕的长度之间的第二差异小于或等于第二阈值。If the length of the second subtitle is within the preset range, the display duration of the target subtitle is reduced, and/or the target speech rate of the dubbing is reduced, so that the display duration of the target subtitle is the same as that of the target subtitle. A second difference between the product of the target speech rate for dubbing and the length of the second subtitle is less than or equal to a second threshold.
进一步地,第二调整模块用于:Further, the second adjustment module is used for:
在所述配音的目标语速为所述默认语速的基础上,逐渐降低所述配音的目标语速;On the basis that the target speech rate of the dubbing is the default speech rate, gradually reduce the target speech rate of the dubbing;
若所述配音的目标语速已达到最小值,且所述第二差异大于第二阈值,则在所述目标字幕的显示时长为所述任一段音频的时长的基础上,逐渐减小所述目标字幕的显示时长,直到所述第二差异小于或等于第二阈值。If the target speech rate of the dubbing has reached the minimum value, and the second difference is greater than the second threshold, on the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the Display duration of the target subtitle until the second difference is less than or equal to the second threshold.
进一步地,第二调整模块用于:Further, the second adjustment module is used for:
在所述目标字幕的显示时长为所述任一段音频的时长的基础上,逐渐减小 所述目标字幕的显示时长;On the basis that the display duration of the target subtitle is the duration of the audio of any segment, gradually reduce the display duration of the target subtitle;
若所述目标字幕的显示时长已达到最小值,且所述第二差异大于第二阈值,则在所述配音的目标语速为所述默认语速的基础上,逐渐降低所述配音的目标语速,直到所述第二差异小于或等于第二阈值。If the display duration of the target subtitle has reached the minimum value and the second difference is greater than the second threshold, the target speech rate of the dubbing is gradually reduced on the basis that the target speech rate of the dubbing is the default speech rate speech rate until the second difference is less than or equal to the second threshold.
进一步地,第二调整模块用于:Further, the second adjustment module is used for:
在所述目标字幕的显示时长为所述任一段音频的时长的基础上,逐渐减小所述目标字幕的显示时长,同时在所述配音的目标语速为所述默认语速的基础上,逐渐降低所述配音的目标语速,直到所述第二差异小于或等于第二阈值。On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the display duration of the target subtitles, and on the basis that the target speech rate of the dubbing is the default speech rate, The target speech rate of the dubbing is gradually decreased until the second difference is less than or equal to a second threshold.
进一步地,第二调整模块通过提高所述任一段音频对应的画面的显示速度,减小所述目标字幕的显示时长。Further, the second adjustment module reduces the display duration of the target subtitle by increasing the display speed of the picture corresponding to any piece of audio.
进一步地,翻译模块还用于:Further, the translation module is also used to:
若确定所述第二字幕的长度不在预设范围内,则根据所述预设范围对所述第一字幕进行重新翻译,使得重新翻译后得到的第二字幕的长度在所述预设范围内。If it is determined that the length of the second subtitle is not within the preset range, the first subtitle is retranslated according to the preset range, so that the length of the second subtitle obtained after retranslation is within the preset range .
进一步地,所述目标字幕的显示时长为所述任一段音频对应的画面的显示时长。Further, the display duration of the target subtitle is the display duration of the picture corresponding to any piece of audio.
进一步地,所述预设范围与所述默认语速和所述任一段音频的时长相关。Further, the preset range is related to the default speech rate and the duration of any piece of audio.
进一步地,所述原视频包括多段音频,所述多段音频是多个目标对象的语音;Further, the original video includes multiple pieces of audio, and the multiple pieces of audio are the voices of multiple target objects;
所述装置还包括选择模块;选择模块用于针对所述多个目标对象中的每个目标对象,选择所述目标对象对应的配音音色;The device further includes a selection module; the selection module is configured to, for each target object in the plurality of target objects, select a dubbing timbre corresponding to the target object;
配音模块,用于根据所述每个目标对象分别对应的配音音色,生成所述多段音频对应的多个配音音频;A dubbing module, configured to generate a plurality of dubbing audios corresponding to the multi-segment audios according to the dubbing timbres corresponding to each target object respectively;
替换模块,用于将所述原视频中的所述多段音频替换为所述多个配音音频,得到目标视频。A replacement module, configured to replace the multiple audio segments in the original video with the multiple dubbed audios to obtain a target video.
本公开实施例提供的视频处理装置,可执行本公开方法实施例所提供的视频处理方法中客户端或服务端所执行的步骤,具备执行步骤和有益效果此处不再赘述。The video processing apparatus provided by the embodiments of the present disclosure can execute the steps performed by the client or the server in the video processing method provided by the method embodiments of the present disclosure, and the execution steps and beneficial effects are not repeated here.
图9为本公开实施例提供的一种电子设备的结构示意图。下面具体参考图9,其示出了适于用来实现本公开实施例中的电子设备1000的结构示意图。本公开实施例中的电子设备1000可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)、可穿戴电子设备等等的移动终端以及诸如数字TV、台式计算机、智能家居设备等等的固定终端。图9示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. Referring specifically to FIG. 9 below, it shows a schematic structural diagram of an electronic device 1000 suitable for implementing an embodiment of the present disclosure. The electronic device 1000 in the embodiment of the present disclosure may include, but is not limited to, such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle-mounted terminal ( Mobile terminals such as in-vehicle navigation terminals), wearable electronic devices, etc., and stationary terminals such as digital TVs, desktop computers, smart home devices, and the like. The electronic device shown in FIG. 9 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
如图9所示,电子设备1000可以包括处理装置(例如中央处理器、图形处理器等)1001,其可以根据存储在只读存储器(ROM)1002中的程序或者从存储装置1008加载到随机访问存储器(RAM)1003中的程序而执行各种适当的动作和处理以实现如本公开所述的实施例的视频处理方法。在RAM 1003中,还存储有电子设备1000操作所需的各种程序和信息。处理装置1001、ROM 1002以及RAM 1003通过总线1004彼此相连。输入/输出(I/O)接口1005也连接至总线1004。As shown in FIG. 9, an electronic device 1000 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 1001, which may be loaded into random access according to a program stored in a read only memory (ROM) 1002 or from a storage device 1008 A program in the memory (RAM) 1003 executes various appropriate actions and processes to implement the video processing method of the embodiment as described in the present disclosure. In the RAM 1003, various programs and information necessary for the operation of the electronic device 1000 are also stored. The processing device 1001, the ROM 1002, and the RAM 1003 are connected to each other through a bus 1004. An input/output (I/O) interface 1005 is also connected to the bus 1004 .
通常,以下装置可以连接至I/O接口1005:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置1006;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置1007;包括例如磁带、硬盘等的存储装置1008;以及通信装置1009。通信装置1009可以允许电子设备1000与其他设备进行无线或有线通信以交换信息。虽然图9示出了具有各种装置的电子设备1000,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices can be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 1007 such as a computer; a storage device 1008 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 1009 . Communication means 1009 may allow electronic device 1000 to communicate wirelessly or by wire with other devices to exchange information. While FIG. 9 shows the electronic device 1000 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码,从而实现如上所述的视频处理方法。在这样的实施例中,该计算机程序可以通过通信装置1009从网络上被下载和安装,或者从存储装置1008被安装,或者从ROM 1002被安装。在该计算机程序被处理装置 1001执行时,执行本公开实施例的方法中限定的上述功能。In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flowchart, thereby achieving the above the video processing method. In such an embodiment, the computer program may be downloaded and installed from the network via the communication device 1009, or from the storage device 1008, or from the ROM 1002. When the computer program is executed by the processing apparatus 1001, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的信息信号,其中承载了计算机可读的程序代码。这种传播的信息信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。It should be noted that the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. Rather, in the present disclosure, a computer-readable signal medium may include an information signal in baseband or propagated as part of a carrier wave with computer-readable program code embodied thereon. Such propagated information signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字信息通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何已知或未来研发的网络。In some embodiments, the client and server can use any known or future developed network protocol such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital information in any form or medium (eg, a communications network) interconnected. Examples of communication networks include local area networks ("LAN"), wide area networks ("WAN"), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any known or future developed network.
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device:
获取原视频中的第一字幕;Get the first subtitle in the original video;
对所述第一字幕进行翻译,得到第二字幕;The first subtitle is translated to obtain the second subtitle;
确定配音的目标语速;Determine the target speech rate for dubbing;
根据所述配音的目标语速生成所述第二字幕对应的配音音频。The dubbing audio corresponding to the second subtitle is generated according to the target speech rate of the dubbing.
可选的,当上述一个或者多个程序被该电子设备执行时,该电子设备还可以执行上述实施例所述的其他步骤。Optionally, when the above one or more programs are executed by the electronic device, the electronic device may also perform other steps described in the above embodiments.
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定。The units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner. Among them, the name of the unit does not constitute a limitation of the unit itself under certain circumstances.
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
根据本公开的一个或多个实施例,本公开提供了一种电子设备,包括:According to one or more embodiments of the present disclosure, the present disclosure provides an electronic device, comprising:
一个或多个处理器;one or more processors;
存储器,用于存储一个或多个程序;memory for storing one or more programs;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本公开提供的任一所述的视频处理方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the video processing method as provided in any one of the present disclosure.
根据本公开的一个或多个实施例,本公开提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本公开提供的任一所述的视频处理方法。According to one or more embodiments of the present disclosure, the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements video processing as described in any one of the present disclosure method.
本公开实施例还提供了一种计算机程序产品,该计算机程序产品包括计算机程序或指令,该计算机程序或指令被处理器执行时实现如上所述的视频处理方法。Embodiments of the present disclosure also provide a computer program product, where the computer program product includes a computer program or instructions, and when the computer program or instructions are executed by a processor, implement the video processing method as described above.
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征 与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is merely a preferred embodiment of the present disclosure and an illustration of the technical principles employed. Those skilled in the art should understand that the scope of the disclosure involved in the present disclosure is not limited to the technical solutions formed by the specific combination of the above-mentioned technical features, and should also cover, without departing from the above-mentioned disclosed concept, the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of its equivalent features. For example, a technical solution is formed by replacing the above-mentioned features with the technical features disclosed in the present disclosure (but not limited to) with similar functions.
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。Additionally, although operations are depicted in a particular order, this should not be construed as requiring that the operations be performed in the particular order shown or in a sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, although the above discussion contains several implementation-specific details, these should not be construed as limitations on the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。Although the subject matter has been described in language specific to structural features and/or logical acts of method, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims.

Claims (20)

  1. 一种视频处理方法,所述方法包括:A video processing method, the method comprising:
    获取原视频中的第一字幕;Get the first subtitle in the original video;
    对所述第一字幕进行翻译,得到第二字幕;The first subtitle is translated to obtain the second subtitle;
    确定配音的目标语速;Determine the target speech rate for dubbing;
    根据所述配音的目标语速生成所述第二字幕对应的配音音频。The dubbing audio corresponding to the second subtitle is generated according to the target speech rate of the dubbing.
  2. 根据权利要求1所述的方法,其中,所述第一字幕是所述原视频中任一段音频对应的字幕;The method according to claim 1, wherein the first subtitle is a subtitle corresponding to any piece of audio in the original video;
    所述方法还包括:The method also includes:
    确定目标字幕的显示时长,所述目标字幕包括所述第一字幕和/或所述第二字幕;determining the display duration of the target subtitle, where the target subtitle includes the first subtitle and/or the second subtitle;
    根据所述配音的目标语速生成所述第二字幕对应的配音音频之后,所述方法还包括:After generating the dubbing audio corresponding to the second subtitle according to the target speech rate of the dubbing, the method further includes:
    将所述原视频中的所述任一段音频替换为所述配音音频,得到目标视频,并在所述目标视频中与所述目标字幕的显示时长对应的画面中显示所述目标字幕。Replace any piece of audio in the original video with the dubbed audio to obtain a target video, and display the target subtitle in a picture corresponding to the display duration of the target subtitle in the target video.
  3. 根据权利要求2所述的方法,其中,确定所述目标字幕的显示时长、以及配音的目标语速,包括:The method according to claim 2, wherein determining the display duration of the target subtitle and the target speech rate of the dubbing comprises:
    根据所述第二字幕的长度和所述配音的默认语速,确定所述配音的默认时长;Determine the default duration of the dubbing according to the length of the second subtitle and the default speech rate of the dubbing;
    若所述任一段音频的时长大于或等于所述默认时长,且所述任一段音频的时长和所述默认时长之间的第一差异小于或等于第一阈值,则所述目标字幕的显示时长为所述任一段音频对应的时间,所述配音的目标语速为所述默认语速。If the duration of any piece of audio is greater than or equal to the default duration, and the first difference between the duration of any piece of audio and the default duration is less than or equal to the first threshold, the display duration of the target subtitles is the time corresponding to any piece of audio, and the target speech rate of the dubbing is the default speech rate.
  4. 根据权利要求3所述的方法,其中,所述方法还包括:The method of claim 3, wherein the method further comprises:
    若所述任一段音频的时长小于所述默认时长,则确定所述第二字幕的长度是否在预设范围内;If the duration of any piece of audio is less than the default duration, determining whether the length of the second subtitle is within a preset range;
    若所述第二字幕的长度在所述预设范围内,则增大所述目标字幕的显示时长,和/或提高所述配音的目标语速,使得所述目标字幕的显示时长与所述配音 的目标语速的乘积与所述第二字幕的长度之间的第二差异小于或等于第二阈值。If the length of the second subtitle is within the preset range, the display duration of the target subtitle is increased, and/or the target speech rate of the dubbing is increased, so that the display duration of the target subtitle is the same as that of the target subtitle. A second difference between the product of the target speech rate for dubbing and the length of the second subtitle is less than or equal to a second threshold.
  5. 根据权利要求4所述的方法,其中,增大所述目标字幕的显示时长,和/或提高所述配音的目标语速,包括:The method according to claim 4, wherein increasing the display duration of the target subtitle and/or increasing the target speech rate of the dubbing comprises:
    在所述配音的目标语速为所述默认语速的基础上,逐渐提高所述配音的目标语速;On the basis that the target speech rate of the dubbing is the default speech rate, gradually increase the target speech rate of the dubbing;
    若所述配音的目标语速已达到最大值,且所述第二差异大于第二阈值,则在所述目标字幕的显示时长为所述任一段音频的时长的基础上,逐渐增大所述目标字幕的显示时长,直到所述第二差异小于或等于第二阈值。If the target speech rate of the dubbing has reached the maximum value, and the second difference is greater than the second threshold, gradually increase the Display duration of the target subtitle until the second difference is less than or equal to the second threshold.
  6. 根据权利要求4所述的方法,其中,增大所述目标字幕的显示时长,和/或提高所述配音的目标语速,包括:The method according to claim 4, wherein increasing the display duration of the target subtitle and/or increasing the target speech rate of the dubbing comprises:
    在所述目标字幕的显示时长为所述任一段音频的时长的基础上,逐渐增大所述目标字幕的显示时长;On the basis that the display duration of the target subtitle is the duration of any piece of audio, gradually increase the display duration of the target subtitle;
    若所述目标字幕的显示时长已达到最大值,且所述第二差异大于第二阈值,则在所述配音的目标语速为所述默认语速的基础上,逐渐提高所述配音的目标语速,直到所述第二差异小于或等于第二阈值。If the display duration of the target subtitle has reached the maximum value, and the second difference is greater than the second threshold, the target speech rate of the dubbing is gradually increased on the basis that the target speech rate of the dubbing is the default speech rate speech rate until the second difference is less than or equal to the second threshold.
  7. 根据权利要求4所述的方法,其中,增大所述目标字幕的显示时长,和/或提高所述配音的目标语速,包括:The method according to claim 4, wherein increasing the display duration of the target subtitle and/or increasing the target speech rate of the dubbing comprises:
    在所述目标字幕的显示时长为所述任一段音频的时长的基础上,逐渐增大所述目标字幕的显示时长,同时在所述配音的目标语速为所述默认语速的基础上,逐渐提高所述配音的目标语速,直到所述第二差异小于或等于第二阈值。On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually increase the display duration of the target subtitles, and on the basis that the target speech rate of the dubbing is the default speech rate, The target speech rate of the dubbing is gradually increased until the second difference is less than or equal to a second threshold.
  8. 根据权利要求5-7任一项所述的方法,其中,增大所述目标字幕的显示时长,包括:The method according to any one of claims 5-7, wherein increasing the display duration of the target subtitles comprises:
    降低所述任一段音频对应的画面的显示速度。Decrease the display speed of the picture corresponding to any piece of audio.
  9. 根据权利要求3所述的方法,其中,所述方法还包括:The method of claim 3, wherein the method further comprises:
    若所述任一段音频的时长大于所述默认时长,且所述任一段音频的时长和所述默认时长之间的第一差异大于第一阈值,则确定所述第二字幕的长度是否在预设范围内;If the duration of any piece of audio is greater than the default duration, and the first difference between the duration of any piece of audio and the default duration is greater than a first threshold, determine whether the length of the second subtitle is within the predetermined duration within the set range;
    若所述第二字幕的长度在所述预设范围内,则减小所述目标字幕的显示时长,和/或降低所述配音的目标语速,使得所述目标字幕的显示时长与所述配音的目标语速的乘积与所述第二字幕的长度之间的第二差异小于或等于第二阈值。If the length of the second subtitle is within the preset range, the display duration of the target subtitle is reduced, and/or the target speech rate of the dubbing is reduced, so that the display duration of the target subtitle is the same as that of the target subtitle. A second difference between the product of the target speech rate for dubbing and the length of the second subtitle is less than or equal to a second threshold.
  10. 根据权利要求9所述的方法,其中,减小所述目标字幕的显示时长,和/或降低所述配音的目标语速,包括:The method according to claim 9, wherein reducing the display duration of the target subtitle and/or reducing the target speech rate of the dubbing comprises:
    在所述配音的目标语速为所述默认语速的基础上,逐渐降低所述配音的目标语速;On the basis that the target speech rate of the dubbing is the default speech rate, gradually reduce the target speech rate of the dubbing;
    若所述配音的目标语速已达到最小值,且所述第二差异大于第二阈值,则在所述目标字幕的显示时长为所述任一段音频的时长的基础上,逐渐减小所述目标字幕的显示时长,直到所述第二差异小于或等于第二阈值。If the target speech rate of the dubbing has reached the minimum value, and the second difference is greater than the second threshold, on the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the Display duration of the target subtitle until the second difference is less than or equal to the second threshold.
  11. 根据权利要求9所述的方法,其中,减小所述目标字幕的显示时长,和/或降低所述配音的目标语速,包括:The method according to claim 9, wherein reducing the display duration of the target subtitle and/or reducing the target speech rate of the dubbing comprises:
    在所述目标字幕的显示时长为所述任一段音频的时长的基础上,逐渐减小所述目标字幕的显示时长;On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the display duration of the target subtitles;
    若所述目标字幕的显示时长已达到最小值,且所述第二差异大于第二阈值,则在所述配音的目标语速为所述默认语速的基础上,逐渐降低所述配音的目标语速,直到所述第二差异小于或等于第二阈值。If the display duration of the target subtitle has reached the minimum value and the second difference is greater than the second threshold, the target speech rate of the dubbing is gradually reduced on the basis that the target speech rate of the dubbing is the default speech rate speech rate until the second difference is less than or equal to the second threshold.
  12. 根据权利要求9所述的方法,其中,减小所述目标字幕的显示时长,和/或降低所述配音的目标语速,包括:The method according to claim 9, wherein reducing the display duration of the target subtitle and/or reducing the target speech rate of the dubbing comprises:
    在所述目标字幕的显示时长为所述任一段音频的时长的基础上,逐渐减小所述目标字幕的显示时长,同时在所述配音的目标语速为所述默认语速的基础上,逐渐降低所述配音的目标语速,直到所述第二差异小于或等于第二阈值。On the basis that the display duration of the target subtitles is the duration of any piece of audio, gradually reduce the display duration of the target subtitles, and on the basis that the target speech rate of the dubbing is the default speech rate, The target speech rate of the dubbing is gradually decreased until the second difference is less than or equal to a second threshold.
  13. 根据权利要求10-12任一项所述的方法,其中,减小所述目标字幕的显示时长,包括:The method according to any one of claims 10-12, wherein reducing the display duration of the target subtitles comprises:
    提高所述任一段音频对应的画面的显示速度。The display speed of the picture corresponding to any piece of audio is increased.
  14. 根据权利要求4或9所述的方法,其中,所述方法还包括:The method of claim 4 or 9, wherein the method further comprises:
    若确定所述第二字幕的长度不在预设范围内,则根据所述预设范围对所述 第一字幕进行重新翻译,使得重新翻译后得到的第二字幕的长度在所述预设范围内。If it is determined that the length of the second subtitle is not within the preset range, the first subtitle is retranslated according to the preset range, so that the length of the second subtitle obtained after retranslation is within the preset range .
  15. 根据权利要求4或9所述的方法,其中,所述目标字幕的显示时长为所述任一段音频对应的画面的显示时长。The method according to claim 4 or 9, wherein the display duration of the target subtitle is the display duration of the picture corresponding to any piece of audio.
  16. 根据权利要求4或9所述的方法,其中,所述预设范围与所述默认语速和所述任一段音频的时长相关。The method according to claim 4 or 9, wherein the preset range is related to the default speech rate and the duration of any piece of audio.
  17. 根据权利要求1所述的方法,其中,所述原视频包括多段音频,所述多段音频是多个目标对象的语音;The method according to claim 1, wherein the original video includes multiple pieces of audio, and the multiple pieces of audio are voices of multiple target objects;
    所述方法还包括:The method also includes:
    针对所述多个目标对象中的每个目标对象,选择所述目标对象对应的配音音色;For each target object in the plurality of target objects, select the dubbing timbre corresponding to the target object;
    根据所述每个目标对象分别对应的配音音色,生成所述多段音频对应的多个配音音频;According to the dubbing timbre corresponding to each target object respectively, generate a plurality of dubbing audios corresponding to the multi-segment audios;
    将所述原视频中的所述多段音频替换为所述多个配音音频,得到目标视频。The multiple pieces of audio in the original video are replaced with the multiple dubbed audios to obtain a target video.
  18. 一种视频处理装置,包括:A video processing device, comprising:
    获取模块,用于获取原视频中的第一字幕;an acquisition module for acquiring the first subtitle in the original video;
    翻译模块,用于对所述第一字幕进行翻译,得到第二字幕;a translation module for translating the first subtitle to obtain a second subtitle;
    确定模块,用于确定配音的目标语速;A determination module for determining the target speech rate of dubbing;
    配音模块,用于根据所述配音的目标语速生成所述第二字幕对应的配音音频。A dubbing module, configured to generate dubbing audio corresponding to the second subtitle according to the target speech rate of the dubbing.
  19. 一种电子设备,所述电子设备包括:An electronic device comprising:
    一个或多个处理器;one or more processors;
    存储装置,用于存储一个或多个程序;a storage device for storing one or more programs;
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-17中任一项所述的方法。The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-17.
  20. 一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1-17中任一项所述的方法。A computer-readable storage medium having stored thereon a computer program, which, when executed by a processor, implements the method of any one of claims 1-17.
PCT/CN2022/087381 2021-04-29 2022-04-18 Video processing method and apparatus, electronic device, and storage medium WO2022228179A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110472124.XA CN113207044A (en) 2021-04-29 2021-04-29 Video processing method and device, electronic equipment and storage medium
CN202110472124.X 2021-04-29

Publications (1)

Publication Number Publication Date
WO2022228179A1 true WO2022228179A1 (en) 2022-11-03

Family

ID=77029350

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/087381 WO2022228179A1 (en) 2021-04-29 2022-04-18 Video processing method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN113207044A (en)
WO (1) WO2022228179A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113207044A (en) * 2021-04-29 2021-08-03 北京有竹居网络技术有限公司 Video processing method and device, electronic equipment and storage medium
CN114025236A (en) * 2021-11-16 2022-02-08 上海大晓智能科技有限公司 Video content understanding method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160021334A1 (en) * 2013-03-11 2016-01-21 Video Dubber Ltd. Method, Apparatus and System For Regenerating Voice Intonation In Automatically Dubbed Videos
CN109119063A (en) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 Video dubs generation method, device, equipment and storage medium
CN111683266A (en) * 2020-05-06 2020-09-18 厦门盈趣科技股份有限公司 Method and terminal for configuring subtitles through simultaneous translation of videos
US20200404386A1 (en) * 2018-02-26 2020-12-24 Google Llc Automated voice translation dubbing for prerecorded video
CN112562721A (en) * 2020-11-30 2021-03-26 清华珠三角研究院 Video translation method, system, device and storage medium
CN113207044A (en) * 2021-04-29 2021-08-03 北京有竹居网络技术有限公司 Video processing method and device, electronic equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006129247A1 (en) * 2005-05-31 2006-12-07 Koninklijke Philips Electronics N. V. A method and a device for performing an automatic dubbing on a multimedia signal
CN109218629B (en) * 2018-09-14 2021-02-05 三星电子(中国)研发中心 Video generation method, storage medium and device
US11159597B2 (en) * 2019-02-01 2021-10-26 Vidubly Ltd Systems and methods for artificial dubbing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160021334A1 (en) * 2013-03-11 2016-01-21 Video Dubber Ltd. Method, Apparatus and System For Regenerating Voice Intonation In Automatically Dubbed Videos
US20200404386A1 (en) * 2018-02-26 2020-12-24 Google Llc Automated voice translation dubbing for prerecorded video
CN109119063A (en) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 Video dubs generation method, device, equipment and storage medium
CN111683266A (en) * 2020-05-06 2020-09-18 厦门盈趣科技股份有限公司 Method and terminal for configuring subtitles through simultaneous translation of videos
CN112562721A (en) * 2020-11-30 2021-03-26 清华珠三角研究院 Video translation method, system, device and storage medium
CN113207044A (en) * 2021-04-29 2021-08-03 北京有竹居网络技术有限公司 Video processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113207044A (en) 2021-08-03

Similar Documents

Publication Publication Date Title
WO2022228179A1 (en) Video processing method and apparatus, electronic device, and storage medium
US11972770B2 (en) Systems and methods for intelligent playback
US20220239882A1 (en) Interactive information processing method, device and medium
WO2020098115A1 (en) Subtitle adding method, apparatus, electronic device, and computer readable storage medium
US20190221200A1 (en) Assisted Media Presentation
WO2023011142A1 (en) Video processing method and apparatus, electronic device and storage medium
KR20220103110A (en) Video generating apparatus and method, electronic device, and computer readable medium
US20230011395A1 (en) Video page display method and apparatus, electronic device and computer-readable medium
CN113259740A (en) Multimedia processing method, device, equipment and medium
CN110418183B (en) Audio and video synchronization method and device, electronic equipment and readable medium
CN113257218B (en) Speech synthesis method, device, electronic equipment and storage medium
CN112908292B (en) Text voice synthesis method and device, electronic equipment and storage medium
WO2023051293A1 (en) Audio processing method and apparatus, and electronic device and storage medium
CN113507637A (en) Media file processing method, device, equipment, readable storage medium and product
WO2023165371A1 (en) Audio playing method and apparatus, electronic device and storage medium
CN113886612A (en) Multimedia browsing method, device, equipment and medium
CN113992926B (en) Interface display method, device, electronic equipment and storage medium
WO2022012390A1 (en) Video recording method and apparatus, electronic device, and storage medium
CN114554238A (en) Live broadcast voice simultaneous transmission method, device, medium and electronic equipment
WO2024037480A1 (en) Interaction method and apparatus, electronic device, and storage medium
WO2022257777A1 (en) Multimedia processing method and apparatus, and device and medium
CN115171645A (en) Dubbing method and device, electronic equipment and storage medium
CN115967833A (en) Video generation method, device and equipment meter storage medium
CN111652002A (en) Text division method, device, equipment and computer readable medium
US11792494B1 (en) Processing method and apparatus, electronic device and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22794650

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22794650

Country of ref document: EP

Kind code of ref document: A1